Tivalio logoTivalio
Opinion

The case for reproducible product research

AI dashboards hallucinate. Your Monday morning numbers shouldn't. A short argument for deterministic research.

March 7, 20265 min read

The silent rewrite of your dashboard

Last Tuesday, your weekly growth dashboard told you your p75 time to value was 6.2 days. You wrote it down. You mentioned it in a Slack thread. Someone on the exec team screenshotted it for the board deck. This Tuesday, the same dashboard tells you the p75 is 5.1 days. You did not change the data. You did not reprocess any events. You did not adjust any definitions. The number just moved.

The number moved because there is a language model sitting between your event store and your dashboard, and last week it interpreted the question one way and this week it interpreted it a different way. The prompt did not change. The data did not change. The model version changed. Or the temperature drifted. Or a retry path inside the orchestration layer sent the request down a slightly different chain. Or the underlying provider swapped a sub-model in the middle of a patch release without announcing it. You will never know which. The dashboard is just a window onto a black box, and the black box has opinions that shift over time.

This is not a theoretical risk. This is the 2026 product analytics stack, live and in production at most companies running an AI-first analytics tool. The pitch sounds modern. Ask a question in plain English, get a chart. What you are actually buying is a dashboard that quietly rewrites itself every week, and the team consuming it has no way to tell the difference between a genuine product movement and a silent prompt drift. The CEO is making decisions on a number. The number is wrong. Nobody in the room can prove it.

What determinism means

The fix is not complicated to describe. It is called determinism, and it used to be the default for all analytics work. A deterministic research is one where the same question, applied to the same data, produces the same answer. Every time. No retries, no drift, no model updates in the middle of the night. Same input, same output, forever.

That is it. That is the whole idea. It is not a technical feature. It is a guarantee, and it is the guarantee that separates research you can build a business on from research you can only vaguely refer to in meetings.

A reproducible research is the operational form of a deterministic one. It is a research where the methodology is documented, the inputs are pinned, the computation is traceable, and a teammate who opens it six months from now can re-run it and get the same conclusion. Not roughly the same conclusion. The same conclusion, down to the digit. If they get a different number, either the underlying data changed, or something is broken. Those are the only two possibilities. There is no third answer where "the AI had a bad day."

The reason this matters is not academic. It is the reason you can compare this week's p75 to last week's p75 and say something meaningful about the difference. If either number was produced by a non-deterministic process, the comparison is worthless. You cannot detect a two-tenths-of-a-day improvement against a background of two-tenths-of-a-day silent drift. The signal and the noise are the same size, and the noise is invisible because it is produced by an opaque layer that does not report its own variance.

Why AI analytics can't be reproducible

The technical reasons are unpleasant and boring and true.

First, language models have temperature. Even at temperature zero, which most production analytics tools are not running at for latency reasons, the output is not guaranteed deterministic across providers, across load balancers, or across GPU hardware revisions. The underlying matrix math is floating-point, the execution order is parallel, and small numerical differences compound through a long generation. The same prompt, to the same model, on the same day, can produce a different output. This is not a bug. It is the baseline behavior of the technology.

Second, prompts drift. The internal prompts that production analytics tools use to translate a natural-language question into a query are not static. They are improved every week by the team maintaining them. Each improvement is a tiny rewrite of the translation layer. Each rewrite changes, in small ways, how the same question gets answered. Your Tuesday morning number is downstream of a prompt file that changed on Monday afternoon.

Third, models update. Providers push new versions of base models on a schedule nobody controls. Sometimes the update is announced. Often it is not. A "stable" model endpoint is a moving target over any window longer than a few months, and analytics questions that span quarters are not going to see the same model at the start and the end of the window.

Fourth, the same question asked twice gets two answers. This is the headline consequence of the three points above, and it is the one that kills reproducible research on an LLM substrate. If you cannot ask a question twice and get the same result, you cannot trust the first answer on its own.

An AI-first analytics layer is, by construction, not reproducible. Same question, same data, two answers. If your weekly growth review runs on top of one of these layers, you are comparing numbers across weeks that were produced by a process that cannot guarantee the comparison is valid. The movement you are reacting to might not exist.

Computed, not guessed

Here is the Tivalio frame, because it is the hill I will die on. AI is a useful tool for understanding a question. It is not a useful tool for inventing the answer. The two are different jobs, and they need different engines.

When a user types "what is slowing down my TTV right now" into a natural-language analytics tool, the AI is being asked to do two jobs at once. It has to understand what the user meant by the question, and then it has to compute the answer. The first job is a language job. The second is a math job. Language models are good at the first. They are not deterministic at the second, and the gap between the two is where the weekly number gets quietly rewritten.

The right architecture separates the two. Use an LLM to interpret the question and route it to the right research template. Then use a deterministic, documented, auditable computation to produce the answer. The computation is not a prompt. It is a pinned methodology with fixed inputs and fixed outputs. The research template is the same this week as it was last week and as it will be next year. The answer moves if and only if the underlying data moved. Nothing else can move it.

This is what "computed, not guessed" means, and it is the thing Tivalio builds around. Every research in the library has a fixed methodology. Every answer is traceable to the rows that produced it. Every re-run on the same inputs returns the same result. You can compare this week's p75 to last week's p75 and know the comparison is real. That comparison is the whole reason you were measuring in the first place, and an AI-first analytics tool cannot give it to you.

The broader argument about why scalar summaries (like activation rate) conceal shape drift on top of process drift lives in our piece on activation rate as a vanity metric. Both pieces point at the same thing: your weekly growth review deserves a number you can trust, and you cannot trust a number that a model might feel differently about next week. Computed, not guessed. Pick one.

Stop reading dashboards.
Start answering questions.

Connect your data in 5 minutes. See your TTV distribution the same day.

Free forever · No credit card · Cancel anytime