reproducibility-failure

Why AI Runs Cannot Be Reproduced Without Identity

Reproduction is not about getting the same output. It is about knowing what ran. Without identity, neither is possible.

Requirements

What reproduction requires

Reproduction requires two things. First, a complete description of the execution conditions. Second, the ability to confirm that a subsequent execution operated under the same conditions.

In traditional software, this is routine. The binary is versioned. The configuration is stored. The environment is documented. Given these artifacts, a second execution can be set up under known-identical conditions. If the outputs differ, the divergence can be isolated because the conditions are known.

AI runs do not have this. The model version may not be disclosed. The system prompt may have changed between executions. The retrieval context depends on index state that is not versioned. The agent's tool-call sequence is not recorded. There is no artifact that captures what the run was.

Without that artifact, reproduction cannot begin. You cannot reproduce what you cannot identify.

Not Non-Determinism

The distinction from non-determinism

The reproducibility problem in AI is commonly attributed to non-determinism. Models use sampling. Temperature introduces variance. The same input can produce different outputs. This is true but beside the point.

Reproduction is not about expecting identical output. It is about knowing what ran. A non-deterministic system can still be reproducible if the execution conditions are identified. You accept that outputs will vary. But you know precisely what produced them.

The problem is not that AI outputs vary. The problem is that the conditions producing those outputs are unidentified. Two engineers cannot compare runs because neither can confirm what conditions their run operated under.

Non-determinism is a property of the model. Reproducibility failure is a property of the infrastructure. Conflating the two obscures the actual problem and misdirects engineering effort toward temperature settings instead of toward the absence of run identity.

Downstream Breaks

What breaks downstream

Debugging. A user reports an incorrect output. The engineer attempts to reproduce the issue. They send the same prompt. They get a different response. Was the original issue a model error, a context problem, a prompt issue, or a retrieval failure? Without identity, the engineer cannot isolate the cause. They cannot even confirm they are looking at the same type of execution.

Compliance. A regulator requests evidence that a system operated under documented conditions at a specific time. Reproduction is the standard method for demonstrating this. If the execution cannot be reproduced — not the output, but the conditions — the compliance requirement cannot be met. The system produced output. Whether it produced that output under the claimed conditions is unprovable.

Regression analysis. A model update is deployed. Output quality degrades. The team needs to compare pre-update and post-update behavior. They cannot. The pre-update runs have no identity. The conditions under which they executed were never captured in a way that allows comparison. The regression is observable in aggregate. The mechanism is invisible.

Each of these failures is independent. Each is active in production systems. Each traces back to the same absence.

Current Approaches

Why current approaches do not resolve this

Prompt versioning captures one input. It does not capture model state, retrieval context, system prompt configuration, or agent trajectory. A versioned prompt submitted to two different model checkpoints with two different retrieval indices is two different runs. Prompt versioning treats them as the same.

Experiment tracking systems record parameters and metrics. They were designed for training pipelines, not inference runs. They track hyperparameters, dataset splits, and training curves. The execution conditions of an inference run — the full state that produced a specific output — are outside their data model.

Output comparison compares results. When results differ, it cannot explain why. Two different outputs from two unidentified runs are two data points with no causal connection. Comparison without identity is correlation without mechanism.

Each approach captures a fragment. None capture the run. The fragments do not compose into identity because they were never designed to. They are observations about an unidentified thing, stored in separate systems, with no shared reference.

What Identity Enables

What identity would make possible

For reproduction to function, any valid system must capture the full set of conditions that define a run — not as scattered logs, but as a single, referenceable identity. The identity must persist. It must travel with the output. It must be comparable across systems and across time.

For debugging to work, an engineer would need to retrieve the identity of the reported run and compare it against their own. For compliance to work, a reviewer would need to reference the identity of a historical run and confirm its conditions. For regression analysis to work, pre-update and post-update runs would need identities that allow structural comparison of their conditions.

None of this requires identical outputs. All of it requires identified runs. The infrastructure does not exist. The primitive is missing. Everything built on top of that missing primitive inherits the failure.

→ Why Distributed Tracing Does Not Establish AI Run Identity

Tracing is commonly offered as a solution. Here is why it does not establish identity.