problem

Why AI Systems Are Not Reproducible

Reproduction requires knowing what ran. No system records the full composition. Every attempt at reproduction is approximation.

The Problem

Reproduction requires a complete record

To reproduce a run, you must know every component that shaped it. The model version. The system prompt. The parameters. The retrieval context. The tool configuration. The conversation history as it existed at execution time.

No AI system records this full composition. Systems record fragments. A model name. A timestamp. The user prompt. Sometimes the output. These fragments are insufficient for reproduction.

The gap between what is recorded and what is needed for reproduction is not small. It includes every component that determines behavior but is not logged.

Why It Matters

Three domains that require reproduction

Debugging requires reproducing a failure. When a model produces an incorrect output, the first step is to reproduce the conditions. Without the full composition, you cannot reproduce. You can only re-run with current conditions, which may differ from the original.

Audit requires confirming what happened. An auditor must verify that a run operated under claimed conditions. Without reproduction, the auditor reviews logs. Logs describe what was observed, not what was. The audit reviews a report, not the execution.

Compliance requires demonstrating control. Regulators require evidence that systems operate within defined parameters. Without reproduction, operators can only assert compliance. They cannot demonstrate it.

What Fails

What currently fails to enable reproduction

Partial logs capture fragments. A model name is not a model version. A model version is not the full configuration. A prompt is not the full context. Each log field captures one dimension. The run is multi-dimensional.

Model versions alone are insufficient. Even with an exact model checkpoint, behavior depends on the system prompt, the retrieved context, and the parameters. Knowing the model is necessary but not sufficient.

Output inspection cannot recover composition. Looking at what a model produced does not reveal what produced it. Multiple compositions can generate the same output. Output is a result, not a record.

Logs do not establish identity. Without identity, there is no defined object to reproduce. Reproduction requires a target. The target does not exist.

The Structural Failure

This is a missing category

The failure is not that logging is incomplete. The failure is that no system is designed to record the full composition of a run as a defined, identifiable object.

This is a missing category in AI systems. Not a missing feature. Not a missing tool. A missing structural concept. Until a run has an identity that captures its full composition, reproduction is impossible by construction.

→ AI Run Failure Modes

The specific ways AI runs fail without identity.