cross-system-disagreement
Cross-System AI Identity Disagreement
Two systems. Same prompt. Different outputs. No way to determine why. This is the default state of multi-system AI operation.
What cross-system disagreement looks like in practice
Team A runs a prompt through their pipeline. Team B runs what they believe is the same prompt through theirs. The outputs differ. Not slightly. Substantively. The clinical recommendation is different. The risk assessment diverges. The generated code solves a different problem.
Both teams check their logs. The prompt is the same. The model name is the same. The timestamps are close. Everything that was recorded matches. Yet the outputs disagree.
The disagreement is not a bug. It is not an error. It is the expected outcome when unidentified runs are compared across systems. The comparison has no foundation. The teams are comparing outputs while having no access to the conditions that produced them.
In practice, this surfaces during integration testing, multi-team deployments, vendor evaluations, and production incident reviews. It is treated as an anomaly each time. It is structural.
Why disagreement cannot be resolved without identity
Resolution requires isolation. To understand why two runs produced different outputs, you must identify which condition differed. This requires knowing the full set of conditions for each run. Not some conditions. All conditions that could influence the output.
Without identity, this knowledge does not exist. Each run is a black box with an input and an output. The inputs match. The outputs do not. The space between input and output — where every meaningful difference resides — is invisible to both systems.
Teams fall back on heuristics. They re-run. They compare again. They adjust parameters. They cannot systematically isolate the divergence because they cannot see the conditions that diverged. The resolution process is trial and error applied to an identification problem. It terminates when the outputs happen to converge, not when the cause is found.
The four possible sources of divergence
When two systems disagree, the divergence originates from one or more of four sources. Without identity, none can be isolated.
Model. The two systems may use different model versions. Not different model names — those would be visible. Different checkpoints, different fine-tunes, different quantization levels, or different provider-side updates applied without notification. The model field in the API call is identical. The actual model is not.
Configuration. System prompts, temperature settings, token limits, stop sequences, frequency penalties — any parameter that shapes the model's behavior. Two systems using the "same" configuration may have drifted through independent deployments, feature flags, or manual adjustments. The configuration is not captured as part of the run. It exists only in the deployment state at the moment of execution.
Context. In retrieval-augmented systems, the documents injected into context depend on the index state, the embedding model, the similarity threshold, and the chunking strategy. Two systems with different indices will retrieve different documents for the same query. The model responds to what it reads. If what it reads differs, the output differs. The retrieval context is not part of the audit trail.
Prompt. The user-visible prompt may be identical. The actual prompt — after template expansion, variable injection, chain-of-thought formatting, and system-level preprocessing — may differ. Prompt assembly pipelines are complex. Without identity, the final assembled prompt is not captured. What the model received is unknown.
Four sources. Any combination. No mechanism to isolate. This is not a tooling limitation. This is the absence of the primitive that tooling would require.
What this costs in multi-team and multi-environment deployments
In single-team, single-environment systems, cross-system disagreement is inconvenient. In multi-team deployments, it is corrosive.
A platform team deploys a model service. Multiple product teams consume it. Each product team wraps the service with their own system prompts, retrieval pipelines, and post-processing logic. When outputs diverge between teams, the platform team cannot determine whether the divergence originates in their service or in the consuming team's wrapper. Neither side has the information to isolate the cause.
In staging-to-production promotion, a model behaves differently in production than it did in staging. The environments are nominally identical. The runs are not identified. There is no way to compare what actually ran in staging against what is running in production. The promotion process assumes equivalence. The assumption is unverifiable.
In vendor evaluation, two providers are compared on identical prompts. The outputs differ. The evaluation cannot determine whether the difference is in the model, the provider's infrastructure, the API's preprocessing, or the evaluation framework's prompt assembly. The comparison lacks a controlled variable. Identity would be that variable.
There is no mechanism today that enables cross-system AI run comparison
There is no mechanism today that allows two systems to confirm they executed the same run. There is no shared reference point. There is no common identity. There is no way to structurally compare execution conditions across system boundaries.
Observability platforms compare metrics. They do not compare runs. Tracing systems follow requests through services. They do not identify what executed within those services. Logging frameworks record events. They do not establish that two events in different systems correspond to equivalent executions.
For cross-system comparison to function, any valid system must establish a shared identity for each run — an identity that captures the conditions of execution, travels across system boundaries, and can be compared without requiring access to internal system state.
That identity does not exist. Until it does, cross-system disagreement is not a problem to be solved. It is a condition to be endured.