audit-gaps

Audit Gaps in AI Systems

Audit trails exist. They record submissions and responses. They do not record what ran. The gap between those two statements is where accountability disappears.

Purpose of Audit

What an audit trail is supposed to provide

An audit trail exists to answer one question: what happened? Not approximately. Not plausibly. The record should reconstruct the conditions under which a system operated at a specific point in time.

In traditional software, this is achievable. The binary is versioned. The configuration is stored. The inputs are logged. The state machine is known. Given the audit record, a reviewer can reconstruct the execution path from input to output. The trail is a map, and the territory is static enough for the map to be useful.

This is the expectation that regulators carry forward. This is the expectation that compliance frameworks encode. This is the expectation that AI systems structurally cannot meet.

What Is Recorded

What AI audit trails actually provide

An AI audit trail typically records the prompt submitted, the response returned, a timestamp, and a user or session identifier. Some systems add token counts, latency metrics, or cost data. Advanced implementations log the model name.

This looks like an audit trail. It has the shape of one. It is stored in databases. It can be queried. Reports can be generated from it. Dashboards render it.

But it does not answer the question an audit trail is supposed to answer. It records that a submission occurred and a response was returned. It does not record what executed between those two events.

The gap is not small. The gap is the entire execution.

Hidden Information

The specific information an audit gap hides

Between the prompt and the response, four categories of information determine the output. None are captured by standard audit trails.

Model version. Not the model name. The specific version, checkpoint, or fine-tune that was active at the moment of execution. Model providers update weights, adjust safety filters, modify behavior — sometimes without version number changes. The audit trail says "gpt-4." It does not say which gpt-4.

System prompt state. The system prompt is not static. It is modified by developers, by A/B testing frameworks, by feature flags, by deployment pipelines. The audit trail records the user prompt. The system prompt — the instruction set that shaped the entire response — is absent.

Retrieval context. In retrieval-augmented systems, the documents retrieved and injected into context determine the response. The retrieval set depends on the index state, the embedding model version, the similarity threshold, the chunk boundaries. None of these are captured. The audit trail does not know what the model was reading when it answered.

Agent state. In agentic systems, the model makes decisions across multiple steps — tool calls, memory lookups, branching logic. The audit trail records the final output. The intermediate decisions, the tools invoked, the order of operations, the state at each step — none of this appears.

The audit trail records the endpoints. Everything between them is dark.

Why Logs Fail

Why this cannot be closed by capturing more logs

The instinct is to log more. Capture the system prompt. Record the retrieval context. Store the agent trajectory. This appears reasonable.

It does not work. Not because the data is unavailable. Because logs, however comprehensive, do not establish identity. They record observations about a run. They do not identify the run.

Consider: two runs execute with the same prompt, the same model name, the same system prompt text, and produce different outputs. The logs are identical in every field that was captured. The runs are different. The logs cannot distinguish them because the logs describe attributes, not identity.

Any valid audit system must be able to answer: "Is this the same run?" Logs cannot answer this question. They can describe what was observed. They cannot establish what was.

More fields in the log do not change this. A longer description of an unidentified thing is still a description of an unidentified thing.

Regulated Contexts

What audit failure means in regulated contexts

In healthcare, a model generates a clinical recommendation. The recommendation is acted upon. Six months later, an adverse outcome triggers a review. The review requires reconstructing what the model knew, what instructions it operated under, and what context shaped the recommendation. The audit trail provides a timestamp and the text of the recommendation. Nothing else.

In financial services, a model assists in a lending decision. The decision is challenged as discriminatory. The regulator asks: under what conditions did this model operate? What data informed this specific decision? The audit trail provides the applicant data submitted and the decision returned. The conditions under which the model processed that data are not recorded.

In legal proceedings, a model drafts analysis that influences case strategy. The analysis later proves incorrect. The question arises: was the model operating with current information? Was the system prompt appropriate for this use case? The audit trail cannot answer. It recorded the question and the answer. Everything between is missing.

The regulated context does not create the audit gap. It reveals the audit gap. The gap exists in every AI system. Regulation simply makes the consequences of the gap visible, expensive, and potentially criminal.

→ Why Everything You Use Fails

Logs are the primary audit tool. They do not establish identity. Understand why.