In clean benchmarks, an output can look convincing because the evaluation surface is fixed. In deployment, the world is dirty: sensors drift, maps expire, logs go missing, and failure cases appear outside the prompt. A reliable system therefore needs a claim-to-replay contract before a claim is treated as action-worthy.
The contract is simple: record the input, assumptions, evidence path, failure samples, boundary conditions, and repair verification. If another process cannot replay the claim, the system should lower autonomy rather than pretend certainty.