Public Release - 2026-05-27

Evidence has a shape.

Evidence is not a paragraph that supports a claim. For high-risk AI, evidence should be an inspectable object: provenance, assumptions, artifact, replay path, boundary, failure case, timestamp, and hash.

Evidence Map Counterexample Challenge Mini Gate Proof-Carrying Action

Claim-to-receipt visual showing a claim being validated and structured into an inspectable receipt schema.

Definition

A claim is a sentence. A receipt is an object.

Many AI systems leave explanations. Fewer leave receipts. The difference is not rhetorical. A receipt is machine-readable enough to be indexed, replayed, bounded, expired, challenged, and repaired.

This is the narrow point of today's release: evidence is not attitude, confidence, or a polished narrative. Evidence has a structure that can fail.

Receipt Schema

The minimum shape of an evidence object.

FieldQuestionFailure if missingPublic use

claim_idWhich claim is supported?Evidence floats without target.Maps artifacts to statements.

sourceWhere did the evidence originate?Post-hoc contamination hides.Preserves provenance.

assumptionWhat must be true?The claim silently overreaches.Exposes dependency.

artifactWhich file, log, trace, or table supports it?No inspectable object exists.Connects claim to material.

replay_pathHow can another reader inspect it?The evidence cannot be checked.Turns trust into procedure.

boundaryWhat does this not prove?A local result becomes a universal claim.Prevents overclaiming.

failure_caseWhat would break it?The claim becomes unfalsifiable.Invites useful challenges.

timestamp/hashWhen and which exact object?Versions blur together.Fixes identity.

Positive Receipt

Supports a bounded claim with provenance, artifact identity, replay path, and an explicit failure condition.

Negative Receipt

Records why the current evidence is not sufficient: missing nulls, stale provenance, incomplete execution certificate, or no clean denominator.

Failed Receipt

Shows where the evidence chain breaks and turns that break into a repair target rather than a hidden weakness.

Counterexample Receipt

Lets an external reader submit the claim, required receipt, missing field, and wrong-action risk in a compact format.

Bad Evidence Shapes

Not every artifact is a receipt.

A screenshot without provenance is weak. A table without denominator is incomplete. A log without replay path is opaque. A claim without boundary is too large. A post-hoc explanation without timestamp can reward the system for inventing a story after the fact.

The purpose of a receipt schema is not to make the system sound more rigorous. It is to make weak evidence fail in public before it becomes action authority.

Minimal Interface

A claim should compile into a receipt.

{
  "claim_id": "agent_improves_after_failure_rounds",
  "source": "public benchmark artifact",
  "assumption": ["same task family", "same scoring rule", "no post-hoc relabeling"],
  "artifact": ["task table", "raw score file", "scoring script"],
  "replay_path": "run scorer on released artifact",
  "boundary": ["does not prove human-like wisdom", "does not prove deployment safety"],
  "failure_case": ["stronger baseline wins", "score leak found", "task labels inconsistent"],
  "timestamp": "2026-05-27",
  "hash": "artifact identity"
}

Counterexample Route

The best attack is a minimal failing receipt.

A useful counterexample is not a vague objection. It identifies the original claim, the receipt that should support it, the missing or broken field, and why that gap could lead to a wrong action.

That is the public standard we want: do not trust the project first; inspect the shape of its evidence.

Open the Counterexample Challenge

Boundary

What this release does not claim.

Not claimedReasonCurrent public positionNext evidence needed

Production safetyReceipt interfaces are not deployment certification.Public protocol and artifact surface.Third-party deployment audit.

Financial performanceEvidence discipline is not live alpha.Proof-carrying action infrastructure.Clean denominator and post-cost edge.

Real-robot validationPublic simulation and benchmark evidence do not replace hardware.Robot-learning research layer.Physical robot logs.

Universal reliabilityA receipt can support only a bounded claim.Claim-to-receipt interface.Stronger counterexamples and independent replication.