Evidence Map | Ouroboros Project

Core Evidence Cards

Three cards a working engineer can audit in one minute.

Read each card in the same order: claim, protocol, evidence, boundary, challenge route, and public artifacts. If a card cannot survive this order, the claim should be narrowed.

CardWhat to check firstEngineering questionFailure route

WBRepeated-round failure learning, not one-shot scoreCan the task, scorer, and confidence interval be replayed?Scorer bug, leakage, stronger baseline

PCAAction authority requires a proof envelopeCan the system prove it is allowed to act?Missing warrant, receipt, null arm, or regret route

RORelations and constraints decay over timeCan the system see which relations make action unsafe?Relation deletion, stale evidence, unclosed control debt

WisdomBench

Learning from failure is a measurable trajectory.

Claim: An agent's first answer is not enough. The benchmark measures whether repeated failure plus feedback changes later behavior.
Formula or Protocol: Task x strategy x seed x round trajectories, scored with repeat-failure rate, normalized improvement, transfer checks, and confidence intervals.
Evidence: Public task definitions, raw scores, scorer code, bootstrap intervals, negative cells, and a reproducible artifact package.
What this does NOT prove: It does not prove human-like wisdom, universal agent improvement, or deployment readiness.
How to attack it: Find a scorer bug, leakage path, stronger baseline, irreproducible task, or claim wider than the artifact supports.
DOI-Code-Data: Zenodo / GitHub / HF dataset

Proof-Carrying Action

High-risk AI action needs an evidence envelope before authority.

Claim: In high-risk settings, an answer is not an action. Authority should require an auditable evidence envelope.
Formula or Protocol: goal -> observation -> relation field -> thesis -> falsifier -> warrant -> threshold -> receipt -> regret -> clean learning.
Evidence: Public schema, no-go reports, no-credit repair discipline, negative-space memory, receipt closure gaps, and counterexample intake.
What this does NOT prove: It does not prove returns, customer deployment, private execution performance, or that a proof envelope makes every action correct.
How to attack it: Show a case that passes without a stated threshold, falsifier, receipt, null arm, regret route, or no-credit boundary.
DOI-Code-Data: GitHub / public note / regret note

Relational Observability

Adaptive intelligence must observe relations, not only objects.

Claim: Many failures come from missing relations: actor pressure, constraints, stale context, control debt, and feedback loops.
Formula or Protocol: R_t = observed relations; D_c = unclosed control debt; H_e = evidence half-life. Gate action when R_t decays or D_c grows.
Evidence: Systems evidence protocol, relation-field audit tables, public interface pages, and artifact-gate work orders.
What this does NOT prove: It does not prove universal intelligence, physical robot performance, or that every relevant relation is observable in practice.
How to attack it: Find a relation deletion case where scalar metrics look safe but a necessary constraint, actor, or feedback loop is missing.
DOI-Code-Data: Technology / Systems / Portfolio Archive Index

Evidence Layers

What supports the research program.

Different papers use different evidence layers. The public claim is strongest when a paper combines protocol, raw logs, provenance, negative results, and explicit stop rules.

LayerEvidenceWhere usedBoundary

E0Formal definitions, protocols, equations, and claim registriesP04-P21Framework support, not empirical deployment proof

E1Longitudinal API / text-agent panels with repeated rounds and seedsP01-P04Text-agent evidence, not robot evidence

E2RLBench self-trained low-dimensional imitation baseline, 6,300 trialsP05-P06Not a public VLA/SOTA leaderboard

E3Public-factory sidecars and supported-set VLA/LIBERO evaluationP05-P07Supported-set evidence, not full benchmark domination

E4Macro mining, representation compression, and perspectival grounding registriesP08Representation-search evidence, not priority over all compression research

E5Failure-antigen labels, recovery-adapter pilots, and trajectory provenanceP09Offline/pilot recovery evidence, not full trained-policy improvement

E6World-model counterfactual, cybernetic, social, and supra-body ablation artifactsP10-P14Simulation and protocol evidence, not real-world deployment

E7Public simulator bridges and AI-for-science benchmark proposalsP19Benchmark framing, not climate-control authority

E8Adverse perception, detector/tracker logs, DAWN1027/COCO128/public panelsP20Robust evidence gating, not detector SOTA

E9Human-intervention residual schemas and synthetic/public teleoperation ingestionP21No human-subject or wearable-robot validation yet

Claim Boundary

What the public archive does not claim.

The project does not claim universal proof over all AI systems, physical robot deployment, detector SOTA, autonomous enforcement, trading returns, medical validity, or digital identity continuity.

The strongest current public claim is narrower: reliable intelligence should be evaluated after experience, failure, feedback, and perturbation, and the evidence must retain provenance and explicit limitations.

Falsification

What would change our mind.

A claim boundary is only useful if the project names what would weaken it. These are public routes for lowering, revising, or repairing a claim.

CounterexampleEffectRepairPublic route

Scorer bugMetric or table changesPatch scorer and version the resultWisdomBench issue

Proof-envelope false positiveAction gate is too weakTighten ActionWarrant and add a testProof action issue

No-go false blockSystem is over-conservativeAdd negative-space task and wait receiptCounterexample packet

Public simulator mismatchEvidence tier must be loweredMove claim to weaker layerEvidence-boundary edit

Private data requiredPublic claim is unsupportedRemove or mark as private-onlyClaim-boundary correction

Open the public counterexample route / View example packets / Artifact/Repro Scope

Gate 1

Metric

What exactly is measured: first-attempt success, improvement, recovery, transfer, or action gating?

Gate 2

Provenance

Raw logs, manifests, versioned files, supported-set boundaries, and no duplicate evidence cells.

Gate 3

Negative Results

Failed cells, weak baselines, low success counts, and limitations are retained rather than hidden.

Gate 4

Cost

Evidence should state whether it came from local runs, API panels, cloud GPU work, public simulators, or small pilots.

Why It Matters

Reliable action needs more than a single model response.

The evidence program treats perception, memory, social calibration, failure recovery, workflow context, and action boundaries as coordinated subsystems. This is why the public site separates papers from evidence and why every strong claim has a boundary.

Use DOI records for public priority and inspectability.
Keep venue-specific or non-public materials outside the public evidence route.
Use evidence pages to explain what is proven, partial, or pending.
Use future external validation for real robots and independent human studies.

Evidence map and claim boundaries.