Evidence Discipline

Evidence map and claim boundaries.

This page is the shortest public route for checking what the project claims, what evidence exists, what it does not prove, and how to attack the result without needing private systems.

Evidence and residual scoring visual.

Core Evidence Cards

Three cards a working engineer can audit in one minute.

Read each card in the same order: claim, protocol, evidence, boundary, attack route, and public artifacts. If a card cannot survive this order, the claim should be narrowed.

CardWhat to check firstEngineering questionFailure route
WBRepeated-round failure learning, not one-shot scoreCan the task, scorer, and confidence interval be replayed?Scorer bug, leakage, stronger baseline
PCAAction authority requires a proof envelopeCan the system prove it is allowed to act?Missing warrant, receipt, null arm, or regret route
RORelations and constraints decay over timeCan the system see which relations make action unsafe?Relation deletion, stale evidence, unclosed control debt
WisdomBench

Learning from failure is a measurable trajectory.

Claim
An agent's first answer is not enough. The benchmark measures whether repeated failure plus feedback changes later behavior.
Formula or Protocol
Task x strategy x seed x round trajectories, scored with repeat-failure rate, normalized improvement, transfer checks, and confidence intervals.
Evidence
Public task definitions, raw scores, scorer code, bootstrap intervals, negative cells, and a reproducible artifact package.
What this does NOT prove
It does not prove human-like wisdom, universal agent improvement, or deployment readiness.
How to attack it
Find a scorer bug, leakage path, stronger baseline, irreproducible task, or claim wider than the artifact supports.
DOI-Code-Data
Zenodo / GitHub / HF dataset
Proof-Carrying Action

High-risk AI action needs an evidence envelope before authority.

Claim
In high-risk settings, an answer is not an action. Authority should require an auditable evidence envelope.
Formula or Protocol
goal -> observation -> relation field -> thesis -> falsifier -> warrant -> threshold -> receipt -> regret -> clean learning.
Evidence
Public schema, no-go reports, no-credit repair discipline, negative-space memory, receipt closure gaps, and counterexample intake.
What this does NOT prove
It does not prove returns, customer deployment, private execution performance, or that a proof envelope makes every action correct.
How to attack it
Show a case that passes without a stated threshold, falsifier, receipt, null arm, regret route, or no-credit boundary.
DOI-Code-Data
GitHub / public note / regret note
Relational Observability

Adaptive intelligence must observe relations, not only objects.

Claim
Many failures come from missing relations: actor pressure, constraints, stale context, control debt, and feedback loops.
Formula or Protocol
R_t = observed relations; D_c = unclosed control debt; H_e = evidence half-life. Gate action when R_t decays or D_c grows.
Evidence
Systems evidence protocol, relation-field audit tables, public interface pages, and artifact-gate work orders.
What this does NOT prove
It does not prove universal intelligence, physical robot performance, or that every relevant relation is observable in practice.
How to attack it
Find a relation deletion case where scalar metrics look safe but a necessary constraint, actor, or feedback loop is missing.
DOI-Code-Data
Technology / Systems / Portfolio DOI

Evidence Layers

What supports the research program.

Different papers use different evidence layers. The public claim is strongest when a paper combines protocol, raw logs, provenance, negative results, and explicit stop rules.

LayerEvidenceWhere usedBoundary
E0Formal definitions, protocols, equations, and claim registriesP04-P21Framework support, not empirical deployment proof
E1Longitudinal API / text-agent panels with repeated rounds and seedsP01-P04Text-agent evidence, not robot evidence
E2RLBench self-trained low-dimensional imitation baseline, 6,300 trialsP05-P06Not a public VLA/SOTA leaderboard
E3Public-factory sidecars and supported-set VLA/LIBERO evaluationP05-P07Supported-set evidence, not full benchmark domination
E4Macro mining, representation compression, and perspectival grounding registriesP08Representation-search evidence, not priority over all compression research
E5Failure-antigen labels, recovery-adapter pilots, and trajectory provenanceP09Offline/pilot recovery evidence, not full trained-policy improvement
E6World-model counterfactual, cybernetic, social, and supra-body ablation artifactsP10-P14Simulation and protocol evidence, not real-world deployment
E7Public simulator bridges and AI-for-science benchmark proposalsP19Benchmark framing, not climate-control authority
E8Adverse perception, detector/tracker logs, DAWN1027/COCO128/public panelsP20Robust evidence gating, not detector SOTA
E9Human-intervention residual schemas and synthetic/public teleoperation ingestionP21No human-subject or wearable-robot validation yet

Claim Boundary

What the public archive does not claim.

The project does not claim universal proof over all AI systems, physical robot deployment, detector SOTA, autonomous enforcement, trading returns, medical validity, or digital identity continuity.

The strongest current public claim is narrower: reliable intelligence should be evaluated after experience, failure, feedback, and perturbation, and the evidence must retain provenance and explicit limitations.

Falsification

What would change our mind.

A claim boundary is only useful if the project names what would weaken it. These are public routes for lowering, revising, or repairing a claim.

CounterexampleEffectRepairPublic route
Scorer bugMetric or table changesPatch scorer and version the resultWisdomBench issue
Proof-envelope false positiveAction gate is too weakTighten ActionWarrant and add a testProof action issue
No-go false blockSystem is over-conservativeAdd negative-space task and wait receiptCounterexample packet
Public simulator mismatchEvidence tier must be loweredMove claim to weaker layerEvidence-boundary edit
Private data requiredPublic claim is unsupportedRemove or mark as private-onlyClaim-boundary correction
Open the public counterexample route / View example packets / Artifact/Repro Scope
Gate 1

Metric

What exactly is measured: first-attempt success, improvement, recovery, transfer, or action gating?

Gate 2

Provenance

Raw logs, manifests, versioned files, supported-set boundaries, and no duplicate evidence cells.

Gate 3

Negative Results

Failed cells, weak baselines, low success counts, and limitations are retained rather than hidden.

Gate 4

Cost

Evidence should state whether it came from local runs, API panels, cloud GPU work, public simulators, or small pilots.

Supra-body architecture visual.

Why It Matters

Reliable action needs more than a single model response.

The evidence program treats perception, memory, social calibration, failure recovery, workflow context, and action boundaries as coordinated subsystems. This is why the public site separates papers from evidence and why every strong claim has a boundary.

  • Use DOI records for public priority and inspectability.
  • Use anonymous packages for double-blind venues.
  • Use evidence pages to explain what is proven, partial, or pending.
  • Use future external validation for real robots and independent human studies.