Counterexample Packets | Ouroboros Project

Packet A

Formula counterexample.

Use this when a published formula, gate, invariant, or protocol rule fails on a concrete public case.

FieldWhat to submitInvalid ifMinimal replay

Target ruleFormula id, paper section, registry row, or protocol rule being challenged.The target is a general theme rather than an exact rule.Open the cited rule and list its variables exactly.

Counterexample stateConcrete variable assignment, boundary case, toy state, or public trace snippet.The case needs private assumptions or hidden system behavior.Substitute values into the rule and show the contradiction.

Expected vs observedWhat the rule predicts and what the counterexample shows.The result depends on changing the stated rule.Check whether the inequality, gate, invariant, or transition fails.

RepairNarrow domain, add precondition, split rule, or downgrade claim boundary.The repair demands private disclosure.File a proof-carrying-action issue with the exact patch target.

Packet B

Data leakage.

Use this when a benchmark or evidence table may have used information that should not be available.

FieldWhat to submitInvalid ifMinimal replay

Leak targetDataset row ids, split ids, task ids, timestamp, provenance record, or manifest hash.The report relies on non-public data or unverifiable suspicion.Locate the same ids in public artifact files.

Leak mechanismDuplicate rows, train/test overlap, future timestamp, label in prompt, answer in rubric, or split mismatch.The mechanism is only "maybe the model knew it."Run duplicate, timestamp, split, or label-presence checks.

Observed impactWhich metric, claim, or table row may be invalidated.No affected result is named.Recompute the affected row with leaked items removed or quarantined.

RepairQuarantine rows, rebuild split, add leakage CI, or downgrade the evidence claim.The repair requires exposing private logs.Open a WisdomBench or evidence-boundary issue.

Packet C

Stronger baseline.

Use this when a simpler public baseline beats or matches a reported effect under the same boundary.

FieldWhat to submitInvalid ifMinimal replay

Target metricRepeat failure rate, normalized improvement, no-go false block rate, action coverage, or task score.The metric differs from the project metric.Confirm the metric, split, scoring script, and sample count match.

Baseline inputPublic script, task subset, seeds, model family, allowed context, and scoring command.The baseline uses extra information or fewer constraints.Run the baseline under the same evidence boundary.

Observed resultThe baseline wins, matches, or removes the claimed effect with confidence interval or exact count.Only cherry-picked examples are shown.Compare aggregate and task-level results under the same stop rule.

RepairAdd the baseline, revise effect size, retain negative result, or narrow the contribution.The repair asks to hide failed results.Open a WisdomBench or evidence-boundary issue.

Packet D

Reproduction failure.

Use this when a public task, scorer, command, dataset card, or artifact package cannot be replayed.

FieldWhat to submitInvalid ifMinimal replay

Target artifactTask id, scorer function, file name, manifest row, command, or public release hash.The target cannot be identified from public files.Open the artifact and confirm the exact version.

Replay contextOS, Python version if relevant, command, seed, expected output, and observed output.The report omits the command or depends on a local private file.Run the command from a clean checkout or public package.

Failure modeMissing file, broken import, nondeterministic score, scorer mismatch, or undocumented dependency.The report is only "it did not work."Confirm the smallest failing step and attach safe logs.

RepairREADME patch, pinned dependency, manifest update, CI test, or result quarantine.The repair requires private credentials.Open the relevant public issue with no secrets.

Packet E

Claim-boundary overreach.

Use this when public wording claims more than the public evidence can support.

FieldWhat to submitInvalid ifMinimal replay

Exact wordingQuote the sentence, page, card, DOI note, README line, or public page section.The challenged claim is paraphrased too broadly.Trace the exact words to the linked evidence.

Evidence envelopeWhich artifact supports it, what sample/domain it covers, and what it does not cover.The critique demands proof of a claim not made.Check artifact manifest, limitations, and claim boundary.

OverreachMissing external validation, real-world deployment, human study, robot hardware, or independent replication.The missing evidence is already disclosed as a limitation.Decide whether the public wording still needs narrowing.

RepairBoundary downgrade, limitation sentence, evidence-card update, or stronger citation route.The repair asks for private system disclosure.Open a website or evidence-boundary issue.

Packet F

Credit leak.

Use this when repair work, semantic guesses, bootstrap notes, or paper-only artifacts appear to receive metric, reward, denominator, gate, or clean-learning credit.

FieldWhat to submitInvalid ifMinimal replay

Target credit pathMetric, reward, denominator, gate, clean-learning, or public evidence-card field.No public claim, log, table, or field is named.Inspect public artifact, registry, and source tags.

Forbidden sourceRepair intent, semantic guess, bootstrap note, paper-only result, price-only receipt, or private-only trace.It requires private logs to judge.Remove or quarantine the source and recompute the affected row if public data permits.

Observed leakWhere the credit entered a table, claim, issue, demo, or public route.It only says the system might have self-rewarded.Compare no-credit flag, denominator rule, and claim boundary.

RepairQuarantine, no-credit label, denominator recompute, clean/dirty split, or regression test.The repair asks for private execution disclosure.Open a public evidence-boundary or proof-action issue.

Packet G

Authority leak.

Use this when research-only, shadow, suggestion, or no-go output is presented as permission to act.

FieldWhat to submitInvalid ifMinimal replay

Target transitionResearch-only, shadow, suggestion, or no-go output that appears to become action, gate, or public deployment authority.No route, wording, API field, or UI label is named.Trace public wording against the warrant requirement.

Missing warrantThreshold, falsifier, null arm, receipt, human review, risk gate, or closure contract that should block action.The public page already labels the output non-deployment or research-only.Check ActionWarrant boundary or proof-action mini demo.

Observed authorityPublic wording that implies an action is authorized, safe to deploy, or ready for live execution.The claim depends on private interpretation rather than public wording.Compare with claim boundary and downgrade if needed.

RepairDowngrade label, add no-action boundary, block gate transition, or add regression test.The repair asks for harmful operational details.Open a proof-action or website issue with public evidence only.

Copy Block

A compact issue body.

Counterexample class:
Credit path or authority transition, if relevant:
Target claim or artifact:
Public input:
Expected behavior:
Observed failure:
Evidence gap:
Minimal reproduction:
Proposed repair:
Safety boundary confirmed: no private data, no keys, no operational harm instructions.

If the objection cannot be written in this shape, it may still be useful privately, but it is not yet a public scientific counterexample.

Not A Useful Packet

What we will not treat as evidence.

General disagreement without a target claim or artifact.
A private log request or private deployment guess.
A benchmark complaint without task id, input, or expected behavior.
A stronger-baseline claim without the same scoring boundary.
A formula objection without a concrete variable assignment or boundary case.
A data-leakage claim without public row, split, timestamp, hash, or provenance evidence.
A credit-leak claim without a named public metric, denominator, reward, gate, or clean-learning field.
An authority-leak claim without a named route, UI label, API field, README line, or public wording.
A security probe, credential test, harassment route, or harmful operational instruction.

Why This Exists

The goal is not to win an argument. The goal is to make the next repair obvious.

A public evidence field becomes stronger when its failures are small enough to replay.

Seven small ways to make the work stronger.

Formula counterexample.

Data leakage.

Stronger baseline.

Reproduction failure.

Claim-boundary overreach.

Credit leak.

Authority leak.

A compact issue body.

What we will not treat as evidence.