Counterexample Packets

Five small ways to make the work stronger.

A useful objection is not a vibe check. It is a packet: target claim, public input, expected behavior, observed failure, and the narrowest repair. These examples show five shapes of critique we can replay.

target input failure repair
Evidence shape visual for counterexample packets.

Packet A

Formula counterexample.

Use this when a published formula, gate, invariant, or protocol rule fails on a concrete public case.

FieldWhat to submitInvalid ifMinimal replay
Target ruleFormula id, paper section, registry row, or protocol rule being challenged.The target is a general theme rather than an exact rule.Open the cited rule and list its variables exactly.
Counterexample stateConcrete variable assignment, boundary case, toy state, or public trace snippet.The case needs private assumptions or hidden system behavior.Substitute values into the rule and show the contradiction.
Expected vs observedWhat the rule predicts and what the counterexample shows.The result depends on changing the stated rule.Check whether the inequality, gate, invariant, or transition fails.
RepairNarrow domain, add precondition, split rule, or downgrade claim boundary.The repair demands private disclosure.File a proof-carrying-action issue with the exact patch target.

Packet B

Data leakage.

Use this when a benchmark or evidence table may have used information that should not be available.

FieldWhat to submitInvalid ifMinimal replay
Leak targetDataset row ids, split ids, task ids, timestamp, provenance record, or manifest hash.The report relies on non-public data or unverifiable suspicion.Locate the same ids in public artifact files.
Leak mechanismDuplicate rows, train/test overlap, future timestamp, label in prompt, answer in rubric, or split mismatch.The mechanism is only "maybe the model knew it."Run duplicate, timestamp, split, or label-presence checks.
Observed impactWhich metric, claim, or table row may be invalidated.No affected result is named.Recompute the affected row with leaked items removed or quarantined.
RepairQuarantine rows, rebuild split, add leakage CI, or downgrade the evidence claim.The repair requires exposing private logs.Open a WisdomBench or evidence-boundary issue.

Packet C

Stronger baseline.

Use this when a simpler public baseline beats or matches a reported effect under the same boundary.

FieldWhat to submitInvalid ifMinimal replay
Target metricRepeat failure rate, normalized improvement, no-go false block rate, action coverage, or task score.The metric differs from the project metric.Confirm the metric, split, scoring script, and sample count match.
Baseline inputPublic script, task subset, seeds, model family, allowed context, and scoring command.The baseline uses extra information or fewer constraints.Run the baseline under the same evidence boundary.
Observed resultThe baseline wins, matches, or removes the claimed effect with confidence interval or exact count.Only cherry-picked examples are shown.Compare aggregate and task-level results under the same stop rule.
RepairAdd the baseline, revise effect size, retain negative result, or narrow the contribution.The repair asks to hide failed results.Open a WisdomBench or evidence-boundary issue.

Packet D

Reproduction failure.

Use this when a public task, scorer, command, dataset card, or artifact package cannot be replayed.

FieldWhat to submitInvalid ifMinimal replay
Target artifactTask id, scorer function, file name, manifest row, command, or public release hash.The target cannot be identified from public files.Open the artifact and confirm the exact version.
Replay contextOS, Python version if relevant, command, seed, expected output, and observed output.The report omits the command or depends on a local private file.Run the command from a clean checkout or public package.
Failure modeMissing file, broken import, nondeterministic score, scorer mismatch, or undocumented dependency.The report is only "it did not work."Confirm the smallest failing step and attach safe logs.
RepairREADME patch, pinned dependency, manifest update, CI test, or result quarantine.The repair requires private credentials.Open the relevant public issue with no secrets.

Packet E

Claim-boundary overreach.

Use this when public wording claims more than the public evidence can support.

FieldWhat to submitInvalid ifMinimal replay
Exact wordingQuote the sentence, page, card, DOI note, README line, or public page section.The challenged claim is paraphrased too broadly.Trace the exact words to the linked evidence.
Evidence envelopeWhich artifact supports it, what sample/domain it covers, and what it does not cover.The critique demands proof of a claim not made.Check artifact manifest, limitations, and claim boundary.
OverreachMissing external validation, real-world deployment, human study, robot hardware, or independent replication.The missing evidence is already disclosed as a limitation.Decide whether the public wording still needs narrowing.
RepairBoundary downgrade, limitation sentence, evidence-card update, or stronger citation route.The repair asks for private system disclosure.Open a website or evidence-boundary issue.

Copy Block

A compact issue body.

Counterexample class:
Target claim or artifact:
Public input:
Expected behavior:
Observed failure:
Evidence gap:
Minimal reproduction:
Proposed repair:
Safety boundary confirmed: no private data, no keys, no operational harm instructions.

If the objection cannot be written in this shape, it may still be useful privately, but it is not yet a public scientific counterexample.

Not A Useful Packet

What we will not treat as evidence.

  • General disagreement without a target claim or artifact.
  • A private log request or private deployment guess.
  • A benchmark complaint without task id, input, or expected behavior.
  • A stronger-baseline claim without the same scoring boundary.
  • A formula objection without a concrete variable assignment or boundary case.
  • A data-leakage claim without public row, split, timestamp, hash, or provenance evidence.
  • A security probe, credential test, harassment route, or harmful operational instruction.

Why This Exists

The goal is not to win an argument. The goal is to make the next repair obvious.

A public evidence field becomes stronger when its failures are small enough to replay.