The substrate
contradiction-preserving · evidence-first
Benchmark · extraction quality vs price

Which model extracts donto-grade claims — for how much?

donto's pitch is deep extraction: an exhaustive, multi-directional, contradiction-preserving, evidence-anchored firehose of claims — not flat fact-stuffing. We bake off LLMs at that task on a fixed source and score each against a hand-authored Claude gold benchmark. Every model gets its own tuned config (reasoning mode, token budget, prompt variant, provider pin) so the comparison is fair — a model is never handicapped by a one-size recipe. The goal: find the quality/price sweet spot far below our current deepseek lane ($0.1372/M blended).

18
models tested
FREE
nemotron-3-super-120b-a12b:free
$526
qwen3-235b-a22b-2507
352,175
backlog chunks

Recommendation — faithfulness-adjusted

don't pick on GoldScore alone

Never pick on GoldScore alone — the adversarial faithfulness pass reordered everything.

Best free (the ~1/50th goal)
nvidia/nemotron-3-super-120b-a12b:free — 93.75% faithful, $0
Best paid (contested cases)
qwen/qwen3-235b-a22b-2507 — 96% faithful, ~$526 backlog

Two-lane: route bulk/uncontested chunks to nemotron-super120 (FREE) after a contested-document validation batch (its contradiction-preservation is unproven, not failed); route known-contested chunks to qwen3-235b (the only cheap model both abundant AND genuinely faithful, near the deepseek/mimo anchor). REJECT ling26flash (fabricates spans), ministral3b/granite4-8b/gptoss120 (collapse the contested core; GoldScores inflated by padding).

⚠ GoldScore-inflated (high abundance, low real faithfulness) — rejected: ministral3b, gptoss120, granite4-8b, ling26flash

00

Leaderboard — vs the Calder gold

updated 2026-06-30T20:44:18Z
#modelmodeGoldfaith✓%adjclaimspredsinfent%contstfab%$in/M$out/Mbacklog$verdict
1qwen/qwen3-235b-a22b-2507none8989105841126465736.20.090.1$526abundant (unverified)
2deepseek/deepseek-v4-flashon88884122698561331.50.0980.196$1,152abundant (unverified)
3xiaomi/mimo-v2.5-prooff8888798215163601030.10.4350.87$3,410abundant (unverified)
4qwen/qwen3-235b-a22b-2507none8496814882643174191.20.090.1$526✓ faithful + abundant
5nvidia/nemotron-3-super-120b-a12b:freeon7693.7571479256071220.400FREE✓ faithful (contested unproven)
6qwen/qwen3-235b-a22b-2507none7070326298110112631.30.090.1$526abundant (unverified)
7google/gemma-3-12b-itnone6363174910541317.20.050.15$504abundant (unverified)
8mistralai/mistral-nemonone62621501466240300.020.03$134abundant (unverified)
9inclusionai/ling-2.6-flashnone786853334298242321119.20.010.03$101⚠ partial — some fabrication
10openai/gpt-oss-120b:freeon826452312168296817000FREE✗ low faithfulness · GoldScore inflated
11arcee-ai/trinity-minion43437764014600.0450.15$731partial (unverified)
12ibm-granite/granite-4.1-8bnone74564125991906162.70.050.1$392✗ low faithfulness · GoldScore inflated
13amazon/nova-micro-v1none40408368034418.10.0350.14$431partial (unverified)
14mistralai/ministral-3b-2512none8545.83932328860513400.10.1$560✗ low faithfulness · GoldScore inflated
15meta-llama/llama-3.1-8b-instructnone28284324039100.020.03$134fails gold bar
16qwen/qwen3.5-9bon2828164206205.50.10.15$1,008fails gold bar
17qwen/qwen-2.5-7b-instructnone25252222017200.040.1$358fails gold bar
18nvidia/nemotron-3-nano-30b-a3b:freeon111122030000FREEfails gold bar

GoldScore 0–100 = entity coverage (30) · contested/consensus/bitemporal structure (30) · inference (15) · predicate diversity (15) · faithfulness/low-fabrication (10). contst& fab% are heuristic signals (substring + lexical) — the rigorous faithfulness check is an adversarial multi-agent pass on the leaders. backlog$ = rough projection over 352,175 pending chunks; FREE models = the headline ~1/50th-and-beyond goal (subject to free-tier rate limits).

01

The gold benchmark

hand-authored by Claude

The fixed source is a contested two-inquiry incident report (Calder Reservoir dam-failure incident report (contested two-inquiry case)). It was chosen because it stresses donto's thesis: two expert panels blame differentcauses, there's a consensus point both agree on, and a press release is later corrected — so a faithful extraction must preserve contradiction, attribute claims to their source, and reify the correction over time, not collapse it. Claude hand-extracted the gold:

303
claims
160
predicates
63
inferences
94
entities
14
contested edges
02

Method — the recipe + per-model fairness

same recipe · tuned configs
prompt
SUPERB extract_broad.txt (donto-api) — abundance-native, emit-free, multi-directional, paraconsistent, evidence-first, non-fabrication principle. Source bookended as inert data.
gleaning
thread-nudge gleaning, up to 6 passes, dry-stop after a pass adds <3 new; session_id provider stickiness.
scoring
vs the 303-claim Calder gold; GoldScore = entity-cov 30 + contested-structure 30 + inference 15 + predicate-diversity 15 + faithfulness 10.
fairness
Each model gets its own tuned config below (reasoning mode, token budgets, prompt variant, provider pin) so a model is never handicapped by a one-size recipe. prompt_variant=superb_fewshot anchors JSONL format for models that drift to prose.

Per-model config (fairness)

modelreasoningmax_tokpromptpintunednotes
qwen/qwen3-235b-a22b-2507none10000leanqwen3-235b with LEAN prompt (~2700 tok vs ~3958 SUPERB; lens list compressed, FACT SHAPE + ANCHOR kept verbatim). Tests: keep gold quality at ~30% fewer input tokens (compact's quality cliff avoided).
deepseek/deepseek-v4-flash400012000superbGMICloudReasoning model. GMICloud pin (others ignore reasoning cap -> 0 content). Production lane.
xiaomi/mimo-v2.5-prooff8000superbReasoning model run reasoning-OFF (reasoning:{enabled:false}). Highest raw abundance.
qwen/qwen3-235b-a22b-2507none10000superbBig 235B MoE paid ($0.094/M). Reliable, strong (217 claims pass1).
nvidia/nemotron-3-super-120b-a12b:free400012000superbFREE, 120B MoE reasoning. Strong abundance; does not tag h:true (inference score low). Free-tier rate-limited (needs retry).
qwen/qwen3-235b-a22b-2507none10000compactqwen3-235b with COMPACT prompt (~200 tok vs ~3900 SUPERB) — context-reduction experiment: does quality hold at a fraction of the input tokens?
google/gemma-3-12b-itnone8000superbGemma3 12B paid ($0.090/M).
mistralai/mistral-nemonone8000superbCheap paid ($0.024/M, 5.7x cheaper). No reasoning.
inclusionai/ling-2.6-flashnone8000superbCheapest paid tool-capable ($0.018/M, 7.6x cheaper). Reliable, no 429.
openai/gpt-oss-120b:free300012000superb_fewshot·Free, often 429. gpt-oss drifts to prose -> fewshot variant to anchor JSONL.
arcee-ai/trinity-mini300010000superb·Paid reasoning ($0.087/M).
ibm-granite/granite-4.1-8bnone8000superbIBM Granite 8B paid ($0.070/M).
amazon/nova-micro-v1none8000superbAmazon Nova Micro paid ($0.077/M).
mistralai/ministral-3b-2512none8000superbMinistral 3B paid ($0.100/M).
meta-llama/llama-3.1-8b-instructnone8000superbFloor model. Collapses contradictions, 0 inferences, loops. Documented fail.
qwen/qwen3.5-9b300010000superb·Paid 9B reasoning ($0.12/M).
qwen/qwen-2.5-7b-instructnone8000superbCheap paid ($0.064/M).
nvidia/nemotron-3-nano-30b-a3b:free200010000superb·RETUNE: reasoning:4000 ate the budget -> only 2 claims. Lower reasoning to 2000 so output has room.

Living benchmark — new models are added continuously. Data: model-experiments/leaderboard.jsonl. Generated 2026-06-30T20:44:18Z.