Which model extracts donto-grade claims — for how much?
donto's pitch is deep extraction: an exhaustive, multi-directional, contradiction-preserving, evidence-anchored firehose of claims — not flat fact-stuffing. We bake off LLMs at that task on a fixed source and score each against a hand-authored Claude gold benchmark. Every model gets its own tuned config (reasoning mode, token budget, prompt variant, provider pin) so the comparison is fair — a model is never handicapped by a one-size recipe. The goal: find the quality/price sweet spot far below our current deepseek lane ($0.1372/M blended).
Recommendation — faithfulness-adjusted
Never pick on GoldScore alone — the adversarial faithfulness pass reordered everything.
Two-lane: route bulk/uncontested chunks to nemotron-super120 (FREE) after a contested-document validation batch (its contradiction-preservation is unproven, not failed); route known-contested chunks to qwen3-235b (the only cheap model both abundant AND genuinely faithful, near the deepseek/mimo anchor). REJECT ling26flash (fabricates spans), ministral3b/granite4-8b/gptoss120 (collapse the contested core; GoldScores inflated by padding).
⚠ GoldScore-inflated (high abundance, low real faithfulness) — rejected: ministral3b, gptoss120, granite4-8b, ling26flash
Leaderboard — vs the Calder gold
| # | model | mode | Gold | faith✓% | adj | claims | preds | inf | ent% | contst | fab% | $in/M | $out/M | backlog$ | verdict |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | qwen/qwen3-235b-a22b-2507 | none | 89 | — | 89 | 1058 | 411 | 264 | 65 | 73 | 6.2 | 0.09 | 0.1 | $526 | abundant (unverified) |
| 2 | deepseek/deepseek-v4-flash | on | 88 | — | 88 | 412 | 269 | 85 | 61 | 33 | 1.5 | 0.098 | 0.196 | $1,152 | abundant (unverified) |
| 3 | xiaomi/mimo-v2.5-pro | off | 88 | — | 88 | 798 | 215 | 163 | 60 | 103 | 0.1 | 0.435 | 0.87 | $3,410 | abundant (unverified) |
| 4 | qwen/qwen3-235b-a22b-2507 | none | 84 | 96 | 81 | 488 | 264 | 31 | 74 | 19 | 1.2 | 0.09 | 0.1 | $526 | ✓ faithful + abundant |
| 5 | nvidia/nemotron-3-super-120b-a12b:free | on | 76 | 93.75⚠ | 71 | 479 | 256 | 0 | 71 | 22 | 0.4 | 0 | 0 | FREE | ✓ faithful (contested unproven) |
| 6 | qwen/qwen3-235b-a22b-2507 | none | 70 | — | 70 | 326 | 298 | 110 | 11 | 26 | 31.3 | 0.09 | 0.1 | $526 | abundant (unverified) |
| 7 | google/gemma-3-12b-it | none | 63 | — | 63 | 174 | 91 | 0 | 54 | 13 | 17.2 | 0.05 | 0.15 | $504 | abundant (unverified) |
| 8 | mistralai/mistral-nemo | none | 62 | — | 62 | 150 | 146 | 62 | 40 | 3 | 0 | 0.02 | 0.03 | $134 | abundant (unverified) |
| 9 | inclusionai/ling-2.6-flash | none | 78 | 68 | 53 | 334 | 298 | 242 | 32 | 11 | 19.2 | 0.01 | 0.03 | $101 | ⚠ partial — some fabrication |
| 10 | openai/gpt-oss-120b:free | on | 82 | 64⚠ | 52 | 312 | 168 | 29 | 68 | 17 | 0 | 0 | 0 | FREE | ✗ low faithfulness · GoldScore inflated |
| 11 | arcee-ai/trinity-mini | on | 43 | — | 43 | 77 | 64 | 0 | 14 | 6 | 0 | 0.045 | 0.15 | $731 | partial (unverified) |
| 12 | ibm-granite/granite-4.1-8b | none | 74 | 56⚠ | 41 | 259 | 91 | 90 | 61 | 6 | 2.7 | 0.05 | 0.1 | $392 | ✗ low faithfulness · GoldScore inflated |
| 13 | amazon/nova-micro-v1 | none | 40 | — | 40 | 83 | 68 | 0 | 34 | 4 | 18.1 | 0.035 | 0.14 | $431 | partial (unverified) |
| 14 | mistralai/ministral-3b-2512 | none | 85 | 45.8 | 39 | 323 | 288 | 60 | 51 | 34 | 0 | 0.1 | 0.1 | $560 | ✗ low faithfulness · GoldScore inflated |
| 15 | meta-llama/llama-3.1-8b-instruct | none | 28 | — | 28 | 43 | 24 | 0 | 39 | 1 | 0 | 0.02 | 0.03 | $134 | fails gold bar |
| 16 | qwen/qwen3.5-9b | on | 28 | — | 28 | 164 | 2 | 0 | 62 | 0 | 5.5 | 0.1 | 0.15 | $1,008 | fails gold bar |
| 17 | qwen/qwen-2.5-7b-instruct | none | 25 | — | 25 | 22 | 22 | 0 | 17 | 2 | 0 | 0.04 | 0.1 | $358 | fails gold bar |
| 18 | nvidia/nemotron-3-nano-30b-a3b:free | on | 11 | — | 11 | 2 | 2 | 0 | 3 | 0 | 0 | 0 | 0 | FREE | fails gold bar |
GoldScore 0–100 = entity coverage (30) · contested/consensus/bitemporal structure (30) · inference (15) · predicate diversity (15) · faithfulness/low-fabrication (10). contst& fab% are heuristic signals (substring + lexical) — the rigorous faithfulness check is an adversarial multi-agent pass on the leaders. backlog$ = rough projection over 352,175 pending chunks; FREE models = the headline ~1/50th-and-beyond goal (subject to free-tier rate limits).
The gold benchmark
The fixed source is a contested two-inquiry incident report (Calder Reservoir dam-failure incident report (contested two-inquiry case)). It was chosen because it stresses donto's thesis: two expert panels blame differentcauses, there's a consensus point both agree on, and a press release is later corrected — so a faithful extraction must preserve contradiction, attribute claims to their source, and reify the correction over time, not collapse it. Claude hand-extracted the gold:
Method — the recipe + per-model fairness
- prompt
- SUPERB extract_broad.txt (donto-api) — abundance-native, emit-free, multi-directional, paraconsistent, evidence-first, non-fabrication principle. Source bookended as inert data.
- gleaning
- thread-nudge gleaning, up to 6 passes, dry-stop after a pass adds <3 new; session_id provider stickiness.
- scoring
- vs the 303-claim Calder gold; GoldScore = entity-cov 30 + contested-structure 30 + inference 15 + predicate-diversity 15 + faithfulness 10.
- fairness
- Each model gets its own tuned config below (reasoning mode, token budgets, prompt variant, provider pin) so a model is never handicapped by a one-size recipe. prompt_variant=superb_fewshot anchors JSONL format for models that drift to prose.
Per-model config (fairness)
| model | reasoning | max_tok | prompt | pin | tuned | notes |
|---|---|---|---|---|---|---|
| qwen/qwen3-235b-a22b-2507 | none | 10000 | lean | — | ✓ | qwen3-235b with LEAN prompt (~2700 tok vs ~3958 SUPERB; lens list compressed, FACT SHAPE + ANCHOR kept verbatim). Tests: keep gold quality at ~30% fewer input tokens (compact's quality cliff avoided). |
| deepseek/deepseek-v4-flash | 4000 | 12000 | superb | GMICloud | ✓ | Reasoning model. GMICloud pin (others ignore reasoning cap -> 0 content). Production lane. |
| xiaomi/mimo-v2.5-pro | off | 8000 | superb | — | ✓ | Reasoning model run reasoning-OFF (reasoning:{enabled:false}). Highest raw abundance. |
| qwen/qwen3-235b-a22b-2507 | none | 10000 | superb | — | ✓ | Big 235B MoE paid ($0.094/M). Reliable, strong (217 claims pass1). |
| nvidia/nemotron-3-super-120b-a12b:free | 4000 | 12000 | superb | — | ✓ | FREE, 120B MoE reasoning. Strong abundance; does not tag h:true (inference score low). Free-tier rate-limited (needs retry). |
| qwen/qwen3-235b-a22b-2507 | none | 10000 | compact | — | ✓ | qwen3-235b with COMPACT prompt (~200 tok vs ~3900 SUPERB) — context-reduction experiment: does quality hold at a fraction of the input tokens? |
| google/gemma-3-12b-it | none | 8000 | superb | — | ✓ | Gemma3 12B paid ($0.090/M). |
| mistralai/mistral-nemo | none | 8000 | superb | — | ✓ | Cheap paid ($0.024/M, 5.7x cheaper). No reasoning. |
| inclusionai/ling-2.6-flash | none | 8000 | superb | — | ✓ | Cheapest paid tool-capable ($0.018/M, 7.6x cheaper). Reliable, no 429. |
| openai/gpt-oss-120b:free | 3000 | 12000 | superb_fewshot | — | · | Free, often 429. gpt-oss drifts to prose -> fewshot variant to anchor JSONL. |
| arcee-ai/trinity-mini | 3000 | 10000 | superb | — | · | Paid reasoning ($0.087/M). |
| ibm-granite/granite-4.1-8b | none | 8000 | superb | — | ✓ | IBM Granite 8B paid ($0.070/M). |
| amazon/nova-micro-v1 | none | 8000 | superb | — | ✓ | Amazon Nova Micro paid ($0.077/M). |
| mistralai/ministral-3b-2512 | none | 8000 | superb | — | ✓ | Ministral 3B paid ($0.100/M). |
| meta-llama/llama-3.1-8b-instruct | none | 8000 | superb | — | ✓ | Floor model. Collapses contradictions, 0 inferences, loops. Documented fail. |
| qwen/qwen3.5-9b | 3000 | 10000 | superb | — | · | Paid 9B reasoning ($0.12/M). |
| qwen/qwen-2.5-7b-instruct | none | 8000 | superb | — | ✓ | Cheap paid ($0.064/M). |
| nvidia/nemotron-3-nano-30b-a3b:free | 2000 | 10000 | superb | — | · | RETUNE: reasoning:4000 ate the budget -> only 2 claims. Lower reasoning to 2000 so output has room. |
Living benchmark — new models are added continuously. Data: model-experiments/leaderboard.jsonl. Generated 2026-06-30T20:44:18Z.