Benchmark · extraction quality vs price

Which model extracts donto-grade claims — for how much?

donto's pitch is deep extraction: an exhaustive, multi-directional, contradiction-preserving, evidence-anchored firehose of claims — not flat fact-stuffing. We bake off LLMs at that task on a fixed source and score each against a hand-authored Claude gold benchmark. Every model gets its own tuned config (reasoning mode, token budget, prompt variant, provider pin) so the comparison is fair — a model is never handicapped by a one-size recipe. The goal: find the quality/price sweet spot far below our current deepseek lane ($0.1372/M blended).

models tested

FREE

nemotron-3-super-120b-a12b:free

$526

qwen3-235b-a22b-2507

352,175

backlog chunks

★

Recommendation — faithfulness-adjusted

don't pick on GoldScore alone

Never pick on GoldScore alone — the adversarial faithfulness pass reordered everything.

Best free (the ~1/50th goal)

nvidia/nemotron-3-super-120b-a12b:free — 93.75% faithful, $0

Best paid (contested cases)

qwen/qwen3-235b-a22b-2507 — 96% faithful, ~$526 backlog

Two-lane: route bulk/uncontested chunks to nemotron-super120 (FREE) after a contested-document validation batch (its contradiction-preservation is unproven, not failed); route known-contested chunks to qwen3-235b (the only cheap model both abundant AND genuinely faithful, near the deepseek/mimo anchor). REJECT ling26flash (fabricates spans), ministral3b/granite4-8b/gptoss120 (collapse the contested core; GoldScores inflated by padding).

⚠ GoldScore-inflated (high abundance, low real faithfulness) — rejected: ministral3b, gptoss120, granite4-8b, ling26flash

Leaderboard — vs the Calder gold

updated 2026-06-30T20:44:18Z

#	model	mode	Gold	faith✓%	adj	claims	preds	inf	ent%	contst	fab%	$in/M	$out/M	backlog$	verdict
1	qwen/qwen3-235b-a22b-2507	none	89	—	89	1058	411	264	65	73	6.2	0.09	0.1	$526	abundant (unverified)
2	deepseek/deepseek-v4-flash	on	88	—	88	412	269	85	61	33	1.5	0.098	0.196	$1,152	abundant (unverified)
3	xiaomi/mimo-v2.5-pro	off	88	—	88	798	215	163	60	103	0.1	0.435	0.87	$3,410	abundant (unverified)
4	qwen/qwen3-235b-a22b-2507	none	84	96	81	488	264	31	74	19	1.2	0.09	0.1	$526	✓ faithful + abundant
5	nvidia/nemotron-3-super-120b-a12b:free	on	76	93.75⚠	71	479	256	0	71	22	0.4	0	0	FREE	✓ faithful (contested unproven)
6	qwen/qwen3-235b-a22b-2507	none	70	—	70	326	298	110	11	26	31.3	0.09	0.1	$526	abundant (unverified)
7	google/gemma-3-12b-it	none	63	—	63	174	91	0	54	13	17.2	0.05	0.15	$504	abundant (unverified)
8	mistralai/mistral-nemo	none	62	—	62	150	146	62	40	3	0	0.02	0.03	$134	abundant (unverified)
9	inclusionai/ling-2.6-flash	none	78	68	53	334	298	242	32	11	19.2	0.01	0.03	$101	⚠ partial — some fabrication
10	openai/gpt-oss-120b:free	on	82	64⚠	52	312	168	29	68	17	0	0	0	FREE	✗ low faithfulness · GoldScore inflated
11	arcee-ai/trinity-mini	on	43	—	43	77	64	0	14	6	0	0.045	0.15	$731	partial (unverified)
12	ibm-granite/granite-4.1-8b	none	74	56⚠	41	259	91	90	61	6	2.7	0.05	0.1	$392	✗ low faithfulness · GoldScore inflated
13	amazon/nova-micro-v1	none	40	—	40	83	68	0	34	4	18.1	0.035	0.14	$431	partial (unverified)
14	mistralai/ministral-3b-2512	none	85	45.8	39	323	288	60	51	34	0	0.1	0.1	$560	✗ low faithfulness · GoldScore inflated
15	meta-llama/llama-3.1-8b-instruct	none	28	—	28	43	24	0	39	1	0	0.02	0.03	$134	fails gold bar
16	qwen/qwen3.5-9b	on	28	—	28	164	2	0	62	0	5.5	0.1	0.15	$1,008	fails gold bar
17	qwen/qwen-2.5-7b-instruct	none	25	—	25	22	22	0	17	2	0	0.04	0.1	$358	fails gold bar
18	nvidia/nemotron-3-nano-30b-a3b:free	on	11	—	11	2	2	0	3	0	0	0	0	FREE	fails gold bar

GoldScore 0–100 = entity coverage (30) · contested/consensus/bitemporal structure (30) · inference (15) · predicate diversity (15) · faithfulness/low-fabrication (10). contst& fab% are heuristic signals (substring + lexical) — the rigorous faithfulness check is an adversarial multi-agent pass on the leaders. backlog$ = rough projection over 352,175 pending chunks; FREE models = the headline ~1/50th-and-beyond goal (subject to free-tier rate limits).

The gold benchmark

hand-authored by Claude

The fixed source is a contested two-inquiry incident report (Calder Reservoir dam-failure incident report (contested two-inquiry case)). It was chosen because it stresses donto's thesis: two expert panels blame differentcauses, there's a consensus point both agree on, and a press release is later corrected — so a faithful extraction must preserve contradiction, attribute claims to their source, and reify the correction over time, not collapse it. Claude hand-extracted the gold:

303

claims

160

predicates

inferences

entities

contested edges

Method — the recipe + per-model fairness

same recipe · tuned configs

prompt: SUPERB extract_broad.txt (donto-api) — abundance-native, emit-free, multi-directional, paraconsistent, evidence-first, non-fabrication principle. Source bookended as inert data.
gleaning: thread-nudge gleaning, up to 6 passes, dry-stop after a pass adds <3 new; session_id provider stickiness.
scoring: vs the 303-claim Calder gold; GoldScore = entity-cov 30 + contested-structure 30 + inference 15 + predicate-diversity 15 + faithfulness 10.
fairness: Each model gets its own tuned config below (reasoning mode, token budgets, prompt variant, provider pin) so a model is never handicapped by a one-size recipe. prompt_variant=superb_fewshot anchors JSONL format for models that drift to prose.

Per-model config (fairness)

model	reasoning	max_tok	prompt	pin	tuned	notes
qwen/qwen3-235b-a22b-2507	none	10000	lean	—	✓	qwen3-235b with LEAN prompt (~2700 tok vs ~3958 SUPERB; lens list compressed, FACT SHAPE + ANCHOR kept verbatim). Tests: keep gold quality at ~30% fewer input tokens (compact's quality cliff avoided).
deepseek/deepseek-v4-flash	4000	12000	superb	GMICloud	✓	Reasoning model. GMICloud pin (others ignore reasoning cap -> 0 content). Production lane.
xiaomi/mimo-v2.5-pro	off	8000	superb	—	✓	Reasoning model run reasoning-OFF (reasoning:{enabled:false}). Highest raw abundance.
qwen/qwen3-235b-a22b-2507	none	10000	superb	—	✓	Big 235B MoE paid ($0.094/M). Reliable, strong (217 claims pass1).
nvidia/nemotron-3-super-120b-a12b:free	4000	12000	superb	—	✓	FREE, 120B MoE reasoning. Strong abundance; does not tag h:true (inference score low). Free-tier rate-limited (needs retry).
qwen/qwen3-235b-a22b-2507	none	10000	compact	—	✓	qwen3-235b with COMPACT prompt (~200 tok vs ~3900 SUPERB) — context-reduction experiment: does quality hold at a fraction of the input tokens?
google/gemma-3-12b-it	none	8000	superb	—	✓	Gemma3 12B paid ($0.090/M).
mistralai/mistral-nemo	none	8000	superb	—	✓	Cheap paid ($0.024/M, 5.7x cheaper). No reasoning.
inclusionai/ling-2.6-flash	none	8000	superb	—	✓	Cheapest paid tool-capable ($0.018/M, 7.6x cheaper). Reliable, no 429.
openai/gpt-oss-120b:free	3000	12000	superb_fewshot	—	·	Free, often 429. gpt-oss drifts to prose -> fewshot variant to anchor JSONL.
arcee-ai/trinity-mini	3000	10000	superb	—	·	Paid reasoning ($0.087/M).
ibm-granite/granite-4.1-8b	none	8000	superb	—	✓	IBM Granite 8B paid ($0.070/M).
amazon/nova-micro-v1	none	8000	superb	—	✓	Amazon Nova Micro paid ($0.077/M).
mistralai/ministral-3b-2512	none	8000	superb	—	✓	Ministral 3B paid ($0.100/M).
meta-llama/llama-3.1-8b-instruct	none	8000	superb	—	✓	Floor model. Collapses contradictions, 0 inferences, loops. Documented fail.
qwen/qwen3.5-9b	3000	10000	superb	—	·	Paid 9B reasoning ($0.12/M).
qwen/qwen-2.5-7b-instruct	none	8000	superb	—	✓	Cheap paid ($0.064/M).
nvidia/nemotron-3-nano-30b-a3b:free	2000	10000	superb	—	·	RETUNE: reasoning:4000 ate the budget -> only 2 claims. Lower reasoning to 2000 so output has room.

Living benchmark — new models are added continuously. Data: model-experiments/leaderboard.jsonl. Generated 2026-06-30T20:44:18Z.