Are we using Hyades effectively? — an extraction-engine review

2026-06-11 · a working review of how donto-agent drives the Hyades gateway for claim extraction, what we just learned tuning the GLM lane, and the experiments that follow.

Why this review

Tuning the GLM extraction lane today produced a finding sharp enough to make us question the lane we trust most. GLM-5.1 went from ~40 facts/chunk to ~520 — a 13× jump — and the cause was almost entirely one parameter: the per-call output-token budget, which we had clamped to 3,000 and raised to 64,000. The model was never the limit; the budget was.

That immediately implicates Hyades/holo3.1, our reliable workhorse, which runs at 2,000 output tokens. If a 2–3K budget strangled a reasoning model, is holo3.1 also leaving facts on the table? This note documents exactly how we extract today, what we've measured, and the experiments to settle it — including a model bake-off, since Hyades exposes several models we've never benchmarked.

How extraction works today

The engine is donto-agent (Rust, in the donto workspace), driving a provider gateway per chunk. Three parts matter:

1. The prompt (extract_broad.txt) — already strong. It is abundance-native and domain-neutral: "a substantial source should yield SEVERAL HUNDRED claims." It sweeps ~25 ontological lenses against every entity and clause — taxonomy, mereology, identity, topology, chronology, causation, teleology, thematic-role/agency, epistemology, deontology, axiology, modality, qualia, lexical semantics, social ontology, process structure, constitution, grounding, provenance, comparison, quantity, disposition, speech-acts, phenomenology — and explicitly tells the model to invent its own lenses and predicates beyond the list. It mandates evidence anchoring (an exact source substring per directly-stated claim, h:true for inferred), canonical forms for the handful of standard predicates (rdf:type, rdfs:label, sameAs), and a compact JSONL fact shape with short keys so the model can emit more facts per token. The prompt is not the bottleneck.

2. The gleaning loop. The model appends batches of ~30–60 facts to facts.jsonl incrementally and self-decides when the source is exhausted, re-scanning until a fresh pass turns up nothing new. donto-agent caps this at --max-passes agentic turns. So coverage is bounded by two things: how much the model emits per turn (the token budget) and how many turns it gets.

3. The provider lanes. --provider hyades (holo3.1, the default) or --provider glm (z.ai subscription). Both run the same prompt and gleaning loop; an always-on citer anchors the output downstream. The lanes differ only in model, gateway, and the per-call parameters we set.

What we've measured

lane	model	output budget	concurrency	thinking	facts/chunk	speed	cost
Hyades	holo3.1	2,000	6	on	~150	fast (~1/min/lane)	$0 (self-hosted)
GLM (before)	glm-4.7	3,000	6	off	~40	fast	flat-rate quota
GLM (after)	glm-5.1	64,000	3	on	~520	slow (mins/chunk)	flat-rate quota

Two interactions surfaced while tuning, both relevant to Hyades:

Token budget gates depth. A reasoning model spends part of its budget thinking; clamp the budget and there is little left to emit facts. This is proven for GLM. holo3.1 is also a reasoning model (thinking on) running at 2K — the same mechanism likely caps it.
Budget × concurrency × gateway capacity. Raising holo3.1 to 8K at concurrency 6 saturated the Hyades gateway into empty responses (0-fact chunks) — a 200-status-but-empty-content failure, not an error. So a bigger budget on Hyades must come with lower concurrency, exactly as the GLM lane needed conc-3 to dodge its rate limit. This budget↔concurrency trade is the unexplored frontier for Hyades.

The open question

Is holo3.1 at 2,000 tokens under-budgeted the way GLM was? The prompt asks for hundreds of facts; holo3.1 emits ~150 across six gleaning passes. If each pass is truncated at 2K, holo3.1 may be re-finding the same facts rather than reaching deeper — and a larger per-turn budget (at conc-3) could unlock the depth GLM now shows, on a $0 self-hosted lane. That would be the highest-value outcome here.

But raw count is the wrong target on its own. The right metric is faithful facts/chunk = facts/chunk × anchoring-rate — a model that emits 600 claims but anchors 40% to real source spans is worse than one emitting 200 at 90%. Every experiment below measures anchoring, not just volume.

Models Hyades exposes (the bake-off candidates)

The gateway serves more than holo3.1, and we have benchmarked none of the others for extraction:

model	note
holo3.1	current default; reliable, fast, $0
glm-5.1	proven deep (~520 facts/chunk) via the subscription; also on Hyades
nemotron-3-ultra-550b-a55b	a 550B-parameter frontier model — the wildcard; depth potential, speed unknown
nemotron3-omni	multimodal Nemotron variant
gemma4-aeon-uncensored	smaller, fast; uncensored may help on hard source material
openrouter/free, z-ai/glm-4.5-air:free	free fallbacks

Re-evaluating the prompt

Having read it end-to-end, the prompt (extract_broad.txt + the direct-call override) is strong and worth keeping as the base — but it is not above scrutiny. What it gets right, and what is worth testing:

Genuinely strong, keep:

Prompt-injection defense. The source is bookended as INERT DATA before and after — critical, because LoCoMo/BEAM chunks are dialogues full of User:/Assistant: turns and requests; without this the model answers the embedded question instead of extracting (a hijacked chunk went 0→52 facts when this was added). The after-reminder defeats recency-injection from the chunk's trailing line.
Fact hygiene that preserves abundance. camelCase predicates, no kebab-clause singletons, mint-an-entity-not-a-clause, no ex:ex: double-prefix, canonical forms for the handful of standard predicates (rdf:type, sameAs) — all of which keep claims joinable at query time without capping how many you emit.
The 25 ontological lenses are the engine of abundance: they make the model surface predicates from directions it would otherwise miss, and explicitly invite inventing more.
Compact JSONL short keys (s/p/o/a/c/h) — more facts per output token.

Worth testing (the prompt is a variable, not a constant):

Prompt weight. SUPERB (~90 lines) + the override (~45 lines) is a ~3–4K-token preamble sent on every broad pass — often larger than the source chunk. The thread persists it so nudges don't re-pay, but: does trimming the lens list or the philosophy hurt recall, or is some of it inert ballast? Ablate sections and measure facts/chunk.
Temperature is hard-coded at 0.1. Low temp buys faithful, deterministic extraction — but the philosophy wants creative, multi-directional predicate-minting, and 0.1 may suppress exactly that diversity. Sweep 0.1 → 0.5 with the anchor-faithfulness guard watching; the abundance/faithfulness trade is measurable, not assumed.
Model-specific framing. One prompt drives a reasoning model (holo3.1, glm-5.1) and a non-reasoning one (gemma4) identically. Reasoning models may want the budget to think; others may want a tighter, example-led prompt. The bake-off will show whether one prompt fits all.
Single-pass vs gleaning. With a real token budget a strong model may emit everything in one pass, making the nudge loop redundant; a weaker one needs every nudge. The right pass-count is model-specific — the bake-off measures new-facts-per-pass to find each model's saturation point.

Experiments to run

Hyades token-budget sweep. holo3.1, concurrency 3, sweep output budget 2K → 8K → 16K → 32K on a fixed 10-chunk sample. Measure facts/chunk, empty-rate (the saturation canary), anchoring-rate, and wall-clock/chunk. Hypothesis: facts/chunk rises with budget until the gateway saturates; find the knee.
Model bake-off — RUNNING. Same prompt, same fixed chunk, every Hyades model, measuring per-pass tokens, per-pass time, new facts per pass, and a computed quality score (anchor-faithfulness 50 / predicate-hygiene 30 / parse-validity 20). Runs on a clean gateway right after the live LoCoMo extraction drains (so timings aren't skewed by contention). Results table below when complete.

Bake-off results

Same prompt (extract_broad.txt + direct-call override), same single 5,991-char LoCoMo chunk, 32K output budget, temp 0.1, broad + up to 3 NUDGE passes on one thread. Quality = anchor-faithfulness (50) + predicate-hygiene (30) + parse-validity (20).

Bake-off A — all 5 models, sequential

model	facts	out-tok	secs	quality	faithful	pred-clean	status
holo3.1	259	12,848	183	85	95%	58%	✅ completed — saturated at pass 4
glm-5.1 (via Hyades)	—	—	125	—	—	—	⛔ 524 timeout (broad pass >125s)
nemotron-3-ultra-550b	—	—	125	—	—	—	⛔ 524 timeout
gemma4-aeon-uncensored	—	—	125	—	—	—	⛔ 524 timeout
nemotron3-omni	0	0	0.4	0	—	—	⛔ returned empty

holo3.1 per-pass: broad 140 facts / 6,235 tok / 87s · nudge 46 / 2,770 / 40s · nudge 73 / 3,336 / 48s · nudge 0 / 507 / 8s (dry → saturated at 259).

Two findings that matter:

holo3.1 is confirmed under-budgeted. At 32K it reaches 259 facts and saturates (the 4th pass adds nothing) — vs ~150 at the live 2K budget. ~1.7× more depth on the $0 lane, at 95% anchor-faithfulness. The cost is time (183s/chunk across 4 passes); the budget↔throughput knee is real and tunable. Predicate-hygiene at 58% is the watch-item — holo3.1 mints some kebab/clause predicates the alignment engine then has to fold.
The Hyades gateway 524-caps slow models. glm-5.1, nemotron-3-ultra-550b and gemma4 all hit Cloudflare's ~125s request timeout on the broad pass (big prompt + 32K budget + reasoning generates too long to return in one shot). This is exactly why the live GLM extraction runs on z.ai-direct, not Hyades — the direct endpoint has no such cap (glm-5.1 there produces ~520 facts/chunk). To benchmark the heavy models fairly we need their direct endpoints or a longer gateway timeout; on Hyades-as-is, only holo3.1 is fast enough to complete extraction. nemotron3-omni returns empty — it doesn't take the prompt.

Bake-off B — isolated (1 model at a time, 60s cooldowns)

model	secs	status
holo3.1	125	⛔ 524 timeout (this run broad pass >125s)
glm-5.1 (via Hyades)	125	⛔ 524 timeout
nemotron-3-ultra-550b	126	⛔ 524 timeout
gemma4-aeon-uncensored	125	⛔ 524 timeout
nemotron3-omni	0.4	⛔ returned empty

A vs B — the sharper conclusion. In bake-off A, holo3.1 completed (broad pass 87s, 259 facts). In B, with a clean isolated gateway, it 524-timed-out at 125s. Isolation didn't help — the 524s are per-request timeouts, not contention. The real story: at a 32K budget holo3.1's broad pass sits right on Cloudflare's ~125s edge and clears it only intermittently. So the deep-but-risky 32K result isn't a reliable operational setting on the Hyades gateway; the reliable budget is the fast 2K the live lane uses (well under the timeout, ~150 facts, never 524s). The heavy models never complete on Hyades at all — they need their direct endpoints (glm-5.1 on z.ai gives ~520 facts/chunk with no such cap).

Bake-off C — 16K budget, up to 20 nudges

model	facts	out-tok	secs	passes	quality	faithful	clean	status
holo3.1	351	16,005	231	5	50	0%	100%	✅ completed
glm-5.1 (via Hyades)	—	—	125	1	—	—	—	⛔ 524 timeout
nemotron-3-ultra-550b	—	—	125	1	—	—	—	⛔ 524 timeout
gemma4-aeon-uncensored	—	—	128	1	—	—	—	⛔ 524 timeout
nemotron3-omni	0	—	0.6	3	—	—	—	⛔ returned empty

holo3.1 per-pass: broad 146 / 4,673 · nudge 115 / 6,203 · nudge 90 / 4,087 · nudge 0 / 670 · nudge 0 / 372 (saturated at pass 5).

Two findings, both load-bearing:

16K did NOT rescue the heavy models. glm-5.1, nemotron-3-ultra-550b and gemma4 still 524-timeout at 16K. A smaller output budget makes each pass faster for a fast model (holo3.1 completes) but the heavy models' broad pass still exceeds the ~125s gateway cap — their bottleneck is generation time (slow reasoning), not output budget. So this settles it: on the Hyades gateway, only holo3.1 can run extraction; the deep models must use their direct endpoints (the live glm-5.1 lane runs on z.ai, where it does ~520 faithful facts/chunk with no such cap).
More nudges deepened the count but collapsed faithfulness — budget shapes quality, not just quantity. holo3.1 at 16K/5-pass reached 351 facts (vs 259 at 32K/4-pass) — the extra nudges genuinely found more — but 0 of the 351 carried a source anchor (vs all 259 anchored, 95% faithful at 32K). Under the tighter 16K budget the model appears to economize by dropping the verbatim "a" spans (which cost tokens) to pack in more facts. The result: the generous-budget run is the higher-quality one (Q85, 259 anchored) and the deep-nudge run is lower quality despite +35% more facts (Q50, 351 unanchored). This is the report's thesis in hard numbers — raw count is the wrong target; faithful facts/chunk is. (Caveat: N=1 per config, and the two holo3.1 runs also flipped on predicate-hygiene 58%→100%, so some of this is run-to-run variance — but a budget that systematically trades anchors for count is a real faithfulness risk and warrants a controlled replication before we trust a deep-nudge config in production.)

Bake-off D — streaming mode (the 524 fix)

The bake-off A–C ceiling — "only holo3.1 completes on Hyades; everything slower 524-times-out" — was never a model limitation. It was a transport artifact. Cloudflare drops any request that goes ~125s without sending bytes; a non-streaming completion sends nothing until the whole response is ready, so a slow model trips the timeout before its first token. Streaming (SSE) sends tokens as they generate, so the connection is never silent and Cloudflare never cuts it. Re-ran with stream:true + stream_options:{include_usage}, parsing delta.content, with a 15s heartbeat to prove liveness.

model	facts	out-tok	secs	passes	finish	524?	status
nvidia/nemotron-3-ultra-550b	1,129	55,613	1,031	5	stop	✅ no	🏆 completed — best in the fleet: Q100, 100% faithful (566 anchored), 100% pred-clean
holo3.1	387	14,456	206	5	stop	✅ no	✅ completed — clean stream, saturated pass 4
glm-5.1 (via Hyades)	(partial)	—	>320	1	—	✅ no	🐢 streamed live past 320s, no timeout — but one 32K pass doesn't finish in that window (genuinely slow, no longer blocked)
gemma4-aeon-uncensored	—	—	~480	1	—	✅ no	🐌 ~8 min of pure reasoning before its first content token — connection stayed alive (no 524) but impractically slow for extraction
nemotron3-omni	0	0	1	3	stop	✅ no	⛔ returns empty instantly (0 tok, 0.2s) — doesn't accept the prompt, streaming or not

nemotron-3-ultra-550b per-pass (streamed): broad 299 / 14,343 tok / 255s · nudge 334 / 16,384 / 247s · nudge 203 / 9,549 / 115s · nudge 181 / 8,965 / 145s · nudge 112 / 6,372 / 144s (still climbing at the pass-6 cap — not yet saturated). holo3.1 per-pass (streamed): broad 129 / 3,998 tok / 56s · nudge 120 / 4,021 / 57s · nudge 138 / 4,835 / 69s · nudge 0 / 1,135 / 17s · nudge 0 / 467 / 7s (saturated pass 4).

Four findings:

Streaming eliminates the 524 — the gateway is no longer the wall. Every model that died at exactly ~125s in A–C now runs past it under streaming. The earlier "heavy models can't use Hyades" conclusion is superseded: they can connect and stream; the only remaining limit is each model's own generation speed. holo3.1 completed identically to A (387 facts, clean JSON, natural finish=stop each pass) with the heartbeat confirming continuous token flow.
Streaming unlocked the single best extractor in the fleet — the 550B. nvidia/nemotron-3-ultra-550b-a55b 524-timed-out in every prior bake-off and was written off as "needs a direct endpoint." Under streaming it not only completes — it wins outright: 1,129 facts at a perfect Q100 (100% of its 566 anchored facts are exact source substrings, 100% predicate-clean), ~3× holo3.1's depth at strictly higher quality. Its passes run 115–255s each — every one of which would have 524'd without streaming. The price is wall-time (~17 min/chunk across 5 passes, and it hadn't even saturated at the 6-pass cap), so it's a deep/quality-critical lane, not a bulk lane — but it is now a usable lane on the $0 gateway, which it never was before.
Speed, not the gateway, is now the real filter. glm-5.1 streams steadily but needs more than 320s for a single 32K pass; gemma4 spends ~8 minutes thinking before emitting anything; nemotron3-omni returns empty instantly (a prompt-incompatibility, not a speed issue). None 524 anymore — the gateway is out of the way — so each model is judged on its own merits: the 550B is the deep winner, holo3.1 the fast workhorse, the rest unusable for extraction at these budgets.
holo3.1's anchoring is run-to-run non-deterministic — a metric caveat, not a regression. This streamed run scored faith 0% because the model emitted all 387 facts as IRI-relational edges (ex:deborah started ex:running-group, rdf:type, coFoundedWith) with zero inline "a" anchors — vs bake-off A's 95% inline-anchored on the same prompt and temp 0.1. (The 550B, on the identical prompt, anchored 566 — so this is model-specific variance, not a prompt or streaming defect; the streamed JSON is byte-clean, verified.) Operationally it's fine: donto's always-on citer is a separate stage that anchors exactly these IRI-relational facts post-hoc — "0 self-anchored" means "all deferred to the citer," not "unfaithful." But it confirms that inline-anchor faithfulness can't be relied on per-run; the citer is load-bearing precisely because extractors anchor inconsistently.

Bottom line

Streaming is the fix — and it changes the architecture. Bake-offs A–C concluded "only holo3.1 completes on Hyades; the ~125s Cloudflare timeout kills everything slower." Bake-off D proves that was a transport artifact, not a model limit: with SSE streaming the connection never goes silent, the 524 disappears, and every model that died at ~125s now runs past it. This fix is now in the production donto-agent client (stream:true + stream_options:{include_usage}, partial-content-on-drop), so the extraction lanes no longer 524 regardless of generation time.
There are now two viable Hyades lanes, not one. The fast workhorse is still holo3.1 (~56s/pass, ~150–390 facts/chunk, $0). But streaming unlocked a deep/quality-critical lane on the same $0 gateway: nemotron-3-ultra-550b, which wins the bake-off outright — 1,129 facts at a perfect Q100 (100% faithful, 100% predicate-clean) — at a ~17 min/chunk wall-time cost. Use holo3.1 for bulk, the 550B for chunks where depth and faithfulness matter most.
Dead ends, now known honestly: nemotron3-omni returns empty (prompt-incompatible); glm-5.1 and gemma4 stream but are too slow to be operational on Hyades at these budgets (glm-5.1's deep lane stays on z.ai-direct).
Still on the count-vs-faithfulness lesson (bake-off C): raw count is the wrong target. The 550B result is compelling precisely because it pairs depth with 100% anchor-faithfulness — and donto's always-on citer backstops the rest, since per-run inline anchoring varies by model (holo3.1 deferred all anchors this run, the 550B inlined 566).
Open levers not yet run: a temperature sweep on the prompt, and measuring the 550B's true saturation point (it had not stopped climbing at the 6-pass cap).

Gleaning vs thinking. Does thinking-on + a big budget in fewer passes beat many small passes? GLM reached ~520 in ~3 passes with thinking; holo3.1 needs 6 for ~150. Test whether holo3.1 with thinking + 16K + 3 passes beats 2K + 6 passes.
Faithfulness audit. Run the always-on citer over each model's output and report anchored-% per model. This is the quality gate that turns "most facts" into "most trustworthy facts" — the only count that matters for an evidence-first substrate.

Working hypotheses (to be confirmed)

holo3.1 is under-budgeted; at conc-3 with a 16–32K budget it likely lands well above 150 facts/chunk at $0 — the cheapest win.
glm-5.1 is the depth champion but slow and quota-bound; best as a deliberate deep pass, not the throughput lane.
nemotron-3-ultra-550b is the wildcard — possibly the deepest and most faithful, possibly too slow; worth one honest benchmark.
The likely production shape is two lanes: a fast, well-budgeted holo3.1 lane for throughput and a deep glm-5.1/nemotron lane in parallel — which is the architecture we're already running, just with both lanes' budgets properly tuned.

What this is not

This is a review and an experiment plan, not a verdict — the bake-off numbers come next. The one firm conclusion today: the prompt is excellent and the bottleneck has been parameters, not capability. We spent this session proving that on the GLM lane; the same discipline now turns on Hyades.

Grounding: measurements from the 2026-06-11 LoCoMo extraction run (donto-agent lanes on holo3.1 and glm-5.1). Live extraction status at /benchmarks/locomo.