# Are we using Hyades effectively? — an extraction-engine review

_2026-06-11 · a working review of how donto-agent drives the Hyades gateway for claim extraction, what we just learned tuning the GLM lane, and the experiments that follow._

## Why this review

Tuning the GLM extraction lane today produced a finding sharp enough to make us question the lane we trust most. GLM-5.1 went from **~40 facts/chunk to ~520** — a **13× jump** — and the cause was almost entirely one parameter: the per-call output-token budget, which we had clamped to 3,000 and raised to 64,000. The model was never the limit; the budget was.

That immediately implicates **Hyades/holo3.1**, our reliable workhorse, which runs at **2,000 output tokens**. If a 2–3K budget strangled a reasoning model, is holo3.1 also leaving facts on the table? This note documents exactly how we extract today, what we've measured, and the experiments to settle it — including a model bake-off, since Hyades exposes several models we've never benchmarked.

## How extraction works today

The engine is **donto-agent** (Rust, in the donto workspace), driving a provider gateway per chunk. Three parts matter:

**1. The prompt (`extract_broad.txt`) — already strong.** It is abundance-native and domain-neutral: "a substantial source should yield SEVERAL HUNDRED claims." It sweeps **~25 ontological lenses** against every entity and clause — taxonomy, mereology, identity, topology, chronology, causation, teleology, thematic-role/agency, epistemology, deontology, axiology, modality, qualia, lexical semantics, social ontology, process structure, constitution, grounding, provenance, comparison, quantity, disposition, speech-acts, phenomenology — and explicitly tells the model to invent its own lenses and predicates beyond the list. It mandates evidence anchoring (an exact source substring per directly-stated claim, `h:true` for inferred), canonical forms for the handful of standard predicates (`rdf:type`, `rdfs:label`, `sameAs`), and a compact JSONL fact shape with short keys so the model can emit *more* facts per token. **The prompt is not the bottleneck.**

**2. The gleaning loop.** The model appends batches of ~30–60 facts to `facts.jsonl` incrementally and **self-decides when the source is exhausted**, re-scanning until a fresh pass turns up nothing new. `donto-agent` caps this at `--max-passes` agentic turns. So coverage is bounded by two things: how much the model emits *per turn* (the token budget) and how many turns it gets.

**3. The provider lanes.** `--provider hyades` (holo3.1, the default) or `--provider glm` (z.ai subscription). Both run the same prompt and gleaning loop; an always-on citer anchors the output downstream. The lanes differ only in model, gateway, and the per-call parameters we set.

## What we've measured

| lane | model | output budget | concurrency | thinking | facts/chunk | speed | cost |
|---|---|---|---|---|---|---|---|
| Hyades | holo3.1 | **2,000** | 6 | on | **~150** | fast (~1/min/lane) | $0 (self-hosted) |
| GLM (before) | glm-4.7 | 3,000 | 6 | off | ~40 | fast | flat-rate quota |
| GLM (after) | glm-5.1 | **64,000** | 3 | on | **~520** | slow (mins/chunk) | flat-rate quota |

Two interactions surfaced while tuning, both relevant to Hyades:

- **Token budget gates depth.** A reasoning model spends part of its budget *thinking*; clamp the budget and there is little left to emit facts. This is proven for GLM. holo3.1 is also a reasoning model (thinking on) running at 2K — the same mechanism likely caps it.
- **Budget × concurrency × gateway capacity.** Raising holo3.1 to 8K at concurrency 6 **saturated the Hyades gateway into empty responses** (0-fact chunks) — a 200-status-but-empty-content failure, not an error. So a bigger budget on Hyades must come with **lower concurrency**, exactly as the GLM lane needed conc-3 to dodge its rate limit. This budget↔concurrency trade is the unexplored frontier for Hyades.

## The open question

**Is holo3.1 at 2,000 tokens under-budgeted the way GLM was?** The prompt asks for hundreds of facts; holo3.1 emits ~150 across six gleaning passes. If each pass is truncated at 2K, holo3.1 may be re-finding the same facts rather than reaching deeper — and a larger per-turn budget (at conc-3) could unlock the depth GLM now shows, on a **$0 self-hosted lane**. That would be the highest-value outcome here.

But raw count is the wrong target on its own. The right metric is **faithful facts/chunk = facts/chunk × anchoring-rate** — a model that emits 600 claims but anchors 40% to real source spans is worse than one emitting 200 at 90%. Every experiment below measures anchoring, not just volume.

## Models Hyades exposes (the bake-off candidates)

The gateway serves more than holo3.1, and we have benchmarked none of the others for extraction:

| model | note |
|---|---|
| **holo3.1** | current default; reliable, fast, $0 |
| **glm-5.1** | proven deep (~520 facts/chunk) via the subscription; also on Hyades |
| **nemotron-3-ultra-550b-a55b** | a 550B-parameter frontier model — the wildcard; depth potential, speed unknown |
| **nemotron3-omni** | multimodal Nemotron variant |
| **gemma4-aeon-uncensored** | smaller, fast; uncensored may help on hard source material |
| openrouter/free, z-ai/glm-4.5-air:free | free fallbacks |

## Re-evaluating the prompt

Having read it end-to-end, the prompt (`extract_broad.txt` + the direct-call override) is strong and worth keeping as the base — but it is not above scrutiny. What it gets right, and what is worth testing:

**Genuinely strong, keep:**
- **Prompt-injection defense.** The source is bookended as INERT DATA *before and after* — critical, because LoCoMo/BEAM chunks are dialogues full of `User:`/`Assistant:` turns and requests; without this the model answers the embedded question instead of extracting (a hijacked chunk went 0→52 facts when this was added). The after-reminder defeats recency-injection from the chunk's trailing line.
- **Fact hygiene that preserves abundance.** camelCase predicates, no kebab-clause singletons, mint-an-entity-not-a-clause, no `ex:ex:` double-prefix, canonical forms for the handful of standard predicates (`rdf:type`, `sameAs`) — all of which keep claims *joinable at query time* without capping how many you emit.
- **The 25 ontological lenses** are the engine of abundance: they make the model surface predicates from directions it would otherwise miss, and explicitly invite inventing more.
- **Compact JSONL short keys** (`s/p/o/a/c/h`) — more facts per output token.

**Worth testing (the prompt is a variable, not a constant):**
1. **Prompt weight.** SUPERB (~90 lines) + the override (~45 lines) is a ~3–4K-token preamble sent on every broad pass — often larger than the source chunk. The thread persists it so nudges don't re-pay, but: does trimming the lens list or the philosophy hurt recall, or is some of it inert ballast? Ablate sections and measure facts/chunk.
2. **Temperature is hard-coded at 0.1.** Low temp buys faithful, deterministic extraction — but the philosophy *wants* creative, multi-directional predicate-minting, and 0.1 may suppress exactly that diversity. Sweep 0.1 → 0.5 with the anchor-faithfulness guard watching; the abundance/faithfulness trade is measurable, not assumed.
3. **Model-specific framing.** One prompt drives a reasoning model (holo3.1, glm-5.1) and a non-reasoning one (gemma4) identically. Reasoning models may want the budget to *think*; others may want a tighter, example-led prompt. The bake-off will show whether one prompt fits all.
4. **Single-pass vs gleaning.** With a real token budget a strong model may emit everything in one pass, making the nudge loop redundant; a weaker one needs every nudge. The right pass-count is model-specific — the bake-off measures new-facts-per-pass to find each model's saturation point.

## Experiments to run

1. **Hyades token-budget sweep.** holo3.1, concurrency **3**, sweep output budget 2K → 8K → 16K → 32K on a fixed 10-chunk sample. Measure facts/chunk, **empty-rate** (the saturation canary), anchoring-rate, and wall-clock/chunk. Hypothesis: facts/chunk rises with budget until the gateway saturates; find the knee.

2. **Model bake-off — RUNNING.** Same prompt, same fixed chunk, every Hyades model, measuring **per-pass tokens, per-pass time, new facts per pass**, and a computed **quality score** (anchor-faithfulness 50 / predicate-hygiene 30 / parse-validity 20). Runs on a clean gateway right after the live LoCoMo extraction drains (so timings aren't skewed by contention). Results table below when complete.

<!-- BAKEOFF:START -->
### Bake-off results

Same prompt (`extract_broad.txt` + direct-call override), same single **5,991-char LoCoMo chunk**, 32K output budget, temp 0.1, broad + up to 3 NUDGE passes on one thread. Quality = anchor-faithfulness (50) + predicate-hygiene (30) + parse-validity (20).

#### Bake-off A — all 5 models, sequential

| model | facts | out-tok | secs | quality | faithful | pred-clean | status |
|---|---|---|---|---|---|---|---|
| **holo3.1** | **259** | 12,848 | 183 | **85** | 95% | 58% | ✅ completed — saturated at pass 4 |
| glm-5.1 (via Hyades) | — | — | 125 | — | — | — | ⛔ 524 timeout (broad pass >125s) |
| nemotron-3-ultra-550b | — | — | 125 | — | — | — | ⛔ 524 timeout |
| gemma4-aeon-uncensored | — | — | 125 | — | — | — | ⛔ 524 timeout |
| nemotron3-omni | 0 | 0 | 0.4 | 0 | — | — | ⛔ returned empty |

**holo3.1 per-pass:** broad 140 facts / 6,235 tok / 87s · nudge 46 / 2,770 / 40s · nudge 73 / 3,336 / 48s · nudge **0** / 507 / 8s (dry → saturated at 259).

**Two findings that matter:**
1. **holo3.1 is confirmed under-budgeted.** At 32K it reaches **259 facts and *saturates*** (the 4th pass adds nothing) — vs ~150 at the live 2K budget. **~1.7× more depth on the $0 lane, at 95% anchor-faithfulness.** The cost is time (183s/chunk across 4 passes); the budget↔throughput knee is real and tunable. Predicate-hygiene at 58% is the watch-item — holo3.1 mints some kebab/clause predicates the alignment engine then has to fold.
2. **The Hyades gateway 524-caps slow models.** glm-5.1, nemotron-3-ultra-550b and gemma4 all hit Cloudflare's ~125s request timeout on the broad pass (big prompt + 32K budget + reasoning generates too long to return in one shot). This is *exactly why the live GLM extraction runs on **z.ai-direct**, not Hyades* — the direct endpoint has no such cap (glm-5.1 there produces ~520 facts/chunk). To benchmark the heavy models fairly we need their direct endpoints or a longer gateway timeout; on Hyades-as-is, **only holo3.1 is fast enough to complete extraction.** nemotron3-omni returns empty — it doesn't take the prompt.

#### Bake-off B — isolated (1 model at a time, 60s cooldowns)

| model | facts | secs | status |
|---|---|---|---|
| holo3.1 | 0 | 125 | ⛔ 524 timeout (this run broad pass >125s) |
| glm-5.1 (via Hyades) | 0 | 125 | ⛔ 524 timeout |
| nemotron-3-ultra-550b | 0 | 126 | ⛔ 524 timeout |
| gemma4-aeon-uncensored | 0 | 125 | ⛔ 524 timeout |
| nemotron3-omni | 0 | 0.4 | ⛔ returned empty |

**A vs B — the sharper conclusion.** In bake-off A, holo3.1 *completed* (broad pass 87s, 259 facts). In B, with a clean isolated gateway, it **524-timed-out at 125s.** Isolation didn't help — the 524s are per-request timeouts, not contention. The real story: at a 32K budget holo3.1's broad pass sits **right on Cloudflare's ~125s edge** and clears it only intermittently. So the deep-but-risky 32K result isn't a reliable operational setting on the Hyades gateway; the **reliable budget is the fast 2K the live lane uses** (well under the timeout, ~150 facts, never 524s). The heavy models never complete on Hyades at all — they need their direct endpoints (glm-5.1 on z.ai gives ~520 facts/chunk with no such cap).

#### Bake-off C — 16K budget, up to 20 nudges

| model | facts | out-tok | secs | passes | quality | faithful | clean | status |
|---|---|---|---|---|---|---|---|---|
| holo3.1 | **351** | 16,005 | 231 | 5 | 50 | **0%** | 100% | ✅ completed |
| glm-5.1 (via Hyades) | — | — | 125 | 1 | — | — | — | ⛔ 524 timeout |
| nemotron-3-ultra-550b | — | — | 125 | 1 | — | — | — | ⛔ 524 timeout |
| gemma4-aeon-uncensored | — | — | 128 | 1 | — | — | — | ⛔ 524 timeout |
| nemotron3-omni | 0 | — | 0.6 | 3 | — | — | — | ⛔ returned empty |

**holo3.1 per-pass:** broad 146 / 4,673 · nudge 115 / 6,203 · nudge 90 / 4,087 · nudge 0 / 670 · nudge 0 / 372 (saturated at pass 5).

**Two findings, both load-bearing:**
1. **16K did NOT rescue the heavy models.** glm-5.1, nemotron-3-ultra-550b and gemma4 still 524-timeout at 16K. A smaller output budget makes each pass faster *for a fast model* (holo3.1 completes) but the heavy models' broad pass still exceeds the ~125s gateway cap — their bottleneck is generation **time** (slow reasoning), not output budget. So this settles it: **on the Hyades gateway, only holo3.1 can run extraction; the deep models must use their direct endpoints** (the live glm-5.1 lane runs on z.ai, where it does ~520 faithful facts/chunk with no such cap).
2. **More nudges deepened the count but collapsed faithfulness — budget shapes quality, not just quantity.** holo3.1 at 16K/5-pass reached **351 facts** (vs 259 at 32K/4-pass) — the extra nudges genuinely found more — but **0 of the 351 carried a source anchor** (vs **all 259 anchored, 95% faithful** at 32K). Under the tighter 16K budget the model appears to economize by dropping the verbatim `"a"` spans (which cost tokens) to pack in more facts. The result: the **generous-budget run is the higher-*quality* one (Q85, 259 anchored)** and the deep-nudge run is *lower* quality despite +35% more facts (Q50, 351 unanchored). This is the report's thesis in hard numbers — **raw count is the wrong target; faithful facts/chunk is.** (Caveat: N=1 per config, and the two holo3.1 runs also flipped on predicate-hygiene 58%→100%, so some of this is run-to-run variance — but a budget that systematically trades anchors for count is a real faithfulness risk and warrants a controlled replication before we trust a deep-nudge config in production.)

#### Bake-off D — streaming mode (the 524 fix)

The bake-off A–C ceiling — "only holo3.1 completes on Hyades; everything slower 524-times-out" — was **never a model limitation. It was a transport artifact.** Cloudflare drops any request that goes ~125s without sending bytes; a non-streaming completion sends *nothing* until the whole response is ready, so a slow model trips the timeout before its first token. **Streaming (SSE) sends tokens as they generate**, so the connection is never silent and Cloudflare never cuts it. Re-ran with `stream:true` + `stream_options:{include_usage}`, parsing `delta.content`, with a 15s heartbeat to prove liveness.

| model | facts | out-tok | secs | passes | finish | 524? | status |
|---|---|---|---|---|---|---|---|
| **nvidia/nemotron-3-ultra-550b** | **1,129** | 55,613 | 1,031 | 5 | stop | ✅ no | 🏆 **completed — best in the fleet: Q100, 100% faithful (566 anchored), 100% pred-clean** |
| **holo3.1** | 387 | 14,456 | 206 | 5 | stop | ✅ no | ✅ completed — clean stream, saturated pass 4 |
| glm-5.1 (via Hyades) | (partial) | — | >320 | 1 | — | ✅ no | 🐢 streamed live past 320s, no timeout — but one 32K pass doesn't finish in that window (genuinely slow, no longer *blocked*) |
| gemma4-aeon-uncensored | — | — | ~480 | 1 | — | ✅ no | 🐌 ~8 min of pure reasoning before its first content token — connection stayed alive (no 524) but impractically slow for extraction |
| nemotron3-omni | 0 | 0 | 1 | 3 | stop | ✅ no | ⛔ returns empty instantly (0 tok, 0.2s) — doesn't accept the prompt, streaming or not |

**nemotron-3-ultra-550b per-pass (streamed):** broad 299 / 14,343 tok / 255s · nudge 334 / 16,384 / 247s · nudge 203 / 9,549 / 115s · nudge 181 / 8,965 / 145s · nudge 112 / 6,372 / 144s (still climbing at the pass-6 cap — not yet saturated).
**holo3.1 per-pass (streamed):** broad 129 / 3,998 tok / 56s · nudge 120 / 4,021 / 57s · nudge 138 / 4,835 / 69s · nudge 0 / 1,135 / 17s · nudge 0 / 467 / 7s (saturated pass 4).

**Four findings:**
1. **Streaming eliminates the 524 — the gateway is no longer the wall.** Every model that died at exactly ~125s in A–C now runs *past* it under streaming. The earlier "heavy models can't use Hyades" conclusion is **superseded**: they *can* connect and stream; the only remaining limit is each model's own generation **speed**. holo3.1 completed identically to A (387 facts, clean JSON, natural `finish=stop` each pass) with the heartbeat confirming continuous token flow.
2. **Streaming unlocked the single best extractor in the fleet — the 550B.** `nvidia/nemotron-3-ultra-550b-a55b` 524-timed-out in *every* prior bake-off and was written off as "needs a direct endpoint." Under streaming it not only completes — it **wins outright: 1,129 facts at a perfect Q100** (100% of its 566 anchored facts are exact source substrings, 100% predicate-clean), **~3× holo3.1's depth at strictly higher quality.** Its passes run 115–255s each — every one of which would have 524'd without streaming. The price is wall-time (~17 min/chunk across 5 passes, and it hadn't even saturated at the 6-pass cap), so it's a **deep/quality-critical lane, not a bulk lane** — but it is now a *usable* lane on the $0 gateway, which it never was before.
3. **Speed, not the gateway, is now the real filter.** glm-5.1 streams steadily but needs more than 320s for a single 32K pass; gemma4 spends ~8 minutes thinking before emitting anything; nemotron3-omni returns empty instantly (a prompt-incompatibility, not a speed issue). None 524 anymore — the gateway is out of the way — so each model is judged on its own merits: the 550B is the deep winner, holo3.1 the fast workhorse, the rest unusable for extraction at these budgets.
4. **holo3.1's anchoring is run-to-run non-deterministic — a metric caveat, not a regression.** This streamed run scored faith **0%** because the model emitted all 387 facts as IRI-relational edges (`ex:deborah started ex:running-group`, `rdf:type`, `coFoundedWith`) with **zero inline `"a"` anchors** — vs bake-off A's 95% inline-anchored on the same prompt and temp 0.1. (The 550B, on the identical prompt, anchored 566 — so this is model-specific variance, not a prompt or streaming defect; the streamed JSON is byte-clean, verified.) Operationally it's fine: donto's always-on **citer** is a separate stage that anchors exactly these IRI-relational facts post-hoc — "0 self-anchored" means "all deferred to the citer," not "unfaithful." But it confirms that **inline-anchor faithfulness can't be relied on per-run**; the citer is load-bearing precisely because extractors anchor inconsistently.

### Bottom line

- **Streaming is the fix — and it changes the architecture.** Bake-offs A–C concluded "only holo3.1 completes on Hyades; the ~125s Cloudflare timeout kills everything slower." Bake-off D proves that was a **transport artifact, not a model limit**: with SSE streaming the connection never goes silent, the 524 disappears, and *every* model that died at ~125s now runs past it. **This fix is now in the production `donto-agent` client** (`stream:true` + `stream_options:{include_usage}`, partial-content-on-drop), so the extraction lanes no longer 524 regardless of generation time.
- **There are now two viable Hyades lanes, not one.** The fast workhorse is still **holo3.1** (~56s/pass, ~150–390 facts/chunk, $0). But streaming unlocked a **deep/quality-critical lane on the same $0 gateway: `nemotron-3-ultra-550b`**, which wins the bake-off outright — **1,129 facts at a perfect Q100** (100% faithful, 100% predicate-clean) — at a ~17 min/chunk wall-time cost. Use holo3.1 for bulk, the 550B for chunks where depth and faithfulness matter most.
- **Dead ends, now known honestly:** `nemotron3-omni` returns empty (prompt-incompatible); glm-5.1 and gemma4 stream but are too slow to be operational on Hyades at these budgets (glm-5.1's deep lane stays on z.ai-direct).
- **Still on the count-vs-faithfulness lesson** (bake-off C): raw count is the wrong target. The 550B result is compelling precisely because it pairs depth *with* 100% anchor-faithfulness — and donto's always-on **citer** backstops the rest, since per-run inline anchoring varies by model (holo3.1 deferred all anchors this run, the 550B inlined 566).
- Open levers not yet run: a **temperature sweep** on the prompt, and measuring the 550B's true saturation point (it had not stopped climbing at the 6-pass cap).
<!-- BAKEOFF:END -->

3. **Gleaning vs thinking.** Does thinking-on + a big budget in *fewer* passes beat many small passes? GLM reached ~520 in ~3 passes with thinking; holo3.1 needs 6 for ~150. Test whether holo3.1 with thinking + 16K + 3 passes beats 2K + 6 passes.

4. **Faithfulness audit.** Run the always-on citer over each model's output and report anchored-% per model. This is the quality gate that turns "most facts" into "most *trustworthy* facts" — the only count that matters for an evidence-first substrate.

## Working hypotheses (to be confirmed)

- holo3.1 is **under-budgeted**; at conc-3 with a 16–32K budget it likely lands well above 150 facts/chunk at $0 — the cheapest win.
- glm-5.1 is the **depth champion** but slow and quota-bound; best as a deliberate deep pass, not the throughput lane.
- nemotron-3-ultra-550b is the **wildcard** — possibly the deepest *and* most faithful, possibly too slow; worth one honest benchmark.
- The likely production shape is **two lanes**: a fast, well-budgeted holo3.1 lane for throughput and a deep glm-5.1/nemotron lane in parallel — which is the architecture we're already running, just with both lanes' budgets properly tuned.

## What this is not

This is a review and an experiment plan, not a verdict — the bake-off numbers come next. The one firm conclusion today: **the prompt is excellent and the bottleneck has been parameters, not capability.** We spent this session proving that on the GLM lane; the same discipline now turns on Hyades.

_Grounding: measurements from the 2026-06-11 LoCoMo extraction run (`donto-agent` lanes on holo3.1 and glm-5.1). Live extraction status at [/benchmarks/locomo](/benchmarks/locomo)._
