The 48-hour build, read closely — a real skeleton that isn't load-bearing yet
Living document. 2026-06-12. A second read of the Past 48 Hours Change Report: not "what was committed" (that report already says it well) but "what is actually true now" — verified against the migrations, the tests, and the live database — and what it means for the benchmark work we are about to resume. For the full reports index, see Reports.
The change report is honest about what landed. This note asks the next question: of everything that landed, what is real, what is exercised, and what should we make load-bearing first? The short answer: the work is real and tested; it is almost entirely unexercised; and the benchmark — not another feature wave — is what tells us which parts to bring to life.
TL;DR
| claim | verdict | evidence |
|---|---|---|
| The 48-hour work is real, not stubs | true | 22 migrations 0156–0177, each a SQL capability + a Rust invariants test + client wiring; e.g. 0174_cross_domain_analogy_engine.sql is 259 lines of in-DB SQL with a 150-line test |
| The W0→H8 execution plan is complete | true | 72478dc H8 is the tip; execution report published; 260 Rust files, 157 migrations, 148 test files |
| donto is now a "governed memory substrate" | true as capability, not yet as practice | the governance machinery exists and passes its invariant tests, but the governance tables are nearly empty |
| The new capabilities are in use | mostly false | feature tables hold 1–16 demo rows against a ~41.5M-statement store |
The 48 hours converted donto's implicit conventions into durable, tested, conformance-gated SQL. That is the skeleton of a governed substrate. A skeleton that passes its own tests is not the same as a substrate under load.
1. The work is real — this is not benchmark theatre
It would be easy to dismiss "H5: cross-domain analogy engine" or "H4: hypothesis tournaments" as headline features. They aren't headlines. Each horizon commit follows the same disciplined shape the change report describes — and spot-checking confirms it:
- A migration that is the feature.
0174_cross_domain_analogy_engine.sqlimplements the analogy engine as 259 lines of in-database SQL (tables + functions), not application glue.0173_hypothesis_tournaments.sqlcreates three real tables (donto_hypothesis_tournament,_entry,_signal). - A test that pins the invariant. Each ships an
invariants_*.rs(e.g. 150 lines for analogy) that the build runs. - Client/route wiring so the capability is reachable.
Many of the "tables?" features are deliberately SQL functions/views, not tables — standing_v2, predicate_closure_v2, grounded_contradiction_policy, the generic_rule_evaluator, the calibration_frames. That is the right shape for logic: it's exercised when called, not when a row is inserted. So a low row count for those isn't a gap — but it does mean "is it used?" has to be answered by call-sites and downstream data, not by count(*).
The verdict on realness is unambiguous: 42 commits of migration-backed, tested substrate logic. The execution plan is genuinely complete.
2. …but almost none of it is exercised
Here is the part the changelog doesn't foreground. Against a live store of ~41.5M statements, the new governance and horizon tables hold:
| table | rows | what it would hold under real use |
|---|---|---|
donto_rule_agenda |
12,684 | scheduled rule work — genuinely running |
donto_argument |
2,509 | contradiction/support edges between claims |
donto_identity_edge |
169 | resolved/blocked entity identities |
donto_policy_action_key |
16 | policy ordering keys |
donto_canonical_shadow_scope |
5 | scoped canonical rebuild registrations |
donto_agent_reputation |
4 | per-agent trust context |
donto_export_tier |
4 | tiered export closures |
donto_standard_import_lift |
3 | standards interop lifts |
donto_hypothesis_tournament |
2 | competing-hypothesis brackets |
donto_cross_domain_analogy |
1 | cross-domain structural matches |
donto_lens_parent |
0 | lens-algebra composition edges |
donto_lens / donto_loss_report |
35 / 13 | active lenses / recorded translation losses |
Two things stand out. First, the only capability under real load is the rule agenda (12.7k rows) — the W2 "loop heartbeat" actually beats. Second, everything in the judgment and horizon layers is at demo scale — one analogy, two tournaments, four reputations. These aren't failures; they're seeds. But they make the honest status precise:
donto has the machinery to hold standing, fold contradictions, run tournaments, track reputation, and export under tiered closure. It does not yet do those things at the scale of its own claim store. The contradiction layer that is supposed to be the differentiator covers 2,509 of ~41.5M statements — about 0.006%.
This is the same gap the original PRD snapshot flagged ("the contradiction/identity machinery — the differentiator — is barely exercised") — and 48 hours of excellent kernel work did not close it, because closing it isn't kernel work. It's consumer pressure.
3. Why this is the right moment to return to the benchmark
The benchmark is not a side quest from the substrate work — it is the instrument that applies consumer pressure to exactly these capabilities. And the LoCoMo results from this same 48-hour window already tell us which capabilities to make load-bearing first. Read them together:
- Claims-only recall: 0.244 vs episodic 0.837. Holding rich claims doesn't help if the reader can't get the answer out of them.
- Bitemporal
valid_time: temporal 0.50 → 0.725 (+45%). The one clean, donto-specific win. - Passage recall caps at 0.650; every retrieval-composition trick was neutral-to-negative.
- (new, this session) Claim-as-recall-arm: a +18–21pt session-recall gain that did NOT convert to accuracy — multi-hop end-to-end stayed flat (5/11 → 5/11) even though the missing evidence sessions were recovered.
The through-line across all four is sharp: on a weak/cheap reader, better retrieval does not convert; the only thing that converts is a pre-computed, answer-shaped fact the reader could not derive itself (the resolved date worked because it handed the reader the answer).
Now map that onto the 48-hour capability list. The features that produce answer-shaped facts are precisely the ones worth exercising first:
| converts (answer-shaping) → exercise first | doesn't convert (retrieval) → deprioritise |
|---|---|
Bitemporal valid_time (proven +45%) |
hybrid / multi-scope retrieval (measured negative) |
| Standing — which claim to trust, returned as the answer | cross-encoder rerank on passages (cheap-regime ceiling) |
| Predicate closure + aligned match — pre-joins a multi-hop chain into one retrievable claim | claim-embedding as a second recall index (recovers sessions, doesn't convert) |
| Grounded contradiction policy — returns the resolved position, not both raw sides | |
| Aggregate / synthesized claims (the untested next step) — move multi-hop reasoning off the reader into the substrate |
So the benchmark isn't just a scoreboard; it's the selection function over the 48-hour build. It says: stop investing in retrieval, start exercising the answer-shaping capabilities — standing, closure, contradiction policy, and especially the not-yet-built aggregate claim that pre-computes the multi-hop join. That is the bridge from "the skeleton passes its tests" to "the skeleton carries weight."
4. What to do next (and the order)
- Make the claim layer load-bearing in real
/recall, as a facet not a passage. The benchmark proved claims lose as a returned passage and as a recall index. They win only as tags on the returned session — the resolved date today; standing and a held-contradiction verdict next. Wire those into the bundle. - Build and benchmark the aggregate/synthesized claim. This is the one lever the benchmark predicts will move multi-hop, because it pre-computes the cross-session join the weak reader can't do. It is also the natural consumer of predicate closure + aligned match — i.e. it exercises three 48-hour capabilities at once.
- Let the benchmark drive contradiction coverage up from 0.006%. Knowledge-update and temporal questions are where held contradictions + bitemporal belief should shine; run those categories specifically and write the deltas back as
donto_argumentedges. - Resume the LoCoMo run on the working free reader (glm-5.1 on Hyades; holo3.1 is currently returning empty streams, gateway-side). Report deltas, not absolutes, until a stable reader is pinned.
Bottom line
The 48-hour build is exactly what the change report says it is: a complete, tested, migration-backed conversion of donto from a claim store into a governed-memory substrate. Verified — the code is real and the plan is done.
The honest addendum is that the substrate is a skeleton that passes its own tests but has not yet been put under load. Its governance and horizon tables hold single-digit demo rows against a 41.5-million-statement store. The differentiating machinery — contradiction, identity, standing, lenses, tournaments — exists and is idle.
The benchmark is how we end the idleness. It has already told us which capabilities convert (answer-shaping facts) and which don't (retrieval), and it points at the one piece still unbuilt — the aggregate claim — as the next thing to make real. That is where we go next.
See also: Past 48 Hours Change Report · The shape of donto's return · Future Substrate Report · LoCoMo Config C.