The 48-hour build, read closely — a real skeleton that isn't load-bearing yet

Living document. 2026-06-12. A second read of the Past 48 Hours Change Report: not "what was committed" (that report already says it well) but "what is actually true now" — verified against the migrations, the tests, and the live database — and what it means for the benchmark work we are about to resume. For the full reports index, see Reports.

The change report is honest about what landed. This note asks the next question: of everything that landed, what is real, what is exercised, and what should we make load-bearing first? The short answer: the work is real and tested; it is almost entirely unexercised; and the benchmark — not another feature wave — is what tells us which parts to bring to life.

TL;DR

claim	verdict	evidence
The 48-hour work is real, not stubs	true	22 migrations `0156–0177`, each a SQL capability + a Rust invariants test + client wiring; e.g. `0174_cross_domain_analogy_engine.sql` is 259 lines of in-DB SQL with a 150-line test
The W0→H8 execution plan is complete	true	`72478dc H8` is the tip; execution report published; 260 Rust files, 157 migrations, 148 test files
donto is now a "governed memory substrate"	true as capability, not yet as practice	the governance machinery exists and passes its invariant tests, but the governance tables are nearly empty
The new capabilities are in use	mostly false	feature tables hold 1–16 demo rows against a ~41.5M-statement store

The 48 hours converted donto's implicit conventions into durable, tested, conformance-gated SQL. That is the skeleton of a governed substrate. A skeleton that passes its own tests is not the same as a substrate under load.

1. The work is real — this is not benchmark theatre

It would be easy to dismiss "H5: cross-domain analogy engine" or "H4: hypothesis tournaments" as headline features. They aren't headlines. Each horizon commit follows the same disciplined shape the change report describes — and spot-checking confirms it:

A migration that is the feature. 0174_cross_domain_analogy_engine.sql implements the analogy engine as 259 lines of in-database SQL (tables + functions), not application glue. 0173_hypothesis_tournaments.sql creates three real tables (donto_hypothesis_tournament, _entry, _signal).
A test that pins the invariant. Each ships an invariants_*.rs (e.g. 150 lines for analogy) that the build runs.
Client/route wiring so the capability is reachable.

Many of the "tables?" features are deliberately SQL functions/views, not tables — standing_v2, predicate_closure_v2, grounded_contradiction_policy, the generic_rule_evaluator, the calibration_frames. That is the right shape for logic: it's exercised when called, not when a row is inserted. So a low row count for those isn't a gap — but it does mean "is it used?" has to be answered by call-sites and downstream data, not by count(*).

The verdict on realness is unambiguous: 42 commits of migration-backed, tested substrate logic. The execution plan is genuinely complete.

2. …but almost none of it is exercised

Here is the part the changelog doesn't foreground. Against a live store of ~41.5M statements, the new governance and horizon tables hold:

table	rows	what it would hold under real use
`donto_rule_agenda`	12,684	scheduled rule work — genuinely running
`donto_argument`	2,509	contradiction/support edges between claims
`donto_identity_edge`	169	resolved/blocked entity identities
`donto_policy_action_key`	16	policy ordering keys
`donto_canonical_shadow_scope`	5	scoped canonical rebuild registrations
`donto_agent_reputation`	4	per-agent trust context
`donto_export_tier`	4	tiered export closures
`donto_standard_import_lift`	3	standards interop lifts
`donto_hypothesis_tournament`	2	competing-hypothesis brackets
`donto_cross_domain_analogy`	1	cross-domain structural matches
`donto_lens_parent`	0	lens-algebra composition edges
`donto_lens` / `donto_loss_report`	35 / 13	active lenses / recorded translation losses

Two things stand out. First, the only capability under real load is the rule agenda (12.7k rows) — the W2 "loop heartbeat" actually beats. Second, everything in the judgment and horizon layers is at demo scale — one analogy, two tournaments, four reputations. These aren't failures; they're seeds. But they make the honest status precise:

donto has the machinery to hold standing, fold contradictions, run tournaments, track reputation, and export under tiered closure. It does not yet do those things at the scale of its own claim store. The contradiction layer that is supposed to be the differentiator covers 2,509 of ~41.5M statements — about 0.006%.

This is the same gap the original PRD snapshot flagged ("the contradiction/identity machinery — the differentiator — is barely exercised") — and 48 hours of excellent kernel work did not close it, because closing it isn't kernel work. It's consumer pressure.

3. Why this is the right moment to return to the benchmark

The benchmark is not a side quest from the substrate work — it is the instrument that applies consumer pressure to exactly these capabilities. And the LoCoMo results from this same 48-hour window already tell us which capabilities to make load-bearing first. Read them together:

Claims-only recall: 0.244 vs episodic 0.837. Holding rich claims doesn't help if the reader can't get the answer out of them.
Bitemporal valid_time: temporal 0.50 → 0.725 (+45%). The one clean, donto-specific win.
Passage recall caps at 0.650; every retrieval-composition trick was neutral-to-negative.
(new, this session) Claim-as-recall-arm: a +18–21pt session-recall gain that did NOT convert to accuracy — multi-hop end-to-end stayed flat (5/11 → 5/11) even though the missing evidence sessions were recovered.

The through-line across all four is sharp: on a weak/cheap reader, better retrieval does not convert; the only thing that converts is a pre-computed, answer-shaped fact the reader could not derive itself (the resolved date worked because it handed the reader the answer).

Now map that onto the 48-hour capability list. The features that produce answer-shaped facts are precisely the ones worth exercising first:

converts (answer-shaping) → exercise first	doesn't convert (retrieval) → deprioritise
Bitemporal `valid_time` (proven +45%)	hybrid / multi-scope retrieval (measured negative)
Standing — which claim to trust, returned as the answer	cross-encoder rerank on passages (cheap-regime ceiling)
Predicate closure + aligned match — pre-joins a multi-hop chain into one retrievable claim	claim-embedding as a second recall index (recovers sessions, doesn't convert)
Grounded contradiction policy — returns the resolved position, not both raw sides
Aggregate / synthesized claims (the untested next step) — move multi-hop reasoning off the reader into the substrate

So the benchmark isn't just a scoreboard; it's the selection function over the 48-hour build. It says: stop investing in retrieval, start exercising the answer-shaping capabilities — standing, closure, contradiction policy, and especially the not-yet-built aggregate claim that pre-computes the multi-hop join. That is the bridge from "the skeleton passes its tests" to "the skeleton carries weight."

4. What to do next (and the order)

Make the claim layer load-bearing in real /recall, as a facet not a passage. The benchmark proved claims lose as a returned passage and as a recall index. They win only as tags on the returned session — the resolved date today; standing and a held-contradiction verdict next. Wire those into the bundle.
Build and benchmark the aggregate/synthesized claim. This is the one lever the benchmark predicts will move multi-hop, because it pre-computes the cross-session join the weak reader can't do. It is also the natural consumer of predicate closure + aligned match — i.e. it exercises three 48-hour capabilities at once.
Let the benchmark drive contradiction coverage up from 0.006%. Knowledge-update and temporal questions are where held contradictions + bitemporal belief should shine; run those categories specifically and write the deltas back as donto_argument edges.
Resume the LoCoMo run on the working free reader (glm-5.1 on Hyades; holo3.1 is currently returning empty streams, gateway-side). Report deltas, not absolutes, until a stable reader is pinned.

Bottom line

The 48-hour build is exactly what the change report says it is: a complete, tested, migration-backed conversion of donto from a claim store into a governed-memory substrate. Verified — the code is real and the plan is done.

The honest addendum is that the substrate is a skeleton that passes its own tests but has not yet been put under load. Its governance and horizon tables hold single-digit demo rows against a 41.5-million-statement store. The differentiating machinery — contradiction, identity, standing, lenses, tournaments — exists and is idle.

The benchmark is how we end the idleness. It has already told us which capabilities convert (answer-shaping facts) and which don't (retrieval), and it points at the one piece still unbuilt — the aggregate claim — as the next thing to make real. That is where we go next.