dontoreport

The 48-hour build, read closely — a real skeleton that isn't load-bearing yet

Living document. 2026-06-12. A second read of the Past 48 Hours Change Report: not "what was committed" (that report already says it well) but "what is actually true now" — verified against the migrations, the tests, and the live database — and what it means for the benchmark work we are about to resume. For the full reports index, see Reports.

The change report is honest about what landed. This note asks the next question: of everything that landed, what is real, what is exercised, and what should we make load-bearing first? The short answer: the work is real and tested; it is almost entirely unexercised; and the benchmark — not another feature wave — is what tells us which parts to bring to life.


TL;DR

claim verdict evidence
The 48-hour work is real, not stubs true 22 migrations 0156–0177, each a SQL capability + a Rust invariants test + client wiring; e.g. 0174_cross_domain_analogy_engine.sql is 259 lines of in-DB SQL with a 150-line test
The W0→H8 execution plan is complete true 72478dc H8 is the tip; execution report published; 260 Rust files, 157 migrations, 148 test files
donto is now a "governed memory substrate" true as capability, not yet as practice the governance machinery exists and passes its invariant tests, but the governance tables are nearly empty
The new capabilities are in use mostly false feature tables hold 1–16 demo rows against a ~41.5M-statement store

The 48 hours converted donto's implicit conventions into durable, tested, conformance-gated SQL. That is the skeleton of a governed substrate. A skeleton that passes its own tests is not the same as a substrate under load.


1. The work is real — this is not benchmark theatre

It would be easy to dismiss "H5: cross-domain analogy engine" or "H4: hypothesis tournaments" as headline features. They aren't headlines. Each horizon commit follows the same disciplined shape the change report describes — and spot-checking confirms it:

  • A migration that is the feature. 0174_cross_domain_analogy_engine.sql implements the analogy engine as 259 lines of in-database SQL (tables + functions), not application glue. 0173_hypothesis_tournaments.sql creates three real tables (donto_hypothesis_tournament, _entry, _signal).
  • A test that pins the invariant. Each ships an invariants_*.rs (e.g. 150 lines for analogy) that the build runs.
  • Client/route wiring so the capability is reachable.

Many of the "tables?" features are deliberately SQL functions/views, not tablesstanding_v2, predicate_closure_v2, grounded_contradiction_policy, the generic_rule_evaluator, the calibration_frames. That is the right shape for logic: it's exercised when called, not when a row is inserted. So a low row count for those isn't a gap — but it does mean "is it used?" has to be answered by call-sites and downstream data, not by count(*).

The verdict on realness is unambiguous: 42 commits of migration-backed, tested substrate logic. The execution plan is genuinely complete.


2. …but almost none of it is exercised

Here is the part the changelog doesn't foreground. Against a live store of ~41.5M statements, the new governance and horizon tables hold:

table rows what it would hold under real use
donto_rule_agenda 12,684 scheduled rule work — genuinely running
donto_argument 2,509 contradiction/support edges between claims
donto_identity_edge 169 resolved/blocked entity identities
donto_policy_action_key 16 policy ordering keys
donto_canonical_shadow_scope 5 scoped canonical rebuild registrations
donto_agent_reputation 4 per-agent trust context
donto_export_tier 4 tiered export closures
donto_standard_import_lift 3 standards interop lifts
donto_hypothesis_tournament 2 competing-hypothesis brackets
donto_cross_domain_analogy 1 cross-domain structural matches
donto_lens_parent 0 lens-algebra composition edges
donto_lens / donto_loss_report 35 / 13 active lenses / recorded translation losses

Two things stand out. First, the only capability under real load is the rule agenda (12.7k rows) — the W2 "loop heartbeat" actually beats. Second, everything in the judgment and horizon layers is at demo scale — one analogy, two tournaments, four reputations. These aren't failures; they're seeds. But they make the honest status precise:

donto has the machinery to hold standing, fold contradictions, run tournaments, track reputation, and export under tiered closure. It does not yet do those things at the scale of its own claim store. The contradiction layer that is supposed to be the differentiator covers 2,509 of ~41.5M statements — about 0.006%.

This is the same gap the original PRD snapshot flagged ("the contradiction/identity machinery — the differentiator — is barely exercised") — and 48 hours of excellent kernel work did not close it, because closing it isn't kernel work. It's consumer pressure.


3. Why this is the right moment to return to the benchmark

The benchmark is not a side quest from the substrate work — it is the instrument that applies consumer pressure to exactly these capabilities. And the LoCoMo results from this same 48-hour window already tell us which capabilities to make load-bearing first. Read them together:

  • Claims-only recall: 0.244 vs episodic 0.837. Holding rich claims doesn't help if the reader can't get the answer out of them.
  • Bitemporal valid_time: temporal 0.50 → 0.725 (+45%). The one clean, donto-specific win.
  • Passage recall caps at 0.650; every retrieval-composition trick was neutral-to-negative.
  • (new, this session) Claim-as-recall-arm: a +18–21pt session-recall gain that did NOT convert to accuracy — multi-hop end-to-end stayed flat (5/11 → 5/11) even though the missing evidence sessions were recovered.

The through-line across all four is sharp: on a weak/cheap reader, better retrieval does not convert; the only thing that converts is a pre-computed, answer-shaped fact the reader could not derive itself (the resolved date worked because it handed the reader the answer).

Now map that onto the 48-hour capability list. The features that produce answer-shaped facts are precisely the ones worth exercising first:

converts (answer-shaping) → exercise first doesn't convert (retrieval) → deprioritise
Bitemporal valid_time (proven +45%) hybrid / multi-scope retrieval (measured negative)
Standingwhich claim to trust, returned as the answer cross-encoder rerank on passages (cheap-regime ceiling)
Predicate closure + aligned match — pre-joins a multi-hop chain into one retrievable claim claim-embedding as a second recall index (recovers sessions, doesn't convert)
Grounded contradiction policy — returns the resolved position, not both raw sides
Aggregate / synthesized claims (the untested next step) — move multi-hop reasoning off the reader into the substrate

So the benchmark isn't just a scoreboard; it's the selection function over the 48-hour build. It says: stop investing in retrieval, start exercising the answer-shaping capabilities — standing, closure, contradiction policy, and especially the not-yet-built aggregate claim that pre-computes the multi-hop join. That is the bridge from "the skeleton passes its tests" to "the skeleton carries weight."


4. What to do next (and the order)

  1. Make the claim layer load-bearing in real /recall, as a facet not a passage. The benchmark proved claims lose as a returned passage and as a recall index. They win only as tags on the returned session — the resolved date today; standing and a held-contradiction verdict next. Wire those into the bundle.
  2. Build and benchmark the aggregate/synthesized claim. This is the one lever the benchmark predicts will move multi-hop, because it pre-computes the cross-session join the weak reader can't do. It is also the natural consumer of predicate closure + aligned match — i.e. it exercises three 48-hour capabilities at once.
  3. Let the benchmark drive contradiction coverage up from 0.006%. Knowledge-update and temporal questions are where held contradictions + bitemporal belief should shine; run those categories specifically and write the deltas back as donto_argument edges.
  4. Resume the LoCoMo run on the working free reader (glm-5.1 on Hyades; holo3.1 is currently returning empty streams, gateway-side). Report deltas, not absolutes, until a stable reader is pinned.

Bottom line

The 48-hour build is exactly what the change report says it is: a complete, tested, migration-backed conversion of donto from a claim store into a governed-memory substrate. Verified — the code is real and the plan is done.

The honest addendum is that the substrate is a skeleton that passes its own tests but has not yet been put under load. Its governance and horizon tables hold single-digit demo rows against a 41.5-million-statement store. The differentiating machinery — contradiction, identity, standing, lenses, tournaments — exists and is idle.

The benchmark is how we end the idleness. It has already told us which capabilities convert (answer-shaping facts) and which don't (retrieval), and it points at the one piece still unbuilt — the aggregate claim — as the next thing to make real. That is where we go next.

See also: Past 48 Hours Change Report · The shape of donto's return · Future Substrate Report · LoCoMo Config C.