# The 48-hour build, read closely — a real skeleton that isn't load-bearing yet

_Living document. 2026-06-12. A second read of the [Past 48 Hours Change Report](/reports/past-48-hours-change-report): not "what was committed" (that report already says it well) but "what is actually true now" — verified against the migrations, the tests, and the live database — and what it means for the benchmark work we are about to resume. For the full reports index, see [Reports](/reports)._

The change report is honest about what landed. This note asks the next question: **of everything that landed, what is real, what is exercised, and what should we make load-bearing first?** The short answer: the work is real and tested; it is almost entirely unexercised; and the benchmark — not another feature wave — is what tells us which parts to bring to life.

---

## TL;DR

| claim | verdict | evidence |
|---|---|---|
| The 48-hour work is real, not stubs | **true** | 22 migrations `0156–0177`, each a SQL capability + a Rust invariants test + client wiring; e.g. `0174_cross_domain_analogy_engine.sql` is 259 lines of in-DB SQL with a 150-line test |
| The W0→H8 execution plan is complete | **true** | `72478dc H8` is the tip; execution report published; 260 Rust files, 157 migrations, 148 test files |
| donto is now a "governed memory substrate" | **true as capability, not yet as practice** | the governance *machinery* exists and passes its invariant tests, but the governance *tables are nearly empty* |
| The new capabilities are in use | **mostly false** | feature tables hold **1–16 demo rows** against a **~41.5M-statement** store |

> The 48 hours converted donto's implicit conventions into durable, tested, conformance-gated SQL. That is the skeleton of a governed substrate. A skeleton that passes its own tests is not the same as a substrate under load.

---

## 1. The work is real — this is not benchmark theatre

It would be easy to dismiss "H5: cross-domain analogy engine" or "H4: hypothesis tournaments" as headline features. They aren't headlines. Each horizon commit follows the same disciplined shape the change report describes — and spot-checking confirms it:

- **A migration that *is* the feature.** `0174_cross_domain_analogy_engine.sql` implements the analogy engine as **259 lines of in-database SQL** (tables + functions), not application glue. `0173_hypothesis_tournaments.sql` creates three real tables (`donto_hypothesis_tournament`, `_entry`, `_signal`).
- **A test that pins the invariant.** Each ships an `invariants_*.rs` (e.g. 150 lines for analogy) that the build runs.
- **Client/route wiring** so the capability is reachable.

Many of the "tables?" features are deliberately **SQL functions/views, not tables** — `standing_v2`, `predicate_closure_v2`, `grounded_contradiction_policy`, the `generic_rule_evaluator`, the `calibration_frames`. That is the right shape for *logic*: it's exercised when called, not when a row is inserted. So a low row count for those isn't a gap — but it does mean "is it used?" has to be answered by call-sites and downstream data, not by `count(*)`.

The verdict on realness is unambiguous: **42 commits of migration-backed, tested substrate logic. The execution plan is genuinely complete.**

---

## 2. …but almost none of it is exercised

Here is the part the changelog doesn't foreground. Against a live store of **~41.5M statements**, the new governance and horizon tables hold:

| table | rows | what it would hold under real use |
|---|---:|---|
| `donto_rule_agenda` | **12,684** | scheduled rule work — *genuinely running* |
| `donto_argument` | 2,509 | contradiction/support edges between claims |
| `donto_identity_edge` | 169 | resolved/blocked entity identities |
| `donto_policy_action_key` | 16 | policy ordering keys |
| `donto_canonical_shadow_scope` | 5 | scoped canonical rebuild registrations |
| `donto_agent_reputation` | 4 | per-agent trust context |
| `donto_export_tier` | 4 | tiered export closures |
| `donto_standard_import_lift` | 3 | standards interop lifts |
| `donto_hypothesis_tournament` | **2** | competing-hypothesis brackets |
| `donto_cross_domain_analogy` | **1** | cross-domain structural matches |
| `donto_lens_parent` | **0** | lens-algebra composition edges |
| `donto_lens` / `donto_loss_report` | 35 / 13 | active lenses / recorded translation losses |

Two things stand out. First, **the only capability under real load is the rule agenda** (12.7k rows) — the W2 "loop heartbeat" actually beats. Second, **everything in the judgment and horizon layers is at demo scale** — one analogy, two tournaments, four reputations. These aren't failures; they're seeds. But they make the honest status precise:

> donto has the *machinery* to hold standing, fold contradictions, run tournaments, track reputation, and export under tiered closure. It does not yet *do* those things at the scale of its own claim store. The contradiction layer that is supposed to be the differentiator covers **2,509 of ~41.5M** statements — about **0.006%**.

This is the same gap the original PRD snapshot flagged ("the contradiction/identity machinery — the differentiator — is barely exercised") — and 48 hours of excellent kernel work did not close it, because closing it isn't kernel work. **It's consumer pressure.**

---

## 3. Why this is the right moment to return to the benchmark

The benchmark is not a side quest from the substrate work — it is the **instrument that applies consumer pressure** to exactly these capabilities. And the LoCoMo results from this same 48-hour window already tell us *which* capabilities to make load-bearing first. Read them together:

- **Claims-only recall: 0.244 vs episodic 0.837.** Holding rich claims doesn't help if the reader can't get the answer out of them.
- **Bitemporal `valid_time`: temporal 0.50 → 0.725 (+45%).** The one clean, donto-specific win.
- **Passage recall caps at 0.650; every retrieval-composition trick was neutral-to-negative.**
- **(new, this session) Claim-as-recall-arm: a +18–21pt *session-recall* gain that did NOT convert to accuracy** — multi-hop end-to-end stayed flat (5/11 → 5/11) even though the missing evidence sessions were recovered.

The through-line across all four is sharp: **on a weak/cheap reader, better *retrieval* does not convert; the only thing that converts is a pre-computed, answer-shaped *fact* the reader could not derive itself** (the resolved date worked because it handed the reader the answer).

Now map that onto the 48-hour capability list. The features that *produce answer-shaped facts* are precisely the ones worth exercising first:

| converts (answer-shaping) → exercise first | doesn't convert (retrieval) → deprioritise |
|---|---|
| **Bitemporal `valid_time`** (proven +45%) | hybrid / multi-scope retrieval (measured negative) |
| **Standing** — *which* claim to trust, returned as the answer | cross-encoder rerank on passages (cheap-regime ceiling) |
| **Predicate closure + aligned match** — pre-joins a multi-hop chain into one retrievable claim | claim-embedding as a second recall index (recovers sessions, doesn't convert) |
| **Grounded contradiction policy** — returns the *resolved* position, not both raw sides | |
| **Aggregate / synthesized claims** (the untested next step) — move multi-hop reasoning off the reader into the substrate | |

So the benchmark isn't just a scoreboard; it's the **selection function** over the 48-hour build. It says: stop investing in retrieval, start exercising the answer-shaping capabilities — standing, closure, contradiction policy, and especially the not-yet-built **aggregate claim** that pre-computes the multi-hop join. That is the bridge from "the skeleton passes its tests" to "the skeleton carries weight."

---

## 4. What to do next (and the order)

1. **Make the claim layer load-bearing in real `/recall`, as a facet not a passage.** The benchmark proved claims lose as a returned passage and as a recall index. They win only as *tags on the returned session* — the resolved date today; standing and a held-contradiction verdict next. Wire those into the bundle.
2. **Build and benchmark the aggregate/synthesized claim.** This is the one lever the benchmark predicts will move multi-hop, because it pre-computes the cross-session join the weak reader can't do. It is also the natural consumer of predicate closure + aligned match — i.e. it *exercises* three 48-hour capabilities at once.
3. **Let the benchmark drive contradiction coverage up from 0.006%.** Knowledge-update and temporal questions are where held contradictions + bitemporal belief should shine; run those categories specifically and write the deltas back as `donto_argument` edges.
4. **Resume the LoCoMo run on the working free reader** (glm-5.1 on Hyades; holo3.1 is currently returning empty streams, gateway-side). Report deltas, not absolutes, until a stable reader is pinned.

---

## Bottom line

The 48-hour build is exactly what the change report says it is: a complete, tested, migration-backed conversion of donto from a claim *store* into a governed-memory *substrate*. Verified — the code is real and the plan is done.

The honest addendum is that the substrate is a **skeleton that passes its own tests but has not yet been put under load**. Its governance and horizon tables hold single-digit demo rows against a 41.5-million-statement store. The differentiating machinery — contradiction, identity, standing, lenses, tournaments — exists and is idle.

The benchmark is how we end the idleness. It has already told us which capabilities convert (answer-shaping facts) and which don't (retrieval), and it points at the one piece still unbuilt — the aggregate claim — as the next thing to make real. That is where we go next.

_See also: [Past 48 Hours Change Report](/reports/past-48-hours-change-report) · [The shape of donto's return](/reports/donto-memory-return-shape) · [Future Substrate Report](/reports/future-substrate) · [LoCoMo Config C](/reports/locomo-claims-only-config-c)._
