The shape of donto's return — what a memory query actually hands the agent
Living document. 2026-06-12. The full schema of the payload donto returns to a reader/agent in the LoCoMo agentic-memory benchmark, field by field, and the empirical reason each field exists. Companion to the LoCoMo Config C study.
When an agent asks donto a question, donto returns a memory bundle — a ranked, JSON-serialisable list of evidence the agent reads to answer. This report documents the exact shape of that bundle, where each field comes from in the substrate, and — the part that matters — why the shape is what it is, because every field was earned by a measured result, not designed up front.
The short version: donto holds a bitemporal, evidence-anchored, paraconsistent claim graph, but what it returns is deliberately narrower and reader-shaped. The substrate is rich; the return is the minimal, highest-signal slice of it that lets a (possibly weak) reader answer in the fewest tokens. Those are two different schemas, and the report covers both.
1. Two schemas: what donto holds vs what donto returns
What donto holds (the substrate row — donto_statement):
| column | type | meaning |
|---|---|---|
statement_id |
uuid | identity of the claim |
subject |
text | the thing the claim is about (an ex:/ctx: IRI or entity) |
predicate |
text | the freely-minted relation (occupation, attendedEvent, occurredAt, …) |
object_iri |
text | object when it's another entity |
object_lit |
jsonb | object when it's a literal: {"v": "...", "dt": "..."} |
context |
text | provenance bucket (e.g. ctx:claims/locomo/<chunk>) |
tx_time |
tstzrange | transaction time — when donto recorded this (system time) |
valid_time |
daterange | valid time — when the fact is/was true in the world |
flags |
int | maturity / status bits |
content_hash |
bytea | content address (auto-dedup) |
This is the bitemporal, paraconsistent core: a fact has both a recorded-at (tx_time) and a true-in-the-world (valid_time); contradictory claims coexist as legal state (no argmax, no overwrite); each claim is anchored to a source via donto_evidence_link → donto_span (offsets into the original document). Around it sit derived layers: per-claim vectors (donto_claim_embedding, bge-small 384-dim, HNSW), predicate embeddings + alignment closure (query-time predicate folding), entity fingerprints (identity resolution).
What donto returns (the memory bundle): a JSON array of memory items, each a small, flat, reader-legible record. This is what lands in the agent's context:
[
{
"subject": "caroline",
"predicate": "attendedEvent",
"object": "lgbtq support group",
"text": "[1:56 pm on 8 May, 2023]",
"source": "search"
},
{
"subject": "Caroline",
"predicate": "said",
"object": "I went to a LGBTQ support group yesterday and it was so powerful.",
"text": "[1:56 pm on 8 May, 2023]",
"source": "search"
}
]
Every item carries at most: subject · predicate · object (the fact), a text field (a bracketed temporal tag and/or a verbatim source quote), and a source tag (which retrieval arm produced it). That's the whole return schema. The rest of this report is why each field is there and not more.
2. The return schema, field by field — and why
subject / predicate / object — the fact triple
The atomic unit. Three forms ship through the same three fields:
- A structured claim —
caroline · occupation · artist. Compact, joinable, abundance-native (donto mints predicates freely, so the same fact appears from many directions). - A resolved temporal claim —
melanie-sunrise · occurredAt · 2022. Same three fields; the object is a resolved date (seevalid_timebelow). - A natural passage —
Caroline · said · "I went to a LGBTQ support group yesterday…". The subject is the speaker, the predicate issaid, the object is the verbatim sentence.
Why the same three fields carry all three: the reader gets one uniform shape regardless of whether the evidence is a structured fact or a raw sentence — no schema-switching, no special-casing. The content of the object is what varies (an IRI label, a date, or a sentence), not the envelope.
text — the temporal tag (and optional source quote)
A bracketed string like [1:56 pm on 8 May, 2023], optionally followed by a verbatim "quote".
Why it exists — measured, not assumed. Temporal questions ask for the event date, but a claim by itself carries no usable date (valid_time defaults to the unbounded (,)). Without a date in the return, the reader is blind on temporal questions — in an early run it answered every "when…" with the session date and scored ~0.18 on the temporal category. Tagging each item with its resolved/anchored date lifted temporal 0.50 → 0.725 (+45%) with no other change. The date is the single most load-bearing non-fact field in the return.
The optional quote field exists for the same reason at finer grain: where a structured claim is terse, the anchored source span gives the reader the words actually said — grounding it reasons over better than an abstract triple.
source — which retrieval arm produced this item
"recall" (holder-scoped episodic recall) vs "search" (claim/passage recall). Diagnostic, not load-bearing for the reader, but it lets us attribute accuracy to a retrieval path and tune the mix.
What is deliberately NOT in the return
Just as important as the fields present:
- No
tx_time. The agent doesn't care when donto recorded a fact, only when it was true.tx_timestays in the substrate for audit/bitemporal queries; it never reaches the reader. - No raw
valid_timerange, IRIs, confidences, maturity, content hashes, or context IRIs. These are substrate plumbing. Exposing them is pure token cost and reader noise. - No mixing of representations within a bundle when it hurts. We measured that adding structured triples on top of natural passages lowered accuracy (0.625 → 0.583) — the triples were noise to a weak reader already given prose. So the return is representation-homogeneous by default: prose for prose-friendly readers, claims when claims win.
3. Why this shape — the empirical journey
The return schema is the residue of a sequence of measured experiments (free holo3.1 reader, reader held fixed, LoCoMo 160-question stratified, 0 gateway-blanked answers). Each shape change is justified by a number.
| return shape | what the reader got | accuracy | median tokens |
|---|---|---|---|
| episodic | whole dialogue sessions | 0.837 | ~19,000 |
| claims-only (triples) | subject·predicate·object + date |
0.362 | ~1,700 |
claims + valid_time |
+ resolved occurredAt dates |
0.412 (temporal 0.50→0.725) | ~1,700 |
| passages (turns) | natural sentences | 0.650 | ~2,800 |
| passages + claims (fusion) | both | 0.583 ↓ | ~3,700 |
What the journey taught — and why the final shape is what it is:
- Structured ≠ readable. Terse triples are compact (~1.7K tokens) but a weak reader loses most of the answer (0.362). The fact existing in the substrate is not the same as the reader being able to use it.
valid_timeis the bitemporal differentiator. Populating when an event was true and surfacing it intextlifted temporal +45% — the one place a contradiction-preserving bitemporal store structurally beats episodic chunk-stuffing, which can only re-derive the date each time and usually grabs the speech date.- Prose beats triples for a weak reader, per token. Returning natural passages instead of triples took accuracy 0.362 → 0.650 at ~1/7th of episodic's tokens — ~78% of episodic accuracy at ~15% of its cost. The return is shaped to give the reader what it reasons over best, minimally.
- More structure is not free. Stacking triples onto passages reduced accuracy (noise). The return omits anything that doesn't earn its tokens — a measured discipline, not a stylistic one.
- Delivery, not just ranking, is the gate. Before blaming the reader, we measured whether the answer-bearing evidence is even in the return: at top-20 it was missing 27% of the time. Widening and adding locality (the neighbours of each retrieved passage, by source offset) raised that to ~89%. The return's selection — not just its field schema — is part of "the shape."
4. Why the shape generalises (it is not a conversation trick)
The return schema is corpus-agnostic by construction, even though LoCoMo is conversational:
| LoCoMo instance | general substrate mechanism |
|---|---|
subject·predicate·object claims |
domain-neutral claim model — the same for genealogy, clinical, legal corpora |
occurredAt / valid_time date tag |
bitemporal valid-time — any corpus with temporal facts |
Speaker · said · sentence passage |
fine-grained passage recall — sentences/passages of any document (segmentation by sentence boundary, not by speaker) |
| neighbour turns (±1) | offset-locality — adjacent passages by source offset; donto_span already stores offsets, so this is general document structure |
| session-date anchor | document / event timestamp as the relative-time anchor |
The return is always: the minimal set of evidence items — structured facts and/or anchored source passages — ranked and locality-expanded, each tagged with the time it was true, in one uniform reader-legible shape. Nothing in that sentence mentions conversations.
5. The facet space — everything donto could return, and what we've tried
The current return uses one facet (the flat triple, optionally + date + anchor). But the substrate already computes ~20 facets about any thing (visible on the admin job page: evidence anchors, contradictions held, bitemporal belief, corroboration, argument graph, two-hop paths, entity reach, aligned claims, cross-source connections, standing…). The richer return is a dossier per relevant thing, not a flat fact list. This is the living checklist of return-shapes — tested vs not — and which benchmark problem each targets.
✅ tried (with result) · 🔄 in progress · ⬜ not yet tried. Numbers are LoCoMo, free holo3.1 reader, reader fixed; 24-Q samples unless "full-160".
Representation of the evidence (how a fact is shown)
- ✅ Claim triple
s·p·o— 0.362–0.417. Compact, joinable, always-present; but a weak reader loses most of the answer. - ✅ Claim +
valid_time(resolvedoccurredAtdate) — temporal 0.50 → 0.725 (+45%). The bitemporal win. ✓ - ✅ Claim + evidence-span anchor (the phrase the claim came from) — 0.524 (> claims-alone 0.417). Anchor helps; phrase-spans are fragmentary. ✓ (validates "triple always, anchor when available")
- ✅ Natural passages (source sentences/turns) — 0.650 full-160. ✓ current winner — prose >> triples for a weak reader, at ~15% of episodic's tokens.
- ⬜ Claim + FULL source-sentence anchor (not the phrase-span — map claim→its sentence) — should beat claim+span; the better anchor.
- ⬜ Realtime re-anchoring — when a claim is unanchored (no
evidence_link), fetch the revision (blob, GCS) and run the citer on demand to locate a span at query time. → "the anchor is never truly absent, only precomputed-or-recoverable."
Retrieval composition (which facts get selected)
- ✅ Single-scope cosine (claims or passages) — the baselines above.
- ✅ Adjacency / locality (±1 neighbour passages by offset) — 0.662 full-160 but +2.3× tokens, hurt open-domain. Marginal.
- ✅ Entity-claim aggregation (all of an entity's claims) — multi-hop 0.50→0.67, but temporal→0.0. No single scope serves all.
- ✅ Multi-scope + cosine rerank — 0.583 (worse): cosine rerank sorts dissimilar-but-relevant aggregation out.
- ✅ Two-hop graph recall (
graph-turns: question entity → 1-hop neighbours → passages about that neighbourhood) — 0.542, negative (multi-hop 2/6 < passages' 3/6). Graph-neighbour passages were noise the weak reader couldn't exploit. - ⬜ Cross-encoder rerank (relevance, not cosine similarity) — Zep's 86.5%→94.7% mechanism. Env-blocked (fastembed reranker registry broken; torch too heavy). The missing piece for proper multi-scope.
- ⬜ Query decomposition (sub-queries → multi-shot retrieve → union) — for open-domain (implied-across-passages answers).
Graph / structural facets (donto's reasoning layer)
- ⬜ Two-Hop Paths (full
seed→pred→mid→pred→farpaths) → multi-hop. - ⬜ Incoming / Backlinks (claims whose object is the entity) → reach the entity from the other side.
- ⬜ Argument Graph (supports / rebuts / undercuts edges) → reasoning, which claim wins.
- ⬜ Co-occurring entities → who/what appears with the entity.
Temporal / belief facets
- ✅
valid_timetag (above). - ⬜ Bitemporal Belief (
valid_time×tx_timeper claim) → distinguish when true from when learned. - ⬜ Contradictions Held (same
s·p, differento, held paraconsistently) → knowledge-update ("what is X's job now" when it changed) — episodic-stuffing can't do this; donto holds both. - ⬜ Version History (retract-not-delete revisions) → temporal evolution.
Confidence / provenance facets
- ⬜ Standing kernel per claim (maturity, corroboration count, contradiction-pressure, recency) → tell the reader how much to trust each fact.
- ⬜ Corroboration (same triple independently asserted elsewhere) → reliability via agreement.
- ⬜ Aligned Claims / predicate folding (
occupation↔jobvia the closure) → synonymy at query time. - ⬜ Cross-Source Connections (other docs mentioning the entity) → triangulation (Lens Engine).
Return-schema shape itself
- ✅ Flat list of items (current).
- ⬜ Per-item dossier:
{claim, anchor, valid_time, standing, contradicts:[…], supportedBy:[…], corroboration:N, neighbours:[…]}— a fact with its context, the opposite of source-stuffing (distilled structure, not verbatim text). - ⬜ Entity-profile return: one consolidated dossier for the question's entity (all facets about it), rather than N independent rows.
- ⬜ Resource URIs as tool affordances (agentic only — why): minimal bundle + handles the agent dereferences on demand.
Reading of the checklist so far: representation (prose vs triples) and the valid_time facet gave the real wins; every retrieval-composition trick (adjacency, entity-aggregation, multi-scope, two-hop graph) was neutral-to-negative on the free reader — extra context is noise it can't exploit. The robust lesson: on a weak reader, changing what you retrieve plateaus at simple passage recall (0.650); the gains come from changing the index/representation (so the same passage is found better, or the fact is shown better) and from moving reasoning into the substrate. That motivates a set of new donto concepts — each dissolves a measured tradeoff instead of adding retrieval noise:
Candidate new donto concepts (invented from the failure pattern)
- ✅ Interrogative Index — index each fact by the questions it answers (GLM-generated at index time), retrieval is question→question. Concept sound; effect modest (and it does not scale with build). Killer example: for the turn "I went to a LGBTQ support group yesterday" GLM generated the verbatim benchmark question "When did Caroline go to the LGBTQ support group?". Now measured at FULL build (conv-26, 417 turns / 1,508 generated questions vs the earlier 44%-build): IQ-alone 0.420 @20 / 0.553 @40 — far below turn-text recall (0.793 / 0.853), because GLM can't anticipate every benchmark phrasing. As a UNION supplement it still recovers evidence turns text missed: 0.793→0.827 @20, 0.853→0.873 @40 (+2.0–3.4 pts; 3–5 recovered). Key finding: the full build barely beat the 44% build (0.873 vs 0.873 @40; 0.827 vs 0.820 @20) → the supplement saturates early and is not a step-change. Build-expensive (one GLM call per turn) for a +2–3pt recall bump that, per all prior runs, does not convert to accuracy through the weak reader. Verdict: keep as a cheap union arm, not a primary index.
- ⬜ Bi-representational claim — store
s·p·oand a canonical one-sentence prose realisation together; return the sentence (reader-ready), join on the triple. Dissolves the prose-vs-triples tradeoff (no fusion noise — it's one unit). - ✅ Synthesized / aggregate claims — the substrate pre-composes scattered atoms into answer-shaped higher-order claims, so multi-hop becomes single-hop retrieval and the combination lives in the substrate, not the reader. Built and benchmarked. Per conversation, per speaker, glm-5.1 composed ~7 self-contained summary sentences from that entity's atoms (e.g. "Caroline has been creating art since age 17, working primarily in painting and stained glass…"), stored as synthesized claims + embedded — 140 aggregate claims over the LoCoMo corpus, injected as
summaryfacets. Result (matched 41Q, glm-5.1 reader, free on Hyades): net flat (0.732 → 0.732), but multi-hop moved the predicted way — 7/12 → 8/12 (+1) — at the cost of open-domain 7/12 → 6/12 (−1). So the lever points the right direction on its target category, but the effect is marginal on a strong reader (glm-5.1 already does multi-hop from the sessions; baseline multi-hop was 0.58, not 0.0) and the always-on injection adds noise to inference questions. Verdict: real but small here; tried next with relevance-gating (distance threshold) + a date facet stacked in, re-run on holo3.1 (the other reader): baseline 0.646 → full stack 0.625 (−0.021) — multi-hop flat (churn +1/−1), temporal flat, open-domain −1. So across both readers the answer-shaping facet stack does NOT beat session-rich episodic.
The decisive lesson (why facets don't add to sessions). The
valid_time+45% was a claims-only win — the reader had only claims, so the resolved date was the only signal. When the reader already has the full sessions (which contain the relative-time phrases and the scattered facts), the date/aggregate facets are redundant, and gated-or-not they add as much noise as signal. Claim-derived facets help in impoverished/token-constrained contexts and with weak readers, not as additions to a session-rich strong-reader setup. Implication for the benchmark: donto's memory edge on session-rich LoCoMo is not raw accuracy (full-session episodic + a good reader is hard to beat) — it's token-efficiency (claims-only answers at ~1/11th the tokens), bitemporal knowledge-update, faithful abstention, and scaling past the context window. The next runs pivot there (the claims-only token-efficiency curve and the knowledge-update category) rather than bolting more facets onto episodic.
⭐ Where the aggregate actually wins: the token-constrained regime
Re-running the aggregate as the context (claims + aggregates + resolved dates, no raw sessions) instead of an addition to sessions flips it from redundant to load-bearing. holo3.1, matched 48Q:
| config | accuracy | context tokens |
|---|---|---|
| episodic (full sessions) | 0.646 | ~5,305 |
| claims + aggregates + dates (depth 5) | 0.479 | ~1,013 (5.2× fewer) |
| claims + aggregates + dates (depth 15) | 0.542 | ~1,403 (3.8× fewer) |
| raw claims-only (no aggregates) | ~0.244 | ~1,700 |
Firmed up at 96 questions (holo3.1): episodic 0.604 (~5,249 tok) vs claims+aggregates depth-15 0.521 (~1,376 tok) — 86% of episodic accuracy at 3.8× fewer tokens, and per-category it matches episodic on multi-hop (9/24=9/24), beats temporal (20 vs 18), ≈open-domain (8 vs 9); the entire gap is single-hop (13 vs 22) — verbatim "what exactly did X say" recall that the claim abstraction drops and the (sparse) evidence spans can't restore.
The aggregates roughly DOUBLE raw claims-only accuracy (0.244 → 0.479) — the answer-shaped composition is the right lever precisely when the reader has no sessions to fall back on. And the efficiency is the story: 74% of episodic's accuracy at 19% of the tokens. Per category it's not a uniform haircut — claims+aggregates matches episodic on temporal (8/12, the date facets fully carry it) and open-domain (6/12), and loses only on single-hop (12→6) and multi-hop (5→3), the categories that need a specific fact lifted verbatim from a session. Deepening claim recall (5 → 15) walks up that curve: 0.542 at ~1,403 tokens — 84% of episodic accuracy at 26% of its tokens — and it recovers multi-hop (3→5, = episodic), beats episodic on temporal (9 vs 8), and matches open-domain. After that the entire remaining gap is single-hop (6/12 vs 12/12): "what exactly did X say" questions whose verbatim detail the claim abstraction drops. So the honest donto memory value on session-rich LoCoMo is not beating episodic accuracy — it's near-episodic answers at a fraction of the tokens, with the gap isolated to a single category (single-hop verbatim recall).
What single-hop actually is — and why the path forward is routing, not facets. A diagnostic settled it: in every single-hop miss the gold answer is already in the claim layer (e.g. "What book did Caroline recommend?" → the claim Caroline · lovedBook · Becoming Nicole exists) — it's a recall/ranking gap, not extraction. The answer-claim just isn't in the vector-recall top-15, because the entity ("Caroline") floods and the discriminating word ("book") is camelCased into the predicate. Adding a lexical (FTS) arm does surface it: single-hop climbs 6→8 / 12. But always-on lexical injection trades that gain against inference (open-domain) questions that don't want term-matched facts — every K nets flat (0.542): +2 single-hop, −2 open-domain. So single-hop is recoverable, and the lever that recovers it is real — but it must fire only when the question is specific/factoid. That is query-adaptive recall (route lexical-vs-semantic by question shape), which is the substrate's alignment frontier, not another returned facet. Best net config stands at claims+aggregates depth-15 — 86% of episodic accuracy at a quarter of the tokens — and the way past it is routing, the thing donto is actually built to do.
- ⬜ Temporal frame —
valid_time+ event-ordering relations (before/after/during) → temporal-ordering questions ("after Y"), the bitemporal differentiator extended. - ✅ Evidence-bound dual-indexed unit (claim-as-index → return source sentence) — retrieve by the finer claim embedding (one fact per vector, not a turn's blurred multi-fact average) but return the claim's source sentence as prose. Strong negative: 0.208 (24Q) vs the same-sentences-via-turn-index 0.625. The claim is a worse retrieval index than the turn, not a sharper one — claim-nearest-neighbours map (via evidence span → containing turn) to the wrong turns, and atomized/temporal claims crowd the top-K. Conclusion: the claim is a worse solo index than the turn, and a worse return object — for a weak reader. (But see "Multi-index recall" below: as a second index unioned with the turn index, the claim embedding recovers evidence the turn index blurs past — worse alone, additive in combination.)
- ✅ Multi-index recall (turn ∪ claim → turns) — the redeeming corollary of the negative above. A turn's embedding is a blurred multi-fact average; a claim's is one fact; they miss different evidence turns. Reader-independent recall@K over 250 Q / all 10 convs: claim-index alone is worse (0.628 @20 / 0.638 @40 vs turn-index 0.732 / 0.819), but the UNION recovers what the turn index blurs past — 0.732 → 0.808 @20 (+7.6 pts), 0.819 → 0.874 @40 (+5.5 pts); 19/250 evidence turns recovered, concentrated in the hard categories (multi-hop 76→83, temporal 82→91). This is the largest recall lift of any idea here (vs the Interrogative Index's +2–3). Whether it converts to accuracy through the weak reader is being A/B'd now (a
turns-claimunionrecall mode: turn-arm 40 + a 15-turn claim supplement, merged in dialogue order). The claim layer's value as a second retrieval index (not a replacement) is the live hypothesis. - ✅ Date-facet injection (passages + resolved-event-date tags) — the facet-layer architecture made concrete: keep passages as content+recall, append the top query-ranked resolved
occurredAtdates as a compact tag block (the reader can't compute "when was yesterday" from prose). 559 resolved event dates in the LoCoMo claim layer. Smoke test was a bullseye — for "When did Caroline go to the LGBTQ support group?" the top facet is2023-05-07 | caroline went to a lgbtq support group. A/B (24Q): temporal 5/6 (0.83) — the facet works — but overall 0.500 vs passages 0.625, because 10 date items distract on non-temporal questions (open-domain 1/6). Helps its category, taxes the rest; at 6-per-category the net is noisy. Lesson: a facet must be query-gated, not dumped — inject the date tag only onto the passages that mention that event, not as a global block. (Refinement pending.) - ⬜ On-demand re-anchoring (the still-open half of the dual unit) — unanchored claims get an anchor recovered at query time (fetch source revision → locate + cite on demand) rather than at index time. "The anchor is never absent, only precomputed-or-recoverable." Untested; orthogonal to the index-sharpness question.
⭐ The decisive result: the claim layer is a second RECALL index (at the real unit — the session)
Earlier tests fought the wrong battle — they used the claim as a return object or a turn-granular index, both below the real baseline. Re-anchoring on how donto-memory actually recalls (whole session chunks; real /recall = 0.837, the number to beat — every turn/claim variant scored under it) flips the picture. Measured reader-independently at session granularity, fusing a claim-recall arm into the episodic session arm (claim → its ctx:claims/... context → the parent session chunk; donto-native, no side-table) nearly closes the recall gap on exactly the baseline's two failure buckets:
| recall@K | episodic | claim-only | union | multi-hop | open-domain |
|---|---|---|---|---|---|
| @5 | 0.684 | 0.737 | 0.895 | 0.80 → 0.95 | 0.56 → 0.83 |
| @10 | 0.789 | 0.921 | 0.974 | 0.95 → 1.00 | 0.61 → 0.94 |
The claim arm alone beats episodic at both K, and the union takes multi-hop session recall to 20/20. The 0.837 baseline's gap is multi-hop (17 fails) + open-domain (10 fails) — precisely where the claim arm recovers the missed sessions (8 then 7 sessions episodic-vector blurred past). This is the legitimate, generalizable form of "claim as a second index": not a return format, but a fused recall arm whose hits promote their source passage into the bundle. It belongs in donto-memory's real /recall (the semantic-claim arm, currently empty for this corpus), where it helps every consumer — genealogy "which source attests X", memory multi-session — not just this benchmark.
…but the recall gain does NOT convert to accuracy (the honest reader-conversion result)
A matched end-to-end A/B settled the open question — and it's a negative. Same GLM reader+judge, same 44 questions, recall-limit 5; baseline = episodic /recall, fusion = episodic ∪ claim-promoted sessions (verified to add the sessions: avg recalled 5 → 7.8):
| matched (40Q) | baseline | fusion | Δ |
|---|---|---|---|
| overall | 0.550 | 0.575 | +0.025 (1 question) |
| multi-hop | 5/11 | 5/11 | 0.00 |
| temporal | 8/12 | 8/12 | 0.00 |
| open-domain | 4/12 | 5/12 | +1 |
| single-hop | 5/5 | 5/5 | 0.00 |
The +21-pt session-recall gain produced ~zero accuracy gain, and multi-hop — where recall improved most (0.80→0.95 reader-independent) — was completely flat. Adding the evidence session the reader was missing didn't help it answer, because for multi-hop the bottleneck is cross-session reasoning, not retrieval: the weak reader can't combine the facts even when all the sessions are present. This is the through-line of every experiment here — recall improvements don't convert through a weak reader; the only lever that ever converted was the valid_time facet (temporal 0.50→0.725), because it handed the reader a computed answer-shaped fact it couldn't derive itself.
Implication (the real lever). donto's edge isn't better retrieval — every system has hybrid recall. It's pre-computing verifiable, answer-shaped claims so a cheap/weak reader can answer without reasoning: the resolved date (works), and — the untested next step — the synthesized/aggregate claim that pre-joins a multi-hop chain (X · did · {a,b,c} evidence-linked to the atoms), moving the cross-session reasoning off the reader and into the substrate. That, not recall, is what should move multi-hop. (Caveat: GLM reader, not the holo3.1 0.837 baseline — Hyades was down; the delta is what's reported, and the delta is ~0.)
What the rest of the negatives add up to
Across the return-shape tests, the passage/session is the best unit for the returned text (passages 0.650 vs claim-triples 0.362; full sessions 0.837). Claims are a worse returned object for a weak reader — but, per above, an excellent recall signal. The other decisive win came from the claim layer's facets: resolved valid_time lifted temporal 0.50 → 0.725 (+45%) by handing the reader a fact it could not compute. So donto's claim layer earns its keep two ways — as a fused recall arm (find the right session) and as a facet/tag layer (the resolved date, a held contradiction, a corroboration count, injected onto the returned passages) — not as the returned object itself. Return shape the evidence supports: sessions/passages for text, claims for recall and for the tags passages can't carry themselves.
The unifying move: push work into the substrate at index time (generate the questions, the prose, the aggregates, the temporal order, the anchors) so query time is a tight match returning reader-ready prose — which is what a weak reader, slow-query-tolerant, token-relaxed regime wants. None dump source documents; all are distilled structure; all generalise beyond conversations.
6. The design principle, stated once
donto's substrate is maximal (hold every claim, every contradiction, both time axes, every anchor, forever). donto's return is minimal (the fewest, highest-signal evidence items, in the representation the reader reasons over best, with only the temporal tag that a fact can't carry itself). The gap between the two is the engineering — and every field that survived into the return survived because a measurement said it earned its tokens.
That is the full shape of what donto hands an agent: a small, flat, time-tagged, evidence-anchored bundle — backed by a bitemporal paraconsistent graph the agent never has to see.
See also: LoCoMo Config C — claims-only recall · all benchmarks · Are we using Hyades effectively?.