# The shape of donto's return — what a memory query actually hands the agent

_Living document. 2026-06-12. The full schema of the payload donto returns to a reader/agent in the LoCoMo agentic-memory benchmark, field by field, and the empirical reason each field exists. Companion to the [LoCoMo Config C study](/reports/locomo-claims-only-config-c)._

When an agent asks donto a question, donto returns a **memory bundle** — a ranked, JSON-serialisable list of evidence the agent reads to answer. This report documents the *exact* shape of that bundle, where each field comes from in the substrate, and — the part that matters — **why the shape is what it is**, because every field was earned by a measured result, not designed up front.

The short version: donto holds a **bitemporal, evidence-anchored, paraconsistent claim graph**, but what it *returns* is deliberately narrower and reader-shaped. The substrate is rich; the return is the minimal, highest-signal slice of it that lets a (possibly weak) reader answer in the fewest tokens. Those are two different schemas, and the report covers both.

---

## 1. Two schemas: what donto *holds* vs what donto *returns*

**What donto holds (the substrate row — `donto_statement`):**

| column | type | meaning |
|---|---|---|
| `statement_id` | uuid | identity of the claim |
| `subject` | text | the thing the claim is about (an `ex:`/`ctx:` IRI or entity) |
| `predicate` | text | the freely-minted relation (`occupation`, `attendedEvent`, `occurredAt`, …) |
| `object_iri` | text | object when it's another entity |
| `object_lit` | jsonb | object when it's a literal: `{"v": "...", "dt": "..."}` |
| `context` | text | provenance bucket (e.g. `ctx:claims/locomo/<chunk>`) |
| `tx_time` | tstzrange | **transaction time** — when donto *recorded* this (system time) |
| `valid_time` | daterange | **valid time** — when the fact is/was *true in the world* |
| `flags` | int | maturity / status bits |
| `content_hash` | bytea | content address (auto-dedup) |

This is the **bitemporal, paraconsistent** core: a fact has both a *recorded-at* (`tx_time`) and a *true-in-the-world* (`valid_time`); contradictory claims coexist as legal state (no argmax, no overwrite); each claim is anchored to a source via `donto_evidence_link → donto_span` (offsets into the original document). Around it sit derived layers: per-claim vectors (`donto_claim_embedding`, bge-small 384-dim, HNSW), predicate embeddings + alignment closure (query-time predicate folding), entity fingerprints (identity resolution).

**What donto returns (the memory bundle):** a JSON array of *memory items*, each a small, flat, reader-legible record. This is what lands in the agent's context:

```json
[
  {
    "subject":   "caroline",
    "predicate": "attendedEvent",
    "object":    "lgbtq support group",
    "text":      "[1:56 pm on 8 May, 2023]",
    "source":    "search"
  },
  {
    "subject":   "Caroline",
    "predicate": "said",
    "object":    "I went to a LGBTQ support group yesterday and it was so powerful.",
    "text":      "[1:56 pm on 8 May, 2023]",
    "source":    "search"
  }
]
```

Every item carries at most: **`subject` · `predicate` · `object`** (the fact), a **`text`** field (a bracketed temporal tag and/or a verbatim source quote), and a **`source`** tag (which retrieval arm produced it). That's the whole return schema. The rest of this report is *why each field is there and not more*.

---

## 2. The return schema, field by field — and why

### `subject` / `predicate` / `object` — the fact triple
The atomic unit. Three forms ship through the same three fields:
- **A structured claim** — `caroline · occupation · artist`. Compact, joinable, abundance-native (donto mints predicates freely, so the same fact appears from many directions).
- **A resolved temporal claim** — `melanie-sunrise · occurredAt · 2022`. Same three fields; the object is a resolved date (see `valid_time` below).
- **A natural passage** — `Caroline · said · "I went to a LGBTQ support group yesterday…"`. The subject is the speaker, the predicate is `said`, the object is the verbatim sentence.

**Why the same three fields carry all three:** the reader gets one uniform shape regardless of whether the evidence is a structured fact or a raw sentence — no schema-switching, no special-casing. The *content* of the object is what varies (an IRI label, a date, or a sentence), not the envelope.

### `text` — the temporal tag (and optional source quote)
A bracketed string like `[1:56 pm on 8 May, 2023]`, optionally followed by a verbatim `"quote"`.

**Why it exists — measured, not assumed.** Temporal questions ask for the *event* date, but a claim by itself carries no usable date (`valid_time` defaults to the unbounded `(,)`). Without a date in the return, the reader is *blind* on temporal questions — in an early run it answered every "when…" with the session date and scored ~0.18 on the temporal category. Tagging each item with its resolved/anchored date lifted temporal **0.50 → 0.725 (+45%)** with no other change. The date is the single most load-bearing non-fact field in the return.

The optional quote field exists for the same reason at finer grain: where a structured claim is terse, the anchored source span gives the reader the *words actually said* — grounding it reasons over better than an abstract triple.

### `source` — which retrieval arm produced this item
`"recall"` (holder-scoped episodic recall) vs `"search"` (claim/passage recall). Diagnostic, not load-bearing for the reader, but it lets us *attribute* accuracy to a retrieval path and tune the mix.

### What is deliberately **NOT** in the return
Just as important as the fields present:
- **No `tx_time`.** The agent doesn't care when donto recorded a fact, only when it was true. `tx_time` stays in the substrate for audit/bitemporal queries; it never reaches the reader.
- **No raw `valid_time` range, IRIs, confidences, maturity, content hashes, or context IRIs.** These are substrate plumbing. Exposing them is pure token cost and reader noise.
- **No mixing of representations within a bundle when it hurts.** We *measured* that adding structured triples on top of natural passages **lowered** accuracy (0.625 → 0.583) — the triples were noise to a weak reader already given prose. So the return is representation-homogeneous by default: prose for prose-friendly readers, claims when claims win.

---

## 3. Why this shape — the empirical journey

The return schema is the residue of a sequence of measured experiments (free `holo3.1` reader, reader held fixed, LoCoMo 160-question stratified, 0 gateway-blanked answers). Each shape change is justified by a number.

| return shape | what the reader got | accuracy | median tokens |
|---|---|---|---|
| **episodic** | whole dialogue sessions | 0.837 | ~19,000 |
| **claims-only (triples)** | `subject·predicate·object` + date | 0.362 | ~1,700 |
| claims + `valid_time` | + resolved `occurredAt` dates | 0.412 (temporal 0.50→0.725) | ~1,700 |
| **passages (turns)** | natural sentences | 0.650 | ~2,800 |
| passages + claims (fusion) | both | 0.583 ↓ | ~3,700 |

What the journey taught — and why the final shape is what it is:

1. **Structured ≠ readable.** Terse triples are compact (~1.7K tokens) but a weak reader loses most of the answer (0.362). The fact existing in the substrate is not the same as the reader being able to *use* it.
2. **`valid_time` is the bitemporal differentiator.** Populating *when an event was true* and surfacing it in `text` lifted temporal +45% — the one place a contradiction-preserving bitemporal store structurally beats episodic chunk-stuffing, which can only re-derive the date each time and usually grabs the speech date.
3. **Prose beats triples for a weak reader, per token.** Returning natural passages instead of triples took accuracy 0.362 → 0.650 at ~1/7th of episodic's tokens — ~78% of episodic accuracy at ~15% of its cost. The return is shaped to give the reader *what it reasons over best*, minimally.
4. **More structure is not free.** Stacking triples onto passages *reduced* accuracy (noise). The return omits anything that doesn't earn its tokens — a measured discipline, not a stylistic one.
5. **Delivery, not just ranking, is the gate.** Before blaming the reader, we measured whether the answer-bearing evidence is even *in* the return: at top-20 it was missing 27% of the time. Widening and adding **locality** (the neighbours of each retrieved passage, by source offset) raised that to ~89%. The return's *selection* — not just its field schema — is part of "the shape."

---

## 4. Why the shape generalises (it is not a conversation trick)

The return schema is corpus-agnostic by construction, even though LoCoMo is conversational:

| LoCoMo instance | general substrate mechanism |
|---|---|
| `subject·predicate·object` claims | domain-neutral claim model — the same for genealogy, clinical, legal corpora |
| `occurredAt` / `valid_time` date tag | bitemporal valid-time — any corpus with temporal facts |
| `Speaker · said · sentence` passage | **fine-grained passage recall** — sentences/passages of any document (segmentation by sentence boundary, not by speaker) |
| neighbour turns (±1) | **offset-locality** — adjacent passages by source offset; `donto_span` already stores offsets, so this is general document structure |
| session-date anchor | **document / event timestamp** as the relative-time anchor |

The return is always: *the minimal set of evidence items — structured facts and/or anchored source passages — ranked and locality-expanded, each tagged with the time it was true, in one uniform reader-legible shape.* Nothing in that sentence mentions conversations.

---

## 5. The facet space — everything donto *could* return, and what we've tried

The current return uses **one** facet (the flat triple, optionally + date + anchor). But the substrate already computes ~20 facets about any thing (visible on the admin job page: evidence anchors, contradictions held, bitemporal belief, corroboration, argument graph, two-hop paths, entity reach, aligned claims, cross-source connections, standing…). The richer return is a **dossier per relevant thing**, not a flat fact list. This is the living checklist of return-shapes — tested vs not — and which benchmark problem each targets.

_✅ tried (with result) · 🔄 in progress · ⬜ not yet tried. Numbers are LoCoMo, free holo3.1 reader, reader fixed; 24-Q samples unless "full-160"._

### Representation of the evidence (how a fact is shown)
- ✅ **Claim triple** `s·p·o` — **0.362–0.417**. Compact, joinable, always-present; but a weak reader loses most of the answer.
- ✅ **Claim + `valid_time`** (resolved `occurredAt` date) — **temporal 0.50 → 0.725 (+45%)**. The bitemporal win. ✓
- ✅ **Claim + evidence-span anchor** (the phrase the claim came from) — **0.524** (> claims-alone 0.417). Anchor helps; phrase-spans are fragmentary. ✓ (validates "triple always, anchor when available")
- ✅ **Natural passages** (source sentences/turns) — **0.650 full-160**. ✓ **current winner** — prose >> triples for a weak reader, at ~15% of episodic's tokens.
- ⬜ **Claim + FULL source-sentence anchor** (not the phrase-span — map claim→its sentence) — should beat claim+span; the better anchor.
- ⬜ **Realtime re-anchoring** — when a claim is unanchored (no `evidence_link`), fetch the revision (blob, GCS) and run the citer *on demand* to locate a span at query time. → "the anchor is never truly absent, only precomputed-or-recoverable."

### Retrieval composition (which facts get selected)
- ✅ **Single-scope cosine** (claims or passages) — the baselines above.
- ✅ **Adjacency / locality** (±1 neighbour passages by offset) — **0.662 full-160** but +2.3× tokens, *hurt* open-domain. Marginal.
- ✅ **Entity-claim aggregation** (all of an entity's claims) — multi-hop 0.50→**0.67**, but temporal→**0.0**. No single scope serves all.
- ✅ **Multi-scope + cosine rerank** — **0.583** (worse): cosine rerank sorts dissimilar-but-relevant aggregation out.
- ✅ **Two-hop graph recall** (`graph-turns`: question entity → 1-hop neighbours → passages about that neighbourhood) — **0.542, negative** (multi-hop 2/6 < passages' 3/6). Graph-neighbour passages were noise the weak reader couldn't exploit.
- ⬜ **Cross-encoder rerank** (relevance, not cosine similarity) — Zep's 86.5%→94.7% mechanism. **Env-blocked** (fastembed reranker registry broken; torch too heavy). The missing piece for proper multi-scope.
- ⬜ **Query decomposition** (sub-queries → multi-shot retrieve → union) — for **open-domain** (implied-across-passages answers).

### Graph / structural facets (donto's reasoning layer)
- ⬜ **Two-Hop Paths** (full `seed→pred→mid→pred→far` paths) → **multi-hop**.
- ⬜ **Incoming / Backlinks** (claims whose *object* is the entity) → reach the entity from the other side.
- ⬜ **Argument Graph** (supports / rebuts / undercuts edges) → **reasoning**, which claim wins.
- ⬜ **Co-occurring entities** → who/what appears with the entity.

### Temporal / belief facets
- ✅ **`valid_time` tag** (above).
- ⬜ **Bitemporal Belief** (`valid_time` × `tx_time` per claim) → distinguish *when true* from *when learned*.
- ⬜ **Contradictions Held** (same `s·p`, different `o`, held paraconsistently) → **knowledge-update** ("what is X's job *now*" when it changed) — episodic-stuffing can't do this; donto holds both.
- ⬜ **Version History** (retract-not-delete revisions) → temporal evolution.

### Confidence / provenance facets
- ⬜ **Standing kernel** per claim (maturity, corroboration count, contradiction-pressure, recency) → tell the reader *how much to trust* each fact.
- ⬜ **Corroboration** (same triple independently asserted elsewhere) → reliability via agreement.
- ⬜ **Aligned Claims / predicate folding** (`occupation`↔`job` via the closure) → **synonymy** at query time.
- ⬜ **Cross-Source Connections** (other docs mentioning the entity) → triangulation (Lens Engine).

### Return-schema shape itself
- ✅ **Flat list** of items (current).
- ⬜ **Per-item dossier**: `{claim, anchor, valid_time, standing, contradicts:[…], supportedBy:[…], corroboration:N, neighbours:[…]}` — a fact *with its context*, the opposite of source-stuffing (distilled structure, not verbatim text).
- ⬜ **Entity-profile return**: one consolidated dossier for the question's entity (all facets about it), rather than N independent rows.
- ⬜ **Resource URIs as tool affordances** (agentic only — [why](/reports/single-shot-vs-agentic-memory)): minimal bundle + handles the agent dereferences on demand.

**Reading of the checklist so far:** representation (prose vs triples) and the `valid_time` facet gave the real wins; **every retrieval-composition trick (adjacency, entity-aggregation, multi-scope, two-hop graph) was neutral-to-negative on the free reader** — extra context is noise it can't exploit. The robust lesson: on a weak reader, *changing what you retrieve* plateaus at simple passage recall (0.650); the gains come from *changing the index/representation* (so the same passage is found better, or the fact is shown better) and from *moving reasoning into the substrate*. That motivates a set of **new donto concepts** — each dissolves a measured tradeoff instead of adding retrieval noise:

### Candidate new donto concepts (invented from the failure pattern)
- ✅ **Interrogative Index** — index each fact by the *questions it answers* (GLM-generated at index time), retrieval is question→question. **Concept sound; effect modest (and it does not scale with build).** Killer example: for the turn *"I went to a LGBTQ support group yesterday"* GLM generated the verbatim benchmark question *"When did Caroline go to the LGBTQ support group?"*. Now measured at **FULL build** (conv-26, 417 turns / 1,508 generated questions vs the earlier 44%-build): IQ-alone **0.420 @20 / 0.553 @40** — far *below* turn-text recall (0.793 / 0.853), because GLM can't anticipate every benchmark phrasing. As a **UNION supplement** it still recovers evidence turns text missed: 0.793→**0.827 @20**, 0.853→**0.873 @40** (+2.0–3.4 pts; 3–5 recovered). **Key finding: the full build barely beat the 44% build** (0.873 vs 0.873 @40; 0.827 vs 0.820 @20) → the supplement saturates early and is *not* a step-change. Build-expensive (one GLM call per turn) for a +2–3pt recall bump that, per all prior runs, does not convert to accuracy through the weak reader. **Verdict: keep as a cheap union arm, not a primary index.**
- ⬜ **Bi-representational claim** — store `s·p·o` **and** a canonical one-sentence prose realisation together; return the sentence (reader-ready), join on the triple. Dissolves the prose-vs-triples tradeoff (no fusion noise — it's one unit).
- ✅ **Synthesized / aggregate claims** — the substrate pre-composes scattered atoms into answer-shaped higher-order claims, so multi-hop becomes single-hop retrieval and the *combination* lives in the substrate, not the reader. **Built and benchmarked.** Per conversation, per speaker, glm-5.1 composed ~7 self-contained summary sentences from that entity's atoms (e.g. *"Caroline has been creating art since age 17, working primarily in painting and stained glass…"*), stored as synthesized claims + embedded — **140 aggregate claims** over the LoCoMo corpus, injected as `summary` facets. **Result (matched 41Q, glm-5.1 reader, free on Hyades): net flat (0.732 → 0.732), but multi-hop moved the predicted way — 7/12 → 8/12 (+1) — at the cost of open-domain 7/12 → 6/12 (−1).** So the lever points the right direction on its target category, but the effect is **marginal on a strong reader** (glm-5.1 already does multi-hop from the sessions; baseline multi-hop was 0.58, not 0.0) and the always-on injection adds noise to inference questions. Verdict: real but small here; tried next with **relevance-gating** (distance threshold) + a **date facet** stacked in, re-run on **holo3.1** (the other reader): **baseline 0.646 → full stack 0.625 (−0.021)** — multi-hop flat (churn +1/−1), temporal flat, open-domain −1. So across **both readers the answer-shaping facet stack does NOT beat session-rich episodic.**

> **The decisive lesson (why facets don't add to sessions).** The `valid_time` +45% was a **claims-only** win — the reader had *only* claims, so the resolved date was the only signal. When the reader already has the full **sessions** (which contain the relative-time phrases and the scattered facts), the date/aggregate facets are **redundant**, and gated-or-not they add as much noise as signal. Claim-derived facets help in **impoverished/token-constrained** contexts and with **weak readers**, not as additions to a session-rich strong-reader setup. **Implication for the benchmark:** donto's memory edge on session-rich LoCoMo is **not raw accuracy** (full-session episodic + a good reader is hard to beat) — it's **token-efficiency** (claims-only answers at ~1/11th the tokens), **bitemporal knowledge-update**, **faithful abstention**, and **scaling past the context window**. The next runs pivot there (the claims-only token-efficiency curve and the knowledge-update category) rather than bolting more facets onto episodic.

#### ⭐ Where the aggregate actually wins: the token-constrained regime
Re-running the aggregate **as the context** (claims + aggregates + resolved dates, **no raw sessions**) instead of an addition to sessions flips it from redundant to load-bearing. holo3.1, matched 48Q:

| config | accuracy | context tokens |
|---|---|---|
| episodic (full sessions) | 0.646 | ~5,305 |
| claims + aggregates + dates (depth 5) | 0.479 | ~1,013 (5.2× fewer) |
| **claims + aggregates + dates (depth 15)** | **0.542** | **~1,403 (3.8× fewer)** |
| raw claims-only (no aggregates) | ~0.244 | ~1,700 |

**Firmed up at 96 questions** (holo3.1): episodic **0.604** (~5,249 tok) vs claims+aggregates depth-15 **0.521** (~1,376 tok) — **86% of episodic accuracy at 3.8× fewer tokens**, and per-category it **matches episodic on multi-hop (9/24=9/24), beats temporal (20 vs 18), ≈open-domain (8 vs 9)**; the *entire* gap is **single-hop (13 vs 22)** — verbatim "what exactly did X say" recall that the claim abstraction drops and the (sparse) evidence spans can't restore.

**The aggregates roughly DOUBLE raw claims-only accuracy (0.244 → 0.479)** — the answer-shaped composition is the right lever *precisely when the reader has no sessions to fall back on*. And the efficiency is the story: **74% of episodic's accuracy at 19% of the tokens.** Per category it's not a uniform haircut — claims+aggregates **matches episodic on temporal (8/12, the date facets fully carry it) and open-domain (6/12)**, and loses only on **single-hop (12→6) and multi-hop (5→3)**, the categories that need a specific fact lifted verbatim from a session. Deepening claim recall (5 → 15) walks **up** that curve: **0.542 at ~1,403 tokens — 84% of episodic accuracy at 26% of its tokens** — and it **recovers multi-hop (3→5, = episodic), beats episodic on temporal (9 vs 8), and matches open-domain**. After that the entire remaining gap is **single-hop (6/12 vs 12/12)**: "what exactly did X say" questions whose verbatim detail the claim abstraction drops. So the honest donto memory value on session-rich LoCoMo is **not** beating episodic accuracy — it's **near-episodic answers at a fraction of the tokens**, with the gap isolated to a *single* category (single-hop verbatim recall).

**What single-hop actually is — and why the path forward is routing, not facets.** A diagnostic settled it: in *every* single-hop miss the gold answer **is already in the claim layer** (e.g. *"What book did Caroline recommend?"* → the claim `Caroline · lovedBook · Becoming Nicole` exists) — it's a **recall/ranking** gap, not extraction. The answer-claim just isn't in the vector-recall top-15, because the entity ("Caroline") floods and the discriminating word ("book") is camelCased into the predicate. Adding a **lexical (FTS) arm** does surface it: single-hop climbs **6→8 / 12**. But always-on lexical injection trades that gain against inference (open-domain) questions that don't want term-matched facts — **every K nets flat (0.542)**: +2 single-hop, −2 open-domain. So single-hop is *recoverable*, and the lever that recovers it is real — but it must fire **only when the question is specific/factoid**. That is **query-adaptive recall** (route lexical-vs-semantic by question shape), which is the substrate's alignment frontier, not another returned facet. **Best net config stands at claims+aggregates depth-15 — 86% of episodic accuracy at a quarter of the tokens** — and the way past it is routing, the thing donto is actually built to do.
- ⬜ **Temporal frame** — `valid_time` + event-ordering relations (before/after/during) → temporal-ordering questions ("after Y"), the bitemporal differentiator extended.
- ✅ **Evidence-bound dual-indexed unit (claim-as-index → return source sentence)** — retrieve by the finer **claim** embedding (one fact per vector, not a turn's blurred multi-fact average) but return the claim's **source sentence** as prose. **Strong negative: 0.208 (24Q) vs the same-sentences-via-turn-index 0.625.** The claim is a *worse* retrieval index than the turn, not a sharper one — claim-nearest-neighbours map (via evidence span → containing turn) to the *wrong* turns, and atomized/temporal claims crowd the top-K. **Conclusion: the claim is a worse *solo* index than the turn, and a worse return object — for a weak reader.** (But see "Multi-index recall" below: as a *second* index unioned with the turn index, the claim embedding recovers evidence the turn index blurs past — worse alone, additive in combination.)
- ✅ **Multi-index recall (turn ∪ claim → turns)** — the redeeming corollary of the negative above. A turn's embedding is a blurred multi-fact average; a claim's is one fact; **they miss *different* evidence turns.** Reader-independent recall@K over **250 Q / all 10 convs**: claim-index alone is worse (0.628 @20 / 0.638 @40 vs turn-index 0.732 / 0.819), **but the UNION recovers what the turn index blurs past — 0.732 → 0.808 @20 (+7.6 pts), 0.819 → 0.874 @40 (+5.5 pts); 19/250 evidence turns recovered, concentrated in the hard categories** (multi-hop 76→83, temporal 82→91). This is the largest recall lift of any idea here (vs the Interrogative Index's +2–3). **Whether it converts to accuracy through the weak reader is being A/B'd now** (a `turns-claimunion` recall mode: turn-arm 40 + a 15-turn claim supplement, merged in dialogue order). The claim layer's value as a *second retrieval index* (not a replacement) is the live hypothesis.
- ✅ **Date-facet injection (passages + resolved-event-date tags)** — the facet-layer architecture made concrete: keep passages as content+recall, append the top query-ranked **resolved** `occurredAt` dates as a compact tag block (the reader can't compute "when was *yesterday*" from prose). 559 resolved event dates in the LoCoMo claim layer. Smoke test was a bullseye — for *"When did Caroline go to the LGBTQ support group?"* the top facet is `2023-05-07 | caroline went to a lgbtq support group`. **A/B (24Q): temporal 5/6 (0.83) — the facet works — but overall 0.500 vs passages 0.625**, because 10 date items distract on non-temporal questions (open-domain 1/6). Helps its category, taxes the rest; at 6-per-category the net is noisy. **Lesson: a facet must be query-gated, not dumped** — inject the date tag only onto the passages that mention that event, not as a global block. (Refinement pending.)
- ⬜ **On-demand re-anchoring** (the still-open half of the dual unit) — unanchored claims get an anchor *recovered at query time* (fetch source revision → locate + cite on demand) rather than at index time. "The anchor is never absent, only precomputed-or-recoverable." Untested; orthogonal to the index-sharpness question.

### ⭐ The decisive result: the claim layer is a second RECALL index (at the real unit — the session)
Earlier tests fought the wrong battle — they used the claim as a *return* object or a turn-granular index, both below the real baseline. Re-anchoring on how donto-memory **actually** recalls (whole **session** chunks; real `/recall` = **0.837**, the number to beat — every turn/claim variant scored *under* it) flips the picture. Measured **reader-independently** at session granularity, fusing a claim-recall arm into the episodic session arm (claim → its `ctx:claims/...` context → the parent session chunk; donto-native, no side-table) **nearly closes the recall gap on exactly the baseline's two failure buckets:**

| recall@K | episodic | claim-only | **union** | multi-hop | open-domain |
|---|---|---|---|---|---|
| @5 | 0.684 | 0.737 | **0.895** | 0.80 → **0.95** | 0.56 → **0.83** |
| @10 | 0.789 | 0.921 | **0.974** | 0.95 → **1.00** | 0.61 → **0.94** |

The claim arm **alone beats episodic at both K**, and the union takes multi-hop session recall to **20/20**. The 0.837 baseline's gap is multi-hop (17 fails) + open-domain (10 fails) — precisely where the claim arm recovers the missed sessions (8 then 7 sessions episodic-vector blurred past). **This is the legitimate, generalizable form of "claim as a second index": not a return format, but a fused recall arm whose hits promote their source passage into the bundle.** It belongs in donto-memory's real `/recall` (the semantic-claim arm, currently empty for this corpus), where it helps every consumer — genealogy "which source attests X", memory multi-session — not just this benchmark.

#### …but the recall gain does NOT convert to accuracy (the honest reader-conversion result)
A matched end-to-end A/B settled the open question — and it's a **negative**. Same GLM reader+judge, same 44 questions, recall-limit 5; baseline = episodic `/recall`, fusion = episodic ∪ claim-promoted sessions (verified to add the sessions: avg recalled 5 → 7.8):

| matched (40Q) | baseline | fusion | Δ |
|---|---|---|---|
| **overall** | 0.550 | 0.575 | **+0.025** (1 question) |
| **multi-hop** | 5/11 | **5/11** | **0.00** |
| temporal | 8/12 | 8/12 | 0.00 |
| open-domain | 4/12 | 5/12 | +1 |
| single-hop | 5/5 | 5/5 | 0.00 |

**The +21-pt session-recall gain produced ~zero accuracy gain, and multi-hop — where recall improved most (0.80→0.95 reader-independent) — was completely flat.** Adding the evidence session the reader was missing didn't help it answer, because for multi-hop the bottleneck is **cross-session reasoning, not retrieval**: the weak reader can't *combine* the facts even when all the sessions are present. This is the through-line of every experiment here — **recall improvements don't convert through a weak reader**; the *only* lever that ever converted was the `valid_time` **facet** (temporal 0.50→0.725), because it handed the reader a *computed* answer-shaped fact it couldn't derive itself.

**Implication (the real lever).** donto's edge isn't better retrieval — every system has hybrid recall. It's **pre-computing verifiable, answer-shaped claims** so a cheap/weak reader can answer without reasoning: the resolved date (works), and — the untested next step — the **synthesized/aggregate claim** that pre-joins a multi-hop chain (`X · did · {a,b,c}` evidence-linked to the atoms), moving the cross-session reasoning *off* the reader and *into* the substrate. That, not recall, is what should move multi-hop. (Caveat: GLM reader, not the holo3.1 0.837 baseline — Hyades was down; the *delta* is what's reported, and the delta is ~0.)

### What the rest of the negatives add up to
Across the *return-shape* tests, **the passage/session is the best unit for the returned text** (passages 0.650 vs claim-triples 0.362; full sessions 0.837). Claims are a worse *returned object* for a weak reader — but, per above, an excellent *recall signal*. **The other decisive win came from the claim layer's *facets***: resolved `valid_time` lifted temporal **0.50 → 0.725 (+45%)** by handing the reader a fact it could not compute. So donto's claim layer earns its keep two ways — **as a fused recall arm** (find the right session) and **as a facet/tag layer** (the resolved date, a held contradiction, a corroboration count, injected onto the returned passages) — **not** as the returned object itself. Return shape the evidence supports: *sessions/passages for text, claims for recall and for the tags passages can't carry themselves.*

The unifying move: push work **into the substrate at index time** (generate the questions, the prose, the aggregates, the temporal order, the anchors) so query time is a *tight match returning reader-ready prose* — which is what a weak reader, slow-query-tolerant, token-relaxed regime wants. None dump source documents; all are distilled structure; all generalise beyond conversations.

---

## 6. The design principle, stated once

> donto's substrate is **maximal** (hold every claim, every contradiction, both time axes, every anchor, forever). donto's **return** is **minimal** (the fewest, highest-signal evidence items, in the representation the reader reasons over best, with only the temporal tag that a fact can't carry itself). The gap between the two is the engineering — and every field that survived into the return survived because a measurement said it earned its tokens.

That is the full shape of what donto hands an agent: a small, flat, time-tagged, evidence-anchored bundle — backed by a bitemporal paraconsistent graph the agent never has to see.

_See also: [LoCoMo Config C — claims-only recall](/reports/locomo-claims-only-config-c) · [all benchmarks](/benchmarks) · [Are we using Hyades effectively?](/reports/hyades-extraction-effectiveness)._
