# What the memory benchmarks taught us — donto's honest scorecard

_Living document. 2026-06-13. A synthesis of a multi-day, free-reader (holo3.1) investigation across **LoCoMo** and **LongMemEval**, asking one question: where does donto's claim/recall substrate actually add value for agent memory, and where doesn't it? The component findings live in [the return-shape report](/reports/donto-memory-return-shape), the [LoCoMo board](/benchmarks/locomo), and the [LongMemEval board](/benchmarks/longmemeval); this note ties them together. All runs use a free reader+judge (holo3.1 on Hyades) — deltas, not absolutes, are the signal._

The short version: **donto's recall does its job everywhere we probed.** Its demonstrated wins are **token-efficiency**, **out-of-the-box knowledge-update**, and **answer-*shaping* that lifts even a weak reader** on its own weak spots. The one place it couldn't convert was single-hop verbatim recall, where the fix is to *re-rank* a specific token rather than *shape* the answer — a routing/reader-capability frontier, not a retrieval one.

---

## The scorecard

| finding | benchmark | result | verdict |
|---|---|---|---|
| **Token-efficiency** | LoCoMo | claims + answer-shaped aggregates = **86% of episodic accuracy at 3.8× fewer tokens** (0.52 vs 0.60, 96Q) | ✅ **real donto win** |
| **Knowledge-update** | LongMemEval | **0.923** out of the box; recency re-ranking adds nothing | ✅ **differentiator strong, baked into the dated representation** |
| **Facets on session-rich episodic** | LoCoMo | redundant — both readers net flat-to-negative | ❌ no value when sessions are already present |
| **Single-hop verbatim** | LoCoMo | recall/alignment gap (answers are *in* the layer); lexical recovers but trades off | ⚠️ needs query-adaptive **routing**, not more facets |
| **Single-session-preference** | LongMemEval | **0.70**, every miss recalled — reader can't *apply* the preference; a **pre-distilled salience facet lifts it to 0.767** | ✅ **answer-shaping helps even a weak reader** |

---

## 1. The win: token-efficiency (LoCoMo)

On session-rich LoCoMo, full-session episodic recall is hard to beat on raw accuracy, and **bolting claim facets onto sessions is redundant** (the sessions already contain the facts — both readers went flat-to-negative). But flip the framing: use the **claim layer as the context itself** — atomic claims + LLM-composed **answer-shaped aggregate claims** + resolved dates, *no raw session transcripts* — and you get **86% of episodic accuracy at a quarter of the tokens**, matching episodic on multi-hop, *beating* it on temporal. The aggregates roughly double a raw claims-only baseline (0.24 → 0.52). This is the substrate's answer-shaping earning its keep: near-episodic answers, far fewer tokens.

## 2. The differentiator, already working: knowledge-update (LongMemEval)

Knowledge-update is the category donto was built for — a fact *changes* over sessions, the answer is the current value, old values are distractors. donto handles it at **0.923 on the free reader, out of the box**, because the bitemporal ingest stamps each session with `valid_from` and the reader is told the current date and to lead with the latest value. We tested an explicit **recency** recall (re-sort recalled sessions latest-first); it added nothing (0.923 → 0.910, pure churn). **The bitemporal win is in the *representation*, not in recall re-ranking** — and with only 6 misses there was no headroom to justify extracting a claim layer here.

## 3. The line that actually matters: *shaping* the answer helps the reader; *re-ranking* what's retrieved doesn't

Both weak spots had the answer *already recalled* — the limit was the reader, not retrieval. But the two responded oppositely, and the difference is the whole lesson:

- **LongMemEval single-session-preference** — answer-shaping **works**. 0.70, and a diagnostic showed **all 9 misses had the answer session recalled**: these are preference-*application* questions ("suggest a hotel honoring my taste") where the weak reader didn't apply the recalled preference. A **pre-distilled salience facet** — extract the user's relevant stated preference and surface it as a prominent top item — **lifted it to 0.767** (recovered 4, lost 2). Pre-computing the answer-relevant fact and putting it front-and-centre helps even a weak reader. (An earlier run that said this *hurt* was a bug — the facet was written to an ignored key; corrected here.)
- **LoCoMo single-hop** — re-ranking **doesn't**. Every miss's answer *is* in the claim layer (e.g. `Caroline · lovedBook · Becoming Nicole`), just out-ranked in vector recall. But the levers that try to *re-rank/route* what's retrieved — an always-on lexical arm (6→8/12 but trades against open-domain), a per-claim rank-gate (net worse), a per-question LLM router (misroutes on the free reader, net-negative) — don't convert, because the discriminating signal is a *specific token* the reader has to pick out, not a *shape* the substrate can hand it pre-made. Closing it needs question-level adaptive recall with a reliable classifier — donto's alignment frontier.

The distinction is sharp: **when the substrate can pre-compute the answer-relevant content and surface it (aggregates for multi-hop, a resolved date for temporal, a distilled preference), it helps the reader — even a weak one. When the fix is to *re-order or route* atomic evidence the reader must still parse, a weak reader/classifier can't be bootstrapped from the substrate.**

**But answer-shaping is a scalpel, not a hammer — and that bounds it.** A *generic* "distill the relevant facts" facet applied to **multi-session** synthesis questions *hurt* (0.70 → 0.63): when the answer needs many facts combined, distilling the 17K-token context to a few lines **drops** what's needed (a weak distiller compresses out the wrong things) — the same failure mode as aggregation losing a single-hop verbatim fact. So shaping lifts the reader only when the answer is a **focused, reliably-distillable fact** (a preference, a date, a specific multi-hop chain); it backfires when it compresses a synthesis-heavy context. Knowing *which* questions to shape (and which to leave as full evidence) is, once more, **routing** — the reader/classifier-bound frontier.

---

## 4. What this says about donto's value

The pattern across both benchmarks is consistent and clarifying:

- **donto's recall is sound.** In every category we diagnosed, the right evidence was retrieved. The substrate is not the bottleneck.
- **donto's claim/answer-shaping layer pays off as *efficiency*** — near-episodic accuracy at a fraction of the tokens — which is exactly the axis that matters at scale and when context windows bind, and the axis session-rich in-context benchmarks under-reward.
- **donto's bitemporal representation pays off as *reliability*** on changing facts (knowledge-update), with the win sitting in the dated representation rather than any clever re-ranking.
- **donto's answer-*shaping* helps the reader; donto's re-*ranking* of atomic evidence doesn't.** Pre-distilled, surfaced content (aggregates, resolved dates, a distilled preference) lifts accuracy even for a weak reader; re-ordering/routing the same atomic claims (single-hop lexical/rank-gate/router) doesn't convert, because the reader still has to do the picking.

So the honest pitch is not "donto wins raw accuracy on session-rich benchmarks" (a strong reader + full context is hard to beat there). It is: **donto delivers near-equal answers at a fraction of the tokens, handles changing facts reliably, retrieves the right evidence every time, and — where it can pre-shape the answer-relevant fact — lifts the reader on its own weak spots (preference 0.70→0.767).**

The one place we couldn't convert was **single-hop verbatim recall**: an always-on lexical arm (6→8 but trades against open-domain), a per-claim rank-gate (net worse), and a per-question LLM router (misroutes on the free reader, net-negative) all failed — because there the fix is to *re-rank* a specific token the reader must still parse, not a *shape* the substrate can hand over pre-made. That one is genuinely a routing/reader-capability frontier.

The clean conclusion: **donto's value is token-efficiency, reliable bitemporal recall, and answer-*shaping* that helps even a weak reader. The residual gap (single-hop) is a re-ranking/routing problem that needs a reliable classifier — i.e. a stronger reader — not another retrieval facet.**

_See also: [the shape of donto's return](/reports/donto-memory-return-shape) · [LoCoMo board](/benchmarks/locomo) · [LongMemEval board](/benchmarks/longmemeval) · [capability vs exercise](/reports/capability-vs-exercise)._
