What the memory benchmarks taught us — donto's honest scorecard

Living document. 2026-06-13. A synthesis of a multi-day, free-reader (holo3.1) investigation across LoCoMo and LongMemEval, asking one question: where does donto's claim/recall substrate actually add value for agent memory, and where doesn't it? The component findings live in the return-shape report, the LoCoMo board, and the LongMemEval board; this note ties them together. All runs use a free reader+judge (holo3.1 on Hyades) — deltas, not absolutes, are the signal.

The short version: donto's recall does its job everywhere we probed. Its demonstrated wins are token-efficiency, out-of-the-box knowledge-update, and answer-shaping that lifts even a weak reader on its own weak spots. The one place it couldn't convert was single-hop verbatim recall, where the fix is to re-rank a specific token rather than shape the answer — a routing/reader-capability frontier, not a retrieval one.

The scorecard

finding	benchmark	result	verdict
Token-efficiency	LoCoMo	claims + answer-shaped aggregates = 86% of episodic accuracy at 3.8× fewer tokens (0.52 vs 0.60, 96Q)	✅ real donto win
Knowledge-update	LongMemEval	0.923 out of the box; recency re-ranking adds nothing	✅ differentiator strong, baked into the dated representation
Facets on session-rich episodic	LoCoMo	redundant — both readers net flat-to-negative	❌ no value when sessions are already present
Single-hop verbatim	LoCoMo	recall/alignment gap (answers are in the layer); lexical recovers but trades off	⚠️ needs query-adaptive routing, not more facets
Single-session-preference	LongMemEval	0.70, every miss recalled — reader can't apply the preference; a pre-distilled salience facet lifts it to 0.767	✅ answer-shaping helps even a weak reader

1. The win: token-efficiency (LoCoMo)

On session-rich LoCoMo, full-session episodic recall is hard to beat on raw accuracy, and bolting claim facets onto sessions is redundant (the sessions already contain the facts — both readers went flat-to-negative). But flip the framing: use the claim layer as the context itself — atomic claims + LLM-composed answer-shaped aggregate claims + resolved dates, no raw session transcripts — and you get 86% of episodic accuracy at a quarter of the tokens, matching episodic on multi-hop, beating it on temporal. The aggregates roughly double a raw claims-only baseline (0.24 → 0.52). This is the substrate's answer-shaping earning its keep: near-episodic answers, far fewer tokens.

2. The differentiator, already working: knowledge-update (LongMemEval)

Knowledge-update is the category donto was built for — a fact changes over sessions, the answer is the current value, old values are distractors. donto handles it at 0.923 on the free reader, out of the box, because the bitemporal ingest stamps each session with valid_from and the reader is told the current date and to lead with the latest value. We tested an explicit recency recall (re-sort recalled sessions latest-first); it added nothing (0.923 → 0.910, pure churn). The bitemporal win is in the representation, not in recall re-ranking — and with only 6 misses there was no headroom to justify extracting a claim layer here.

3. The line that actually matters: shaping the answer helps the reader; re-ranking what's retrieved doesn't

Both weak spots had the answer already recalled — the limit was the reader, not retrieval. But the two responded oppositely, and the difference is the whole lesson:

LongMemEval single-session-preference — answer-shaping works. 0.70, and a diagnostic showed all 9 misses had the answer session recalled: these are preference-application questions ("suggest a hotel honoring my taste") where the weak reader didn't apply the recalled preference. A pre-distilled salience facet — extract the user's relevant stated preference and surface it as a prominent top item — lifted it to 0.767 (recovered 4, lost 2). Pre-computing the answer-relevant fact and putting it front-and-centre helps even a weak reader. (An earlier run that said this hurt was a bug — the facet was written to an ignored key; corrected here.)
LoCoMo single-hop — re-ranking doesn't. Every miss's answer is in the claim layer (e.g. Caroline · lovedBook · Becoming Nicole), just out-ranked in vector recall. But the levers that try to re-rank/route what's retrieved — an always-on lexical arm (6→8/12 but trades against open-domain), a per-claim rank-gate (net worse), a per-question LLM router (misroutes on the free reader, net-negative) — don't convert, because the discriminating signal is a specific token the reader has to pick out, not a shape the substrate can hand it pre-made. Closing it needs question-level adaptive recall with a reliable classifier — donto's alignment frontier.

The distinction is sharp: when the substrate can pre-compute the answer-relevant content and surface it (aggregates for multi-hop, a resolved date for temporal, a distilled preference), it helps the reader — even a weak one. When the fix is to re-order or route atomic evidence the reader must still parse, a weak reader/classifier can't be bootstrapped from the substrate.

But answer-shaping is a scalpel, not a hammer — and that bounds it. A generic "distill the relevant facts" facet applied to multi-session synthesis questions hurt (0.70 → 0.63): when the answer needs many facts combined, distilling the 17K-token context to a few lines drops what's needed (a weak distiller compresses out the wrong things) — the same failure mode as aggregation losing a single-hop verbatim fact. So shaping lifts the reader only when the answer is a focused, reliably-distillable fact (a preference, a date, a specific multi-hop chain); it backfires when it compresses a synthesis-heavy context. Knowing which questions to shape (and which to leave as full evidence) is, once more, routing — the reader/classifier-bound frontier.

4. What this says about donto's value

The pattern across both benchmarks is consistent and clarifying:

donto's recall is sound. In every category we diagnosed, the right evidence was retrieved. The substrate is not the bottleneck.
donto's claim/answer-shaping layer pays off as efficiency — near-episodic accuracy at a fraction of the tokens — which is exactly the axis that matters at scale and when context windows bind, and the axis session-rich in-context benchmarks under-reward.
donto's bitemporal representation pays off as reliability on changing facts (knowledge-update), with the win sitting in the dated representation rather than any clever re-ranking.
donto's answer-shaping helps the reader; donto's re-ranking of atomic evidence doesn't. Pre-distilled, surfaced content (aggregates, resolved dates, a distilled preference) lifts accuracy even for a weak reader; re-ordering/routing the same atomic claims (single-hop lexical/rank-gate/router) doesn't convert, because the reader still has to do the picking.

So the honest pitch is not "donto wins raw accuracy on session-rich benchmarks" (a strong reader + full context is hard to beat there). It is: donto delivers near-equal answers at a fraction of the tokens, handles changing facts reliably, retrieves the right evidence every time, and — where it can pre-shape the answer-relevant fact — lifts the reader on its own weak spots (preference 0.70→0.767).

The one place we couldn't convert was single-hop verbatim recall: an always-on lexical arm (6→8 but trades against open-domain), a per-claim rank-gate (net worse), and a per-question LLM router (misroutes on the free reader, net-negative) all failed — because there the fix is to re-rank a specific token the reader must still parse, not a shape the substrate can hand over pre-made. That one is genuinely a routing/reader-capability frontier.

The clean conclusion: donto's value is token-efficiency, reliable bitemporal recall, and answer-shaping that helps even a weak reader. The residual gap (single-hop) is a re-ranking/routing problem that needs a reliable classifier — i.e. a stronger reader — not another retrieval facet.