Answer-Shaping: when a knowledge substrate should pre-compute the answer, and when it must not
donto papers · original research · 2026-06-13 · empirical, working draft v1
Abstract. Retrieval-augmented systems are usually evaluated on whether they fetch the right evidence. We report a different and sharper lever: a substrate that holds extracted, aligned claims can pre-shape the answer-relevant content — compose an aggregate claim, resolve a date, distil a stated preference — and surface it ahead of raw evidence. Across two agent-memory benchmarks (LoCoMo, LongMemEval) on a deliberately weak, free reader, we find a clean and reproducible law: answer-shaping lifts accuracy when the answer is a single reliably-distillable fact, and degrades it when the answer requires synthesizing many facts. A pre-distilled preference facet lifts LongMemEval single-session-preference from 0.70 to 0.767; the same generic distillation applied to multi-session synthesis questions drops accuracy from 0.70 to 0.63. Separately, using answer-shaped aggregate claims as the entire context (no raw transcripts) reaches 86% of full-episodic accuracy at 3.8× fewer tokens. We name the principle (shape a scalpel, never a hammer), give the mechanism (pre-shaping helps the reader apply a fact it already has; it cannot help the reader combine facts it must still parse), and draw the design consequence: the substrate's job is not just to retrieve but to decide which questions it may answer for the reader — a routing problem, not a retrieval one.
1. The setup, and why a weak reader is the right instrument
All results use a free, weak reader+judge (holo3.1) rather than a frontier model. This is deliberate. A strong reader with the full context can answer almost anything, which masks whether the substrate added value; a weak reader makes the substrate's contribution visible as a delta. We report deltas, not absolutes — the absolutes would change with the reader, but the direction and shape of each lever is the finding.
Two benchmarks, chosen because they stress different things:
- LoCoMo — 1,540 QA over long multi-session dialogues; categories: multi-hop, temporal, open-domain, single-hop.
- LongMemEval (
_s) — 500 QA, ~50 sessions/question; categories include knowledge-update, single-session-preference, multi-session, temporal.
donto ingests each as evidence-anchored claims with valid-time, then offers the reader different return shapes: raw recalled sessions (episodic), atomic claims, or answer-shaped artifacts (aggregates, resolved dates, distilled preferences).
2. Result 1 — token-efficiency: answer-shaped aggregates as the whole context
On session-rich LoCoMo, full-session episodic recall is a strong baseline and bolting claim facets onto the sessions is redundant — the sessions already contain the facts; both readers went flat-to-negative. The win appears only when you replace the transcripts:
| context | accuracy (96Q) | tokens | note |
|---|---|---|---|
| full episodic (raw sessions) | 0.604 | ~19,000 | strong baseline |
| atomic claims only | 0.244 | ~1,690 | recall-limited; collapses |
| claims + answer-shaped aggregates + dates | 0.521 | ~5,000 | 86% of episodic at 3.8× fewer tokens |
The aggregate claims — LLM-composed summaries that preserve every specific named detail and are themselves embedded and recalled — roughly double the atomic-claims baseline (0.24 → 0.52) and recover most of the episodic accuracy at roughly a quarter of the tokens, matching episodic on multi-hop and beating it on temporal. This is the substrate paying off on the axis that binds at scale (context budget), which session-rich in-context benchmarks under-reward.
Mechanism. An aggregate is the substrate doing, once and at rest, the cross-claim composition the reader would otherwise have to do every query — and doing it where it has the whole claim layer in view, not a truncated window.
3. Result 2 — the scalpel/hammer law
The decisive experiments are two applications of the same pre-distillation operator to different question shapes:
| question shape | operator | result | direction |
|---|---|---|---|
| single-session-preference (apply one stated taste) | distil the user's relevant stated preference → surface as top item | 0.70 → 0.767 | ✅ helps |
| multi-session (synthesize across many sessions) | generic "distil the relevant facts" → surface as top item | 0.70 → 0.63 | ❌ hurts |
A diagnostic on the preference misses showed all 9 had the answer session already recalled — retrieval was not the problem; the weak reader simply failed to apply the recalled preference. Pre-computing that one fact and putting it front-and-centre fixed it (recovered 4, lost 2). The same distillation on multi-session questions threw away what the synthesis needed: compressing a 17K-token context to a few lines drops the long tail of facts a synthesis answer must combine — the identical failure mode as an aggregate losing a single-hop verbatim token.
The law. Pre-shaping helps exactly when the answer is a focused, reliably-distillable fact (a preference, a date, one multi-hop chain). It hurts when the answer is a synthesis over many facts, because distillation is lossy and a weak distiller compresses out the wrong things. Shape a scalpel, never a hammer.
This unifies several smaller observations into one rule: aggregates help multi-hop (a focused chain) but a specific aggregate can drop a single-hop verbatim token; resolved dates help temporal; distilled preferences help preference; generic distillation helps none of the synthesis categories.
4. Result 3 — re-ranking is not shaping (and the substrate can't bootstrap a weak reader on it)
A natural alternative to shaping is re-ranking: leave the atomic evidence intact but re-order/route it so the discriminating item rises. We tried three on LoCoMo single-hop, where every miss's answer is present in the claim layer (e.g. Caroline · lovedBook · Becoming Nicole) but out-ranked in vector recall:
- always-on lexical arm: 6→8 of 12 single-hop, but trades against open-domain;
- per-claim rank-gate: net worse;
- per-question LLM router (classify then choose recall): misroutes on the weak reader, net-negative.
None converted. The reason completes the picture: re-ranking still hands the reader atomic evidence it must parse to find a specific token; shaping hands the reader the token. A weak reader can be helped by the latter and not the former. So the residual single-hop gap is a routing/classifier problem — it needs a reliable question classifier, i.e. a stronger reader — not another retrieval facet.
5. The design consequence: a substrate that decides what it may answer
The standard RAG contract is "retrieve well, let the reader read." Our results argue for a richer contract:
- Retrieve the evidence (donto's recall is sound — in every diagnosed category the right evidence was present).
- Decide the question's shape — focused-fact vs synthesis.
- For focused-fact questions, pre-shape the answer (aggregate / date / preference) and surface it; for synthesis questions, return full evidence and do not distil.
Step 2 is the open frontier and it is a routing problem. The honest boundary of this work: we showed that shaping helps the scalpel cases and hurts the hammer cases; we did not produce a reliable, reader-independent classifier to tell them apart at query time (our LLM router on the weak reader misrouted). Building that classifier — ideally from substrate signals (claim count, predicate spread, contradiction-pressure, question embedding) rather than a brittle category list — is the next paper.
6. Relation to donto's other claims
- The token-efficiency result is the empirical face of emit-free / defer-joining: the substrate composes the answer-shaped aggregate once, the reader pays a fraction of the tokens forever after.
- The scalpel/hammer law is why donto's value is answer-shaping, not raw accuracy on session-rich benchmarks: a strong reader with full context is hard to beat on accuracy there, but it pays full token cost and cannot be improved by the substrate; a substrate that knows which fact to pre-compute lifts even a weak reader at a fraction of the cost.
- The routing frontier is where a contradiction-aware operator (the bitemporal sheaf / sheaf
H¹) re-enters: contradiction-pressure and claim-spread are exactly the substrate signals a non-brittle shape-classifier should consume.
7. Reproducibility and honesty notes
- Free reader+judge (holo3.1) throughout; deltas not absolutes; the reader was held fixed and fair (no test-derived prompt tuning).
- The preference result recovered 4 / lost 2 of 30 — a real but small-N lift; we report it as directional, corroborated by the diagnostic that retrieval was not the bottleneck. An earlier run that reported the opposite (shaping hurts preference) was a bug — the facet was written to a key the renderer ignored — corrected here; we flag it because the corrected and uncorrected runs are a clean lesson in measuring the plumbing before the hypothesis.
- All findings generalize beyond conversational data only insofar as the claim layer does; we make no cross-domain claim yet.
The contribution is not a leaderboard number. It is a law about when a knowledge substrate should answer for the reader — and the consequent reframing of the substrate's job from retrieve to retrieve, then decide whether to shape.
See also: the synthesis these results came from, the honest scorecard; the theory companion The Bitemporal Sheaf; the full program in the donto research agenda.