Answer-Shaping: when a knowledge substrate should pre-compute the answer, and when it must not

donto papers · original research · 2026-06-13 · empirical, working draft v1

Abstract. Retrieval-augmented systems are usually evaluated on whether they fetch the right evidence. We report a different and sharper lever: a substrate that holds extracted, aligned claims can pre-shape the answer-relevant content — compose an aggregate claim, resolve a date, distil a stated preference — and surface it ahead of raw evidence. Across two agent-memory benchmarks (LoCoMo, LongMemEval) on a deliberately weak, free reader, we find a clean and reproducible law: answer-shaping lifts accuracy when the answer is a single reliably-distillable fact, and degrades it when the answer requires synthesizing many facts. A pre-distilled preference facet lifts LongMemEval single-session-preference from 0.70 to 0.767; the same generic distillation applied to multi-session synthesis questions drops accuracy from 0.70 to 0.63. Separately, using answer-shaped aggregate claims as the entire context (no raw transcripts) reaches 86% of full-episodic accuracy at 3.8× fewer tokens. We name the principle (shape a scalpel, never a hammer), give the mechanism (pre-shaping helps the reader apply a fact it already has; it cannot help the reader combine facts it must still parse), and draw the design consequence: the substrate's job is not just to retrieve but to decide which questions it may answer for the reader — a routing problem, not a retrieval one.

1. The setup, and why a weak reader is the right instrument

All results use a free, weak reader+judge (holo3.1) rather than a frontier model. This is deliberate. A strong reader with the full context can answer almost anything, which masks whether the substrate added value; a weak reader makes the substrate's contribution visible as a delta. We report deltas, not absolutes — the absolutes would change with the reader, but the direction and shape of each lever is the finding.

Two benchmarks, chosen because they stress different things:

LoCoMo — 1,540 QA over long multi-session dialogues; categories: multi-hop, temporal, open-domain, single-hop.
LongMemEval (_s) — 500 QA, ~50 sessions/question; categories include knowledge-update, single-session-preference, multi-session, temporal.

donto ingests each as evidence-anchored claims with valid-time, then offers the reader different return shapes: raw recalled sessions (episodic), atomic claims, or answer-shaped artifacts (aggregates, resolved dates, distilled preferences).

2. Result 1 — token-efficiency: answer-shaped aggregates as the whole context

On session-rich LoCoMo, full-session episodic recall is a strong baseline and bolting claim facets onto the sessions is redundant — the sessions already contain the facts; both readers went flat-to-negative. The win appears only when you replace the transcripts:

context	accuracy (96Q)	tokens	note
full episodic (raw sessions)	0.604	~19,000	strong baseline
atomic claims only	0.244	~1,690	recall-limited; collapses
claims + answer-shaped aggregates + dates	0.521	~5,000	86% of episodic at 3.8× fewer tokens

The aggregate claims — LLM-composed summaries that preserve every specific named detail and are themselves embedded and recalled — roughly double the atomic-claims baseline (0.24 → 0.52) and recover most of the episodic accuracy at roughly a quarter of the tokens, matching episodic on multi-hop and beating it on temporal. This is the substrate paying off on the axis that binds at scale (context budget), which session-rich in-context benchmarks under-reward.

Mechanism. An aggregate is the substrate doing, once and at rest, the cross-claim composition the reader would otherwise have to do every query — and doing it where it has the whole claim layer in view, not a truncated window.

3. Result 2 — the scalpel/hammer law

The decisive experiments are two applications of the same pre-distillation operator to different question shapes:

question shape	operator	result	direction
single-session-preference (apply one stated taste)	distil the user's relevant stated preference → surface as top item	0.70 → 0.767	✅ helps
multi-session (synthesize across many sessions)	generic "distil the relevant facts" → surface as top item	0.70 → 0.63	❌ hurts

A diagnostic on the preference misses showed all 9 had the answer session already recalled — retrieval was not the problem; the weak reader simply failed to apply the recalled preference. Pre-computing that one fact and putting it front-and-centre fixed it (recovered 4, lost 2). The same distillation on multi-session questions threw away what the synthesis needed: compressing a 17K-token context to a few lines drops the long tail of facts a synthesis answer must combine — the identical failure mode as an aggregate losing a single-hop verbatim token.

The law. Pre-shaping helps exactly when the answer is a focused, reliably-distillable fact (a preference, a date, one multi-hop chain). It hurts when the answer is a synthesis over many facts, because distillation is lossy and a weak distiller compresses out the wrong things. Shape a scalpel, never a hammer.

This unifies several smaller observations into one rule: aggregates help multi-hop (a focused chain) but a specific aggregate can drop a single-hop verbatim token; resolved dates help temporal; distilled preferences help preference; generic distillation helps none of the synthesis categories.

4. Result 3 — re-ranking is not shaping (and the substrate can't bootstrap a weak reader on it)

A natural alternative to shaping is re-ranking: leave the atomic evidence intact but re-order/route it so the discriminating item rises. We tried three on LoCoMo single-hop, where every miss's answer is present in the claim layer (e.g. Caroline · lovedBook · Becoming Nicole) but out-ranked in vector recall:

always-on lexical arm: 6→8 of 12 single-hop, but trades against open-domain;
per-claim rank-gate: net worse;
per-question LLM router (classify then choose recall): misroutes on the weak reader, net-negative.

None converted. The reason completes the picture: re-ranking still hands the reader atomic evidence it must parse to find a specific token; shaping hands the reader the token. A weak reader can be helped by the latter and not the former. So the residual single-hop gap is a routing/classifier problem — it needs a reliable question classifier, i.e. a stronger reader — not another retrieval facet.

5. The design consequence: a substrate that decides what it may answer

The standard RAG contract is "retrieve well, let the reader read." Our results argue for a richer contract:

Retrieve the evidence (donto's recall is sound — in every diagnosed category the right evidence was present).
Decide the question's shape — focused-fact vs synthesis.
For focused-fact questions, pre-shape the answer (aggregate / date / preference) and surface it; for synthesis questions, return full evidence and do not distil.

Step 2 is the open frontier and it is a routing problem. The honest boundary of this work: we showed that shaping helps the scalpel cases and hurts the hammer cases; we did not produce a reliable, reader-independent classifier to tell them apart at query time (our LLM router on the weak reader misrouted). Building that classifier — ideally from substrate signals (claim count, predicate spread, contradiction-pressure, question embedding) rather than a brittle category list — is the next paper.

6. Relation to donto's other claims

The token-efficiency result is the empirical face of emit-free / defer-joining: the substrate composes the answer-shaped aggregate once, the reader pays a fraction of the tokens forever after.
The scalpel/hammer law is why donto's value is answer-shaping, not raw accuracy on session-rich benchmarks: a strong reader with full context is hard to beat on accuracy there, but it pays full token cost and cannot be improved by the substrate; a substrate that knows which fact to pre-compute lifts even a weak reader at a fraction of the cost.
The routing frontier is where a contradiction-aware operator (the bitemporal sheaf / sheaf H¹) re-enters: contradiction-pressure and claim-spread are exactly the substrate signals a non-brittle shape-classifier should consume.

7. Reproducibility and honesty notes

Free reader+judge (holo3.1) throughout; deltas not absolutes; the reader was held fixed and fair (no test-derived prompt tuning).
The preference result recovered 4 / lost 2 of 30 — a real but small-N lift; we report it as directional, corroborated by the diagnostic that retrieval was not the bottleneck. An earlier run that reported the opposite (shaping hurts preference) was a bug — the facet was written to a key the renderer ignored — corrected here; we flag it because the corrected and uncorrected runs are a clean lesson in measuring the plumbing before the hypothesis.
All findings generalize beyond conversational data only insofar as the claim layer does; we make no cross-domain claim yet.

The contribution is not a leaderboard number. It is a law about when a knowledge substrate should answer for the reader — and the consequent reframing of the substrate's job from retrieve to retrieve, then decide whether to shape.

See also: the synthesis these results came from, the honest scorecard; the theory companion The Bitemporal Sheaf; the full program in the donto research agenda.