# Answer-Shaping: when a knowledge substrate should pre-compute the answer, and when it must not

_donto papers · original research · 2026-06-13 · empirical, working draft v1_

**Abstract.** Retrieval-augmented systems are usually evaluated on whether they fetch the *right evidence*. We report a different and sharper lever: a substrate that holds extracted, aligned claims can **pre-shape** the answer-relevant content — compose an aggregate claim, resolve a date, distil a stated preference — and surface it ahead of raw evidence. Across two agent-memory benchmarks (LoCoMo, LongMemEval) on a deliberately weak, free reader, we find a clean and reproducible law: **answer-shaping lifts accuracy when the answer is a single reliably-distillable fact, and *degrades* it when the answer requires synthesizing many facts.** A pre-distilled preference facet lifts LongMemEval single-session-preference from 0.70 to 0.767; the *same generic distillation* applied to multi-session synthesis questions drops accuracy from 0.70 to 0.63. Separately, using answer-shaped aggregate claims *as the entire context* (no raw transcripts) reaches **86% of full-episodic accuracy at 3.8× fewer tokens**. We name the principle (**shape a scalpel, never a hammer**), give the mechanism (pre-shaping helps the reader *apply* a fact it already has; it cannot help the reader *combine* facts it must still parse), and draw the design consequence: the substrate's job is not just to retrieve but to decide *which questions it may answer for the reader* — a routing problem, not a retrieval one.

---

## 1. The setup, and why a weak reader is the right instrument

All results use a **free, weak reader+judge** (holo3.1) rather than a frontier model. This is deliberate. A strong reader with the full context can answer almost anything, which *masks* whether the substrate added value; a weak reader makes the substrate's contribution visible as a delta. We report deltas, not absolutes — the absolutes would change with the reader, but the *direction and shape* of each lever is the finding.

Two benchmarks, chosen because they stress different things:
- **LoCoMo** — 1,540 QA over long multi-session dialogues; categories: multi-hop, temporal, open-domain, single-hop.
- **LongMemEval (`_s`)** — 500 QA, ~50 sessions/question; categories include knowledge-update, single-session-preference, multi-session, temporal.

donto ingests each as evidence-anchored claims with valid-time, then offers the reader different *return shapes*: raw recalled sessions (episodic), atomic claims, or **answer-shaped** artifacts (aggregates, resolved dates, distilled preferences).

---

## 2. Result 1 — token-efficiency: answer-shaped aggregates as the whole context

On session-rich LoCoMo, full-session episodic recall is a strong baseline and **bolting claim facets onto the sessions is redundant** — the sessions already contain the facts; both readers went flat-to-negative. The win appears only when you *replace* the transcripts:

| context | accuracy (96Q) | tokens | note |
|---|---|---|---|
| full episodic (raw sessions) | 0.604 | ~19,000 | strong baseline |
| atomic claims only | 0.244 | ~1,690 | recall-limited; collapses |
| **claims + answer-shaped aggregates + dates** | **0.521** | **~5,000** | **86% of episodic at 3.8× fewer tokens** |

The aggregate claims — LLM-composed summaries that *preserve every specific named detail* and are themselves embedded and recalled — roughly **double** the atomic-claims baseline (0.24 → 0.52) and recover most of the episodic accuracy at roughly a quarter of the tokens, matching episodic on multi-hop and *beating* it on temporal. This is the substrate paying off on the axis that binds at scale (context budget), which session-rich in-context benchmarks under-reward.

**Mechanism.** An aggregate is the substrate doing, once and at rest, the cross-claim composition the reader would otherwise have to do every query — and doing it where it has the whole claim layer in view, not a truncated window.

---

## 3. Result 2 — the scalpel/hammer law

The decisive experiments are two applications of the *same* pre-distillation operator to *different* question shapes:

| question shape | operator | result | direction |
|---|---|---|---|
| single-session-**preference** (apply one stated taste) | distil the user's relevant stated preference → surface as top item | **0.70 → 0.767** | ✅ helps |
| **multi-session** (synthesize across many sessions) | generic "distil the relevant facts" → surface as top item | **0.70 → 0.63** | ❌ hurts |

A diagnostic on the preference misses showed **all 9 had the answer session already recalled** — retrieval was not the problem; the weak reader simply failed to *apply* the recalled preference. Pre-computing that one fact and putting it front-and-centre fixed it (recovered 4, lost 2). The *same* distillation on multi-session questions **threw away** what the synthesis needed: compressing a 17K-token context to a few lines drops the long tail of facts a synthesis answer must combine — the identical failure mode as an aggregate losing a single-hop verbatim token.

> **The law.** Pre-shaping helps exactly when the answer is a **focused, reliably-distillable fact** (a preference, a date, one multi-hop chain). It hurts when the answer is a **synthesis** over many facts, because distillation is lossy and a weak distiller compresses out the wrong things. *Shape a scalpel, never a hammer.*

This unifies several smaller observations into one rule: aggregates help multi-hop (a focused chain) but a *specific* aggregate can drop a single-hop verbatim token; resolved dates help temporal; distilled preferences help preference; generic distillation helps none of the synthesis categories.

---

## 4. Result 3 — re-ranking is not shaping (and the substrate can't bootstrap a weak reader on it)

A natural alternative to *shaping* is *re-ranking*: leave the atomic evidence intact but re-order/route it so the discriminating item rises. We tried three on LoCoMo single-hop, where every miss's answer **is** present in the claim layer (e.g. `Caroline · lovedBook · Becoming Nicole`) but out-ranked in vector recall:

- always-on lexical arm: 6→8 of 12 single-hop, but **trades against open-domain**;
- per-claim rank-gate: **net worse**;
- per-question LLM router (classify then choose recall): **misroutes on the weak reader, net-negative**.

None converted. The reason completes the picture: **re-ranking still hands the reader atomic evidence it must parse to find a specific token**; *shaping* hands the reader the token. A weak reader can be helped by the latter and not the former. So the residual single-hop gap is a **routing/classifier** problem — it needs a *reliable* question classifier, i.e. a stronger reader — not another retrieval facet.

---

## 5. The design consequence: a substrate that decides what it may answer

The standard RAG contract is "retrieve well, let the reader read." Our results argue for a richer contract:

1. **Retrieve the evidence** (donto's recall is sound — in every diagnosed category the right evidence was present).
2. **Decide the question's shape** — focused-fact vs synthesis.
3. **For focused-fact questions, pre-shape the answer** (aggregate / date / preference) and surface it; **for synthesis questions, return full evidence and do not distil.**

Step 2 is the open frontier and it is a *routing* problem. The honest boundary of this work: we showed *that* shaping helps the scalpel cases and hurts the hammer cases; we did **not** produce a reliable, reader-independent classifier to tell them apart at query time (our LLM router on the weak reader misrouted). Building that classifier — ideally from substrate signals (claim count, predicate spread, contradiction-pressure, question embedding) rather than a brittle category list — is the next paper.

---

## 6. Relation to donto's other claims

- The **token-efficiency** result is the empirical face of *emit-free / defer-joining*: the substrate composes the answer-shaped aggregate once, the reader pays a fraction of the tokens forever after.
- The **scalpel/hammer law** is why donto's value is *answer-shaping*, not raw accuracy on session-rich benchmarks: a strong reader with full context is hard to beat on accuracy there, but it pays full token cost and cannot be improved by the substrate; a substrate that knows *which* fact to pre-compute lifts even a weak reader at a fraction of the cost.
- The **routing** frontier is where a contradiction-aware operator (the [bitemporal sheaf](/papers/bitemporal-sheaf) / sheaf `H¹`) re-enters: contradiction-pressure and claim-spread are exactly the substrate signals a non-brittle shape-classifier should consume.

---

## 7. Reproducibility and honesty notes

- Free reader+judge (holo3.1) throughout; deltas not absolutes; the reader was held fixed and fair (no test-derived prompt tuning).
- The preference result recovered 4 / lost 2 of 30 — a real but small-N lift; we report it as directional, corroborated by the diagnostic that retrieval was not the bottleneck. An earlier run that reported the *opposite* (shaping hurts preference) was a bug — the facet was written to a key the renderer ignored — corrected here; we flag it because the corrected and uncorrected runs are a clean lesson in measuring the plumbing before the hypothesis.
- All findings generalize beyond conversational data only insofar as the claim layer does; we make no cross-domain claim yet.

The contribution is not a leaderboard number. It is a **law about when a knowledge substrate should answer for the reader** — and the consequent reframing of the substrate's job from *retrieve* to *retrieve, then decide whether to shape*.

---

_See also: the synthesis these results came from, [the honest scorecard](/reports/memory-benchmarks-scorecard); the theory companion [The Bitemporal Sheaf](/papers/bitemporal-sheaf); the full program in the [donto research agenda](/papers/research-agenda)._