# LoCoMo Config C — claims-only recall (does the philosophy win *and* shrink the context?)

_Living document. Started 2026-06-11. Part of the [LoCoMo benchmark](/benchmarks/locomo) coverage curve._

donto's thesis is that an **evidence-anchored, contradiction-preserving claim layer** — not episodic chunk-stuffing — is what should win agent-memory benchmarks. Config C is the cleanest test of that thesis on LoCoMo: **throw away the dialogue entirely and let the reader answer from claims alone.** If the claims carry the answer, two things should happen at once — accuracy holds up, *and* the context collapses toward (or below) Zep's compact-retrieval regime.

**Two results, one lesson.** *Lexical* claims-only scored **0.244** vs episodic's 0.837 — an honest negative that pinned the bottleneck on claim **recall**, not claim existence. So we built **semantic claim recall** (embed every claim, retrieve by meaning + bridging neighbours): it lifts claims-only to **0.391** (+60%, multi-hop 0.05→0.29) at the *same* ~11× context collapse (median ~1,700 tokens vs episodic's 19,029). Still short of episodic, but the bottleneck moved exactly where predicted — better claim retrieval, better answers. Both results below; live per-run AI transcripts on the [LoCoMo board](/benchmarks/locomo).

---

## Update — semantic claim recall (the bottleneck was retrieval)

The lexical verdict ("claims exist but recall drops the answer") had an obvious test: stop retrieving claims by lexical overlap and retrieve them **by meaning**. We:

1. **Embedded every claim** — a new `ClaimProvider` in the embed coordinator (`target='claim'` → `donto_claim_embedding`, bge-small, HNSW cosine); the **distributed fleet** embedded all **55,741** LoCoMo claims (no box load, 0 submit errors).
2. **Recall by ANN, not lexical** — embed the question, nearest-neighbour the *whole conversation's* claim layer (drops the lossy 8-chunk routing entirely), **+ bounded neighbour claims** (claims sharing a subject with a hit) so multi-hop has its bridging facts. Still claim-centric: ~30–40 dated triples, never dialogue.

**Result (same 160-question stratified sample, clean answers):**

| metric | lexical C | **semantic C** | episodic A | Zep |
|---|---|---|---|---|
| overall accuracy | 0.244 | **0.391** | 0.837 | 0.947 |
| multi-hop | 0.05 | **0.29** | — | — |
| temporal | 0.275 | **0.40** | — | — |
| open-domain | 0.225 | **0.385** | — | — |
| single-hop | 0.425 | **0.475** | — | — |
| median context | 1,690 | **1,705** | 19,029 | 5,760 |
| retrieval p50 | ~10 s | **0.9 s** | — | 0.087 s |

**Every category improves, multi-hop ~6×** — the category lexical recall broke on (0.05) is exactly the one neighbour-expanded semantic recall rescues (0.29). And it's *free on context*: the ~11× collapse holds (~1,700 tok), while retrieval gets **~10× faster** (HNSW ANN beats the per-question chunk-route + lexical sort). The thesis half that's now proven: **the claim layer's value is unlocked by semantic claim recall, and it stays radically token-efficient.** It is still below episodic (0.837) — the remaining gap is recall depth/precision and reader reasoning over terse triples, not the claims' existence.

**Honesty note on the number.** This semantic run logged **0.391 over the 128 questions the reader actually answered**; the raw figure over all 160 was 0.31 because the Hyades gateway (the holo3.1 reader) **destabilised in the run's tail** and blanked 32 responses (all clustered at the end — empty scores as wrong). Per our standing rule, *empty-under-load is a transient failure, not an answer* — so the clean 0.391 is the result, with the gateway flap flagged. (Same gateway-flap caveat as the lexical run; the gateway recovers between runs.)

---

## The three configs (recap)

| config | reader sees | tests |
|---|---|---|
| **A** — episodic baseline | the recalled dialogue chunks (donto-memory hybrid /recall) | the 0%-claim point |
| **B** — episodic + claims | dialogue **plus** blended claim hits | does adding claims lift accuracy? |
| **C** — claims-only | **only** the claim layer, no dialogue | does the philosophy win *and* collapse context? |

Config A is already run: **258 questions, 0.837 accuracy on the holo3.1 reader, median context 19,029 tokens.** That last number matters — episodic recall on LoCoMo pulls whole sessions, so the context is **~3.3× larger than Zep's 5,760-token median.** That is exactly the inefficiency the claim layer is meant to fix.

---

## What Config C required (and a real finding about the substrate)

Building C surfaced an architectural gap worth recording: **the donto-agent claim layer (`ctx:claims/locomo/*`) is not wired into donto-memory's `/recall`.** Recall's `semantic-claim` module reads memory-native claims, not the extraction output — so a naive "recall the claims" returns only episodic rows. The substrate-wide `/search` doesn't help either: its FTS index covers subject + object names but **not predicates**, and LoCoMo claims encode their meaning in freely-minted predicates (`playsBasketball`, `occupation`, `sensoryExperience`) over often-opaque object IRIs. So neither existing recall path retrieves these claims well.

**The fix — chunk-routed claims recall.** Config C routes through the proven retriever instead of replacing it:

1. Run the holder-scoped hybrid `/recall` (FTS + bge-small vector RRF) to find the **k most relevant chunks** for the question — this gives relevance *and* conversation-scoping for free (LoCoMo QA is per-conversation; cross-conversation leakage would be unfair).
2. Map each recalled chunk to its claim context (`ctx:claims/locomo/<chunk-statement-id>`) and pull those chunks' claims (~190–250 per chunk).
3. Rank the pooled claims by question-token overlap and keep the top 40.
4. Hand the reader **only** those 40 claims — compact `subject · predicate · object` triples, no dialogue.

This is faithful to the benchmark (same retrieval backbone as Config A) while swapping verbose dialogue for structured claims.

A second fix fell out of it: the holo3.1 reader call was **non-streaming and timing out** on the gateway (a slow reasoning model on a big prompt trips Cloudflare's ~125s no-data limit). The reader now **streams (SSE)** — the same fix this session put into the production `donto-agent` extractor — so the connection never goes silent. (The token budget is untouched.)

---

## Result so far: the context collapses ~13×

Measured on the Config C retrieval path (gateway-independent — this is prompt construction, not the reader):

| config | median context tokens | vs Zep (5,760) | accuracy |
|---|---|---|---|
| **A — episodic** | **19,029** | 3.3× larger | **0.837** (258 Q) |
| **C — claims-only** | **1,690** | **0.29× — 3.4× smaller** | **0.244** (160 Q, 0 empties) |
| _Zep reference_ | 5,760 | — | 0.947 |

**The context collapses ~11× (19,029 → 1,690 tokens, ~3.4× below Zep's median) — but so does accuracy (0.837 → 0.244).** Claims-only recall, *as currently built*, does **not** win: trading whole dialogue sessions for ~40 anchored triples per question keeps a fraction of the tokens but loses most of the answers. The honest verdict on this configuration: **the claim layer's existence is not sufficient — recalling and presenting the *right* claims is the unsolved part.**

(Reader+judge = holo3.1 on Hyades, Zep's gpt-5.4-class methodology. Scored on a stratified 160-question sample, 40 per category, **0 empty responses** — a clean run. An earlier attempt was discarded: the gateway died mid-run, blanking the back half, and the claims were fed without session dates — see "what went wrong first," below.)

### Per-category — where claims-only breaks

| category | claims-only (C) | n | reading |
|---|---|---|---|
| single-hop | **0.425** | 40 | best — one fact, often present in the top-40 claims |
| temporal | **0.275** | 40 | the session-date tag (added after v1) lifts this from 0.175 |
| open-domain | **0.225** | 40 | broad questions need more than 40 triples |
| **multi-hop** | **0.050** | 40 | worst — chaining across claims fails; terse triples lose the connective reasoning that dialogue carries |

The shape is the finding: **claims-only degrades smoothly with reasoning depth** — decent on single-fact lookup, near-zero on multi-hop. That points squarely at the bottleneck: not whether the facts exist (272/272 chunks, ~52K anchored claims do), but **claim retrieval + ranking** (an 8-chunk route + top-40 lexical-overlap rank drops the bridging facts) and the reader's ability to chain bare triples. This is steering signal, not a grade: the differentiator that needs building is **semantic claim recall** (claim embeddings + multi-hop-aware ranking), not more extraction.

### What went wrong first (and why the number is trustworthy now)

The first scored run read 0.156 — and was thrown out, correctly. Two faults: (1) the Hyades gateway recovered from its 502 outage, ran ~76 questions, then **failed again mid-run**, blanking the entire back half (43/160 empty, all clustered at the end — empty answers score as wrong); and (2) the reader was fed **dateless** triples, so every temporal question was blind. Fixing both — session-date tags from the dataset (the claims' own `valid_time` is unbounded `(,)`, so the date had to come from the session), and a fully stable gateway — gives the clean **0.244 / 0 empties** above. The lesson is the standing one: *empty-under-load is a transient failure, never an answer.*

---

## Update 2 — hybrid recall + the honest scaling caveat

The 0.391 cosine run had a flaw I only caught later: **32 of its 160 answers were gateway-blanked** (clustered in the tail) and silently scored as wrong — so its "0.391" was really *over the 128 it actually answered*. Two fixes followed.

**(a) The harness now refuses to record a blank.** A reader/judge failure (empty or ungraded) is a gateway fault, not a wrong answer — so it's no longer written; a resumable loop rides out the flap and re-attempts every question until it gets a real graded answer. **Result: 0 empties, every question scored for real.** This is the methodology fix that makes any number trustworthy.

**(b) Recall went hybrid.** The failure analysis showed **61% of misses were a *ranking* gap** — the answer-claim existed in the conversation but pure-cosine top-K didn't surface it. So recall now fuses **vector (bge-small cosine) + lexical (FTS over the humanized triple) + entity-neighbour** arms by Reciprocal-Rank Fusion. On a controlled **A/B over 32 matched, fully-answered questions, reader held byte-identical**, hybrid beat cosine **0.438 → 0.563**.

**The honest caveat: that A/B win did *not* scale — and the clean comparison reversed it.** I re-ran **both** cosine and hybrid to completion on the same full 160 (0 empties each, reader identical). The result:

| recall (claims-only, 160 Q clean) | overall | multi-hop | temporal | open-domain | single-hop |
|---|---|---|---|---|---|
| **cosine** (pure semantic ANN) | **0.362** | 0.20 | **0.50** | 0.275 | **0.475** |
| hybrid (vector+lexical+entity RRF) | 0.325 | 0.225 | 0.425 | 0.275 | 0.375 |

**Cosine beats hybrid.** The hybrid change was **net-negative at scale** — the lexical and entity-neighbour arms add noise that dilutes cosine's good ranking (single-hop fell 0.475→0.375, temporal 0.50→0.425) for only a marginal multi-hop gain (0.20→0.225). The 32-question A/B's +29% was small-sample optimism on the easy slice. **Lesson, recorded honestly: the ranking-method lever is tapped out — pure semantic claim recall (0.362) is the claims-only baseline, and we reverted to it.** The remaining headroom is not in *how* we rank claims; it's in the claims themselves (temporal `valid_time`, extraction depth) and the reader's ability to reason over terse triples.

**Temporal stayed flat — by design, and we fixed it at the substrate.** Hybrid recall did *not* move temporal (correct: it's not a ranking problem), and I refused to fix it with a reader hack. The real cause: every temporal question asks for the **event** date, but claims only carried the **session (speech)** date, and the dialogue's relative phrase ("yesterday", "last year") was never resolved. The fix is donto-native — **populate `valid_time`**: a GLM enrichment pass resolves each event to an absolute time against the session anchor and writes an `occurredAt` claim (minute-precise literal) plus a day-granular `valid_time`. 239/272 chunks, **762 `occurredAt` claims**, embedded so semantic recall surfaces them.

---

## Update 3 — populating `valid_time` lifts temporal +45% (the bitemporal differentiator)

Re-ran cosine recall on the same full 160 with the temporal claims now in the pool. Reader identical, 0 empties:

| claims-only, cosine recall (160 Q clean) | overall | multi-hop | **temporal** | open-domain | single-hop |
|---|---|---|---|---|---|
| cosine baseline | 0.362 | 0.20 | 0.50 | 0.275 | 0.475 |
| **cosine + `valid_time`** | **0.412** | 0.20 | **0.725** | 0.25 | 0.475 |

**Temporal jumps 0.50 → 0.725 (+45% relative)**, carrying overall claims-only **0.362 → 0.412**. Every other category is flat — exactly the signature of a clean, targeted fix (no reader change, no leakage into other modes). The transcripts confirm the mechanism, not luck: the two flagship failures now pass — *"When did Melanie paint a sunrise?"* → **2022** ✓ (resolved "last year"), *"When did Caroline go to the LGBTQ support group?"* → **May 7** ✓ (the off-by-one "yesterday" is gone) — each because recall now surfaces a `… occurredAt 2022 / 2023-05-07` claim the reader reads directly.

**This is the thesis made concrete.** Temporal is where a contradiction-preserving **bitemporal** substrate *should* beat episodic-stuffing — and it does, once `valid_time` is actually populated. The win isn't a prompt trick or a retrieval tweak; it's donto holding *when an event was true in the world* as first-class state. Episodic chunk-stuffing has no equivalent — it can only re-read the dialogue and re-derive the date every time (and usually grabs the speech date, as our own pre-fix reader did).

**Honest residuals** (the next temporal lever): the remaining temporal misses are (a) **resolver precision** — "the Sunday before 25 May" resolved to May 20 (a day off), one event mis-resolved to 2019; and (b) **reader disambiguation** — when both a session-date claim and a resolved `occurredAt` claim are present, the reader sometimes still picks the session date. Both are tractable: a deterministic date resolver (vs the LLM doing calendar arithmetic) for (a), and surfacing `valid_time` more prominently than the speech-date tag for (b).

---

## Update 4 — representation: passages beat triples, and the cheap-regime ceiling

The triple-based claim layer topped out around **0.41** on the free reader. The question we then asked — *is the reader the limit, or are we not delivering the fact?* — turned out to be the right one. Two reader-independent measurements settled it:

- **Coverage** (is the exact answer-bearing passage even in what the reader sees?): at top-20 it was missing **27%** of the time. Not a reasoning problem — a delivery problem.
- **Representation**: a weak reader reasons over **natural prose**, not terse triples.

So we switched the return from claim-triples to **dialogue passages** (the general principle: fine-grained passage recall, corpus-agnostic — sentences/passages of any document, not "speaker turns"). Reader held identical, full 160, 0 empties:

| return (free reader, full 160) | overall | multi-hop | temporal | open-domain | single-hop | tokens |
|---|---|---|---|---|---|---|
| claim-triples (cosine) | 0.362 | 0.20 | 0.50 | 0.275 | 0.475 | 1.7K |
| **passages (top-40)** | **0.650** | 0.45 | 0.825 | 0.475 | 0.85 | **2.8K** |
| _episodic (whole sessions)_ | _0.837_ | — | — | — | — | _19K_ |

**Passage recall is +80% over claim-triples and reaches ~78% of episodic's accuracy at ~15% of its tokens** — the high-accuracy/minimal-token target. The two strong categories (single-hop 0.85, temporal 0.825) are near episodic; the gap is the **reasoning-heavy** categories (multi-hop 0.45, open-domain 0.475).

Then we hit a **ceiling that maps cleanly onto the [single-shot-vs-multi-scope analysis](/reports/single-shot-vs-agentic-memory)** — every attempt to push past passage-recall, on the *cheap* regime (free reader, cosine retrieval), failed for an instructive reason:

| lever (24-Q samples, vs passages 0.625) | result | why |
|---|---|---|
| **adjacency** (±1 neighbour passages) | 0.662 full-160, but +2.3× tokens; **hurt open-domain** | coverage 82→89% but lossy conversion; more passages = more distractors for inference Qs |
| **entity-claim aggregation** (gather all of an entity's claims) | multi-hop 0.50→**0.67** but temporal **→0.0** | a single scope can't serve all categories — fixes aggregation, loses the dated passages |
| **multi-scope + cosine rerank** (passage + entity scopes) | **0.583 ↓** | cosine rerank sorts the *dissimilar-but-relevant* aggregation passages back **out** — it collapses multi-scope to semantic-only |

The third row is the key negative: **multi-scope composition is real (it's how Zep gets 86.5% → 94.7%), but it needs a *cross-encoder* reranker** that scores (query, passage) *relevance*, not a bi-encoder cosine that only measures *similarity*. We could not run one — `fastembed`'s reranker registry is broken in this environment (bi-encoders download, cross-encoders don't) and installing `torch`/`sentence-transformers` isn't justified on this box. So that lever is **environment-blocked, not disproven**.

**The honest conclusion for the constrained regime** (free `holo3.1` reader, minimal tokens, cosine-only retrieval): **simple passage recall (0.650) is the ceiling.** Sophistication — adjacency, entity aggregation, multi-scope — does not convert to accuracy without (a) a **cross-encoder reranker** (to compose scopes) and (b) a **stronger reader** (to use the richer context). That is exactly the regime Zep operates in: gpt-5.4 reader + cross-encoder rerank + multi-scope composition. Our 0.650 single-scope number is honestly closest to Zep's *auto-search* (single-call) regime (86.5% — with their far stronger reader), **not** the 94.7% multi-scope headline.

**What this maps for the program:** the remaining accuracy lives behind two doors we've deliberately kept shut here — the strong reader (codex, declined: budget) and the cross-encoder reranker (env-blocked) — plus the **agentic track** ([why](/reports/single-shot-vs-agentic-memory)), where the reader fetches the scattered facts itself and which is the natural fix for multi-hop aggregation. None of those are cheats or conversation tricks; they're the next real builds.

---

## Status & next

- ✅ **Config C harness built + scored** — `run_locomo.py --claims-only` (chunk-routed, conversation-scoped, **dated** top-40 claims) + a streaming reader. Clean 160-Q stratified run, 0 empties.
- ✅ **Context-collapse confirmed** — ~11× smaller than episodic, ~3.4× smaller than Zep.
- ✅ **Honest verdict recorded** — claims-only **0.244** vs episodic **0.837**: this configuration trades too much accuracy for the token saving. The win, if it exists, is in **better claim recall**, not raw claim count.
- ⬜ **Semantic claim recall** — the obvious next lever: embed individual claims, rank by vector similarity + multi-hop bridging (not lexical overlap), widen the chunk route. Re-score Config C against this.
- ⬜ **Config B** — episodic **+** claims (the augmentation hypothesis): does adding claims to the full dialogue *lift* episodic's 0.837? This is likelier to show the claim layer's value than claims-only.
- ⬜ **Zep-comparable pass** — once a claim-recall config is competitive, the gpt-5.4 reader pass for the headline vs Zep 0.947.

The framing, per donto's benchmark philosophy: this is an **instrument for building donto**, not a scoreboard. A clean negative — claims-only loses accuracy at an 11× context cut, worst on multi-hop — is exactly the reading that says *build semantic claim recall next.* The claims are there and anchored; the substrate just can't yet surface the right few for a reasoning chain.

_See also: [LoCoMo run checklist](/benchmarks/locomo) · [Are we using Hyades effectively?](/reports/hyades-extraction-effectiveness) · [all benchmarks](/benchmarks)._