LoCoMo Config C — claims-only recall (does the philosophy win and shrink the context?)

Living document. Started 2026-06-11. Part of the LoCoMo benchmark coverage curve.

donto's thesis is that an evidence-anchored, contradiction-preserving claim layer — not episodic chunk-stuffing — is what should win agent-memory benchmarks. Config C is the cleanest test of that thesis on LoCoMo: throw away the dialogue entirely and let the reader answer from claims alone. If the claims carry the answer, two things should happen at once — accuracy holds up, and the context collapses toward (or below) Zep's compact-retrieval regime.

Two results, one lesson. Lexical claims-only scored 0.244 vs episodic's 0.837 — an honest negative that pinned the bottleneck on claim recall, not claim existence. So we built semantic claim recall (embed every claim, retrieve by meaning + bridging neighbours): it lifts claims-only to 0.391 (+60%, multi-hop 0.05→0.29) at the same ~11× context collapse (median ~1,700 tokens vs episodic's 19,029). Still short of episodic, but the bottleneck moved exactly where predicted — better claim retrieval, better answers. Both results below; live per-run AI transcripts on the LoCoMo board.

Update — semantic claim recall (the bottleneck was retrieval)

The lexical verdict ("claims exist but recall drops the answer") had an obvious test: stop retrieving claims by lexical overlap and retrieve them by meaning. We:

Embedded every claim — a new ClaimProvider in the embed coordinator (target='claim' → donto_claim_embedding, bge-small, HNSW cosine); the distributed fleet embedded all 55,741 LoCoMo claims (no box load, 0 submit errors).
Recall by ANN, not lexical — embed the question, nearest-neighbour the whole conversation's claim layer (drops the lossy 8-chunk routing entirely), + bounded neighbour claims (claims sharing a subject with a hit) so multi-hop has its bridging facts. Still claim-centric: ~30–40 dated triples, never dialogue.

Result (same 160-question stratified sample, clean answers):

metric	lexical C	semantic C	episodic A	Zep
overall accuracy	0.244	0.391	0.837	0.947
multi-hop	0.05	0.29	—	—
temporal	0.275	0.40	—	—
open-domain	0.225	0.385	—	—
single-hop	0.425	0.475	—	—
median context	1,690	1,705	19,029	5,760
retrieval p50	~10 s	0.9 s	—	0.087 s

Every category improves, multi-hop ~6× — the category lexical recall broke on (0.05) is exactly the one neighbour-expanded semantic recall rescues (0.29). And it's free on context: the ~11× collapse holds (~1,700 tok), while retrieval gets ~10× faster (HNSW ANN beats the per-question chunk-route + lexical sort). The thesis half that's now proven: the claim layer's value is unlocked by semantic claim recall, and it stays radically token-efficient. It is still below episodic (0.837) — the remaining gap is recall depth/precision and reader reasoning over terse triples, not the claims' existence.

Honesty note on the number. This semantic run logged 0.391 over the 128 questions the reader actually answered; the raw figure over all 160 was 0.31 because the Hyades gateway (the holo3.1 reader) destabilised in the run's tail and blanked 32 responses (all clustered at the end — empty scores as wrong). Per our standing rule, empty-under-load is a transient failure, not an answer — so the clean 0.391 is the result, with the gateway flap flagged. (Same gateway-flap caveat as the lexical run; the gateway recovers between runs.)

The three configs (recap)

config	reader sees	tests
A — episodic baseline	the recalled dialogue chunks (donto-memory hybrid /recall)	the 0%-claim point
B — episodic + claims	dialogue plus blended claim hits	does adding claims lift accuracy?
C — claims-only	only the claim layer, no dialogue	does the philosophy win and collapse context?

Config A is already run: 258 questions, 0.837 accuracy on the holo3.1 reader, median context 19,029 tokens. That last number matters — episodic recall on LoCoMo pulls whole sessions, so the context is ~3.3× larger than Zep's 5,760-token median. That is exactly the inefficiency the claim layer is meant to fix.

What Config C required (and a real finding about the substrate)

Building C surfaced an architectural gap worth recording: the donto-agent claim layer (ctx:claims/locomo/*) is not wired into donto-memory's /recall. Recall's semantic-claim module reads memory-native claims, not the extraction output — so a naive "recall the claims" returns only episodic rows. The substrate-wide /search doesn't help either: its FTS index covers subject + object names but not predicates, and LoCoMo claims encode their meaning in freely-minted predicates (playsBasketball, occupation, sensoryExperience) over often-opaque object IRIs. So neither existing recall path retrieves these claims well.

The fix — chunk-routed claims recall. Config C routes through the proven retriever instead of replacing it:

Run the holder-scoped hybrid /recall (FTS + bge-small vector RRF) to find the k most relevant chunks for the question — this gives relevance and conversation-scoping for free (LoCoMo QA is per-conversation; cross-conversation leakage would be unfair).
Map each recalled chunk to its claim context (ctx:claims/locomo/<chunk-statement-id>) and pull those chunks' claims (~190–250 per chunk).
Rank the pooled claims by question-token overlap and keep the top 40.
Hand the reader only those 40 claims — compact subject · predicate · object triples, no dialogue.

This is faithful to the benchmark (same retrieval backbone as Config A) while swapping verbose dialogue for structured claims.

A second fix fell out of it: the holo3.1 reader call was non-streaming and timing out on the gateway (a slow reasoning model on a big prompt trips Cloudflare's ~125s no-data limit). The reader now streams (SSE) — the same fix this session put into the production donto-agent extractor — so the connection never goes silent. (The token budget is untouched.)

Result so far: the context collapses ~13×

Measured on the Config C retrieval path (gateway-independent — this is prompt construction, not the reader):

config	median context tokens	vs Zep (5,760)	accuracy
A — episodic	19,029	3.3× larger	0.837 (258 Q)
C — claims-only	1,690	0.29× — 3.4× smaller	0.244 (160 Q, 0 empties)
Zep reference	5,760	—	0.947

The context collapses ~11× (19,029 → 1,690 tokens, ~3.4× below Zep's median) — but so does accuracy (0.837 → 0.244). Claims-only recall, as currently built, does not win: trading whole dialogue sessions for ~40 anchored triples per question keeps a fraction of the tokens but loses most of the answers. The honest verdict on this configuration: the claim layer's existence is not sufficient — recalling and presenting the right claims is the unsolved part.

(Reader+judge = holo3.1 on Hyades, Zep's gpt-5.4-class methodology. Scored on a stratified 160-question sample, 40 per category, 0 empty responses — a clean run. An earlier attempt was discarded: the gateway died mid-run, blanking the back half, and the claims were fed without session dates — see "what went wrong first," below.)

Per-category — where claims-only breaks

category	claims-only (C)	n	reading
single-hop	0.425	40	best — one fact, often present in the top-40 claims
temporal	0.275	40	the session-date tag (added after v1) lifts this from 0.175
open-domain	0.225	40	broad questions need more than 40 triples
multi-hop	0.050	40	worst — chaining across claims fails; terse triples lose the connective reasoning that dialogue carries

The shape is the finding: claims-only degrades smoothly with reasoning depth — decent on single-fact lookup, near-zero on multi-hop. That points squarely at the bottleneck: not whether the facts exist (272/272 chunks, ~52K anchored claims do), but claim retrieval + ranking (an 8-chunk route + top-40 lexical-overlap rank drops the bridging facts) and the reader's ability to chain bare triples. This is steering signal, not a grade: the differentiator that needs building is semantic claim recall (claim embeddings + multi-hop-aware ranking), not more extraction.

What went wrong first (and why the number is trustworthy now)

The first scored run read 0.156 — and was thrown out, correctly. Two faults: (1) the Hyades gateway recovered from its 502 outage, ran ~76 questions, then failed again mid-run, blanking the entire back half (43/160 empty, all clustered at the end — empty answers score as wrong); and (2) the reader was fed dateless triples, so every temporal question was blind. Fixing both — session-date tags from the dataset (the claims' own valid_time is unbounded (,), so the date had to come from the session), and a fully stable gateway — gives the clean 0.244 / 0 empties above. The lesson is the standing one: empty-under-load is a transient failure, never an answer.

Update 2 — hybrid recall + the honest scaling caveat

The 0.391 cosine run had a flaw I only caught later: 32 of its 160 answers were gateway-blanked (clustered in the tail) and silently scored as wrong — so its "0.391" was really over the 128 it actually answered. Two fixes followed.

(a) The harness now refuses to record a blank. A reader/judge failure (empty or ungraded) is a gateway fault, not a wrong answer — so it's no longer written; a resumable loop rides out the flap and re-attempts every question until it gets a real graded answer. Result: 0 empties, every question scored for real. This is the methodology fix that makes any number trustworthy.

(b) Recall went hybrid. The failure analysis showed 61% of misses were a ranking gap — the answer-claim existed in the conversation but pure-cosine top-K didn't surface it. So recall now fuses vector (bge-small cosine) + lexical (FTS over the humanized triple) + entity-neighbour arms by Reciprocal-Rank Fusion. On a controlled A/B over 32 matched, fully-answered questions, reader held byte-identical, hybrid beat cosine 0.438 → 0.563.

The honest caveat: that A/B win did not scale — and the clean comparison reversed it. I re-ran both cosine and hybrid to completion on the same full 160 (0 empties each, reader identical). The result:

recall (claims-only, 160 Q clean)	overall	multi-hop	temporal	open-domain	single-hop
cosine (pure semantic ANN)	0.362	0.20	0.50	0.275	0.475
hybrid (vector+lexical+entity RRF)	0.325	0.225	0.425	0.275	0.375

Cosine beats hybrid. The hybrid change was net-negative at scale — the lexical and entity-neighbour arms add noise that dilutes cosine's good ranking (single-hop fell 0.475→0.375, temporal 0.50→0.425) for only a marginal multi-hop gain (0.20→0.225). The 32-question A/B's +29% was small-sample optimism on the easy slice. Lesson, recorded honestly: the ranking-method lever is tapped out — pure semantic claim recall (0.362) is the claims-only baseline, and we reverted to it. The remaining headroom is not in how we rank claims; it's in the claims themselves (temporal valid_time, extraction depth) and the reader's ability to reason over terse triples.

Temporal stayed flat — by design, and we fixed it at the substrate. Hybrid recall did not move temporal (correct: it's not a ranking problem), and I refused to fix it with a reader hack. The real cause: every temporal question asks for the event date, but claims only carried the session (speech) date, and the dialogue's relative phrase ("yesterday", "last year") was never resolved. The fix is donto-native — populate valid_time: a GLM enrichment pass resolves each event to an absolute time against the session anchor and writes an occurredAt claim (minute-precise literal) plus a day-granular valid_time. 239/272 chunks, 762 occurredAt claims, embedded so semantic recall surfaces them.

Update 3 — populating `valid_time` lifts temporal +45% (the bitemporal differentiator)

Re-ran cosine recall on the same full 160 with the temporal claims now in the pool. Reader identical, 0 empties:

claims-only, cosine recall (160 Q clean)	overall	multi-hop	temporal	open-domain	single-hop
cosine baseline	0.362	0.20	0.50	0.275	0.475
cosine + `valid_time`	0.412	0.20	0.725	0.25	0.475

Temporal jumps 0.50 → 0.725 (+45% relative), carrying overall claims-only 0.362 → 0.412. Every other category is flat — exactly the signature of a clean, targeted fix (no reader change, no leakage into other modes). The transcripts confirm the mechanism, not luck: the two flagship failures now pass — "When did Melanie paint a sunrise?" → 2022 ✓ (resolved "last year"), "When did Caroline go to the LGBTQ support group?" → May 7 ✓ (the off-by-one "yesterday" is gone) — each because recall now surfaces a … occurredAt 2022 / 2023-05-07 claim the reader reads directly.

This is the thesis made concrete. Temporal is where a contradiction-preserving bitemporal substrate should beat episodic-stuffing — and it does, once valid_time is actually populated. The win isn't a prompt trick or a retrieval tweak; it's donto holding when an event was true in the world as first-class state. Episodic chunk-stuffing has no equivalent — it can only re-read the dialogue and re-derive the date every time (and usually grabs the speech date, as our own pre-fix reader did).

Honest residuals (the next temporal lever): the remaining temporal misses are (a) resolver precision — "the Sunday before 25 May" resolved to May 20 (a day off), one event mis-resolved to 2019; and (b) reader disambiguation — when both a session-date claim and a resolved occurredAt claim are present, the reader sometimes still picks the session date. Both are tractable: a deterministic date resolver (vs the LLM doing calendar arithmetic) for (a), and surfacing valid_time more prominently than the speech-date tag for (b).

Update 4 — representation: passages beat triples, and the cheap-regime ceiling

The triple-based claim layer topped out around 0.41 on the free reader. The question we then asked — is the reader the limit, or are we not delivering the fact? — turned out to be the right one. Two reader-independent measurements settled it:

Coverage (is the exact answer-bearing passage even in what the reader sees?): at top-20 it was missing 27% of the time. Not a reasoning problem — a delivery problem.
Representation: a weak reader reasons over natural prose, not terse triples.

So we switched the return from claim-triples to dialogue passages (the general principle: fine-grained passage recall, corpus-agnostic — sentences/passages of any document, not "speaker turns"). Reader held identical, full 160, 0 empties:

return (free reader, full 160)	overall	multi-hop	temporal	open-domain	single-hop	tokens
claim-triples (cosine)	0.362	0.20	0.50	0.275	0.475	1.7K
passages (top-40)	0.650	0.45	0.825	0.475	0.85	2.8K
episodic (whole sessions)	0.837	—	—	—	—	19K

Passage recall is +80% over claim-triples and reaches ~78% of episodic's accuracy at ~15% of its tokens — the high-accuracy/minimal-token target. The two strong categories (single-hop 0.85, temporal 0.825) are near episodic; the gap is the reasoning-heavy categories (multi-hop 0.45, open-domain 0.475).

Then we hit a ceiling that maps cleanly onto the single-shot-vs-multi-scope analysis — every attempt to push past passage-recall, on the cheap regime (free reader, cosine retrieval), failed for an instructive reason:

lever (24-Q samples, vs passages 0.625)	result	why
adjacency (±1 neighbour passages)	0.662 full-160, but +2.3× tokens; hurt open-domain	coverage 82→89% but lossy conversion; more passages = more distractors for inference Qs
entity-claim aggregation (gather all of an entity's claims)	multi-hop 0.50→0.67 but temporal →0.0	a single scope can't serve all categories — fixes aggregation, loses the dated passages
multi-scope + cosine rerank (passage + entity scopes)	0.583 ↓	cosine rerank sorts the dissimilar-but-relevant aggregation passages back out — it collapses multi-scope to semantic-only

The third row is the key negative: multi-scope composition is real (it's how Zep gets 86.5% → 94.7%), but it needs a cross-encoder reranker that scores (query, passage) relevance, not a bi-encoder cosine that only measures similarity. We could not run one — fastembed's reranker registry is broken in this environment (bi-encoders download, cross-encoders don't) and installing torch/sentence-transformers isn't justified on this box. So that lever is environment-blocked, not disproven.

The honest conclusion for the constrained regime (free holo3.1 reader, minimal tokens, cosine-only retrieval): simple passage recall (0.650) is the ceiling. Sophistication — adjacency, entity aggregation, multi-scope — does not convert to accuracy without (a) a cross-encoder reranker (to compose scopes) and (b) a stronger reader (to use the richer context). That is exactly the regime Zep operates in: gpt-5.4 reader + cross-encoder rerank + multi-scope composition. Our 0.650 single-scope number is honestly closest to Zep's auto-search (single-call) regime (86.5% — with their far stronger reader), not the 94.7% multi-scope headline.

What this maps for the program: the remaining accuracy lives behind two doors we've deliberately kept shut here — the strong reader (codex, declined: budget) and the cross-encoder reranker (env-blocked) — plus the agentic track (why), where the reader fetches the scattered facts itself and which is the natural fix for multi-hop aggregation. None of those are cheats or conversation tricks; they're the next real builds.

Status & next

✅ Config C harness built + scored — run_locomo.py --claims-only (chunk-routed, conversation-scoped, dated top-40 claims) + a streaming reader. Clean 160-Q stratified run, 0 empties.
✅ Context-collapse confirmed — ~11× smaller than episodic, ~3.4× smaller than Zep.
✅ Honest verdict recorded — claims-only 0.244 vs episodic 0.837: this configuration trades too much accuracy for the token saving. The win, if it exists, is in better claim recall, not raw claim count.
⬜ Semantic claim recall — the obvious next lever: embed individual claims, rank by vector similarity + multi-hop bridging (not lexical overlap), widen the chunk route. Re-score Config C against this.
⬜ Config B — episodic + claims (the augmentation hypothesis): does adding claims to the full dialogue lift episodic's 0.837? This is likelier to show the claim layer's value than claims-only.
⬜ Zep-comparable pass — once a claim-recall config is competitive, the gpt-5.4 reader pass for the headline vs Zep 0.947.

The framing, per donto's benchmark philosophy: this is an instrument for building donto, not a scoreboard. A clean negative — claims-only loses accuracy at an 11× context cut, worst on multi-hop — is exactly the reading that says build semantic claim recall next. The claims are there and anchored; the substrate just can't yet surface the right few for a reasoning chain.