LoCoMo — run checklist

Goal: a full 1,540-question LoCoMo run (categories 1–4; the adversarial cat-5 is excluded, matching Zep's denominator), scored donto-memory vs Zep 94.7% — and, because donto is built to win through its philosophical extraction, the extraction-coverage curve that shows the claim layer (not episodic stuffing) doing the work.

Reference (getzep.com/research, 2026): LoCoMo 94.7% (1,459/1,540), gpt-5.4 reader (reasoning=medium) + gpt-5.4 CoT judge, retrieval p50 87ms, median context 5,760 tok.

Status: ✅ done · 🔄 in progress · ⛔ blocked · ⬜ not started. Last updated 2026-06-11.

📡 Live monitor

auto-checked every 5 min · last check 2026-06-22 18:10 UTC · status: ⚠️ service donto-home=failed; service dontosrv=inactive; service donto-memory-api=inactive; service donto-embed-coordinator=inactive; service donto-align-daemon=inactive; swap=3749MB;

signal	value
claim extraction	/272 chunks real (n_claims>0) · pending · empty · worker `inactive`
claims in substrate	statements (`ctx:claims/locomo`)
predicate folding inputs	/ LoCoMo predicates embedded
contributor fleet	entity vecs · queued · rate -59/min
baseline reader (config A)	258/1540 answered, 0 empty
box	load 0.64 · mem 2568/15987MB · swap 3749MB
services	⛔donto-home ⛔dontosrv ⛔donto-memory-api ⛔donto-embed-coordinator ⛔donto-align-daemon

Foundation — the corpus and its memory

✅ Dataset acquired — locomo10.json (10 conversations, 272 sessions, 1,540 gradable QA across cat 1–4).
✅ Episodic ingest — all 272 sessions replayed into donto-memory (holder=user:locomo:<id>, valid_from = session date), 0 failures.
✅ Chunk embeddings — 272/272 episodic chunks embedded (bge-small). Vector recall works now.

The claim layer — the philosophical extraction (the part that must win)

✅ Rich claim extraction — DONE (272/272). Two lanes: Hyades/holo3.1 (~150 facts/chunk) + z.ai glm-5.1 (thinking-on, 64K budget, ~520 facts/chunk). Corpus average ~190 facts/chunk, 0 contention-empties. The last straggler chunk (conv-43 session 8 — repeated transient z.ai network errors) was closed by a post-hoc streaming holo3.1 retry (177 facts, 169 anchored): the new SSE transport no longer 524s, so the retry completed first try. 272/272 = 100%.
✅ Extraction reconciled — every chunk has n_claims>0, verified in donto_statement (the early contention-empties were re-queued and re-extracted deep).

Embeddings for folding — prioritized to the contributor fleet

✅ Prioritized LoCoMo embeddings to front of fleet queue — prioritize_locomo_embeddings.sh ran; 7,308 LoCoMo entities backdated to the front and embedded ahead of the general backlog.
✅ Predicate embeddings — DONE 100% (13,195/13,195). The deep extraction minted ~13K predicates. The distributed fleet stalled on the last ~1,400 — workers embedded them fine but their submit calls 504'd at the Cloudflare layer (the upsert was too slow under load), so they never wrote back. Fixed by embedding the final 1,407 locally (embed.py --only-missing, bge-small, ~14 min) straight into donto_predicate_embedding, bypassing the coordinator/CF; the stuck queue leases were then reconciled to done.
✅ Entity embeddings — DONE — LoCoMo's front-queued entities fully embedded (powers identity folding).

Predicate folding — the differentiator

✅ Alignment + closure rebuilt over the LoCoMo predicates — folding LIVE. A full donto_hybrid_align_batch ran over the predicate space (proposed 12,331, registered 12,331) and the closure was rebuilt to 1,081,391 rows (+24,715). All 13,195 LoCoMo predicates are embedded and carry closure rows; donto_match_aligned(…, p_expand_predicates=true) — the function donto-memory's recall uses — now expands query predicates through this closure.
✅ Folding probed end-to-end. Confirmed the closure folds predicates at query time (e.g. occupation → {research:occupation, ex:occupation, …} at confidence 1.0). Honest scope: the hybrid gate (semantic cosine ≥0.82 and a trigram-lexical vote) is deliberately high-precision — it folds orthographic / cross-namespace variants but not lexically-distant synonyms (occupation↔currentJob). That semantic-synonym case is covered by the complementary path — donto-memory's vector recall over claim text — not by predicate folding. Whether conservative folding suffices, or the synonym gate should loosen, is exactly what the scored coverage curve will measure (we won't loosen substrate-wide thresholds blind — that risks false folds in the genealogy corpus too).

The runs

🔄 Config A — episodic-only baseline (1,540 Q) — the 0%-extraction point. Was contention-corrupted (121 empty reads), re-running clean on a free gateway (0 empties confirmed). ~0.79 on holo3.1 so far.
⬜ Extraction-coverage curve — configs at 0 / 10 / 50 / 100% claim coverage, controlled deterministically at query time over the 272 per-chunk claim contexts (no racing extraction):
- ⬜ Config B — episodic + claims (does adding the claim layer lift accuracy?)
- ✅ Config C — claims-only recall — SCORED, full investigation. Claim-triples 0.362 → +valid_time temporal +45% (0.412) → passage recall 0.650 (free reader, full 160, 2.8K tok = 78% of episodic's 0.837 at 15% of its tokens). Found: prose>triples for weak reader; coverage (not reader) was the gap; on the cheap regime (free reader, cosine) passage-recall 0.650 is the ceiling — adjacency/entity-aggregation/multi-scope don't convert without a cross-encoder reranker (env-blocked) + stronger reader. Maps to Zep auto-search 86.5% (single-scope) vs multi-scope 94.7%. Write-ups: Config C · return shape · single-shot vs agentic.
✅ Token-efficiency result — the donto value axis, firmed (holo3.1, 96Q). The honest finding on session-rich LoCoMo: full-session episodic is hard to beat on raw accuracy, and bolting claim facets onto sessions is redundant noise (both readers net-flat-to-negative). But built as the context itself — claims + answer-shaped aggregate claims + resolved dates, no raw sessions — the substrate hits 0.521 vs episodic 0.604 = 86% of episodic accuracy at ~1,376 tok (3.8× fewer), and per-category matches episodic on multi-hop, beats it on temporal, ≈open-domain. The aggregates roughly double raw claims-only (0.244→0.52). The entire remaining gap is single-hop verbatim recall — a diagnostic proved the answers are in the claim layer but not surfaced (a recall/alignment gap); a lexical arm recovers them (6→8/12) but always-on injection trades against open-domain, and a naive rank-gate didn't cleanly route — so closing it needs question-level adaptive recall (donto's alignment frontier), not more facets. Write-up: return shape. Headline: donto's claim layer delivers near-episodic answers at a quarter of the tokens.
⬜ Zep-comparable run — flip LOCOMO_BACKEND=codex (gpt-5.4 reader + judge) for the headline number vs Zep's 94.7%.

Reporting

⬜ Metrics — accuracy overall + per category (multi-hop / temporal / open-domain / single-hop), retrieval p50/p95, median context tokens.
⬜ Honest write-up — per-category deltas attributing the win to the claim layer; caveats (reader/judge model, gateway).

Hard rules: never share the gateway between extraction and reader; done ≠ extracted; build before measuring. → all benchmarks