LoCoMo — run checklist
Goal: a full 1,540-question LoCoMo run (categories 1–4; the adversarial cat-5 is excluded, matching Zep's denominator), scored donto-memory vs Zep 94.7% — and, because donto is built to win through its philosophical extraction, the extraction-coverage curve that shows the claim layer (not episodic stuffing) doing the work.
Reference (getzep.com/research, 2026): LoCoMo 94.7% (1,459/1,540), gpt-5.4 reader (reasoning=medium) + gpt-5.4 CoT judge, retrieval p50 87ms, median context 5,760 tok.
Status: ✅ done · 🔄 in progress · ⛔ blocked · ⬜ not started. Last updated 2026-06-11.
📡 Live monitor
auto-checked every 5 min · last check 2026-06-13 18:05 UTC · status: ⚠️ swap=6473MB;
| signal | value |
|---|---|
| claim extraction | 272/272 chunks real (n_claims>0) · 0 pending · 0 empty · worker inactive |
| claims in substrate | 52396 statements (ctx:claims/locomo) |
| predicate folding inputs | 13195/13376 LoCoMo predicates embedded |
| contributor fleet | 3160590 entity vecs · 0 queued · rate 0/min |
| baseline reader (config A) | 258/1540 answered, 0 empty |
| box | load 3.72 · mem 6009/15987MB · swap 6473MB |
| services | ✅donto-home ✅dontosrv ✅donto-memory-api ✅donto-embed-coordinator ✅donto-align-daemon |
Foundation — the corpus and its memory
- ✅ Dataset acquired —
locomo10.json(10 conversations, 272 sessions, 1,540 gradable QA across cat 1–4). - ✅ Episodic ingest — all 272 sessions replayed into donto-memory (
holder=user:locomo:<id>,valid_from= session date), 0 failures. - ✅ Chunk embeddings — 272/272 episodic chunks embedded (bge-small). Vector recall works now.
The claim layer — the philosophical extraction (the part that must win)
- ✅ Rich claim extraction — DONE (272/272). Two lanes: Hyades/holo3.1 (~150 facts/chunk) + z.ai glm-5.1 (thinking-on, 64K budget, ~520 facts/chunk). Corpus average ~190 facts/chunk, 0 contention-empties. The last straggler chunk (conv-43 session 8 — repeated transient z.ai network errors) was closed by a post-hoc streaming holo3.1 retry (177 facts, 169 anchored): the new SSE transport no longer 524s, so the retry completed first try. 272/272 = 100%.
- ✅ Extraction reconciled — every chunk has
n_claims>0, verified indonto_statement(the early contention-empties were re-queued and re-extracted deep).
Embeddings for folding — prioritized to the contributor fleet
- ✅ Prioritized LoCoMo embeddings to front of fleet queue —
prioritize_locomo_embeddings.shran; 7,308 LoCoMo entities backdated to the front and embedded ahead of the general backlog. - ✅ Predicate embeddings — DONE 100% (13,195/13,195). The deep extraction minted ~13K predicates. The distributed fleet stalled on the last ~1,400 — workers embedded them fine but their submit calls 504'd at the Cloudflare layer (the upsert was too slow under load), so they never wrote back. Fixed by embedding the final 1,407 locally (
embed.py --only-missing, bge-small, ~14 min) straight intodonto_predicate_embedding, bypassing the coordinator/CF; the stuck queue leases were then reconciled todone. - ✅ Entity embeddings — DONE — LoCoMo's front-queued entities fully embedded (powers identity folding).
Predicate folding — the differentiator
- ✅ Alignment + closure rebuilt over the LoCoMo predicates — folding LIVE. A full
donto_hybrid_align_batchran over the predicate space (proposed 12,331, registered 12,331) and the closure was rebuilt to 1,081,391 rows (+24,715). All 13,195 LoCoMo predicates are embedded and carry closure rows;donto_match_aligned(…, p_expand_predicates=true)— the function donto-memory's recall uses — now expands query predicates through this closure. - ✅ Folding probed end-to-end. Confirmed the closure folds predicates at query time (e.g.
occupation → {research:occupation, ex:occupation, …}at confidence 1.0). Honest scope: the hybrid gate (semantic cosine ≥0.82 and a trigram-lexical vote) is deliberately high-precision — it folds orthographic / cross-namespace variants but not lexically-distant synonyms (occupation↔currentJob). That semantic-synonym case is covered by the complementary path — donto-memory's vector recall over claim text — not by predicate folding. Whether conservative folding suffices, or the synonym gate should loosen, is exactly what the scored coverage curve will measure (we won't loosen substrate-wide thresholds blind — that risks false folds in the genealogy corpus too).
The runs
- 🔄 Config A — episodic-only baseline (1,540 Q) — the 0%-extraction point. Was contention-corrupted (121 empty reads), re-running clean on a free gateway (0 empties confirmed). ~0.79 on holo3.1 so far.
- ⬜ Extraction-coverage curve — configs at 0 / 10 / 50 / 100% claim coverage, controlled deterministically at query time over the 272 per-chunk claim contexts (no racing extraction):
- ⬜ Config B — episodic + claims (does adding the claim layer lift accuracy?)
- ✅ Config C — claims-only recall — SCORED, full investigation. Claim-triples 0.362 → +
valid_timetemporal +45% (0.412) → passage recall 0.650 (free reader, full 160, 2.8K tok = 78% of episodic's 0.837 at 15% of its tokens). Found: prose>triples for weak reader; coverage (not reader) was the gap; on the cheap regime (free reader, cosine) passage-recall 0.650 is the ceiling — adjacency/entity-aggregation/multi-scope don't convert without a cross-encoder reranker (env-blocked) + stronger reader. Maps to Zep auto-search 86.5% (single-scope) vs multi-scope 94.7%. Write-ups: Config C · return shape · single-shot vs agentic.
- ✅ Token-efficiency result — the donto value axis, firmed (holo3.1, 96Q). The honest finding on session-rich LoCoMo: full-session episodic is hard to beat on raw accuracy, and bolting claim facets onto sessions is redundant noise (both readers net-flat-to-negative). But built as the context itself — claims + answer-shaped aggregate claims + resolved dates, no raw sessions — the substrate hits 0.521 vs episodic 0.604 = 86% of episodic accuracy at ~1,376 tok (3.8× fewer), and per-category matches episodic on multi-hop, beats it on temporal, ≈open-domain. The aggregates roughly double raw claims-only (0.244→0.52). The entire remaining gap is single-hop verbatim recall — a diagnostic proved the answers are in the claim layer but not surfaced (a recall/alignment gap); a lexical arm recovers them (6→8/12) but always-on injection trades against open-domain, and a naive rank-gate didn't cleanly route — so closing it needs question-level adaptive recall (donto's alignment frontier), not more facets. Write-up: return shape. Headline: donto's claim layer delivers near-episodic answers at a quarter of the tokens.
- ⬜ Zep-comparable run — flip
LOCOMO_BACKEND=codex(gpt-5.4 reader + judge) for the headline number vs Zep's 94.7%.
Reporting
- ⬜ Metrics — accuracy overall + per category (multi-hop / temporal / open-domain / single-hop), retrieval p50/p95, median context tokens.
- ⬜ Honest write-up — per-category deltas attributing the win to the claim layer; caveats (reader/judge model, gateway).
Hard rules: never share the gateway between extraction and reader; done ≠ extracted; build before measuring. → all benchmarks