# LoCoMo — run checklist

**Goal:** a full **1,540-question** LoCoMo run (categories 1–4; the adversarial cat-5 is excluded, matching Zep's denominator), scored donto-memory vs **Zep 94.7%** — and, because donto is built to win *through its philosophical extraction*, the extraction-coverage curve that shows the claim layer (not episodic stuffing) doing the work.

**Reference (getzep.com/research, 2026):** LoCoMo **94.7%** (1,459/1,540), gpt-5.4 reader (reasoning=medium) + gpt-5.4 CoT judge, retrieval p50 87ms, median context **5,760 tok**.

_Status: ✅ done · 🔄 in progress · ⛔ blocked · ⬜ not started. Last updated 2026-06-11._

<!-- MONITOR:START -->
## 📡 Live monitor

_auto-checked every 5 min · last check **2026-06-22 18:10 UTC** · status: **⚠️  service donto-home=failed; service dontosrv=inactive; service donto-memory-api=inactive; service donto-embed-coordinator=inactive; service donto-align-daemon=inactive; swap=3749MB;**_

| signal | value |
|---|---|
| claim extraction | **/272** chunks real (n_claims>0) ·  pending ·  empty · worker `inactive` |
| claims in substrate |  statements (`ctx:claims/locomo`) |
| predicate folding inputs | / LoCoMo predicates embedded |
| contributor fleet |  entity vecs ·  queued · rate -59/min |
| baseline reader (config A) | 258/1540 answered, 0 empty |
| box | load 0.64 · mem 2568/15987MB · swap 3749MB |
| services | ⛔donto-home ⛔dontosrv ⛔donto-memory-api ⛔donto-embed-coordinator ⛔donto-align-daemon |
<!-- MONITOR:END -->

---

## Foundation — the corpus and its memory

- ✅ **Dataset acquired** — `locomo10.json` (10 conversations, 272 sessions, 1,540 gradable QA across cat 1–4).
- ✅ **Episodic ingest** — all 272 sessions replayed into donto-memory (`holder=user:locomo:<id>`, `valid_from` = session date), 0 failures.
- ✅ **Chunk embeddings** — 272/272 episodic chunks embedded (bge-small). *Vector recall works now.*

## The claim layer — the philosophical extraction (the part that must win)

- ✅ **Rich claim extraction — DONE (272/272).** Two lanes: Hyades/holo3.1 (~150 facts/chunk) + z.ai glm-5.1 (thinking-on, 64K budget, **~520 facts/chunk**). Corpus average **~190 facts/chunk**, **0 contention-empties**. The last straggler chunk (conv-43 session 8 — repeated transient z.ai network errors) was closed by a post-hoc **streaming** holo3.1 retry (177 facts, 169 anchored): the new SSE transport no longer 524s, so the retry completed first try. **272/272 = 100%.**
- ✅ **Extraction reconciled** — every chunk has `n_claims>0`, verified in `donto_statement` (the early contention-empties were re-queued and re-extracted deep).

## Embeddings for folding — prioritized to the contributor fleet

- ✅ **Prioritized LoCoMo embeddings to front of fleet queue** — `prioritize_locomo_embeddings.sh` ran; 7,308 LoCoMo entities backdated to the front and embedded ahead of the general backlog.
- ✅ **Predicate embeddings — DONE 100%** (13,195/13,195). The deep extraction minted ~13K predicates. The distributed fleet stalled on the last ~1,400 — workers embedded them fine but their **submit calls 504'd at the Cloudflare layer** (the upsert was too slow under load), so they never wrote back. Fixed by embedding the final 1,407 **locally** (`embed.py --only-missing`, bge-small, ~14 min) straight into `donto_predicate_embedding`, bypassing the coordinator/CF; the stuck queue leases were then reconciled to `done`.
- ✅ **Entity embeddings — DONE** — LoCoMo's front-queued entities fully embedded (powers identity folding).

## Predicate folding — the differentiator

- ✅ **Alignment + closure rebuilt over the LoCoMo predicates — folding LIVE.** A full `donto_hybrid_align_batch` ran over the predicate space (proposed **12,331**, registered 12,331) and the closure was rebuilt to **1,081,391 rows** (+24,715). All 13,195 LoCoMo predicates are embedded and carry closure rows; `donto_match_aligned(…, p_expand_predicates=true)` — the function donto-memory's recall uses — now expands query predicates through this closure.
- ✅ **Folding probed end-to-end.** Confirmed the closure folds predicates at query time (e.g. `occupation → {research:occupation, ex:occupation, …}` at confidence 1.0). **Honest scope:** the hybrid gate (semantic cosine ≥0.82 **and** a trigram-lexical vote) is deliberately high-precision — it folds orthographic / cross-namespace variants but **not** lexically-distant synonyms (`occupation`↔`currentJob`). That semantic-synonym case is covered by the complementary path — donto-memory's **vector recall over claim text** — not by predicate folding. Whether conservative folding suffices, or the synonym gate should loosen, is exactly what the scored coverage curve will measure (we won't loosen substrate-wide thresholds blind — that risks false folds in the genealogy corpus too).

## The runs

- 🔄 **Config A — episodic-only baseline** (1,540 Q) — the 0%-extraction point. Was contention-corrupted (121 empty reads), re-running clean on a free gateway (0 empties confirmed). ~0.79 on holo3.1 so far.
- ⬜ **Extraction-coverage curve** — configs at 0 / 10 / 50 / 100% claim coverage, controlled deterministically at query time over the 272 per-chunk claim contexts (no racing extraction):
  - ⬜ Config B — episodic + claims (does adding the claim layer lift accuracy?)
  - ✅ **Config C — claims-only recall — SCORED, full investigation.** Claim-triples 0.362 → +`valid_time` temporal +45% (0.412) → **passage recall 0.650** (free reader, full 160, 2.8K tok = 78% of episodic's 0.837 at 15% of its tokens). Found: prose>triples for weak reader; *coverage* (not reader) was the gap; on the cheap regime (free reader, cosine) passage-recall 0.650 is the **ceiling** — adjacency/entity-aggregation/multi-scope don't convert without a cross-encoder reranker (env-blocked) + stronger reader. Maps to Zep auto-search 86.5% (single-scope) vs multi-scope 94.7%. Write-ups: [Config C](/reports/locomo-claims-only-config-c) · [return shape](/reports/donto-memory-return-shape) · [single-shot vs agentic](/reports/single-shot-vs-agentic-memory).
- ✅ **Token-efficiency result — the donto value axis, firmed (holo3.1, 96Q).** The honest finding on session-rich LoCoMo: full-session episodic is hard to beat on raw accuracy, and bolting claim facets onto sessions is redundant noise (both readers net-flat-to-negative). But built **as the context itself** — claims + answer-shaped **aggregate claims** + resolved dates, *no raw sessions* — the substrate hits **0.521 vs episodic 0.604 = 86% of episodic accuracy at ~1,376 tok (3.8× fewer)**, and per-category **matches episodic on multi-hop, beats it on temporal, ≈open-domain**. The aggregates roughly **double** raw claims-only (0.244→0.52). The entire remaining gap is **single-hop verbatim recall** — a diagnostic proved the answers are *in* the claim layer but not surfaced (a recall/alignment gap); a lexical arm recovers them (6→8/12) but always-on injection trades against open-domain, and a naive rank-gate didn't cleanly route — so closing it needs **question-level adaptive recall** (donto's alignment frontier), not more facets. Write-up: [return shape](/reports/donto-memory-return-shape). *Headline: donto's claim layer delivers near-episodic answers at a quarter of the tokens.*
- ⬜ **Zep-comparable run** — flip `LOCOMO_BACKEND=codex` (gpt-5.4 reader + judge) for the headline number vs Zep's 94.7%.

## Reporting

- ⬜ **Metrics** — accuracy overall + **per category** (multi-hop / temporal / open-domain / single-hop), retrieval p50/p95, median context tokens.
- ⬜ **Honest write-up** — per-category deltas attributing the win to the claim layer; caveats (reader/judge model, gateway).

---

_Hard rules: never share the gateway between extraction and reader; `done` ≠ extracted; build before measuring. → [all benchmarks](/benchmarks)_