# LongMemEval — run checklist

**Goal:** a full **500-question** LongMemEval (`longmemeval_s`) run, scored donto-memory vs **Zep 90.2%** — then the extraction/folding upside that proves the win comes from the philosophy, not episodic stuffing.

**Reference (getzep.com/research, 2026):** LongMemEval **90.2%** (451/500), gpt-5.4 reader (reasoning=medium) + gpt-5.4 CoT judge, retrieval p50 104ms, median context **4,408 tok**.

_Status: ✅ done · 🔄 in progress · ⛔ blocked · ⬜ not started. Last updated 2026-06-11._

---

## Headline result — BASELINE COMPLETE ✅

| metric | donto-memory | Zep | verdict |
|---|---|---|---|
| **accuracy** | **91.0%** (455/500) | 90.2% (451/500) | **donto wins** |
| retrieval p50 | 220 ms | 104 ms | Zep ~2× faster |
| retrieval p95 | 452 ms | 162 ms | Zep wins |
| **median context** | **55,836 tok** | 4,408 tok | **Zep ~13× leaner** |

**Read honestly:** donto wins accuracy from **raw episodic recall with no claim extraction at all** — but pays with ~13× the context tokens (it recalls whole session transcripts where Zep recalls distilled graph edges). The accuracy win is real; the token cost is the motivation for the extraction work below. *Per-type:* user 0.97 · knowledge-update 0.95 · temporal 0.95 · multi-session 0.83 · single-session-preference 0.70 (weakest).

## Foundation

- ✅ **Dataset** — `longmemeval_s` (500 instances, the retrieval-stressed variant, ~50 sessions/question).
- ✅ **Episodic ingest** — 25,112 records (`holder=user:lme_s_official:<qid>`, `valid_from` injected per session).
- ✅ **Chunk embeddings** — 25,112 / 25,112 (100%).

## The scored baseline (episodic-only, Zep-comparable)

- ✅ **Full 500 answered** — gpt-5.4 reader + judge (official per-type prompts), 0 empty after a contention-free re-run of 32 throttled answers.
- ✅ **Metrics** — 91.0% overall, per-type computed, abstention measured.
- ✅ **Latency + context measured** — p50 220ms / p95 452ms / median 55,836 tok.

## The philosophy upside — NOT yet done (the bigger story)

- ✅ **Knowledge-update probe (free reader) — donto's bitemporal value is in the *representation*, not recall re-ranking.** Re-ran the 78 knowledge-update questions on the **free holo3.1** reader (vs the codex 0.95 above): **0.923 (72/78)**. Then tested an explicit **recency** recall (re-sort recalled sessions latest-`valid_from`-first, to surface the *current* value of a changed fact): **0.897 — slightly worse** (recovered 1, lost 3). Conclusion: donto's dated episodic ingest (`valid_from` per session + current-date prompt) **already** lets the reader pick the latest of changing facts — the bitemporal win is baked into the representation, and re-ordering adds nothing. With only 6 misses, claim-layer extraction is **not justified by the data here** (little headroom on this category); the real weak point to target is single-session-preference (~0.70) — where a **pre-distilled preference-salience facet lifts 0.70→0.767** (the substrate pre-shaping the answer-relevant preference helps even the weak reader; an earlier 'it hurts' run was a wrong-key no-op, corrected).
- ⬜ **Predicate folding for LME claims** — after extraction (embed predicates → alignment closure).
- ⬜ **Coverage curve + claims-only recall** — does the claim layer lift the donto-signature categories *and* collapse the 55,836-token context toward Zep's 4,408?

---

_What this proves so far: episodic recall alone reaches parity-plus on accuracy while doing less work at ingest — but at a real token cost. The extraction layer is the lever for winning accuracy **and** efficiency together. → [all benchmarks](/benchmarks)_
