LongMemEval — run checklist

Goal: a full 500-question LongMemEval (longmemeval_s) run, scored donto-memory vs Zep 90.2% — then the extraction/folding upside that proves the win comes from the philosophy, not episodic stuffing.

Reference (getzep.com/research, 2026): LongMemEval 90.2% (451/500), gpt-5.4 reader (reasoning=medium) + gpt-5.4 CoT judge, retrieval p50 104ms, median context 4,408 tok.

Status: ✅ done · 🔄 in progress · ⛔ blocked · ⬜ not started. Last updated 2026-06-11.

Headline result — BASELINE COMPLETE ✅

metric	donto-memory	Zep	verdict
accuracy	91.0% (455/500)	90.2% (451/500)	donto wins
retrieval p50	220 ms	104 ms	Zep ~2× faster
retrieval p95	452 ms	162 ms	Zep wins
median context	55,836 tok	4,408 tok	Zep ~13× leaner

Read honestly: donto wins accuracy from raw episodic recall with no claim extraction at all — but pays with ~13× the context tokens (it recalls whole session transcripts where Zep recalls distilled graph edges). The accuracy win is real; the token cost is the motivation for the extraction work below. Per-type: user 0.97 · knowledge-update 0.95 · temporal 0.95 · multi-session 0.83 · single-session-preference 0.70 (weakest).

Foundation

✅ Dataset — longmemeval_s (500 instances, the retrieval-stressed variant, ~50 sessions/question).
✅ Episodic ingest — 25,112 records (holder=user:lme_s_official:<qid>, valid_from injected per session).
✅ Chunk embeddings — 25,112 / 25,112 (100%).

The scored baseline (episodic-only, Zep-comparable)

✅ Full 500 answered — gpt-5.4 reader + judge (official per-type prompts), 0 empty after a contention-free re-run of 32 throttled answers.
✅ Metrics — 91.0% overall, per-type computed, abstention measured.
✅ Latency + context measured — p50 220ms / p95 452ms / median 55,836 tok.

The philosophy upside — NOT yet done (the bigger story)

✅ Knowledge-update probe (free reader) — donto's bitemporal value is in the representation, not recall re-ranking. Re-ran the 78 knowledge-update questions on the free holo3.1 reader (vs the codex 0.95 above): 0.923 (72/78). Then tested an explicit recency recall (re-sort recalled sessions latest-valid_from-first, to surface the current value of a changed fact): 0.897 — slightly worse (recovered 1, lost 3). Conclusion: donto's dated episodic ingest (valid_from per session + current-date prompt) already lets the reader pick the latest of changing facts — the bitemporal win is baked into the representation, and re-ordering adds nothing. With only 6 misses, claim-layer extraction is not justified by the data here (little headroom on this category); the real weak point to target is single-session-preference (~0.70) — where a pre-distilled preference-salience facet lifts 0.70→0.767 (the substrate pre-shaping the answer-relevant preference helps even the weak reader; an earlier 'it hurts' run was a wrong-key no-op, corrected).
⬜ Predicate folding for LME claims — after extraction (embed predicates → alignment closure).
⬜ Coverage curve + claims-only recall — does the claim layer lift the donto-signature categories and collapse the 55,836-token context toward Zep's 4,408?

What this proves so far: episodic recall alone reaches parity-plus on accuracy while doing less work at ingest — but at a real token cost. The extraction layer is the lever for winning accuracy and efficiency together. → all benchmarks