LongMemEval — run checklist
Goal: a full 500-question LongMemEval (longmemeval_s) run, scored donto-memory vs Zep 90.2% — then the extraction/folding upside that proves the win comes from the philosophy, not episodic stuffing.
Reference (getzep.com/research, 2026): LongMemEval 90.2% (451/500), gpt-5.4 reader (reasoning=medium) + gpt-5.4 CoT judge, retrieval p50 104ms, median context 4,408 tok.
Status: ✅ done · 🔄 in progress · ⛔ blocked · ⬜ not started. Last updated 2026-06-11.
Headline result — BASELINE COMPLETE ✅
| metric | donto-memory | Zep | verdict |
|---|---|---|---|
| accuracy | 91.0% (455/500) | 90.2% (451/500) | donto wins |
| retrieval p50 | 220 ms | 104 ms | Zep ~2× faster |
| retrieval p95 | 452 ms | 162 ms | Zep wins |
| median context | 55,836 tok | 4,408 tok | Zep ~13× leaner |
Read honestly: donto wins accuracy from raw episodic recall with no claim extraction at all — but pays with ~13× the context tokens (it recalls whole session transcripts where Zep recalls distilled graph edges). The accuracy win is real; the token cost is the motivation for the extraction work below. Per-type: user 0.97 · knowledge-update 0.95 · temporal 0.95 · multi-session 0.83 · single-session-preference 0.70 (weakest).
Foundation
- ✅ Dataset —
longmemeval_s(500 instances, the retrieval-stressed variant, ~50 sessions/question). - ✅ Episodic ingest — 25,112 records (
holder=user:lme_s_official:<qid>,valid_frominjected per session). - ✅ Chunk embeddings — 25,112 / 25,112 (100%).
The scored baseline (episodic-only, Zep-comparable)
- ✅ Full 500 answered — gpt-5.4 reader + judge (official per-type prompts), 0 empty after a contention-free re-run of 32 throttled answers.
- ✅ Metrics — 91.0% overall, per-type computed, abstention measured.
- ✅ Latency + context measured — p50 220ms / p95 452ms / median 55,836 tok.
The philosophy upside — NOT yet done (the bigger story)
- ✅ Knowledge-update probe (free reader) — donto's bitemporal value is in the representation, not recall re-ranking. Re-ran the 78 knowledge-update questions on the free holo3.1 reader (vs the codex 0.95 above): 0.923 (72/78). Then tested an explicit recency recall (re-sort recalled sessions latest-
valid_from-first, to surface the current value of a changed fact): 0.897 — slightly worse (recovered 1, lost 3). Conclusion: donto's dated episodic ingest (valid_fromper session + current-date prompt) already lets the reader pick the latest of changing facts — the bitemporal win is baked into the representation, and re-ordering adds nothing. With only 6 misses, claim-layer extraction is not justified by the data here (little headroom on this category); the real weak point to target is single-session-preference (~0.70) — where a pre-distilled preference-salience facet lifts 0.70→0.767 (the substrate pre-shaping the answer-relevant preference helps even the weak reader; an earlier 'it hurts' run was a wrong-key no-op, corrected). - ⬜ Predicate folding for LME claims — after extraction (embed predicates → alignment closure).
- ⬜ Coverage curve + claims-only recall — does the claim layer lift the donto-signature categories and collapse the 55,836-token context toward Zep's 4,408?
What this proves so far: episodic recall alone reaches parity-plus on accuracy while doing less work at ingest — but at a real token cost. The extraction layer is the lever for winning accuracy and efficiency together. → all benchmarks