dontobenchmark

LongMemEval — run checklist

Goal: a full 500-question LongMemEval (longmemeval_s) run, scored donto-memory vs Zep 90.2% — then the extraction/folding upside that proves the win comes from the philosophy, not episodic stuffing.

Reference (getzep.com/research, 2026): LongMemEval 90.2% (451/500), gpt-5.4 reader (reasoning=medium) + gpt-5.4 CoT judge, retrieval p50 104ms, median context 4,408 tok.

Status: ✅ done · 🔄 in progress · ⛔ blocked · ⬜ not started. Last updated 2026-06-11.


Headline result — BASELINE COMPLETE ✅

metric donto-memory Zep verdict
accuracy 91.0% (455/500) 90.2% (451/500) donto wins
retrieval p50 220 ms 104 ms Zep ~2× faster
retrieval p95 452 ms 162 ms Zep wins
median context 55,836 tok 4,408 tok Zep ~13× leaner

Read honestly: donto wins accuracy from raw episodic recall with no claim extraction at all — but pays with ~13× the context tokens (it recalls whole session transcripts where Zep recalls distilled graph edges). The accuracy win is real; the token cost is the motivation for the extraction work below. Per-type: user 0.97 · knowledge-update 0.95 · temporal 0.95 · multi-session 0.83 · single-session-preference 0.70 (weakest).

Foundation

  • Datasetlongmemeval_s (500 instances, the retrieval-stressed variant, ~50 sessions/question).
  • Episodic ingest — 25,112 records (holder=user:lme_s_official:<qid>, valid_from injected per session).
  • Chunk embeddings — 25,112 / 25,112 (100%).

The scored baseline (episodic-only, Zep-comparable)

  • Full 500 answered — gpt-5.4 reader + judge (official per-type prompts), 0 empty after a contention-free re-run of 32 throttled answers.
  • Metrics — 91.0% overall, per-type computed, abstention measured.
  • Latency + context measured — p50 220ms / p95 452ms / median 55,836 tok.

The philosophy upside — NOT yet done (the bigger story)

  • Knowledge-update probe (free reader) — donto's bitemporal value is in the representation, not recall re-ranking. Re-ran the 78 knowledge-update questions on the free holo3.1 reader (vs the codex 0.95 above): 0.923 (72/78). Then tested an explicit recency recall (re-sort recalled sessions latest-valid_from-first, to surface the current value of a changed fact): 0.897 — slightly worse (recovered 1, lost 3). Conclusion: donto's dated episodic ingest (valid_from per session + current-date prompt) already lets the reader pick the latest of changing facts — the bitemporal win is baked into the representation, and re-ordering adds nothing. With only 6 misses, claim-layer extraction is not justified by the data here (little headroom on this category); the real weak point to target is single-session-preference (~0.70) — where a pre-distilled preference-salience facet lifts 0.70→0.767 (the substrate pre-shaping the answer-relevant preference helps even the weak reader; an earlier 'it hurts' run was a wrong-key no-op, corrected).
  • Predicate folding for LME claims — after extraction (embed predicates → alignment closure).
  • Coverage curve + claims-only recall — does the claim layer lift the donto-signature categories and collapse the 55,836-token context toward Zep's 4,408?

What this proves so far: episodic recall alone reaches parity-plus on accuracy while doing less work at ingest — but at a real token cost. The extraction layer is the lever for winning accuracy and efficiency together. → all benchmarks