Memory benchmarks — live status board

Living checklists for every agent-memory benchmark donto runs. donto is built to win these benchmarks through its philosophical extraction — evidence-anchored, contradiction-preserving, bitemporal claims — not episodic-chunk stuffing. So each page tracks not just the score but whether the claim layer is doing the work. Updated as runs progress.

Last updated: 2026-06-11.

benchmark	reference target	donto status
LongMemEval (`_s`, 500 Q)	Zep 90.2%	✅ baseline 91.0% (episodic-only; extraction upside pending)
LoCoMo (1,540 Q)	Zep 94.7%	🔄 building claim layer → coverage curve → scored run
BEAM-10M (~200 Q)	Hindsight 64.1%	⏸ paused (partial 0.684 at ~5% coverage)

The method (same for all)

Build the foundation — ingest, embed chunks, extract the claim layer (verified in donto_statement, not the queue's done flag).
Align — embed predicates/entities, build the alignment closure so predicate folding is live (on a quiet box — never while extraction loads it).
Measure — scored run on a clean gateway (never shared with extraction); report accuracy per category, retrieval latency, and context tokens.
Prove the philosophy — the extraction-coverage curve and claims-only recall, showing the win comes from the claim layer (especially knowledge-update / temporal / multi-hop), not raw recall.

Hard rules (paid for in lost hours)

Never share the inference gateway between a producer (extraction) and a consumer (reader/judge) — holo3.1 returns empty-under-load and silently corrupts both.
status='done' ≠ extracted — reconcile against n_claims>0 and donto_statement.
Build before measuring — no scored run until the claim layer is real and folding is live.

These benchmarks are an instrument for building donto: every reading — a 55K-token context, an empty-under-load, an under-built claim layer — is steering signal for what to build next.