Memory benchmarks — live status board
Living checklists for every agent-memory benchmark donto runs. donto is built to win these benchmarks through its philosophical extraction — evidence-anchored, contradiction-preserving, bitemporal claims — not episodic-chunk stuffing. So each page tracks not just the score but whether the claim layer is doing the work. Updated as runs progress.
Last updated: 2026-06-11.
| benchmark | reference target | donto status |
|---|---|---|
LongMemEval (_s, 500 Q) |
Zep 90.2% | ✅ baseline 91.0% (episodic-only; extraction upside pending) |
| LoCoMo (1,540 Q) | Zep 94.7% | 🔄 building claim layer → coverage curve → scored run |
| BEAM-10M (~200 Q) | Hindsight 64.1% | ⏸ paused (partial 0.684 at ~5% coverage) |
The method (same for all)
- Build the foundation — ingest, embed chunks, extract the claim layer (verified in
donto_statement, not the queue'sdoneflag). - Align — embed predicates/entities, build the alignment closure so predicate folding is live (on a quiet box — never while extraction loads it).
- Measure — scored run on a clean gateway (never shared with extraction); report accuracy per category, retrieval latency, and context tokens.
- Prove the philosophy — the extraction-coverage curve and claims-only recall, showing the win comes from the claim layer (especially knowledge-update / temporal / multi-hop), not raw recall.
Hard rules (paid for in lost hours)
- Never share the inference gateway between a producer (extraction) and a consumer (reader/judge) — holo3.1 returns empty-under-load and silently corrupts both.
status='done'≠ extracted — reconcile againstn_claims>0anddonto_statement.- Build before measuring — no scored run until the claim layer is real and folding is live.
These benchmarks are an instrument for building donto: every reading — a 55K-token context, an empty-under-load, an under-built claim layer — is steering signal for what to build next.