dontobenchmarks

Memory benchmarks — live status board

Living checklists for every agent-memory benchmark donto runs. donto is built to win these benchmarks through its philosophical extraction — evidence-anchored, contradiction-preserving, bitemporal claims — not episodic-chunk stuffing. So each page tracks not just the score but whether the claim layer is doing the work. Updated as runs progress.

Last updated: 2026-06-11.

benchmark reference target donto status
LongMemEval (_s, 500 Q) Zep 90.2% ✅ baseline 91.0% (episodic-only; extraction upside pending)
LoCoMo (1,540 Q) Zep 94.7% 🔄 building claim layer → coverage curve → scored run
BEAM-10M (~200 Q) Hindsight 64.1% ⏸ paused (partial 0.684 at ~5% coverage)

The method (same for all)

  1. Build the foundation — ingest, embed chunks, extract the claim layer (verified in donto_statement, not the queue's done flag).
  2. Align — embed predicates/entities, build the alignment closure so predicate folding is live (on a quiet box — never while extraction loads it).
  3. Measure — scored run on a clean gateway (never shared with extraction); report accuracy per category, retrieval latency, and context tokens.
  4. Prove the philosophy — the extraction-coverage curve and claims-only recall, showing the win comes from the claim layer (especially knowledge-update / temporal / multi-hop), not raw recall.

Hard rules (paid for in lost hours)

  • Never share the inference gateway between a producer (extraction) and a consumer (reader/judge) — holo3.1 returns empty-under-load and silently corrupts both.
  • status='done' ≠ extracted — reconcile against n_claims>0 and donto_statement.
  • Build before measuring — no scored run until the claim layer is real and folding is live.

These benchmarks are an instrument for building donto: every reading — a 55K-token context, an empty-under-load, an under-built claim layer — is steering signal for what to build next.

source: index.md