# Memory benchmarks — live status board

Living checklists for every agent-memory benchmark donto runs. donto is built to **win these benchmarks through its philosophical extraction** — evidence-anchored, contradiction-preserving, bitemporal claims — not episodic-chunk stuffing. So each page tracks not just the score but whether the *claim layer* is doing the work. Updated as runs progress.

_Last updated: 2026-06-11._

| benchmark | reference target | donto status |
|---|---|---|
| **[LongMemEval](/benchmarks/longmemeval)** (`_s`, 500 Q) | Zep 90.2% | ✅ baseline **91.0%** (episodic-only; extraction upside pending) |
| **[LoCoMo](/benchmarks/locomo)** (1,540 Q) | Zep 94.7% | 🔄 building claim layer → coverage curve → scored run |
| **[BEAM-10M](/benchmarks/beam)** (~200 Q) | Hindsight 64.1% | ⏸ paused (partial 0.684 at ~5% coverage) |

## The method (same for all)

1. **Build the foundation** — ingest, embed chunks, extract the claim layer (verified in `donto_statement`, not the queue's `done` flag).
2. **Align** — embed predicates/entities, build the alignment closure so **predicate folding** is live (on a quiet box — never while extraction loads it).
3. **Measure** — scored run on a clean gateway (never shared with extraction); report accuracy **per category**, retrieval latency, and context tokens.
4. **Prove the philosophy** — the extraction-coverage curve and claims-only recall, showing the win comes from the claim layer (especially knowledge-update / temporal / multi-hop), not raw recall.

## Hard rules (paid for in lost hours)

- **Never share the inference gateway** between a producer (extraction) and a consumer (reader/judge) — holo3.1 returns empty-under-load and silently corrupts both.
- **`status='done'` ≠ extracted** — reconcile against `n_claims>0` and `donto_statement`.
- **Build before measuring** — no scored run until the claim layer is real and folding is live.

_These benchmarks are an instrument for building donto: every reading — a 55K-token context, an empty-under-load, an under-built claim layer — is steering signal for what to build next._
