dontobenchmark

BEAM-10M — run checklist

Goal: the full BEAM-10M run (Hindsight/Vectorize's benchmark; 10 conversations × ~13.3M tokens; ~200 questions across tiers to 10M tokens), scored donto-memory vs Hindsight 64.1% — donto's hardest agent-memory stress test.

Reference (Hindsight, arXiv 2512.12818): BEAM-10M 64.1% (vs Honcho 40.6%, RAG baseline 24.9%). The benchmark itself is Tavakoli et al. (arXiv 2510.27246, ICLR 2026).

Status: ✅ done · 🔄 in progress · ⏸ paused · ⬜ not started. Last updated 2026-06-11. STATUS: PAUSED — box dedicated to the LoCoMo + LongMemEval Zep comparison.


Partial result (honest, in a deliberately under-built state)

donto scored 0.684 vs Hindsight's 0.641 on identical questions/metrics — but at ~5% claim coverage and ~4% vectors at run time. This is a floor, not a like-for-like: most of BEAM hadn't been ingested into the claim layer when it ran. Treat it as "donto clears the bar even mostly-empty," not a final number.

Foundation

  • Dataset — 10 conversations, slim JSON loader.
  • Firehose ingest — episodic chunks via asyncpg COPY (~36× faster than /memorize), bitemporal dates carried through.

Build the layer (the long haul)

  • Chunk embeddings to 100% — ~279K of ~1.27M embedded (~22%). Paused. This is the job for the contributor fleet (donto.org/help) — the auto-enqueue fleet-starvation bug was fixed 2026-06-11, so memory_chunk work now flows to workers.
  • Claim extractiondonto-agent-beam, holo3.1, into ctx:claims/beam. ~24K chunks done of ~337K queued. Paused.
  • Reconcile — verified claims in donto_statement, no contention-empties (done ≠ extracted).

Folding + scoring

  • Predicate + entity embeddings for BEAM claims, then alignment closure.
  • Scored answer run — codex answer phase (~200 Q).
  • Empty-instance baseline — the no-leakage control (proves recall, not memorization).

Why paused: LoCoMo + LongMemEval is the current priority (smaller corpora, faster signal, same philosophy under test). BEAM resumes after — the firehose ingest + fixed embed fleet make a full build tractable. → all benchmarks

source: beam.md