Single-shot vs agentic, and one-call vs multi-scope — what the memory benchmarks actually measure
Living document. 2026-06-12. Corrected after reading getzep.com/research directly. Two different axes are at play in agentic-memory evaluation — retrieval composition and reader agency — and conflating them produces unfair comparisons. This pins down what Zep's two headline numbers actually are. Companion to the shape of donto's return.
TL;DR (corrected)
Zep reports two LoCoMo numbers, and the gap between them is the whole story:
- Auto search — 86.5% — "a single API call that retrieves across every scope, applies a cross-scope rerank, and packs the result into a character-bounded context block. No scope selection, no client-side composition." Median context 2,680 tokens.
- Multi-scope — 94.7% (the headline) — "five parallel searches across facts, entities, episodes, observations, and thread summaries, composed at the client." Median context 5,760 tokens.
The crucial correction to my earlier draft: both are single-shot retrieve-then-read — the reader answers in one pass in both cases. What changes between 86.5% and 94.7% is the retrieval composition (one auto call vs five typed searches composed + reranked) and the token budget (2,680 → 5,760). So an +8.2-point jump comes purely from composing more retrieval scopes into the single bundle, at ~2× the tokens. That is not the reader becoming an agent.
This means there are two independent axes, and you must say which one you're on:
| axis | cheap end | rich end | who moves on it |
|---|---|---|---|
| A — retrieval composition | one search → one bundle | many typed searches, composed + cross-scope reranked | Zep does (86.5 → 94.7); OMEGA's rerank/time-decay is here too |
| B — reader agency | one passive read of a fixed bundle | agent loops: re-queries, follows references, fetches on demand | nobody published — every leaderboard number is a single read |
1. The two axes, precisely
Axis A — retrieval composition (how the bundle is built):
question ─▶ [1 search] ─▶ bundle ─▶ read (Zep auto-search, 86.5%)
question ─▶ [5 typed searches → compose → rerank] ─▶ bigger bundle ─▶ read (Zep multi-scope, 94.7%)
Both end in one read. The richer composition wins by getting more of the right evidence into the single bundle — at a token cost. This is where almost all "memory system" engineering lives (graph vs vector, typed scopes, cross-encoder rerank, time-decay).
Axis B — reader agency (whether the reader can act):
question ─▶ [agent] ⇄ tools (search / follow-URI / aggregate-entity / expand-span) ─▶ … N turns … ─▶ answer
Here the reader is an agent — it inspects a result and decides what to fetch next, looping until it can answer. No published LoCoMo/LongMemEval number is on this axis. The fingerprint: every system (Zep included) reports a single median context figure, which only exists if the reader gets one bundle and reads once. An agentic loop has no single context size.
My earlier draft sloppily called Zep's 94.7% "single-shot vs agentic." Correct statement: Zep's 94.7% is single-read, multi-scope retrieval. Agentic (axis B) is a separate thing nobody on the board does.
2. The numbers, attributed
| system | benchmark | accuracy | reader | what produced it |
|---|---|---|---|---|
| Zep — auto search | LoCoMo | 86.5% | gpt-5.4 | one API call, cross-scope rerank, 2,680 tok bundle, one read |
| Zep — multi-scope | LoCoMo | 94.7% (1,459/1,540) | gpt-5.4 | 5 typed searches composed + rerank, 5,760 tok, one read |
| Zep | LongMemEval | 90.2% (451/500) | gpt-5.4 | multi-scope, 4,408 tok, one read |
| OMEGA | LongMemEval | 95.4% | GPT-4.1 | bge-small+FTS + cross-encoder rerank + time-decay, one read |
| HydraDB | LongMemEval | 90.79% | Gemini 3.0 Pro | contextual-chunked retrieval, one read |
| paper baseline | LongMemEval | 60–64% | GPT-4o | full context, one read |
(Reader = gpt-5.4 reasoning-medium; judge = gpt-5.4 CoT; per getzep.com/research, fetched 2026-06-12.)
The headline numbers are strong-reader, single-read, multi-scope-retrieval. The reader is doing a lot of the work (the score conflates reader × memory — see the return-shape report); the memory system's contribution is the +8.2 points from auto-search → multi-scope.
3. Where donto actually sits — and which Zep number is the fair target
This corrects our comparison baseline:
- donto's runs are single-read (one formatted bundle → one holo3.1 read → one judge). So we're on axis B's cheap end, like everyone.
- donto's retrieval has been single-scope so far — we test one signal at a time (episodic-only, or claims-only, or turns-only). That is structurally closest to Zep's auto search (86.5%) regime, not the multi-scope 94.7% headline. Comparing our single-scope numbers to Zep's 94.7% was the wrong target; the apples-to-apples cheap-retrieval target is ~86.5% (and Zep gets there with a gpt-5.4 reader; we're on free holo3.1).
- We have not yet built Zep-style multi-scope composition. Our one attempt to combine scopes (turns + claims) was a naive concatenation and it hurt (0.625 → 0.583) — precisely because we lacked the cross-scope rerank that Zep credits for making composition work. That's the missing piece, and it's a concrete, corpus-agnostic thing to build: retrieve several typed donto scopes (passages, claims, entities,
valid_time-resolved facts), rerank across them, pack to a budget.
So the corrected program for the leaderboard (single-read) track:
- Build multi-scope retrieval with cross-scope rerank — the actual mechanism behind 94.7%, which we'd been missing. Compose donto's typed scopes (passage / claim / entity-aggregate / temporal) and rerank into one budgeted bundle.
- Compare honestly: single-scope donto ≈ Zep auto-search regime; multi-scope+rerank donto ≈ Zep multi-scope regime — reader strength held in mind (our free holo3.1 vs their gpt-5.4 caps the absolute number).
4. The agentic track (axis B) — still the more general, more donto-native path
Independently of axis A, the agentic track remains the direction that could exceed both Zep numbers, because it adds a capability nobody on the board has: the reader fetches what it's missing.
- It is the natural fix for donto's hardest category — multi-hop = aggregation ("what activities does X do?", scattered across many turns): an agent sees one fact, follows the entity URI, fetches them all. Single-read (any scope) must pre-bake that aggregation into the bundle; an agent doesn't.
- donto is built to be navigated — stable IRIs, evidence links,
donto_argumentcontradiction edges, bitemporalvalid_time— and already exposes the MCP tools (donto_recall/donto_search/donto_memorize). - But its numbers are not leaderboard-comparable. An agentic result must be reported in its own column, never beside single-read numbers.
5. The corrected bottom line
- Zep 94.7% = single-read + multi-scope retrieval (5 composed searches + rerank, 5,760 tok, gpt-5.4). The plain single-call version is 86.5% (2,680 tok). The +8.2 points is retrieval composition, not reader agency.
- Two axes, never conflate them: retrieval composition (one call vs many composed+reranked) and reader agency (one read vs agent loop). Zep moves on the first; the whole leaderboard sits at the cheap end of the second.
- donto's fair single-read targets: single-scope ≈ auto-search regime (~86.5%); to chase 94.7% we must build multi-scope composition with cross-scope rerank — the piece our naive fusion lacked.
- The agentic track is donto's most general and most native bet, and the real fix for aggregation — but it is a separate, non-comparable claim.
See also: the shape of donto's return · LoCoMo Config C · all benchmarks.