Single-shot vs agentic, and one-call vs multi-scope — what the memory benchmarks actually measure

Living document. 2026-06-12. Corrected after reading getzep.com/research directly. Two different axes are at play in agentic-memory evaluation — retrieval composition and reader agency — and conflating them produces unfair comparisons. This pins down what Zep's two headline numbers actually are. Companion to the shape of donto's return.

TL;DR (corrected)

Zep reports two LoCoMo numbers, and the gap between them is the whole story:

Auto search — 86.5% — "a single API call that retrieves across every scope, applies a cross-scope rerank, and packs the result into a character-bounded context block. No scope selection, no client-side composition." Median context 2,680 tokens.
Multi-scope — 94.7% (the headline) — "five parallel searches across facts, entities, episodes, observations, and thread summaries, composed at the client." Median context 5,760 tokens.

The crucial correction to my earlier draft: both are single-shot retrieve-then-read — the reader answers in one pass in both cases. What changes between 86.5% and 94.7% is the retrieval composition (one auto call vs five typed searches composed + reranked) and the token budget (2,680 → 5,760). So an +8.2-point jump comes purely from composing more retrieval scopes into the single bundle, at ~2× the tokens. That is not the reader becoming an agent.

This means there are two independent axes, and you must say which one you're on:

axis	cheap end	rich end	who moves on it
A — retrieval composition	one search → one bundle	many typed searches, composed + cross-scope reranked	Zep does (86.5 → 94.7); OMEGA's rerank/time-decay is here too
B — reader agency	one passive read of a fixed bundle	agent loops: re-queries, follows references, fetches on demand	nobody published — every leaderboard number is a single read

1. The two axes, precisely

Axis A — retrieval composition (how the bundle is built):

question ─▶ [1 search]                         ─▶ bundle ─▶ read     (Zep auto-search, 86.5%)
question ─▶ [5 typed searches → compose → rerank] ─▶ bigger bundle ─▶ read   (Zep multi-scope, 94.7%)

Both end in one read. The richer composition wins by getting more of the right evidence into the single bundle — at a token cost. This is where almost all "memory system" engineering lives (graph vs vector, typed scopes, cross-encoder rerank, time-decay).

Axis B — reader agency (whether the reader can act):

question ─▶ [agent] ⇄ tools (search / follow-URI / aggregate-entity / expand-span) ─▶ … N turns … ─▶ answer

Here the reader is an agent — it inspects a result and decides what to fetch next, looping until it can answer. No published LoCoMo/LongMemEval number is on this axis. The fingerprint: every system (Zep included) reports a single median context figure, which only exists if the reader gets one bundle and reads once. An agentic loop has no single context size.

My earlier draft sloppily called Zep's 94.7% "single-shot vs agentic." Correct statement: Zep's 94.7% is single-read, multi-scope retrieval. Agentic (axis B) is a separate thing nobody on the board does.

2. The numbers, attributed

system	benchmark	accuracy	reader	what produced it
Zep — auto search	LoCoMo	86.5%	gpt-5.4	one API call, cross-scope rerank, 2,680 tok bundle, one read
Zep — multi-scope	LoCoMo	94.7% (1,459/1,540)	gpt-5.4	5 typed searches composed + rerank, 5,760 tok, one read
Zep	LongMemEval	90.2% (451/500)	gpt-5.4	multi-scope, 4,408 tok, one read
OMEGA	LongMemEval	95.4%	GPT-4.1	bge-small+FTS + cross-encoder rerank + time-decay, one read
HydraDB	LongMemEval	90.79%	Gemini 3.0 Pro	contextual-chunked retrieval, one read
paper baseline	LongMemEval	60–64%	GPT-4o	full context, one read

(Reader = gpt-5.4 reasoning-medium; judge = gpt-5.4 CoT; per getzep.com/research, fetched 2026-06-12.)

The headline numbers are strong-reader, single-read, multi-scope-retrieval. The reader is doing a lot of the work (the score conflates reader × memory — see the return-shape report); the memory system's contribution is the +8.2 points from auto-search → multi-scope.

3. Where donto actually sits — and which Zep number is the fair target

This corrects our comparison baseline:

donto's runs are single-read (one formatted bundle → one holo3.1 read → one judge). So we're on axis B's cheap end, like everyone.
donto's retrieval has been single-scope so far — we test one signal at a time (episodic-only, or claims-only, or turns-only). That is structurally closest to Zep's auto search (86.5%) regime, not the multi-scope 94.7% headline. Comparing our single-scope numbers to Zep's 94.7% was the wrong target; the apples-to-apples cheap-retrieval target is ~86.5% (and Zep gets there with a gpt-5.4 reader; we're on free holo3.1).
We have not yet built Zep-style multi-scope composition. Our one attempt to combine scopes (turns + claims) was a naive concatenation and it hurt (0.625 → 0.583) — precisely because we lacked the cross-scope rerank that Zep credits for making composition work. That's the missing piece, and it's a concrete, corpus-agnostic thing to build: retrieve several typed donto scopes (passages, claims, entities, valid_time-resolved facts), rerank across them, pack to a budget.

So the corrected program for the leaderboard (single-read) track:

Build multi-scope retrieval with cross-scope rerank — the actual mechanism behind 94.7%, which we'd been missing. Compose donto's typed scopes (passage / claim / entity-aggregate / temporal) and rerank into one budgeted bundle.
Compare honestly: single-scope donto ≈ Zep auto-search regime; multi-scope+rerank donto ≈ Zep multi-scope regime — reader strength held in mind (our free holo3.1 vs their gpt-5.4 caps the absolute number).

4. The agentic track (axis B) — still the more general, more donto-native path

Independently of axis A, the agentic track remains the direction that could exceed both Zep numbers, because it adds a capability nobody on the board has: the reader fetches what it's missing.

It is the natural fix for donto's hardest category — multi-hop = aggregation ("what activities does X do?", scattered across many turns): an agent sees one fact, follows the entity URI, fetches them all. Single-read (any scope) must pre-bake that aggregation into the bundle; an agent doesn't.
donto is built to be navigated — stable IRIs, evidence links, donto_argument contradiction edges, bitemporal valid_time — and already exposes the MCP tools (donto_recall / donto_search / donto_memorize).
But its numbers are not leaderboard-comparable. An agentic result must be reported in its own column, never beside single-read numbers.

5. The corrected bottom line

Zep 94.7% = single-read + multi-scope retrieval (5 composed searches + rerank, 5,760 tok, gpt-5.4). The plain single-call version is 86.5% (2,680 tok). The +8.2 points is retrieval composition, not reader agency.
Two axes, never conflate them: retrieval composition (one call vs many composed+reranked) and reader agency (one read vs agent loop). Zep moves on the first; the whole leaderboard sits at the cheap end of the second.
donto's fair single-read targets: single-scope ≈ auto-search regime (~86.5%); to chase 94.7% we must build multi-scope composition with cross-scope rerank — the piece our naive fusion lacked.
The agentic track is donto's most general and most native bet, and the real fix for aggregation — but it is a separate, non-comparable claim.