# Single-shot vs agentic, and one-call vs multi-scope — what the memory benchmarks actually measure

_Living document. 2026-06-12. Corrected after reading [getzep.com/research](https://www.getzep.com/research/) directly. Two **different axes** are at play in agentic-memory evaluation — retrieval composition and reader agency — and conflating them produces unfair comparisons. This pins down what Zep's two headline numbers actually are. Companion to [the shape of donto's return](/reports/donto-memory-return-shape)._

## TL;DR (corrected)

Zep reports **two LoCoMo numbers**, and the gap between them is the whole story:

- **Auto search — 86.5%** — *"a single API call that retrieves across every scope, applies a cross-scope rerank, and packs the result into a character-bounded context block. No scope selection, no client-side composition."* Median context **2,680 tokens**.
- **Multi-scope — 94.7%** (the headline) — *"five parallel searches across facts, entities, episodes, observations, and thread summaries, composed at the client."* Median context **5,760 tokens**.

The crucial correction to my earlier draft: **both are single-shot retrieve-then-*read*** — the *reader* answers in one pass in both cases. What changes between 86.5% and 94.7% is the **retrieval composition** (one auto call vs five typed searches composed + reranked) and the **token budget** (2,680 → 5,760). So an **+8.2-point** jump comes purely from *composing more retrieval scopes into the single bundle*, at ~2× the tokens. That is **not** the reader becoming an agent.

This means there are two independent axes, and you must say which one you're on:

| axis | cheap end | rich end | who moves on it |
|---|---|---|---|
| **A — retrieval composition** | one search → one bundle | many typed searches, composed + cross-scope reranked | **Zep does (86.5 → 94.7)**; OMEGA's rerank/time-decay is here too |
| **B — reader agency** | one passive read of a fixed bundle | agent loops: re-queries, follows references, fetches on demand | **nobody published** — every leaderboard number is a single read |

---

## 1. The two axes, precisely

**Axis A — retrieval composition** (how the bundle is built):
```
question ─▶ [1 search]                         ─▶ bundle ─▶ read     (Zep auto-search, 86.5%)
question ─▶ [5 typed searches → compose → rerank] ─▶ bigger bundle ─▶ read   (Zep multi-scope, 94.7%)
```
Both end in **one read**. The richer composition wins by getting more of the right evidence into the single bundle — at a token cost. This is where almost all "memory system" engineering lives (graph vs vector, typed scopes, cross-encoder rerank, time-decay).

**Axis B — reader agency** (whether the reader can act):
```
question ─▶ [agent] ⇄ tools (search / follow-URI / aggregate-entity / expand-span) ─▶ … N turns … ─▶ answer
```
Here the reader is an **agent** — it inspects a result and decides what to fetch next, looping until it can answer. **No published LoCoMo/LongMemEval number is on this axis.** The fingerprint: every system (Zep included) reports a single **median context** figure, which only exists if the reader gets one bundle and reads once. An agentic loop has no single context size.

My earlier draft sloppily called Zep's 94.7% "single-shot vs agentic." Correct statement: **Zep's 94.7% is single-*read*, multi-*scope* retrieval.** Agentic (axis B) is a separate thing nobody on the board does.

---

## 2. The numbers, attributed

| system | benchmark | accuracy | reader | what produced it |
|---|---|---|---|---|
| **Zep — auto search** | LoCoMo | **86.5%** | gpt-5.4 | one API call, cross-scope rerank, **2,680 tok** bundle, one read |
| **Zep — multi-scope** | LoCoMo | **94.7%** (1,459/1,540) | gpt-5.4 | **5 typed searches composed** + rerank, **5,760 tok**, one read |
| Zep | LongMemEval | 90.2% (451/500) | gpt-5.4 | multi-scope, **4,408 tok**, one read |
| OMEGA | LongMemEval | 95.4% | GPT-4.1 | bge-small+FTS + cross-encoder rerank + time-decay, one read |
| HydraDB | LongMemEval | 90.79% | Gemini 3.0 Pro | contextual-chunked retrieval, one read |
| paper baseline | LongMemEval | 60–64% | GPT-4o | full context, one read |

_(Reader = gpt-5.4 reasoning-medium; judge = gpt-5.4 CoT; per getzep.com/research, fetched 2026-06-12.)_

The headline numbers are **strong-reader, single-read, multi-scope-retrieval**. The reader is doing a lot of the work (the score conflates reader × memory — see the [return-shape report](/reports/donto-memory-return-shape)); the memory system's contribution is the +8.2 points from auto-search → multi-scope.

---

## 3. Where donto actually sits — and which Zep number is the fair target

This corrects our comparison baseline:

- **donto's runs are single-read** (one formatted bundle → one holo3.1 read → one judge). So we're on **axis B's cheap end**, like everyone.
- **donto's retrieval has been *single-scope*** so far — we test *one* signal at a time (episodic-only, or claims-only, or turns-only). That is structurally closest to Zep's **auto search (86.5%)** regime, **not** the multi-scope 94.7% headline. Comparing our single-scope numbers to Zep's *94.7%* was the wrong target; the apples-to-apples cheap-retrieval target is **~86.5%** (and Zep gets there with a gpt-5.4 reader; we're on free holo3.1).
- **We have not yet built Zep-style multi-scope composition.** Our one attempt to combine scopes (turns + claims) was a *naive concatenation* and it **hurt** (0.625 → 0.583) — precisely because we lacked the **cross-scope rerank** that Zep credits for making composition work. That's the missing piece, and it's a concrete, corpus-agnostic thing to build: retrieve several typed donto scopes (passages, claims, entities, `valid_time`-resolved facts), **rerank across them**, pack to a budget.

So the corrected program for the **leaderboard (single-read) track**:
1. Build **multi-scope retrieval with cross-scope rerank** — the actual mechanism behind 94.7%, which we'd been missing. Compose donto's typed scopes (passage / claim / entity-aggregate / temporal) and rerank into one budgeted bundle.
2. Compare honestly: single-scope donto ≈ Zep auto-search regime; multi-scope+rerank donto ≈ Zep multi-scope regime — **reader strength held in mind** (our free holo3.1 vs their gpt-5.4 caps the absolute number).

---

## 4. The agentic track (axis B) — still the more general, more donto-native path

Independently of axis A, the **agentic** track remains the direction that could exceed *both* Zep numbers, because it adds a capability nobody on the board has: the reader **fetches what it's missing**.

- It is the natural fix for donto's hardest category — **multi-hop = aggregation** ("what activities does X do?", scattered across many turns): an agent sees one fact, follows the entity URI, fetches them all. Single-read (any scope) must pre-bake that aggregation into the bundle; an agent doesn't.
- donto is built to be navigated — stable IRIs, evidence links, `donto_argument` contradiction edges, bitemporal `valid_time` — and already exposes the MCP tools (`donto_recall` / `donto_search` / `donto_memorize`).
- **But its numbers are not leaderboard-comparable.** An agentic result must be reported in its own column, never beside single-read numbers.

---

## 5. The corrected bottom line

- **Zep 94.7% = single-read + multi-scope retrieval (5 composed searches + rerank, 5,760 tok, gpt-5.4).** The plain single-call version is **86.5%** (2,680 tok). The +8.2 points is retrieval composition, not reader agency.
- **Two axes, never conflate them:** retrieval composition (one call vs many composed+reranked) and reader agency (one read vs agent loop). Zep moves on the first; the whole leaderboard sits at the cheap end of the second.
- **donto's fair single-read targets:** single-scope ≈ auto-search regime (~86.5%); to chase 94.7% we must build **multi-scope composition with cross-scope rerank** — the piece our naive fusion lacked.
- **The agentic track is donto's most general and most native bet**, and the real fix for aggregation — but it is a separate, non-comparable claim.

_See also: [the shape of donto's return](/reports/donto-memory-return-shape) · [LoCoMo Config C](/reports/locomo-claims-only-config-c) · [all benchmarks](/benchmarks)._