# The verifiable epistemic trail: agent memory as auditable state, not text

_donto papers · original research · 2026-06-13 · reframing + method, working draft v1_

**Abstract.** Agent memory is almost universally a pile of remembered *text* — chunks, summaries, embedded passages — recalled by similarity and pasted into a context window. That representation answers "what did the user say about X" but structurally *cannot* answer the questions that matter when an agent's memory is wrong, contested, or audited: **what did the agent believe at 14:02 last Tuesday, why did it believe it, what evidence anchored it, and was that belief later corrected?** We argue agent memory should instead be a **bitemporal, evidence-anchored claim trail** — every write a holder-scoped claim stamped with transaction-time and bound to its source — and that this reframing turns *accountability* into a query rather than a feature. We show the trail already exists in donto's live store: ~1.505M holder-scoped memory records (`donto_x_memory_record`), ~158.6K of them written by one agent (`agent:omega-bot`) across 15 days, each anchored to a `donto_statement` and recallable *as of* any past instant with a single `tx_time @>` predicate. We demonstrate **epistemic replay** working today — the agent held 37,243 currently-believed records as of 2026-06-05 versus ~158.6K now, and the difference is computable, not lost — and we map the recall *audit* log (`donto_x_memory_access`, 49,349 `retrieved` events with actor, query-hash, rank and score). We are blunt about the gaps: the replay query layer is a raw SQL predicate, not a product surface; the trail is a *mix* of typed and episodic claims, not a uniform chunk store; only one of five audit `access_kind`s is populated; and there is **no accountability benchmark** — agent-memory benchmarks score recall, and none of them asks the questions a trail is for. Building that benchmark is the contribution this paper sets up; the substrate is already the trail.

---

## 1. The reframing: from "what is remembered" to "what was believed, and why"

Every agent-memory system answers one question well — *given this query, what relevant prior text exists?* — because every agent-memory system is, underneath, a retriever over remembered text. LoCoMo, LongMemEval, BEAM all score this: did the system surface the passage that lets the reader answer the question. Our companion empirical work ([Answer-Shaping](/papers/answer-shaping)) lives entirely inside that frame: it improves *what gets surfaced* and *in what shape*.

But there is a second class of question that an agent in production *will* be asked and that a text-pile cannot answer:

| accountability question | text-pile memory | claim-trail memory |
|---|---|---|
| What did the agent believe at past instant *t*? | ✗ (the index reflects *now*) | ✓ `tx_time @> t` |
| Why did it believe it — what evidence? | ✗ (chunk has no anchor) | ✓ `donto_evidence_link → span → revision → blob` |
| Was belief *b* later corrected, and when? | ✗ (overwrite, or duplicate) | ✓ supersede chain on `tx_time` (I3) |
| Which memories did it *act on* for answer *a*? | partial (no recall log) | partial: `donto_x_memory_access` logs *retrieved*; *cited* is schema-only |
| Reconstruct the exact context it reasoned over last Tuesday | ✗ | ✓ as-of recall of the believed set |

These are not exotic. They are the questions a regulator, a debugger, a red-teamer, or a user disputing "you told me X" actually asks. The standard memory representation answers *none* of them, because text remembered is not state recorded: a chunk has no transaction-time of its own (the index is mutated in place), no separable evidence link, and no notion of being superseded rather than replaced.

**The claim.** Agent memory should be an *auditable epistemic trail*: a bitemporal, evidence-anchored sequence of claims whose every entry is independently verifiable (it is bound to its source) and replayable (it carries the transaction-time at which it became believed). Under this reframing, the agent-memory problem stops being only *accuracy* and becomes also *accountability* — and accountability is the axis where a substrate beats a text-pile structurally, not marginally.

---

## 2. Why only donto can make this claim

The reframing is cheap to *state* and expensive to *back*, because it needs a store with three properties simultaneously — and they are donto's three foundational invariants:

1. **Never delete (I3 — supersede, never overwrite).** Replay requires every past belief to still be on disk. A mutable vector index cannot answer "what did you believe last Tuesday" because last Tuesday's index is gone. donto's memory *state* is versioned by `tx_time`: one record (`66ee367a-…`) already carries **101** `donto_x_memory_state` versions — 100 with a closed `tx_time` upper bound and exactly **1** open successor, none destroyed. That is a literal supersede chain on disk, not a metaphor.
2. **Two time axes on every fact.** The trail needs *transaction-time* (when the agent came to believe it) distinct from *valid-time* (when the fact holds in the world). `donto_x_memory_record.tx_time` is a `tstzrange` with an open upper bound for "currently believed"; the anchored `donto_statement` carries `valid_time` separately — so "the agent learned on Jun-5 that the user moved in 2023" keeps *learned-when* and *true-when* apart, which a single timestamp on a chunk fundamentally cannot.
3. **Evidence-first anchoring.** "Why did it believe it" requires each belief to point at its source. Every memory record satisfies a `one_anchor` check constraint binding it to exactly one `root_statement` / `root_frame` / `root_context`; the substrate holds **2,799,094** `donto_evidence_link` rows tying statements to character-spans of stored sources. A remembered chunk, by contrast, *is* the evidence and the claim at once, with no separable anchor to verify. (Anchoring *coverage* over the memory trail specifically is not yet measured — see §5.)

No collapse-on-conflict store has all three. A KB that dedups loses the superseded belief (breaks replay); one with a single time axis conflates learned-when with true-when (breaks the world-vs-belief distinction the trail exists to record); one without evidence links cannot answer "why." donto has them because it was built paraconsistent and bitemporal from the substrate up — the trail is not a memory *feature*, it is what those invariants *are* when the holder is an agent.

---

## 3. The trail already exists — measured

This is not a proposal for a schema; the schema is live and carrying an agent's full memory. The store is growing as the bot runs, so counts are as of this writing (2026-06-13) and approximate by ±the last few hours of traffic:

| quantity | value | source |
|---|---|---|
| holder-scoped memory records | **~1,505,087** | `donto_x_memory_record` |
| distinct holders / sessions (substrate-wide) | **1,206** / **20,947** | same |
| records written by `agent:omega-bot` | **~158,606** | `holder_iri='agent:omega-bot'` |
| omega-bot sessions | **29** | distinct `session_iri` for that holder |
| omega-bot belief span | **15 days**, 2026-05-30 → 06-13 | `min/max(lower(tx_time))` |
| recall audit events (`retrieved`) | **49,349** | `donto_x_memory_access` |
| evidence links (substrate-wide) | **2,799,094** | `donto_evidence_link` |
| max state-version chain on one record | **101** (100 closed + 1 open) | `donto_x_memory_state` |

Each omega-bot record is a real claim — scoped by `holder='agent:omega-bot'` and `session='discord:<guild>:<channel>'`, anchored to a `donto_statement`, and stamped with the transaction-time the bot came to believe it. **The trail is heterogeneous, not a chunk pile:** the anchored statements are dominated by typed assertions (`rdf:type` ~31K, `donto:includes`, `donto:content`, `donto:presupposes`) with `mem:episodic/chunk` itself a minority (~934). This is the point of the reframing made visible — the bot's Discord life lands as *claims with types and structure*, not raw embedded passages, and it does so with no extra instrumentation because it writes through donto-memory rather than into an index.

### 3.1 Epistemic replay is a query, today

The headline capability — *reconstruct the agent's currently-believed set as of any past transaction-time* — is **not** future work. It is a `tx_time` containment predicate over data that already exists, because I3 kept every version:

```sql
-- records agent:omega-bot believed (currently-valid) as of 2026-06-05 00:00 UTC
SELECT count(*) FROM donto_x_memory_record
WHERE holder_iri = 'agent:omega-bot'
  AND tx_time @> '2026-06-05 00:00:00+00'::timestamptz;
--> 37243

-- what it believes now (open upper bound)
SELECT count(*) FROM donto_x_memory_record
WHERE holder_iri = 'agent:omega-bot' AND upper(tx_time) IS NULL;
--> 158606
```

As of Jun-5 the agent held **37,243** currently-believed records; now it holds **~158,606**; the ~121K-record delta is exactly *what it learned since*, recoverable rather than overwritten. Swap the predicate from `@>` an instant to a join against `donto_statement.valid_time` and you separate *when it learned* from *when the fact was true* — the two-axis replay the reframing promised. **This is the PROVEN tier: as-of belief reconstruction works on the live store with one line of SQL.** A text-pile cannot return this number at all. (The caveat is in the word *count*: we have proven the believed *set* is reconstructable, not that re-executing the agent against that frozen set reproduces its past behaviour — that is §6's counterfactual-replay build, still unbuilt.)

### 3.2 The recall log is an accountability substrate

`donto_x_memory_access` records *what the agent acted on*: every recall writes rows with `actor_iri` (who recalled), `query_hash`, `access_kind`, `rank`, and `score`, itself bitemporal. The `access_kind` domain is the skeleton of a full accountability ledger —

```
CHECK (access_kind IN ('retrieved','surfaced','cited','ignored','corrected'))
```

`retrieved` is populated (49,349 events; e.g. `user:locomo:conv-26` 9,297, `agent:omega-bot` 1,161). The other four — *surfaced* (made it into the prompt), *cited* (the answer used it), *ignored* (recalled but unused), *corrected* (later refuted) — are **schema-only today**. That gap is the honest center of this paper (§5): the trail records what was *retrieved*, but the loop that closes accountability — recording what was *used*, *ignored*, and *corrected*, then feeding `corrected` back into [standing](/papers/standing-dynamics) — is specified by the schema and not yet wired by the consumer.

---

## 4. What a claim-trail memory can answer that a text-pile cannot

The reframing earns its keep on queries no episodic-text memory can express. Each below is a `tx_time`/evidence/supersede query over the existing tables — the *shape* of an accountability benchmark:

- **As-of belief.** "List everything the agent currently believed about user U as of `t`." → records where `holder='agent:U-handler'` and `tx_time @> t`. (PROVEN at the aggregate level in §3.1.)
- **Provenance.** "For belief *b*, show the source span and document." → `root_statement → donto_evidence_link → span → revision → blob`. (The join is mechanical and the links exist substrate-wide; coverage over the memory trail is unmeasured.)
- **Correction audit.** "Which beliefs were superseded, by what, and when?" → the supersede chain in `tx_time` (closed-upper records and their open successor). (Versioning is live — the 100-closed-+-1-open chain proves it; the *diff* surface is unbuilt.)
- **Action audit.** "Which memories did the agent *act on* to produce answer *a*?" → `donto_x_memory_access` filtered by `query_hash`/`actor_iri`. (`retrieved` populated; `cited` not.)
- **Counterfactual replay.** "Re-run the agent's reasoning over only what it believed at `t`." → recall the as-of set, feed it as context. (The set is queryable; the harness to *re-execute* the agent against a frozen belief-set is unbuilt.)

A text-pile answers the *first half of provenance* (here is the chunk) and **none** of the rest, because it has no transaction-time of its own, no separable evidence anchor, and no supersede semantics.

---

## 5. Honest status: proven / conjectured / speculative

In the three-tier candor of the seed papers:

- **PROVEN (measured on the live store).** The trail exists and is bitemporal: ~1.5M holder-scoped records, an agent's ~158.6K-record / 15-day / 29-session trail, anchored via a `one_anchor` constraint, with 2.8M evidence links substrate-wide and state versioned to 101 deep (a genuine 100-closed + 1-open supersede chain). **Epistemic replay works today** — as-of belief reconstruction is a one-line `tx_time @>` query returning 37,243 (Jun-5) versus ~158,606 (now). A recall audit log (`donto_x_memory_access`) is live with 49,349 events carrying actor/query/rank/score. These are facts about the database, not aspirations.
- **CONJECTURED (and the failure mode named).** That a claim-trail memory matches text-pile memory *on recall accuracy* while strictly dominating it *on accountability*. The recall-parity half is plausible but **not free**: [Answer-Shaping](/papers/answer-shaping) and the [memory scorecard](/reports/memory-benchmarks-scorecard) show atomic-claims-*only* recall **collapses** (LoCoMo 0.244 vs 0.604 episodic, 96Q) — so a trail that stores *only* claims and discards transcripts can *lose accuracy*. **The failure mode is explicit:** if "claim trail" is read as "throw away the text," recall regresses on synthesis questions. The defensible version stores claims *plus* their anchored source spans (the trail is the accountability layer *over* retained evidence, not a replacement for it) — and whether that composite matches episodic recall at competitive token cost is the open measurement, not a settled result. A second open question lives here too: anchoring *coverage over the memory trail specifically* is unmeasured (§2.3), so "every belief is verifiable" is an architectural guarantee of the schema, not yet an empirical rate.
- **SPECULATIVE (flagged).** That `corrected` access events, fed back, make standing a reality-steered loop ([standing-dynamics](/papers/standing-dynamics)) — i.e. that the trail is not just an audit record but the *training signal* by which the agent's memory self-corrects. The mechanism is plausible and I3-safe (corrections are new superseding writes, never deletions), but nothing populates `corrected` yet and there is no experiment. Flagged, not claimed.

The sharpest honest negative: **there is no accountability benchmark, and the field has no incentive to build one** because every existing benchmark scores recall. We can replay an agent's believed set but cannot yet *score* a system on whether it replays *correctly*, attributes provenance *faithfully*, or surfaces corrections *completely*. The reframing is only as strong as the benchmark that operationalizes it — and that benchmark is the work this paper exists to motivate.

---

## 6. The experiment that would establish it

Two builds, then a measurement:

1. **Replay query layer.** Wrap the live `tx_time @>` / evidence-join / supersede-chain queries as a first-class read API (`recall_as_of(holder, t)`, `provenance(belief)`, `corrections(holder, window)`) on donto-memory — Rust, reusing the existing recall path, no new tables (I3: corrections are superseding writes). The data is there; this exposes it. Populate the four dark `access_kind`s by having the consumer log `surfaced`/`cited`/`ignored` at answer-assembly time and `corrected` on retraction.
2. **An accountability benchmark.** Take an existing dialogue corpus (LongMemEval `_s`, which already has dated knowledge-updates) and *derive* an accountability test set: for each knowledge-update, the questions become "what did the agent believe *before* the update," "after," "show the evidence for the post-update belief," "list the superseded prior belief." Score on **replay fidelity** (does as-of recall match the held-out ground-truth belief-set), **provenance precision** (does the cited span actually contain the fact — reuse the [decoupled-anchoring](/papers/decoupled-anchoring) citer as the grader), and **correction recall** (fraction of refuted beliefs surfaced as `corrected`).
3. **The head-to-head.** Run the same corpus through a strong episodic-text baseline (a mutable vector index) and donto's claim trail. The expected result — the one that would establish the reframing — is a **separation, not a tie**: comparable recall accuracy, but the text-pile scores ~0 on every accountability metric (it cannot answer them) while the trail scores high. A separation on a metric the baseline *structurally cannot reach* is a stronger result than a recall-leaderboard delta, and it is the honest shape of donto's advantage here. (The recall-parity leg is the CONJECTURED tier and the experiment must report it honestly — including a loss, if claims-plus-spans still trails episodic.)

---

## 7. Why this is donto's to write

The reframing — *memory as auditable state, not text* — is only credible from a store that already holds an agent's life as bitemporal, evidence-anchored, never-deleted claims and can replay them with a one-line predicate. That is the precise intersection of I3, bitemporality, and evidence-first anchoring, the same three invariants the [bitemporal sheaf](/papers/bitemporal-sheaf) turns into a contradiction measure. The sheaf asks *what shape do the agent's disagreements have*; this paper asks *what did the agent believe, and can it prove why* — two readings of the same trail. donto can pose both because, uniquely, it kept the data: **~158,600 typed, dated, holder-scoped beliefs on disk for one agent over fifteen days — and not one of them overwritten.**

---

_See also: the empirical companion [Answer-Shaping](/papers/answer-shaping) and its [honest scorecard](/reports/memory-benchmarks-scorecard); the return-shape study [donto-memory return shape](/reports/donto-memory-return-shape); the theory companion [The Bitemporal Sheaf](/papers/bitemporal-sheaf); the feedback loop in [Standing as a dynamical system](/papers/standing-dynamics); the full program in the [donto research agenda](/papers/research-agenda)._
