# Contested retrieval: returning disagreement as a first-class result

_donto papers · original research · 2026-06-13 · method, working draft v1_

**Abstract.** Every retrieval-augmented system makes a silent commitment the moment it picks a passage: it *resolves* disagreement on the reader's behalf. When two sources conflict, RAG returns the higher-scored one and discards the other — so the model is handed "X" when the corpus actually says "X, but a second source disagrees." The reader has no way to know it was lied to by omission, and it answers confidently wrong while the faithfulness metric scores it correct against the passage it was given. We propose **contested retrieval**: a retrieval contract that returns, alongside the ranked answer, a **contestedness score** and the **dissenting evidence** — the model receives `X (contested: a second source says Y)` instead of `X`. This is not a re-ranking trick layered on a collapsing store; it is only *possible* on a store that (a) never deleted the dissent (donto invariant **I3** — supersede, never overwrite) and (b) already maintains a corpus-wide measure of how contested each fact is. donto has both: ~42M live statements retain every conflicting attestation (`donto_statement` `reltuples` ≈ 42.0M, 2026-06-12), and `donto_paraconsistency_density` already holds **~235K subject/predicate/time-window cells with two or more disagreeing polarities** as standing, pre-computed state. The novel claim is a faithfulness contract: contested retrieval should *reduce confidently-wrong answers on contested-fact questions without hurting settled-fact or knowledge-update accuracy* — because the contestedness annotation is near-zero overhead on settled facts and decisive on disputed ones. We give the mechanism over real tables, the experiment that would establish it, its dependence on the [bitemporal sheaf](/papers/bitemporal-sheaf)'s `H¹` for the *principled* score, and an honest accounting of what is built (the dissent, plus a noisy density proxy) versus what is not (the recall wiring and the controlled experiment).

---

## 1. The silent-resolution bug in the RAG contract

The standard RAG loop is: embed the question, retrieve top-`k` passages by similarity, hand them to the reader. The faithfulness literature then asks whether the reader *grounded its answer in the retrieved passages*. But there is a prior, unexamined failure: **what the retriever chose not to return.** If the corpus contains both `birthplace = Coen` and `birthplace = McIvor` for the same person and the retriever surfaces only Coen, a perfectly faithful reader produces a confidently wrong "Coen" — faithful to its context, unfaithful to the corpus. The faithfulness metric scores it correct against the passage, and the system has no signal that it just laundered a contested claim into a settled one.

This is structural, not a tuning problem. A store that collapses on conflict (dedup, pick-a-winner, invalidate-on-update) has *physically thrown away* the dissent, so even a perfect retriever cannot return it — there is nothing to return. The contract can only change on a store that kept the disagreement. That is donto's paraconsistent stance made operational: hold every incompatible claim forever as legal state, then **return the disagreement instead of hiding it.**

> **The contested-retrieval contract.** Retrieval returns a triple `⟨answer, contestedness, dissent⟩`: the ranked best answer, a scalar contestedness ∈ [0,1] for the answered subject/predicate, and the top dissenting attestations with their sources. The reader is *instructed to surface contestedness* — to answer "McIvor (contested: a second source says Coen)" rather than "McIvor." Settled facts get contestedness ≈ 0 and the contract degrades to ordinary RAG.

---

## 2. Why only donto can make this claim

The contract needs three things at once, which are exactly donto's foundational invariants plus its alignment fabric:

| requirement | donto provides it as | verified (2026-06-12) |
|---|---|---|
| the dissent still exists to be returned | **I3** — no destructive overwrite; conflicting attestations coexist as live rows in `donto_statement` | ~42M live statements (`reltuples` estimate) |
| a per-fact "how contested is this" signal, pre-computed | `donto_paraconsistency_density` (subject, predicate, window, `distinct_polarities`, `distinct_contexts`, `conflict_score`) | **~235K rows with `distinct_polarities ≥ 2`** over ~27.7K distinct subjects; refreshed `2026-05-15 → 2026-06-12` by `analyzer_paraconsistency.rs` |
| the dissenting *evidence*, not just the dissenting *claim* | `donto_evidence_link` → span → revision → blob | ~2.57M evidence links |
| query-time alignment so "different words for the same predicate" count as one contest | `donto_predicate_closure` (~1.05M rows), `donto_match_aligned` | live |

No collapse-on-conflict store has the first row, so it cannot have the third either. A vector DB returns the winner and its neighbors; it has no notion that the loser is *evidence of contest*. donto's density table is the difference: contestedness is **not** computed at query time from whatever happened to be retrieved (which would be circular — the retriever's own omission would suppress the signal), it is **standing, corpus-wide state** maintained ahead of the query. A real example from the live store:

```
subject       predicate   distinct_polarities  distinct_contexts  conflict_score
ex:ajaxdavis  rdf:type    2                    1125               0.529
```

`ex:ajaxdavis` is asserted to be of competing types across **1,125 distinct contexts** — a junk-drawer identity any collapsing store would have silently resolved to whichever type indexed first. Contested retrieval would hand the reader *both* with a high contestedness score: "this entity's type is disputed across the corpus." (Note this row is mid-table, not the most-contested — see §4 for why the *top* of the proxy is dominated by extraction noise, which is itself the honest limitation of the v1 score.)

---

## 3. Mechanism: a thin annotation layer over recall

The design is deliberately small — it is an annotation, not a new retriever — because the heavy lifting (holding dissent, scoring contest) is already done. Over donto's existing recall path:

1. **Recall as today.** `/recall` (holder-scoped) or `/search` (substrate-wide FTS + hybrid vector) returns ranked statements for the question. This is unchanged; donto's recall is sound (the [answer-shaping](/papers/answer-shaping) diagnostics found the right evidence present in *every* diagnosed category — the gap was the reader applying it, never recall missing it).
2. **Resolve the answered (subject, predicate) under the alignment closure.** Fold the retrieved top claim's predicate through `donto_predicate_closure` so `birthPlace`, `bornIn`, `placeOfBirth` are one contest, not three (no-brittle-logic: this is the alignment engine's job, not a synonym list).
3. **Read the standing contestedness.** Look up `donto_paraconsistency_density` for that folded (subject, predicate) **as-of** the question's transaction-time. `conflict_score` (a function of `distinct_polarities` and `distinct_contexts`) is the contestedness scalar; `sample_statements` (the `uuid[]` the analyzer already stores) points at the dissenting rows.
4. **Attach the dissent.** Pull the top dissenting attestations and their `donto_evidence_link` spans, distinct from the answered one (a different polarity, or a different context).
5. **Return the triple, instruct the reader.** Answer + `contestedness` + dissent spans, with a system instruction: *if contestedness exceeds a threshold τ, state the answer AND that it is contested and name the dissent.*

I3 is respected throughout — this is pure read; nothing is merged, retracted, or deleted. The whole layer is a Rust addition to the recall route plus one indexed lookup against a table that already exists (`donto_paraconsistency_density_score_idx` already covers `distinct_polarities ≥ 2`); it adds one bounded query to a path that already runs sub-second.

```
question ──▶ recall (unchanged) ──▶ fold (predicate_closure)
                                        │
                          donto_paraconsistency_density[(subj,pred) as-of t]
                                        │
        ⟨ answer , conflict_score , dissent spans (evidence_link) ⟩ ──▶ reader
```

---

## 4. The score, and why the principled version needs the sheaf

`donto_paraconsistency_density.conflict_score` is a **proxy**, and a deliberately coarse one. It aggregates over a (subject, predicate, time-window) cell — distinct polarities, distinct contexts. That is enough to *flag* contest and is already maintained corpus-wide, which is why contested retrieval is buildable today. It has two known weaknesses, and we name both:

**(a) It is noisy at the top.** Sorted by `conflict_score`, the most-contested cells are not the interesting genealogical disputes but **extraction junk** — `ex:python · rdf:type`, `ex:step-2 · rdf:type`, `ex:print-statement · rdf:type` (each `conflict_score = 1.0` across hundreds of contexts), artifacts of code-extraction minting the same generic IRIs with conflicting types. ~102K of the ~235K ≥2-polarity cells sit at `conflict_score = 1.0`. A deployable contested-retrieval layer therefore cannot just read the top of the table; it needs a relevance gate (subject is in the recalled answer, and ideally a non-generic entity), or the noise dominates. This is the abundance trade-off showing through: the same firehose that gives us real disputes for free also mints generic junk, and the v1 proxy does not tell them apart.

**(b) It sees a cell, not the graph — so it conflates change with contest.** This is the weakness it shares with all pairwise contradiction machinery (`donto_argument`, 2,525 edges). It cannot distinguish:

- *the world changed* — residence `Cairns@2019` then `Brisbane@2023`, which is **not** a contradiction (it is the LongMemEval knowledge-update case donto already handles at **0.923**, per [the scorecard](/reports/memory-benchmarks-scorecard)); from
- *the sources disagree* — birthplace `Coen` vs `McIvor` at the same valid-time, which **is** one.

The density proxy can over-flag the first as contested if it pools across valid-times, which would *hurt* knowledge-update accuracy — the exact failure mode the claim must avoid. The principled contestedness score is the **valid-time `H¹` mass** of the [bitemporal sheaf](/papers/bitemporal-sheaf): non-zero valid-time `H¹` is genuine disagreement; a fact that merely *changed over time* has `H¹ = 0` along the valid axis (Example A of that paper). So:

> **Contested retrieval is the consumer of the sheaf's contradiction measure.** The density table is the deployable v1 score (flag now, behind a relevance gate and a valid-time guard); `H¹(·, now)` is the correct v2 score that distinguishes contest from change *and* is not fooled by generic-IRI pile-ups, because it measures an obstruction over a recalled subgraph rather than counting contexts in a cell. Contested retrieval is *why* the sheaf's `H¹` needs to exist — it is the retrieval surface that turns a cohomology number into a faithfulness gain.

This is the honest dependency: the cleanest version of the experiment below wants Stage-0 `H¹`, which is in build (`packages/donto-sheaf`, migrations 0178–0183), not yet a measured analytic.

---

## 5. The experiment that would establish the claim

The claim is two-sided and both sides must hold; a one-sided win is not the result.

**Hypothesis.** Contested retrieval (a) **reduces confidently-wrong answers** on contested-fact questions, and (b) **does not hurt settled-fact or knowledge-update accuracy**.

**Setup.** Free, weak reader+judge (holo3.1) throughout — the same instrument and rationale as [answer-shaping](/papers/answer-shaping): a weak reader makes the substrate's contribution visible as a delta; a strong reader masks it. Two arms, identical retrieval, differing only in the return shape:

- **control:** ordinary RAG — top-`k`, winner only.
- **treatment:** contested retrieval — `⟨answer, conflict_score, dissent⟩` with the surface-contestedness instruction.

**Strata** (this is what makes it a clean test rather than an average):

| stratum | source | what it tests | predicted direction |
|---|---|---|---|
| **contested-fact** | genealogy corpus, subjects with `distinct_polarities ≥ 2`, relevance-gated to real entities (e.g. Coen/McIvor birthplace) | does the annotation prevent confident-wrong? | treatment ↑ (fewer confident-wrong) |
| **knowledge-update** | LongMemEval knowledge-update (current 0.923) | does the *changed-not-contested* case get falsely flagged? | treatment **=** control (must not regress) |
| **settled-fact** | LongMemEval/LoCoMo single-hop, `distinct_polarities = 1` | overhead-only cost on uncontested facts | treatment **=** control |

**Primary metric:** the **confidently-wrong rate** on the contested-fact stratum — answers stated without a contest caveat that are wrong or arbitrary tie-breaks — not raw accuracy (a contested question often has no single right answer; the right *behavior* is to surface the contest). **Guard metrics:** accuracy on knowledge-update and settled-fact strata, which must not drop. A judge rubric that credits "states the answer and flags the contest" as correct on contested-fact, and penalizes a false contest-flag on settled/knowledge-update, operationalizes both sides.

**The decisive cell** is the knowledge-update guard: it is where the *density proxy* is predicted to fail (over-flagging change as contest) and where the *valid-time `H¹` score* is predicted to pass. Running both scores in the treatment arm turns this experiment into a *measurement of the proxy's cost* and a *justification for the sheaf*, in one table — a sharp result either way. (A null primary result is also informative: it would say a weak reader cannot exploit a contestedness annotation any more than it could exploit re-ranked evidence — the [answer-shaping §4](/papers/answer-shaping) lesson — which would push the contribution toward stronger readers.)

---

## 6. Why this beats the obvious alternatives

- **vs. "just retrieve more (higher `k`)":** more passages can *contain* the dissent but the reader still has no signal that two of its passages conflict about the *same* fact — it sees two facts, not one contest. [Answer-shaping §4](/papers/answer-shaping) showed exactly this: re-ranking/expanding hands the reader atomic evidence it must parse to notice the conflict; a weak reader does not. Contested retrieval hands it the *conclusion that there is a conflict*, pre-computed.
- **vs. "let the reader detect contradictions in-context":** that re-derives, every query and unreliably, what `donto_paraconsistency_density` already computed once over the whole corpus — and it is blind to dissent the retriever omitted.
- **vs. confidence calibration on the reader:** reader confidence is about the *reader's* uncertainty; contestedness is about the *corpus's* disagreement. They are orthogonal — a reader can be confident about a corpus-contested fact, which is precisely the bug.

The contribution is the reframing: **faithfulness is not only "did you ground in what you retrieved" — it is "did you return that the corpus disagrees."** Only a store that kept the disagreement can be held to that standard, and donto is the one that did.

---

## 7. Status: proven / conjectured / speculative

In the seed papers' three-tier candor:

- **Built (solid).** The two preconditions exist and are measured on the live store: the dissent is retained (I3, ~42M statements; conflicting attestations coexist) and a corpus-wide contestedness proxy is computed and standing (`donto_paraconsistency_density`, ~235K `distinct_polarities ≥ 2` cells over ~27.7K subjects, refreshed through 2026-06-12 by `analyzer_paraconsistency.rs`). `sample_statements` and `donto_evidence_link` (~2.57M) supply the dissenting evidence. The annotation layer of §3 is a small, I3-safe read over tables that already exist.
- **Conjectured (open, failure mode named).** The two-sided result of §5 — contested retrieval *reduces confidently-wrong on contested-fact without hurting knowledge-update/settled accuracy*. **Named failure modes:** (1) the density proxy over-flags *change* as *contest* (pooling across valid-times), which would regress the knowledge-update guard; (2) the proxy is junk-dominated at the top (§4a), so without a relevance gate it would caveat the wrong things. Mitigation for both is the valid-time `H¹` score over a recalled subgraph — itself unproven until Stage-0 sheaf is measured. The experiment is fully specified and runnable today against the gated proxy; the clean version waits on `H¹`. **Not yet wired:** the recall routes do **not** currently read `donto_paraconsistency_density` — verified by grep: it is referenced only by the analytics/standing layer (`analyzer_paraconsistency.rs`, `analyzer_sheaf.rs`, the standing migrations), never by `/recall` or `/search` in `donto` or `donto-memory`. That wiring is the build prerequisite for the experiment.
- **Speculative (flagged).** That contestedness should *flow into answer-shaping's router* — contested-fact questions are a distinct shape that should be routed to "return the contest, do not distil," joining the scalpel/hammer taxonomy as a third class. And that a corpus's contestedness *profile* (where the dissent concentrates, once junk is gated out) is itself a deliverable — a contested-knowledge map — independent of any single query. No claim yet; both are measurements to run once the layer ships.

The honest one-liner: donto already *holds* the disagreement and already *scores* it (coarsely); what is unproven is that *handing the reader the disagreement* buys faithfulness on a weak reader — and the cleanest proof of that needs the sheaf's `H¹` to tell contest from change and to see past the proxy's noise. That is a clean experiment waiting on one build, not a hope.

---

_See also: the contradiction measure this paper consumes, [The Bitemporal Sheaf](/papers/bitemporal-sheaf); the empirical law on when to shape vs. return evidence, [Answer-Shaping](/papers/answer-shaping); the conflicts pairwise edges miss, [the research agenda's cycle-contradictions entry](/papers/research-agenda); the full program in the [donto research agenda](/papers/research-agenda); the benchmark grounding in [the honest scorecard](/reports/memory-benchmarks-scorecard)._
