Contested retrieval: returning disagreement as a first-class result

donto papers · original research · 2026-06-13 · method, working draft v1

Abstract. Every retrieval-augmented system makes a silent commitment the moment it picks a passage: it resolves disagreement on the reader's behalf. When two sources conflict, RAG returns the higher-scored one and discards the other — so the model is handed "X" when the corpus actually says "X, but a second source disagrees." The reader has no way to know it was lied to by omission, and it answers confidently wrong while the faithfulness metric scores it correct against the passage it was given. We propose contested retrieval: a retrieval contract that returns, alongside the ranked answer, a contestedness score and the dissenting evidence — the model receives X (contested: a second source says Y) instead of X. This is not a re-ranking trick layered on a collapsing store; it is only possible on a store that (a) never deleted the dissent (donto invariant I3 — supersede, never overwrite) and (b) already maintains a corpus-wide measure of how contested each fact is. donto has both: ~42M live statements retain every conflicting attestation (donto_statement reltuples ≈ 42.0M, 2026-06-12), and donto_paraconsistency_density already holds ~235K subject/predicate/time-window cells with two or more disagreeing polarities as standing, pre-computed state. The novel claim is a faithfulness contract: contested retrieval should reduce confidently-wrong answers on contested-fact questions without hurting settled-fact or knowledge-update accuracy — because the contestedness annotation is near-zero overhead on settled facts and decisive on disputed ones. We give the mechanism over real tables, the experiment that would establish it, its dependence on the bitemporal sheaf's H¹ for the principled score, and an honest accounting of what is built (the dissent, plus a noisy density proxy) versus what is not (the recall wiring and the controlled experiment).

1. The silent-resolution bug in the RAG contract

The standard RAG loop is: embed the question, retrieve top-k passages by similarity, hand them to the reader. The faithfulness literature then asks whether the reader grounded its answer in the retrieved passages. But there is a prior, unexamined failure: what the retriever chose not to return. If the corpus contains both birthplace = Coen and birthplace = McIvor for the same person and the retriever surfaces only Coen, a perfectly faithful reader produces a confidently wrong "Coen" — faithful to its context, unfaithful to the corpus. The faithfulness metric scores it correct against the passage, and the system has no signal that it just laundered a contested claim into a settled one.

This is structural, not a tuning problem. A store that collapses on conflict (dedup, pick-a-winner, invalidate-on-update) has physically thrown away the dissent, so even a perfect retriever cannot return it — there is nothing to return. The contract can only change on a store that kept the disagreement. That is donto's paraconsistent stance made operational: hold every incompatible claim forever as legal state, then return the disagreement instead of hiding it.

The contested-retrieval contract. Retrieval returns a triple ⟨answer, contestedness, dissent⟩: the ranked best answer, a scalar contestedness ∈ [0,1] for the answered subject/predicate, and the top dissenting attestations with their sources. The reader is instructed to surface contestedness — to answer "McIvor (contested: a second source says Coen)" rather than "McIvor." Settled facts get contestedness ≈ 0 and the contract degrades to ordinary RAG.

2. Why only donto can make this claim

The contract needs three things at once, which are exactly donto's foundational invariants plus its alignment fabric:

requirement	donto provides it as	verified (2026-06-12)
the dissent still exists to be returned	I3 — no destructive overwrite; conflicting attestations coexist as live rows in `donto_statement`	~42M live statements (`reltuples` estimate)
a per-fact "how contested is this" signal, pre-computed	`donto_paraconsistency_density` (subject, predicate, window, `distinct_polarities`, `distinct_contexts`, `conflict_score`)	~235K rows with `distinct_polarities ≥ 2` over ~27.7K distinct subjects; refreshed `2026-05-15 → 2026-06-12` by `analyzer_paraconsistency.rs`
the dissenting evidence, not just the dissenting claim	`donto_evidence_link` → span → revision → blob	~2.57M evidence links
query-time alignment so "different words for the same predicate" count as one contest	`donto_predicate_closure` (~1.05M rows), `donto_match_aligned`	live

No collapse-on-conflict store has the first row, so it cannot have the third either. A vector DB returns the winner and its neighbors; it has no notion that the loser is evidence of contest. donto's density table is the difference: contestedness is not computed at query time from whatever happened to be retrieved (which would be circular — the retriever's own omission would suppress the signal), it is standing, corpus-wide state maintained ahead of the query. A real example from the live store:

subject       predicate   distinct_polarities  distinct_contexts  conflict_score
ex:ajaxdavis  rdf:type    2                    1125               0.529

ex:ajaxdavis is asserted to be of competing types across 1,125 distinct contexts — a junk-drawer identity any collapsing store would have silently resolved to whichever type indexed first. Contested retrieval would hand the reader both with a high contestedness score: "this entity's type is disputed across the corpus." (Note this row is mid-table, not the most-contested — see §4 for why the top of the proxy is dominated by extraction noise, which is itself the honest limitation of the v1 score.)

3. Mechanism: a thin annotation layer over recall

The design is deliberately small — it is an annotation, not a new retriever — because the heavy lifting (holding dissent, scoring contest) is already done. Over donto's existing recall path:

Recall as today. /recall (holder-scoped) or /search (substrate-wide FTS + hybrid vector) returns ranked statements for the question. This is unchanged; donto's recall is sound (the answer-shaping diagnostics found the right evidence present in every diagnosed category — the gap was the reader applying it, never recall missing it).
Resolve the answered (subject, predicate) under the alignment closure. Fold the retrieved top claim's predicate through donto_predicate_closure so birthPlace, bornIn, placeOfBirth are one contest, not three (no-brittle-logic: this is the alignment engine's job, not a synonym list).
Read the standing contestedness. Look up donto_paraconsistency_density for that folded (subject, predicate) as-of the question's transaction-time. conflict_score (a function of distinct_polarities and distinct_contexts) is the contestedness scalar; sample_statements (the uuid[] the analyzer already stores) points at the dissenting rows.
Attach the dissent. Pull the top dissenting attestations and their donto_evidence_link spans, distinct from the answered one (a different polarity, or a different context).
Return the triple, instruct the reader. Answer + contestedness + dissent spans, with a system instruction: if contestedness exceeds a threshold τ, state the answer AND that it is contested and name the dissent.

I3 is respected throughout — this is pure read; nothing is merged, retracted, or deleted. The whole layer is a Rust addition to the recall route plus one indexed lookup against a table that already exists (donto_paraconsistency_density_score_idx already covers distinct_polarities ≥ 2); it adds one bounded query to a path that already runs sub-second.

question ──▶ recall (unchanged) ──▶ fold (predicate_closure)
                                        │
                          donto_paraconsistency_density[(subj,pred) as-of t]
                                        │
        ⟨ answer , conflict_score , dissent spans (evidence_link) ⟩ ──▶ reader

4. The score, and why the principled version needs the sheaf

donto_paraconsistency_density.conflict_score is a proxy, and a deliberately coarse one. It aggregates over a (subject, predicate, time-window) cell — distinct polarities, distinct contexts. That is enough to flag contest and is already maintained corpus-wide, which is why contested retrieval is buildable today. It has two known weaknesses, and we name both:

(a) It is noisy at the top. Sorted by conflict_score, the most-contested cells are not the interesting genealogical disputes but extraction junk — ex:python · rdf:type, ex:step-2 · rdf:type, ex:print-statement · rdf:type (each conflict_score = 1.0 across hundreds of contexts), artifacts of code-extraction minting the same generic IRIs with conflicting types. ~102K of the ~235K ≥2-polarity cells sit at conflict_score = 1.0. A deployable contested-retrieval layer therefore cannot just read the top of the table; it needs a relevance gate (subject is in the recalled answer, and ideally a non-generic entity), or the noise dominates. This is the abundance trade-off showing through: the same firehose that gives us real disputes for free also mints generic junk, and the v1 proxy does not tell them apart.

(b) It sees a cell, not the graph — so it conflates change with contest. This is the weakness it shares with all pairwise contradiction machinery (donto_argument, 2,525 edges). It cannot distinguish:

the world changed — residence Cairns@2019 then Brisbane@2023, which is not a contradiction (it is the LongMemEval knowledge-update case donto already handles at 0.923, per the scorecard); from
the sources disagree — birthplace Coen vs McIvor at the same valid-time, which is one.

The density proxy can over-flag the first as contested if it pools across valid-times, which would hurt knowledge-update accuracy — the exact failure mode the claim must avoid. The principled contestedness score is the valid-time H¹ mass of the bitemporal sheaf: non-zero valid-time H¹ is genuine disagreement; a fact that merely changed over time has H¹ = 0 along the valid axis (Example A of that paper). So:

Contested retrieval is the consumer of the sheaf's contradiction measure. The density table is the deployable v1 score (flag now, behind a relevance gate and a valid-time guard); H¹(·, now) is the correct v2 score that distinguishes contest from change and is not fooled by generic-IRI pile-ups, because it measures an obstruction over a recalled subgraph rather than counting contexts in a cell. Contested retrieval is why the sheaf's H¹ needs to exist — it is the retrieval surface that turns a cohomology number into a faithfulness gain.

This is the honest dependency: the cleanest version of the experiment below wants Stage-0 H¹, which is in build (packages/donto-sheaf, migrations 0178–0183), not yet a measured analytic.

5. The experiment that would establish the claim

The claim is two-sided and both sides must hold; a one-sided win is not the result.

Hypothesis. Contested retrieval (a) reduces confidently-wrong answers on contested-fact questions, and (b) does not hurt settled-fact or knowledge-update accuracy.

Setup. Free, weak reader+judge (holo3.1) throughout — the same instrument and rationale as answer-shaping: a weak reader makes the substrate's contribution visible as a delta; a strong reader masks it. Two arms, identical retrieval, differing only in the return shape:

control: ordinary RAG — top-k, winner only.
treatment: contested retrieval — ⟨answer, conflict_score, dissent⟩ with the surface-contestedness instruction.

Strata (this is what makes it a clean test rather than an average):

stratum	source	what it tests	predicted direction
contested-fact	genealogy corpus, subjects with `distinct_polarities ≥ 2`, relevance-gated to real entities (e.g. Coen/McIvor birthplace)	does the annotation prevent confident-wrong?	treatment ↑ (fewer confident-wrong)
knowledge-update	LongMemEval knowledge-update (current 0.923)	does the changed-not-contested case get falsely flagged?	treatment = control (must not regress)
settled-fact	LongMemEval/LoCoMo single-hop, `distinct_polarities = 1`	overhead-only cost on uncontested facts	treatment = control

Primary metric: the confidently-wrong rate on the contested-fact stratum — answers stated without a contest caveat that are wrong or arbitrary tie-breaks — not raw accuracy (a contested question often has no single right answer; the right behavior is to surface the contest). Guard metrics: accuracy on knowledge-update and settled-fact strata, which must not drop. A judge rubric that credits "states the answer and flags the contest" as correct on contested-fact, and penalizes a false contest-flag on settled/knowledge-update, operationalizes both sides.

The decisive cell is the knowledge-update guard: it is where the density proxy is predicted to fail (over-flagging change as contest) and where the valid-time H¹ score is predicted to pass. Running both scores in the treatment arm turns this experiment into a measurement of the proxy's cost and a justification for the sheaf, in one table — a sharp result either way. (A null primary result is also informative: it would say a weak reader cannot exploit a contestedness annotation any more than it could exploit re-ranked evidence — the answer-shaping §4 lesson — which would push the contribution toward stronger readers.)

6. Why this beats the obvious alternatives

vs. "just retrieve more (higher k)": more passages can contain the dissent but the reader still has no signal that two of its passages conflict about the same fact — it sees two facts, not one contest. Answer-shaping §4 showed exactly this: re-ranking/expanding hands the reader atomic evidence it must parse to notice the conflict; a weak reader does not. Contested retrieval hands it the conclusion that there is a conflict, pre-computed.
vs. "let the reader detect contradictions in-context": that re-derives, every query and unreliably, what donto_paraconsistency_density already computed once over the whole corpus — and it is blind to dissent the retriever omitted.
vs. confidence calibration on the reader: reader confidence is about the reader's uncertainty; contestedness is about the corpus's disagreement. They are orthogonal — a reader can be confident about a corpus-contested fact, which is precisely the bug.

The contribution is the reframing: faithfulness is not only "did you ground in what you retrieved" — it is "did you return that the corpus disagrees." Only a store that kept the disagreement can be held to that standard, and donto is the one that did.

7. Status: proven / conjectured / speculative

In the seed papers' three-tier candor:

Built (solid). The two preconditions exist and are measured on the live store: the dissent is retained (I3, ~42M statements; conflicting attestations coexist) and a corpus-wide contestedness proxy is computed and standing (donto_paraconsistency_density, ~235K distinct_polarities ≥ 2 cells over ~27.7K subjects, refreshed through 2026-06-12 by analyzer_paraconsistency.rs). sample_statements and donto_evidence_link (~2.57M) supply the dissenting evidence. The annotation layer of §3 is a small, I3-safe read over tables that already exist.
Conjectured (open, failure mode named). The two-sided result of §5 — contested retrieval reduces confidently-wrong on contested-fact without hurting knowledge-update/settled accuracy. Named failure modes: (1) the density proxy over-flags change as contest (pooling across valid-times), which would regress the knowledge-update guard; (2) the proxy is junk-dominated at the top (§4a), so without a relevance gate it would caveat the wrong things. Mitigation for both is the valid-time H¹ score over a recalled subgraph — itself unproven until Stage-0 sheaf is measured. The experiment is fully specified and runnable today against the gated proxy; the clean version waits on H¹. Not yet wired: the recall routes do not currently read donto_paraconsistency_density — verified by grep: it is referenced only by the analytics/standing layer (analyzer_paraconsistency.rs, analyzer_sheaf.rs, the standing migrations), never by /recall or /search in donto or donto-memory. That wiring is the build prerequisite for the experiment.
Speculative (flagged). That contestedness should flow into answer-shaping's router — contested-fact questions are a distinct shape that should be routed to "return the contest, do not distil," joining the scalpel/hammer taxonomy as a third class. And that a corpus's contestedness profile (where the dissent concentrates, once junk is gated out) is itself a deliverable — a contested-knowledge map — independent of any single query. No claim yet; both are measurements to run once the layer ships.

The honest one-liner: donto already holds the disagreement and already scores it (coarsely); what is unproven is that handing the reader the disagreement buys faithfulness on a weak reader — and the cleanest proof of that needs the sheaf's H¹ to tell contest from change and to see past the proxy's noise. That is a clean experiment waiting on one build, not a hope.

See also: the contradiction measure this paper consumes, The Bitemporal Sheaf; the empirical law on when to shape vs. return evidence, Answer-Shaping; the conflicts pairwise edges miss, the research agenda's cycle-contradictions entry; the full program in the donto research agenda; the benchmark grounding in the honest scorecard.