How donto works, one document at a time
donto is a bitemporal, paraconsistent, evidence-first claim substrate. Generating typed knowledge used to be the scarce step; now an LLM emits an unbounded firehose of claims for ~$0.0001 each. donto's job is to hold that firehose without throwing most of it away — keeping contradictions, anchoring every claim to its source, and re-ranking by reality over time. Below, we follow a single source document through all ten stages, end to end, with nothing skipped.
Source document in → answer with provenance out
Ten stages. Each one is a real, running part of the substrate — extraction and anchoring are deliberately separate; typing, joining and identity resolution are all deferred to query time. Jump to any stage:
- 01Ingest the sourceblob · document · revision
- 02Extract free claimsgenerative abundance
- 03Anchor the evidencealways-on citer
- 04Hold it — paraconsistentlybitemporal · I3
- 05Embed into the fabricbge-small · 384d
- 06Align at query timeclosure · identity-as-hypothesis
- 07Measure contradictionH⁰ / H¹ · cocycles
- 08Rank by reality⟨maturity, corroboration, pressure, recency⟩
- 09Recall + reconcileFTS + vector + RRF
- 10Return with provenancefact → span → blob
Ingest the source
A source document arrives. donto content-addresses it (SHA-256), stores the bytes as a GCS-backed blob, and records a document + revision. Nothing is summarised-and-discarded — the raw resource is kept so every fact can point back to it.
| stored object | what it is |
|---|---|
| blob | content-addressed bytes, deduped, GCS-backed |
| document | the resource (searchable body + metadata) |
| revision | a point-in-time version of the document |
| evidence_link | later: fact → span inside this revision |
Rule: never summarise-and-discard. An un-stored source is a bug.
Extract free claims
A guided LLM reads the source and emits an unbounded, multi-directional space of subject–predicate–object claims, inventing predicates as it goes. A gleaning loop pushes for coverage, not a count. Typing and joining are deferred — emit free now.
1,084,128 distinct predicates — the signature of abundance, not a schema to maintain.
Anchor the evidence
A separate, always-on citer attaches each claim to an evidence span in the source — or honestly flags it as an unanchorable hypothesis (never a bogus span). Extraction says WHAT; anchoring says WHERE. This is what makes donto evidence-first.
Every anchored claim carries an evidence span; the rest are honestly flagged as hypotheses — never given a bogus span. 3.2M evidence links and counting.
Hold it — paraconsistently
Claims are written to donto_statement: bitemporal (valid-time + transaction-time) and paraconsistent. Contradictions are legal state held forever — never overwritten (invariant I3: retract/supersede, never DELETE). A vector DB collapses; donto holds.
- valid time
- when the claim is true in the world
- transaction time
- when donto came to believe it
- belief A · kept
- ex:tommy removedTo ex:yarrabah
- belief B · kept
- ex:tommy hasRemovalOrderTo ex:yarrabah
Both beliefs are kept. 6,487,078 contested subject–predicate pairs live side by side — invariant I3: retract or supersede, never DELETE.
Embed into the fabric
Claims, predicates and entities are embedded (bge-small, 384-dim, HNSW) by a distributed coordinator that hands out disjoint work, so CPU and GPU workers scale linearly and never double-embed. This powers semantic recall and alignment.
A distributed coordinator hands out disjoint work, so CPU and GPU workers scale linearly and never double-embed.
Align at query time
Freely-minted predicates are folded by similarity (donto_predicate_closure): occupation ↔ currentJob become comparable without a hand-kept synonym table. Entity identity is a hypothesis resolved at query time, not a destructive merge at write time.
occupation ─┐
├─ closure fold (cos ≥ τ) ─▶ one fact
currentJob ─┘ no synonym table — learned
ex:coen ≈ ex:coen-qld identity = hypothesis,
ex:coen ≠ ex:mcivor resolved at query timePredicates are folded by similarity (donto_predicate_closure); entity identity stays a hypothesis until you ask — never a destructive write-time merge.
Measure contradiction
A cellular sheaf reads the argument, incompatibility and identity layers and computes H⁰ (the consistent core) and H¹ (irreducible disagreement) — including loop conflicts (cocycles) that pairwise edges literally cannot see. Pressure flows to standing.
ex:tommy ──hasRemovalOrderTo──▶ ex:yarrabah
│ ▲
└──────────── removedTo ───────────┘
loop residual ‖δx‖ = 2√2 ≈ 2.828 (cocycle)Two claims about ex:tommy → ex:yarrabah are each locally fine but disagree around the loop — a contradiction the pairwise argument layer cannot see. Total contradiction pressure across the substrate: 8.0M.
Rank by reality
Every claim carries a standing vector. donto re-ranks by reality over time instead of deleting on conflict — corroborated, low-pressure, recent claims rise; contested ones are surfaced, not silently dropped.
donto re-ranks by reality over time instead of deleting on conflict. Contested claims are surfaced, not silently dropped.
Recall + reconcile
A question triggers hybrid recall (lexical FTS + semantic vector, fused with RRF), closure-folded and reconciled by a query-time sheaf pass that tags contested claims. Token-efficient, abstention-aware, scoped to the asker.
| arm | catches |
|---|---|
| lexical FTS | exact terms, names, IDs |
| vector ANN | meaning — paraphrases lexical misses |
| RRF | fuses both into one ranking |
| reconcile | tags contested claims, re-rank-neutral |
Return with provenance
The answer comes back WITH its trail: each fact links to its evidence span, the span to its revision, the revision to the original blob — plus contradiction tags. Fully auditable: you can always get back to the source sentence.
| family | contexts |
|---|---|
| claims | 4,624 |
| books | 363 |
| budget-smoke | 7 |
| anchor-test | 3 |
| cross | 2 |
| agent | 1 |
The answer returns with its full trail — you can always get back to the source sentence.
The live substrate, right now
| metric | value | what it means |
|---|---|---|
| currently believed statements | 42,430,187 | live claim state |
| distinct predicates | 1,084,128 | freely minted |
| distinct contexts | 77,989 | scopes / provenance |
| evidence links | 3,217,207 | fact → span |
| contested subject–predicate pairs | 6,487,078 | held, not resolved |
| total contradiction pressure | 7,999,558 | Σ ‖δx‖ |
| argument edges | 3,085 | typed support / attack |
| identity hypotheses | 169 | same-referent, query-time |
| consumer namespaces | 353 | domains sharing one instance |
| retracted statements | 365 | superseded, never deleted |
Claim volume grows in all directions; typing & joining happen at query time.
Live from dontosrv /discovery — refreshed every few minutes.