Reports

Working research notes from building donto — extraction engineering, benchmark studies, substrate design. Living documents, updated as the work progresses.

Last updated: 2026-07-02.

report	what it covers
Hyades/holo3.1: the empty `tool_calls.arguments` bug — a gateway diagnostic	A full technical diagnostic for the Hyades gateway operator: ~1 in 3 structured calls to holo3.1 returns HTTP 200 with a well-formed tool-call envelope whose `arguments` is empty — while `usage` bills 600–1,000+ completion tokens. Production-scale measurement (1,064 calls / 204 runs in a day: 43% broad-pass vs 30% nudge empties, 2× input-size correlation, flat by hour, load-independent), a serial 8-call probe reproducing at 37.5% with raw captured bodies (success vs failure side-by-side), everything ruled out (schema, budgets, load, tool_choice form, transport), ranked harness hypotheses (tool-call parser drop is #1), the server-side fix wishlist, the client-side mitigations already shipped (dominated-rung ladder fix: worst-case slot burn halved), and a copy-paste repro. Fixing it recovers ~⅓ of the lane's effective throughput
📄 donto — a contradiction-preserving substrate for knowledge (LaTeX PDF treatise)	The consolidated, citable account of the two-day run as a 6-page PDF. What donto is (bitemporal, paraconsistent, evidence-first claim substrate for the age of generative abundance); the instrument (reader-ladder; the definitive full 1,539-Q LoCoMo at 53.3%, reader-bound); what was built (full LongMemEval claim layer; the claim + evidence-span recall unit, validated and deployed to `/recall`; the 19h-runaway kill + disk save); the self-correction (cross-reader check refuted "claims beat the dump" — robust wins are token-efficiency 3.5–4.8× + claims+spans > bare-claims; the accuracy win needs the non-fitting regime, an honest tie in the proxy); and a frank answer to can donto be the future of AI and knowledge — what's warranted vs the larger epistemic-OS bet. Every number measured, not asserted.
Claim + evidence-span recall — the LongMemEval pivot, and an honest LoCoMo ceiling	The definitive full run + the substrate change it produced. Full LoCoMo, all 1,539 questions, cheap `flash-lite` reader = 53.3% (single-hop 0.73 — the reader reads; inference categories 0.27–0.34 — the reader refuses to infer), framed against the reader-ladder (same dump: flash-lite 45% → deepseek 75% → gpt-5.4 86%): LoCoMo is reader/benchmark-capped, not donto-capped, and since its conversations fit in context the dump saturates retrieval so donto can't shine there either way. The pivot to LongMemEval: the right recall unit is the claim + its evidence span (lifts accuracy on both readers; atomized triples hurt synthesis, the anchored snippet restores it; `valid_time` was not the bottleneck — an honest negative) — but a cross-reader check corrected the headline: claims+spans beat the dump on temporal only with a weak reader; a strong reader (GLM 1.00) wins on a dump that fits. The reader-independent win is token-efficiency (3.5–4.8× fewer); the accuracy win needs the non-fitting `_m` split. Honest science over a clean story. Now deployed: donto-memory `/recall` ships an `agent-claims` arm (`2535ab7`) surfacing `ctx:claims/<corpus>/<chunk>` claims+spans, verified live. Plus the ops: a 19h runaway query killed, disk 99%→39% + cron, the GIN-insert perf wall.
Every LoCoMo failure, and why it fails	A per-question post-mortem of donto best LoCoMo run (turn-recall depth 100, 75.8%). All 48 failures classified by root cause: ~60% reader/judge-bound (the free reader won inference, or the judge rejected a correct relative-date answer — true accuracy is higher), ~40% recall-bound (donto lever). Key actionable pattern: multi-hop failures are almost all list/aggregation questions (reader returns a subset) → pre-joined aggregate claims are the next experiment. Honest map of which levers are donto and which are the reader/grader.
What moved LoCoMo: the claim layer — and where the sheaf comes in	An honest accounting of how the recent substrate program changed the LoCoMo benchmark. *Semantic claim recall — enabled by taking claim-embedding coverage to 100% — lifted claims-only LoCoMo 2.4× (15.6% → 37.5%) at fewer* tokens (913 vs 1,435), and the bitemporal claim layer drove temporal accuracy 18% → 72%. Claims-only answers at ~1/21 the context of episodic stuffing (913 vs 19,029 tok) — the scaling story. Reported faithfully: the sheaf contradiction-math is not yet in the LoCoMo recall path** and did not produce these numbers; it's the teed-up next lever (contradiction-pressure-aware reranking for knowledge-update; claim-graph bridging for multi-hop). Plus the no-shortcuts road to 100%
GPU acceleration for donto — a brief for an implementing friend	A practical hand-off for someone with a GPU. Two workloads: (1) turnkey embedding backfill — ~42M claims need `bge-small` (384-dim) vectors, ~2–4/sec on our CPU vs hundreds–thousands/sec on GPU; a ready-made distributed worker (`github thomasdavis/donto-embed-worker`, `donto.org/embed` queue, `fastembed-gpu`+`EMBED_CUDA=1`) means plug-and-play, no donto internals; (2) sheaf NSD training — GPU sweeps over exported subgraphs (JSON in / checkpoints out, no DB access), parity-gated into Rust inference. Includes hardware/cost guidance + a one-paragraph version to forward
Is the Internet an extension of human memory?	A reading of a note the author wrote in 2011 — catching himself storing search actions instead of facts. It independently described cognitive offloading (Sparrow's "Google Effect," named that same year) and transactive memory, named the still-open creative cost (an index has coverage but no adjacency, and adjacency is where epiphany happens — "instead of a novel, I am a dictionary"), and proposed — almost verbatim — donto's storage contract: store the idea first, then the path to the source. The throughline: the 2011 personal dilemma is the exact design problem donto solves at the substrate level (evidence-anchored claims = idea + path; the episodic-vs-claims benchmark tension; the Lens/sheaf "epiphany over held premises")
Sheaf neural networks for donto	Research report on sheaf neural networks (cellular-sheaf GNNs) — arguably the missing mathematics for a contradiction-preserving substrate. Cellular sheaves put a vector space (stalk) on each node + a learned restriction map on each edge; their cohomology computably measures whether local views glue into a consistent whole (`H⁰`) and where they can't (`H¹`). Maps line-for-line onto donto: claims = stalks, query-time alignment/identity = restriction maps, paraconsistency = non-zero `H¹`, contradiction-pressure = localized disagreement norm, multi-source fusion + contested-source reasoning = sheaf data-fusion / discourse-sheaf opinion dynamics. Built to handle heterophily + oversmoothing — the exact failures donto's claim graph triggers. Includes concrete build proposals + honest compute caveats
Memory benchmarks — donto's honest scorecard	A multi-day free-reader synthesis across LoCoMo + LongMemEval. The pattern: donto's recall does its job everywhere; the accuracy limits are the reader and query-time routing, not retrieval. Demonstrated wins — token-efficiency (claims+aggregates = 86% of episodic accuracy at 3.8× fewer tokens), knowledge-update out-of-box (0.923, bitemporal baked into the dated representation), and targeted answer-shaping (a distilled-preference facet lifts single-session-preference 0.70→0.767). The honest limit: answer-shaping is a scalpel (helps focused-fact questions, hurts synthesis), and single-hop is a routing gap — both reader/router-bound, not retrieval
The Seven Sisters across 32 cultures	A new example consumer: the Pleiades / Seven Sisters story (Greek, Aboriginal Australian, Native American, Polynesian, Andean, African, Near Eastern + the deep-time "oldest story" debate) saved to disk and extracted into donto as evidence-anchored claims via the `donto-agent` GLM lane. Covers the engineering (opencode 0-facts → donto-agent; chunking ~doubles fact density; the 429-throttle + pacing; the 0-fact-≠-done fix) and the substrate findings (~11k claims, ~3k invented predicates, the pursuit motif at 246 stmts, the lost-Pleiad puzzle at 114, agricultural-calendar convergence, identity-as-hypothesis)
The 48-hour build, read closely	A verification pass over the change report: the 48-hour work is real and tested (22 migrations `0156–0177`, each a SQL capability + invariants test), but almost entirely unexercised — governance/horizon tables hold 1–16 demo rows against a ~41.5M-statement store (analogy 1, tournament 2, reputation 4; only `rule_agenda` at 12.7k is under real load). Reads it together with the LoCoMo results to name the selection function: answer-shaping facts convert (valid_time +45%, standing, closure, aggregates), retrieval doesn't — so the benchmark is what makes the skeleton load-bearing, and the aggregate claim is the next thing to build
Past 48 Hours Change Report	Full audit of the committed work from 2026-06-10 10:14 UTC to 2026-06-12 10:14 UTC across `donto` and `donto-web`: substrate execution-plan waves, public docs, reports, benchmark analysis, route fixes, why each class of change was made, and what should come next
Future Substrate Report	What Donto is still missing to become a century-scale knowledge and memory substrate for AI systems: durable identity, deeper source objects, semantic stability, memory lifecycle, governance, cryptographic trust, federation, multimodal evidence, scale, and model integration
Single-shot vs agentic, one-call vs multi-scope	What the benchmarks measure, corrected from Zep's page: their 94.7% is single-read with multi-scope retrieval (5 composed searches + rerank, 5,760 tok); the plain single-call auto-search is 86.5% (2,680 tok). Two axes — retrieval composition vs reader agency — and which Zep number is donto's fair target
The shape of donto's return	The full schema of the memory bundle donto hands an agent — every field (`subject·predicate·object`, the temporal `text` tag, `source`), where it comes from in the bitemporal claim graph, and the measured reason each one earned its tokens. Substrate is maximal; the return is minimal
LoCoMo Config C — claims-only recall	Throwing away the dialogue and answering from the claim layer alone. Clean result: context collapses ~11× (19,029→1,690 tok) but so does accuracy — 0.244 vs episodic 0.837. An honest negative: the bottleneck is claim recall (worst on multi-hop), not claim existence. Next: semantic claim recall
Are we using Hyades effectively?	Review of how donto-agent drives the Hyades gateway for claim extraction; the token-budget finding from tuning GLM (40→520 facts/chunk); a model bake-off (incl. the streaming-mode 524 fix) + experiment plan

Full archive

All research reports (migrated from the former genes.apexpots.com/research).