Sheaf neural networks — the missing mathematics for a contradiction-preserving substrate
Living document. 2026-06-13 (iteration 5 — refined across five passes; recent references independently verified). A research report on sheaf neural networks (cellular-sheaf-based GNNs) and why they may be the natural mathematical framework for donto: a bitemporal, paraconsistent, evidence-first CLAIM substrate. Researched from the primary literature; the donto mapping is the payload. (Note: the term is sheaf — from algebraic topology — not "sheath".)
One-paragraph thesis. A standard graph says only that two things are connected. A cellular sheaf says how — it puts a vector space (a "stalk") on every node and edge and a linear "restriction map" on every node→edge incidence that translates one local view into the shared language of the connection. That is exactly donto's world: every claim/entity carries its own local representation; edges between them are typed (argument, identity-as-hypothesis, alignment); and crucially connected things often disagree. Sheaf theory was built to handle disagreement formally: its cohomology measures, computably, whether a set of local views can be glued into one consistent global picture — and where they can't. Standard GNNs assume the trivial sheaf (everything lives in one shared space, neighbours should agree) and so collapse on heterophily and oversmooth — the precise failure modes donto's claim graph would trigger. Sheaf neural networks remove that assumption. They give donto a principled, learnable operator that reconciles where claims are consistent and localizes contradiction where they aren't — without ever deleting either side. That is donto's entire philosophy, expressed as linear algebra over a graph.
1. What a cellular sheaf is (the 90-second version)
Take a graph G = (V, E). A cellular sheaf F attaches:
- a vector space
F(v)— the stalk — to each nodev(its local feature space / "private opinion"); - a vector space
F(e)to each edgee— a shared discourse space for that connection; - a linear restriction map
F_{v⊴e} : F(v) → F(e)for each incident node–edge pair — how v's local view appears in the shared space of edge e.
A standard graph is the special case where every stalk is the same space and every restriction map is the identity (the trivial sheaf). Sheaf theory is what you get when you stop assuming that.
Global sections & cohomology. A global section is an assignment of a vector to every node that all the restriction maps agree on across every edge — a globally consistent state. The space of these is the 0-th cohomology H⁰(F). The 1-st cohomology H¹(F) measures the obstruction: non-zero H¹ means no globally consistent assignment exists — the data is genuinely contradictory, and H¹ quantifies how much. This is not a metaphor; it is a computable linear-algebra invariant.
The sheaf Laplacian. From the restriction maps you build a coboundary operator δ (for each edge, the difference of the two incident projections) and the sheaf Laplacian L_F = δᵀδ. It generalizes the graph Laplacian: its kernel is exactly H⁰ (the consistent states), and the diffusion ẋ = −L_F x flows the system toward consistency as the restriction maps define it — not toward a flat global average. The trivial-sheaf case recovers ordinary graph diffusion (and ordinary GNN oversmoothing).
2. Sheaf neural networks: what they are and what they fix
A sheaf neural network (SNN) replaces the graph Laplacian in a GNN's message-passing/diffusion step with a sheaf Laplacian, and (in the modern versions) learns the restriction maps from data. The lineage:
| work | contribution |
|---|---|
| Hansen & Gebhart, Sheaf Neural Networks (2020) | First SNN: message-passing via the sheaf Laplacian; stalks can have different dimensions; restriction maps encode per-edge compatibility constraints. |
| Bodnar, Di Giovanni, Chamberlain, Liò, Bronstein, Neural Sheaf Diffusion (NeurIPS 2022) | The breakthrough: a topological theory of heterophily and oversmoothing. Shows GNNs implicitly use the trivial sheaf; a hierarchy of richer (learnable) sheaves provably expands the model's control over its asymptotic behaviour and achieves linear class separation in the diffusion limit. Strong heterophilic-benchmark results. (arXiv:2202.04579, code) |
| Barbero et al., Sheaf NNs with Connection Laplacians (2022) | Restriction maps constrained to orthogonal (O(d)) "connection" maps — a geometric, cheaper-to-learn family. |
| Nonlinear Sheaf Diffusion (2024) | Nonlinear sheaf Laplacians; richer dynamics (bounded-confidence / antagonistic flows). (arXiv:2403.00337) |
| Sheaf theory: from deep geometry to deep learning (2026, 117pp) | Survey + generalization of cellular-sheaf concepts to arbitrary posets, a new algorithm to compute sheaf cohomology on finite posets, and an open-problems list. (arXiv:2502.15476) |
The equations (from Neural Sheaf Diffusion), so this is concrete, not hand-wavy. Features live in stalks of small dimension d (typically 1–8). The sheaf Laplacian acts at a node as
(Δx)_v = Σ_{e=(v,u)} ( x_v − R_e · x_u )
where R_e : ℝ^d → ℝ^d is the (learned) restriction map for edge e. Diffusion is ∂x/∂t = −Δx, discretized into the layer update
x^{t+1} = x^t − α · Δ x^t (α a learned diffusion coefficient; nonlinearity + channel mixing per layer)
The model learns the restriction maps in one of three families, trading expressivity for cost: diagonal R_e = diag(σ(w_e)) (cheapest), orthogonal R_e ∈ O(d) via Cayley transform (geometric, rotation-like — preserves norms, good for "different view, same magnitude"), and general R_e ∈ GL(d) (full d×d, most expressive). On the standard heterophilic node-classification suite (Texas, Wisconsin, Cornell, Film, Squirrel, Chameleon) sheaf diffusion beats GCN/GAT — on the order of +10–15 accuracy points on Squirrel/Chameleon, where vanilla convolution barely clears chance. For donto the takeaway is the shape of the win: it is largest exactly where neighbours disagree.
Recent developments (2025–2026) — the field is moving toward exactly donto's constraints. The 2022 breakthrough has fanned out into a research program, and the new directions read like a list of donto's open problems:
| work | contribution | why it matters for donto |
|---|---|---|
| Bundle Neural Networks (ICLR 2025) | Replaces full sheaves with flat vector bundles — node-level orthogonal maps that make the diffusion a transport with no curvature penalty; the discrete BuNN layer is a special case of a sheaf NN; trains at scale, mitigates over-squashing as well as oversmoothing. (arXiv:2405.15540) | The direct answer to this report's "scale unproven" caveat: a cheaper restriction-map family (orthogonal, node-level) that keeps the disagreement-aware geometry but runs on big graphs and long-range tasks. The natural Stage-1 default. |
| Sheaf Attention Networks (2024/25) | Puts attention on the sheaf — learns the restriction maps and per-edge attention weights jointly; strictly generalizes GAT (GAT = the trivial-sheaf special case). (openreview) | donto already wants to weight edges by source reliability / standing; this is attention and sheaf transport in one operator — the discourse-sheaf source-weighting of §4 with a learnable attention head. |
| Sheaf Hypergraph Networks (NeurIPS 2023) + Hypergraph Neural Sheaf Diffusion (2025) | Lifts sheaves to hypergraphs — restriction maps from each node into a shared hyperedge stalk; captures higher-order (n-ary) relations, not just pairwise. (arXiv:2309.17116, arXiv:2505.05702) | donto's contexts are n-ary — a claim binds subject, predicate, object, evidence, time together; a donto_context is literally a hyperedge over its statements. The hypergraph sheaf is the honest model for a context's internal consistency, not a pairwise approximation. |
| Copresheaf Topological Neural Networks (NeurIPS 2025) | A unifying framework: learnable anisotropic maps on directed edges subsume GCN/GAT/sheaf-NNs as special cases; copresheaves handle asymmetric relations natively. (arXiv:2505.21251) | donto edges are directed and typed (argues-against, daughter-of, supersedes) — asymmetric by nature. Copresheaves are the categorically-correct structure for that, where a plain sheaf's symmetric restriction would lose the direction. |
| Sheaf ↔ transformer / topos line (Mahadevan and others, 2025–26) | Recasts attention as a sheaf/topos construction; situates message-passing as a local-to-global ("gluing") computation. (arXiv:2603.14831) | The conceptual unification: the same gluing math underlies both the transformer reader and the substrate's reconciliation — donto's "emit-free, defer-joining to query time" is a local-to-global sheaf computation. |
The trajectory is unmistakable: higher-order (hypergraph), directed/asymmetric (copresheaf), source-weighted (attention), and scalable (bundle) — every one of which is a structural feature of donto's claim/context graph rather than a convenience. The field is converging on the exact object donto already is.
The two failures SNNs were built to remove — both of which donto would hit:
- Heterophily. Standard GNNs assume connected nodes are similar and should be averaged together. On heterophilic graphs (neighbours differ), GCNs "saturate slightly above chance." donto's claim graph is intrinsically heterophilic: an
argues-againstedge connects claims that disagree; anidentity-as-hypothesisedge connects entities that might not be the same. Averaging across those is exactly wrong. Sheaf restriction maps let the model learn the transformation between disagreeing neighbours instead of blending them. - Oversmoothing. Deep GNNs diffuse everything toward one mean vector, erasing local detail. The non-trivial sheaf Laplacian has a richer kernel (
H⁰) than "the all-ones vector," so sheaf diffusion can run deep without collapsing distinct claims into mush — it converges to the consistent-where-consistent state, preserving genuine differences.
3. Why this is donto's mathematics, line for line
donto's design principles each have a precise sheaf-theoretic counterpart. This is the heart of the report.
| donto concept | sheaf-theoretic counterpart |
|---|---|
| A claim / entity with its own free-typed local representation | a node stalk F(v) — a per-node vector space, dimensions need not match |
Query-time alignment / predicate folding / identity-as-hypothesis (occupation↔currentJob, Olympus(myth)↔Mt.Olympus(geo)) |
restriction maps — learned linear maps that translate one node's view into a shared edge space; identity becomes a map, not an assertion |
| Paraconsistency — hold incompatible claims forever, never resolve-by-deletion | non-zero H¹ — the substrate represents inconsistency as a first-class, measured quantity instead of collapsing it |
| Contradiction pressure (a standing-v1 component) | the norm of the disagreement at an edge / a localized H¹ contribution — a computable contradiction score per claim, per region |
| "Re-rank by reality over time, don't delete on conflict" | sheaf diffusion ẋ = −L_F x — flows toward the consistent subspace while keeping the contradictory components visible; reconcile-where-consistent, preserve-where-not |
| Multi-source evidence fusion (the same fact attested by many sources) | the canonical result that "sheaves are the canonical data structure for sensor/data fusion" (Robinson) — cohomology measures whether sources can be reconciled and flags where they can't |
| Contested knowledge / source reliability (genealogy: Federal Court vs Lorimer vs Tindale, each an interpretive witness) | opinion dynamics on discourse sheaves (Hansen & Ghrist) — private node opinions vs a public discourse space, with restriction maps modeling what each source commits to publicly, and the Laplacian driving public consensus while private disagreement persists — even modeling deception/propaganda |
| Trust kernel / standing ⟨maturity, corroboration, contradiction-pressure, recency⟩ | corroboration ≈ agreement in H⁰; contradiction-pressure ≈ H¹ mass; the sheaf Laplacian gives a single operator that produces both |
| The lens engine (relationships emerge at the intersection of perspectives) | restriction maps into different edge spaces = different lenses; a claim's behaviour under several sheaves = its multi-perspective signature |
The fit is not loose analogy. donto already says "identity is a hypothesis, not a merge"; a sheaf says "identity is a restriction map, and whether two things glue is H¹." donto already says "preserve contradictions and re-rank by reality"; sheaf diffusion is a re-ranking flow that preserves the contradictory subspace. Sheaf theory is, essentially, the linear-algebra formalization of the donto canon.
4. Concrete ways donto could use it
Ordered by effort vs payoff (to be pressure-tested in later iterations of this report):
H¹as a computed contradiction-pressure signal (low effort, high value). donto already wants a per-claim/per-region "contradiction pressure" in standing v1. Build the sheaf Laplacian over a subgraph (a holder's claims, or a contested entity's neighbourhood) with simple restriction maps (even diagonal/scalar to start) and read off the local disagreement norm. This is a principled, computable replacement for ad-hoc contradiction counting — and it localizes conflict to specific claim pairs.- Sheaf diffusion for query-time reconciliation (medium). Today donto folds predicates via similarity at query time. A learned sheaf over the recalled claim neighbourhood would reconcile the consistent part (fold
occupation/currentJob) while holding the genuinely contradictory part separate — and return both, with the contradiction quantified. This is donto's "emit-free, defer-joining" done with a principled operator. - A sheaf-NN scorer over the claim graph (medium-high). donto's claim graph is heterophilic and the contradiction/identity machinery is currently barely exercised (per the capability-vs-exercise report:
donto_argument≈ 0.006% of statements). A Neural-Sheaf-Diffusion model is the natural learner for this graph shape — it could rank claims, predict missing argument edges, and resolve identity hypotheses because it's built for disagreement, where a vanilla GNN would oversmooth the contradictions away. - Discourse-sheaf source modeling for the genealogy example (high, research-y). Model each source (court, genealogist, oral history) as a node with a private stalk and a restriction map into a shared "what-happened" discourse space. The harmonic section = the reconcilable core; the
H¹= the irreducible contested residue. This is exactly the genealogy frontier ("figure out what actually happened across contradictory witnesses") with a rigorous engine.
Honest caveats (from the literature itself):
- Compute. Sheaf Laplacians and especially cohomology are heavier than graph Laplacians; the survey and the industry write-up both flag cost and the need for specialized optimization. donto would apply it to recalled subgraphs, not the 41.5M-statement whole, at least initially.
- Modeling expertise. Sheaves are "not plug-and-play" — choosing stalks/restriction-map families is a design act. Start with learnable low-dimensional diagonal/orthogonal maps (the Barbero/Bodnar recipes), not bespoke hand-built sheaves.
- Scale unproven. Billion-node sheaf systems are an open problem industry-wide. donto's advantage: it can scope the sheaf to a query's neighbourhood, where size is bounded.
4½. A worked example: contradiction as cohomology (and the part donto can't currently see)
Case A — a single contested attribute (the easy win). A real donto example: a person e whose birthplace is attested as Coen by one source and McIvor by another (an actual Rosie-search revision). Model it as a 3-node sheaf — two source nodes s₁, s₂ and the entity e — with a 2-dimensional birthplace stalk (Coen=(1,0), McIvor=(0,1)). Restriction maps carry each source's asserted value into e's shared birthplace space. A global section (an element of H⁰) would require a single value both edges agree on. Because (1,0) ≠ (0,1), none exists in that component; the obstruction lives in H¹, and its norm ‖(1,0)−(0,1)‖ = √2 is the contradiction-pressure for that attribute. Re-rank the sources by reliability (a weight on each restriction map) and the harmonic/H⁰ part shifts toward the trusted value while H¹ keeps the dissent on the books — donto's "re-rank by reality, never delete" as one linear operator.
Case B — the contradiction donto currently cannot detect (the real argument). donto's contradiction machinery today is pairwise (donto_argument edges between two claims). Sheaf cohomology sees a strictly larger class of conflict: cycles that are consistent on every edge but inconsistent around the loop. Genealogy gives a clean one. Three attestations:
A: Kitty is the daughter of BujilkabuB: Kitty is the spouse of BujilkabuC: daughter-of and spouse-of are disjoint for the same pair
Each pair of statements is individually fine; no single donto_argument edge fires. But compose the restriction maps around the Kitty–Bujilkabu cycle and they fail to return to the identity — a non-trivial 1-cocycle, H¹ ≠ 0. (This is exactly the kind of incest-conflict flagged by hand in the EKY-2026 vs Brady-2013 Kitty work — found by a person, invisible to pairwise edges.) Sheaf cohomology turns that hand-detection into an automatic, localizable computation. This is the single most concrete reason to bring sheaves into donto: it upgrades contradiction detection from "two claims that explicitly negate" to "any loop of claims that can't be globally reconciled" — which is where most real-world contested-knowledge errors actually hide.
4¾. A minimal build path (grounded in donto's existing tables)
donto already has every input a sheaf needs — this is assembly, not green-field. A staged plan that fails cheap and only escalates when a stage pays off:
Stage 0 — H¹ contradiction-pressure as a read-only analytic (days, no ML training).
- Graph: a recalled subgraph, not the 41.5M-statement whole — a holder's claims, or a contested entity's neighbourhood. Edges already exist:
donto_argument,donto_identity_edge, and claim co-occurrence within a context. - Stalks: start
d=1scalar (or reuse the existing bge-small claim embeddings indonto_claim_embeddingasd-dim stalk features — they're already computed). - Restriction maps: start diagonal
R_e = diag(σ(w_e)); seedw_efrom the alignment closure (donto_predicate_closure/donto_match_aligned) — high-confidence folds → near-identity maps, weak/contested links → small maps. (This reuses the day-0iterative_scanalignment work directly.) - Output: build
δ, formL_F = δᵀδ, compute the per-node disagreement norm‖(δx)‖and the bottom of the spectrum (H⁰dim,H¹mass). Write the per-claim disagreement as a standing-v1 contradiction-pressure value, and flag any non-trivial 1-cocycle (the Case-B cycle conflicts) as newdonto_argument/argument-cluster rows. Pure linear algebra on a bounded subgraph — cheap, idempotent, I3-safe (additive writes only).
Stage 1 — a learnable sheaf over the claim graph (a real model, scoped). Port the twitter-research/neural-sheaf-diffusion recipe (diagonal → O(d) → general restriction maps, d=2–8) onto exported recalled subgraphs. Tasks it's built for and donto needs: predict missing argument edges, resolve identity hypotheses (identity-as-hypothesis → a learned restriction map + a glue/no-glue verdict from H¹), and score claims under contradiction. This is the model that fits a heterophilic claim graph where a vanilla GNN oversmooths the disagreement away.
Stage 2 — query-time sheaf reconciliation in /recall (the payoff). At recall, build the scoped sheaf over the returned claims, run a few sheaf-diffusion steps: fold the consistent part (the occupation↔currentJob case), return the harmonic/H⁰ summary as the answer-shaped fact, and attach the H¹ residue as an explicit "contested — N sources disagree" tag. That is donto's emit-free/defer-joining and its contradiction-preservation in one operator, and it slots exactly where the benchmark work said donto's value lives (answer-shaping facets that help even a weak reader — §ref the scorecard).
Engineering posture (per the donto canon). Rust-first for anything load-bearing (the Stage-0 analytic can be a pg_donto SQL/Rust routine over a subgraph; the sheaf Laplacian is sparse and small once scoped). Prototype Stage-1 in Python/torch over exported subgraphs to validate, then port the winning piece into the Cargo workspace — don't scatter it into a side repo. Scope every sheaf to a query's neighbourhood so size stays bounded (the honest answer to the "billion-node sheaf is unproven" caveat: donto never builds the global sheaf).
5. The strategic read (and a candor note)
There is industry momentum: a widely-shared write-up argues sheaf theory is "NVIDIA's stealth deep-learning bet," frames it as the paradigm move beyond GNNs (upgrading the representational substrate, not just attention/residual tweaks), and points to a legal-reasoning company (Iqidis) building dynamic, contradiction-aware knowledge graphs — the same shape as donto's contested-knowledge domains. The market-size numbers in that piece are speculative and should be treated as hype-adjacent; the mathematics and the published benchmark wins on heterophily are not. (source)
The defensible conclusion: sheaf neural networks are the most precise existing formalization of what donto claims to be — a substrate that holds disagreement, aligns at edges, and re-ranks by consistency without deletion. Adopting the vocabulary alone (stalks, restriction maps, H⁰/H¹, sheaf Laplacian) sharpens the canon; adopting the operator (a scoped sheaf Laplacian for contradiction-pressure, then a sheaf-NN over the claim graph) is a concrete, literature-backed path to making donto's differentiating machinery load-bearing.
From research to build
This report is the literature grounding; the no-shortcuts implementation PRD that turns it into an engineering plan lives in the donto repo at docs/sheaf-prd/ (branch sheaf-prd) — a 7-document suite (master + data-model + the three stage specs + engineering/rollout, with an authoritative migration ledger), generated by a 28-agent ultracode workflow (map → adversarial design → synthesis) and human-reconciled. The adversarial pass caught correctness bugs the naive design would have shipped — notably that scalar d=1 restriction maps mathematically cannot produce the Kitty cocycle (so Stage 0 defaults to d≥2 sign-carrying maps), and that H¹ requires the up-Laplacian L₁=δδᵀ, not just L_F=δᵀδ. See also the original-research companions on /papers: The Bitemporal Sheaf (the temporal generalization) and the research agenda.
References (primary)
- Hansen & Gebhart, Sheaf Neural Networks (2020) — pdf
- Bodnar et al., Neural Sheaf Diffusion (NeurIPS 2022) — arXiv:2202.04579 · code
- Barbero et al., Sheaf NNs with Connection Laplacians (2022) — pdf
- Nonlinear Sheaf Diffusion in GNNs (2024) — arXiv:2403.00337
- Hansen & Ghrist, Opinion Dynamics on Discourse Sheaves — pdf
- Robinson, Sheaves are the canonical data structure for sensor integration (2017) — arXiv:1603.01446
- Sheaf theory: from deep geometry to deep learning (2026 survey) — arXiv:2502.15476
References (recent, 2024–2026)
- Bundle Neural Networks for message diffusion on graphs (ICLR 2025) — arXiv:2405.15540
- Sheaf Attention Networks — openreview
- Sheaf Hypergraph Networks (NeurIPS 2023) — arXiv:2309.17116
- Hypergraph Neural Sheaf Diffusion (2025) — arXiv:2505.05702
- Copresheaf Topological Neural Networks (2025) — arXiv:2505.21251
- Neural Networks as Local-to-Global Computations — arXiv:2603.14831
Iteration 5 (2026-06-13): added the recent-developments sweep (bundle / attention / hypergraph / copresheaf / topos lines) and the second reference block — every reference in it independently verified to resolve (titles/authors/venues confirmed: Hajij et al. CTNN NeurIPS 2025; Bosca & Ghrist 2026; Choi/Kim/Oh HNSD 2025; Bundle NNs ICLR 2025; Sheaf Attention OpenReview). Earlier iterations built the cellular-sheaf primer, the Neural-Sheaf-Diffusion equations + heterophily numbers, the line-for-line donto mapping, the worked H¹-as-contradiction-pressure examples (Coen-vs-McIvor; the Kitty daughter-vs-spouse cycle pairwise edges can't see), and the staged build path grounded in donto's existing tables.