dontopaper

Generative-abundance metrics: predicate proliferation as signal, not noise

donto papers · original research · 2026-06-13 · measurement framework, working draft v1

Abstract. A freely-extracted knowledge base accumulates a vast number of distinct predicates — donto's live store holds ~1.16M of them over ~42.0M statements. The standing reflex of every schema-disciplined system is to read that number as a quality failure: uncontrolled vocabulary, normalize it away. We argue the reflex is backwards. Proliferation is the signature of abundance — the observable trace of an emit-free pipeline that invents predicates as it reads — and it carries recoverable signal that aggressive ingest-time typing destroys irreversibly. The contribution is a metric suite purpose-built for emit-free stores: predicate-mass concentration C, the singleton fraction S, fold density δ and fold yield Y, alignment-debt-repayment rate ρ, and a recoverable-signal measure R that compares recall on the freely-typed store against recall on a normalized projection. We ground every metric in live donto-pg numbers (73.1% of predicates are used exactly once; the top 100 predicates carry 85.1% of all statements). We also report — honestly — that our one measured fold yield is contaminated: expanding the hub predicate bornIn lifts recall a raw +8.7%, but only +7.0% of that is real near-synonym recovery (wasBornIn, birthplaceOf) and the rest is test-fixture pollution in the closure — a live instance of the very "counterfeit fold" failure mode the suite is built to catch. We state the one decisive head-to-head we have not run (R) and predict its sign: recall on the freely-typed store ≥ recall on a normalized projection, at strictly lower ingest cost. The metric definitions are new and R is to-be-run; the snapshot is consistent with the prediction but does not prove it. This is the measurement sibling of the economic argument in Alignment debt: that paper says carrying the debt is cheaper; this one says how to read the store's vital signs and which number to stop being alarmed by.


1. The number everyone misreads

Show a data engineer "1.16 million distinct predicates" and the trained response is immediate: that is a vocabulary that has lost control; the store needs canonicalization. The response is correct under the assumption that built the field — that a predicate is an expensive, deliberate schema decision, so a million of them means a million failures of discipline.

That assumption no longer holds. donto's predicates were not authored by a schema committee; they were minted by extraction lanes told to invent predicates freely (/mnt/donto-data/workspace/donto/apps/donto-api/prompts/extract_broad.txt; see decoupled anchoring). At ~$0.0001 per fact, a guided model emits an unbounded, multi-directional space of properties about any entity — bornIn, birthPlace, placeOfBirth, wasBornIn, birthplaceOf and dozens of other near-synonyms of one relation — as a feature of how it reads, not a failure of how it was governed. The million predicates are the trace of abundance: every one is a real, evidence-anchored statement that a collapse-on-conflict store would have thrown away.

So the question this paper answers is not "how do we reduce the predicate count?" It is: what are the right vital signs of an emit-free store, such that proliferation reads as health and the genuine pathologies (un-repayable debt, counterfeit folds, dead tail) read as alarms? A wrong instrument — one that alarms on predicate count — drives you to normalize at ingest, and normalization at ingest is the one move you cannot undo (it violates I3, no destructive overwrite). The instrument matters because it determines whether you keep the signal or burn it.


2. What the live store actually looks like

Before defining metrics, the shape. Measured 2026-06-13 against donto-pg (donto_statement, upper(tx_time) IS NULL); the store is live and growing, so counts are point-in-time to ±a few statements.

quantity value what it says
live statements 42,040,511 the firehose
distinct live predicates 1,160,474 the proliferation
predicates used exactly once (singletons) 848,796 (73.1%) the long tail
predicates used ≤ 2 times 1,005,598 (86.7%) the tail is the store, by count
median predicate use-count 1 most predicates are mentioned once and never again
predicates used ≥ 1,000 times 677 the head, by count
statements under those 677 head predicates 38,364,614 (91.3%) the head carries the mass
statements under the top 100 predicates 35,756,596 (85.1%) extreme concentration
max single-predicate use-count 4,164,423 the hottest hub

The store is a textbook heavy-headed power law: a tiny head (100–677 predicates) carries ~85–91% of all statements, while a vast tail (849K singletons) carries the diversity — one statement each. This is not disorder. It is the precise shape you would predict if a fluent generator emits common relations constantly (name, bornIn, parentOf) and rare, situation-specific relations once (wasWitnessAtBaptismOf, boiledDownTallowAt). The head is where recall and folding pay off; the tail is where abundance lives and where normalization would do its damage. Any honest metric suite has to treat the two regions differently — a single store-wide average over a power law is a lie.

2.1 The fold side

donto folds these predicates at read time through donto_predicate_closure (1,045,762 rows; 1,035,738 self-identity + 10,024 fold edges) via the donto_match_aligned(...) SQL function, whose p_expand_predicates flag is exactly the knob this paper's metrics toggle. The fabric that feeds it:

quantity value source
predicate embeddings (fold candidates) 1,035,744 donto_predicate_embedding
fold edges (relation <> 'self') 10,024 donto_predicate_closure
— by kind 8,438 close_match · 1,412 exact_equivalent · 132 inverse_equivalent · 30 sub_property_of · 12 decomposition group by relation
predicates with ≥ 1 fold target 3,874 (~0.33% of distinct) distinct non-self predicate_iri
mean fold fan-out (among folded) 2.59 (max 85) per-predicate edge count
entity embeddings 3,160,590 donto_entity_embedding
identity edges 169 donto_identity_edge

And the repayment is live: the continuous alignment daemon (donto_align_daemon_status) was last ticking 2026-06-13 14:21 UTC — ~298 ticks, 7,387 predicates + 47,315 entities embedded, 2,571 predicate alignments proposed and registered, with a real GLM-cap-aware backlog of 20,686 predicate candidates and 317 identity candidates still queued (cap_reason = entity adjudication: transient GLM rate-limit). The store is in the middle of paying down its alignment debt, throttled by a flat-subscription rate limit, not finished. That ongoing process is itself a metric (§3.4).


3. The metric suite

Each metric below is (a) defined precisely, (b) instrumentable from tables that already exist, (c) given its live value, and (d) tagged with the failure mode that makes the naïve version misleading. The design rule throughout is the canon's no brittle logic — every threshold is a measured property of the data, never a hand-maintained list of "good" predicates.

3.1 Predicate-mass concentration C

Definition. C(k) = fraction of live statements carried by the top-k predicates by use-count. Report the curve, not a point.

Live: C(100) = 0.851, C(677) = 0.913. Why it matters: C separates the queryable head from the abundant tail. A healthy emit-free store has high C (the firehose concentrates mass on common relations the generator reaches constantly) and a long tail (it keeps minting rare ones). A falling C over time with a flat tail would be the real alarm — mass leaking into a middle band of near-duplicate medium-frequency predicates that should have folded but didn't. Failure mode: C alone cannot distinguish "healthy head" from "the head is three unfolded spellings of name." Pair it with fold yield (§3.3).

3.2 Singleton fraction S — the proliferation gauge

Definition. S = fraction of distinct predicates used exactly once.

Live: S = 848,796 / 1,160,474 = 0.731. Why it matters: S is the direct proliferation signal, and it is the number people misread. High S is not noise — it is the generator exercising its full expressive range, minting a bespoke predicate for a one-off relation it actually found in the source. The vast majority of these are never queried, never folded, and cost nothing until something asks for them (the alignment-debt "cold tail"). The non-brittle reading: do not alarm on S rising; alarm only if a singleton is queried and cannot be folded to a populated equivalent — i.e. abundance that turned into an actual recall miss. That conditional is what §3.5 measures. Failure mode: treating S as a quality score and normalizing the tail away — the destructive, I3-violating move; the tail's singletons are exactly the rare evidence-anchored facts a contested-knowledge store exists to keep.

3.3 Fold density δ and fold yield Y

Definition. Fold density δ = fraction of queried predicates that have ≥ 1 fold target. Fold yield Y(p) = the recall lift from expanding predicate p = (statements with fold ON − statements with fold OFF) / (statements with fold OFF).

Live aggregate δ ≈ 0.33% (3,874 of 1.16M predicates have any fold), but the aggregate is the wrong denominator — queries cluster on the head, and the head is where folds concentrate. A measured Y on a head predicate that does fold (bornIn, which has 70 closure equivalents; timings on donto-pg, three warm runs):

read on bornIn statements cold warm
fold OFF (p_expand_predicates=false) 1,407 ~19.5 ms ~5 ms
fold ON (donto_match_aligned, conf ≥ 0.8) 1,530 ~81 ms ~7 ms

Raw Y(bornIn) = +8.7% recall at ~4× cold / ~1.4× warm read cost. But the headline number is dirty, and the honest decomposition is the real finding. Of the +123 recovered statements:

recovered under count kind
wasBornIn 78 real near-synonym ✅
birthplaceOf 21 real near-synonym ✅
test:*/bornIn fixtures (24 distinct, 1 each) 24 closure pollution ❌

So the real near-synonym yield is +7.0% (99 statements an exact-match read could never reach), and +1.7% is test-fixture predicates that leaked into donto_predicate_closure (67 of bornIn's 70 closure edges are test:-prefixed fixtures; only 2 are production). Why this matters more, not less: Y is the metric that proves proliferation carries recoverable signal — wasBornIn and birthplaceOf were minted freely and are invisible to an exact read, recovered only because folding ran at query time. And the contamination is itself the §6 "counterfeit fold" failure mode caught in the live store: a Y reported without auditing what each recovered statement actually folded from would have overstated the win by ~20%. Failure modes, both live here: (i) measuring δ store-wide (0.33%, looks dead) instead of workload-weighted (the head, where it is the live action); (ii) reporting Y as a single number without the fold-provenance breakdown that separates real recovery from polluted edges.

3.4 Alignment-debt-repayment rate ρ

Definition. ρ = the rate at which the continuous aligner converts un-folded predicate/entity candidates into closure edges, relative to the standing backlog. Read directly off donto_align_daemon_status.

Live: backlog of 20,686 predicate candidates + 317 identity candidates; 2,571 alignments registered to date across ~298 ticks; throttled by a GLM rate limit (cap_reason = entity adjudication: transient GLM rate-limit). Why it matters: an emit-free store is healthy only if debt is being servicedρ > 0 and the backlog bounded. ρ is the store's metabolism. A rising, unbounded backlog with ρ → 0 is a genuine pathology (the aligner died, or the firehose outran it). This is the metric the alignment-debt paper flagged as speculative ("debt-repayment as a health signal"); here it is concretely a donto_align_daemon_status query. Failure mode: reading backlog size as alarm in absolute terms — it is expected to be large on a budget-throttled flat-subscription lane (the canon's spend constraint). The signal is ρ and backlog trend, not backlog magnitude. (See also §6: ρ is confounded by the throttle and must be normalized by available aligner-calls.)

3.5 Recoverable-signal R — the metric that settles the argument

Definition. For a fixed recall workload Q, R = recall(freely-typed store, fold ON) − recall(normalized-at-ingest projection). Positive R means proliferation + late folding kept signal that ingest-time typing lost.

This is the head-to-head, and the one number that converts the thesis from assertion to result. R > 0 would establish that recall on the freely-typed store ≥ recall on a normalized projection — the agenda's target result. We have the freely-typed arm (live) and the read-path operator (donto_match_aligned); we have not built the normalized-at-ingest projection arm (the same Arm B that alignment-debt §7 specifies). So R is defined and instrumentable but unmeasured — §5 is the experiment. The §3.3 measurement is a one-predicate lower-bound intuition for R, and even there the honest figure is the +7.0% real recovery, not the +8.7% raw: a normalizer that collapsed wasBornIn/birthplaceOf to a different canonical choice than the query uses would miss those 99 statements outright, and one that collapsed them irreversibly could never recover the distinction. But one predicate is not a workload; R is the workload-level claim.


4. Why only donto can compute these

The suite needs three things at once, and they are donto's three invariants — the same intersection that makes the bitemporal sheaf and alignment-debt buildable:

  1. A genuinely freely-minted store (emit-free ingest). You cannot measure proliferation-as-signal on a store that never proliferated. The 1.16M predicates are real, minted by extract_broad.txt, not a synthetic stress test. S = 0.731 is an empirical property of how a fluent extractor reads, observable only on a store that let it.
  2. No destructive overwrite (I3). R is only meaningful if both arms are reachable from the same facts: the freely-typed arm exists and a normalized projection can be derived from it without having already destroyed the variants. A store that normalized at ingest has only Arm B and can never run the comparison. donto keeps every variant as a live statement and the fold as a separate, revisable donto_predicate_closure edge — which is also why the §3.3 contamination is recoverable (drop the bad edges and re-read), not baked in.
  3. Query-time folding as a first-class operator. Y and the fold-ON arm of R require folding to be a read-path function (donto_match_aligned with p_expand_predicates), not a batch ETL. Without it, "recoverable signal" is a hypothesis you cannot test; with it, it is a function call with a boolean flag — and, as §3.3 shows, an auditable one.

A vector DB cannot run this (dedup collapses the variants — S is undefined, Y is zero by construction). A normalized KG cannot run this (it is the projection arm — there is no freely-typed arm to compare against). Only a paraconsistent, emit-free, align-at-read substrate has both arms and the operator that connects them.


5. The experiment that turns R into a result

The decisive measurement is a single head-to-head over donto's own corpus, reusing harnesses that already exist (LongMemEval, LoCoMo, genealogy recall):

  1. Arm A (live). The freely-typed store. Recall via donto_match_aligned(..., p_expand_predicates=true). Prerequisite: purge the test:-prefixed fixture edges from donto_predicate_closure first, or §3.3's contamination inflates Arm A.
  2. Arm B (to build). A normalized-at-ingest projection: take the same extracted facts and canonicalize predicates forward at ingest, using the same alignment signals (donto_predicate_embedding + donto_predicate_closure) applied write-side instead of read-side. A bounded build — the alignment fabric already exists; it is run in the other direction.
  3. Fixed workload Q. Run the same recall queries against both arms, computing R = recall(A) − recall(B) overall and per power-law band (head / mid / tail) — because the prediction is band-dependent: R ≈ 0 on the head (both arms fold common predicates the same way) and R > 0 on the mid/tail, where ingest-time normalization had to guess a canonical form and guessed differently than the query, or collapsed a distinction the query needed.
  4. Cost. Pair R with ingest cost (A: zero normalize; B: the per-fact normalization surcharge Δ_norm·N from alignment-debt) to state the joint claim: R ≥ 0 at strictly lower ingest cost.

Prediction. R ≥ 0 workload-wide, concentrated in the mid/tail bands, at lower ingest cost — because late folding is revisable (it re-amortizes as alignment improves; the live ρ > 0 daemon is doing exactly this) while ingest-time normalization is a one-shot irreversible guess. Abundance does not merely tolerate deferral; deferral recovers signal that prevention destroyed.

The snapshot is consistent with this — high S, positive real Y on the one head predicate measured, a live repayment daemon — but consistency is not proof, and the same one-predicate probe surfaced a closure-pollution defect that would bias the result if uncorrected. R itself is unmeasured until Arm B exists.


6. The honest failure modes

Named explicitly, in the spirit of the seed papers — and one of them is already realized in §3.3:

  • S could be hiding dead weight, not signal. The thesis that singletons are recoverable signal holds only to the extent that queried singletons fold to populated equivalents. If a large share of the tail is genuinely junk (kebab-clause objects, containsCommaCharacter-style padding — the exact hygiene failure the extraction prompt fights), then S overstates health. This is measurable: sample queried singletons, check fold reachability and evidence-anchoring (via the decoupled anchoring citer). The honest version of S is anchored, fold-reachable singleton fraction, not raw S.
  • R (and Y) can be ≤ the headline if folds are wrong — and ours already are, partly. A counterfeit fold returns statements that should not match, so Arm A can lose on precision even while winning on recall. We did not have to hypothesize this: 67 of bornIn's 70 closure edges are test: fixtures, inflating raw Y from a real +7.0% to a reported +8.7%. The 0.8 confidence floor and the close_match/exact_equivalent split are guardrails but did not stop fixture leakage; R must be reported as recall and precision, with fold provenance audited. This couples directly to the alignment-debt quality caveat.
  • Band-weighting can manufacture either sign. Over a head-only workload, R ≈ 0 (the boring true answer) and the thesis under-delivers; over a tail-heavy workload, R is large but on rarely-queried predicates whose practical value is small. The honest report is the per-band curve weighted by a real workload's query distribution, not a flattering slice.
  • The aligner's throttle confounds ρ. ρ is currently capped by the GLM rate limit, so a low ρ reflects budget, not the store's intrinsic alignability. ρ is a health signal conditional on the lane's available capacity; report it normalized by available aligner-calls, or it measures the wallet, not the store.

7. Status: measured / defined / conjectured / speculative

  • Measured (solid). The store shape: 1,160,474 distinct predicates over ~42.04M statements; S = 0.731 (848,796 singletons), median use 1, max 4,164,423; C(100) = 0.851, C(677) = 0.913; 10,024 fold edges over 3,874 folded predicates; a live aligner mid-repayment (backlog 20,686 predicate candidates, ~298 ticks). All reproducible donto-pg reads with the queries in §2–§3.
  • Measured with a caveat. Y(bornIn): raw +8.7% recall, of which +7.0% is real near-synonym recovery (wasBornIn 78, birthplaceOf 21) and +1.7% is test-fixture pollution in donto_predicate_closure (67/70 of bornIn's edges are test: fixtures). The real number is the +7.0%; the gap is itself a finding (counterfeit-fold detection works only if you audit provenance).
  • Defined (solid). The six metrics (C, S, δ, Y, ρ, R), each instrumentable from existing tables, each paired with its failure mode. The definitions follow from the power-law structure of an emit-free store and are internally consistent.
  • Conjectured (open). The headline — R ≥ 0 at lower ingest cost, concentrated in the mid/tail. Snapshot-consistent (real Y > 0, revisable folds, live ρ), but the projection arm (Arm B) and the head-to-head are unbuilt. §5 is the test; §6 names the conditions (junk tail, counterfeit folds, head-only workloads) under which it fails.
  • Speculative (flagged). That this suite, tracked over time, is a better operational dashboard for an emit-free store than predicate-count alarmism — that an operator should watch ρ, fold-reachable S, and the C curve's shape rather than the raw predicate total. We propose it; we have not run it as a longitudinal instrument.

The contribution is not a leaderboard number. It is a reframing of what to measure: in a store where generating a fact is nearly free, predicate proliferation is the signature of abundance, not a defect — and the right vital signs read the fold-reachability and repayment of that abundance, never its raw count. The one measurement we ran end-to-end (Y) both confirmed the thesis (real recall recovered from freely-minted variants) and caught the suite's own named failure mode in the live data. The number that would settle the whole argument, R, is defined, instrumentable, and waiting on a bounded build.


See also: the economic sibling Alignment debt (why carrying the debt is cheaper); the faithfulness mechanism that keeps the tail real, Decoupled anchoring; the contradiction-measurement theory in The Bitemporal Sheaf; the full program in the donto research agenda; abundance grounding in /reports.