Generative-abundance metrics: predicate proliferation as signal, not noise
donto papers · original research · 2026-06-13 · measurement framework, working draft v1
Abstract. A freely-extracted knowledge base accumulates a vast number of distinct predicates — donto's live store holds ~1.16M of them over ~42.0M statements. The standing reflex of every schema-disciplined system is to read that number as a quality failure: uncontrolled vocabulary, normalize it away. We argue the reflex is backwards. Proliferation is the signature of abundance — the observable trace of an emit-free pipeline that invents predicates as it reads — and it carries recoverable signal that aggressive ingest-time typing destroys irreversibly. The contribution is a metric suite purpose-built for emit-free stores: predicate-mass concentration C, the singleton fraction S, fold density δ and fold yield Y, alignment-debt-repayment rate ρ, and a recoverable-signal measure R that compares recall on the freely-typed store against recall on a normalized projection. We ground every metric in live donto-pg numbers (73.1% of predicates are used exactly once; the top 100 predicates carry 85.1% of all statements). We also report — honestly — that our one measured fold yield is contaminated: expanding the hub predicate bornIn lifts recall a raw +8.7%, but only +7.0% of that is real near-synonym recovery (wasBornIn, birthplaceOf) and the rest is test-fixture pollution in the closure — a live instance of the very "counterfeit fold" failure mode the suite is built to catch. We state the one decisive head-to-head we have not run (R) and predict its sign: recall on the freely-typed store ≥ recall on a normalized projection, at strictly lower ingest cost. The metric definitions are new and R is to-be-run; the snapshot is consistent with the prediction but does not prove it. This is the measurement sibling of the economic argument in Alignment debt: that paper says carrying the debt is cheaper; this one says how to read the store's vital signs and which number to stop being alarmed by.
1. The number everyone misreads
Show a data engineer "1.16 million distinct predicates" and the trained response is immediate: that is a vocabulary that has lost control; the store needs canonicalization. The response is correct under the assumption that built the field — that a predicate is an expensive, deliberate schema decision, so a million of them means a million failures of discipline.
That assumption no longer holds. donto's predicates were not authored by a schema committee; they were minted by extraction lanes told to invent predicates freely (/mnt/donto-data/workspace/donto/apps/donto-api/prompts/extract_broad.txt; see decoupled anchoring). At ~$0.0001 per fact, a guided model emits an unbounded, multi-directional space of properties about any entity — bornIn, birthPlace, placeOfBirth, wasBornIn, birthplaceOf and dozens of other near-synonyms of one relation — as a feature of how it reads, not a failure of how it was governed. The million predicates are the trace of abundance: every one is a real, evidence-anchored statement that a collapse-on-conflict store would have thrown away.
So the question this paper answers is not "how do we reduce the predicate count?" It is: what are the right vital signs of an emit-free store, such that proliferation reads as health and the genuine pathologies (un-repayable debt, counterfeit folds, dead tail) read as alarms? A wrong instrument — one that alarms on predicate count — drives you to normalize at ingest, and normalization at ingest is the one move you cannot undo (it violates I3, no destructive overwrite). The instrument matters because it determines whether you keep the signal or burn it.
2. What the live store actually looks like
Before defining metrics, the shape. Measured 2026-06-13 against donto-pg (donto_statement, upper(tx_time) IS NULL); the store is live and growing, so counts are point-in-time to ±a few statements.
| quantity | value | what it says |
|---|---|---|
| live statements | 42,040,511 | the firehose |
| distinct live predicates | 1,160,474 | the proliferation |
| predicates used exactly once (singletons) | 848,796 (73.1%) | the long tail |
| predicates used ≤ 2 times | 1,005,598 (86.7%) | the tail is the store, by count |
| median predicate use-count | 1 | most predicates are mentioned once and never again |
| predicates used ≥ 1,000 times | 677 | the head, by count |
| statements under those 677 head predicates | 38,364,614 (91.3%) | the head carries the mass |
| statements under the top 100 predicates | 35,756,596 (85.1%) | extreme concentration |
| max single-predicate use-count | 4,164,423 | the hottest hub |
The store is a textbook heavy-headed power law: a tiny head (100–677 predicates) carries ~85–91% of all statements, while a vast tail (849K singletons) carries the diversity — one statement each. This is not disorder. It is the precise shape you would predict if a fluent generator emits common relations constantly (name, bornIn, parentOf) and rare, situation-specific relations once (wasWitnessAtBaptismOf, boiledDownTallowAt). The head is where recall and folding pay off; the tail is where abundance lives and where normalization would do its damage. Any honest metric suite has to treat the two regions differently — a single store-wide average over a power law is a lie.
2.1 The fold side
donto folds these predicates at read time through donto_predicate_closure (1,045,762 rows; 1,035,738 self-identity + 10,024 fold edges) via the donto_match_aligned(...) SQL function, whose p_expand_predicates flag is exactly the knob this paper's metrics toggle. The fabric that feeds it:
| quantity | value | source |
|---|---|---|
| predicate embeddings (fold candidates) | 1,035,744 | donto_predicate_embedding |
fold edges (relation <> 'self') |
10,024 | donto_predicate_closure |
| — by kind | 8,438 close_match · 1,412 exact_equivalent · 132 inverse_equivalent · 30 sub_property_of · 12 decomposition | group by relation |
| predicates with ≥ 1 fold target | 3,874 (~0.33% of distinct) | distinct non-self predicate_iri |
| mean fold fan-out (among folded) | 2.59 (max 85) | per-predicate edge count |
| entity embeddings | 3,160,590 | donto_entity_embedding |
| identity edges | 169 | donto_identity_edge |
And the repayment is live: the continuous alignment daemon (donto_align_daemon_status) was last ticking 2026-06-13 14:21 UTC — ~298 ticks, 7,387 predicates + 47,315 entities embedded, 2,571 predicate alignments proposed and registered, with a real GLM-cap-aware backlog of 20,686 predicate candidates and 317 identity candidates still queued (cap_reason = entity adjudication: transient GLM rate-limit). The store is in the middle of paying down its alignment debt, throttled by a flat-subscription rate limit, not finished. That ongoing process is itself a metric (§3.4).
3. The metric suite
Each metric below is (a) defined precisely, (b) instrumentable from tables that already exist, (c) given its live value, and (d) tagged with the failure mode that makes the naïve version misleading. The design rule throughout is the canon's no brittle logic — every threshold is a measured property of the data, never a hand-maintained list of "good" predicates.
3.1 Predicate-mass concentration C
Definition.
C(k)= fraction of live statements carried by the top-kpredicates by use-count. Report the curve, not a point.
Live: C(100) = 0.851, C(677) = 0.913. Why it matters: C separates the queryable head from the abundant tail. A healthy emit-free store has high C (the firehose concentrates mass on common relations the generator reaches constantly) and a long tail (it keeps minting rare ones). A falling C over time with a flat tail would be the real alarm — mass leaking into a middle band of near-duplicate medium-frequency predicates that should have folded but didn't. Failure mode: C alone cannot distinguish "healthy head" from "the head is three unfolded spellings of name." Pair it with fold yield (§3.3).
3.2 Singleton fraction S — the proliferation gauge
Definition.
S= fraction of distinct predicates used exactly once.
Live: S = 848,796 / 1,160,474 = 0.731. Why it matters: S is the direct proliferation signal, and it is the number people misread. High S is not noise — it is the generator exercising its full expressive range, minting a bespoke predicate for a one-off relation it actually found in the source. The vast majority of these are never queried, never folded, and cost nothing until something asks for them (the alignment-debt "cold tail"). The non-brittle reading: do not alarm on S rising; alarm only if a singleton is queried and cannot be folded to a populated equivalent — i.e. abundance that turned into an actual recall miss. That conditional is what §3.5 measures. Failure mode: treating S as a quality score and normalizing the tail away — the destructive, I3-violating move; the tail's singletons are exactly the rare evidence-anchored facts a contested-knowledge store exists to keep.
3.3 Fold density δ and fold yield Y
Definition. Fold density
δ= fraction of queried predicates that have ≥ 1 fold target. Fold yieldY(p)= the recall lift from expanding predicatep= (statements with fold ON − statements with fold OFF) / (statements with fold OFF).
Live aggregate δ ≈ 0.33% (3,874 of 1.16M predicates have any fold), but the aggregate is the wrong denominator — queries cluster on the head, and the head is where folds concentrate. A measured Y on a head predicate that does fold (bornIn, which has 70 closure equivalents; timings on donto-pg, three warm runs):
read on bornIn |
statements | cold | warm |
|---|---|---|---|
fold OFF (p_expand_predicates=false) |
1,407 | ~19.5 ms | ~5 ms |
fold ON (donto_match_aligned, conf ≥ 0.8) |
1,530 | ~81 ms | ~7 ms |
Raw Y(bornIn) = +8.7% recall at ~4× cold / ~1.4× warm read cost. But the headline number is dirty, and the honest decomposition is the real finding. Of the +123 recovered statements:
| recovered under | count | kind |
|---|---|---|
wasBornIn |
78 | real near-synonym ✅ |
birthplaceOf |
21 | real near-synonym ✅ |
test:*/bornIn fixtures (24 distinct, 1 each) |
24 | closure pollution ❌ |
So the real near-synonym yield is +7.0% (99 statements an exact-match read could never reach), and +1.7% is test-fixture predicates that leaked into donto_predicate_closure (67 of bornIn's 70 closure edges are test:-prefixed fixtures; only 2 are production). Why this matters more, not less: Y is the metric that proves proliferation carries recoverable signal — wasBornIn and birthplaceOf were minted freely and are invisible to an exact read, recovered only because folding ran at query time. And the contamination is itself the §6 "counterfeit fold" failure mode caught in the live store: a Y reported without auditing what each recovered statement actually folded from would have overstated the win by ~20%. Failure modes, both live here: (i) measuring δ store-wide (0.33%, looks dead) instead of workload-weighted (the head, where it is the live action); (ii) reporting Y as a single number without the fold-provenance breakdown that separates real recovery from polluted edges.
3.4 Alignment-debt-repayment rate ρ
Definition.
ρ= the rate at which the continuous aligner converts un-folded predicate/entity candidates into closure edges, relative to the standing backlog. Read directly offdonto_align_daemon_status.
Live: backlog of 20,686 predicate candidates + 317 identity candidates; 2,571 alignments registered to date across ~298 ticks; throttled by a GLM rate limit (cap_reason = entity adjudication: transient GLM rate-limit). Why it matters: an emit-free store is healthy only if debt is being serviced — ρ > 0 and the backlog bounded. ρ is the store's metabolism. A rising, unbounded backlog with ρ → 0 is a genuine pathology (the aligner died, or the firehose outran it). This is the metric the alignment-debt paper flagged as speculative ("debt-repayment as a health signal"); here it is concretely a donto_align_daemon_status query. Failure mode: reading backlog size as alarm in absolute terms — it is expected to be large on a budget-throttled flat-subscription lane (the canon's spend constraint). The signal is ρ and backlog trend, not backlog magnitude. (See also §6: ρ is confounded by the throttle and must be normalized by available aligner-calls.)
3.5 Recoverable-signal R — the metric that settles the argument
Definition. For a fixed recall workload
Q,R = recall(freely-typed store, fold ON) − recall(normalized-at-ingest projection). PositiveRmeans proliferation + late folding kept signal that ingest-time typing lost.
This is the head-to-head, and the one number that converts the thesis from assertion to result. R > 0 would establish that recall on the freely-typed store ≥ recall on a normalized projection — the agenda's target result. We have the freely-typed arm (live) and the read-path operator (donto_match_aligned); we have not built the normalized-at-ingest projection arm (the same Arm B that alignment-debt §7 specifies). So R is defined and instrumentable but unmeasured — §5 is the experiment. The §3.3 measurement is a one-predicate lower-bound intuition for R, and even there the honest figure is the +7.0% real recovery, not the +8.7% raw: a normalizer that collapsed wasBornIn/birthplaceOf to a different canonical choice than the query uses would miss those 99 statements outright, and one that collapsed them irreversibly could never recover the distinction. But one predicate is not a workload; R is the workload-level claim.
4. Why only donto can compute these
The suite needs three things at once, and they are donto's three invariants — the same intersection that makes the bitemporal sheaf and alignment-debt buildable:
- A genuinely freely-minted store (emit-free ingest). You cannot measure proliferation-as-signal on a store that never proliferated. The 1.16M predicates are real, minted by
extract_broad.txt, not a synthetic stress test.S = 0.731is an empirical property of how a fluent extractor reads, observable only on a store that let it. - No destructive overwrite (I3).
Ris only meaningful if both arms are reachable from the same facts: the freely-typed arm exists and a normalized projection can be derived from it without having already destroyed the variants. A store that normalized at ingest has only Arm B and can never run the comparison. donto keeps every variant as a live statement and the fold as a separate, revisabledonto_predicate_closureedge — which is also why the §3.3 contamination is recoverable (drop the bad edges and re-read), not baked in. - Query-time folding as a first-class operator.
Yand the fold-ON arm ofRrequire folding to be a read-path function (donto_match_alignedwithp_expand_predicates), not a batch ETL. Without it, "recoverable signal" is a hypothesis you cannot test; with it, it is a function call with a boolean flag — and, as §3.3 shows, an auditable one.
A vector DB cannot run this (dedup collapses the variants — S is undefined, Y is zero by construction). A normalized KG cannot run this (it is the projection arm — there is no freely-typed arm to compare against). Only a paraconsistent, emit-free, align-at-read substrate has both arms and the operator that connects them.
5. The experiment that turns R into a result
The decisive measurement is a single head-to-head over donto's own corpus, reusing harnesses that already exist (LongMemEval, LoCoMo, genealogy recall):
- Arm A (live). The freely-typed store. Recall via
donto_match_aligned(..., p_expand_predicates=true). Prerequisite: purge thetest:-prefixed fixture edges fromdonto_predicate_closurefirst, or §3.3's contamination inflates Arm A. - Arm B (to build). A normalized-at-ingest projection: take the same extracted facts and canonicalize predicates forward at ingest, using the same alignment signals (
donto_predicate_embedding+donto_predicate_closure) applied write-side instead of read-side. A bounded build — the alignment fabric already exists; it is run in the other direction. - Fixed workload
Q. Run the same recall queries against both arms, computingR = recall(A) − recall(B)overall and per power-law band (head / mid / tail) — because the prediction is band-dependent:R ≈ 0on the head (both arms fold common predicates the same way) andR > 0on the mid/tail, where ingest-time normalization had to guess a canonical form and guessed differently than the query, or collapsed a distinction the query needed. - Cost. Pair
Rwith ingest cost (A: zero normalize; B: the per-fact normalization surchargeΔ_norm·Nfrom alignment-debt) to state the joint claim:R ≥ 0at strictly lower ingest cost.
Prediction.
R ≥ 0workload-wide, concentrated in the mid/tail bands, at lower ingest cost — because late folding is revisable (it re-amortizes as alignment improves; the liveρ > 0daemon is doing exactly this) while ingest-time normalization is a one-shot irreversible guess. Abundance does not merely tolerate deferral; deferral recovers signal that prevention destroyed.
The snapshot is consistent with this — high S, positive real Y on the one head predicate measured, a live repayment daemon — but consistency is not proof, and the same one-predicate probe surfaced a closure-pollution defect that would bias the result if uncorrected. R itself is unmeasured until Arm B exists.
6. The honest failure modes
Named explicitly, in the spirit of the seed papers — and one of them is already realized in §3.3:
Scould be hiding dead weight, not signal. The thesis that singletons are recoverable signal holds only to the extent that queried singletons fold to populated equivalents. If a large share of the tail is genuinely junk (kebab-clause objects,containsCommaCharacter-style padding — the exact hygiene failure the extraction prompt fights), thenSoverstates health. This is measurable: sample queried singletons, check fold reachability and evidence-anchoring (via the decoupled anchoring citer). The honest version ofSis anchored, fold-reachable singleton fraction, not rawS.R(andY) can be ≤ the headline if folds are wrong — and ours already are, partly. A counterfeit fold returns statements that should not match, so Arm A can lose on precision even while winning on recall. We did not have to hypothesize this: 67 ofbornIn's 70 closure edges aretest:fixtures, inflating rawYfrom a real +7.0% to a reported +8.7%. The 0.8 confidence floor and theclose_match/exact_equivalentsplit are guardrails but did not stop fixture leakage;Rmust be reported as recall and precision, with fold provenance audited. This couples directly to the alignment-debt quality caveat.- Band-weighting can manufacture either sign. Over a head-only workload,
R ≈ 0(the boring true answer) and the thesis under-delivers; over a tail-heavy workload,Ris large but on rarely-queried predicates whose practical value is small. The honest report is the per-band curve weighted by a real workload's query distribution, not a flattering slice. - The aligner's throttle confounds
ρ.ρis currently capped by the GLM rate limit, so a lowρreflects budget, not the store's intrinsic alignability.ρis a health signal conditional on the lane's available capacity; report it normalized by available aligner-calls, or it measures the wallet, not the store.
7. Status: measured / defined / conjectured / speculative
- Measured (solid). The store shape: 1,160,474 distinct predicates over ~42.04M statements;
S = 0.731(848,796 singletons), median use 1, max 4,164,423;C(100) = 0.851,C(677) = 0.913; 10,024 fold edges over 3,874 folded predicates; a live aligner mid-repayment (backlog 20,686 predicate candidates, ~298 ticks). All reproducibledonto-pgreads with the queries in §2–§3. - Measured with a caveat.
Y(bornIn): raw +8.7% recall, of which +7.0% is real near-synonym recovery (wasBornIn78,birthplaceOf21) and +1.7% is test-fixture pollution indonto_predicate_closure(67/70 ofbornIn's edges aretest:fixtures). The real number is the +7.0%; the gap is itself a finding (counterfeit-fold detection works only if you audit provenance). - Defined (solid). The six metrics (
C,S,δ,Y,ρ,R), each instrumentable from existing tables, each paired with its failure mode. The definitions follow from the power-law structure of an emit-free store and are internally consistent. - Conjectured (open). The headline —
R ≥ 0at lower ingest cost, concentrated in the mid/tail. Snapshot-consistent (realY > 0, revisable folds, liveρ), but the projection arm (Arm B) and the head-to-head are unbuilt. §5 is the test; §6 names the conditions (junk tail, counterfeit folds, head-only workloads) under which it fails. - Speculative (flagged). That this suite, tracked over time, is a better operational dashboard for an emit-free store than predicate-count alarmism — that an operator should watch
ρ, fold-reachableS, and theCcurve's shape rather than the raw predicate total. We propose it; we have not run it as a longitudinal instrument.
The contribution is not a leaderboard number. It is a reframing of what to measure: in a store where generating a fact is nearly free, predicate proliferation is the signature of abundance, not a defect — and the right vital signs read the fold-reachability and repayment of that abundance, never its raw count. The one measurement we ran end-to-end (Y) both confirmed the thesis (real recall recovered from freely-minted variants) and caught the suite's own named failure mode in the live data. The number that would settle the whole argument, R, is defined, instrumentable, and waiting on a bounded build.
See also: the economic sibling Alignment debt (why carrying the debt is cheaper); the faithfulness mechanism that keeps the tail real, Decoupled anchoring; the contradiction-measurement theory in The Bitemporal Sheaf; the full program in the donto research agenda; abundance grounding in /reports.