The Seven Sisters across 32 cultures — a contested-knowledge testbed for donto

Living document. 2026-06-12. We built a cross-cultural corpus of the Seven Sisters / Pleiades story — Greek, Aboriginal Australian, Native American, Polynesian, Andean, African, Near Eastern, and the deep-time "oldest story" scholarship — saved every source to disk, and extracted it into donto as evidence-anchored claims via the donto-agent GLM lane. This is the report on what we built, what the extraction engineering taught us, and what the substrate now holds. Snapshot taken mid-run (~110 of 261 chunks extracted); numbers will grow.

The Seven Sisters is, plausibly, the most widely shared story on Earth. The same six-or-seven stars in Taurus carry a strikingly convergent narrative across cultures that never met: seven sisters (or girls, or wives, or hens), usually pursued by a man or a hunter, who escape into the sky; and almost everywhere, a puzzle — the cultures say seven, but only six are easy to see. That convergence is either the deepest coincidence in world folklore or evidence that the story is older than the human dispersal out of Africa — a claim now argued from the proper motion of the stars themselves.

That is not a literature problem. It is a donto problem. The sources disagree, conflate myth with geography, attach the same sky to incompatible meanings, and span 100,000 years of contested transmission. You cannot answer "what is the Seven Sisters story?" by picking a winner and deleting the rest — the disagreement is the data. donto is built to hold exactly this: bitemporal, paraconsistent, evidence-first, identity-as-hypothesis. So we used the Seven Sisters as a new example consumer — the same machinery as the genealogy work, pointed at a different kind of contested knowledge.

1. Why this is the right test for the substrate

donto's north star is contested reality: hold every claim about a thing, anchored to its source, link the incompatible ones, and re-rank by corroboration over time instead of collapsing on conflict. The Seven Sisters exercises every hard invariant at once:

Paraconsistency. Is the "lost" seventh sister Electra (hiding her face in grief for Troy) or Merope (ashamed of marrying a mortal)? Greek sources themselves disagree, and that is before you add the Aboriginal, Kiowa, and Cherokee accounts of why one star is faint or missing. donto holds all of them as legal state.
Identity-as-hypothesis. Is Māori Matariki "the same as" the Greek Pleiades? As Japanese Subaru? As Aboriginal Kungkarrangkalpa? They are the same stars with different ontologies. Identity is a query-time hypothesis the alignment engine proposes — not a merge baked in at write time.
Evidence-first. Every claim is anchored to the span of source text it came from, so "the Pleiades mark the planting season" is traceable to Hesiod, to the Zulu isiLimela, and to the Andean Qullqa — independently.
Cross-source triangulation (the discovery thesis). The new understanding is supposed to fall out of connecting evidence-anchored claims across sources — exactly the lost-Pleiad and deep-time questions below.

The substrate stays domain-neutral; the Seven Sisters just proves it on a harder corpus than genealogy.

2. What we built

Mirroring the genealogy pipeline (/home/ajax/donto-resources/, ingest_frontier.py): save every source to disk, register it as a retrievable document, extract facts anchored to it.

63 sources, ~465 KB, 32 cultures, saved to raw/<culture>/<source>.md with provenance frontmatter (title, culture, era, author, source URL, license). Public-domain primary texts preferred; for copyrighted scholarship we save the claims, not the prose.
Coverage by tradition: Greek/Roman primary texts (Hesiod, Homer, Ovid, Hyginus, Apollodorus, Aratus, Sappho, Pausanias); Aboriginal Australian (public material only — NMA Songlines, Kungkarrangkalpa, the Wurundjeri Karatgurk fire-sisters); Native American (Devils Tower / Bear Lodge across Kiowa·Lakota·Cheyenne·Arapaho·Crow, Cherokee, Navajo, Blackfoot, Monache, Onondaga); pan-cultural astronomy (Subaru, Krittika, Mao, Kimah, al-Thurayya, Parvin); Pacific / Andean / Mesoamerican / African (Matariki, Makaliʻi, Qullqa, Tianquiztli, Tzab-ek, isiLimela, Khuseti); and a dedicated deep-time cluster (Norris's 100,000-year argument, d'Huy's phylomythology, the lost-Pleiad survey, the Nebra disc, and the critiques).

CARE respect. Aboriginal Seven Sisters songlines hold restricted, secret-sacred knowledge. We used only public, published material, treated First Nations sources as attributed interpretive witnesses (never ground truth), and built a restricted: true frontmatter flag that hard-skips a source from extraction. Nothing secret-sacred was sought, saved, or extracted.

3. Extraction engineering — what actually worked (and what didn't)

This is the part worth keeping. We learned more about the extraction stack here than from the corpus.

3.1 opencode/glm-5.1 returned zero facts; `donto-agent`/glm-4.7 returned hundreds

The first attempt ran the donto-api /extract-and-ingest opencode path (glm-5.1, thinking-off). On the exact same source it produced 0 facts across all four passes — the documented "empty extraction" failure mode. Switching to donto-agent extract --provider glm (the Rust harness, glm-4.7 with thinking enabled) turned the same source into 200+ anchored claims. The lesson, now load-bearing: for GLM extraction, glm-4.7 + thinking-on is the productive config; glm-5.1/thinking-off is a dead lane.

3.2 Chunking is the abundance lever — measured

A whole file self-saturates the gleaning loop: the model declares itself "done" by choice, not budget. Raising --max-passes from 3 to 8 on a 13.8 KB source changed nothing — it still stopped at 3 passes / 284 facts. But the same source, chunked, stays productive:

input	facts	density
whole 13.8 KB file	284	~20 facts/KB
one 1.8 KB chunk	82 (used all 4 passes)	~45 facts/KB

So we split each source into ~2 KB chunks and run a 4-pass gleaning loop per chunk, all writing into the source's one context. Chunking roughly doubles fact density and is what turns a medium corpus into tens of thousands of claims. This is "abundance" made operational: emit freely, defer the joining to query time.

3.3 The GLM rate limit, and pacing vs. lane rotation

At 8 parallel workers the run flew for ~38 minutes, then hit a wall of HTTP 429 code 1302 — "rate limit reached for requests." Crucially this is not the weekly quota (code 1310): single probes kept succeeding, so it's a requests/time throttle. The fix was to dial concurrency down to 3 (which sits under the throttle) and make the runner resumable so the ~200 throttled chunks simply re-queue. We also tested the Hyades/holo3.1 lane as a failover, but it was returning empty (0 tokens) under load, so GLM-paced remained the productive path. Trade-off accepted: slower, but steady and complete.

3.4 A 0-fact result is not "done" (the no-shortcuts fix)

The throttle exposed a correctness bug worth naming: a chunk that returns zero facts (empty/failed response) must not be marked done — that's a coverage lie. We made ingested > 0 a precondition of success, so empty/failed chunks re-queue instead of silently passing. "Extracted" means facts verified in donto_statement, never job attempted.

4. What the substrate holds now (mid-run snapshot)

As of this writing, ~110 of 261 chunks are extracted and the run continues:

~11,000 statements across 48 contexts
~3,000 distinct predicates — the abundance signature: the model invents the relations as it goes (originalGreekTitle, keyedTo, providesGuidanceOn, datesToEra, ascendsToBecome, pursuedBy, …). These are not a proliferation problem; they are the firehose donto is designed to tame at query time, not at write time.
~2,600 distinct subjects
Evidence-anchored throughout — essentially every claim carries the source span it was extracted from (literal-object claims anchor lexically; relational claims carry their excerpt).

Claims by tradition (statements, snapshot):

tradition	statements	tradition	statements
aboriginal-australian	3,321	zulu	216
astronomy (the cluster itself)	1,268	filipino	194
deep-time / oldest-story	1,242	blackfoot	166
greek	1,024	aztec	155
african (synthesis)	726	maori	151
indonesian	532	maya	144
native-american	391	hopi	112
arabic	352	hawaiian	96
chinese	335	berber	84
polynesian	248	cherokee	78
biblical	242	european	74
…Hindu, Japanese, Persian, Lakota, Kiowa, Onondaga, Navajo, Khoikhoi, Inca chunks were throttle-casualties, re-queued and filling in.

5. The findings the corpus is built to surface

The point of holding all of this in one paraconsistent store is triangulation. Three patterns are already measurable in the claims, and they are exactly where donto's query-time alignment will earn its keep.

5.1 The pursuit motif is genuinely pan-cultural

246 statements across the corpus encode a pursuit / flight / hunter / transformation relation — and they come from cultures that never met: the Greek Orion chasing the Pleiades for seven years; the Western Desert ancestral man Wati Nyiru pursuing the Kungkarrangkalpa sisters across thousands of kilometres of songline; the Kiowa/Lakota giant bear chasing seven girls onto the rock that becomes Devils Tower; the Monache wild-onion wives fleeing their husbands. Held side by side and anchored to source, this is the raw material for testing d'Huy's claim that the gendered "pursuit" structure (motif I115A) is Paleolithic.

5.2 The "lost Pleiad": seven named, six seen

114 statements touch the seven-vs-six puzzle and the lost/hidden/shy seventh sister. The Greek tradition gives two competing explanations (Electra's grief; Merope's shame) — donto holds both, anchored, as contradiction rather than picking one. The same "one is missing/faint" beat recurs in Aboriginal, Cherokee, Onondaga, Kiowa, and Nez Perce accounts. This is the crux of the deep-time hypothesis (Norris & Norris 2021): ~100,000 years ago the proper motion of Pleione and Atlas left them resolvable as separate stars — so seven were visible — and the story may encode a sky no living human has seen. We deliberately also ingested the counter-argument (d'Huy's "reductionist" critique; the Baltic sieve counterexample; the convergent-number-seven alternative). Proponent and skeptic now sit in the store as linked, competing claims — the substrate's native shape.

5.3 The agricultural calendar is the other universal

213 statements encode a season / harvest / planting / heliacal-rising relation. Across Hesiod, the Andean Qullqa (whose June brightness still tracks El Niño), the Zulu isiLimela ("the digging stars"), Javanese pranata mangsa rice-timing, and the Aztec New Fire ceremony, the Pleiades' heliacal rising marks when to plant, sail, or count the year. Whether this is shared inheritance or independent convergence (the same conspicuous cluster, the same agricultural need) is a question you can only pose once the claims are in one place, aligned by meaning rather than by string.

5.4 Identity-as-hypothesis, waiting for the alignment engine

The corpus now contains dozens of culture-specific names for one object — Pleiades, Subaru, Matariki, Krittika, Mao, Kimah, al-Thurayya, Parvin, Qullqa, Tianquiztli, isiLimela, Khuseti, Karatgurk. None are merged. The need is already visible within a single tradition: the store holds both ex:electra and ex:Electra, and both ex:asterope and ex:asterope-22-tau, as distinct subjects — the model minting near-duplicate identities as it goes. Resolving those is emphatically not a hand-maintained alias table (that would be the anti-pattern); it is the continuous alignment engine's job, done by embedding similarity and resolved at query time. The next step is to let donto's pgvector + alignment engine propose electra ≈ Electra and Matariki ≈ Pleiades ≈ Subaru as scored, query-time identity hypotheses — and then ask the genuinely new question: which meanings travel with the stars (the sisters, the pursuit, the calendar) and which are local inventions?

6. Project checklist

Done

Project scaffolded, mirroring the genealogy consumer (raw/<culture>/<source>.md, chunking driver, resumable manifest, CARE restricted: flag)
63 sources / 32 cultures saved to disk with provenance frontmatter (public-domain primaries preferred; claims-only for copyrighted scholarship)
Extraction engine built on the Rust donto-agent lane — chunk-by-chunk, multi-pass gleaning, evidence-anchored into donto_statement
Diagnosed the opencode/glm-5.1 0-fact failure; switched to donto-agent/glm-4.7 + thinking-on
Established chunking as the abundance lever (measured ~20 → ~45 facts/KB)
Handled the GLM 429 / code-1302 throttle: paced concurrency, made the run resumable, and made 0-fact-≠-done (no-shortcuts rule)
~11k evidence-anchored claims across 23+ traditions ingested (run continuing)
Report published to donto.org/reports

In progress / next

Finish extracting the remaining chunks (118/261 done) — extraction lane switched to Hyades holo3.1 @ concurrency 2
Backfill the GLM-throttle-casualty traditions (Hindu, Japanese, Persian, Lakota, Kiowa, Onondaga, Navajo, Khoikhoi, Inca)
Run the always-on citer to separate stated (anchored) from interpreted (hypothesis) claims
Embed predicates + entities and run the continuous alignment engine to propose identity hypotheses (Matariki ≈ Pleiades ≈ Subaru; electra ≈ Electra)
Draw donto_argument contradiction edges (Electra-vs-Merope; Norris-vs-d'Huy)
Query-time cross-cultural synthesis: rank convergences by corroboration, surface contradictions
Add longer primary texts (full book chapters) for deeper per-source abundance

7. Honest limits & next steps

The run is incomplete and uneven. GLM-throttle casualties left some traditions (Hindu, Japanese, Lakota, Persian…) under-extracted; they are re-queuing. The final corpus will be larger and more balanced.
Claims are free-typed and unaligned. The cross-cultural counts above are lexical regex over invented predicates — deliberately not a hand-maintained synonym table (that would be the anti-pattern). Real triangulation waits on the alignment engine folding predicates and proposing identities at query time. That is the next build.
No contradiction edges yet. The Electra-vs-Merope and Norris-vs-d'Huy disagreements are present as parallel claims but not yet linked with donto_argument edges. Promoting them to explicit contradictions is where the Lens Engine's verifier comes in.
Synthesis pass to come. With the corpus complete, the payoff is a query-time synthesis: rank the cross-cultural convergences by corroboration, surface the contradictions, and let the substrate — not a hand-written essay — say what is genuinely shared about the Seven Sisters.

The pipeline, corpus, and driver live at /home/ajax/donto-resources/seven-sisters/. It is resumable, CARE-respecting, and Rust-first on the donto-agent GLM lane — a second example consumer proving the substrate on contested knowledge that spans the whole of human time.