# The Seven Sisters across 32 cultures — a contested-knowledge testbed for donto

_Living document. 2026-06-12. We built a cross-cultural corpus of the Seven Sisters / Pleiades story — Greek, Aboriginal Australian, Native American, Polynesian, Andean, African, Near Eastern, and the deep-time "oldest story" scholarship — saved every source to disk, and extracted it into donto as evidence-anchored claims via the `donto-agent` GLM lane. This is the report on what we built, what the extraction engineering taught us, and what the substrate now holds. Snapshot taken mid-run (~110 of 261 chunks extracted); numbers will grow._

The Seven Sisters is, plausibly, the most widely shared story on Earth. The same six-or-seven stars in Taurus carry a strikingly convergent narrative across cultures that never met: seven sisters (or girls, or wives, or hens), usually **pursued** by a man or a hunter, who **escape into the sky**; and almost everywhere, a puzzle — the cultures say *seven*, but only *six* are easy to see. That convergence is either the deepest coincidence in world folklore or evidence that the story is older than the human dispersal out of Africa — a claim now argued from the proper motion of the stars themselves.

That is not a literature problem. It is a **donto** problem. The sources disagree, conflate myth with geography, attach the same sky to incompatible meanings, and span 100,000 years of contested transmission. You cannot answer "what is the Seven Sisters story?" by picking a winner and deleting the rest — the disagreement *is* the data. donto is built to hold exactly this: bitemporal, paraconsistent, evidence-first, identity-as-hypothesis. So we used the Seven Sisters as a new **example consumer** — the same machinery as the genealogy work, pointed at a different kind of contested knowledge.

---

## 1. Why this is the right test for the substrate

donto's north star is contested reality: hold every claim about a thing, anchored to its source, link the incompatible ones, and re-rank by corroboration over time instead of collapsing on conflict. The Seven Sisters exercises every hard invariant at once:

- **Paraconsistency.** Is the "lost" seventh sister Electra (hiding her face in grief for Troy) or Merope (ashamed of marrying a mortal)? Greek sources themselves disagree, and that is *before* you add the Aboriginal, Kiowa, and Cherokee accounts of why one star is faint or missing. donto holds all of them as legal state.
- **Identity-as-hypothesis.** Is Māori *Matariki* "the same as" the Greek *Pleiades*? As Japanese *Subaru*? As Aboriginal *Kungkarrangkalpa*? They are the same stars with different ontologies. Identity is a query-time hypothesis the alignment engine proposes — not a merge baked in at write time.
- **Evidence-first.** Every claim is anchored to the span of source text it came from, so "the Pleiades mark the planting season" is traceable to Hesiod, to the Zulu *isiLimela*, and to the Andean *Qullqa* — independently.
- **Cross-source triangulation (the discovery thesis).** The *new understanding* is supposed to fall out of connecting evidence-anchored claims across sources — exactly the lost-Pleiad and deep-time questions below.

The substrate stays domain-neutral; the Seven Sisters just proves it on a harder corpus than genealogy.

---

## 2. What we built

Mirroring the genealogy pipeline (`/home/ajax/donto-resources/`, `ingest_frontier.py`): **save every source to disk, register it as a retrievable document, extract facts anchored to it.**

- **63 sources, ~465 KB, 32 cultures**, saved to `raw/<culture>/<source>.md` with provenance frontmatter (title, culture, era, author, source URL, license). Public-domain primary texts preferred; for copyrighted scholarship we save the *claims*, not the prose.
- **Coverage by tradition:** Greek/Roman primary texts (Hesiod, Homer, Ovid, Hyginus, Apollodorus, Aratus, Sappho, Pausanias); Aboriginal Australian (public material only — NMA *Songlines*, Kungkarrangkalpa, the Wurundjeri *Karatgurk* fire-sisters); Native American (Devils Tower / Bear Lodge across Kiowa·Lakota·Cheyenne·Arapaho·Crow, Cherokee, Navajo, Blackfoot, Monache, Onondaga); pan-cultural astronomy (Subaru, Krittika, Mao, Kimah, al-Thurayya, Parvin); Pacific / Andean / Mesoamerican / African (Matariki, Makaliʻi, Qullqa, Tianquiztli, Tzab-ek, isiLimela, Khuseti); and a dedicated **deep-time** cluster (Norris's 100,000-year argument, d'Huy's phylomythology, the lost-Pleiad survey, the Nebra disc, and the *critiques*).

**CARE respect.** Aboriginal Seven Sisters songlines hold restricted, secret-sacred knowledge. We used only public, published material, treated First Nations sources as attributed interpretive witnesses (never ground truth), and built a `restricted: true` frontmatter flag that hard-skips a source from extraction. Nothing secret-sacred was sought, saved, or extracted.

---

## 3. Extraction engineering — what actually worked (and what didn't)

This is the part worth keeping. We learned more about the extraction stack here than from the corpus.

### 3.1 opencode/glm-5.1 returned **zero** facts; `donto-agent`/glm-4.7 returned hundreds
The first attempt ran the donto-api `/extract-and-ingest` opencode path (glm-5.1, thinking-off). On the *exact same source* it produced **0 facts** across all four passes — the documented "empty extraction" failure mode. Switching to **`donto-agent extract --provider glm`** (the Rust harness, glm-4.7 with thinking **enabled**) turned the same source into 200+ anchored claims. The lesson, now load-bearing: for GLM extraction, **glm-4.7 + thinking-on is the productive config; glm-5.1/thinking-off is a dead lane.**

### 3.2 Chunking is the abundance lever — measured
A whole file self-saturates the gleaning loop: the model declares itself "done" by *choice*, not budget. Raising `--max-passes` from 3 to 8 on a 13.8 KB source changed nothing — it still stopped at 3 passes / 284 facts. But the same source, **chunked**, stays productive:

| input | facts | density |
|---|---|---|
| whole 13.8 KB file | 284 | ~20 facts/KB |
| one 1.8 KB chunk | 82 (used *all* 4 passes) | ~45 facts/KB |

So we split each source into ~2 KB chunks and run a 4-pass gleaning loop per chunk, all writing into the source's one context. Chunking roughly **doubles** fact density and is what turns a medium corpus into tens of thousands of claims. This is "abundance" made operational: emit freely, defer the joining to query time.

### 3.3 The GLM rate limit, and pacing vs. lane rotation
At 8 parallel workers the run flew for ~38 minutes, then hit a wall of **HTTP 429 code 1302 — "rate limit reached for requests."** Crucially this is *not* the weekly quota (code 1310): single probes kept succeeding, so it's a requests/time throttle. The fix was to **dial concurrency down to 3** (which sits under the throttle) and make the runner resumable so the ~200 throttled chunks simply re-queue. We also tested the Hyades/holo3.1 lane as a failover, but it was returning empty (0 tokens) under load, so GLM-paced remained the productive path. Trade-off accepted: slower, but steady and complete.

### 3.4 A 0-fact result is **not** "done" (the no-shortcuts fix)
The throttle exposed a correctness bug worth naming: a chunk that returns *zero* facts (empty/failed response) must **not** be marked done — that's a coverage lie. We made `ingested > 0` a precondition of success, so empty/failed chunks re-queue instead of silently passing. "Extracted" means *facts verified in `donto_statement`*, never *job attempted*.

---

## 4. What the substrate holds now (mid-run snapshot)

As of this writing, ~110 of 261 chunks are extracted and the run continues:

- **~11,000 statements** across **48 contexts**
- **~3,000 distinct predicates** — the abundance signature: the model invents the relations as it goes (`originalGreekTitle`, `keyedTo`, `providesGuidanceOn`, `datesToEra`, `ascendsToBecome`, `pursuedBy`, …). These are *not* a proliferation problem; they are the firehose donto is designed to tame at query time, not at write time.
- **~2,600 distinct subjects**
- **Evidence-anchored throughout** — essentially every claim carries the source span it was extracted from (literal-object claims anchor lexically; relational claims carry their excerpt).

**Claims by tradition (statements, snapshot):**

| tradition | statements | tradition | statements |
|---|---:|---|---:|
| aboriginal-australian | 3,321 | zulu | 216 |
| astronomy (the cluster itself) | 1,268 | filipino | 194 |
| deep-time / oldest-story | 1,242 | blackfoot | 166 |
| greek | 1,024 | aztec | 155 |
| african (synthesis) | 726 | maori | 151 |
| indonesian | 532 | maya | 144 |
| native-american | 391 | hopi | 112 |
| arabic | 352 | hawaiian | 96 |
| chinese | 335 | berber | 84 |
| polynesian | 248 | cherokee | 78 |
| biblical | 242 | european | 74 |
| _…Hindu, Japanese, Persian, Lakota, Kiowa, Onondaga, Navajo, Khoikhoi, Inca chunks were throttle-casualties, re-queued and filling in._ | | | |

---

## 5. The findings the corpus is built to surface

The point of holding all of this in one paraconsistent store is triangulation. Three patterns are already measurable in the claims, and they are exactly where donto's query-time alignment will earn its keep.

### 5.1 The pursuit motif is genuinely pan-cultural
**246 statements** across the corpus encode a *pursuit / flight / hunter / transformation* relation — and they come from cultures that never met: the Greek Orion chasing the Pleiades for seven years; the Western Desert ancestral man **Wati Nyiru** pursuing the Kungkarrangkalpa sisters across thousands of kilometres of songline; the Kiowa/Lakota giant **bear** chasing seven girls onto the rock that becomes Devils Tower; the Monache **wild-onion wives** fleeing their husbands. Held side by side and anchored to source, this is the raw material for testing d'Huy's claim that the gendered "pursuit" structure (motif I115A) is Paleolithic.

### 5.2 The "lost Pleiad": seven named, six seen
**114 statements** touch the seven-vs-six puzzle and the lost/hidden/shy seventh sister. The Greek tradition gives *two competing* explanations (Electra's grief; Merope's shame) — donto holds both, anchored, as contradiction rather than picking one. The same "one is missing/faint" beat recurs in Aboriginal, Cherokee, Onondaga, Kiowa, and Nez Perce accounts. This is the crux of **the deep-time hypothesis** (Norris & Norris 2021): ~100,000 years ago the proper motion of Pleione and Atlas left them resolvable as separate stars — so *seven* were visible — and the story may encode a sky no living human has seen. We deliberately also ingested the **counter-argument** (d'Huy's "reductionist" critique; the Baltic sieve counterexample; the convergent-number-seven alternative). Proponent and skeptic now sit in the store as linked, competing claims — the substrate's native shape.

### 5.3 The agricultural calendar is the other universal
**213 statements** encode a season / harvest / planting / heliacal-rising relation. Across Hesiod, the Andean *Qullqa* (whose June brightness still tracks El Niño), the Zulu *isiLimela* ("the digging stars"), Javanese *pranata mangsa* rice-timing, and the Aztec New Fire ceremony, the Pleiades' heliacal rising marks **when to plant, sail, or count the year**. Whether this is shared inheritance or independent convergence (the same conspicuous cluster, the same agricultural need) is a question you can only *pose* once the claims are in one place, aligned by meaning rather than by string.

### 5.4 Identity-as-hypothesis, waiting for the alignment engine
The corpus now contains dozens of culture-specific names for one object — Pleiades, Subaru, Matariki, Krittika, Mao, Kimah, al-Thurayya, Parvin, Qullqa, Tianquiztli, isiLimela, Khuseti, Karatgurk. None are merged. The need is already visible *within* a single tradition: the store holds both `ex:electra` and `ex:Electra`, and both `ex:asterope` and `ex:asterope-22-tau`, as distinct subjects — the model minting near-duplicate identities as it goes. Resolving those is emphatically **not** a hand-maintained alias table (that would be the anti-pattern); it is the continuous alignment engine's job, done by embedding similarity and resolved at query time. The next step is to let donto's pgvector + alignment engine **propose** `electra ≈ Electra` and `Matariki ≈ Pleiades ≈ Subaru` as scored, query-time identity hypotheses — and then ask the genuinely new question: *which meanings travel with the stars (the sisters, the pursuit, the calendar) and which are local inventions?*

---

## 6. Project checklist

**Done**

- [x] Project scaffolded, mirroring the genealogy consumer (`raw/<culture>/<source>.md`, chunking driver, resumable manifest, CARE `restricted:` flag)
- [x] 63 sources / 32 cultures saved to disk with provenance frontmatter (public-domain primaries preferred; claims-only for copyrighted scholarship)
- [x] Extraction engine built on the Rust `donto-agent` lane — chunk-by-chunk, multi-pass gleaning, evidence-anchored into `donto_statement`
- [x] Diagnosed the opencode/glm-5.1 **0-fact** failure; switched to `donto-agent`/glm-4.7 + thinking-on
- [x] Established **chunking as the abundance lever** (measured ~20 → ~45 facts/KB)
- [x] Handled the GLM **429 / code-1302** throttle: paced concurrency, made the run resumable, and made **0-fact-≠-done** (no-shortcuts rule)
- [x] ~11k evidence-anchored claims across 23+ traditions ingested (run continuing)
- [x] Report published to `donto.org/reports`

**In progress / next**

- [ ] Finish extracting the remaining chunks (118/261 done) — extraction lane switched to **Hyades holo3.1 @ concurrency 2**
- [ ] Backfill the GLM-throttle-casualty traditions (Hindu, Japanese, Persian, Lakota, Kiowa, Onondaga, Navajo, Khoikhoi, Inca)
- [ ] Run the always-on **citer** to separate *stated* (anchored) from *interpreted* (hypothesis) claims
- [ ] Embed predicates + entities and run the **continuous alignment engine** to propose identity hypotheses (`Matariki ≈ Pleiades ≈ Subaru`; `electra ≈ Electra`)
- [ ] Draw `donto_argument` **contradiction edges** (Electra-vs-Merope; Norris-vs-d'Huy)
- [ ] Query-time **cross-cultural synthesis**: rank convergences by corroboration, surface contradictions
- [ ] Add longer primary texts (full book chapters) for deeper per-source abundance

## 7. Honest limits & next steps

- **The run is incomplete and uneven.** GLM-throttle casualties left some traditions (Hindu, Japanese, Lakota, Persian…) under-extracted; they are re-queuing. The final corpus will be larger and more balanced.
- **Claims are free-typed and unaligned.** The cross-cultural counts above are lexical regex over invented predicates — deliberately *not* a hand-maintained synonym table (that would be the anti-pattern). Real triangulation waits on the alignment engine folding predicates and proposing identities at query time. That is the next build.
- **No contradiction edges yet.** The Electra-vs-Merope and Norris-vs-d'Huy disagreements are present as parallel claims but not yet linked with `donto_argument` edges. Promoting them to explicit contradictions is where the Lens Engine's verifier comes in.
- **Synthesis pass to come.** With the corpus complete, the payoff is a query-time synthesis: rank the cross-cultural convergences by corroboration, surface the contradictions, and let the substrate — not a hand-written essay — say what is genuinely shared about the Seven Sisters.

The pipeline, corpus, and driver live at `/home/ajax/donto-resources/seven-sisters/`. It is resumable, CARE-respecting, and Rust-first on the donto-agent GLM lane — a second example consumer proving the substrate on contested knowledge that spans the whole of human time.
