# Help donto — run an embedding worker

**The complete guide to lending donto your spare CPU/GPU.** *(updated 2026-06-10)*
One command, fully safe (text in → vectors out; no database access; one token).
Live page: <https://donto.org/help> · Live tracker: <https://admin.donto.org/embed> · Source: <https://github.com/thomasdavis/donto-embed-worker>

---

## TL;DR — just run this

```bash
docker run --rm -e DONTO_EMBED_TOKEN=80f8ffdbee931ae49b10b500263044f4674e94310ae4fefe ghcr.io/thomasdavis/donto-embed-worker
```

That pulls a tiny prebuilt image and runs **one worker per CPU core**. Each worker leases short
texts from `https://donto.org/embed`, turns them into 384-dimensional vectors with a small model
(`BAAI/bge-small-en-v1.5`), and submits them back. Watch your machine appear within seconds at
**<https://admin.donto.org/embed>**.

- **GPU box?** Don't use the Docker image (it ships the CPU build) — use the [GPU path in §7](#7-gpu-acceleration) (`pip install fastembed-gpu` + `EMBED_CUDA=1`): **~2,000–6,000 vectors/min**, i.e. one GPU ≈ a 50–100-core fleet.
- **More computers?** Run the same command on each — the server hands every worker *different* work, so they never overlap.
- **Stop anytime:** `Ctrl-C` / `docker stop`. Unfinished work auto-returns to the queue in ~15 minutes; nothing is lost.

If you use **Claude Code**, you can instead paste the block in [§4](#4-method-b--paste-into-claude-code) and it'll set everything up and verify it for you.

---

## Table of contents

1. [What this is & why it matters](#1-what-this-is--why-it-matters)
2. [Is it safe? (what it does and does not do)](#2-is-it-safe-what-it-does-and-does-not-do)
3. [Method A — Docker (recommended)](#3-method-a--docker-recommended)
4. [Method B — paste into Claude Code](#4-method-b--paste-into-claude-code)
5. [Method C — bare Python (no Docker)](#5-method-c--bare-python-no-docker)
6. [Method D — docker-compose / build your own](#6-method-d--docker-compose--build-your-own)
7. [GPU acceleration](#7-gpu-acceleration)
8. [Running a fleet (many machines)](#8-running-a-fleet-many-machines)
9. [Verify it's actually working](#9-verify-its-actually-working)
10. [Configuration reference (env vars)](#10-configuration-reference-env-vars)
11. [How it works under the hood](#11-how-it-works-under-the-hood)
12. [Troubleshooting](#12-troubleshooting)
13. [FAQ](#13-faq)
14. [Stopping & cleanup](#14-stopping--cleanup)
15. [Appendix A — full worker source](#appendix-a--full-worker-source)
16. [Appendix B — HTTP API reference](#appendix-b--http-api-reference)

---

## 1. What this is & why it matters

**donto** is a knowledge substrate that keeps *every* claim — even contradictory ones — anchored to
its source, and works out what's true at **query time** instead of throwing data away. A huge part of
that "query-time" intelligence runs on **vector embeddings**: small numeric fingerprints of text that
let donto:

- **fold synonymous relations together** (e.g. `rdfType` ↔ `rdf:type`, `killedBy` ↔ `assassinatedBy`) by similarity rather than a hand-maintained list;
- **spot when two records describe the same person/place/thing** (identity resolution / duplicate detection);
- **recall memories by meaning**, not just keywords.

The maths is cheap; the bottleneck is raw throughput. There are **~4 million** things waiting to be
embedded — conversation chunks & claims (~1.2M, with a live BEAM-10M benchmark run literally waiting
on them), entities (~2.8M), and freely-minted predicates — and it's stuck on a single 4-core box
doing ~100/min. **Your spare cores (or a GPU) fix that.** Every
vector you compute makes donto's alignment, identity graph, and recall measurably better.

A **worker** is dead simple: it asks the coordinator for a batch of text, embeds it with the
`bge-small` model, and posts the vectors back. It is *target-agnostic* — the server tells it what to
embed (entities, predicates, chunks — whatever is live), so you never configure that.

---

## 2. Is it safe? (what it does and does not do)

**Yes — it's read-only from your side: text comes in, vectors go out.**

It **does**:
- talk **only** to `https://donto.org/embed` over HTTPS;
- download a ~130 MB model (`bge-small`) once (already baked into the Docker image);
- use your CPU/GPU and ~0.3–0.5 GB RAM per worker while embedding.

It **does not**:
- touch any database — the coordinator owns every DB write; the worker never connects to Postgres;
- see private data beyond the short text snippets it must embed;
- open any inbound port, install anything outside the container / a single `pip` package, or need a GitHub or donto account;
- need anything but **one bearer token**, which can only *fetch work and submit vectors* — it cannot read or write the substrate, and it's revocable at any time.

You can stop it whenever you like; any work you'd leased but not submitted returns to the queue
automatically after ~15 minutes, and nothing is ever double-embedded.

---

## 3. Method A — Docker (recommended)

**Prerequisites:** Docker installed and running. Nothing else.

**1. Confirm you can reach the coordinator** (no token needed):

```bash
curl https://donto.org/embed/health
# → {"ok": true, "targets": ["memory_chunk", "predicate", "entity"]}
```

**2. Run it** (one worker per CPU core):

```bash
docker run --rm -e DONTO_EMBED_TOKEN=80f8ffdbee931ae49b10b500263044f4674e94310ae4fefe ghcr.io/thomasdavis/donto-embed-worker
```

You'll see, per worker:

```
→ coordinator: https://donto.org/embed
✓ coordinator reachable
→ launching 8 worker(s) (model BAAI/bge-small-en-v1.5, 384d)
[w1-yourhost] -> https://donto.org/embed (target=any)
[w1-yourhost] +256 (total 256)
[w1-yourhost] +256 (total 512)
...
```

Each `+N (total …)` line is N vectors you just contributed.

**Useful variations:**

```bash
# Run in the background (detached), name it, auto-restart on reboot:
docker run -d --restart unless-stopped --name donto-embed \
  -e DONTO_EMBED_TOKEN=80f8ffdbee931ae49b10b500263044f4674e94310ae4fefe ghcr.io/thomasdavis/donto-embed-worker
docker logs -f donto-embed        # watch it
docker stop donto-embed           # stop it

# Limit how many cores it uses (e.g. leave some free):
docker run --rm -e EMBED_WORKERS=2 -e DONTO_EMBED_TOKEN=80f8ffdbee931ae49b10b500263044f4674e94310ae4fefe ghcr.io/thomasdavis/donto-embed-worker

# Focus a single aspect (otherwise it takes whatever's queued):
docker run --rm -e EMBED_TARGET=entity -e DONTO_EMBED_TOKEN=80f8ffdbee931ae49b10b500263044f4674e94310ae4fefe ghcr.io/thomasdavis/donto-embed-worker

# Bigger batches per request (default 256; server caps at 1024):
docker run --rm -e EMBED_N=512 -e DONTO_EMBED_TOKEN=80f8ffdbee931ae49b10b500263044f4674e94310ae4fefe ghcr.io/thomasdavis/donto-embed-worker
```

The image pre-bakes the model, so the first run is instant (no cold download).

---

## 4. Method B — paste into Claude Code

If you use [Claude Code](https://claude.com/claude-code), open it in any folder and paste the block
below (canonical copy: <https://donto.org/help/claude.txt> — `curl -L donto.org/help/claude.txt`).
It detects a GPU, installs the right build (`fastembed-gpu` vs `fastembed`), runs the workers, and
**verifies they're really embedding** (not just "started"). Safe to paste on **many machines at
once** — same token everywhere; the coordinator splits the work.

```text
Help me run donto embedding worker(s) on this machine. I may run this exact same
prompt on SEVERAL machines — that is fine and expected: the coordinator hands every
worker a disjoint slice of work (Postgres FOR UPDATE SKIP LOCKED), machines never
overlap, and if a machine stops mid-batch its work auto-returns to the queue in
~15 minutes. Nothing is ever lost or embedded twice.

WHAT THIS IS: donto (donto.org) needs millions of short texts turned into 384-dim
vectors with the small model BAAI/bge-small-en-v1.5 — ~1.2M conversation chunks/claims
+ ~2.8M entities are queued behind one 4-core server, and a live benchmark run
(BEAM-10M) is literally waiting on these vectors. A worker just loops:
lease texts over HTTPS -> embed locally -> submit vectors back. It never touches a
database or my files, needs no inbound ports and no account; the only secret is one
bearer token; I can stop it at any time.

COORDINATOR: https://donto.org/embed
TOKEN:       80f8ffdbee931ae49b10b500263044f4674e94310ae4fefe

STEP 1 — reachability (no token needed):
  curl -A curl/8 https://donto.org/embed/health
  Expect: {"ok": true, "targets": ["memory_chunk", "predicate", "entity"]}
  NOTE: Cloudflare blocks python-urllib's default User-Agent — everything that talks
  to the coordinator must send a curl-style UA. The worker below already does.

STEP 2 — detect hardware, choose ONE path:
  Run nvidia-smi. If it lists a GPU -> GPU PATH (one GPU ~ 50-100 CPU cores here).
  Otherwise -> CPU PATH.

STEP 3 — save this as worker.py (same file works for both paths):

import json, os, socket, time, urllib.request
URL = os.environ.get("DONTO_EMBED_URL", "https://donto.org/embed").rstrip("/")
TOKEN = os.environ["DONTO_EMBED_TOKEN"]
WID = os.environ.get("EMBED_WORKER_ID") or f"worker-{socket.gethostname()}-{os.getpid()}"
N = int(os.environ.get("EMBED_N", "256"))        # items per lease (server caps at 1024)
BS = int(os.environ.get("EMBED_BATCH", "256"))   # model batch size
TARGET = os.environ.get("EMBED_TARGET") or None  # memory_chunk|predicate|entity|unset=any
CUDA = os.environ.get("EMBED_CUDA") == "1"
IDLE = float(os.environ.get("EMBED_IDLE_SLEEP", "5"))
from fastembed import TextEmbedding
if CUDA:
    import onnxruntime as ort
    assert "CUDAExecutionProvider" in ort.get_available_providers(), (
        "CUDA provider missing - install fastembed-gpu (+ nvidia-cudnn-cu12), or unset EMBED_CUDA")
def post(path, body, timeout=300):
    req = urllib.request.Request(URL + path, data=json.dumps(body).encode(),
        headers={"Authorization": f"Bearer {TOKEN}", "Content-Type": "application/json",
                 "User-Agent": "curl/8.5.0"}, method="POST")  # Cloudflare blocks python-urllib
    with urllib.request.urlopen(req, timeout=timeout) as r:
        return json.loads(r.read())
_m = {}
def model_for(name):
    if name not in _m:
        kw = {"model_name": name}
        if CUDA: kw["providers"] = ["CUDAExecutionProvider", "CPUExecutionProvider"]
        _m[name] = TextEmbedding(**kw)
    return _m[name]
_lr = 0.0
def report_err(kind, msg):  # best-effort error telemetry, max 1/min
    global _lr
    if time.time() - _lr < 60: return
    _lr = time.time()
    try: post("/report", {"worker_id": WID, "kind": kind, "message": str(msg)[:1500]}, timeout=20)
    except Exception: pass
total, win, errs = 0, [], 0
print(f"[{WID}] -> {URL} target={TARGET or 'any'} n={N} batch={BS} cuda={CUDA}", flush=True)
while True:
    try:
        batch = post("/lease", {"worker_id": WID, "n": N, "target": TARGET}).get("batch", [])
        errs = 0
    except Exception as e:
        errs += 1; print(f"[{WID}] lease error: {e}", flush=True)
        report_err("lease", e); time.sleep(min(IDLE * errs, 60)); continue
    if not batch:
        time.sleep(IDLE); continue
    by = {}
    for b in batch: by.setdefault(b["model"], []).append(b)
    items = []
    for mn, g in by.items():
        for x, v in zip(g, model_for(mn).embed([t["text"] for t in g], batch_size=BS)):
            items.append({"target": x["target"], "item_id": x["item_id"],
                          "vector": [float(z) for z in v]})
    try:
        r = post("/submit", {"worker_id": WID, "items": items})
        n = int(r.get("upserted", 0)); total += n
        now = time.time(); win.append((now, n)); win = [(t, c) for t, c in win if now - t < 60]
        print(f"[{WID}] +{n} (total {total}, ~{sum(c for _, c in win)}/min)", flush=True)
        errs = 0
    except Exception as e:
        errs += 1; print(f"[{WID}] submit error: {e}", flush=True)
        report_err("submit", e); time.sleep(min(IDLE * errs, 60))

STEP 4 — GPU PATH (skip if no GPU):
  python3 -m venv ~/.donto-embed && source ~/.donto-embed/bin/activate
  pip uninstall -y fastembed 2>/dev/null; pip install -U fastembed-gpu
  python -c "import onnxruntime as o; print(o.get_available_providers())"
    -> must include CUDAExecutionProvider. If it does NOT:
       pip install nvidia-cudnn-cu12 nvidia-cublas-cu12   and re-check;
       if still missing, tell me what you found and use the CPU PATH meanwhile.
  Run TWO processes (they share the GPU and overlap network I/O):
    export DONTO_EMBED_TOKEN=80f8ffdbee931ae49b10b500263044f4674e94310ae4fefe
    EMBED_CUDA=1 EMBED_N=1024 EMBED_BATCH=512 EMBED_WORKER_ID=gpu1-$(hostname) \
      nohup python worker.py > ~/donto-w1.log 2>&1 &
    EMBED_CUDA=1 EMBED_N=1024 EMBED_BATCH=512 EMBED_WORKER_ID=gpu2-$(hostname) \
      nohup python worker.py > ~/donto-w2.log 2>&1 &
  If the GPU runs out of memory, halve EMBED_BATCH (the model itself is ~130 MB).

STEP 5 — CPU PATH (no GPU):
  pip install fastembed
  export DONTO_EMBED_TOKEN=80f8ffdbee931ae49b10b500263044f4674e94310ae4fefe
  for i in $(seq $(nproc)); do
    EMBED_WORKER_ID=w$i-$(hostname) nohup python worker.py > ~/donto-w$i.log 2>&1 &
  done
  (~0.3-0.5 GB RAM per process — if the machine swaps, run fewer.)
  Zero-setup alternative (CPU only, one worker/core):
    docker run --rm -e DONTO_EMBED_TOKEN=80f8ffdbee931ae49b10b500263044f4674e94310ae4fefe \
      ghcr.io/thomasdavis/donto-embed-worker

STEP 6 — VERIFY it is really working (do not skip — "started" is not "working"):
  1. Within ~2 min the logs print lines like "+1024 (total 4096, ~3100/min)".
     Tell me the steady ~N/min figure. (Repeated lease/submit errors are
     auto-reported to the coordinator — throttled to 1/min — so the operator can
     see them; the worker also backs off exponentially, but still tell me the
     error line if it persists.)
  2. GPU path only: nvidia-smi must show the python processes holding GPU memory.
     If it doesn't, the model silently fell back to CPU — re-check providers (step 4).
  3. curl -A curl/8 -H "Authorization: Bearer 80f8ffdbee931ae49b10b500263044f4674e94310ae4fefe" https://donto.org/embed/stats
     -> queue.done climbing / queue.pending falling. The machine also appears within
     seconds as a contributor at https://admin.donto.org/embed (vectors-done rising).
  Ballparks: ~60-150 vectors/min per CPU core; ~2,000-6,000/min per GPU.

STEP 7 — keep it running:
  Leave the workers running in the background and make them survive this terminal
  closing (nohup is already used above; tmux or a systemd --user unit is even better —
  your choice). If a lease returns an empty batch the worker naps 5 s and retries:
  the server refills its queue every 2 minutes from a multi-million backlog, so empty
  moments are normal, NOT an error. Stop anytime with: pkill -f worker.py

MORE MACHINES: run this same prompt on every machine, same token everywhere. Worker
IDs are unique automatically (hostname + pid); the coordinator splits work safely
across any number of machines.
```

---

## 5. Method C — bare Python (no Docker)

**Prerequisites:** Python 3.9+ and `pip`.

```bash
# 1. install the one dependency (CPU build)
pip install fastembed            # GPU box: pip install fastembed-gpu

# 2. grab the worker kit (worker.py + Dockerfile + compose + README)
curl -sL https://genes.apexpots.com/research/donto-embed-worker.tar.gz | tar xz
cd donto-embed-worker

# 3. point it at the coordinator + your token
export DONTO_EMBED_URL=https://donto.org/embed
export DONTO_EMBED_TOKEN=80f8ffdbee931ae49b10b500263044f4674e94310ae4fefe

# 4. run one worker per CPU core
for i in $(seq $(nproc)); do EMBED_WORKER_ID=w$i-$(hostname) python worker.py & done

# ...or just one:
python worker.py
```

> The full `worker.py` is also in [Appendix A](#appendix-a--full-worker-source) — you can paste it into
> a file directly if you'd rather not download the kit.

To stop the background workers: `kill $(jobs -p)` (or close the terminal).

---

## 6. Method D — docker-compose / build your own

The kit ships a `Dockerfile` and `docker-compose.yml` if you'd rather build locally or scale with
compose:

```bash
curl -sL https://genes.apexpots.com/research/donto-embed-worker.tar.gz | tar xz
cd donto-embed-worker
export DONTO_EMBED_TOKEN=80f8ffdbee931ae49b10b500263044f4674e94310ae4fefe

# one worker per core, all managed by compose:
docker compose up -d --build --scale embed-worker=$(nproc)
docker compose logs -f          # watch "+N (total ...)"
docker compose down             # stop everything
```

For a GPU host, install `fastembed-gpu` in the Dockerfile, set `EMBED_N=1024`, and add `--gpus all`
(or the compose `deploy.resources.reservations.devices` GPU stanza).

---

## 7. GPU acceleration

A single NVIDIA GPU does **~2,000–6,000 vectors/min** vs ~60–150/min per CPU core — one GPU box
outruns a 50–100-core fleet. Two things matter: install the **GPU build** of fastembed, and set
**`EMBED_CUDA=1`** (fastembed defaults to CPU even when the GPU build is present).

```bash
# 1. GPU build (replaces the CPU package)
python3 -m venv ~/.donto-embed && source ~/.donto-embed/bin/activate
pip uninstall -y fastembed 2>/dev/null; pip install -U fastembed-gpu

# 2. verify CUDA is actually available to onnxruntime
python -c "import onnxruntime as o; print(o.get_available_providers())"
# must list CUDAExecutionProvider; if not: pip install nvidia-cudnn-cu12 nvidia-cublas-cu12

# 3. run TWO processes (they share the GPU and overlap network I/O)
export DONTO_EMBED_TOKEN=80f8ffdbee931ae49b10b500263044f4674e94310ae4fefe
EMBED_CUDA=1 EMBED_N=1024 EMBED_BATCH=512 EMBED_WORKER_ID=gpu1-$(hostname) nohup python worker.py > ~/donto-w1.log 2>&1 &
EMBED_CUDA=1 EMBED_N=1024 EMBED_BATCH=512 EMBED_WORKER_ID=gpu2-$(hostname) nohup python worker.py > ~/donto-w2.log 2>&1 &
```

**Verify the GPU is really in use:** `nvidia-smi` must show the python processes holding GPU memory,
and the log rate should be in the thousands/min. If `nvidia-smi` doesn't show them, the model silently
fell back to CPU — re-check step 2. If the GPU OOMs, halve `EMBED_BATCH` (the model is only ~130 MB).

> ⚠️ The prebuilt Docker image is **CPU-only** (it ships plain `fastembed`); `--gpus all` won't make
> it use the GPU. Use the pip path above on GPU hosts — or the Claude-Code paste in §4, which does
> all of this automatically.

---

## 8. Running a fleet (many machines)

Scaling is **linear and zero-coordination**. Run the exact same command on every machine, with the
same token:

```bash
docker run -d --restart unless-stopped --name donto-embed \
  -e DONTO_EMBED_TOKEN=80f8ffdbee931ae49b10b500263044f4674e94310ae4fefe ghcr.io/thomasdavis/donto-embed-worker
```

The coordinator hands every worker a **disjoint** slice of work using a `FOR UPDATE SKIP LOCKED`
queue, so two laptops + a desktop + a GPU box can all run at once and **never** double-embed or
collide. Add or remove machines freely; if one dies mid-batch, its lease is reclaimed after ~15 min
and the work re-flows to the others.

Give each machine a recognizable name so you can tell them apart on the tracker:

```bash
docker run -d -e EMBED_WORKER_ID_PREFIX=alice-laptop- \
  -e DONTO_EMBED_TOKEN=80f8ffdbee931ae49b10b500263044f4674e94310ae4fefe ghcr.io/thomasdavis/donto-embed-worker
```

---

## 9. Verify it's actually working

**A "launched" worker is not a "working" worker** — always confirm real progress, three ways:

**1. The worker's own output** — each worker should print `+N (total …)` lines every few seconds.
No lines, or only `lease error` / `submit error`, means something's wrong (see Troubleshooting).

**2. The live tracker** — open **<https://admin.donto.org/embed>**. Your machine appears within
seconds as a contributor with a green "live" dot, a rising **vectors done** count, in-flight count,
and last-seen time, plus per-target queue progress and throughput tiles (1m / 5m / 1h).

**3. The stats endpoint:**

```bash
curl -A curl/8 -H "Authorization: Bearer 80f8ffdbee931ae49b10b500263044f4674e94310ae4fefe" https://donto.org/embed/stats
```

```json
{
  "predicate":    { "source": 955709, "embedded": 886836, "queue": { "pending": 49996, "leased": 64, "done": 304 } },
  "entity":       { "source": 3153432, "embedded": 317237, "queue": { "pending": 49996, "leased": 0, "done": 0 } },
  "memory_chunk": { "source": null, "embedded": null, "queue": { "pending": 100000, "leased": 0, "done": 8 } }
}
```

Over 1–2 minutes, `queue.done` should **climb** and `queue.pending` should **fall**. That's your
machine eating the backlog. (`source`/`embedded` are big totals refreshed in the background and may
briefly show `null` — that's cosmetic; the `queue` numbers are live.)

**Also check your machine's health:** `free -m` (or `docker stats`). Each worker holds the model
(~0.3–0.5 GB). If RAM swaps, throughput collapses — run fewer workers (see `EMBED_WORKERS`).

---

## 10. Configuration reference (env vars)

| Variable | Default | Applies to | Meaning |
|---|---|---|---|
| `DONTO_EMBED_TOKEN` | *(required)* | all | Bearer token. Embed-only; can't read/write the substrate. |
| `DONTO_EMBED_URL` | `https://donto.org/embed` | all | Coordinator base URL. |
| `EMBED_WORKERS` | number of CPU cores | Docker image | How many worker processes to launch. Lower it to keep cores free. |
| `EMBED_N` | `256` | all | Items leased per request (server caps at **1024**). Raise on GPU (`1024`); lower on tiny boxes. |
| `EMBED_TARGET` | *(any)* | all | Restrict to one aspect: `memory_chunk`, `predicate`, or `entity`. Omit = take whatever's queued. |
| `EMBED_WORKER_ID` | `worker-<host>-<pid>` | bare Python | Explicit worker name shown on the tracker. |
| `EMBED_WORKER_ID_PREFIX` | `w` | Docker image | Prefix for auto-generated worker names (e.g. `alice-laptop-`). |
| `EMBED_IDLE_SLEEP` | `5` | all | Seconds to wait when the queue is empty before asking again. |

The embedding **model is pinned** to `BAAI/bge-small-en-v1.5` (384-dim) and is dictated by the server
per batch — **never override it**, or your vectors won't match the fabric and would have to be redone.

---

## 11. How it works under the hood

```
   your machine (worker)                         the donto box (coordinator + DB)
   ─────────────────────                         ────────────────────────────────
   1. POST /embed/lease  ───────────────────────▶  pick N un-embedded items
                                                    UPDATE ... FOR UPDATE SKIP LOCKED
      ◀───────────────  [{item_id, text, model}]    (no two workers get the same item)
   2. embed text → 384-d vectors  (bge-small)
   3. POST /embed/submit ───────────────────────▶  validate dim, UPSERT vectors into
                                                    donto_*_embedding, mark items done
      ◀───────────────  {upserted: N}
   4. repeat
```

- **The worker is dumb on purpose.** It never touches the database — it only knows how to turn `text`
  into a vector. The coordinator builds the text, owns all DB writes, and decides what's next. This is
  what makes it safe to hand to anyone.
- **The work queue (`donto_embed_queue`)** uses Postgres `FOR UPDATE SKIP LOCKED`, the canonical
  no-double-work pattern. Each leased item is invisible to other workers until it's done or its lease
  goes stale (~15 min), at which point it's automatically re-offered. This is why unlimited workers
  across unlimited machines never collide, and why crashing/stopping never loses work.
- **Three targets** are live: `memory_chunk` (conversation memory), `predicate` (the freely-minted
  relation names), and `entity` (per-entity fingerprints used for identity resolution). The coordinator
  keeps the queue topped up from the backlog automatically, so workers rarely run dry.
- **Idempotent:** vectors are upserted by key, so re-embedding an item (e.g. after a reclaimed lease)
  just overwrites with the identical vector — never a duplicate or a conflict.

---

## 12. Troubleshooting

| Symptom | Cause & fix |
|---|---|
| **`403 Forbidden`** on every lease | Cloudflare (in front of the coordinator) blocks the default `python-urllib` user-agent. The worker here already sends a `curl/8.5.0` UA — **use it as-is, don't strip that header.** For ad-hoc `curl`, add `-A curl/8`. |
| **`401 unauthorized`** | Bad or missing token. Re-check `DONTO_EMBED_TOKEN` (no quotes, no trailing spaces). Ask Thomas for a fresh one. |
| **`health` ok but every lease errors** | Network/proxy between you and Cloudflare. Try `curl -v https://donto.org/embed/health`. Corporate proxies sometimes need `HTTPS_PROXY` set. |
| **Workers run but `queue.done` never moves / `batch: []`** | The queue is empty for the aspect you're on right now (not an error). Leave a worker running — it picks up the moment work is enqueued. Or drop `EMBED_TARGET` so it takes any aspect. |
| **Machine swaps / load spikes** | Too many workers for your RAM. Each holds the model (~0.3–0.5 GB). Set `EMBED_WORKERS` ≤ number of cores, or lower. Watch `free -m`. |
| **First run is slow / "downloading model"** | Bare Python downloads `bge-small` once (~130 MB) then caches it. The Docker image pre-bakes it, so use Docker to skip this. |
| **`fastembed` install fails** | Needs a recent `pip`: `pip install -U pip` first. On GPU use `fastembed-gpu`. The Docker image avoids this entirely. |
| **GPU not used** | Install `fastembed-gpu` (not `fastembed`), pass `--gpus all` to Docker, and set `EMBED_N=1024`. Confirm `nvidia-smi` works on the host. |
| **Want it to survive reboots** | Run detached with `--restart unless-stopped` (see [§3](#3-method-a--docker-recommended)). |

---

## 13. FAQ

**Do I need a GitHub or donto account?** No. Just Docker (or Python) and the token.

**Can I break anything?** No — the worker is read-only to you (text in, vectors out) and never touches
the database. The token can only fetch work and submit vectors.

**How much will it cost me?** Just electricity/CPU time while it runs. No cloud charges, no API keys.

**Can I run it part-time / overnight only?** Absolutely. Start and stop whenever; work auto-returns to
the queue if you stop mid-batch. Many people just run it overnight.

**Will it slow down my machine?** It uses your cores while embedding. Run fewer workers
(`-e EMBED_WORKERS=2`) to leave headroom, or only run it when you're idle.

**Which is faster, Docker or Python?** Same speed — both run the identical worker and model. Docker is
easier (model pre-baked, one command). Python is handy if you can't install Docker.

**How do I know my contribution counted?** `queue.done` rises on `/embed/stats`, and your machine's
**vectors done** climbs on <https://admin.donto.org/embed>.

**Is the token secret?** It's a low-privilege *embed-only* token (can't read or write donto) and is
revocable, but still don't paste it into public chats — DM it. A fresh one is one message to Thomas.

**What model is it and can I change it?** `BAAI/bge-small-en-v1.5`, 384-dim. **Don't change it** — the
whole fabric uses this model; a different one produces incompatible vectors.

---

## 14. Stopping & cleanup

```bash
# Docker (foreground): Ctrl-C
# Docker (detached):
docker stop donto-embed && docker rm donto-embed
docker rmi ghcr.io/thomasdavis/donto-embed-worker   # optional: remove the image too

# Bare Python:
kill $(jobs -p)        # or just close the terminal

# docker-compose:
docker compose down
```

Stopping is always safe: any leased-but-unsubmitted work returns to the queue automatically after
~15 minutes, and nothing is double-counted.

---

## Appendix A — full worker source

`worker.py` — canonical source (standard library + `fastembed` only; CPU and GPU via `EMBED_CUDA=1`).
Same code as the §4 paste and the Docker image:

```python
import json, os, socket, time, urllib.request
URL = os.environ.get("DONTO_EMBED_URL", "https://donto.org/embed").rstrip("/")
TOKEN = os.environ["DONTO_EMBED_TOKEN"]
WID = os.environ.get("EMBED_WORKER_ID") or f"worker-{socket.gethostname()}-{os.getpid()}"
N = int(os.environ.get("EMBED_N", "256"))        # items per lease (server caps at 1024)
BS = int(os.environ.get("EMBED_BATCH", "256"))   # model batch size
TARGET = os.environ.get("EMBED_TARGET") or None  # memory_chunk|predicate|entity|unset=any
CUDA = os.environ.get("EMBED_CUDA") == "1"
IDLE = float(os.environ.get("EMBED_IDLE_SLEEP", "5"))
from fastembed import TextEmbedding
if CUDA:
    import onnxruntime as ort
    assert "CUDAExecutionProvider" in ort.get_available_providers(), (
        "CUDA provider missing - install fastembed-gpu (+ nvidia-cudnn-cu12), or unset EMBED_CUDA")
def post(path, body, timeout=300):
    req = urllib.request.Request(URL + path, data=json.dumps(body).encode(),
        headers={"Authorization": f"Bearer {TOKEN}", "Content-Type": "application/json",
                 "User-Agent": "curl/8.5.0"}, method="POST")  # Cloudflare blocks python-urllib
    with urllib.request.urlopen(req, timeout=timeout) as r:
        return json.loads(r.read())
_m = {}
def model_for(name):
    if name not in _m:
        kw = {"model_name": name}
        if CUDA: kw["providers"] = ["CUDAExecutionProvider", "CPUExecutionProvider"]
        _m[name] = TextEmbedding(**kw)
    return _m[name]
_lr = 0.0
def report_err(kind, msg):  # best-effort error telemetry, max 1/min
    global _lr
    if time.time() - _lr < 60: return
    _lr = time.time()
    try: post("/report", {"worker_id": WID, "kind": kind, "message": str(msg)[:1500]}, timeout=20)
    except Exception: pass
total, win, errs = 0, [], 0
print(f"[{WID}] -> {URL} target={TARGET or 'any'} n={N} batch={BS} cuda={CUDA}", flush=True)
while True:
    try:
        batch = post("/lease", {"worker_id": WID, "n": N, "target": TARGET}).get("batch", [])
        errs = 0
    except Exception as e:
        errs += 1; print(f"[{WID}] lease error: {e}", flush=True)
        report_err("lease", e); time.sleep(min(IDLE * errs, 60)); continue
    if not batch:
        time.sleep(IDLE); continue
    by = {}
    for b in batch: by.setdefault(b["model"], []).append(b)
    items = []
    for mn, g in by.items():
        for x, v in zip(g, model_for(mn).embed([t["text"] for t in g], batch_size=BS)):
            items.append({"target": x["target"], "item_id": x["item_id"],
                          "vector": [float(z) for z in v]})
    try:
        r = post("/submit", {"worker_id": WID, "items": items})
        n = int(r.get("upserted", 0)); total += n
        now = time.time(); win.append((now, n)); win = [(t, c) for t, c in win if now - t < 60]
        print(f"[{WID}] +{n} (total {total}, ~{sum(c for _, c in win)}/min)", flush=True)
        errs = 0
    except Exception as e:
        errs += 1; print(f"[{WID}] submit error: {e}", flush=True)
        report_err("submit", e); time.sleep(min(IDLE * errs, 60))
```

---

## Appendix B — HTTP API reference

Base URL: `https://donto.org/embed`. All requests except `/health` and `/stats` are `POST` with a JSON
body. All authenticated requests need the header `Authorization: Bearer <token>` **and** a non-default
`User-Agent` (e.g. `curl/8.5.0`).

| Endpoint | Method | Auth | Body | Returns |
|---|---|---|---|---|
| `/embed/health` | GET | none | — | `{"ok": true, "targets": [...]}` |
| `/embed/stats` | GET | token | — | per-target `{source, embedded, queue:{pending,leased,done}}` |
| `/embed/lease` | POST | token | `{"worker_id","n","target"?}` | `{"batch": [{"target","item_id","text","model","dim"}]}` |
| `/embed/submit` | POST | token | `{"worker_id","items":[{"target","item_id","vector"}]}` | `{"upserted": N}` |

Notes:
- `n` is clamped to ≤ 1024 server-side.
- `target` on `/lease` is optional; omit to receive any aspect.
- Vectors must be 384 floats; wrong-dim submissions are dropped (logged) and the item stays pending.
- A leased item not submitted within ~15 minutes is automatically re-offered to other workers.

---

*Questions, a fresh token, or want to go deeper than embedding? → Thomas.
Image source & issues: <https://github.com/thomasdavis/donto-embed-worker>.
The wider story: <https://genes.apexpots.com/research/donto-alignment-contribute-2026-06-06.html>.*

### POST `/embed/report` — worker error telemetry

Workers automatically send throttled (1/min) error reports here, so the operator can
see fleet-side failures (Cloudflare 524s, timeouts, version issues) without access to
the contributor's machine. You can also send one manually:

```bash
curl -A curl/8 -H "Authorization: Bearer $DONTO_EMBED_TOKEN" -H "Content-Type: application/json" \
  -d '{"worker_id":"my-laptop","kind":"lease","message":"HTTP 524 from /lease"}' \
  https://donto.org/embed/report
```

### GET `/embed/reports?limit=50` — recent error reports

Returns the latest reports (also shown on <https://admin.donto.org/embed>).
