Help donto — run an embedding worker (CPU or GPU)

🤖 Claude Code — paste & go (CPU or GPU)

Paste this into Claude Code on each machine. It detects your GPU, sets up the right build, runs workers, and verifies they're really embedding. Also raw at donto.org/help/claude.txt (curl -L donto.org/help/claude.txt).

Help me run donto embedding worker(s) on this machine. I may run this exact same
prompt on SEVERAL machines — that is fine and expected: the coordinator hands every
worker a disjoint slice of work (Postgres FOR UPDATE SKIP LOCKED), machines never
overlap, and if a machine stops mid-batch its work auto-returns to the queue in
~15 minutes. Nothing is ever lost or embedded twice.

WHAT THIS IS: donto (donto.org) needs millions of short texts turned into 384-dim
vectors with the small model BAAI/bge-small-en-v1.5 — ~1.2M conversation chunks/claims
+ ~2.8M entities are queued behind one 4-core server, and a live benchmark run
(BEAM-10M) is literally waiting on these vectors. A worker just loops:
lease texts over HTTPS -> embed locally -> submit vectors back. It never touches a
database or my files, needs no inbound ports and no account; the only secret is one
bearer token; I can stop it at any time.

COORDINATOR: https://donto.org/embed
TOKEN:       80f8ffdbee931ae49b10b500263044f4674e94310ae4fefe

STEP 1 — reachability (no token needed):
  curl -A curl/8 https://donto.org/embed/health
  Expect: {"ok": true, "targets": ["memory_chunk", "predicate", "entity"]}
  NOTE: Cloudflare blocks python-urllib's default User-Agent — everything that talks
  to the coordinator must send a curl-style UA. The worker below already does.

STEP 2 — detect hardware, choose ONE path:
  Run nvidia-smi. If it lists a GPU -> GPU PATH (one GPU ~ 50-100 CPU cores here).
  Otherwise -> CPU PATH.

STEP 3 — save this as worker.py (same file works for both paths):

import json, os, socket, time, urllib.request
URL = os.environ.get("DONTO_EMBED_URL", "https://donto.org/embed").rstrip("/")
TOKEN = os.environ["DONTO_EMBED_TOKEN"]
WID = os.environ.get("EMBED_WORKER_ID") or f"worker-{socket.gethostname()}-{os.getpid()}"
N = int(os.environ.get("EMBED_N", "256"))        # items per lease (server caps at 1024)
BS = int(os.environ.get("EMBED_BATCH", "256"))   # model batch size
TARGET = os.environ.get("EMBED_TARGET") or None  # memory_chunk|predicate|entity|unset=any
CUDA = os.environ.get("EMBED_CUDA") == "1"
IDLE = float(os.environ.get("EMBED_IDLE_SLEEP", "5"))
from fastembed import TextEmbedding
if CUDA:
    import onnxruntime as ort
    assert "CUDAExecutionProvider" in ort.get_available_providers(), (
        "CUDA provider missing - install fastembed-gpu (+ nvidia-cudnn-cu12), or unset EMBED_CUDA")
def post(path, body, timeout=300):
    req = urllib.request.Request(URL + path, data=json.dumps(body).encode(),
        headers={"Authorization": f"Bearer {TOKEN}", "Content-Type": "application/json",
                 "User-Agent": "curl/8.5.0"}, method="POST")  # Cloudflare blocks python-urllib
    with urllib.request.urlopen(req, timeout=timeout) as r:
        return json.loads(r.read())
_m = {}
def model_for(name):
    if name not in _m:
        kw = {"model_name": name}
        if CUDA: kw["providers"] = ["CUDAExecutionProvider", "CPUExecutionProvider"]
        _m[name] = TextEmbedding(**kw)
    return _m[name]
_lr = 0.0
def report_err(kind, msg):  # best-effort error telemetry, max 1/min
    global _lr
    if time.time() - _lr < 60: return
    _lr = time.time()
    try: post("/report", {"worker_id": WID, "kind": kind, "message": str(msg)[:1500]}, timeout=20)
    except Exception: pass
total, win, errs = 0, [], 0
print(f"[{WID}] -> {URL} target={TARGET or 'any'} n={N} batch={BS} cuda={CUDA}", flush=True)
while True:
    try:
        batch = post("/lease", {"worker_id": WID, "n": N, "target": TARGET}).get("batch", [])
        errs = 0
    except Exception as e:
        errs += 1; print(f"[{WID}] lease error: {e}", flush=True)
        report_err("lease", e); time.sleep(min(IDLE * errs, 60)); continue
    if not batch:
        time.sleep(IDLE); continue
    by = {}
    for b in batch: by.setdefault(b["model"], []).append(b)
    items = []
    for mn, g in by.items():
        for x, v in zip(g, model_for(mn).embed([t["text"] for t in g], batch_size=BS)):
            items.append({"target": x["target"], "item_id": x["item_id"],
                          "vector": [float(z) for z in v]})
    try:
        r = post("/submit", {"worker_id": WID, "items": items})
        n = int(r.get("upserted", 0)); total += n
        now = time.time(); win.append((now, n)); win = [(t, c) for t, c in win if now - t < 60]
        print(f"[{WID}] +{n} (total {total}, ~{sum(c for _, c in win)}/min)", flush=True)
        errs = 0
    except Exception as e:
        errs += 1; print(f"[{WID}] submit error: {e}", flush=True)
        report_err("submit", e); time.sleep(min(IDLE * errs, 60))

STEP 4 — GPU PATH (skip if no GPU):
  python3 -m venv ~/.donto-embed && source ~/.donto-embed/bin/activate
  pip uninstall -y fastembed 2>/dev/null; pip install -U fastembed-gpu
  python -c "import onnxruntime as o; print(o.get_available_providers())"
    -> must include CUDAExecutionProvider. If it does NOT:
       pip install nvidia-cudnn-cu12 nvidia-cublas-cu12   and re-check;
       if still missing, tell me what you found and use the CPU PATH meanwhile.
  Run TWO processes (they share the GPU and overlap network I/O):
    export DONTO_EMBED_TOKEN=80f8ffdbee931ae49b10b500263044f4674e94310ae4fefe
    EMBED_CUDA=1 EMBED_N=1024 EMBED_BATCH=512 EMBED_WORKER_ID=gpu1-$(hostname) \
      nohup python worker.py > ~/donto-w1.log 2>&1 &
    EMBED_CUDA=1 EMBED_N=1024 EMBED_BATCH=512 EMBED_WORKER_ID=gpu2-$(hostname) \
      nohup python worker.py > ~/donto-w2.log 2>&1 &
  If the GPU runs out of memory, halve EMBED_BATCH (the model itself is ~130 MB).

STEP 5 — CPU PATH (no GPU):
  pip install fastembed
  export DONTO_EMBED_TOKEN=80f8ffdbee931ae49b10b500263044f4674e94310ae4fefe
  for i in $(seq $(nproc)); do
    EMBED_WORKER_ID=w$i-$(hostname) nohup python worker.py > ~/donto-w$i.log 2>&1 &
  done
  (~0.3-0.5 GB RAM per process — if the machine swaps, run fewer.)
  Zero-setup alternative (CPU only, one worker/core):
    docker run --rm -e DONTO_EMBED_TOKEN=80f8ffdbee931ae49b10b500263044f4674e94310ae4fefe \
      ghcr.io/thomasdavis/donto-embed-worker

STEP 6 — VERIFY it is really working (do not skip — "started" is not "working"):
  1. Within ~2 min the logs print lines like "+1024 (total 4096, ~3100/min)".
     Tell me the steady ~N/min figure. (Repeated lease/submit errors are
     auto-reported to the coordinator — throttled to 1/min — so the operator can
     see them; the worker also backs off exponentially, but still tell me the
     error line if it persists.)
  2. GPU path only: nvidia-smi must show the python processes holding GPU memory.
     If it doesn't, the model silently fell back to CPU — re-check providers (step 4).
  3. curl -A curl/8 -H "Authorization: Bearer 80f8ffdbee931ae49b10b500263044f4674e94310ae4fefe" https://donto.org/embed/stats
     -> queue.done climbing / queue.pending falling. The machine also appears within
     seconds as a contributor at https://admin.donto.org/embed (vectors-done rising).
  Ballparks: ~60-150 vectors/min per CPU core; ~2,000-6,000/min per GPU.

STEP 7 — keep it running:
  Leave the workers running in the background and make them survive this terminal
  closing (nohup is already used above; tmux or a systemd --user unit is even better —
  your choice). If a lease returns an empty batch the worker naps 5 s and retries:
  the server refills its queue every 2 minutes from a multi-million backlog, so empty
  moments are normal, NOT an error. Stop anytime with: pkill -f worker.py

MORE MACHINES: run this same prompt on every machine, same token everywhere. Worker
IDs are unique automatically (hostname + pid); the coordinator splits work safely
across any number of machines.

Text in, vectors out — never touches a database or your files
Talks only to donto.org/embed; no inbound ports, no account
Stop anytime; unfinished work auto-returns to the queue in ~15 min
Run it on many machines at once — work never overlaps (SKIP LOCKED leasing)

🐳 Docker — one command (CPU)

Nothing to install but Docker. One worker per CPU core.

docker run --rm -e DONTO_EMBED_TOKEN=80f8ffdbee931ae49b10b500263044f4674e94310ae4fefe ghcr.io/thomasdavis/donto-embed-worker

Fewer cores: -e EMBED_WORKERS=2. Detached & restart-on-reboot: -d --restart unless-stopped --name donto-embed. Stop: docker stop donto-embed.

⚡ Got an NVIDIA GPU? Use the Claude-Code paste instead — the image is CPU-only, but the paste installs fastembed-gpu and runs the same worker at ~2,000–6,000 vectors/min (vs ~100/core). One GPU ≈ a 50–100-core fleet.

FAQ

Multiple computers? Yes — run the same paste (or Docker command) on each, same token. The coordinator hands every worker a disjoint slice (FOR UPDATE SKIP LOCKED); nothing is ever embedded twice, and a machine that vanishes mid-batch has its work auto-reclaimed in ~15 min.
Will it hog my machine? CPU workers use ~0.3–0.5 GB RAM each — run fewer if it swaps. The GPU path uses ~1–2 GB of VRAM; halve EMBED_BATCH if it OOMs.
How do I know it's really working? Logs print +N (total X, ~R/min); nvidia-smi shows the python process (GPU path); your machine appears at admin.donto.org/embed. "Started" ≠ "working" — check one of these.
Can I stop it? Anytime — pkill -f worker.py, docker stop, or close the terminal. Unfinished work auto-returns to the queue. Nothing is lost.
I got a 403. Cloudflare blocks the default python-urllib user-agent — the worker already sends a curl UA; don't strip that header.
"Queue empty / nothing happening"? Momentary — it refills every 2 min. Leave the worker running; it picks work up the instant it lands.
Do I need an account? No. Just Python (or Docker) and the token.

Lend donto your spare compute 🛠️

🤖 Claude Code — paste & go (CPU or GPU)

🐳 Docker — one command (CPU)

The token

Watch your contribution land

What am I actually doing?

FAQ