multi_corpus_e2e: WORKERS_LIMIT env knob — and the embed-text-not-sample-size finding

Adds WORKERS_LIMIT env override (default 5000) so the e2e can be
re-run at different sample sizes. Tiny change; the interesting part
is the FINDING that motivated the run.

Investigation: a97881d's reality test put zero Forklift Operators in
the top-6 for "Forklift operator with OSHA-30 certification,
warehouse experience" — instead returned Production Worker / Machine
Operator / Assembler.

Hypothesis tested: maybe the 5000-row sample didn't contain
forklift operators in retrievable density.

Result: hypothesis falsified. Direct probe of workers_500k.parquet:

  All 500K rows         → 55,349 Forklift Operators (11.07%)
                       → 150,328 with "forklift" in certs
                       → 74,852 with OSHA-30 specifically
  First 5K rows         → 569 Forklift Operators (11.38%)
                       → distribution matches global, no ordering bias

So 569 forklift operators were IN the corpus the matrix indexer
searched and STILL didn't surface in top-6. That means the bottleneck
isn't sample size — it's nomic-embed-text + our embed-text template
ranking "Production Worker" / "Machine Operator" / "Assembler" as
semantically nearer to the query than literal "Forklift Operator".

The reality test exposed this faithfully. Three real follow-ups, none
in scope of this commit:

  1. Embed text design — front-loading role + certs (currently
     "Worker role: <role>" then skills then certs) might dominate
     retrieval better. Worth A/B-testing.
  2. Hybrid SQL+semantic — pre-filter by role/certs via queryd
     before semantic ranking. Not in SPEC §3.4 today; would address
     the "available" / "Chicago" gap from the candidates reality
     test (0d1553c) too.
  3. Playbook-memory boost — SPEC §3.4 component 5. When a query
     "Forklift OSHA-30" was answered with worker w-X in the past,
     boost w-X's score for similar future queries. The retrieval
     gap CAN be bridged by the learning loop without changing the
     base embedder.

Commits the env knob; the finding lives in the commit body so future
sessions don't re-run the sample-size hypothesis.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
root 2026-04-29 19:26:32 -05:00
parent a97881d80c
commit 31b408882b

View File

@ -24,6 +24,7 @@ cd "$(dirname "$0")/.."
export PATH="$PATH:/usr/local/go/bin"
QUERY="${1:-Forklift operator with OSHA-30 certification, warehouse experience}"
WORKERS_LIMIT="${WORKERS_LIMIT:-5000}"
if ! curl -sS --max-time 3 http://localhost:11434/api/tags >/dev/null 2>&1; then
echo "[multi-corpus-e2e] Ollama not reachable on :11434 — skipping"
@ -94,8 +95,8 @@ poll_health 3218 || { echo "matrixd failed"; exit 1; }
poll_health 3110 || { echo "gateway failed"; exit 1; }
echo
echo "[multi-corpus-e2e] ingest workers (limit=5000)..."
./bin/staffing_workers -limit 5000
echo "[multi-corpus-e2e] ingest workers (limit=$WORKERS_LIMIT)..."
./bin/staffing_workers -limit "$WORKERS_LIMIT"
echo
echo "[multi-corpus-e2e] ingest candidates..."