multi_corpus_e2e: WORKERS_LIMIT env knob — and the embed-text-not-sample-size finding

Adds WORKERS_LIMIT env override (default 5000) so the e2e can be re-run at different sample sizes. Tiny change; the interesting part is the FINDING that motivated the run. Investigation: a97881d's reality test put zero Forklift Operators in the top-6 for "Forklift operator with OSHA-30 certification, warehouse experience" — instead returned Production Worker / Machine Operator / Assembler. Hypothesis tested: maybe the 5000-row sample didn't contain forklift operators in retrievable density. Result: hypothesis falsified. Direct probe of workers_500k.parquet: All 500K rows → 55,349 Forklift Operators (11.07%) → 150,328 with "forklift" in certs → 74,852 with OSHA-30 specifically First 5K rows → 569 Forklift Operators (11.38%) → distribution matches global, no ordering bias So 569 forklift operators were IN the corpus the matrix indexer searched and STILL didn't surface in top-6. That means the bottleneck isn't sample size — it's nomic-embed-text + our embed-text template ranking "Production Worker" / "Machine Operator" / "Assembler" as semantically nearer to the query than literal "Forklift Operator". The reality test exposed this faithfully. Three real follow-ups, none in scope of this commit: 1. Embed text design — front-loading role + certs (currently "Worker role: <role>" then skills then certs) might dominate retrieval better. Worth A/B-testing. 2. Hybrid SQL+semantic — pre-filter by role/certs via queryd before semantic ranking. Not in SPEC §3.4 today; would address the "available" / "Chicago" gap from the candidates reality test (0d1553c) too. 3. Playbook-memory boost — SPEC §3.4 component 5. When a query "Forklift OSHA-30" was answered with worker w-X in the past, boost w-X's score for similar future queries. The retrieval gap CAN be bridged by the learning loop without changing the base embedder. Commits the env knob; the finding lives in the commit body so future sessions don't re-run the sample-size hypothesis. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 19:26:32 -05:00 · 2026-04-29 19:26:32 -05:00 · 31b408882b
commit 31b408882b
parent a97881d80c
1 changed files with 3 additions and 2 deletions
--- a/scripts/multi_corpus_e2e.sh
+++ b/scripts/multi_corpus_e2e.sh
@ -24,6 +24,7 @@ cd "$(dirname "$0")/.."
 export PATH="$PATH:/usr/local/go/bin"

 QUERY="${1:-Forklift operator with OSHA-30 certification, warehouse experience}"
+WORKERS_LIMIT="${WORKERS_LIMIT:-5000}"

 if ! curl -sS --max-time 3 http://localhost:11434/api/tags >/dev/null 2>&1; then
  echo "[multi-corpus-e2e] Ollama not reachable on :11434 — skipping"
@ -94,8 +95,8 @@ poll_health 3218 || { echo "matrixd failed"; exit 1; }
 poll_health 3110 || { echo "gateway failed"; exit 1; }

 echo
-echo "[multi-corpus-e2e] ingest workers (limit=5000)..."
-./bin/staffing_workers -limit 5000
+echo "[multi-corpus-e2e] ingest workers (limit=$WORKERS_LIMIT)..."
+./bin/staffing_workers -limit "$WORKERS_LIMIT"

 echo
 echo "[multi-corpus-e2e] ingest candidates..."