From 31b408882bf49dafeafb1ce9df9180187f688859 Mon Sep 17 00:00:00 2001
From: root <root@island37.com>
Date: Wed, 29 Apr 2026 19:26:32 -0500
Subject: [PATCH] =?UTF-8?q?multi=5Fcorpus=5Fe2e:=20WORKERS=5FLIMIT=20env?=
 =?UTF-8?q?=20knob=20=E2=80=94=20and=20the=20embed-text-not-sample-size=20?=
 =?UTF-8?q?finding?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Adds WORKERS_LIMIT env override (default 5000) so the e2e can be
re-run at different sample sizes. Tiny change; the interesting part
is the FINDING that motivated the run.

Investigation: a97881d's reality test put zero Forklift Operators in
the top-6 for "Forklift operator with OSHA-30 certification,
warehouse experience" — instead returned Production Worker / Machine
Operator / Assembler.

Hypothesis tested: maybe the 5000-row sample didn't contain
forklift operators in retrievable density.

Result: hypothesis falsified. Direct probe of workers_500k.parquet:

  All 500K rows         → 55,349 Forklift Operators (11.07%)
                       → 150,328 with "forklift" in certs
                       → 74,852 with OSHA-30 specifically
  First 5K rows         → 569 Forklift Operators (11.38%)
                       → distribution matches global, no ordering bias

So 569 forklift operators were IN the corpus the matrix indexer
searched and STILL didn't surface in top-6. That means the bottleneck
isn't sample size — it's nomic-embed-text + our embed-text template
ranking "Production Worker" / "Machine Operator" / "Assembler" as
semantically nearer to the query than literal "Forklift Operator".

The reality test exposed this faithfully. Three real follow-ups, none
in scope of this commit:

  1. Embed text design — front-loading role + certs (currently
     "Worker role: <role>" then skills then certs) might dominate
     retrieval better. Worth A/B-testing.
  2. Hybrid SQL+semantic — pre-filter by role/certs via queryd
     before semantic ranking. Not in SPEC §3.4 today; would address
     the "available" / "Chicago" gap from the candidates reality
     test (0d1553c) too.
  3. Playbook-memory boost — SPEC §3.4 component 5. When a query
     "Forklift OSHA-30" was answered with worker w-X in the past,
     boost w-X's score for similar future queries. The retrieval
     gap CAN be bridged by the learning loop without changing the
     base embedder.

Commits the env knob; the finding lives in the commit body so future
sessions don't re-run the sample-size hypothesis.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 scripts/multi_corpus_e2e.sh | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/scripts/multi_corpus_e2e.sh b/scripts/multi_corpus_e2e.sh
index 5cbe2ef..897e682 100755
--- a/scripts/multi_corpus_e2e.sh
+++ b/scripts/multi_corpus_e2e.sh
@@ -24,6 +24,7 @@ cd "$(dirname "$0")/.."
 export PATH="$PATH:/usr/local/go/bin"
 
 QUERY="${1:-Forklift operator with OSHA-30 certification, warehouse experience}"
+WORKERS_LIMIT="${WORKERS_LIMIT:-5000}"
 
 if ! curl -sS --max-time 3 http://localhost:11434/api/tags >/dev/null 2>&1; then
   echo "[multi-corpus-e2e] Ollama not reachable on :11434 — skipping"
@@ -94,8 +95,8 @@ poll_health 3218 || { echo "matrixd failed"; exit 1; }
 poll_health 3110 || { echo "gateway failed"; exit 1; }
 
 echo
-echo "[multi-corpus-e2e] ingest workers (limit=5000)..."
-./bin/staffing_workers -limit 5000
+echo "[multi-corpus-e2e] ingest workers (limit=$WORKERS_LIMIT)..."
+./bin/staffing_workers -limit "$WORKERS_LIMIT"
 
 echo
 echo "[multi-corpus-e2e] ingest candidates..."