From 31b408882bf49dafeafb1ce9df9180187f688859 Mon Sep 17 00:00:00 2001 From: root Date: Wed, 29 Apr 2026 19:26:32 -0500 Subject: [PATCH] =?UTF-8?q?multi=5Fcorpus=5Fe2e:=20WORKERS=5FLIMIT=20env?= =?UTF-8?q?=20knob=20=E2=80=94=20and=20the=20embed-text-not-sample-size=20?= =?UTF-8?q?finding?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Adds WORKERS_LIMIT env override (default 5000) so the e2e can be re-run at different sample sizes. Tiny change; the interesting part is the FINDING that motivated the run. Investigation: a97881d's reality test put zero Forklift Operators in the top-6 for "Forklift operator with OSHA-30 certification, warehouse experience" — instead returned Production Worker / Machine Operator / Assembler. Hypothesis tested: maybe the 5000-row sample didn't contain forklift operators in retrievable density. Result: hypothesis falsified. Direct probe of workers_500k.parquet: All 500K rows → 55,349 Forklift Operators (11.07%) → 150,328 with "forklift" in certs → 74,852 with OSHA-30 specifically First 5K rows → 569 Forklift Operators (11.38%) → distribution matches global, no ordering bias So 569 forklift operators were IN the corpus the matrix indexer searched and STILL didn't surface in top-6. That means the bottleneck isn't sample size — it's nomic-embed-text + our embed-text template ranking "Production Worker" / "Machine Operator" / "Assembler" as semantically nearer to the query than literal "Forklift Operator". The reality test exposed this faithfully. Three real follow-ups, none in scope of this commit: 1. Embed text design — front-loading role + certs (currently "Worker role: " then skills then certs) might dominate retrieval better. Worth A/B-testing. 2. Hybrid SQL+semantic — pre-filter by role/certs via queryd before semantic ranking. Not in SPEC §3.4 today; would address the "available" / "Chicago" gap from the candidates reality test (0d1553c) too. 3. Playbook-memory boost — SPEC §3.4 component 5. When a query "Forklift OSHA-30" was answered with worker w-X in the past, boost w-X's score for similar future queries. The retrieval gap CAN be bridged by the learning loop without changing the base embedder. Commits the env knob; the finding lives in the commit body so future sessions don't re-run the sample-size hypothesis. Co-Authored-By: Claude Opus 4.7 (1M context) --- scripts/multi_corpus_e2e.sh | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/scripts/multi_corpus_e2e.sh b/scripts/multi_corpus_e2e.sh index 5cbe2ef..897e682 100755 --- a/scripts/multi_corpus_e2e.sh +++ b/scripts/multi_corpus_e2e.sh @@ -24,6 +24,7 @@ cd "$(dirname "$0")/.." export PATH="$PATH:/usr/local/go/bin" QUERY="${1:-Forklift operator with OSHA-30 certification, warehouse experience}" +WORKERS_LIMIT="${WORKERS_LIMIT:-5000}" if ! curl -sS --max-time 3 http://localhost:11434/api/tags >/dev/null 2>&1; then echo "[multi-corpus-e2e] Ollama not reachable on :11434 — skipping" @@ -94,8 +95,8 @@ poll_health 3218 || { echo "matrixd failed"; exit 1; } poll_health 3110 || { echo "gateway failed"; exit 1; } echo -echo "[multi-corpus-e2e] ingest workers (limit=5000)..." -./bin/staffing_workers -limit 5000 +echo "[multi-corpus-e2e] ingest workers (limit=$WORKERS_LIMIT)..." +./bin/staffing_workers -limit "$WORKERS_LIMIT" echo echo "[multi-corpus-e2e] ingest candidates..."