multi_corpus_e2e: WORKERS_LIMIT env knob — and the embed-text-not-sample-size finding
Adds WORKERS_LIMIT env override (default 5000) so the e2e can be
re-run at different sample sizes. Tiny change; the interesting part
is the FINDING that motivated the run.
Investigation: a97881d's reality test put zero Forklift Operators in
the top-6 for "Forklift operator with OSHA-30 certification,
warehouse experience" — instead returned Production Worker / Machine
Operator / Assembler.
Hypothesis tested: maybe the 5000-row sample didn't contain
forklift operators in retrievable density.
Result: hypothesis falsified. Direct probe of workers_500k.parquet:
All 500K rows → 55,349 Forklift Operators (11.07%)
→ 150,328 with "forklift" in certs
→ 74,852 with OSHA-30 specifically
First 5K rows → 569 Forklift Operators (11.38%)
→ distribution matches global, no ordering bias
So 569 forklift operators were IN the corpus the matrix indexer
searched and STILL didn't surface in top-6. That means the bottleneck
isn't sample size — it's nomic-embed-text + our embed-text template
ranking "Production Worker" / "Machine Operator" / "Assembler" as
semantically nearer to the query than literal "Forklift Operator".
The reality test exposed this faithfully. Three real follow-ups, none
in scope of this commit:
1. Embed text design — front-loading role + certs (currently
"Worker role: <role>" then skills then certs) might dominate
retrieval better. Worth A/B-testing.
2. Hybrid SQL+semantic — pre-filter by role/certs via queryd
before semantic ranking. Not in SPEC §3.4 today; would address
the "available" / "Chicago" gap from the candidates reality
test (0d1553c) too.
3. Playbook-memory boost — SPEC §3.4 component 5. When a query
"Forklift OSHA-30" was answered with worker w-X in the past,
boost w-X's score for similar future queries. The retrieval
gap CAN be bridged by the learning loop without changing the
base embedder.
Commits the env knob; the finding lives in the commit body so future
sessions don't re-run the sample-size hypothesis.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
a97881d80c
commit
31b408882b
@ -24,6 +24,7 @@ cd "$(dirname "$0")/.."
|
|||||||
export PATH="$PATH:/usr/local/go/bin"
|
export PATH="$PATH:/usr/local/go/bin"
|
||||||
|
|
||||||
QUERY="${1:-Forklift operator with OSHA-30 certification, warehouse experience}"
|
QUERY="${1:-Forklift operator with OSHA-30 certification, warehouse experience}"
|
||||||
|
WORKERS_LIMIT="${WORKERS_LIMIT:-5000}"
|
||||||
|
|
||||||
if ! curl -sS --max-time 3 http://localhost:11434/api/tags >/dev/null 2>&1; then
|
if ! curl -sS --max-time 3 http://localhost:11434/api/tags >/dev/null 2>&1; then
|
||||||
echo "[multi-corpus-e2e] Ollama not reachable on :11434 — skipping"
|
echo "[multi-corpus-e2e] Ollama not reachable on :11434 — skipping"
|
||||||
@ -94,8 +95,8 @@ poll_health 3218 || { echo "matrixd failed"; exit 1; }
|
|||||||
poll_health 3110 || { echo "gateway failed"; exit 1; }
|
poll_health 3110 || { echo "gateway failed"; exit 1; }
|
||||||
|
|
||||||
echo
|
echo
|
||||||
echo "[multi-corpus-e2e] ingest workers (limit=5000)..."
|
echo "[multi-corpus-e2e] ingest workers (limit=$WORKERS_LIMIT)..."
|
||||||
./bin/staffing_workers -limit 5000
|
./bin/staffing_workers -limit "$WORKERS_LIMIT"
|
||||||
|
|
||||||
echo
|
echo
|
||||||
echo "[multi-corpus-e2e] ingest candidates..."
|
echo "[multi-corpus-e2e] ingest candidates..."
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user