golangLAKEHOUSE/scripts/multi_corpus_e2e.sh
root 31b408882b multi_corpus_e2e: WORKERS_LIMIT env knob — and the embed-text-not-sample-size finding
Adds WORKERS_LIMIT env override (default 5000) so the e2e can be
re-run at different sample sizes. Tiny change; the interesting part
is the FINDING that motivated the run.

Investigation: a97881d's reality test put zero Forklift Operators in
the top-6 for "Forklift operator with OSHA-30 certification,
warehouse experience" — instead returned Production Worker / Machine
Operator / Assembler.

Hypothesis tested: maybe the 5000-row sample didn't contain
forklift operators in retrievable density.

Result: hypothesis falsified. Direct probe of workers_500k.parquet:

  All 500K rows         → 55,349 Forklift Operators (11.07%)
                       → 150,328 with "forklift" in certs
                       → 74,852 with OSHA-30 specifically
  First 5K rows         → 569 Forklift Operators (11.38%)
                       → distribution matches global, no ordering bias

So 569 forklift operators were IN the corpus the matrix indexer
searched and STILL didn't surface in top-6. That means the bottleneck
isn't sample size — it's nomic-embed-text + our embed-text template
ranking "Production Worker" / "Machine Operator" / "Assembler" as
semantically nearer to the query than literal "Forklift Operator".

The reality test exposed this faithfully. Three real follow-ups, none
in scope of this commit:

  1. Embed text design — front-loading role + certs (currently
     "Worker role: <role>" then skills then certs) might dominate
     retrieval better. Worth A/B-testing.
  2. Hybrid SQL+semantic — pre-filter by role/certs via queryd
     before semantic ranking. Not in SPEC §3.4 today; would address
     the "available" / "Chicago" gap from the candidates reality
     test (0d1553c) too.
  3. Playbook-memory boost — SPEC §3.4 component 5. When a query
     "Forklift OSHA-30" was answered with worker w-X in the past,
     boost w-X's score for similar future queries. The retrieval
     gap CAN be bridged by the learning loop without changing the
     base embedder.

Commits the env knob; the finding lives in the commit body so future
sessions don't re-run the sample-size hypothesis.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 19:26:32 -05:00

133 lines
4.6 KiB
Bash
Executable File

#!/usr/bin/env bash
# Multi-corpus reality test — first deep-field test with TWO real
# staffing corpora composed via /v1/matrix/search.
#
# Pipeline:
# - Bring up the Go stack (storaged, embedd, vectord, matrixd, gateway)
# - Ingest workers (5000 rows from workers_500k.parquet)
# - Ingest candidates (1000 rows from candidates.parquet)
# - Run a real query through /v1/matrix/search with both corpora
# - Print the merged top-k with corpus attribution
#
# Headline assertion: results include hits from BOTH corpora (the
# whole point of multi-corpus matrix retrieval).
#
# Requires: Ollama on :11434 with nomic-embed-text loaded. Skips
# (exit 0) when Ollama is absent.
#
# Usage: ./scripts/multi_corpus_e2e.sh
# ./scripts/multi_corpus_e2e.sh "your custom query"
set -euo pipefail
cd "$(dirname "$0")/.."
export PATH="$PATH:/usr/local/go/bin"
QUERY="${1:-Forklift operator with OSHA-30 certification, warehouse experience}"
WORKERS_LIMIT="${WORKERS_LIMIT:-5000}"
if ! curl -sS --max-time 3 http://localhost:11434/api/tags >/dev/null 2>&1; then
echo "[multi-corpus-e2e] Ollama not reachable on :11434 — skipping"
exit 0
fi
echo "[multi-corpus-e2e] building binaries..."
go build -o bin/ ./cmd/storaged ./cmd/embedd ./cmd/vectord ./cmd/matrixd ./cmd/gateway \
./scripts/staffing_workers ./scripts/staffing_candidates
pkill -f "bin/(storaged|embedd|vectord|matrixd|gateway)" 2>/dev/null || true
sleep 0.3
PIDS=()
TMP="$(mktemp -d)"
CFG="$TMP/e2e.toml"
cleanup() {
echo "[multi-corpus-e2e] cleanup"
for p in "${PIDS[@]}"; do [ -n "$p" ] && kill "$p" 2>/dev/null || true; done
rm -rf "$TMP"
}
trap cleanup EXIT INT TERM
# Ephemeral mode (vectord storaged_url=""); same rationale as
# candidates_e2e — don't pollute MinIO _vectors/ between runs.
cat > "$CFG" <<EOF
[gateway]
bind = "127.0.0.1:3110"
storaged_url = "http://127.0.0.1:3211"
catalogd_url = "http://127.0.0.1:3212"
ingestd_url = "http://127.0.0.1:3213"
queryd_url = "http://127.0.0.1:3214"
vectord_url = "http://127.0.0.1:3215"
embedd_url = "http://127.0.0.1:3216"
pathwayd_url = "http://127.0.0.1:3217"
matrixd_url = "http://127.0.0.1:3218"
[vectord]
bind = "127.0.0.1:3215"
storaged_url = ""
[matrixd]
bind = "127.0.0.1:3218"
embedd_url = "http://127.0.0.1:3216"
vectord_url = "http://127.0.0.1:3215"
EOF
poll_health() {
local port="$1" deadline=$(($(date +%s) + 5))
while [ "$(date +%s)" -lt "$deadline" ]; do
if curl -sS --max-time 1 "http://127.0.0.1:$port/health" >/dev/null 2>&1; then return 0; fi
sleep 0.05
done
return 1
}
echo "[multi-corpus-e2e] launching stack..."
./bin/storaged -config "$CFG" > /tmp/storaged.log 2>&1 & PIDS+=($!)
poll_health 3211 || { echo "storaged failed"; exit 1; }
./bin/embedd -config "$CFG" > /tmp/embedd.log 2>&1 & PIDS+=($!)
poll_health 3216 || { echo "embedd failed"; exit 1; }
./bin/vectord -config "$CFG" > /tmp/vectord.log 2>&1 & PIDS+=($!)
poll_health 3215 || { echo "vectord failed"; exit 1; }
./bin/matrixd -config "$CFG" > /tmp/matrixd.log 2>&1 & PIDS+=($!)
poll_health 3218 || { echo "matrixd failed"; exit 1; }
./bin/gateway -config "$CFG" > /tmp/gateway.log 2>&1 & PIDS+=($!)
poll_health 3110 || { echo "gateway failed"; exit 1; }
echo
echo "[multi-corpus-e2e] ingest workers (limit=$WORKERS_LIMIT)..."
./bin/staffing_workers -limit "$WORKERS_LIMIT"
echo
echo "[multi-corpus-e2e] ingest candidates..."
./bin/staffing_candidates -skip-populate=false -query "$QUERY" 2>&1 | grep -v "^\[candidates\]\(matrix\|reality\)" || true
echo
echo "[multi-corpus-e2e] /matrix/corpora — confirm both registered:"
curl -sS http://127.0.0.1:3110/v1/matrix/corpora | jq -c
echo
echo "[multi-corpus-e2e] multi-corpus query: $QUERY"
RESP="$(curl -sS -X POST http://127.0.0.1:3110/v1/matrix/search \
-H 'Content-Type: application/json' \
-d "{\"query_text\":\"$QUERY\",\"corpora\":[\"workers\",\"candidates\"],\"k\":8,\"per_corpus_k\":6}")"
# Sanity / headline assertions
WORKER_HITS="$(echo "$RESP" | jq -r '[.results[] | select(.corpus=="workers")] | length')"
CAND_HITS="$(echo "$RESP" | jq -r '[.results[] | select(.corpus=="candidates")] | length')"
TOTAL="$(echo "$RESP" | jq -r '.results | length')"
echo
echo "[multi-corpus-e2e] merged top-$TOTAL: workers=$WORKER_HITS candidates=$CAND_HITS"
echo "$RESP" | jq -r '.results[] | " \(.corpus | .[0:1]) d=\(.distance | tostring | .[0:6]) \(.id) \(.metadata.role // .metadata.skills // "n/a")"'
if [ "$WORKER_HITS" -gt 0 ] && [ "$CAND_HITS" -gt 0 ]; then
echo
echo "[multi-corpus-e2e] PASS: both corpora represented in merged top-$TOTAL"
exit 0
else
echo
echo "[multi-corpus-e2e] FAIL: corpus mix was workers=$WORKER_HITS candidates=$CAND_HITS"
exit 1
fi