golangLAKEHOUSE/scripts/multi_corpus_e2e.sh
root a97881d80c workers corpus + multi-corpus reality test — matrix indexer end-to-end
Lands the second real-data corpus (workers_500k) and the first
multi-corpus reality test through /v1/matrix/search composing both
corpora live.

What's new:
  - scripts/staffing_workers/main.go — parquet driver over
    workers_500k.parquet, multi-chunk arrow handling (workers
    parquet has multiple row groups vs candidates' one). Embed text:
    role + skills + certifications + city + state + archetype +
    resume_text. IDs prefixed "w-".
  - scripts/multi_corpus_e2e.sh — first end-to-end test composing
    both corpora through the matrix indexer.

Real-data multi-corpus result (this commit):
  Query: "Forklift operator with OSHA-30 certification, warehouse
          experience"
  Corpora: workers (5000 rows) + candidates (1000 rows)
  Merged top-8: workers=6, candidates=2

  Top hits:
    w  d=0.327  w-4573  Production Worker
    w  d=0.353  w-1726  Machine Operator
    w  d=0.362  w-3806  Production Worker
    w  d=0.366  w-1000  Machine Operator
    w  d=0.374  w-1436  Assembler
    w  d=0.395  w-162   Machine Operator
    c  d=0.440  c-CAND-00727  C#,.NET,Azure
    c  d=0.446  c-CAND-00031  React,TypeScript,Node

The matrix indexer correctly chose the right domain — manufacturing/
warehouse roles in workers (correct semantic match for the staffing
query) rank ABOVE software-engineer candidates from the candidates
corpus. 0.11 gap between the worst worker (0.395) and the best
candidate (0.440) — clean distance separation.

Compared to the candidates-only e2e run from 0d1553c:
  candidates-only top: c-CAND-00727 at d=0.4404
  multi-corpus top:    w-4573 at d=0.3265 (a Production Worker)

That's the matrix indexer's whole point made visible: composing
domain-distinct corpora surfaces better matches than single-corpus
search. Without workers in the search space, the staffing query
returned software engineers (wrong domain). With workers, it
returns roles in the right ballpark.

What's still imperfect (signal for component 5 + future work):
  - No top-6 worker actually has "Forklift" or "OSHA-30" visible in
    metadata; "Production Worker" is semantically nearest in this
    sample. Likely needs a larger workers ingest (5000 from 500K)
    or skill-keyword boost.
  - Status/availability still not gated. The staffing-side
    structured filtering gap from 0d1553c persists; relevance filter
    (CODE-aware) doesn't address it.

Pipeline timings:
  workers ingest: 5000 rows / 19.2s = 260/sec end-to-end
  candidates ingest: 1000 rows / 3.1s = 322/sec
  multi-corpus query (text → embed → 2 parallel vectord → merge): 14ms

14-smoke regression sweep all green (D1-D6, G1, G1P, G2,
storaged_cap, pathway, matrix, relevance, downgrade).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 19:22:16 -05:00

132 lines
4.5 KiB
Bash
Executable File

#!/usr/bin/env bash
# Multi-corpus reality test — first deep-field test with TWO real
# staffing corpora composed via /v1/matrix/search.
#
# Pipeline:
# - Bring up the Go stack (storaged, embedd, vectord, matrixd, gateway)
# - Ingest workers (5000 rows from workers_500k.parquet)
# - Ingest candidates (1000 rows from candidates.parquet)
# - Run a real query through /v1/matrix/search with both corpora
# - Print the merged top-k with corpus attribution
#
# Headline assertion: results include hits from BOTH corpora (the
# whole point of multi-corpus matrix retrieval).
#
# Requires: Ollama on :11434 with nomic-embed-text loaded. Skips
# (exit 0) when Ollama is absent.
#
# Usage: ./scripts/multi_corpus_e2e.sh
# ./scripts/multi_corpus_e2e.sh "your custom query"
set -euo pipefail
cd "$(dirname "$0")/.."
export PATH="$PATH:/usr/local/go/bin"
QUERY="${1:-Forklift operator with OSHA-30 certification, warehouse experience}"
if ! curl -sS --max-time 3 http://localhost:11434/api/tags >/dev/null 2>&1; then
echo "[multi-corpus-e2e] Ollama not reachable on :11434 — skipping"
exit 0
fi
echo "[multi-corpus-e2e] building binaries..."
go build -o bin/ ./cmd/storaged ./cmd/embedd ./cmd/vectord ./cmd/matrixd ./cmd/gateway \
./scripts/staffing_workers ./scripts/staffing_candidates
pkill -f "bin/(storaged|embedd|vectord|matrixd|gateway)" 2>/dev/null || true
sleep 0.3
PIDS=()
TMP="$(mktemp -d)"
CFG="$TMP/e2e.toml"
cleanup() {
echo "[multi-corpus-e2e] cleanup"
for p in "${PIDS[@]}"; do [ -n "$p" ] && kill "$p" 2>/dev/null || true; done
rm -rf "$TMP"
}
trap cleanup EXIT INT TERM
# Ephemeral mode (vectord storaged_url=""); same rationale as
# candidates_e2e — don't pollute MinIO _vectors/ between runs.
cat > "$CFG" <<EOF
[gateway]
bind = "127.0.0.1:3110"
storaged_url = "http://127.0.0.1:3211"
catalogd_url = "http://127.0.0.1:3212"
ingestd_url = "http://127.0.0.1:3213"
queryd_url = "http://127.0.0.1:3214"
vectord_url = "http://127.0.0.1:3215"
embedd_url = "http://127.0.0.1:3216"
pathwayd_url = "http://127.0.0.1:3217"
matrixd_url = "http://127.0.0.1:3218"
[vectord]
bind = "127.0.0.1:3215"
storaged_url = ""
[matrixd]
bind = "127.0.0.1:3218"
embedd_url = "http://127.0.0.1:3216"
vectord_url = "http://127.0.0.1:3215"
EOF
poll_health() {
local port="$1" deadline=$(($(date +%s) + 5))
while [ "$(date +%s)" -lt "$deadline" ]; do
if curl -sS --max-time 1 "http://127.0.0.1:$port/health" >/dev/null 2>&1; then return 0; fi
sleep 0.05
done
return 1
}
echo "[multi-corpus-e2e] launching stack..."
./bin/storaged -config "$CFG" > /tmp/storaged.log 2>&1 & PIDS+=($!)
poll_health 3211 || { echo "storaged failed"; exit 1; }
./bin/embedd -config "$CFG" > /tmp/embedd.log 2>&1 & PIDS+=($!)
poll_health 3216 || { echo "embedd failed"; exit 1; }
./bin/vectord -config "$CFG" > /tmp/vectord.log 2>&1 & PIDS+=($!)
poll_health 3215 || { echo "vectord failed"; exit 1; }
./bin/matrixd -config "$CFG" > /tmp/matrixd.log 2>&1 & PIDS+=($!)
poll_health 3218 || { echo "matrixd failed"; exit 1; }
./bin/gateway -config "$CFG" > /tmp/gateway.log 2>&1 & PIDS+=($!)
poll_health 3110 || { echo "gateway failed"; exit 1; }
echo
echo "[multi-corpus-e2e] ingest workers (limit=5000)..."
./bin/staffing_workers -limit 5000
echo
echo "[multi-corpus-e2e] ingest candidates..."
./bin/staffing_candidates -skip-populate=false -query "$QUERY" 2>&1 | grep -v "^\[candidates\]\(matrix\|reality\)" || true
echo
echo "[multi-corpus-e2e] /matrix/corpora — confirm both registered:"
curl -sS http://127.0.0.1:3110/v1/matrix/corpora | jq -c
echo
echo "[multi-corpus-e2e] multi-corpus query: $QUERY"
RESP="$(curl -sS -X POST http://127.0.0.1:3110/v1/matrix/search \
-H 'Content-Type: application/json' \
-d "{\"query_text\":\"$QUERY\",\"corpora\":[\"workers\",\"candidates\"],\"k\":8,\"per_corpus_k\":6}")"
# Sanity / headline assertions
WORKER_HITS="$(echo "$RESP" | jq -r '[.results[] | select(.corpus=="workers")] | length')"
CAND_HITS="$(echo "$RESP" | jq -r '[.results[] | select(.corpus=="candidates")] | length')"
TOTAL="$(echo "$RESP" | jq -r '.results | length')"
echo
echo "[multi-corpus-e2e] merged top-$TOTAL: workers=$WORKER_HITS candidates=$CAND_HITS"
echo "$RESP" | jq -r '.results[] | " \(.corpus | .[0:1]) d=\(.distance | tostring | .[0:6]) \(.id) \(.metadata.role // .metadata.skills // "n/a")"'
if [ "$WORKER_HITS" -gt 0 ] && [ "$CAND_HITS" -gt 0 ]; then
echo
echo "[multi-corpus-e2e] PASS: both corpora represented in merged top-$TOTAL"
exit 0
else
echo
echo "[multi-corpus-e2e] FAIL: corpus mix was workers=$WORKER_HITS candidates=$CAND_HITS"
exit 1
fi