golangLAKEHOUSE/scripts/candidates_e2e.sh
root 0d1553ca88 candidates corpus: first deep-field reality test on real staffing data
Lands the second staffing corpus and the first end-to-end reality test
through the full Go pipeline: parquet → corpusingest → embedd →
vectord → matrixd → gateway.

What's new:
  - scripts/staffing_candidates/main.go — parquet Source over
    candidates.parquet (1000 rows, 11 cols), single-chunk arrow-go
    pqarrow read. Embed text: "Candidate skills: <s>. Based in
    <city>, <state>. <years> years experience. Status: <status>.
    <first> <last>." IDs prefixed "c-" so multi-corpus merges
    against workers ("w-") stay unambiguous.
  - scripts/candidates_e2e.sh — first integration smoke that runs
    the full stack (storaged + embedd + vectord + matrixd + gateway),
    ingests via corpusingest, runs a real query through
    /v1/matrix/search, prints results. Ephemeral mode (vectord
    persistence disabled via custom toml) so re-runs don't pollute
    MinIO _vectors/ and break g1p_smoke's "only-one-persisted-index"
    assertion.

Real bug caught + fixed in corpusingest:
  When LogProgress > 0, the progress goroutine's only exit was
  ctx.Done(). With context.Background() in the production driver,
  Run hung forever after the pipeline finished. Added a stopProgress
  channel that close()s after wg.Wait(). Regression test
  TestRun_ProgressLoggerExits bounds Run's wall to 2s with
  LogProgress=50ms.

This is the bug the unit tests didn't catch because every prior test
set LogProgress: 0. Reality test surfaced it on first real-data
run — exactly the hyperfocus-and-find-architectural-weakness
property J framed as the reason for the Go pass.

End-to-end output (1000 candidates, query "Python AWS Docker
engineer in Chicago available now"):

  populate: scanned=1000 embedded=1000 added=1000 wall=3.5s
  matrix returned 5 hits in 26ms

The result quality is the interesting signal: top-5 had ZERO
Chicago candidates, ZERO active-status candidates, and the exact-
skill-match (Python,AWS,Docker) ranked #3 not #1. Pipeline works;
retrieval quality has real architectural limits (no structured
filtering, no relevance gate, semantic-only ranking dominated by
secondary signals like "1 year experience" and "engineer"). This
motivates SPEC §3.4 components 3 (relevance filter) and
eventually structured filtering — exactly the kind of finding the
deep field reality tests are supposed to surface before Enterprise
cutover.

12-smoke regression sweep all green. 9 corpusingest unit tests
including the new regression. vet clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 19:06:27 -05:00

99 lines
3.3 KiB
Bash
Executable File

#!/usr/bin/env bash
# Candidates end-to-end — first deep-field reality test.
#
# Spins up storaged + embedd + vectord + matrixd + gateway, ingests
# the 1000-candidate corpus from
# /home/profit/lakehouse/data/datasets/candidates.parquet via the
# corpusingest substrate, then runs a real staffing query through
# /v1/matrix/search and prints the top 5 hits.
#
# Requires: Ollama on :11434 with nomic-embed-text loaded. If absent,
# this script exits 0 with a "skipped" message — same contract as
# g2_smoke.
#
# Usage: ./scripts/candidates_e2e.sh
# ./scripts/candidates_e2e.sh "your custom query here"
set -euo pipefail
cd "$(dirname "$0")/.."
export PATH="$PATH:/usr/local/go/bin"
QUERY="${1:-Python AWS Docker engineer in Chicago available now}"
if ! curl -sS --max-time 3 http://localhost:11434/api/tags >/dev/null 2>&1; then
echo "[candidates-e2e] Ollama not reachable on :11434 — skipping (matches g2_smoke contract)"
exit 0
fi
echo "[candidates-e2e] building binaries..."
go build -o bin/ ./cmd/storaged ./cmd/embedd ./cmd/vectord ./cmd/matrixd ./cmd/gateway ./scripts/staffing_candidates
pkill -f "bin/(storaged|embedd|vectord|matrixd|gateway)" 2>/dev/null || true
sleep 0.3
PIDS=()
TMP="$(mktemp -d)"
CFG="$TMP/e2e.toml"
cleanup() {
echo "[candidates-e2e] cleanup"
for p in "${PIDS[@]}"; do [ -n "$p" ] && kill "$p" 2>/dev/null || true; done
rm -rf "$TMP"
}
trap cleanup EXIT INT TERM
# Custom toml: vectord persistence disabled so the candidates index
# doesn't survive the run. Without this, re-running pollutes the
# shared MinIO `_vectors/` prefix and breaks g1p_smoke's "this is
# the only persisted index" assertion (caught 2026-04-29).
cat > "$CFG" <<EOF
[gateway]
bind = "127.0.0.1:3110"
storaged_url = "http://127.0.0.1:3211"
catalogd_url = "http://127.0.0.1:3212"
ingestd_url = "http://127.0.0.1:3213"
queryd_url = "http://127.0.0.1:3214"
vectord_url = "http://127.0.0.1:3215"
embedd_url = "http://127.0.0.1:3216"
pathwayd_url = "http://127.0.0.1:3217"
matrixd_url = "http://127.0.0.1:3218"
[vectord]
bind = "127.0.0.1:3215"
storaged_url = ""
[matrixd]
bind = "127.0.0.1:3218"
embedd_url = "http://127.0.0.1:3216"
vectord_url = "http://127.0.0.1:3215"
EOF
poll_health() {
local port="$1" deadline=$(($(date +%s) + 5))
while [ "$(date +%s)" -lt "$deadline" ]; do
if curl -sS --max-time 1 "http://127.0.0.1:$port/health" >/dev/null 2>&1; then return 0; fi
sleep 0.05
done
return 1
}
echo "[candidates-e2e] launching stack..."
./bin/storaged -config "$CFG" > /tmp/storaged.log 2>&1 & PIDS+=($!)
poll_health 3211 || { echo "storaged failed"; tail /tmp/storaged.log; exit 1; }
./bin/embedd -config "$CFG" > /tmp/embedd.log 2>&1 & PIDS+=($!)
poll_health 3216 || { echo "embedd failed"; tail /tmp/embedd.log; exit 1; }
./bin/vectord -config "$CFG" > /tmp/vectord.log 2>&1 & PIDS+=($!)
poll_health 3215 || { echo "vectord failed"; tail /tmp/vectord.log; exit 1; }
./bin/matrixd -config "$CFG" > /tmp/matrixd.log 2>&1 & PIDS+=($!)
poll_health 3218 || { echo "matrixd failed"; tail /tmp/matrixd.log; exit 1; }
./bin/gateway -config "$CFG" > /tmp/gateway.log 2>&1 & PIDS+=($!)
poll_health 3110 || { echo "gateway failed"; tail /tmp/gateway.log; exit 1; }
echo "[candidates-e2e] stack up; running ingest + reality test query..."
echo
./bin/staffing_candidates -query "$QUERY"