Lands the second staffing corpus and the first end-to-end reality test
through the full Go pipeline: parquet → corpusingest → embedd →
vectord → matrixd → gateway.
What's new:
- scripts/staffing_candidates/main.go — parquet Source over
candidates.parquet (1000 rows, 11 cols), single-chunk arrow-go
pqarrow read. Embed text: "Candidate skills: <s>. Based in
<city>, <state>. <years> years experience. Status: <status>.
<first> <last>." IDs prefixed "c-" so multi-corpus merges
against workers ("w-") stay unambiguous.
- scripts/candidates_e2e.sh — first integration smoke that runs
the full stack (storaged + embedd + vectord + matrixd + gateway),
ingests via corpusingest, runs a real query through
/v1/matrix/search, prints results. Ephemeral mode (vectord
persistence disabled via custom toml) so re-runs don't pollute
MinIO _vectors/ and break g1p_smoke's "only-one-persisted-index"
assertion.
Real bug caught + fixed in corpusingest:
When LogProgress > 0, the progress goroutine's only exit was
ctx.Done(). With context.Background() in the production driver,
Run hung forever after the pipeline finished. Added a stopProgress
channel that close()s after wg.Wait(). Regression test
TestRun_ProgressLoggerExits bounds Run's wall to 2s with
LogProgress=50ms.
This is the bug the unit tests didn't catch because every prior test
set LogProgress: 0. Reality test surfaced it on first real-data
run — exactly the hyperfocus-and-find-architectural-weakness
property J framed as the reason for the Go pass.
End-to-end output (1000 candidates, query "Python AWS Docker
engineer in Chicago available now"):
populate: scanned=1000 embedded=1000 added=1000 wall=3.5s
matrix returned 5 hits in 26ms
The result quality is the interesting signal: top-5 had ZERO
Chicago candidates, ZERO active-status candidates, and the exact-
skill-match (Python,AWS,Docker) ranked #3 not #1. Pipeline works;
retrieval quality has real architectural limits (no structured
filtering, no relevance gate, semantic-only ranking dominated by
secondary signals like "1 year experience" and "engineer"). This
motivates SPEC §3.4 components 3 (relevance filter) and
eventually structured filtering — exactly the kind of finding the
deep field reality tests are supposed to surface before Enterprise
cutover.
12-smoke regression sweep all green. 9 corpusingest unit tests
including the new regression. vet clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
99 lines
3.3 KiB
Bash
Executable File
99 lines
3.3 KiB
Bash
Executable File
#!/usr/bin/env bash
|
|
# Candidates end-to-end — first deep-field reality test.
|
|
#
|
|
# Spins up storaged + embedd + vectord + matrixd + gateway, ingests
|
|
# the 1000-candidate corpus from
|
|
# /home/profit/lakehouse/data/datasets/candidates.parquet via the
|
|
# corpusingest substrate, then runs a real staffing query through
|
|
# /v1/matrix/search and prints the top 5 hits.
|
|
#
|
|
# Requires: Ollama on :11434 with nomic-embed-text loaded. If absent,
|
|
# this script exits 0 with a "skipped" message — same contract as
|
|
# g2_smoke.
|
|
#
|
|
# Usage: ./scripts/candidates_e2e.sh
|
|
# ./scripts/candidates_e2e.sh "your custom query here"
|
|
|
|
set -euo pipefail
|
|
cd "$(dirname "$0")/.."
|
|
|
|
export PATH="$PATH:/usr/local/go/bin"
|
|
|
|
QUERY="${1:-Python AWS Docker engineer in Chicago available now}"
|
|
|
|
if ! curl -sS --max-time 3 http://localhost:11434/api/tags >/dev/null 2>&1; then
|
|
echo "[candidates-e2e] Ollama not reachable on :11434 — skipping (matches g2_smoke contract)"
|
|
exit 0
|
|
fi
|
|
|
|
echo "[candidates-e2e] building binaries..."
|
|
go build -o bin/ ./cmd/storaged ./cmd/embedd ./cmd/vectord ./cmd/matrixd ./cmd/gateway ./scripts/staffing_candidates
|
|
|
|
pkill -f "bin/(storaged|embedd|vectord|matrixd|gateway)" 2>/dev/null || true
|
|
sleep 0.3
|
|
|
|
PIDS=()
|
|
TMP="$(mktemp -d)"
|
|
CFG="$TMP/e2e.toml"
|
|
cleanup() {
|
|
echo "[candidates-e2e] cleanup"
|
|
for p in "${PIDS[@]}"; do [ -n "$p" ] && kill "$p" 2>/dev/null || true; done
|
|
rm -rf "$TMP"
|
|
}
|
|
trap cleanup EXIT INT TERM
|
|
|
|
# Custom toml: vectord persistence disabled so the candidates index
|
|
# doesn't survive the run. Without this, re-running pollutes the
|
|
# shared MinIO `_vectors/` prefix and breaks g1p_smoke's "this is
|
|
# the only persisted index" assertion (caught 2026-04-29).
|
|
cat > "$CFG" <<EOF
|
|
[gateway]
|
|
bind = "127.0.0.1:3110"
|
|
storaged_url = "http://127.0.0.1:3211"
|
|
catalogd_url = "http://127.0.0.1:3212"
|
|
ingestd_url = "http://127.0.0.1:3213"
|
|
queryd_url = "http://127.0.0.1:3214"
|
|
vectord_url = "http://127.0.0.1:3215"
|
|
embedd_url = "http://127.0.0.1:3216"
|
|
pathwayd_url = "http://127.0.0.1:3217"
|
|
matrixd_url = "http://127.0.0.1:3218"
|
|
|
|
[vectord]
|
|
bind = "127.0.0.1:3215"
|
|
storaged_url = ""
|
|
|
|
[matrixd]
|
|
bind = "127.0.0.1:3218"
|
|
embedd_url = "http://127.0.0.1:3216"
|
|
vectord_url = "http://127.0.0.1:3215"
|
|
EOF
|
|
|
|
poll_health() {
|
|
local port="$1" deadline=$(($(date +%s) + 5))
|
|
while [ "$(date +%s)" -lt "$deadline" ]; do
|
|
if curl -sS --max-time 1 "http://127.0.0.1:$port/health" >/dev/null 2>&1; then return 0; fi
|
|
sleep 0.05
|
|
done
|
|
return 1
|
|
}
|
|
|
|
echo "[candidates-e2e] launching stack..."
|
|
./bin/storaged -config "$CFG" > /tmp/storaged.log 2>&1 & PIDS+=($!)
|
|
poll_health 3211 || { echo "storaged failed"; tail /tmp/storaged.log; exit 1; }
|
|
|
|
./bin/embedd -config "$CFG" > /tmp/embedd.log 2>&1 & PIDS+=($!)
|
|
poll_health 3216 || { echo "embedd failed"; tail /tmp/embedd.log; exit 1; }
|
|
|
|
./bin/vectord -config "$CFG" > /tmp/vectord.log 2>&1 & PIDS+=($!)
|
|
poll_health 3215 || { echo "vectord failed"; tail /tmp/vectord.log; exit 1; }
|
|
|
|
./bin/matrixd -config "$CFG" > /tmp/matrixd.log 2>&1 & PIDS+=($!)
|
|
poll_health 3218 || { echo "matrixd failed"; tail /tmp/matrixd.log; exit 1; }
|
|
|
|
./bin/gateway -config "$CFG" > /tmp/gateway.log 2>&1 & PIDS+=($!)
|
|
poll_health 3110 || { echo "gateway failed"; tail /tmp/gateway.log; exit 1; }
|
|
|
|
echo "[candidates-e2e] stack up; running ingest + reality test query..."
|
|
echo
|
|
./bin/staffing_candidates -query "$QUERY"
|