golangLAKEHOUSE/scripts/replay_smoke.sh
root 89ca72d471 materializer + replay ports + vectord substrate fix verified at scale
Two threads landing together — the doc edits interleave so they ship
in a single commit.

1. **vectord substrate fix verified at original scale** (closes the
   2026-05-01 thread). Re-ran multitier 5min @ conc=50: 132,211
   scenarios at 438/sec, 6/6 classes at 0% failure (was 4/6 pre-fix).
   Throughput dropped 1,115 → 438/sec because previously-broken
   scenarios now do real HNSW Add work — honest cost of correctness.
   The fix (i.vectors side-store + safeGraphAdd recover wrappers +
   smallIndexRebuildThreshold=32 + saveTask coalescing) holds at the
   footprint that originally surfaced the bug.

2. **Materializer port** — internal/materializer + cmd/materializer +
   scripts/materializer_smoke.sh. Ports scripts/distillation/transforms.ts
   (12 transforms) + build_evidence_index.ts (idempotency, day-partition,
   receipt). On-wire JSON shape matches TS so Bun and Go runs are
   interchangeable. 14 tests green.

3. **Replay port** — internal/replay + cmd/replay +
   scripts/replay_smoke.sh. Ports scripts/distillation/replay.ts
   (retrieve → bundle → /v1/chat → validate → log). Closes audit-FULL
   phase 7 live invocation on the Go side. Both runtimes append to the
   same data/_kb/replay_runs.jsonl (schema=replay_run.v1). 14 tests green.

Side effect on internal/distillation/types.go: EvidenceRecord gained
prompt_tokens, completion_tokens, and metadata fields to mirror the TS
shape the materializer transforms produce.

STATE_OF_PLAY refreshed to 2026-05-02; ARCHITECTURE_COMPARISON decisions
tracker moves the materializer + replay items from _open_ to DONE and
adds the substrate-fix scale verification row.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 03:31:02 -05:00

78 lines
3.5 KiB
Bash
Executable File

#!/usr/bin/env bash
# replay smoke — Go port of scripts/distillation/replay.ts.
# Validates that the replay tool:
# - Builds a context bundle from a synthetic playbooks corpus
# - Runs --dry-run end-to-end without an LLM
# - Logs a row to data/_kb/replay_runs.jsonl with schema=replay_run.v1
# - Honors --no-retrieval (no bundle, empty rag_ids)
# - Exits non-zero when validation fails
set -euo pipefail
cd "$(dirname "$0")/.."
export PATH="$PATH:/usr/local/go/bin"
echo "[replay-smoke] building bin/replay..."
go build -o bin/replay ./cmd/replay
ROOT="$(mktemp -d)"
trap 'rm -rf "$ROOT"' EXIT INT TERM
mkdir -p "$ROOT/exports/rag"
cat > "$ROOT/exports/rag/playbooks.jsonl" <<'EOF'
{"id":"p1","title":"build verification","content":"verify the build, check tests pass before merge\nensure no regressions in suites","tags":["scrum"],"source_run_id":"r-1","success_score":"accepted","source_category":"scrum_review"}
{"id":"p2","title":"merge cleanup","content":"verify the build, then assert tests passed, then merge","tags":["scrum"],"source_run_id":"r-2","success_score":"accepted","source_category":"scrum_review"}
{"id":"p3","title":"partial fix","content":"verify the build, sometimes assert tests passed","tags":["scrum"],"source_run_id":"r-3","success_score":"partially_accepted","source_category":"scrum_review"}
EOF
echo "[replay-smoke] dry-run (with retrieval)"
./bin/replay -task "verify the build before merging" -dry-run -root "$ROOT" > /tmp/replay_smoke_a.txt 2>&1 || true
grep -q "retrieval: " /tmp/replay_smoke_a.txt || {
echo "missing retrieval line"; cat /tmp/replay_smoke_a.txt; exit 1;
}
grep -q "escalation_path: qwen3.5:latest" /tmp/replay_smoke_a.txt || {
echo "missing escalation_path line"; cat /tmp/replay_smoke_a.txt; exit 1;
}
LOG="$ROOT/data/_kb/replay_runs.jsonl"
[ -s "$LOG" ] || { echo "expected $LOG to be written"; exit 1; }
grep -q "replay_run.v1" "$LOG" || {
echo "schema=replay_run.v1 missing in log";
cat "$LOG";
exit 1;
}
echo "[replay-smoke] dry-run (no retrieval)"
./bin/replay -task "verify build" -dry-run -no-retrieval -root "$ROOT" > /tmp/replay_smoke_b.txt 2>&1 || true
grep -q "retrieval: DISABLED" /tmp/replay_smoke_b.txt || {
echo "expected retrieval: DISABLED";
cat /tmp/replay_smoke_b.txt;
exit 1;
}
LINES_BEFORE=$(wc -l < "$LOG")
echo "[replay-smoke] forced-fail with escalation"
# Force validation failure by putting a hedge phrase as the FIRST
# accepted sample's first verify line. extractValidationSteps walks
# corpus order, and the dry-run synthesizer surfaces the first 3 steps,
# so the hedge phrase needs to be in an early-corpus accepted sample.
cat > "$ROOT/exports/rag/playbooks.jsonl" <<'EOF'
{"id":"p9","title":"hedged step","content":"verify auth as an AI and proceed without checking","tags":["security"],"source_run_id":"r-9","success_score":"accepted","source_category":"audit"}
{"id":"p1","title":"build verification","content":"verify the build, check tests pass before merge","tags":["scrum"],"source_run_id":"r-1","success_score":"accepted","source_category":"scrum_review"}
EOF
./bin/replay -task "verify auth proceed" -dry-run -allow-escalation -root "$ROOT" > /tmp/replay_smoke_c.txt 2>&1 || true
grep -q "escalation_path: qwen3.5:latest → deepseek-v3.1:671b" /tmp/replay_smoke_c.txt || {
echo "expected escalation path to deepseek when validation fails";
cat /tmp/replay_smoke_c.txt;
exit 1;
}
LINES_AFTER=$(wc -l < "$LOG")
[ "$LINES_AFTER" -gt "$LINES_BEFORE" ] || {
echo "expected log file to grow: before=$LINES_BEFORE after=$LINES_AFTER";
exit 1;
}
echo "[replay-smoke] PASS"