root e8cf113af8 gauntlet 2026-05-02: smoke chain + per-component scrum + parity probe
Production-readiness gauntlet exploiting the dual Rust/Go
implementation as a measurement instrument.

## Phase 1 — Full smoke chain
21/21 PASS in ~60s. Substrate intact across the full service surface.

## Phase 2 — Per-component scrum (token-volume fix)
Prior wave (165KB diff): Kimi 62 tokens out, Qwen 297 → no useful
analysis. This wave splits today's commits into 4 focused bundles
(36-71KB each):
  c1 validatord (46KB) → 0 convergent / 11 distinct
  c2 vectord substrate (36KB) → 0 convergent / 10 distinct
  c3 materializer (71KB) → 0 convergent / 6 distinct (Opus emitted
                           a BLOCK then self-retracted in same response)
  c4 replay (45KB) → 0 convergent / 10 distinct

Reviewer engagement vs prior wave: Kimi went 62 → ~250 tokens out
once bundles dropped below 60KB.

scripts/scrum_review.sh hardening:
  * Diff-size guard (warn >60KB, hard-fail >100KB,
    SCRUM_FORCE_OVERSIZE=1 override)
  * Tightened prompt — file path must appear EXACTLY as in diff
    so post-processor can grep WHERE: lines reliably
  * Auto-tally step dedupes by (reviewer, location); convergence
    counts distinct lineages (closes the prior `opus+opus+opus`
    false-convergence bug)

## Phase 3 — Cross-runtime validator parity probe (the headline finding)
scripts/cutover/parity/validator_parity.sh sends 6 identical
/v1/validate cases to Rust :3100 AND Go :4110, compares status+body.

Result: **6/6 status codes match · 5/6 body shapes diverge.**

Rust returns serde-tagged enum:   {"Schema":{"field":"x","reason":"y"}}
Go returns flat exported-fields:  {"Kind":"schema","Field":"x","Reason":"y"}

Both round-trip inside their own runtime; a caller swapping one for
the other would break parsing silently. Captured as new _open_ row
in docs/ARCHITECTURE_COMPARISON.md decisions tracker.

This is the "use the dual-implementation as a measurement instrument"
return — single-repo scrums can't catch this class of cross-runtime
drift.

## Phase 4 — Production assessment
ship-with-known-gap. Validator wire-format gap is documented, not
regressed. ~50 LOC future fix on Go side (custom MarshalJSON on
ValidationError to match Rust's serde shape).

Persistent stack config (/tmp/lakehouse-persistent.toml) gains
validatord on :3221 + persistent-validatord binary so operators
bringing up the persistent stack get the new daemon automatically.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 04:05:18 -05:00

1309 lines
44 KiB
Diff
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

commit 89ca72d4718fcb20ba9dcc03110e090890a0736e
Author: root <root@island37.com>
Date: Sat May 2 03:31:02 2026 -0500
materializer + replay ports + vectord substrate fix verified at scale
Two threads landing together — the doc edits interleave so they ship
in a single commit.
1. **vectord substrate fix verified at original scale** (closes the
2026-05-01 thread). Re-ran multitier 5min @ conc=50: 132,211
scenarios at 438/sec, 6/6 classes at 0% failure (was 4/6 pre-fix).
Throughput dropped 1,115 → 438/sec because previously-broken
scenarios now do real HNSW Add work — honest cost of correctness.
The fix (i.vectors side-store + safeGraphAdd recover wrappers +
smallIndexRebuildThreshold=32 + saveTask coalescing) holds at the
footprint that originally surfaced the bug.
2. **Materializer port** — internal/materializer + cmd/materializer +
scripts/materializer_smoke.sh. Ports scripts/distillation/transforms.ts
(12 transforms) + build_evidence_index.ts (idempotency, day-partition,
receipt). On-wire JSON shape matches TS so Bun and Go runs are
interchangeable. 14 tests green.
3. **Replay port** — internal/replay + cmd/replay +
scripts/replay_smoke.sh. Ports scripts/distillation/replay.ts
(retrieve → bundle → /v1/chat → validate → log). Closes audit-FULL
phase 7 live invocation on the Go side. Both runtimes append to the
same data/_kb/replay_runs.jsonl (schema=replay_run.v1). 14 tests green.
Side effect on internal/distillation/types.go: EvidenceRecord gained
prompt_tokens, completion_tokens, and metadata fields to mirror the TS
shape the materializer transforms produce.
STATE_OF_PLAY refreshed to 2026-05-02; ARCHITECTURE_COMPARISON decisions
tracker moves the materializer + replay items from _open_ to DONE and
adds the substrate-fix scale verification row.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
diff --git a/cmd/replay/main.go b/cmd/replay/main.go
new file mode 100644
index 0000000..f73d3b6
--- /dev/null
+++ b/cmd/replay/main.go
@@ -0,0 +1,87 @@
+// replay — Go-side distillation replay runner. Closes audit-FULL
+// phase 7 live invocation on the Go side. Mirrors
+// scripts/distillation/replay.ts; both runtimes append to the same
+// `data/_kb/replay_runs.jsonl` shape (schema=replay_run.v1).
+//
+// Usage:
+//
+// replay -task "rebuild evidence index"
+// replay -task "..." -allow-escalation
+// replay -task "..." -no-retrieval # baseline mode
+// replay -task "..." -dry-run # synthetic, no LLM
+// replay -task "..." -root /home/profit/lakehouse # custom repo root
+package main
+
+import (
+ "context"
+ "flag"
+ "fmt"
+ "os"
+ "strings"
+
+ "git.agentview.dev/profit/golangLAKEHOUSE/internal/replay"
+)
+
+func main() {
+ task := flag.String("task", "", "input task to replay")
+ localOnly := flag.Bool("local-only", false, "never escalate; record validation result only")
+ allowEscalation := flag.Bool("allow-escalation", false, "fall back to the bigger model when local validation fails")
+ noRetrieval := flag.Bool("no-retrieval", false, "baseline mode: skip retrieval bundle (still logs)")
+ dryRun := flag.Bool("dry-run", false, "synthesize a deterministic response — no LLM call")
+ root := flag.String("root", replay.DefaultRoot(), "lakehouse repo root (defaults to $LH_DISTILL_ROOT or cwd)")
+ gateway := flag.String("gateway", "", "override gateway URL (default: $LH_GATEWAY_URL or http://localhost:3110)")
+ localModel := flag.String("local-model", "", "override local model name")
+ escalationModel := flag.String("escalation-model", "", "override escalation model name")
+ flag.Parse()
+
+ if *task == "" {
+ fmt.Fprintln(os.Stderr, `usage: replay -task "<input>" [-local-only] [-allow-escalation] [-no-retrieval] [-dry-run]`)
+ os.Exit(2)
+ }
+
+ res, err := replay.Replay(context.Background(), replay.ReplayRequest{
+ Task: *task,
+ LocalOnly: *localOnly,
+ AllowEscalation: *allowEscalation,
+ NoRetrieval: *noRetrieval,
+ DryRun: *dryRun,
+ GatewayURL: *gateway,
+ LocalModel: *localModel,
+ EscalationModel: *escalationModel,
+ }, *root)
+ if err != nil {
+ fmt.Fprintf(os.Stderr, "replay: %v\n", err)
+ os.Exit(1)
+ }
+
+ fmt.Printf("[replay] run_id=%s\n", res.RecordedRunID)
+ if res.ContextBundle == nil {
+ fmt.Println("[replay] retrieval: DISABLED")
+ } else {
+ fmt.Printf("[replay] retrieval: %d playbooks\n", len(res.ContextBundle.RetrievedPlaybooks))
+ }
+ fmt.Printf("[replay] escalation_path: %s\n", strings.Join(res.EscalationPath, " → "))
+ fmt.Printf("[replay] model_used: %s · %dms\n", res.ModelUsed, res.DurationMs)
+ verdict := "PASS"
+ if !res.ValidationResult.Passed {
+ verdict = "FAIL"
+ }
+ suffix := ""
+ if len(res.ValidationResult.Reasons) > 0 {
+ suffix = " (" + strings.Join(res.ValidationResult.Reasons, "; ") + ")"
+ }
+ fmt.Printf("[replay] validation: %s%s\n", verdict, suffix)
+ fmt.Println()
+ fmt.Println("─── response ───")
+ body := res.ModelResponse
+ if len(body) > 1500 {
+ fmt.Println(body[:1500])
+ fmt.Printf("... [%d more chars]\n", len(body)-1500)
+ } else {
+ fmt.Println(body)
+ }
+
+ if !res.ValidationResult.Passed {
+ os.Exit(1)
+ }
+}
diff --git a/internal/replay/model.go b/internal/replay/model.go
new file mode 100644
index 0000000..cbad676
--- /dev/null
+++ b/internal/replay/model.go
@@ -0,0 +1,131 @@
+package replay
+
+import (
+ "bytes"
+ "context"
+ "encoding/json"
+ "fmt"
+ "io"
+ "net/http"
+ "strings"
+ "time"
+)
+
+// callModelResult is what the gateway round-trip returns.
+type callModelResult struct {
+ Content string
+ OK bool
+ Error string
+}
+
+// ModelCaller is the seam tests use to swap out HTTP. Production
+// supplies httpModelCaller; tests can supply scripted responses.
+type ModelCaller func(ctx context.Context, model, system, user string) callModelResult
+
+// httpModelCaller posts to ${gatewayURL}/v1/chat with provider derived
+// from model name. Mirrors replay.ts:callModel.
+func httpModelCaller(gatewayURL string) ModelCaller {
+ client := &http.Client{Timeout: 180 * time.Second}
+ return func(ctx context.Context, model, system, user string) callModelResult {
+ provider := inferProvider(model)
+ body, err := json.Marshal(map[string]any{
+ "provider": provider,
+ "model": model,
+ "messages": []map[string]string{
+ {"role": "system", "content": system},
+ {"role": "user", "content": user},
+ },
+ "max_tokens": 1500,
+ "temperature": 0.1,
+ })
+ if err != nil {
+ return callModelResult{Error: "marshal request: " + err.Error()}
+ }
+ req, err := http.NewRequestWithContext(ctx, "POST", gatewayURL+"/v1/chat", bytes.NewReader(body))
+ if err != nil {
+ return callModelResult{Error: "build request: " + err.Error()}
+ }
+ req.Header.Set("Content-Type", "application/json")
+ resp, err := client.Do(req)
+ if err != nil {
+ return callModelResult{Error: trim(err.Error(), 240)}
+ }
+ defer resp.Body.Close()
+ buf, _ := io.ReadAll(resp.Body)
+ if resp.StatusCode >= 400 {
+ return callModelResult{Error: fmt.Sprintf("HTTP %d: %s", resp.StatusCode, trim(string(buf), 240))}
+ }
+ var parsed struct {
+ Choices []struct {
+ Message struct {
+ Content string `json:"content"`
+ } `json:"message"`
+ } `json:"choices"`
+ }
+ if err := json.Unmarshal(buf, &parsed); err != nil {
+ return callModelResult{Error: "parse response: " + err.Error()}
+ }
+ content := ""
+ if len(parsed.Choices) > 0 {
+ content = parsed.Choices[0].Message.Content
+ }
+ return callModelResult{Content: content, OK: true}
+ }
+}
+
+// inferProvider picks the right /v1/chat provider for a given model
+// name. Mirrors replay.ts:callModel's branching exactly so the gateway
+// sees the same request shape regardless of caller runtime.
+//
+// "/" in name → openrouter
+// kimi-/qwen3-coder/... → ollama_cloud
+// else → ollama (local)
+func inferProvider(model string) string {
+ if strings.Contains(model, "/") {
+ return "openrouter"
+ }
+ switch {
+ case strings.HasPrefix(model, "kimi-"),
+ strings.HasPrefix(model, "qwen3-coder"),
+ strings.HasPrefix(model, "deepseek-v"),
+ strings.HasPrefix(model, "mistral-large"),
+ model == "gpt-oss:120b",
+ model == "qwen3.5:397b":
+ return "ollama_cloud"
+ }
+ return "ollama"
+}
+
+// dryRunSynthesize produces a deterministic synthetic response that
+// echoes context-bundle signals. Used by tests + dry-run mode to
+// exercise retrieval + validation without a live LLM.
+func dryRunSynthesize(task string, bundle *ContextBundle) string {
+ parts := []string{
+ "Synthetic dry-run response for task: " + trim(task, 120),
+ "",
+ }
+ if bundle != nil {
+ parts = append(parts, fmt.Sprintf(
+ "Retrieved %d playbooks; %d accepted, %d partial.",
+ len(bundle.RetrievedPlaybooks),
+ len(bundle.PriorSuccessfulOutputs),
+ len(bundle.FailurePatterns),
+ ))
+ if len(bundle.ValidationSteps) > 0 {
+ parts = append(parts, "Following validation checklist:")
+ for i, s := range bundle.ValidationSteps {
+ if i >= 3 {
+ break
+ }
+ parts = append(parts, "- "+s)
+ }
+ }
+ if len(bundle.PriorSuccessfulOutputs) > 0 {
+ parts = append(parts, "")
+ parts = append(parts, "Anchored on prior accepted: "+bundle.PriorSuccessfulOutputs[0].Title)
+ }
+ } else {
+ parts = append(parts, "No retrieval context — answering from task alone. Verify and check produced output before approving.")
+ }
+ return strings.Join(parts, "\n")
+}
diff --git a/internal/replay/prompt.go b/internal/replay/prompt.go
new file mode 100644
index 0000000..f86eee4
--- /dev/null
+++ b/internal/replay/prompt.go
@@ -0,0 +1,64 @@
+package replay
+
+import "strings"
+
+// PromptParts captures the two roles the prompt assembly produces.
+type PromptParts struct {
+ System string
+ User string
+}
+
+const systemPrompt = "You are a Lakehouse task executor. Stay grounded — only assert what you can derive from the prior successful patterns or the task itself. " +
+ "Do NOT hedge. Do NOT say 'as an AI'. Produce a concrete actionable answer. " +
+ "When prior successful outputs are provided, follow their style and format."
+
+// BuildPrompt assembles the system + user messages for a model call.
+// When bundle is nil (NoRetrieval mode), the user message is just the
+// task — same wording as replay.ts so completions stay comparable.
+func BuildPrompt(task string, bundle *ContextBundle) PromptParts {
+ if bundle == nil {
+ return PromptParts{
+ System: systemPrompt,
+ User: "Task: " + task + "\n\nProduce the answer.",
+ }
+ }
+
+ var b strings.Builder
+ if len(bundle.PriorSuccessfulOutputs) > 0 {
+ b.WriteString("## Prior successful runs on similar tasks\n\n")
+ for _, r := range bundle.PriorSuccessfulOutputs {
+ b.WriteString("### ")
+ b.WriteString(r.Title)
+ b.WriteString(" (score: ")
+ b.WriteString(r.SuccessScore)
+ b.WriteString(")\n")
+ b.WriteString(r.ContentPreview)
+ b.WriteString("\n\n")
+ }
+ }
+ if len(bundle.FailurePatterns) > 0 {
+ b.WriteString("## Patterns that produced PARTIAL results — avoid these failure modes\n\n")
+ for _, r := range bundle.FailurePatterns {
+ b.WriteString("- ")
+ b.WriteString(r.Title)
+ b.WriteString(": ")
+ b.WriteString(trim(r.ContentPreview, 160))
+ b.WriteByte('\n')
+ }
+ b.WriteByte('\n')
+ }
+ if len(bundle.ValidationSteps) > 0 {
+ b.WriteString("## Validation checklist (from accepted runs)\n")
+ for _, s := range bundle.ValidationSteps {
+ b.WriteString("- ")
+ b.WriteString(s)
+ b.WriteByte('\n')
+ }
+ b.WriteByte('\n')
+ }
+ b.WriteString("## Task\n")
+ b.WriteString(task)
+ b.WriteString("\n\nProduce the answer following the style of the prior successful runs above.")
+
+ return PromptParts{System: systemPrompt, User: b.String()}
+}
diff --git a/internal/replay/replay.go b/internal/replay/replay.go
new file mode 100644
index 0000000..3ac74e6
--- /dev/null
+++ b/internal/replay/replay.go
@@ -0,0 +1,193 @@
+package replay
+
+import (
+ "context"
+ "crypto/sha256"
+ "encoding/hex"
+ "encoding/json"
+ "fmt"
+ "os"
+ "path/filepath"
+ "time"
+)
+
+// DefaultRoot is what the CLI uses when --root isn't passed.
+func DefaultRoot() string {
+ if r := os.Getenv("LH_DISTILL_ROOT"); r != "" {
+ return r
+ }
+ if cwd, err := os.Getwd(); err == nil {
+ return cwd
+ }
+ return "/home/profit/lakehouse"
+}
+
+// Replay runs the retrieve→prompt→model→validate→log pipeline.
+// Returns a ReplayResult that's already been appended to
+// data/_kb/replay_runs.jsonl unless DryRun + the file is read-only.
+//
+// Errors here are *infrastructure* failures (corpus unreadable, log
+// write failed). A failed model call OR a failed validation gate is
+// captured in ReplayResult.ValidationResult, not returned as error —
+// callers can branch on Passed / EscalationPath.
+func Replay(ctx context.Context, opts ReplayRequest, root string) (ReplayResult, error) {
+ t0 := time.Now()
+ recordedAt := time.Now().UTC().Format(time.RFC3339Nano)
+
+ taskHash := sha256Hex(opts.Task)
+
+ corpus, err := LoadRagCorpus(root)
+ if err != nil {
+ return ReplayResult{}, fmt.Errorf("load rag corpus: %w", err)
+ }
+
+ var bundle *ContextBundle
+ if !opts.NoRetrieval {
+ bundle = BuildContextBundle(corpus, opts.Task)
+ }
+ prompt := BuildPrompt(opts.Task, bundle)
+
+ localModel := orDefault(opts.LocalModel, DefaultLocalModel)
+ escalationModel := orDefault(opts.EscalationModel, DefaultEscalationModel)
+ gatewayURL := orDefault(opts.GatewayURL, gatewayFromEnv())
+
+ caller := httpModelCaller(gatewayURL)
+ if opts.DryRun {
+ caller = dryRunCaller(opts.Task, bundle)
+ }
+
+ escalation := []string{localModel}
+ modelUsed := localModel
+ var modelResponse string
+ var validation ValidationResult
+
+ localCall := caller(ctx, localModel, prompt.System, prompt.User)
+ if localCall.OK {
+ modelResponse = localCall.Content
+ validation = ValidateResponse(modelResponse, bundle)
+ } else {
+ validation = ValidationResult{
+ Passed: false,
+ Reasons: []string{"local call failed: " + localCall.Error},
+ }
+ }
+
+ if !validation.Passed && opts.AllowEscalation && !opts.LocalOnly {
+ escalation = append(escalation, escalationModel)
+ escalCall := caller(ctx, escalationModel, prompt.System, prompt.User)
+ if escalCall.OK {
+ modelResponse = escalCall.Content
+ modelUsed = escalationModel
+ validation = ValidateResponse(modelResponse, bundle)
+ if validation.Passed {
+ validation.Reasons = append([]string{"recovered via escalation to " + escalationModel}, validation.Reasons...)
+ }
+ } else {
+ validation.Reasons = append(validation.Reasons, "escalation also failed: "+escalCall.Error)
+ }
+ }
+
+ recordedRunID := fmt.Sprintf("replay:%s:%s",
+ taskHash[:16],
+ sha256Hex(recordedAt)[:12],
+ )
+ result := ReplayResult{
+ InputTask: opts.Task,
+ TaskHash: taskHash,
+ RetrievedArtifacts: RetrievedIDs{RagIDs: ragIDs(bundle)},
+ ContextBundle: bundle,
+ ModelResponse: modelResponse,
+ ModelUsed: modelUsed,
+ EscalationPath: escalation,
+ ValidationResult: validation,
+ RecordedRunID: recordedRunID,
+ RecordedAt: recordedAt,
+ DurationMs: time.Since(t0).Milliseconds(),
+ }
+
+ if err := logReplayEvidence(root, result); err != nil {
+ // Logging failure is real — surface it. The caller still gets the
+ // in-memory result so they can inspect what happened.
+ return result, fmt.Errorf("log replay evidence: %w", err)
+ }
+ return result, nil
+}
+
+// dryRunCaller wraps dryRunSynthesize as a ModelCaller. The escalation
+// branch in Replay calls the caller a second time; for parity with TS,
+// we return the same content suffixed with [ESCALATED] so a smoke can
+// detect escalation in dry-run mode.
+func dryRunCaller(task string, bundle *ContextBundle) ModelCaller {
+ calls := 0
+ return func(_ context.Context, _ string, _ string, _ string) callModelResult {
+ calls++
+ content := dryRunSynthesize(task, bundle)
+ if calls >= 2 {
+ content += "\n\n[ESCALATED]"
+ }
+ return callModelResult{Content: content, OK: true}
+ }
+}
+
+// logReplayEvidence appends one row to data/_kb/replay_runs.jsonl.
+// model_response is truncated to 4000 chars in the persisted log to
+// keep the file lean (matches TS behavior).
+func logReplayEvidence(root string, result ReplayResult) error {
+ path := filepath.Join(root, "data", "_kb", "replay_runs.jsonl")
+ if err := os.MkdirAll(filepath.Dir(path), 0o755); err != nil {
+ return err
+ }
+
+ persist := struct {
+ Schema string `json:"schema"`
+ ReplayResult
+ }{
+ Schema: "replay_run.v1",
+ ReplayResult: result,
+ }
+ persist.ReplayResult.ModelResponse = trim(persist.ReplayResult.ModelResponse, 4000)
+
+ buf, err := json.Marshal(persist)
+ if err != nil {
+ return err
+ }
+ buf = append(buf, '\n')
+
+ f, err := os.OpenFile(path, os.O_APPEND|os.O_CREATE|os.O_WRONLY, 0o644)
+ if err != nil {
+ return err
+ }
+ defer f.Close()
+ _, err = f.Write(buf)
+ return err
+}
+
+func ragIDs(bundle *ContextBundle) []string {
+ if bundle == nil {
+ return []string{}
+ }
+ out := make([]string, 0, len(bundle.RetrievedPlaybooks))
+ for _, p := range bundle.RetrievedPlaybooks {
+ out = append(out, p.RagID)
+ }
+ return out
+}
+
+func sha256Hex(s string) string {
+ h := sha256.Sum256([]byte(s))
+ return hex.EncodeToString(h[:])
+}
+
+func gatewayFromEnv() string {
+ if u := os.Getenv("LH_GATEWAY_URL"); u != "" {
+ return u
+ }
+ return DefaultGatewayURL
+}
+
+func orDefault(v, fallback string) string {
+ if v == "" {
+ return fallback
+ }
+ return v
+}
diff --git a/internal/replay/replay_test.go b/internal/replay/replay_test.go
new file mode 100644
index 0000000..4e1eedd
--- /dev/null
+++ b/internal/replay/replay_test.go
@@ -0,0 +1,283 @@
+package replay
+
+import (
+ "context"
+ "encoding/json"
+ "os"
+ "path/filepath"
+ "strings"
+ "testing"
+)
+
+// ─── Tokenization + retrieval primitives ───────────────────────────
+
+func TestTokenize_FiltersShortAndLowercase(t *testing.T) {
+ got := tokenize("Hello, World! Foo BAR baz x12 a")
+ want := map[string]bool{"hello": true, "world": true, "foo": true, "bar": true, "baz": true, "x12": true}
+ for k := range want {
+ if _, ok := got[k]; !ok {
+ t.Errorf("missing token %q", k)
+ }
+ }
+ if _, ok := got["a"]; ok {
+ t.Errorf("len=1 token should be filtered: a")
+ }
+}
+
+func TestJaccard_EdgeCases(t *testing.T) {
+ a := map[string]struct{}{"x": {}, "y": {}, "z": {}}
+ b := map[string]struct{}{"y": {}, "z": {}, "w": {}}
+ got := jaccard(a, b)
+ want := 2.0 / 4.0 // |A∩B|=2 (y,z); |AB|=4 (x,y,z,w)
+ if got != want {
+ t.Errorf("jaccard = %v, want %v", got, want)
+ }
+ if jaccard(map[string]struct{}{}, b) != 0 {
+ t.Error("empty set should produce 0")
+ }
+}
+
+// ─── Retrieval ───────────────────────────────────────────────────
+
+func TestRetrieveRag_ScoresAndCaps(t *testing.T) {
+ corpus := []RagSample{
+ {ID: "p1", Title: "validate scrum", Content: "verify the build, check tests", Tags: []string{"scrum"}, SuccessScore: "accepted"},
+ {ID: "p2", Title: "irrelevant cooking notes", Content: "boil pasta longer than ten minutes", Tags: []string{"food"}, SuccessScore: "accepted"},
+ {ID: "p3", Title: "build verification ladder", Content: "verify build steps, assert green", Tags: []string{"build"}, SuccessScore: "partially_accepted"},
+ }
+ got := retrieveRag(corpus, "verify the build assert green", 3)
+ if len(got) == 0 {
+ t.Fatal("expected at least one result")
+ }
+ for _, a := range got {
+ if a.RagID == "p2" {
+ t.Errorf("irrelevant sample p2 should not surface, got: %+v", got)
+ }
+ }
+}
+
+func TestBuildContextBundle_SplitsAcceptedAndPartial(t *testing.T) {
+ corpus := []RagSample{
+ {ID: "a1", Title: "A1", Content: "verify build assert green check tests", SuccessScore: "accepted"},
+ {ID: "p1", Title: "P1", Content: "verify build sometimes fails to assert", SuccessScore: "partially_accepted"},
+ }
+ b := BuildContextBundle(corpus, "verify build assert tests")
+ if b == nil {
+ t.Fatal("nil bundle")
+ }
+ if len(b.PriorSuccessfulOutputs) != 1 || b.PriorSuccessfulOutputs[0].RagID != "a1" {
+ t.Errorf("accepted bucket wrong: %+v", b.PriorSuccessfulOutputs)
+ }
+ if len(b.FailurePatterns) != 1 || b.FailurePatterns[0].RagID != "p1" {
+ t.Errorf("partially_accepted bucket wrong: %+v", b.FailurePatterns)
+ }
+ if len(b.ValidationSteps) == 0 {
+ t.Errorf("expected validation_steps from accepted sample, got none")
+ }
+}
+
+// ─── Prompt assembly ─────────────────────────────────────────────
+
+func TestBuildPrompt_NoBundleIsCompact(t *testing.T) {
+ p := BuildPrompt("rebuild evidence index", nil)
+ if !strings.Contains(p.User, "Task: rebuild evidence index") {
+ t.Errorf("user prompt missing task: %q", p.User)
+ }
+ if strings.Contains(p.User, "## Prior successful runs") {
+ t.Error("no-bundle prompt should not include retrieval headers")
+ }
+}
+
+func TestBuildPrompt_WithBundleIncludesAllSections(t *testing.T) {
+ bundle := &ContextBundle{
+ PriorSuccessfulOutputs: []RetrievedArtifact{{RagID: "a1", Title: "A1", ContentPreview: "verified", SuccessScore: "accepted"}},
+ FailurePatterns: []RetrievedArtifact{{RagID: "p1", Title: "P1", ContentPreview: "partial result", SuccessScore: "partially_accepted"}},
+ ValidationSteps: []string{"verify the build"},
+ }
+ p := BuildPrompt("task X", bundle)
+ for _, marker := range []string{
+ "## Prior successful runs",
+ "## Patterns that produced PARTIAL results",
+ "## Validation checklist",
+ "## Task",
+ "task X",
+ } {
+ if !strings.Contains(p.User, marker) {
+ t.Errorf("user prompt missing marker %q in:\n%s", marker, p.User)
+ }
+ }
+}
+
+// ─── Validation gate ─────────────────────────────────────────────
+
+func TestValidateResponse_FailsOnEmptyAndShort(t *testing.T) {
+ if got := ValidateResponse("", nil); got.Passed {
+ t.Error("empty should fail")
+ }
+ if got := ValidateResponse("too short", nil); got.Passed {
+ t.Error("too-short should fail")
+ }
+}
+
+func TestValidateResponse_FailsOnFiller(t *testing.T) {
+ resp := strings.Repeat("This is a real long response that meets the eighty character minimum for the gate. ", 2) +
+ " As an AI, I cannot help."
+ got := ValidateResponse(resp, nil)
+ if got.Passed {
+ t.Errorf("response with hedge phrase should fail, reasons=%v", got.Reasons)
+ }
+}
+
+func TestValidateResponse_PassesWhenChecklistOverlaps(t *testing.T) {
+ bundle := &ContextBundle{ValidationSteps: []string{"verify the build is green"}}
+ resp := "I followed the procedure and verified that the build is green and tests passed before merging the change."
+ got := ValidateResponse(resp, bundle)
+ if !got.Passed {
+ t.Errorf("expected pass, got reasons=%v", got.Reasons)
+ }
+}
+
+func TestValidateResponse_FailsWhenChecklistOrthogonal(t *testing.T) {
+ bundle := &ContextBundle{ValidationSteps: []string{"verify mango ripeness"}}
+ resp := "I followed completely unrelated steps about Quantum Tax compliance — I did not look at any fruit at all and that's the point."
+ got := ValidateResponse(resp, bundle)
+ if got.Passed {
+ t.Errorf("expected fail because no checklist token overlap, got pass")
+ }
+}
+
+// ─── End-to-end (dry-run, no LLM) ────────────────────────────────
+
+func TestReplay_DryRun_LogsResult(t *testing.T) {
+ root := t.TempDir()
+ mustWriteRagFixture(t, root, []RagSample{
+ {ID: "p1", Title: "build verification", Content: "verify the build, check tests pass before merge",
+ Tags: []string{"scrum"}, SuccessScore: "accepted", SourceRunID: "r-1"},
+ })
+
+ res, err := Replay(context.Background(), ReplayRequest{
+ Task: "verify the build before merging",
+ DryRun: true,
+ }, root)
+ if err != nil {
+ t.Fatalf("Replay: %v", err)
+ }
+ if res.RecordedRunID == "" {
+ t.Error("expected recorded_run_id")
+ }
+ if !strings.HasPrefix(res.RecordedRunID, "replay:") {
+ t.Errorf("run_id shape: %s", res.RecordedRunID)
+ }
+ if res.ContextBundle == nil {
+ t.Fatal("expected retrieval to fire by default")
+ }
+ if len(res.ContextBundle.RetrievedPlaybooks) == 0 {
+ t.Errorf("expected at least one retrieved playbook")
+ }
+
+ logPath := filepath.Join(root, "data/_kb/replay_runs.jsonl")
+ body, err := os.ReadFile(logPath)
+ if err != nil {
+ t.Fatalf("read log: %v", err)
+ }
+ var row map[string]any
+ if err := json.Unmarshal([]byte(strings.TrimSpace(string(body))), &row); err != nil {
+ t.Fatalf("parse log row: %v", err)
+ }
+ if row["schema"] != "replay_run.v1" {
+ t.Errorf("schema field: %v", row["schema"])
+ }
+}
+
+func TestReplay_NoRetrievalSkipsCorpus(t *testing.T) {
+ root := t.TempDir()
+ mustWriteRagFixture(t, root, []RagSample{
+ {ID: "p1", Title: "would match", Content: "verify build assert", SuccessScore: "accepted"},
+ })
+
+ res, err := Replay(context.Background(), ReplayRequest{
+ Task: "verify build assert",
+ DryRun: true,
+ NoRetrieval: true,
+ }, root)
+ if err != nil {
+ t.Fatalf("Replay: %v", err)
+ }
+ if res.ContextBundle != nil {
+ t.Errorf("expected nil bundle in NoRetrieval mode")
+ }
+ if len(res.RetrievedArtifacts.RagIDs) != 0 {
+ t.Errorf("expected empty rag_ids, got %v", res.RetrievedArtifacts.RagIDs)
+ }
+}
+
+func TestReplay_EscalationFiresOnFailedValidation(t *testing.T) {
+ root := t.TempDir()
+ // Trick: the dry-run synthesizer copies validation_steps verbatim
+ // into its output. If a checklist step contains a hedge phrase, the
+ // synthesized response will contain it too — triggering the
+ // filler-pattern guard in ValidateResponse and forcing escalation.
+ mustWriteRagFixture(t, root, []RagSample{
+ {ID: "p1", Title: "demo step", Content: "verify the build then i cannot proceed without approval", SuccessScore: "accepted"},
+ })
+
+ res, err := Replay(context.Background(), ReplayRequest{
+ Task: "verify the build then proceed",
+ DryRun: true,
+ AllowEscalation: true,
+ }, root)
+ if err != nil {
+ t.Fatalf("Replay: %v", err)
+ }
+ if len(res.EscalationPath) < 2 {
+ t.Errorf("expected escalation, path=%v reasons=%v", res.EscalationPath, res.ValidationResult.Reasons)
+ }
+ if !strings.Contains(res.ModelResponse, "[ESCALATED]") {
+ t.Errorf("expected escalated marker in response, got: %q", res.ModelResponse)
+ }
+}
+
+func TestReplay_NoEscalationWhenValidationPasses(t *testing.T) {
+ root := t.TempDir()
+ mustWriteRagFixture(t, root, []RagSample{
+ {ID: "p1", Title: "build verification", Content: "verify the build, check tests pass before merge",
+ Tags: []string{"scrum"}, SuccessScore: "accepted", SourceRunID: "r-1"},
+ })
+
+ res, err := Replay(context.Background(), ReplayRequest{
+ Task: "verify the build before merging",
+ DryRun: true,
+ AllowEscalation: true,
+ }, root)
+ if err != nil {
+ t.Fatalf("Replay: %v", err)
+ }
+ if len(res.EscalationPath) != 1 {
+ t.Errorf("expected single-step path on validation pass, got %v", res.EscalationPath)
+ }
+ if !res.ValidationResult.Passed {
+ t.Errorf("expected pass, got reasons=%v", res.ValidationResult.Reasons)
+ }
+}
+
+// ─── Helpers ────────────────────────────────────────────────────
+
+func mustWriteRagFixture(t *testing.T, root string, samples []RagSample) {
+ t.Helper()
+ path := filepath.Join(root, "exports/rag/playbooks.jsonl")
+ if err := os.MkdirAll(filepath.Dir(path), 0o755); err != nil {
+ t.Fatalf("mkdir: %v", err)
+ }
+ var buf strings.Builder
+ for _, s := range samples {
+ b, err := json.Marshal(s)
+ if err != nil {
+ t.Fatalf("marshal sample: %v", err)
+ }
+ buf.Write(b)
+ buf.WriteByte('\n')
+ }
+ if err := os.WriteFile(path, []byte(buf.String()), 0o644); err != nil {
+ t.Fatalf("write fixture: %v", err)
+ }
+}
diff --git a/internal/replay/retrieval.go b/internal/replay/retrieval.go
new file mode 100644
index 0000000..62e7575
--- /dev/null
+++ b/internal/replay/retrieval.go
@@ -0,0 +1,215 @@
+package replay
+
+import (
+ "bufio"
+ "encoding/json"
+ "os"
+ "path/filepath"
+ "regexp"
+ "sort"
+ "strings"
+)
+
+// tokenize lowercases and splits on non-[a-z0-9_] runs, keeping tokens
+// of length ≥3. Matches replay.ts so retrieval scoring is consistent
+// across runtimes.
+func tokenize(text string) map[string]struct{} {
+ out := map[string]struct{}{}
+ if text == "" {
+ return out
+ }
+ lower := strings.ToLower(text)
+ var b strings.Builder
+ flush := func() {
+ if b.Len() >= 3 {
+ out[b.String()] = struct{}{}
+ }
+ b.Reset()
+ }
+ for _, r := range lower {
+ if (r >= 'a' && r <= 'z') || (r >= '0' && r <= '9') || r == '_' {
+ b.WriteRune(r)
+ } else {
+ flush()
+ }
+ }
+ flush()
+ return out
+}
+
+// jaccard returns |A ∩ B| / |A B| over token sets.
+func jaccard(a, b map[string]struct{}) float64 {
+ if len(a) == 0 || len(b) == 0 {
+ return 0
+ }
+ inter := 0
+ for t := range a {
+ if _, ok := b[t]; ok {
+ inter++
+ }
+ }
+ union := len(a) + len(b) - inter
+ if union == 0 {
+ return 0
+ }
+ return float64(inter) / float64(union)
+}
+
+// LoadRagCorpus reads `exports/rag/playbooks.jsonl` under root.
+// Returns empty slice when the file is missing — callers fall back to
+// a context-less prompt rather than failing.
+func LoadRagCorpus(root string) ([]RagSample, error) {
+ path := filepath.Join(root, "exports", "rag", "playbooks.jsonl")
+ f, err := os.Open(path)
+ if err != nil {
+ if os.IsNotExist(err) {
+ return nil, nil
+ }
+ return nil, err
+ }
+ defer f.Close()
+ var corpus []RagSample
+ sc := bufio.NewScanner(f)
+ sc.Buffer(make([]byte, 0, 1<<16), 1<<24)
+ for sc.Scan() {
+ line := sc.Bytes()
+ if len(line) == 0 {
+ continue
+ }
+ var rec RagSample
+ if err := json.Unmarshal(line, &rec); err != nil {
+ continue // malformed line — skip, matches TS behavior
+ }
+ corpus = append(corpus, rec)
+ }
+ return corpus, sc.Err()
+}
+
+// retrieveRag returns up to topK playbooks with non-zero overlap.
+// Sorted by score descending. Matches replay.ts.
+func retrieveRag(corpus []RagSample, task string, topK int) []RetrievedArtifact {
+ taskTokens := tokenize(task)
+ type scored struct {
+ rec RagSample
+ score float64
+ }
+ all := make([]scored, 0, len(corpus))
+ for _, r := range corpus {
+ text := r.Title + " " + r.Content + " " + strings.Join(r.Tags, " ")
+ all = append(all, scored{rec: r, score: jaccard(taskTokens, tokenize(text))})
+ }
+ sort.SliceStable(all, func(i, j int) bool { return all[i].score > all[j].score })
+
+ out := make([]RetrievedArtifact, 0, topK)
+ for _, s := range all {
+ if len(out) >= topK {
+ break
+ }
+ if s.score <= 0 {
+ break
+ }
+ out = append(out, RetrievedArtifact{
+ RagID: s.rec.ID,
+ SourceRunID: s.rec.SourceRunID,
+ Title: s.rec.Title,
+ ContentPreview: trim(s.rec.Content, 240),
+ SuccessScore: s.rec.SuccessScore,
+ Tags: tagsOrEmpty(s.rec.Tags),
+ Score: s.score,
+ })
+ }
+ return out
+}
+
+var validationLineRE = regexp.MustCompile(`(?i)^[-*]\s*(verify|check|assert|confirm|ensure)\b|^\s*(verify|check|assert|confirm|ensure)\s`)
+
+// extractValidationSteps pulls verify/check/assert/confirm/ensure
+// lines from accepted samples. Used as a soft-anchor in the
+// validation gate (response should touch at least one of these
+// tokens) and surfaced into the prompt.
+func extractValidationSteps(samples []RetrievedArtifact, corpus []RagSample) []string {
+ ids := map[string]struct{}{}
+ for _, s := range samples {
+ ids[s.RagID] = struct{}{}
+ }
+ var steps []string
+ for _, r := range corpus {
+ if _, ok := ids[r.ID]; !ok {
+ continue
+ }
+ for _, line := range strings.Split(r.Content, "\n") {
+ t := strings.TrimSpace(line)
+ if validationLineRE.MatchString(t) {
+ steps = append(steps, trim(t, 200))
+ if len(steps) >= 6 {
+ return steps
+ }
+ }
+ }
+ }
+ return steps
+}
+
+// BuildContextBundle assembles a ContextBundle from a corpus + task.
+// Top 8 retrieved → split by success_score → at most 3 accepted, 2
+// warnings → extract validation steps → estimate token cost.
+func BuildContextBundle(corpus []RagSample, task string) *ContextBundle {
+ top := retrieveRag(corpus, task, 8)
+ accepted := filterByScore(top, "accepted", 3)
+ warnings := filterByScore(top, "partially_accepted", 2)
+ steps := extractValidationSteps(accepted, corpus)
+
+ totalChars := 0
+ for _, r := range accepted {
+ totalChars += len(r.ContentPreview) + len(r.Title)
+ }
+ for _, r := range warnings {
+ totalChars += len(r.ContentPreview) + len(r.Title)
+ }
+ for _, s := range steps {
+ totalChars += len(s)
+ }
+ tokenEstimate := (totalChars + 3) / 4 // ceil(chars/4)
+
+ return &ContextBundle{
+ RetrievedPlaybooks: top,
+ PriorSuccessfulOutputs: accepted,
+ FailurePatterns: warnings,
+ ValidationSteps: stepsOrEmpty(steps),
+ BundleTokenEstimate: tokenEstimate,
+ }
+}
+
+func filterByScore(arts []RetrievedArtifact, score string, max int) []RetrievedArtifact {
+ out := make([]RetrievedArtifact, 0, max)
+ for _, a := range arts {
+ if a.SuccessScore == score {
+ out = append(out, a)
+ if len(out) >= max {
+ break
+ }
+ }
+ }
+ return out
+}
+
+func tagsOrEmpty(t []string) []string {
+ if t == nil {
+ return []string{}
+ }
+ return t
+}
+
+func stepsOrEmpty(s []string) []string {
+ if s == nil {
+ return []string{}
+ }
+ return s
+}
+
+func trim(s string, n int) string {
+ if len(s) <= n {
+ return s
+ }
+ return s[:n]
+}
diff --git a/internal/replay/types.go b/internal/replay/types.go
new file mode 100644
index 0000000..9048323
--- /dev/null
+++ b/internal/replay/types.go
@@ -0,0 +1,98 @@
+// Package replay ports scripts/distillation/replay.ts to Go.
+//
+// Replay takes a task → retrieves matching playbooks/RAG records →
+// builds a context bundle → calls a LOCAL model via the gateway's
+// /v1/chat → validates → escalates to a stronger model if needed →
+// logs the run as new evidence in `data/_kb/replay_runs.jsonl`.
+//
+// Spec invariants (carry over from replay.ts):
+// - never bypass retrieval (unless caller passes NoRetrieval)
+// - never discard provenance
+// - never allow free-form hallucinated output (validation gate)
+// - log every run as new evidence
+//
+// This is NOT training — it's runtime behavior shaping via retrieval.
+package replay
+
+// ReplayRequest mirrors the TS interface. NoRetrieval skips the
+// context bundle entirely (baseline mode for A/B tests). DryRun returns
+// a deterministic synthetic response without calling the gateway —
+// used by tests to exercise retrieval/validation without an LLM.
+type ReplayRequest struct {
+ Task string
+ LocalOnly bool
+ AllowEscalation bool
+ NoRetrieval bool
+ DryRun bool
+ GatewayURL string // overrides $LH_GATEWAY_URL
+ LocalModel string // overrides default
+ EscalationModel string // overrides default
+}
+
+// RagSample is one record in exports/rag/playbooks.jsonl.
+type RagSample struct {
+ ID string `json:"id"`
+ Title string `json:"title"`
+ Content string `json:"content"`
+ Tags []string `json:"tags"`
+ SourceRunID string `json:"source_run_id"`
+ SuccessScore string `json:"success_score"`
+ SourceCategory string `json:"source_category"`
+}
+
+// RetrievedArtifact is one playbook surfaced into a ContextBundle.
+type RetrievedArtifact struct {
+ RagID string `json:"rag_id"`
+ SourceRunID string `json:"source_run_id"`
+ Title string `json:"title"`
+ ContentPreview string `json:"content_preview"` // first 240 chars
+ SuccessScore string `json:"success_score"`
+ Tags []string `json:"tags"`
+ Score float64 `json:"score"`
+}
+
+// ContextBundle is what the prompt builder consumes. Empty bundles
+// (no retrieved playbooks) still pass through — buildPrompt downgrades
+// to a no-context prompt when both accepted and warnings are empty.
+type ContextBundle struct {
+ RetrievedPlaybooks []RetrievedArtifact `json:"retrieved_playbooks"`
+ PriorSuccessfulOutputs []RetrievedArtifact `json:"prior_successful_outputs"`
+ FailurePatterns []RetrievedArtifact `json:"failure_patterns"`
+ ValidationSteps []string `json:"validation_steps"`
+ BundleTokenEstimate int `json:"bundle_token_estimate"`
+}
+
+// ValidationResult is the deterministic gate's verdict. Reasons is
+// always non-nil so JSON consumers can iterate without a nil check.
+type ValidationResult struct {
+ Passed bool `json:"passed"`
+ Reasons []string `json:"reasons"`
+}
+
+// ReplayResult is what Replay returns. Mirrors the TS type one-to-one
+// so JSONL emitted by either runtime parses identically.
+type ReplayResult struct {
+ InputTask string `json:"input_task"`
+ TaskHash string `json:"task_hash"`
+ RetrievedArtifacts RetrievedIDs `json:"retrieved_artifacts"`
+ ContextBundle *ContextBundle `json:"context_bundle"`
+ ModelResponse string `json:"model_response"`
+ ModelUsed string `json:"model_used"`
+ EscalationPath []string `json:"escalation_path"`
+ ValidationResult ValidationResult `json:"validation_result"`
+ RecordedRunID string `json:"recorded_run_id"`
+ RecordedAt string `json:"recorded_at"`
+ DurationMs int64 `json:"duration_ms"`
+}
+
+// RetrievedIDs is the {rag_ids} envelope the TS shape uses.
+type RetrievedIDs struct {
+ RagIDs []string `json:"rag_ids"`
+}
+
+// Defaults match replay.ts. Override via env or ReplayRequest fields.
+const (
+ DefaultLocalModel = "qwen3.5:latest"
+ DefaultEscalationModel = "deepseek-v3.1:671b"
+ DefaultGatewayURL = "http://localhost:3110"
+)
diff --git a/internal/replay/validate.go b/internal/replay/validate.go
new file mode 100644
index 0000000..7fb217e
--- /dev/null
+++ b/internal/replay/validate.go
@@ -0,0 +1,66 @@
+package replay
+
+import (
+ "fmt"
+ "regexp"
+ "strings"
+)
+
+// fillerPatterns are the hedge phrases the spec rejects. Compiled once
+// per package — the gate runs on every replay call.
+var fillerPatterns = []*regexp.Regexp{
+ regexp.MustCompile(`(?i)as an ai`),
+ regexp.MustCompile(`(?i)i cannot`),
+ regexp.MustCompile(`(?i)i'?m sorry, but`),
+ regexp.MustCompile(`(?i)i don'?t have access`),
+ regexp.MustCompile(`(?i)i am unable to`),
+}
+
+// ValidateResponse runs the deterministic gate on a model response.
+// Empty / too-short / hedge-bearing / context-disconnected responses
+// fail. Matches replay.ts:validateResponse one-to-one.
+func ValidateResponse(response string, bundle *ContextBundle) ValidationResult {
+ trimmed := strings.TrimSpace(response)
+ var reasons []string
+
+ if len(trimmed) == 0 {
+ return ValidationResult{Passed: false, Reasons: []string{"empty response"}}
+ }
+ if len(trimmed) < 80 {
+ reasons = append(reasons, fmt.Sprintf("response too short (%d chars; min 80)", len(trimmed)))
+ }
+ for _, re := range fillerPatterns {
+ if re.MatchString(trimmed) {
+ reasons = append(reasons, fmt.Sprintf("filler/hedge phrase detected: %s", re.String()))
+ }
+ }
+ // Soft anchor: if a validation checklist was supplied, the response
+ // should share at least one token with it (≥3 chars per tokenize()).
+ if bundle != nil && len(bundle.ValidationSteps) > 0 {
+ checklistTokens := map[string]struct{}{}
+ for _, s := range bundle.ValidationSteps {
+ for t := range tokenize(s) {
+ checklistTokens[t] = struct{}{}
+ }
+ }
+ respTokens := tokenize(trimmed)
+ overlap := 0
+ for t := range checklistTokens {
+ if _, ok := respTokens[t]; ok {
+ overlap++
+ }
+ }
+ if len(checklistTokens) > 0 && overlap == 0 {
+ reasons = append(reasons, "response shares no tokens with validation checklist (may not have followed prior patterns)")
+ }
+ }
+
+ return ValidationResult{Passed: len(reasons) == 0, Reasons: reasonsOrEmpty(reasons)}
+}
+
+func reasonsOrEmpty(r []string) []string {
+ if r == nil {
+ return []string{}
+ }
+ return r
+}
diff --git a/scripts/replay_smoke.sh b/scripts/replay_smoke.sh
new file mode 100755
index 0000000..1274f2b
--- /dev/null
+++ b/scripts/replay_smoke.sh
@@ -0,0 +1,77 @@
+#!/usr/bin/env bash
+# replay smoke — Go port of scripts/distillation/replay.ts.
+# Validates that the replay tool:
+# - Builds a context bundle from a synthetic playbooks corpus
+# - Runs --dry-run end-to-end without an LLM
+# - Logs a row to data/_kb/replay_runs.jsonl with schema=replay_run.v1
+# - Honors --no-retrieval (no bundle, empty rag_ids)
+# - Exits non-zero when validation fails
+
+set -euo pipefail
+cd "$(dirname "$0")/.."
+
+export PATH="$PATH:/usr/local/go/bin"
+
+echo "[replay-smoke] building bin/replay..."
+go build -o bin/replay ./cmd/replay
+
+ROOT="$(mktemp -d)"
+trap 'rm -rf "$ROOT"' EXIT INT TERM
+
+mkdir -p "$ROOT/exports/rag"
+cat > "$ROOT/exports/rag/playbooks.jsonl" <<'EOF'
+{"id":"p1","title":"build verification","content":"verify the build, check tests pass before merge\nensure no regressions in suites","tags":["scrum"],"source_run_id":"r-1","success_score":"accepted","source_category":"scrum_review"}
+{"id":"p2","title":"merge cleanup","content":"verify the build, then assert tests passed, then merge","tags":["scrum"],"source_run_id":"r-2","success_score":"accepted","source_category":"scrum_review"}
+{"id":"p3","title":"partial fix","content":"verify the build, sometimes assert tests passed","tags":["scrum"],"source_run_id":"r-3","success_score":"partially_accepted","source_category":"scrum_review"}
+EOF
+
+echo "[replay-smoke] dry-run (with retrieval)"
+./bin/replay -task "verify the build before merging" -dry-run -root "$ROOT" > /tmp/replay_smoke_a.txt 2>&1 || true
+grep -q "retrieval: " /tmp/replay_smoke_a.txt || {
+ echo "missing retrieval line"; cat /tmp/replay_smoke_a.txt; exit 1;
+}
+grep -q "escalation_path: qwen3.5:latest" /tmp/replay_smoke_a.txt || {
+ echo "missing escalation_path line"; cat /tmp/replay_smoke_a.txt; exit 1;
+}
+
+LOG="$ROOT/data/_kb/replay_runs.jsonl"
+[ -s "$LOG" ] || { echo "expected $LOG to be written"; exit 1; }
+grep -q "replay_run.v1" "$LOG" || {
+ echo "schema=replay_run.v1 missing in log";
+ cat "$LOG";
+ exit 1;
+}
+
+echo "[replay-smoke] dry-run (no retrieval)"
+./bin/replay -task "verify build" -dry-run -no-retrieval -root "$ROOT" > /tmp/replay_smoke_b.txt 2>&1 || true
+grep -q "retrieval: DISABLED" /tmp/replay_smoke_b.txt || {
+ echo "expected retrieval: DISABLED";
+ cat /tmp/replay_smoke_b.txt;
+ exit 1;
+}
+
+LINES_BEFORE=$(wc -l < "$LOG")
+
+echo "[replay-smoke] forced-fail with escalation"
+# Force validation failure by putting a hedge phrase as the FIRST
+# accepted sample's first verify line. extractValidationSteps walks
+# corpus order, and the dry-run synthesizer surfaces the first 3 steps,
+# so the hedge phrase needs to be in an early-corpus accepted sample.
+cat > "$ROOT/exports/rag/playbooks.jsonl" <<'EOF'
+{"id":"p9","title":"hedged step","content":"verify auth as an AI and proceed without checking","tags":["security"],"source_run_id":"r-9","success_score":"accepted","source_category":"audit"}
+{"id":"p1","title":"build verification","content":"verify the build, check tests pass before merge","tags":["scrum"],"source_run_id":"r-1","success_score":"accepted","source_category":"scrum_review"}
+EOF
+./bin/replay -task "verify auth proceed" -dry-run -allow-escalation -root "$ROOT" > /tmp/replay_smoke_c.txt 2>&1 || true
+grep -q "escalation_path: qwen3.5:latest → deepseek-v3.1:671b" /tmp/replay_smoke_c.txt || {
+ echo "expected escalation path to deepseek when validation fails";
+ cat /tmp/replay_smoke_c.txt;
+ exit 1;
+}
+
+LINES_AFTER=$(wc -l < "$LOG")
+[ "$LINES_AFTER" -gt "$LINES_BEFORE" ] || {
+ echo "expected log file to grow: before=$LINES_BEFORE after=$LINES_AFTER";
+ exit 1;
+}
+
+echo "[replay-smoke] PASS"