chore: add real content that was sitting untracked

Surfaced by today's untracked-files audit. None of these are accidents — multiple are referenced by name in CLAUDE.md and memory files but were never added. Categories: - docs/PHASE_AUDIT_GUIDE.md (106 LOC) — Claude Code phase audit guidance - ops/systemd/lakehouse-langfuse-bridge.service — Langfuse bridge unit - package.json — top-level npm manifest - scripts/e2e_pipeline_check.sh + production_smoke.sh — real test scripts - reports/kimi/audit-last-week*.md — the "Two reports live" CLAUDE.md cites - tests/multi-agent/scenarios/ — 44 staffing scenarios (cutover decision A) - tests/multi-agent/playbooks/ — 102 playbook records - tests/battery/, tests/agent_test/PRD.md, tests/real-world/* — real tests - sidecar/sidecar/{lab_ui,pipeline_lab}.py — 888 LOC dev-only UIs that remain in service post-sidecar-drop (commit ba928b1 explicitly kept them) Sensitivity check: scenarios use synthetic company names ("Heritage Foods", "Cornerstone Fabrication"); audit reports describe code findings only; no PII or secrets surfaced. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 22:22:10 -05:00 · 2026-05-02 22:22:10 -05:00 · 41b0a99ed2
commit 41b0a99ed2
parent 6e34ef7baf
788 changed files with 107142 additions and 0 deletions
--- a/docs/PHASE_AUDIT_GUIDE.md
+++ b/docs/PHASE_AUDIT_GUIDE.md
@ -0,0 +1,107 @@
 # Phase Audit Guidance for Claude Code
 ## Purpose
 This document provides the proper workflow for auditing completed phases in the Lakehouse project.
 ## ⚠️ Important: Do NOT Skip Steps
 Each phase requires BOTH:
 1. PRD spec verification (check code exists)
 2. Full SCRUM execution (6 commands)
 ## Proper Phase Audit Workflow
 ### Step 1: Read PRD Specification
 For each phase, read the PRD to understand what's supposed to ship:
 ```bash
 # Read from docs/PRD.md or docs/PHASES.md
 cat docs/PHASES.md | grep -A20 "Phase N:"
 ```
 ### Step 2: Verify Code Exists
 Check that each deliverable from the PRD spec has corresponding code:
 ```bash
 # Example - check for specific implementations
 grep -r "function_name" crates/*/src/
 ls crates/*/src/*.rs
 ```
 ### Step 3: Run Full SCRUM (6 Commands)
 In order, execute ALL of these for the phase's crates:
 ```bash
 # 1. Build
 cargo build -p <crate-name>
 # 2. Test  
 cargo test -p <crate-name>
 # 3. Clippy (if installed)
 cargo clippy -p <crate-name> -- -D warnings
 # 4. Format check
 cargo fmt -p <crate-name> -- --check
 # 5. Cargo check
 cargo check -p <crate-name>
 # 6. Doc check
 cargo doc -p <crate-name> --no-deps
 ```
 ### Step 4: Fix Issues
 If any SCRUM command fails:
 - Fix the code
 - Re-run the failing command
 - Re-run ALL 6 commands to verify
 ### Step 5: Update Phase Documentation
 Only mark as ✅ after ALL 6 SCRUM commands pass:
 ```markdown
 ## Phase N: [Name] ✅
 - [x] spec item 1
 - [x] spec item 2
  - SCRUM: build ✅ test ✅ clippy ✅ fmt ✅ check ✅ doc ✅
 ```
 ## Current Phase Status
 | Phase | Status | Notes |
 |-------|--------|-------|
 | 0 | ✅ | Bootstrap complete |
 | 1 | ✅ | Storage + Catalog |
 | 2 | ✅ | Query Engine |
 | 3 | ✅ | AI Integration |
 | 4 | ✅ | Frontend |
 | 5 | ✅ | Hardening |
 | 6-42 | ✅ | See docs/PHASES.md |
 ## Notes from Previous Session
 - Clippy and rustfmt are NOT installed on this system
 - Run `rustup component add clippy rustfmt` to install
 - Some crates have 0 unit tests (expected for service crates)
 - 28 warnings remain in unused code paths (ui/vectord)
 ## Key Files
 - `docs/PHASES.md` - Phase tracker with checkboxes
 - `docs/PRD.md` - Full product requirements
 - `docs/CONTROL_PLANE_PRD.md` - Phases 38+ specifications
 - `crates/*/` - All crate implementations
 ## Quick Reference
 ```bash
 # Full workspace SCRUM
 cargo build --workspace
 cargo test --workspace
 # (clippy if installed)
 cargo fmt -- --check
 cargo check --workspace
 cargo doc --no-deps
 # Per-crate
 cargo build -p <crate>
 cargo test -p <crate>
 cargo check -p <crate>
 ```
--- a/ops/systemd/lakehouse-langfuse-bridge.service
+++ b/ops/systemd/lakehouse-langfuse-bridge.service
@ -0,0 +1,28 @@
 [Unit]
 Description=Lakehouse Langfuse → observer bridge — forwards LLM trace metadata to :3800 so KB learns from cost/latency/provider deltas
 Documentation=file:///home/profit/lakehouse/mcp-server/langfuse_bridge.ts
 After=network.target
 # No hard dependency on either Langfuse or observer — if either is down,
 # the bridge retries on the next tick without crashing. That's the
 # whole point of the cursor state file.
 [Service]
 Type=simple
 WorkingDirectory=/home/profit/lakehouse
 ExecStart=/home/profit/.bun/bin/bun run /home/profit/lakehouse/mcp-server/langfuse_bridge.ts
 Restart=on-failure
 RestartSec=30
 # Credentials resolved from env. Matches how
 # crates/gateway/src/v1/langfuse_trace.rs reads them so both producer
 # (gateway emitter) and consumer (this bridge) share the same config.
 EnvironmentFile=-/etc/lakehouse/langfuse.env
 Environment=LANGFUSE_URL=http://localhost:3001
 Environment=OBSERVER_URL=http://localhost:3800
 Environment=LANGFUSE_POLL_MS=30000
 Environment=LANGFUSE_BATCH_LIMIT=50
 Environment=LANGFUSE_STATE_FILE=/var/lib/lakehouse-guard/langfuse_last_seen.json
 KillSignal=SIGTERM
 TimeoutStopSec=5
 [Install]
 WantedBy=multi-user.target
--- a/package.json
+++ b/package.json
@ -0,0 +1,5 @@
 {
  "dependencies": {
    "langfuse": "^3.38.20"
  }
 }
--- a/reports/kimi/audit-last-week-full.md
+++ b/reports/kimi/audit-last-week-full.md
@ -0,0 +1,45 @@
 # Kimi Forensic Audit (FULL FILES) — distillation v1.0.0
 **Generated:** 2026-04-27 by `kimi-for-coding` via gateway /v1/chat
 **Latency:** 270.6s | **finish:** stop | **usage:** {'prompt_tokens': 66338, 'completion_tokens': 10159, 'total_tokens': 76497}
 **Input:** /tmp/kimi-audit-full.md (238KB · 12 commits · 15 files · line-numbered, no truncation)
 ---
 ## Verdict
 **Hold**: the substrate’s TypeScript pipeline is architecturally coherent and the SFT firewall is genuine, but committed Rust tests fail to compile, drift detection hardcodes an unverified integrity assertion, and deterministic guarantees leak wall-clock time in multiple places.
 ## What's solid
 - **Three-layer SFT contamination firewall is real.** Schema enum restricts `quality_score` to `["accepted", "partially_accepted"]` (`sft_sample.ts:13,62`), exporter constant `SFT_NEVER` blocks rejected/needs_human_review before synthesis (`export_sft.ts:51,205`), and `receipts.ts` re-reads the output to fail loud if any forbidden score leaked (`receipts.ts:231-236`).
 - **Core scorer is pure and deterministic.** `scoreRecord` takes an `EvidenceRecord`, performs no I/O, no LLM calls, and uses no mutable state (`scorer.ts:1-5,257-273`).
 - **Quarantine is exhaustive and observable.** Every exporter routes skips to structured `exports/quarantine/<exporter>.jsonl` with typed reasons; silent drops are impossible by construction (`quarantine.ts:1-6,14-26`).
 - **Evidence provenance is mandatory on every row.** Every `EvidenceRecord` carries `source_file`, `line_offset`, `sig_hash`, and `recorded_at` (`build_evidence_index.ts:27-34`).
 - **Local-first replay reduces cloud calls.** `replay.ts` defaults to a local model, augments via RAG retrieval, and only escalates on validation failure, directly supporting the cloud-call reduction claim (`replay.ts:24,349-376`).
 ## What's risky
 1. **receipts.ts:495** hardcodes `input_hash_match: true` in drift reports while comments on lines 467-469 admit input-hash comparison is unimplemented; this is false telemetry in a forensic system.
 2. **score_runs.ts:159** deduplicates scored runs by `scored.provenance.sig_hash` (the *evidence* hash), not by a composite of evidence + scorer version, so scorer logic or `SCORER_VERSION` updates are silently ignored on re-runs against existing partition files.
 3. **transforms.ts:181** `auto_apply` transform falls back to `new Date().toISOString()` when `row.ts` is missing, injecting wall-clock time into the supposedly deterministic materialization layer.
 4. **mode.rs:1035,1042** Rust test code assigns `Some("...".into())` and `None` to a `Vec<String>` field (`matrix_corpus`), which would fail `cargo test` compilation; this contradicts the claim that the tag is fully tested.
 5. **export_sft.ts:109-133** synthesizes fake instruction templates per source stem instead of using actual historical prompts; the SFT firewall prevents category contamination but not prompt-fidelity distortion.
 ## Specific findings
 - **mode.rs:1035** — Compile error in test helper: `matrix_corpus: Some("distilled_procedural_v1".into())` mismatches the `Vec<String>` type declared at line 172. **Rationale:** Direct struct construction in the test module uses an `Option` where a `Vec` is required, so the Rust test suite cannot compile.
 - **receipts.ts:495** — Drift detection hardcodes `input_hash_match: true`. **Rationale:** The adjacent comment admits input-hash comparison is simplified and unimplemented (lines 467-469); asserting a verified match is misleading telemetry that will hide real input-side regressions.
 - **score_runs.ts:159** — Scored-run dedup ignores scorer version. **Rationale:** `loadSeenHashes` and the skip logic key only on the EvidenceRecord `sig_hash`, meaning an existing scored-run file from yesterday will block updated scores even if `SCORER_VERSION` or scorer logic changed today.
 - **transforms.ts:181** — Non-deterministic timestamp fallback in `auto_apply` transform. **Rationale:** `row.ts ?? new Date().toISOString()` injects wall-clock time when the source row lacks a timestamp, violating the header claim that transforms are “deterministic by construction” and breaking bit-identical reproducibility for that stream.
 - **export_sft.ts:126** — Unsafe property access via `as any`. **Rationale:** `(ev as any).contractor` bypasses the `EvidenceRecord` type contract; if the property is absent the template silently emits `"<contractor>"`, degrading SFT data quality without a type error.
 - **scorer.ts:30** — Environmental dependency in deterministic scorer. **Rationale:** `process.env.LH_SCORER_VERSION` means identical evidence inputs produce different `scorer_version` stamps (and different downstream receipts) depending on the runtime environment, undermining bit-identical claims.
 - **replay.ts:378** — Non-deterministic run identifier. **Rationale:** `` `replay:${task_hash.slice(0, 16)}:${Date.now()}` `` makes replay evidence rows non-reproducible and risks collision under rapid successive calls.
 - **export_sft.ts:109-133** — Synthetic instruction generation replaces ground-truth prompts. **Rationale:** The exporter fabricates instruction strings from metadata (e.g., hardcoded scrum review phrasing) rather than retrieving the actual historical prompt, so the resulting SFT dataset trains on reconstructed, not authentic, user instructions.
 ## Direction recommendation
 **Pause the staffing audit and harden the substrate first.** Before building the staffing inference mode (`staffing_inference_lakehouse` in `mode.rs:54`) on top of this substrate:
 1. Fix the Rust test compile errors (`mode.rs:1035,1042`) and ensure `cargo test` runs in CI.
 2. Replace the hardcoded `input_hash_match: true` in drift detection (`receipts.ts:495`) with a real hash comparison or remove the field until it is implemented.
 3. Change scored-run dedup (`score_runs.ts:159`) to key on a composite hash of `evidence_sig_hash + scorer_version + SCORER_VERSION` so scorer updates force re-scoring.
 4. Remove the `new Date().toISOString()` fallback in `transforms.ts:181` or fail the row so determinism is preserved.
 5. Audit all `as any` casts in the export layer (`export_sft.ts:126`) for type-safe alternatives.
 Once those fixes land and acceptance re-runs pass, proceed to the staffing audit wave; the architecture is sound enough to support it, but the forensic guarantees must be honest before downstream teams depend on them.
--- a/reports/kimi/audit-last-week.md
+++ b/reports/kimi/audit-last-week.md
@ -0,0 +1,36 @@
 # Kimi Forensic Audit — distillation v1.0.0 (last week)
 **Generated:** 2026-04-27 by `kimi-for-coding` via gateway /v1/chat
 **Latency:** 157.6s | **finish:** stop | **usage:** {'prompt_tokens': 14014, 'completion_tokens': 6356, 'total_tokens': 20370}
 **Input:** /tmp/kimi-audit-input.md (56k chars · 12 commits · 6 files)
 ---
 ## Verdict
 **hold** — Runtime lock-in, integration mismatches, and truncated source files in the v1.0.0 payload make the tag unshippable without rework.
 ## What's solid
 - `scorer.ts` is a pure, deterministic function with no I/O, no LLM calls, and an explicit version stamp (`scorer.ts:22`).
 - SFT export enforces defense-in-depth contamination firewalls via `SFT_NEVER` and schema validators (`export_sft.ts:49-50`; `sft_sample.ts:43-48`).
 - Evidence materialization is idempotent across reruns using `sig_hash` deduplication (`build_evidence_index.ts:114-126`).
 - Mode router falls back to a safe built-in default if config parsing fails (`mode.rs:208-228`).
 - Quarantine writer abstraction isolates bad records instead of failing the export (`export_sft.ts`).
 ## What's risky
 - **Integration mismatch**: `replay.ts` posts to `/v1/chat`, but the provided gateway only declares `/v1/mode` and `/v1/mode/execute` (`replay.ts:186` vs `mode.rs:13-18`), suggesting an undocumented or broken proxy contract.
 - **Bun runtime lock-in**: Multiple files depend on `Bun.CryptoHasher`, which throws in Node.js (`export_sft.ts:235`; `build_evidence_index.ts:89`).
 - **Unauditable files in scope**: Critical files listed in the diff—`transforms.ts`, `receipts.ts`, `quarantine.ts`, `score_runs.ts`—were not provided, so their logic is unseen.
 - **Every shown implementation file is truncated**: `scorer.ts`, `export_sft.ts`, `build_evidence_index.ts`, `replay.ts`, and `mode.rs` all end mid-block, hiding error handling, receipt finalization, and gateway dispatch code.
 - **Type safety escape**: `(ev as any).contractor` in SFT synthesis bypasses the schema layer (`export_sft.ts:138`).
 ## Specific findings
 1. `scripts/distillation/scorer.ts:22` — `SCORER_VERSION` reads from `process.env`, introducing environment-dependent output drift that contradicts the file’s “identical input → identical output forever” contract.
 2. `scripts/distillation/export_sft.ts:138` — `(ev as any).contractor` is an unguarded `any` cast; a malformed `EvidenceRecord` will inject the string `"undefined"` or crash at runtime inside the SFT instruction template.
 3. `scripts/distillation/export_sft.ts:235` — `new Bun.CryptoHasher("sha256")` is a Bun-only API; this path will fail under Node.js/Deno and makes the substrate non-portable.
 4. `scripts/distillation/build_evidence_index.ts:89` — Same Bun crypto lock-in in `sha256OfFile`, fragmenting the hashing implementation (here `Bun.CryptoHasher`, elsewhere `canonicalSha256`).
 5. `scripts/distillation/replay.ts:178` — Provider routing relies on fragile string heuristics (`model.includes("/")`, prefix lists); models with unexpected names will route to the wrong backend or hit the `ollama` default incorrectly.
 6. `scripts/distillation/replay.ts:186` — `fetch(`${gatewayUrl()}/v1/chat`` targets an endpoint absent from the provided `mode.rs` router; without the missing gateway dispatch code, this call will 404.
 7. `crates/gateway/src/v1/mode.rs:141` — `deserialize_string_or_vec` uses `serde_json::Value::deserialize` against a TOML source, which is non-idiomatic and risks mis-handling TOML-specific types (datetime, inline tables) compared to a native `toml::Value`.
 8. `scripts/distillation/build_evidence_index.ts:185` — `await canonicalSha256(row)` is async, yet `sha256OfFile` is sync; the mixing of sync/async crypto calls in the same module hints at inconsistent I/O boundaries.
 ## Direction recommendation
 Keep the substrate architecture, but **do not expand staffing audit work on top of v1.0.0 until three blockers are fixed**: (1) replace `Bun.CryptoHasher` with portable WebCrypto or Node `crypto` so the build is runtime-agnostic; (2) align `replay.ts` to the actual gateway contract (`/v1/mode/execute`) or document the `/v1/chat` proxy route; and (3) eliminate `any` casts in the export path. The schema firewalls, deterministic scorer, and receipt provenance are the right foundation—rework the runtime/contract gaps rather than rebuilding from scratch.
--- a/scripts/e2e_pipeline_check.sh
+++ b/scripts/e2e_pipeline_check.sh
@ -0,0 +1,536 @@
 #!/usr/bin/env bash
 # ------------------------------------------------------------
 # End-to-end pipeline verification for Lakehouse.
 #
 # Generates realistic staffing-style data, runs it through every
 # shipped pipeline stage, asserts correctness at each step, and
 # cleans up after itself.
 #
 # Stages exercised:
 #   0. Preflight                     — gateway + sidecar reachability
 #   1. Data generation               — 1000 candidates, 200 placements, 10 resumes
 #   2. CSV ingest                    — Phase 6.1 (via ?name= query param)
 #   3. NDJSON ingest                 — Phase 6.2
 #   4. SQL queries + joins           — Phase 2, Phase 8 hot cache
 #   5. Content-hash re-ingest dedup  — Phase 6.4
 #   6. Idempotent register           — ADR-020 (same-fingerprint path)
 #   7. Schema-drift rejection        — ADR-020 (409 Conflict path)
 #   8. Catalog dedupe no-op          — ADR-020 (clean state)
 #   9. Metadata enrichment           — Phase 10 POST
 #  10. PII auto-detection audit      — Phase 10
 #  11. Vector index + search         — Phase 7 (documents pulled via SQL)
 #  12. Cleanup + baseline verify     — no-orphan guarantee
 #
 # Usage:
 #   ./scripts/e2e_pipeline_check.sh              # run all stages
 #   SKIP_VECTOR=1 ./scripts/e2e_pipeline_check.sh # skip Ollama-bound steps
 #   KEEP_DATA=1   ./scripts/e2e_pipeline_check.sh # leave /tmp artifacts
 #
 # Exit codes:
 #   0  all assertions passed
 #   1  one or more assertions failed
 #   2  preflight failed (service unreachable)
 # ------------------------------------------------------------
 set -u
 set -o pipefail
 GATEWAY="${GATEWAY:-http://localhost:3100}"
 SIDECAR="${SIDECAR:-http://localhost:3200}"
 WORKDIR="${WORKDIR:-/tmp/lakehouse_e2e}"
 DATA_ROOT="${DATA_ROOT:-/home/profit/lakehouse/data}"
 SKIP_VECTOR="${SKIP_VECTOR:-0}"
 KEEP_DATA="${KEEP_DATA:-0}"
 RUN_ID="e2e_$(date +%s)"
 CAND_DS="${RUN_ID}_candidates"
 PLACE_DS="${RUN_ID}_placements"
 RESUME_DS="${RUN_ID}_resumes"
 VEC_IDX="${RESUME_DS}_v1"
 # Color names use a CC_ prefix so they can't be shadowed by single-letter
 # local variables like `R` that hold curl responses elsewhere in the script.
 if [[ -t 1 ]]; then
    CC_GRN=$'\033[0;32m'; CC_RED=$'\033[0;31m'; CC_YLW=$'\033[1;33m'
    CC_BLU=$'\033[1;34m'; CC_DIM=$'\033[2m';    CC_RST=$'\033[0m'
 else
    CC_GRN=''; CC_RED=''; CC_YLW=''; CC_BLU=''; CC_DIM=''; CC_RST=''
 fi
 PASS=0; FAIL=0; WARN=0; STARTED_AT=$(date +%s)
 FAILURES=()
 pass() { printf '  %s✓%s %s\n' "$CC_GRN" "$CC_RST" "$1"; PASS=$((PASS+1)); }
 fail() { printf '  %s✗%s %s\n' "$CC_RED" "$CC_RST" "$1"; FAIL=$((FAIL+1)); FAILURES+=("$1"); }
 warn() { printf '  %s!%s %s\n' "$CC_YLW" "$CC_RST" "$1"; WARN=$((WARN+1)); }
 step() { printf '\n%s== %s ==%s\n' "$CC_BLU" "$1" "$CC_RST"; }
 info() { printf '  %s%s%s\n' "$CC_DIM" "$1" "$CC_RST"; }
 die()  { printf '%sFATAL: %s%s\n' "$CC_RED" "$1" "$CC_RST" >&2; cleanup; exit 2; }
 assert_eq() {
    if [[ "$1" == "$2" ]]; then pass "$3 ($1)"; else fail "$3: got '$1', expected '$2'"; fi
 }
 http_code() {
    local method="$1" path="$2" data="${3:-}"
    if [[ -n "$data" ]]; then
        curl -s -o /dev/null -w '%{http_code}' -X "$method" "$GATEWAY$path" \
            -H 'Content-Type: application/json' -d "$data"
    else
        curl -s -o /dev/null -w '%{http_code}' -X "$method" "$GATEWAY$path"
    fi
 }
 # query_scalar <sql>  -> first column of first row as string, sentinel on empty/error
 query_scalar() {
    local sql="$1"
    local payload
    payload=$(python3 -c 'import json,sys; print(json.dumps({"sql": sys.argv[1]}))' "$sql")
    curl -s -X POST "$GATEWAY/query/sql" \
         -H 'Content-Type: application/json' \
         -d "$payload" \
      | python3 -c '
 import sys, json
 try:
    r = json.load(sys.stdin)
 except Exception:
    print("__PARSE_ERROR__"); sys.exit(0)
 if isinstance(r, dict) and "error" in r:
    sys.stderr.write("query error: " + str(r["error"]) + "\n")
    print("__ERROR__"); sys.exit(0)
 rows = r.get("rows") if isinstance(r, dict) else None
 if not rows:
    print("__NO_ROWS__"); sys.exit(0)
 row = rows[0]
 print(next(iter(row.values())))
 '
 }
 cleanup() {
    [[ "$KEEP_DATA" == "1" ]] && { info "KEEP_DATA=1 — leaving $WORKDIR"; return; }
    info "cleaning up test datasets for $RUN_ID"
    # Catch any previous-run zombies too: any catalog entry whose name
    # starts with "e2e_" is definitionally ours. Using DELETE (added for
    # this script's needs) purges both the live registry and the manifest
    # file atomically, so the next run doesn't trip on zombie entries
    # pointing at parquets we've already rm'd.
    local names
    names=$(curl -s "$GATEWAY/catalog/datasets" 2>/dev/null \
        | python3 -c "
 import sys, json
 try: ds = json.load(sys.stdin)
 except Exception: sys.exit(0)
 for d in ds:
    if d['name'].startswith('e2e_'):
        print(d['name'])
 " 2>/dev/null || true)
    local removed=0
    for n in $names; do
        curl -s -o /dev/null -X DELETE "$GATEWAY/catalog/datasets/by-name/$n" && removed=$((removed+1))
    done
    # Delete any stray parquet + vector artifacts we can positively
    # attribute to an e2e_ prefix.
    rm -f "$DATA_ROOT/datasets/"e2e_*.parquet 2>/dev/null || true
    rm -f "$DATA_ROOT/vectors/"e2e_*.parquet  2>/dev/null || true
    rm -rf "$WORKDIR" 2>/dev/null || true
    info "deleted $removed e2e datasets (covers this run + any prior zombies)"
 }
 trap cleanup EXIT
 # ============================================================
 # 0. Preflight
 # ============================================================
 step "0. Preflight"
 curl -sf -m 3 "$GATEWAY/health" >/dev/null 2>&1 || die "gateway not reachable at $GATEWAY"
 pass "gateway /health (200)"
 SIDECAR_UP=0
 if curl -sf -m 3 "$SIDECAR/health" >/dev/null 2>&1; then
    SIDECAR_UP=1; pass "sidecar /health (200)"
 else
    warn "sidecar unreachable — vector stage will be skipped"
    SKIP_VECTOR=1
 fi
 # Purge any e2e_* zombies from prior runs (stale registry entries that
 # would otherwise break DataFusion schema inference for every query).
 ZOMBIES=$(curl -s "$GATEWAY/catalog/datasets" 2>/dev/null \
    | python3 -c "
 import sys, json
 try: ds = json.load(sys.stdin)
 except Exception: sys.exit(0)
 for d in ds:
    if d['name'].startswith('e2e_'):
        print(d['name'])
 " 2>/dev/null || true)
 if [[ -n "$ZOMBIES" ]]; then
    ZCOUNT=$(echo "$ZOMBIES" | wc -l | tr -d ' ')
    for n in $ZOMBIES; do
        curl -s -o /dev/null -X DELETE "$GATEWAY/catalog/datasets/by-name/$n"
    done
    info "pre-cleaned $ZCOUNT e2e_ zombies from prior runs"
 fi
 BASELINE=$(curl -s "$GATEWAY/catalog/datasets" | python3 -c 'import sys,json; print(len(json.load(sys.stdin)))')
 info "baseline dataset count: $BASELINE"
 # ============================================================
 # 1. Generate realistic data
 # ============================================================
 step "1. Generate realistic staffing data"
 mkdir -p "$WORKDIR"
 # Seed with RUN_ID (which embeds the wall-clock timestamp) so each run
 # produces different content. Otherwise the content-hash dedup from
 # Phase 6.4 keys off a stale hash that lingers in the live registry
 # until the next gateway restart, and subsequent runs silently dedupe.
 python3 - "$WORKDIR" "$RUN_ID" <<'PYEOF'
 import csv, json, random, sys, os
 workdir, run_id = sys.argv[1], sys.argv[2]
 # Mix RUN_ID into the seed so content differs per run, but keep it
 # deterministic within a single run.
 random.seed(hash(run_id) & 0x7FFFFFFF)
 FIRST = ['Aisha','Brandon','Carlos','Daria','Eli','Fiona','Gabriel','Hana','Ian','Julia',
        'Kofi','Lena','Mateo','Nadia','Oscar','Priya','Quinn','Raj','Sofia','Tomas',
        'Uma','Victor','Wendy','Xander','Yuki','Zara']
 LAST  = ['Adams','Brown','Chen','Davis','Evans','Fisher','Garcia','Hughes','Ibrahim','Johnson',
        'Kim','Lopez','Martinez','Nguyen','Ortiz','Patel','Rossi','Singh','Thomas','Umar',
        'Vargas','Williams','Xu','Young','Zhang','OConnor']
 PLACES = [('Chicago','IL'),('New York','NY'),('San Francisco','CA'),('Austin','TX'),
          ('Seattle','WA'),('Denver','CO'),('Boston','MA'),('Atlanta','GA'),
          ('Miami','FL'),('Phoenix','AZ')]
 SKILL_GROUPS = [
    ['Python','AWS','Docker'],['Java','Spring','Kubernetes'],
    ['React','TypeScript','Node'],['Go','PostgreSQL','gRPC'],
    ['Rust','DataFusion','Parquet'],['C#','.NET','Azure'],
    ['Ruby','Rails','Redis'],['Scala','Spark','Kafka'],
    ['Swift','iOS','CoreData'],['Kotlin','Android','Jetpack'],
 ]
 STATUSES = ['active','placed','inactive','blocked']
 STATUS_WEIGHTS = [60, 25, 10, 5]
 with open(os.path.join(workdir, 'candidates.csv'), 'w', newline='') as f:
    w = csv.DictWriter(f, fieldnames=[
        'candidate_id','first_name','last_name','email','phone',
        'city','state','skills','years_experience','hourly_rate_usd','status'])
    w.writeheader()
    for i in range(1, 1001):
        fn, ln = random.choice(FIRST), random.choice(LAST)
        city, state = random.choice(PLACES)
        w.writerow({
            'candidate_id': f'CAND-{i:05d}',
            'first_name': fn, 'last_name': ln,
            'email': f'{fn.lower()}.{ln.lower()}{i}@example.com',
            'phone': f'({random.randint(200,999)}) {random.randint(200,999)}-{random.randint(1000,9999)}',
            'city': city, 'state': state,
            'skills': ','.join(random.choice(SKILL_GROUPS)),
            'years_experience': random.randint(0, 20),
            'hourly_rate_usd': random.randint(35, 185),
            'status': random.choices(STATUSES, weights=STATUS_WEIGHTS)[0],
        })
 CLIENTS = ['Acme Corp','Globex','Initech','Umbrella','Wayne Enterprises',
           'Stark Industries','Tyrell','Cyberdyne','Massive Dynamic','Oscorp']
 with open(os.path.join(workdir, 'placements.ndjson'), 'w') as f:
    for i in range(1, 201):
        f.write(json.dumps({
            'placement_id': f'PLACE-{i:04d}',
            'candidate_id': f'CAND-{random.randint(1,1000):05d}',
            'client': random.choice(CLIENTS),
            'start_date': f'2026-{random.randint(1,4):02d}-{random.randint(1,28):02d}',
            'weekly_hours': random.choice([20,25,30,35,40]),
            'bill_rate': random.randint(80, 250),
            'placement_status': random.choice(['active','completed','terminated']),
        }) + '\n')
 RESUMES = [
    'Senior Python engineer with 8 years of cloud infrastructure experience. Expert in AWS, Docker, and distributed systems design. Led migration of monolithic legacy system to microservices.',
    'Full-stack React and TypeScript developer specializing in real-time dashboards. Built financial trading interfaces. GraphQL, WebSocket, performance optimization.',
    'Data engineer with deep Apache Spark and Kafka expertise. Seven years on streaming analytics pipelines processing billions of events per day. Scala and Python.',
    'Embedded systems engineer with C++ and Rust experience. Worked on automotive ADAS systems and industrial IoT devices. Low-level firmware, RTOS.',
    'DevOps engineer with Kubernetes and Terraform expertise. Six years at hypergrowth startups. Prometheus, Grafana, and observability tooling.',
    'Machine learning engineer specializing in NLP. Built production transformer-based systems. PyTorch, Hugging Face, fine-tuning large language models.',
    'iOS developer with Swift and SwiftUI. Four years building consumer apps at mid-size tech companies. Offline-first architectures and CoreData.',
    'Backend Go developer focused on high-throughput APIs. Built payment processing systems handling millions of transactions. PostgreSQL, gRPC, Redis.',
    'Security engineer with penetration testing and threat modeling experience. OSCP certified. Web application security, AppSec code review, SAST and DAST tooling.',
    'Site reliability engineer with Linux internals and performance tuning expertise. Ten years at large-scale infrastructure. Tracing, profiling, kernel-level debugging.',
 ]
 with open(os.path.join(workdir, 'resumes.ndjson'), 'w') as f:
    for i, r in enumerate(RESUMES, 1):
        f.write(json.dumps({'doc_id': f'RES-{i:03d}', 'resume_text': r}) + '\n')
 PYEOF
 pass "candidates.csv  (1000 rows, 11 cols)"
 pass "placements.ndjson (200 rows, 7 cols)"
 pass "resumes.ndjson   (10 rows, 2 cols)"
 # ============================================================
 # 2. CSV ingest
 # ============================================================
 step "2. CSV ingest (Phase 6.1)"
 R=$(curl -s -X POST "$GATEWAY/ingest/file?name=$CAND_DS" -F "file=@$WORKDIR/candidates.csv")
 echo "$R" | python3 -c 'import sys,json; json.load(sys.stdin)' 2>/dev/null \
    || { fail "ingest response was not JSON: $(echo "$R" | head -c 200)"; R='{}'; }
 ROWS=$(echo "$R"  | python3 -c 'import sys,json; print(json.load(sys.stdin).get("rows",-1))' 2>/dev/null)
 DEDUP=$(echo "$R" | python3 -c 'import sys,json; print(json.load(sys.stdin).get("deduplicated","?"))' 2>/dev/null)
 DS_NAME=$(echo "$R" | python3 -c 'import sys,json; print(json.load(sys.stdin).get("dataset_name","?"))' 2>/dev/null)
 assert_eq "$DS_NAME" "$CAND_DS" "ingest respected ?name= query param"
 assert_eq "$ROWS"    "1000"     "ingest rows"
 assert_eq "$DEDUP"   "False"    "first upload not deduplicated"
 REG_ROWS=$(curl -s "$GATEWAY/catalog/datasets/by-name/$CAND_DS" \
    | python3 -c 'import sys,json; print(json.load(sys.stdin).get("row_count","null"))')
 assert_eq "$REG_ROWS" "1000" "manifest row_count reflects ingest"
 # ============================================================
 # 3. NDJSON ingest
 # ============================================================
 step "3. NDJSON ingest (Phase 6.2)"
 R=$(curl -s -X POST "$GATEWAY/ingest/file?name=$PLACE_DS" -F "file=@$WORKDIR/placements.ndjson")
 ROWS=$(echo "$R" | python3 -c 'import sys,json; print(json.load(sys.stdin).get("rows",-1))' 2>/dev/null)
 assert_eq "$ROWS" "200" "placements NDJSON ingest rows"
 R=$(curl -s -X POST "$GATEWAY/ingest/file?name=$RESUME_DS" -F "file=@$WORKDIR/resumes.ndjson")
 ROWS=$(echo "$R" | python3 -c 'import sys,json; print(json.load(sys.stdin).get("rows",-1))' 2>/dev/null)
 assert_eq "$ROWS" "10" "resumes NDJSON ingest rows"
 # ============================================================
 # 4. SQL queries + JOIN + cache
 # ============================================================
 step "4. SQL queries (Phase 2, Phase 8)"
 N=$(query_scalar "SELECT COUNT(*) FROM $CAND_DS")
 assert_eq "$N" "1000" "candidates COUNT(*)"
 N=$(query_scalar "SELECT COUNT(*) FROM $CAND_DS WHERE status = 'active'")
 if [[ "$N" =~ ^[0-9]+$ ]] && (( N > 400 && N < 700 )); then
    pass "active candidates in plausible range ($N, expect ~600)"
 else
    fail "active candidates count out of range: $N"
 fi
 N=$(query_scalar "
    SELECT COUNT(DISTINCT c.candidate_id)
    FROM $CAND_DS c
    JOIN $PLACE_DS p ON c.candidate_id = p.candidate_id
    WHERE p.placement_status = 'active'
 ")
 if [[ "$N" =~ ^[0-9]+$ ]] && (( N > 0 && N <= 200 )); then
    pass "cross-dataset JOIN with filter returns $N rows"
 else
    fail "JOIN returned unexpected count: $N"
 fi
 AVG=$(query_scalar "SELECT AVG(hourly_rate_usd) FROM $CAND_DS")
 if python3 -c "import sys; v=float('$AVG'); sys.exit(0 if 100 < v < 130 else 1)" 2>/dev/null; then
    pass "average hourly rate in plausible range ($AVG, expect ~110)"
 else
    fail "average hourly rate out of range: $AVG"
 fi
 CODE=$(http_code POST "/query/cache/pin" "{\"dataset\":\"$CAND_DS\"}")
 assert_eq "$CODE" "200" "cache pin HTTP"
 # ============================================================
 # 5. Content-hash re-ingest dedup (Phase 6.4)
 # ============================================================
 step "5. Content-hash re-ingest dedup"
 R=$(curl -s -X POST "$GATEWAY/ingest/file?name=$CAND_DS" -F "file=@$WORKDIR/candidates.csv")
 DEDUP=$(echo "$R" | python3 -c 'import sys,json; print(json.load(sys.stdin).get("deduplicated","?"))' 2>/dev/null)
 assert_eq "$DEDUP" "True" "re-upload same file is deduplicated"
 # ============================================================
 # 6. Idempotent register — same fingerprint (ADR-020)
 # ============================================================
 step "6. Idempotent register (ADR-020 same-fp path)"
 DS=$(curl -s "$GATEWAY/catalog/datasets/by-name/$CAND_DS")
 FP=$(echo "$DS"    | python3 -c 'import sys,json; print(json.load(sys.stdin)["schema_fingerprint"])')
 OBJS=$(echo "$DS"  | python3 -c 'import sys,json,json as j; print(j.dumps(json.load(sys.stdin)["objects"]))')
 ID_BEFORE=$(echo "$DS" | python3 -c 'import sys,json; print(json.load(sys.stdin)["id"])')
 PAYLOAD=$(python3 -c "import json,sys; print(json.dumps({'name':sys.argv[1],'schema_fingerprint':sys.argv[2],'objects':json.loads(sys.argv[3])}))" "$CAND_DS" "$FP" "$OBJS")
 CODE=$(http_code POST "/catalog/datasets" "$PAYLOAD")
 assert_eq "$CODE" "201" "same-fp re-register returns 201"
 ID_AFTER=$(curl -s "$GATEWAY/catalog/datasets/by-name/$CAND_DS" | python3 -c 'import sys,json; print(json.load(sys.stdin)["id"])')
 assert_eq "$ID_AFTER" "$ID_BEFORE" "same DatasetId after re-register"
 COUNT=$(curl -s "$GATEWAY/catalog/datasets" | python3 -c "import sys,json; print(sum(1 for d in json.load(sys.stdin) if d['name']=='$CAND_DS'))")
 assert_eq "$COUNT" "1" "no duplicate manifest created"
 # ============================================================
 # 7. Schema-drift rejection (409)
 # ============================================================
 step "7. Schema-drift rejection (ADR-020 409 path)"
 PAYLOAD=$(python3 -c "import json,sys; print(json.dumps({'name':sys.argv[1],'schema_fingerprint':'deadbeefnotmatching','objects':json.loads(sys.argv[2])}))" "$CAND_DS" "$OBJS")
 CODE=$(http_code POST "/catalog/datasets" "$PAYLOAD")
 assert_eq "$CODE" "409" "different-fp rejected with 409"
 # ============================================================
 # 8. Dedupe no-op on clean catalog
 # ============================================================
 step "8. Dedupe no-op on clean state"
 R=$(curl -s -X POST "$GATEWAY/catalog/dedupe")
 GROUPS=$(echo "$R" | python3 -c 'import sys,json; print(json.load(sys.stdin)["groups"])')
 REMOVED=$(echo "$R" | python3 -c 'import sys,json; print(json.load(sys.stdin)["removed"])')
 assert_eq "$GROUPS" "0" "dedupe groups (clean catalog)"
 assert_eq "$REMOVED" "0" "dedupe removed count"
 # ============================================================
 # 9. Metadata enrichment (Phase 10)
 # ============================================================
 step "9. Metadata enrichment (Phase 10)"
 CODE=$(http_code POST "/catalog/datasets/by-name/$CAND_DS/metadata" \
    "{\"owner\":\"e2e-test\",\"description\":\"$RUN_ID synthetic candidates\",\"tags\":[\"test\",\"synthetic\"]}")
 assert_eq "$CODE" "200" "POST metadata HTTP"
 META=$(curl -s "$GATEWAY/catalog/datasets/by-name/$CAND_DS")
 OWNER=$(echo "$META" | python3 -c 'import sys,json; print(json.load(sys.stdin).get("owner",""))')
 assert_eq "$OWNER" "e2e-test" "owner persisted"
 # ============================================================
 # 10. PII auto-detection (Phase 10)
 # ============================================================
 step "10. PII auto-detection (Phase 10)"
 PII_COLS=$(echo "$META" | python3 -c '
 import sys, json
 m = json.load(sys.stdin)
 pii = [c["name"] for c in m.get("columns",[]) if c.get("is_pii") or (isinstance(c.get("sensitivity"),str) and c["sensitivity"].lower()=="pii")]
 print(" ".join(pii) if pii else "__NONE__")')
 if [[ "$PII_COLS" == *"email"* ]] && [[ "$PII_COLS" == *"phone"* ]]; then
    pass "email and phone flagged as PII ($PII_COLS)"
 elif [[ "$PII_COLS" == "__NONE__" ]]; then
    warn "no PII flagged — auto-detection may not run on this path"
 else
    warn "partial PII detection: $PII_COLS"
 fi
 # ============================================================
 # 11. Vector index + semantic search (Phase 7)
 # ============================================================
 step "11. Vector index + semantic search (Phase 7)"
 if [[ "$SKIP_VECTOR" == "1" ]]; then
    warn "SKIP_VECTOR=1 — skipping vector pipeline"
 else
    # Pull documents out of the ingested resumes dataset via SQL,
    # then feed to the inline /vectors/index body. This exercises
    # the query→embed integration rather than pre-canned input.
    DOCS=$(curl -s -X POST "$GATEWAY/query/sql" \
         -H 'Content-Type: application/json' \
         -d "$(python3 -c "import json; print(json.dumps({'sql': 'SELECT doc_id, resume_text FROM $RESUME_DS'}))")" \
      | python3 -c '
 import sys, json
 r = json.load(sys.stdin)
 docs = [{"id": row["doc_id"], "text": row["resume_text"]} for row in r.get("rows", [])]
 print(json.dumps(docs))')
    DOC_COUNT=$(echo "$DOCS" | python3 -c 'import sys,json; print(len(json.load(sys.stdin)))')
    assert_eq "$DOC_COUNT" "10" "pulled docs via SQL for embedding"
    PAYLOAD=$(python3 -c "
 import json, sys
 print(json.dumps({
    'index_name': sys.argv[1],
    'source':     sys.argv[2],
    'documents':  json.loads(sys.argv[3]),
    'chunk_size': 500,
    'overlap':    50,
 }))" "$VEC_IDX" "$RESUME_DS" "$DOCS")
    R=$(curl -s -X POST "$GATEWAY/vectors/index" -H 'Content-Type: application/json' -d "$PAYLOAD")
    JOB_ID=$(echo "$R" | python3 -c 'import sys,json; d=json.load(sys.stdin); print(d.get("job_id","__NONE__"))' 2>/dev/null)
    if [[ "$JOB_ID" == "__NONE__" || -z "$JOB_ID" ]]; then
        fail "vector index job rejected: $(echo "$R" | head -c 200)"
    else
        pass "embedding job accepted (job=$JOB_ID)"
        # Poll up to 90s for 10 short resumes; Ollama cold-start can be slow.
        JOB_STATUS="unknown"
        for _ in $(seq 1 45); do
            JOB_STATUS=$(curl -s "$GATEWAY/vectors/jobs/$JOB_ID" 2>/dev/null \
                | python3 -c '
 import sys, json
 try: print(json.load(sys.stdin).get("status","?"))
 except Exception: print("?")' 2>/dev/null)
            [[ "$JOB_STATUS" == "completed" || "$JOB_STATUS" == "Completed" ]] && break
            [[ "$JOB_STATUS" == "failed"    || "$JOB_STATUS" == "Failed"    ]] && break
            sleep 2
        done
        case "$JOB_STATUS" in
            completed|Completed)
                pass "embedding job completed"
                R=$(curl -s -X POST "$GATEWAY/vectors/search" \
                    -H 'Content-Type: application/json' \
                    -d "{\"index_name\":\"$VEC_IDX\",\"query\":\"fine-tuning large language models\",\"k\":3}")
                TOP_DOC=$(echo "$R" | python3 -c '
 import sys, json
 r = json.load(sys.stdin)
 if r.get("results"): print(r["results"][0].get("doc_id","?"))
 else: print("__NONE__")' 2>/dev/null)
                if [[ "$TOP_DOC" == "RES-006" ]]; then
                    pass "top match is ML/NLP resume (semantically correct)"
                elif [[ "$TOP_DOC" == "__NONE__" ]]; then
                    fail "search returned no results"
                else
                    warn "top match is $TOP_DOC (expected RES-006 — ranking may vary)"
                fi ;;
            *)
                fail "embedding job did not complete (status=$JOB_STATUS)" ;;
        esac
    fi
 fi
 # ============================================================
 # 12. Cleanup + baseline verify
 # ============================================================
 step "12. Cleanup + baseline verify"
 cleanup
 trap - EXIT
 ON_DISK=$(ls "$DATA_ROOT/_catalog/manifests"/*.json 2>/dev/null | wc -l | tr -d ' ')
 info "manifest files on disk now: $ON_DISK"
 DISK_ORPHANS=0
 if compgen -G "$DATA_ROOT/_catalog/manifests/*.json" > /dev/null; then
    DISK_ORPHANS=$(grep -l "\"$RUN_ID" "$DATA_ROOT/_catalog/manifests"/*.json 2>/dev/null | wc -l | tr -d ' ')
 fi
 assert_eq "$DISK_ORPHANS" "0" "no orphan manifest files on disk for $RUN_ID"
 LIVE_ORPHANS=$(curl -s "$GATEWAY/catalog/datasets" \
    | python3 -c "import sys,json; print(sum(1 for d in json.load(sys.stdin) if d['name'].startswith('$RUN_ID')))")
 if [[ "$LIVE_ORPHANS" != "0" ]]; then
    warn "$LIVE_ORPHANS entries linger in live registry (clears on gateway restart; on-disk is ground truth)"
 fi
 # ============================================================
 # Summary
 # ============================================================
 ELAPSED=$(( $(date +%s) - STARTED_AT ))
 printf '\n%s─── Summary ───%s\n' "$CC_BLU" "$CC_RST"
 printf '  run_id:   %s\n'          "$RUN_ID"
 printf '  elapsed:  %ss\n'         "$ELAPSED"
 printf '  passed:   %s%d%s\n'      "$CC_GRN" "$PASS" "$CC_RST"
 printf '  failed:   %s%d%s\n'      "$CC_RED" "$FAIL" "$CC_RST"
 printf '  warnings: %s%d%s\n'      "$CC_YLW" "$WARN" "$CC_RST"
 if (( FAIL > 0 )); then
    printf '\n%sfailures:%s\n' "$CC_RED" "$CC_RST"
    for f in "${FAILURES[@]}"; do printf '  - %s\n' "$f"; done
    exit 1
 fi
 exit 0
--- a/scripts/production_smoke.sh
+++ b/scripts/production_smoke.sh
@ -0,0 +1,157 @@
 #!/usr/bin/env bash
 # Production substrate smoke — single command that verifies every
 # production-critical surface end-to-end. Exits non-zero on the first
 # failure so an operator can run this before:
 #   - Swapping workers_500k.parquet → real Chicago contractor data
 #   - Spinning up the Asterisk voice agent against /v1/chat
 #   - Running staffing inference loops via /v1/iterate
 #   - Wiring the assistant against the gateway
 #
 # Usage:
 #   ./scripts/production_smoke.sh
 #
 # Tunable via env:
 #   GATEWAY=http://localhost:3100   # gateway base URL
 #   FAIL_FAST=1                     # exit on first failure (default 1)
 #   VERBOSE=1                       # print full responses on success too
 set -e
 GATEWAY="${GATEWAY:-http://localhost:3100}"
 FAIL_FAST="${FAIL_FAST:-1}"
 VERBOSE="${VERBOSE:-0}"
 PASS=0
 FAIL=0
 FAILURES=()
 check() {
    local name="$1"
    local expected_status="$2"
    local cmd="$3"
    echo -n "  [$(($PASS + $FAIL + 1))] $name ... "
    local resp
    resp=$(eval "$cmd" 2>&1) || true
    local status="${resp%%|||*}"
    local body="${resp#*|||}"
    if [ "$status" = "$expected_status" ]; then
        PASS=$((PASS + 1))
        echo "✓ ($status)"
        if [ "$VERBOSE" = "1" ]; then echo "      $body" | head -3 | sed 's/^/      /'; fi
    else
        FAIL=$((FAIL + 1))
        FAILURES+=("$name: expected $expected_status, got $status")
        echo "✗ (got $status, expected $expected_status)"
        echo "      $body" | head -3 | sed 's/^/      /'
        [ "$FAIL_FAST" = "1" ] && { print_summary; exit 1; }
    fi
 }
 curl_with_status() {
    # Run curl, capture HTTP status + body, format as "status|||body"
    local args=("$@")
    curl -sS -w "\n%{http_code}" "${args[@]}" 2>&1 | awk '
        { lines[NR]=$0 }
        END {
            status=lines[NR]
            body=""
            for (i=1; i<NR; i++) body=body lines[i] (i<NR-1?"\n":"")
            print status "|||" body
        }
    '
 }
 print_summary() {
    echo ""
    echo "═══════════════════════════════════════════════════════════════"
    echo "  $PASS passed · $FAIL failed"
    if [ ${#FAILURES[@]} -gt 0 ]; then
        echo "  failures:"
        for f in "${FAILURES[@]}"; do echo "    - $f"; done
    fi
    echo "═══════════════════════════════════════════════════════════════"
 }
 echo "Production substrate smoke test against $GATEWAY"
 echo ""
 # ─── 1. Liveness ─────────────────────────────────────────────────────
 echo "▶ Liveness"
 check "gateway /health" "200" \
    'curl_with_status -m 5 "$GATEWAY/health"'
 # ─── 2. Operational health ──────────────────────────────────────────
 echo "▶ Operational state"
 HEALTH_RESP=$(curl -sS -m 10 "$GATEWAY/v1/health" 2>&1) || HEALTH_RESP="{}"
 WORKERS_COUNT=$(echo "$HEALTH_RESP" | python3 -c "import sys,json; print(json.load(sys.stdin).get('workers_count',0))" 2>/dev/null || echo 0)
 PROVIDERS_OK=$(echo "$HEALTH_RESP" | python3 -c "import sys,json; d=json.load(sys.stdin).get('providers_configured',{}); print(sum(1 for v in d.values() if v))" 2>/dev/null || echo 0)
 echo "  workers_count: $WORKERS_COUNT"
 echo "  providers_configured (count): $PROVIDERS_OK"
 if [ "$WORKERS_COUNT" -lt 1 ]; then
    FAIL=$((FAIL + 1))
    FAILURES+=("workers_count=0 — parquet load failed or empty")
    echo "  ✗ workers not loaded"
    [ "$FAIL_FAST" = "1" ] && { print_summary; exit 1; }
 else
    PASS=$((PASS + 1))
    echo "  ✓ workers loaded"
 fi
 # ─── 3. Truth Layer ──────────────────────────────────────────────────
 echo "▶ Truth Layer"
 check "/v1/context returns rules" "200" \
    'curl_with_status -m 10 "$GATEWAY/v1/context"'
 # ─── 4. /v1/chat (provider=ollama) ──────────────────────────────────
 echo "▶ /v1/chat (provider=ollama, fast model)"
 check "/v1/chat ping" "200" \
    'curl_with_status -m 60 -X POST "$GATEWAY/v1/chat" \
        -H "content-type: application/json" \
        -d "{\"provider\":\"ollama\",\"model\":\"qwen3.5:latest\",\"messages\":[{\"role\":\"user\",\"content\":\"reply: PONG\"}],\"max_tokens\":30,\"temperature\":0,\"think\":false}"'
 # ─── 5. /v1/validate (negative + positive) ──────────────────────────
 echo "▶ /v1/validate"
 check "phantom candidate_id → 422 Consistency" "422" \
    'curl_with_status -m 10 -X POST "$GATEWAY/v1/validate" \
        -H "content-type: application/json" \
        -d "{\"kind\":\"fill\",\"artifact\":{\"fills\":[{\"candidate_id\":\"W-FAKE-0\",\"name\":\"Fake\"}]},\"context\":{\"target_count\":1}}"'
 check "real worker (W-1) → 200 OK" "200" \
    'curl_with_status -m 10 -X POST "$GATEWAY/v1/validate" \
        -H "content-type: application/json" \
        -d "{\"kind\":\"fill\",\"artifact\":{\"fills\":[{\"candidate_id\":\"W-1\",\"name\":\"Anyone\"}]},\"context\":{\"target_count\":1}}"'
 check "SSN in body → 422 Policy" "422" \
    'curl_with_status -m 10 -X POST "$GATEWAY/v1/validate" \
        -H "content-type: application/json" \
        -d "{\"kind\":\"email\",\"artifact\":{\"to\":\"a@b.com\",\"body\":\"Your SSN 123-45-6789 is on file.\"}}"'
 # ─── 6. /v1/iterate (bounded retry loop) ───────────────────────────
 # Phantom worker → expect 422 IterateFailure with history (not 200)
 echo "▶ /v1/iterate (bounded retry)"
 check "/v1/iterate phantom → bounded fail" "422" \
    'curl_with_status -m 240 -X POST "$GATEWAY/v1/iterate" \
        -H "content-type: application/json" \
        -d "{\"kind\":\"fill\",\"provider\":\"ollama\",\"model\":\"qwen3.5:latest\",\"system\":\"Reply with ONLY: {\\\"fills\\\":[{\\\"candidate_id\\\":\\\"W-99999999\\\",\\\"name\\\":\\\"X\\\"}]}\",\"prompt\":\"emit it\",\"context\":{\"target_count\":1},\"max_iterations\":1,\"max_tokens\":200,\"temperature\":0}"'
 # ─── 7. Doc-drift batch ─────────────────────────────────────────────
 echo "▶ Doc-drift scan"
 check "/vectors/playbook_memory/doc_drift/scan" "200" \
    'curl_with_status -m 60 -X POST "$GATEWAY/vectors/playbook_memory/doc_drift/scan"'
 # ─── 8. Usage tracking ──────────────────────────────────────────────
 echo "▶ Usage tracking"
 USAGE=$(curl -sS -m 10 "$GATEWAY/v1/usage" 2>&1)
 USAGE_REQS=$(echo "$USAGE" | python3 -c "import sys,json; print(json.load(sys.stdin).get('requests',0))" 2>/dev/null || echo 0)
 echo "  usage.requests: $USAGE_REQS (should be > 0 if /v1/chat fired)"
 if [ "$USAGE_REQS" -ge 1 ]; then
    PASS=$((PASS + 1))
    echo "  ✓ /v1/usage tracking"
 else
    FAIL=$((FAIL + 1))
    FAILURES+=("/v1/usage didn't increment after /v1/chat call")
    echo "  ✗ /v1/usage didn't increment"
 fi
 print_summary
 [ $FAIL -eq 0 ] && exit 0 || exit 1
--- a/sidecar/sidecar/lab_ui.py
+++ b/sidecar/sidecar/lab_ui.py
@ -0,0 +1,385 @@
 """Pipeline Lab notebook UI — served as a single HTML page.
 Note: innerHTML usage in this file is intentional for building the UI.
 All user-supplied text is escaped through the esc() function before insertion.
 The only values rendered via innerHTML are pre-formatted HTML strings with
 escaped user content — no raw user input is ever injected unescaped.
 """
 from fastapi import APIRouter
 from fastapi.responses import HTMLResponse
 router = APIRouter()
 def _get_lab_html() -> str:
    """Return the Pipeline Lab HTML. Separated into a function for clarity."""
    # The HTML is a self-contained notebook UI.
    # All user-facing text is escaped via the esc() JS function.
    return r"""<!DOCTYPE html>
 <html lang="en"><head>
 <meta charset="UTF-8"><meta name="viewport" content="width=device-width,initial-scale=1.0">
 <title>Pipeline Lab — Lakehouse</title>
 <style>
 :root{--bg:#08090c;--surface:rgba(14,16,22,0.9);--border:#2a2d35;--text:#e8e6e3;--text2:#7a7872;--accent:#4ade80;--gold:#e2b55a;--red:#e05252;--blue:#5b9cf5;--purple:#c084fc}
 *{box-sizing:border-box;margin:0;padding:0}
 body{font-family:'SF Mono','Menlo','Consolas',monospace;background:var(--bg);color:var(--text);min-height:100vh;padding:20px 28px;font-size:13px}
 h1{font-size:18px;font-weight:700;margin-bottom:4px}h1 span{color:var(--accent)}
 .subtitle{color:var(--text2);font-size:11px;margin-bottom:20px}
 .cells{display:flex;flex-direction:column;gap:12px;max-width:1100px}
 .cell{background:var(--surface);border:1px solid var(--border);border-radius:4px;overflow:hidden}
 .cell.running{border-color:var(--gold)}
 .cell-header{display:flex;align-items:center;gap:8px;padding:8px 12px;border-bottom:1px solid var(--border);font-size:10px;text-transform:uppercase;letter-spacing:1px;color:var(--text2)}
 .cell-type{font-weight:700}
 .cell-time{margin-left:auto;color:var(--text2)}
 .cell-input{padding:12px;background:rgba(0,0,0,0.3)}
 .cell-input textarea{width:100%;min-height:60px;background:transparent;border:none;color:var(--text);font-family:inherit;font-size:13px;resize:vertical;outline:none;line-height:1.6}
 .cell-output{padding:12px;font-size:12px;line-height:1.6;white-space:pre-wrap;max-height:400px;overflow-y:auto;display:none}
 .cell-output.has-data{display:block;border-top:1px solid var(--border)}
 .toolbar{display:flex;gap:6px;padding:8px 12px;border-top:1px solid var(--border);flex-wrap:wrap}
 .btn{font-family:inherit;font-size:10px;text-transform:uppercase;letter-spacing:0.5px;padding:5px 12px;border:1px solid var(--border);border-radius:3px;background:transparent;color:var(--text2);cursor:pointer}
 .btn:hover{border-color:var(--accent);color:var(--accent)}
 .btn.primary{border-color:var(--accent);color:var(--accent);background:rgba(74,222,128,0.06)}
 .btn.gold{border-color:var(--gold);color:var(--gold)}
 .btn.blue{border-color:var(--blue);color:var(--blue)}
 .btn.purple{border-color:var(--purple);color:var(--purple)}
 .btn.red{border-color:var(--red);color:var(--red)}
 .top-bar{display:flex;gap:8px;margin-bottom:16px;align-items:center;flex-wrap:wrap}
 .status-bar{display:flex;gap:12px;padding:8px 12px;background:var(--surface);border:1px solid var(--border);border-radius:4px;margin-bottom:16px;font-size:10px;color:var(--text2)}
 .stat{display:flex;align-items:center;gap:4px}.stat b{color:var(--text)}
 .result-row{display:flex;gap:8px;padding:6px 8px;border-bottom:1px solid rgba(42,45,53,0.3);align-items:center;font-size:11px}
 .result-row:last-child{border-bottom:none}
 .score-bar{width:60px;height:5px;background:rgba(0,0,0,0.2);border-radius:3px;overflow:hidden}
 .score-fill{height:100%;border-radius:3px}
 .benchmark-grid{display:grid;grid-template-columns:1fr 1fr;gap:12px;margin-top:8px}
 .bench-col{background:rgba(0,0,0,0.2);border-radius:3px;padding:10px}
 .bench-label{font-size:10px;text-transform:uppercase;letter-spacing:1px;margin-bottom:6px;font-weight:700}
 .threshold-slider{display:flex;align-items:center;gap:8px;padding:0 12px;margin:4px 0}
 .threshold-slider input[type=range]{flex:1;accent-color:var(--accent)}
 .threshold-slider .val{font-weight:700;min-width:36px;text-align:right}
 </style></head><body>
 <h1><span>Pipeline Lab</span> // Lakehouse</h1>
 <div class="subtitle">Embedding-based screening vs LLM classification &#x2014; iterative experimentation</div>
 <div class="status-bar" id="status-bar">
  <div class="stat"><span>Exemplars:</span> <b id="st-exemplars">0</b></div>
  <div class="stat"><span>Categories:</span> <b id="st-categories">0</b></div>
  <div class="stat"><span>Pipelines:</span> <b id="st-pipelines">0</b></div>
  <div class="stat" style="margin-left:auto"><span>Sidecar:</span> <b id="st-health" style="color:var(--text2)">...</b></div>
 </div>
 <div class="top-bar">
  <button class="btn primary" onclick="addCell('exemplars')">+ Exemplars</button>
  <button class="btn gold" onclick="addCell('screen')">+ Screen</button>
  <button class="btn blue" onclick="addCell('classify')">+ Classify</button>
  <button class="btn purple" onclick="addCell('benchmark')">+ Benchmark</button>
  <button class="btn" onclick="addCell('similarity')">+ Similarity</button>
  <button class="btn" onclick="addCell('generate')">+ Generate</button>
  <button class="btn" onclick="addCell('pipeline')">+ Pipeline</button>
  <span style="flex:1"></span>
  <button class="btn red" onclick="clearCells()">Clear All</button>
 </div>
 <div class="cells" id="cells"></div>
 <script>
 var BASE = '';
 var cellCounter = 0;
 function esc(t){var d=document.createElement('span');d.textContent=String(t);return d.innerHTML}
 async function api(path, body) {
  var opts = body ? {method:'POST', headers:{'Content-Type':'application/json'}, body:JSON.stringify(body)} : {};
  var r = await fetch(BASE + '/lab' + path, opts);
  return r.json();
 }
 async function refreshStatus() {
  try {
    var ex = await api('/exemplars');
    var pl = await api('/pipelines');
    var h = await fetch(BASE + '/health').then(function(r){return r.json()});
    document.getElementById('st-exemplars').textContent = ex.total || 0;
    document.getElementById('st-categories').textContent = Object.keys(ex.categories || {}).length;
    document.getElementById('st-pipelines').textContent = (pl.pipelines || []).length;
    document.getElementById('st-health').textContent = h.status || 'ok';
    document.getElementById('st-health').style.color = 'var(--accent)';
  } catch(e) {
    document.getElementById('st-health').textContent = 'error';
    document.getElementById('st-health').style.color = 'var(--red)';
  }
 }
 function addCell(type) {
  var id = 'cell-' + (++cellCounter);
  var cells = document.getElementById('cells');
  var cell = document.createElement('div'); cell.className = 'cell'; cell.id = id;
  var colors = {exemplars:'var(--accent)',screen:'var(--gold)',classify:'var(--blue)',benchmark:'var(--purple)',similarity:'var(--text2)',generate:'var(--text2)',pipeline:'var(--accent)'};
  var labels = {exemplars:'EXEMPLARS',screen:'SCREEN',classify:'CLASSIFY (LLM)',benchmark:'BENCHMARK A/B',similarity:'SIMILARITY',generate:'GENERATE',pipeline:'PIPELINE'};
  var placeholders = {
    exemplars:'Category: decision\n---\nWe decided to use Parquet for all storage\nThe team chose React over Vue\nArchitecture decision: microservices',
    screen:'Enter texts to classify via embedding similarity (one per line):\n\nWe decided to migrate to PostgreSQL\nThe weather is nice today\nArchitecture: chose event sourcing over CRUD',
    classify:'Enter texts to classify via LLM (one per line):\n\nWe decided to migrate to PostgreSQL\nThe weather is nice today',
    benchmark:'Enter texts to benchmark (one per line):\n\nWe decided to use Kubernetes for orchestration\nThe new hire starts Monday\nTechnical debt: refactor the auth module\nLunch menu looks good today',
    similarity:'Enter texts to compare pairwise (one per line):\n\nWe chose React for the frontend\nReact was selected as our UI framework\nThe database uses PostgreSQL',
    generate:'Enter a prompt for the LLM...',
    pipeline:'Pipeline name: my-extraction\n---\nscreen | threshold=0.6\nclassify\nextract | prompt=Extract the key decision and its rationale\nvalidate | dedup_threshold=0.9'
  };
  var color = colors[type] || 'var(--text2)';
  var label = labels[type] || type.toUpperCase();
  var ph = placeholders[type] || '';
  // Build cell using DOM methods
  var header = document.createElement('div'); header.className = 'cell-header';
  var typeSpan = document.createElement('span'); typeSpan.className = 'cell-type'; typeSpan.style.color = color; typeSpan.textContent = label; header.appendChild(typeSpan);
  var numSpan = document.createElement('span'); numSpan.textContent = 'Cell #' + cellCounter; header.appendChild(numSpan);
  var timeSpan = document.createElement('span'); timeSpan.className = 'cell-time'; timeSpan.id = id + '-time'; header.appendChild(timeSpan);
  cell.appendChild(header);
  var inputDiv = document.createElement('div'); inputDiv.className = 'cell-input';
  var textarea = document.createElement('textarea'); textarea.id = id + '-input'; textarea.placeholder = ph; textarea.value = ph;
  inputDiv.appendChild(textarea); cell.appendChild(inputDiv);
  if (type === 'screen' || type === 'benchmark') {
    var slider = document.createElement('div'); slider.className = 'threshold-slider';
    var slLabel = document.createElement('span'); slLabel.style.cssText = 'font-size:10px;color:var(--text2)'; slLabel.textContent = 'Threshold:'; slider.appendChild(slLabel);
    var range = document.createElement('input'); range.type = 'range'; range.min = '0.3'; range.max = '0.95'; range.step = '0.05'; range.value = '0.65'; range.id = id + '-threshold';
    var valSpan = document.createElement('span'); valSpan.className = 'val'; valSpan.textContent = '0.65';
    range.oninput = function() { valSpan.textContent = this.value; };
    slider.appendChild(range); slider.appendChild(valSpan); cell.appendChild(slider);
  }
  var outputDiv = document.createElement('div'); outputDiv.className = 'cell-output'; outputDiv.id = id + '-output';
  cell.appendChild(outputDiv);
  var tb = document.createElement('div'); tb.className = 'toolbar';
  var runBtn = document.createElement('button'); runBtn.className = 'btn primary'; runBtn.textContent = 'Run';
  runBtn.onclick = function() { runCell(id, type); }; tb.appendChild(runBtn);
  var rmBtn = document.createElement('button'); rmBtn.className = 'btn red'; rmBtn.textContent = 'Remove';
  rmBtn.onclick = function() { removeCell(id); }; tb.appendChild(rmBtn);
  cell.appendChild(tb);
  cells.appendChild(cell);
  textarea.focus();
  return id;
 }
 function removeCell(id) { var el = document.getElementById(id); if (el) el.remove(); }
 function clearCells() { document.getElementById('cells').textContent = ''; cellCounter = 0; }
 function parseLines(text) { return text.split('\n').map(function(l){return l.trim()}).filter(function(l){return l && l.charAt(0) !== '#'}); }
 async function runCell(id, type) {
  var cell = document.getElementById(id);
  var input = document.getElementById(id+'-input').value;
  var output = document.getElementById(id+'-output');
  var timeEl = document.getElementById(id+'-time');
  cell.classList.add('running');
  output.className = 'cell-output has-data';
  output.textContent = 'Running...';
  try {
    var t0 = performance.now();
    var result;
    if (type === 'exemplars') {
      var parts = input.split('---');
      var catLine = (parts[0] || '').trim();
      var category = catLine.replace(/^category:\s*/i, '').trim().toLowerCase();
      var texts = parseLines(parts.slice(1).join('\n'));
      if (!category || !texts.length) { output.textContent = 'Format: Category: name\\n---\\ntext1\\ntext2'; return; }
      result = await api('/exemplars', {category: category, texts: texts});
      output.textContent = 'Added ' + result.added + ' exemplars to "' + result.category + '" (total: ' + result.total + ')';
      output.style.color = 'var(--accent)';
      refreshStatus();
    }
    else if (type === 'screen') {
      var texts = parseLines(input);
      var threshold = parseFloat((document.getElementById(id+'-threshold') || {}).value || '0.65');
      result = await api('/screen', {texts: texts, threshold: threshold});
      renderScreenResults(output, result, threshold);
    }
    else if (type === 'classify') {
      var texts = parseLines(input);
      result = await api('/classify', {texts: texts});
      renderClassifyResults(output, result);
    }
    else if (type === 'benchmark') {
      var texts = parseLines(input);
      var threshold = parseFloat((document.getElementById(id+'-threshold') || {}).value || '0.65');
      result = await api('/benchmark', {texts: texts, threshold: threshold});
      renderBenchmark(output, result);
    }
    else if (type === 'similarity') {
      var texts = parseLines(input);
      result = await api('/cell', {action:'similarity', texts: texts});
      renderSimilarityMatrix(output, result);
    }
    else if (type === 'generate') {
      result = await api('/cell', {action:'generate', text: input});
      output.textContent = result.text || '(empty)';
    }
    else if (type === 'pipeline') {
      var parts = input.split('---');
      var nameLine = (parts[0] || '').trim();
      var pName = nameLine.replace(/^pipeline\s*name:\s*/i, '').trim();
      var stageLines = parseLines(parts.slice(1).join('\n'));
      var stages = stageLines.map(function(line) {
        var ps = line.split('|').map(function(s){return s.trim()});
        var mode = ps[0];
        var config = {};
        ps.slice(1).forEach(function(p) {
          var kv = p.split('='); if (kv.length===2) {
            var v = kv[1].trim();
            config[kv[0].trim()] = isNaN(parseFloat(v)) ? v : parseFloat(v);
          }
        });
        return {name: mode, mode: mode, config: config};
      });
      await api('/pipelines', {name: pName, stages: stages, description: 'Created in Pipeline Lab'});
      output.textContent = 'Pipeline "' + pName + '" saved (' + stages.length + ' stages). Use the API to run it: POST /lab/pipelines/run';
      output.style.color = 'var(--accent)';
      refreshStatus();
    }
    var elapsed = Math.round(performance.now() - t0);
    timeEl.textContent = elapsed + 'ms' + (result && result.time_ms ? ' (server: '+result.time_ms+'ms)' : '');
  } catch(e) {
    output.textContent = 'Error: ' + e.message;
    output.style.color = 'var(--red)';
  } finally {
    cell.classList.remove('running');
  }
 }
 function renderScreenResults(el, results, threshold) {
  el.textContent = '';
  results.forEach(function(r) {
    var row = document.createElement('div'); row.className = 'result-row';
    var cat = document.createElement('span');
    cat.style.cssText = 'min-width:80px;font-weight:700;color:' + (r.above_threshold ? 'var(--accent)' : 'var(--text2)');
    cat.textContent = r.best_category || 'none'; row.appendChild(cat);
    var sim = document.createElement('span'); sim.style.cssText = 'min-width:50px;font-weight:700';
    sim.textContent = (r.similarity * 100).toFixed(1) + '%';
    sim.style.color = r.similarity >= 0.7 ? 'var(--accent)' : r.similarity >= threshold ? 'var(--gold)' : 'var(--text2)';
    row.appendChild(sim);
    var bar = document.createElement('div'); bar.className = 'score-bar';
    var fill = document.createElement('div'); fill.className = 'score-fill';
    fill.style.width = (r.similarity * 100) + '%';
    fill.style.background = r.similarity >= 0.7 ? 'var(--accent)' : r.similarity >= threshold ? 'var(--gold)' : 'var(--red)';
    bar.appendChild(fill); row.appendChild(bar);
    var txt = document.createElement('span'); txt.style.cssText = 'flex:1;overflow:hidden;text-overflow:ellipsis;white-space:nowrap';
    txt.textContent = r.text; row.appendChild(txt);
    var badge = document.createElement('span');
    badge.style.cssText = 'font-size:9px;padding:2px 6px;border-radius:2px;border:1px solid;' +
      (r.above_threshold ? 'color:var(--accent);border-color:var(--accent)' : 'color:var(--text2);border-color:var(--border)');
    badge.textContent = r.above_threshold ? 'PASS' : 'FILTERED'; row.appendChild(badge);
    el.appendChild(row);
  });
 }
 function renderClassifyResults(el, results) {
  el.textContent = '';
  results.forEach(function(r) {
    var row = document.createElement('div'); row.className = 'result-row';
    var cat = document.createElement('span'); cat.style.cssText = 'min-width:80px;font-weight:700;color:var(--blue)';
    cat.textContent = r.category; row.appendChild(cat);
    var conf = document.createElement('span');
    conf.style.cssText = 'min-width:50px;font-size:10px;color:' + (r.confidence==='high'?'var(--accent)':r.confidence==='medium'?'var(--gold)':'var(--text2)');
    conf.textContent = r.confidence; row.appendChild(conf);
    var txt = document.createElement('span'); txt.style.flex = '1'; txt.textContent = r.text; row.appendChild(txt);
    el.appendChild(row);
  });
 }
 function renderBenchmark(el, result) {
  el.textContent = '';
  // Summary stats (using safe DOM construction)
  var summary = document.createElement('div'); summary.style.cssText = 'display:flex;gap:16px;margin-bottom:12px;flex-wrap:wrap';
  var stats = [
    ['Agreement', (result.agreement_rate*100).toFixed(1)+'%', result.agreement_rate>=0.8?'var(--accent)':'var(--gold)'],
    ['Speedup', result.speedup+'x', result.speedup>=2?'var(--accent)':'var(--text)'],
    ['Embed', result.embed_time_ms+'ms', 'var(--gold)'],
    ['LLM', result.llm_time_ms+'ms', 'var(--blue)'],
    ['Hybrid est.', result.hybrid_estimated_ms+'ms', 'var(--accent)'],
    ['Screened out', result.texts_screened_out+'/'+result.total_texts, 'var(--purple)']
  ];
  stats.forEach(function(s) {
    var box = document.createElement('div'); box.style.cssText = 'background:rgba(0,0,0,0.2);padding:6px 10px;border-radius:3px;text-align:center';
    var lbl = document.createElement('div'); lbl.style.cssText = 'font-size:9px;color:var(--text2);text-transform:uppercase;letter-spacing:0.5px'; lbl.textContent = s[0]; box.appendChild(lbl);
    var val = document.createElement('div'); val.style.cssText = 'font-size:16px;font-weight:700;color:'+s[2]; val.textContent = s[1]; box.appendChild(val);
    summary.appendChild(box);
  });
  el.appendChild(summary);
  // Side-by-side comparison
  var grid = document.createElement('div'); grid.style.cssText = 'display:grid;grid-template-columns:1fr 1fr;gap:12px;margin-top:8px';
  // Embed column
  var leftCol = document.createElement('div'); leftCol.style.cssText = 'background:rgba(0,0,0,0.2);border-radius:3px;padding:10px';
  var leftTitle = document.createElement('div'); leftTitle.style.cssText = 'font-size:10px;text-transform:uppercase;letter-spacing:1px;margin-bottom:6px;font-weight:700;color:var(--gold)';
  leftTitle.textContent = 'EMBEDDING SCREENING (' + result.embed_time_ms + 'ms)'; leftCol.appendChild(leftTitle);
  (result.embed_results||[]).forEach(function(r) {
    var row = document.createElement('div'); row.style.cssText = 'font-size:11px;padding:3px 0;display:flex;gap:6px;align-items:center';
    var c = document.createElement('span'); c.style.cssText = 'min-width:60px;font-weight:700;color:'+(r.above_threshold?'var(--accent)':'var(--text2)'); c.textContent = r.best_category||'none'; row.appendChild(c);
    var s = document.createElement('span'); s.style.cssText = 'min-width:40px;color:var(--text2)'; s.textContent = (r.similarity*100).toFixed(0)+'%'; row.appendChild(s);
    var t = document.createElement('span'); t.style.cssText = 'flex:1;overflow:hidden;text-overflow:ellipsis;white-space:nowrap'; t.textContent = r.text; row.appendChild(t);
    leftCol.appendChild(row);
  });
  grid.appendChild(leftCol);
  // LLM column
  var rightCol = document.createElement('div'); rightCol.style.cssText = 'background:rgba(0,0,0,0.2);border-radius:3px;padding:10px';
  var rightTitle = document.createElement('div'); rightTitle.style.cssText = 'font-size:10px;text-transform:uppercase;letter-spacing:1px;margin-bottom:6px;font-weight:700;color:var(--blue)';
  rightTitle.textContent = 'LLM CLASSIFICATION (' + result.llm_time_ms + 'ms)'; rightCol.appendChild(rightTitle);
  (result.llm_results||[]).forEach(function(r) {
    var row = document.createElement('div'); row.style.cssText = 'font-size:11px;padding:3px 0;display:flex;gap:6px;align-items:center';
    var c = document.createElement('span'); c.style.cssText = 'min-width:60px;font-weight:700;color:var(--blue)'; c.textContent = r.category; row.appendChild(c);
    var s = document.createElement('span'); s.style.cssText = 'min-width:40px;color:'+(r.confidence==='high'?'var(--accent)':'var(--text2)'); s.textContent = r.confidence; row.appendChild(s);
    var t = document.createElement('span'); t.style.cssText = 'flex:1;overflow:hidden;text-overflow:ellipsis;white-space:nowrap'; t.textContent = r.text; row.appendChild(t);
    rightCol.appendChild(row);
  });
  grid.appendChild(rightCol);
  el.appendChild(grid);
 }
 function renderSimilarityMatrix(el, result) {
  el.textContent = '';
  var matrix = result.matrix || [];
  var texts = result.texts || [];
  if (!matrix.length) { el.textContent = 'No results'; return; }
  var tbl = document.createElement('table'); tbl.style.cssText = 'border-collapse:collapse;font-size:11px;width:100%';
  var hdr = document.createElement('tr');
  var corner = document.createElement('th'); hdr.appendChild(corner);
  texts.forEach(function(t) {
    var th = document.createElement('th'); th.style.cssText = 'padding:4px;color:var(--text2);font-size:9px;max-width:100px;overflow:hidden;text-overflow:ellipsis;white-space:nowrap';
    th.textContent = t.substring(0, 20); th.title = t; hdr.appendChild(th);
  });
  tbl.appendChild(hdr);
  matrix.forEach(function(row, i) {
    var tr = document.createElement('tr');
    var td0 = document.createElement('td'); td0.style.cssText = 'padding:4px;color:var(--text2);font-size:9px;max-width:100px;overflow:hidden;text-overflow:ellipsis;white-space:nowrap';
    td0.textContent = texts[i].substring(0, 20); tr.appendChild(td0);
    row.forEach(function(v, j) {
      var td = document.createElement('td');
      var bg = i===j ? 'rgba(74,222,128,0.1)' : v>=0.8 ? 'rgba(74,222,128,0.15)' : v>=0.6 ? 'rgba(226,181,90,0.1)' : 'transparent';
      td.style.cssText = 'padding:4px;text-align:center;font-weight:'+(v>=0.7?'700':'400')+';color:'+(v>=0.8?'var(--accent)':v>=0.6?'var(--gold)':'var(--text2)')+';background:'+bg;
      td.textContent = v.toFixed(2); tr.appendChild(td);
    });
    tbl.appendChild(tr);
  });
  el.appendChild(tbl);
 }
 refreshStatus();
 </script>
 </body></html>"""
@router.get("", response_class=HTMLResponse)
 async def lab_page():
    return _get_lab_html()
--- a/sidecar/sidecar/pipeline_lab.py
+++ b/sidecar/sidecar/pipeline_lab.py
@ -0,0 +1,503 @@
 """Pipeline Lab — iterative embedding/LLM pipeline experimentation.
 Provides:
 - Exemplar-based embedding classification (fast screening)
 - LLM-based classification (accurate but slow)
 - A/B benchmarking between the two
 - Pipeline definition and execution
 - Notebook-style API for interactive experimentation
 """
 import json
 import math
 import os
 import time
 from pathlib import Path
 from typing import Optional
 from fastapi import APIRouter, HTTPException
 from fastapi.responses import HTMLResponse
 from pydantic import BaseModel
 from .ollama import client
 router = APIRouter()
 EMBED_MODEL = os.environ.get("EMBED_MODEL", "nomic-embed-text")
 GEN_MODEL = os.environ.get("GEN_MODEL", "qwen2.5")
 LAB_DIR = Path(os.environ.get("LAB_DIR", "./data/_pipeline_lab"))
 LAB_DIR.mkdir(parents=True, exist_ok=True)
 # ─── Vector math ─────────────────────────────────────────────
 def cosine_similarity(a: list[float], b: list[float]) -> float:
    dot = sum(x * y for x, y in zip(a, b))
    norm_a = math.sqrt(sum(x * x for x in a))
    norm_b = math.sqrt(sum(x * x for x in b))
    if norm_a == 0 or norm_b == 0:
        return 0.0
    return dot / (norm_a * norm_b)
 # ─── Exemplar store ──────────────────────────────────────────
 # Exemplars are labeled text+embedding pairs used for classification.
 # e.g. category="decision" texts=["We decided to use Parquet", "The team chose React"]
 _exemplars: dict[str, list[dict]] = {}  # category -> [{text, embedding}]
 def _exemplar_file() -> Path:
    return LAB_DIR / "exemplars.json"
 def _load_exemplars():
    global _exemplars
    fp = _exemplar_file()
    if fp.exists():
        data = json.loads(fp.read_text())
        _exemplars = data
    return _exemplars
 def _save_exemplars():
    _exemplar_file().write_text(json.dumps(_exemplars, indent=2))
 _load_exemplars()
 # ─── Pipeline store ──────────────────────────────────────────
 def _pipelines_dir() -> Path:
    d = LAB_DIR / "pipelines"
    d.mkdir(exist_ok=True)
    return d
 # ─── Embedding helper ────────────────────────────────────────
 async def _embed_texts(texts: list[str], model: str = EMBED_MODEL) -> list[list[float]]:
    embeddings = []
    async with client() as c:
        for text in texts:
            resp = await c.post("/api/embed", json={"model": model, "input": text})
            if resp.status_code != 200:
                raise HTTPException(502, f"Ollama embed error: {resp.text}")
            data = resp.json()
            embeddings.extend(data.get("embeddings", []))
    return embeddings
 async def _generate(prompt: str, model: str = GEN_MODEL, temperature: float = 0.3) -> str:
    async with client() as c:
        resp = await c.post("/api/generate", json={
            "model": model, "prompt": prompt, "stream": False,
            "options": {"temperature": temperature, "num_predict": 1024}
        })
        if resp.status_code != 200:
            raise HTTPException(502, f"Ollama generate error: {resp.text}")
        return resp.json().get("response", "")
 # ─── API: Exemplars ──────────────────────────────────────────
 class ExemplarAdd(BaseModel):
    category: str
    texts: list[str]
 class ExemplarList(BaseModel):
    categories: dict[str, int]  # category -> count
@router.post("/exemplars")
 async def add_exemplars(req: ExemplarAdd):
    """Add labeled exemplar texts for a category. Embeddings generated automatically."""
    category = req.category.strip().lower()
    if not category or not req.texts:
        raise HTTPException(400, "category and texts required")
    embeddings = await _embed_texts(req.texts)
    if category not in _exemplars:
        _exemplars[category] = []
    for text, emb in zip(req.texts, embeddings):
        _exemplars[category].append({"text": text, "embedding": emb})
    _save_exemplars()
    return {"ok": True, "category": category, "added": len(req.texts),
            "total": len(_exemplars[category])}
@router.get("/exemplars")
 async def list_exemplars():
    """List all exemplar categories and counts."""
    return {"categories": {k: len(v) for k, v in _exemplars.items()},
            "total": sum(len(v) for v in _exemplars.values())}
@router.delete("/exemplars/{category}")
 async def delete_exemplar_category(category: str):
    if category in _exemplars:
        del _exemplars[category]
        _save_exemplars()
    return {"ok": True}
 # ─── API: Screen (embedding-based classification) ────────────
 class ScreenRequest(BaseModel):
    texts: list[str]
    threshold: float = 0.65
    top_k: int = 1
 class ScreenResult(BaseModel):
    text: str
    best_category: str | None
    similarity: float
    above_threshold: bool
    all_scores: dict[str, float]
@router.post("/screen", response_model=list[ScreenResult])
 async def screen_texts(req: ScreenRequest):
    """Classify texts by cosine similarity to exemplar embeddings (fast path)."""
    if not _exemplars:
        raise HTTPException(400, "No exemplars defined. Add exemplars first.")
    embeddings = await _embed_texts(req.texts)
    results = []
    for text, emb in zip(req.texts, embeddings):
        category_scores = {}
        for category, exemplar_list in _exemplars.items():
            sims = [cosine_similarity(emb, ex["embedding"]) for ex in exemplar_list]
            category_scores[category] = max(sims) if sims else 0.0
        best_cat = max(category_scores, key=category_scores.get) if category_scores else None
        best_sim = category_scores.get(best_cat, 0.0) if best_cat else 0.0
        results.append(ScreenResult(
            text=text[:200],
            best_category=best_cat if best_sim >= req.threshold else None,
            similarity=round(best_sim, 4),
            above_threshold=best_sim >= req.threshold,
            all_scores={k: round(v, 4) for k, v in sorted(category_scores.items(),
                        key=lambda x: x[1], reverse=True)},
        ))
    return results
 # ─── API: Classify (LLM-based classification) ────────────────
 class ClassifyRequest(BaseModel):
    texts: list[str]
    categories: list[str] | None = None  # if None, use exemplar category names
    model: str | None = None
 class ClassifyResult(BaseModel):
    text: str
    category: str
    confidence: str
    reasoning: str
@router.post("/classify", response_model=list[ClassifyResult])
 async def classify_texts(req: ClassifyRequest):
    """Classify texts using LLM (slow but accurate path)."""
    categories = req.categories or list(_exemplars.keys())
    if not categories:
        raise HTTPException(400, "No categories. Provide categories or add exemplars.")
    model = req.model or GEN_MODEL
    results = []
    for text in req.texts:
        prompt = (
            f"Classify this text into exactly ONE of these categories: {', '.join(categories)}\n\n"
            f"TEXT: {text[:500]}\n\n"
            f"Respond with JSON: {{\"category\": \"...\", \"confidence\": \"high|medium|low\", "
            f"\"reasoning\": \"one sentence\"}}"
        )
        raw = await _generate(prompt, model=model, temperature=0.1)
        # Parse
        try:
            j_s, j_e = raw.find("{"), raw.rfind("}") + 1
            parsed = json.loads(raw[j_s:j_e]) if j_s >= 0 and j_e > j_s else {}
        except Exception:
            parsed = {}
        results.append(ClassifyResult(
            text=text[:200],
            category=parsed.get("category", "unknown"),
            confidence=parsed.get("confidence", "low"),
            reasoning=parsed.get("reasoning", raw[:200]),
        ))
    return results
 # ─── API: Benchmark (A/B comparison) ─────────────────────────
 class BenchmarkRequest(BaseModel):
    texts: list[str]
    threshold: float = 0.65
    model: str | None = None
 class BenchmarkResult(BaseModel):
    total_texts: int
    # Embedding path
    embed_time_ms: int
    embed_results: list[dict]
    # LLM path
    llm_time_ms: int
    llm_results: list[dict]
    # Comparison
    agreement_rate: float
    speedup: float
    texts_screened_out: int
    texts_needing_llm: int
    hybrid_estimated_ms: int
@router.post("/benchmark", response_model=BenchmarkResult)
 async def benchmark(req: BenchmarkRequest):
    """Run same texts through embedding screening and LLM classification. Compare."""
    if not _exemplars:
        raise HTTPException(400, "No exemplars. Add exemplars first.")
    categories = list(_exemplars.keys())
    # Embedding path
    t0 = time.monotonic()
    embed_results = await screen_texts(ScreenRequest(
        texts=req.texts, threshold=req.threshold
    ))
    embed_ms = int((time.monotonic() - t0) * 1000)
    # LLM path
    t0 = time.monotonic()
    llm_results = await classify_texts(ClassifyRequest(
        texts=req.texts, categories=categories, model=req.model
    ))
    llm_ms = int((time.monotonic() - t0) * 1000)
    # Compare
    agreements = 0
    screened_out = 0
    for er, lr in zip(embed_results, llm_results):
        if not er.above_threshold:
            screened_out += 1
        if er.best_category == lr.category:
            agreements += 1
    needing_llm = len(req.texts) - screened_out
    # Hybrid estimate: embed all + LLM only the uncertain ones
    per_text_embed_ms = embed_ms / max(len(req.texts), 1)
    per_text_llm_ms = llm_ms / max(len(req.texts), 1)
    hybrid_ms = int(embed_ms + needing_llm * per_text_llm_ms)
    return BenchmarkResult(
        total_texts=len(req.texts),
        embed_time_ms=embed_ms,
        embed_results=[r.model_dump() for r in embed_results],
        llm_time_ms=llm_ms,
        llm_results=[r.model_dump() for r in llm_results],
        agreement_rate=round(agreements / max(len(req.texts), 1), 3),
        speedup=round(llm_ms / max(hybrid_ms, 1), 2),
        texts_screened_out=screened_out,
        texts_needing_llm=needing_llm,
        hybrid_estimated_ms=hybrid_ms,
    )
 # ─── API: Pipeline definition & execution ────────────────────
 class PipelineStage(BaseModel):
    name: str
    mode: str  # "screen", "classify", "extract", "validate", "custom"
    config: dict = {}  # stage-specific config (threshold, prompt, etc.)
 class PipelineDef(BaseModel):
    name: str
    stages: list[PipelineStage]
    description: str = ""
 class PipelineRunRequest(BaseModel):
    pipeline_name: str
    texts: list[str]
@router.post("/pipelines")
 async def save_pipeline(pipeline: PipelineDef):
    """Save a pipeline definition."""
    fp = _pipelines_dir() / f"{pipeline.name}.json"
    fp.write_text(pipeline.model_dump_json(indent=2))
    return {"ok": True, "name": pipeline.name}
@router.get("/pipelines")
 async def list_pipelines():
    """List saved pipeline definitions."""
    pipelines = []
    for fp in _pipelines_dir().glob("*.json"):
        try:
            data = json.loads(fp.read_text())
            pipelines.append({"name": data["name"], "stages": len(data["stages"]),
                             "description": data.get("description", "")})
        except Exception:
            pass
    return {"pipelines": pipelines}
@router.get("/pipelines/{name}")
 async def get_pipeline(name: str):
    fp = _pipelines_dir() / f"{name}.json"
    if not fp.exists():
        raise HTTPException(404, "Pipeline not found")
    return json.loads(fp.read_text())
@router.post("/pipelines/run")
 async def run_pipeline(req: PipelineRunRequest):
    """Execute a pipeline on a set of texts. Returns per-stage results and timing."""
    fp = _pipelines_dir() / f"{req.pipeline_name}.json"
    if not fp.exists():
        raise HTTPException(404, f"Pipeline '{req.pipeline_name}' not found")
    pipeline = json.loads(fp.read_text())
    results = {"pipeline": req.pipeline_name, "stages": [], "total_ms": 0}
    current_texts = req.texts[:]
    for stage_def in pipeline["stages"]:
        stage_name = stage_def["name"]
        mode = stage_def["mode"]
        config = stage_def.get("config", {})
        t0 = time.monotonic()
        stage_result = {"name": stage_name, "mode": mode, "input_count": len(current_texts)}
        if mode == "screen":
            threshold = config.get("threshold", 0.65)
            screen_res = await screen_texts(ScreenRequest(
                texts=current_texts, threshold=threshold
            ))
            passed = [r for r in screen_res if r.above_threshold]
            stage_result["output_count"] = len(passed)
            stage_result["filtered_out"] = len(current_texts) - len(passed)
            stage_result["results"] = [r.model_dump() for r in screen_res]
            # Pass only above-threshold texts to next stage
            current_texts = [r.text for r in screen_res if r.above_threshold]
        elif mode == "classify":
            cls_res = await classify_texts(ClassifyRequest(
                texts=current_texts,
                categories=config.get("categories"),
                model=config.get("model"),
            ))
            stage_result["output_count"] = len(cls_res)
            stage_result["results"] = [r.model_dump() for r in cls_res]
        elif mode == "extract":
            extract_prompt = config.get("prompt", "Extract key information from this text:")
            extractions = []
            for text in current_texts:
                raw = await _generate(f"{extract_prompt}\n\nTEXT: {text[:800]}")
                extractions.append({"text": text[:200], "extracted": raw})
            stage_result["output_count"] = len(extractions)
            stage_result["results"] = extractions
        elif mode == "validate":
            # Embedding-based dedup: find near-duplicate results
            if len(current_texts) > 1:
                embs = await _embed_texts(current_texts)
                dupes = []
                threshold = config.get("dedup_threshold", 0.92)
                for i in range(len(embs)):
                    for j in range(i + 1, len(embs)):
                        sim = cosine_similarity(embs[i], embs[j])
                        if sim >= threshold:
                            dupes.append({"i": i, "j": j, "similarity": round(sim, 4),
                                         "text_a": current_texts[i][:100],
                                         "text_b": current_texts[j][:100]})
                stage_result["duplicates_found"] = len(dupes)
                stage_result["results"] = dupes
            else:
                stage_result["duplicates_found"] = 0
                stage_result["results"] = []
            stage_result["output_count"] = len(current_texts)
        else:
            stage_result["error"] = f"Unknown mode: {mode}"
            stage_result["output_count"] = len(current_texts)
        stage_ms = int((time.monotonic() - t0) * 1000)
        stage_result["time_ms"] = stage_ms
        results["stages"].append(stage_result)
        results["total_ms"] += stage_ms
    return results
 # ─── API: REPL cell (free-form eval) ─────────────────────────
 class CellRequest(BaseModel):
    action: str  # "embed", "generate", "similarity", "screen", "classify"
    text: str = ""
    texts: list[str] = []
    params: dict = {}
@router.post("/cell")
 async def run_cell(req: CellRequest):
    """Execute a single notebook cell. Flexible entry point for ad-hoc operations."""
    t0 = time.monotonic()
    result = {}
    if req.action == "embed":
        texts = req.texts or ([req.text] if req.text else [])
        embs = await _embed_texts(texts)
        result = {"embeddings_count": len(embs), "dimensions": len(embs[0]) if embs else 0,
                  "texts": texts}
    elif req.action == "generate":
        text = await _generate(req.text, **{k: v for k, v in req.params.items()
                                             if k in ("model", "temperature")})
        result = {"text": text}
    elif req.action == "similarity":
        if len(req.texts) < 2:
            raise HTTPException(400, "Need at least 2 texts for similarity")
        embs = await _embed_texts(req.texts)
        matrix = []
        for i in range(len(embs)):
            row = []
            for j in range(len(embs)):
                row.append(round(cosine_similarity(embs[i], embs[j]), 4))
            matrix.append(row)
        result = {"matrix": matrix, "texts": [t[:80] for t in req.texts]}
    elif req.action == "screen":
        texts = req.texts or ([req.text] if req.text else [])
        threshold = req.params.get("threshold", 0.65)
        res = await screen_texts(ScreenRequest(texts=texts, threshold=threshold))
        result = {"results": [r.model_dump() for r in res]}
    elif req.action == "classify":
        texts = req.texts or ([req.text] if req.text else [])
        res = await classify_texts(ClassifyRequest(texts=texts))
        result = {"results": [r.model_dump() for r in res]}
    else:
        raise HTTPException(400, f"Unknown action: {req.action}")
    result["time_ms"] = int((time.monotonic() - t0) * 1000)
    return result
--- a/tests/agent_test/PRD.md
+++ b/tests/agent_test/PRD.md
@ -0,0 +1,90 @@
 # PRD: Chicago Permit Staffing Recommendation
 ## Mission
 You are a staffing-intelligence assistant. Your job is to **analyze a Chicago building permit and produce a one-page staffing recommendation** for our staffing company.
 The output is a markdown document that a human staffing coordinator will read in under 2 minutes to decide whether to pursue the contract for staffing fit.
 ## Critical rules
 1. **DO NOT START WRITING THE FINAL ANALYSIS YET.**
   - First, READ this PRD fully.
   - Then, PLAN your approach in `note()` — what steps will you take, what tools will you call, what evidence will you need.
   - Only after planning, begin executing.
 2. **Never invent facts.** If you don't have evidence for a claim (from a tool call), do not make the claim. Say "no evidence available" instead.
 3. **Cite your sources.** Every factual claim in the final output should reference either:
   - The permit data you read (cite the permit ID)
   - A matrix-retrieved chunk (cite as `[matrix:source:doc_id]`)
 4. **Stay focused.** This is a one-page deliverable, not a research paper. Aim for 600-1000 words total.
 ## Tools available
 - `list_permits(min_cost?: number, permit_type?: string)` — list permits matching filter; default returns top 5 by cost
 - `read_permit(permit_id: string)` — get full details for one permit
 - `query_matrix(query: string, top_k?: number)` — search the knowledge base for relevant context (contractor entities, prior permits, SEC tickers, LLM team patterns)
 - `note(text: string)` — append to your working scratchpad (visible to you across iterations)
 - `read_scratchpad()` — read your full scratchpad
 - `done(summary: string)` — finish; pass your final markdown analysis as `summary`
 ## Required output structure
 When you call `done(summary=...)`, the summary should contain:
 ```markdown
 # Staffing Recommendation: Permit <ID>
 ## Permit Summary
 [2-3 sentences: type, cost, address, scope of work]
 ## Contractor Profile
 [What we know about the contractor(s) from matrix evidence. If no matrix hits, say so explicitly.]
 ## Staffing Implications
 [What trades + headcount this permit implies. Ground in the work description.]
 ## Risk Signals
 [Any matrix hits suggesting caution: debarment, prior incidents, low-quality history. If none, say so.]
 ## Recommendation
 [Pursue / Pass / Investigate-Further, with one-sentence rationale.]
 ```
 ## Example workflow (do not copy verbatim)
 1. Note your plan: "I will list 5 mid-range permits, pick one with a private contractor, read it fully, query the matrix for the contractor name, then write the recommendation."
 2. Call `list_permits(min_cost=100000)` → see candidates
 3. **PICK A PERMIT WITH A PRIVATE CONTRACTOR (a person's name or a private LLC), NOT a government agency** like CDOT, City of Chicago, etc. Government permits have no useful contractor profile to recommend on.
 4. `read_permit(id)` → see all fields
 5. Call `query_matrix("<contractor name> contractor Chicago renovation")` → see what the matrix has
 6. Note any evidence found, gaps, surprises
 7. Call `done(summary="<final markdown>")`
 ## Success criteria
 - You called `done()` with a summary that follows the required structure
 - Every factual claim has a source (permit ID or matrix citation)
 - Total output is 600-1000 words
 - You did not invent contractor names, prior incidents, or capabilities
 - Plan was noted BEFORE execution started
 ## What "good" looks like
 - Plan is concrete (which permit, which queries)
 - Matrix queries are specific (contractor name + work type, not "find anything about this")
 - When matrix returns nothing useful, you say so honestly
 - Recommendation reflects the actual evidence, not boilerplate
 ## What "bad" looks like
 - Skipping the plan and jumping to execution
 - Making up contractor histories with no matrix evidence
 - Generic recommendations that don't reference the actual permit
 - Walls of text or structured padding to look thorough
 ## Begin
 Start by acknowledging you've read this PRD and noting your plan via `note()`. Then proceed.
--- a/tests/battery/compounding_battery.ts
+++ b/tests/battery/compounding_battery.ts
@ -0,0 +1,404 @@
 // Compounding Stress Battery — the rigorous smoke test.
 //
 // Three iterations against /v1/respond, each running:
 //   α  baseline (3 easy tasks)       — should complete local-only with boost
 //   β  drift   (3 niche tasks)       — forces executor miss → overseer fires
 //   γ  impossible (2 zero-supply)    — must fail honestly, no token explosion
 //   δ  distill outcomes              — writes distilled_*.jsonl + vector indexes
 //   ε  overseer meta-review          — gpt-oss:120b judges the iteration
 //   ζ  scrum judgment                — gpt-oss:120b reviews overseer proposals
 //
 // Iteration N+1 runs the same tasks as iteration N. We measure compounding:
 // does turns_per_task drop? does overseer_called_rate drop? does
 // correction_effective rise? If 3/5 metrics trend favorably, architecture
 // validated; otherwise the scrum verdict points at what to fix.
 //
 // Fail-fast: every error bubbles. No silent catches — the run ABORTS with
 // the underlying stack so we see exactly where the architecture broke.
 //
 // Runtime: ~60-90 min. Cloud cost: ~24-32 gpt-oss calls (well under daily cap).
 import { writeFile, mkdir, readFile } from "node:fs/promises";
 import { join } from "node:path";
 const GATEWAY = process.env.GATEWAY_URL ?? "http://localhost:3100";
 const LLM_TEAM = process.env.LLM_TEAM_URL ?? "http://localhost:5000";
 const BATTERY_DIR = process.env.BATTERY_DIR
  ?? "/home/profit/lakehouse/data/_kb/battery";
 // 10-minute timeout per /v1/respond call — cloud executor on a hard task
 // can chew for a while, and we want to see real behavior, not premature aborts.
 const RESPOND_TIMEOUT_MS = 10 * 60 * 1000;
 const META_TIMEOUT_MS = 5 * 60 * 1000;
 interface Task {
  task_class: string;
  operation: string;
  spec: Record<string, any>;
 }
 interface Tasks {
  phases: {
    alpha_baseline: Task[];
    beta_drift: Task[];
    gamma_impossible: Task[];
  };
  models: {
    executor_cloud: string;
    reviewer_cloud: string;
    overseer_cloud: string;
  };
 }
 interface RunResult {
  status: "ok" | "failed" | "blocked";
  iterations: number;
  artifact: any;
  log: any[];
  error?: string | null;
  _elapsed_ms: number;
 }
 interface TaskRun {
  task: Task;
  phase: "alpha" | "beta" | "gamma";
  result: RunResult;
 }
 // ─── HTTP helpers ───
 async function runRespond(task: Task, models: Tasks["models"]): Promise<RunResult> {
  const body = {
    task_class: task.task_class,
    operation: task.operation,
    spec: task.spec,
    executor_model: models.executor_cloud,
    reviewer_model: models.reviewer_cloud,
  };
  const start = Date.now();
  const resp = await fetch(`${GATEWAY}/v1/respond`, {
    method: "POST",
    headers: { "content-type": "application/json" },
    body: JSON.stringify(body),
    signal: AbortSignal.timeout(RESPOND_TIMEOUT_MS),
  });
  if (!resp.ok) {
    const txt = await resp.text();
    throw new Error(`/v1/respond HTTP ${resp.status}: ${txt.slice(0, 500)}`);
  }
  const j = (await resp.json()) as RunResult;
  j._elapsed_ms = Date.now() - start;
  return j;
 }
 async function runDistill(source: string): Promise<any[]> {
  const body = { mode: "distill", prompt: "battery iteration distill", source };
  const resp = await fetch(`${LLM_TEAM}/api/run?mode=distill`, {
    method: "POST",
    headers: { "content-type": "application/json" },
    body: JSON.stringify(body),
    signal: AbortSignal.timeout(META_TIMEOUT_MS),
  });
  if (!resp.ok) throw new Error(`distill HTTP ${resp.status}`);
  const text = await resp.text();
  // SSE stream — parse data: lines, return parsed event objects
  const events: any[] = [];
  for (const line of text.split("\n")) {
    if (!line.startsWith("data: ")) continue;
    try { events.push(JSON.parse(line.slice(6))); } catch { /* skip */ }
  }
  return events;
 }
 async function cloudChat(
  model: string,
  prompt: string,
  temperature: number,
  think: boolean,
 ): Promise<string> {
  const resp = await fetch(`${GATEWAY}/v1/chat`, {
    method: "POST",
    headers: { "content-type": "application/json" },
    body: JSON.stringify({
      model,
      messages: [{ role: "user", content: prompt }],
      temperature,
      think,
      provider: "ollama_cloud",
    }),
    signal: AbortSignal.timeout(META_TIMEOUT_MS),
  });
  if (!resp.ok) {
    const txt = await resp.text();
    throw new Error(`/v1/chat ${model} HTTP ${resp.status}: ${txt.slice(0, 500)}`);
  }
  const j = await resp.json() as any;
  return j.choices?.[0]?.message?.content ?? "";
 }
 // ─── Meta-review + scrum ───
 async function overseerReview(
  iterNum: number,
  artifacts: any,
  models: Tasks["models"],
 ): Promise<string> {
  const prompt = `You are the OVERSEER reviewing iteration ${iterNum} of a stress battery run against Lakehouse /v1/respond.
 For each task in the battery below, examine: status (ok/failed/blocked), iterations used, error signature, whether the in-loop overseer fired, total tokens.
 Produce a PR-style meta-review in markdown with these sections:
 ## What worked
 List specific tasks (by operation string) that completed correctly and the evidence — turns_used, citations, tokens. Be concrete.
 ## What failed
 List specific tasks that failed or needed overseer correction. Classify: was it a real failure (impossible task), a drift we should repair, or a false positive from the test?
 ## Proposed changes for iteration ${iterNum + 1}
 At least 3 concrete architectural changes, each with:
 - **Target file** (e.g. \`crates/gateway/src/execution_loop/mod.rs\`)
 - **Rationale** (what the metrics show)
 - **Expected impact** (which metric should move in iter ${iterNum + 1})
 Be honest about weaknesses. Do NOT propose generic best practices — reference specific observations from the artifacts below.
 ARTIFACTS (iteration ${iterNum}):
 ${JSON.stringify(artifacts, null, 2).slice(0, 30000)}`;
  return cloudChat(models.overseer_cloud, prompt, 0.2, true);
 }
 async function scrumJudge(
  iterNum: number,
  review: string,
  models: Tasks["models"],
 ): Promise<string> {
  const prompt = `You are the SCRUM MASTER. The OVERSEER proposed these architectural changes for iteration ${iterNum + 1} based on iteration ${iterNum}'s results.
 For each proposal, produce a verdict in markdown:
 - **Proposal N**: <short name>
  - **Verdict**: APPROVE | REVISE | REJECT
  - **Reason**: why
  - **If APPROVE**: is the expected impact realistic? what's the blast radius? is the target file correct?
  - **If REVISE**: what should change about the proposal before applying?
  - **If REJECT**: why is the proposal wrong or out of scope?
 Final section:
 ## PR-ready changes
 Bulleted list of only the APPROVE proposals, ready to apply.
 Be rigorous. Don't rubber-stamp. If a proposal references a file that probably doesn't exist, REJECT and say so. If a proposal is a generic "improve X" without concrete plan, REVISE.
 OVERSEER PROPOSED:
 ${review.slice(0, 15000)}`;
  return cloudChat(models.overseer_cloud, prompt, 0.1, true);
 }
 // ─── Iteration driver ───
 async function runIteration(iterNum: number, tasks: Tasks): Promise<any> {
  console.log(`\n${"═".repeat(60)}`);
  console.log(`▶ ITERATION ${iterNum}`);
  console.log(`${"═".repeat(60)}\n`);
  const iterDir = join(BATTERY_DIR, `iter_${iterNum}`);
  await mkdir(iterDir, { recursive: true });
  const runs: TaskRun[] = [];
  for (const [phaseKey, phaseName] of [
    ["alpha_baseline", "alpha"],
    ["beta_drift", "beta"],
    ["gamma_impossible", "gamma"],
  ] as const) {
    console.log(`\n── Phase ${phaseName} ──`);
    for (const task of tasks.phases[phaseKey]) {
      console.log(`  ▶ ${task.operation}`);
      const result = await runRespond(task, tasks.models);
      const overseerFired = (result.log ?? []).some(e => e.kind === "overseer_correction");
      console.log(
        `    status=${result.status} turns=${result.iterations}` +
        ` tokens=${result.artifact?.usage?.total_tokens ?? 0}` +
        ` overseer=${overseerFired}` +
        ` elapsed=${Math.round(result._elapsed_ms / 1000)}s`
      );
      if (result.error) console.log(`    error: ${result.error.slice(0, 200)}`);
      runs.push({ task, phase: phaseName, result });
    }
  }
  // Phase δ
  console.log(`\n── Phase δ: distill outcomes_tail:20 ──`);
  const distillEvents = await runDistill("outcomes_tail:20");
  const distillFinal = [...distillEvents].reverse()
    .find(e => e.role === "final") ?? distillEvents[distillEvents.length - 1];
  const distillText = distillFinal?.text ?? JSON.stringify(distillFinal ?? {}).slice(0, 200);
  console.log(`  ${distillText.split("\n")[0]}`);
  await writeFile(join(iterDir, "distill_output.txt"), distillText);
  // Metrics
  const collectPhase = (p: string) => runs.filter(r => r.phase === p);
  const phaseMetrics = (p: string) => {
    const ps = collectPhase(p);
    if (ps.length === 0) return { count: 0 };
    return {
      count: ps.length,
      ok: ps.filter(r => r.result.status === "ok").length,
      failed: ps.filter(r => r.result.status === "failed").length,
      avg_turns: ps.reduce((s, r) => s + (r.result.iterations || 0), 0) / ps.length,
      total_tokens: ps.reduce((s, r) => s + (r.result.artifact?.usage?.total_tokens ?? 0), 0),
      overseer_called: ps.filter(r => (r.result.log ?? []).some(e => e.kind === "overseer_correction")).length,
      avg_elapsed_s: ps.reduce((s, r) => s + (r.result._elapsed_ms || 0), 0) / ps.length / 1000,
    };
  };
  const metrics = {
    iteration: iterNum,
    total_tasks: runs.length,
    ok_tasks: runs.filter(r => r.result.status === "ok").length,
    failed_tasks: runs.filter(r => r.result.status === "failed").length,
    blocked_tasks: runs.filter(r => r.result.status === "blocked").length,
    total_tokens: runs.reduce((s, r) => s + (r.result.artifact?.usage?.total_tokens ?? 0), 0),
    avg_turns_per_task: runs.reduce((s, r) => s + (r.result.iterations || 0), 0) / runs.length,
    overseer_called_rate: runs.filter(r => (r.result.log ?? []).some(e => e.kind === "overseer_correction")).length / runs.length,
    total_elapsed_s: runs.reduce((s, r) => s + (r.result._elapsed_ms || 0), 0) / 1000,
    by_phase: {
      alpha: phaseMetrics("alpha"),
      beta: phaseMetrics("beta"),
      gamma: phaseMetrics("gamma"),
    },
  };
  console.log(`\n── Metrics ──`);
  console.log(`  total_tokens: ${metrics.total_tokens}`);
  console.log(`  avg_turns_per_task: ${metrics.avg_turns_per_task.toFixed(2)}`);
  console.log(`  overseer_called_rate: ${(metrics.overseer_called_rate * 100).toFixed(1)}%`);
  console.log(`  ok/total: ${metrics.ok_tasks}/${metrics.total_tasks}`);
  await writeFile(join(iterDir, "runs.json"), JSON.stringify(runs, null, 2));
  await writeFile(join(iterDir, "metrics.json"), JSON.stringify(metrics, null, 2));
  // Phase ε: overseer review
  console.log(`\n── Phase ε: overseer meta-review ──`);
  const reviewInput = {
    metrics,
    task_summary: runs.map(r => ({
      operation: r.task.operation,
      phase: r.phase,
      status: r.result.status,
      iterations: r.result.iterations,
      tokens: r.result.artifact?.usage?.total_tokens ?? 0,
      overseer_called: (r.result.log ?? []).some(e => e.kind === "overseer_correction"),
      error: r.result.error ?? null,
      elapsed_s: Math.round((r.result._elapsed_ms || 0) / 1000),
    })),
  };
  const review = await overseerReview(iterNum, reviewInput, tasks.models);
  await writeFile(join(iterDir, "overseer_review.md"), review);
  console.log(`  ✓ ${review.length} chars`);
  // Phase ζ: scrum
  console.log(`\n── Phase ζ: scrum judgment ──`);
  const verdict = await scrumJudge(iterNum, review, tasks.models);
  await writeFile(join(iterDir, "scrum_findings.md"), verdict);
  console.log(`  ✓ ${verdict.length} chars`);
  return metrics;
 }
 // ─── Main ───
 async function main() {
  const tasks = JSON.parse(
    await readFile("/home/profit/lakehouse/tests/battery/tasks.json", "utf8"),
  ) as Tasks;
  await mkdir(BATTERY_DIR, { recursive: true });
  const iterations: any[] = [];
  const batteryStart = Date.now();
  for (let i = 1; i <= 3; i++) {
    const m = await runIteration(i, tasks);
    iterations.push(m);
  }
  const batteryElapsed = (Date.now() - batteryStart) / 1000;
  // Summary
  const delta = (k: keyof any, inverted = false) => {
    const vals = iterations.map((m: any) => m[k]);
    if (vals.some(v => v === undefined)) return "—";
    const diff = vals[2] - vals[0];
    const pct = vals[0] !== 0 ? (diff / vals[0]) * 100 : 0;
    const arrow = inverted ? (diff < 0 ? "↓ better" : "↑ worse") : (diff > 0 ? "↑ better" : "↓ worse");
    return `${arrow} (${diff > 0 ? "+" : ""}${diff.toFixed?.(2) ?? diff}, ${pct.toFixed(1)}%)`;
  };
  const rows = [
    ["total_tokens", "inverted", "want ↓ — fewer tokens for same work"],
    ["avg_turns_per_task", "inverted", "want ↓ — executor gets smarter"],
    ["overseer_called_rate", "inverted", "want ↓ — fewer cloud escalations"],
    ["ok_tasks", "normal", "want ↑ — more successes"],
    ["total_elapsed_s", "inverted", "want ↓ — faster iterations"],
  ];
  let summary = `# Compounding Stress Battery — Summary\n\n`;
  summary += `**Run:** ${new Date().toISOString()}\n`;
  summary += `**Elapsed:** ${Math.round(batteryElapsed)}s (${(batteryElapsed/60).toFixed(1)} min)\n`;
  summary += `**Models:** executor=${tasks.models.executor_cloud}, reviewer=${tasks.models.reviewer_cloud}, overseer=${tasks.models.overseer_cloud}\n\n`;
  summary += `## Compounding Metrics\n\n`;
  summary += `| Metric | iter 1 | iter 2 | iter 3 | Trend (1→3) | Goal |\n`;
  summary += `|---|---|---|---|---|---|\n`;
  for (const [key, inv, goal] of rows) {
    const vals = iterations.map((m: any) => {
      const v = m[key as string];
      return typeof v === "number" ? v.toFixed(2) : String(v);
    });
    summary += `| ${key} | ${vals[0]} | ${vals[1]} | ${vals[2]} | ${delta(key as any, inv === "inverted")} | ${goal} |\n`;
  }
  summary += "\n";
  // Count trending metrics
  const trends = rows.map(([k, inv]) => {
    const vs = iterations.map((m: any) => m[k as string]) as number[];
    const improved = inv === "inverted" ? vs[2] < vs[0] : vs[2] > vs[0];
    return { metric: k, improved };
  });
  const improvedCount = trends.filter(t => t.improved).length;
  summary += `## Verdict\n\n`;
  if (improvedCount >= 3) {
    summary += `**✓ Architecture validated** — ${improvedCount}/${trends.length} compounding metrics improved from iteration 1 to 3.\n\n`;
  } else {
    summary += `**✗ Compounding NOT demonstrated** — only ${improvedCount}/${trends.length} metrics improved. See scrum_findings.md in each iter_N/ directory for the overseer's proposals and the scrum master's review of what to change.\n\n`;
  }
  summary += `Metrics that ${improvedCount >= 3 ? "improved" : "regressed"}:\n`;
  for (const t of trends) {
    summary += `- ${t.metric}: ${t.improved ? "✓ improved" : "✗ flat or worse"}\n`;
  }
  summary += `\n## Artifacts\n\n`;
  summary += `- \`iter_1/\`, \`iter_2/\`, \`iter_3/\` — per-iteration runs.json, metrics.json, overseer_review.md, scrum_findings.md, distill_output.txt\n`;
  summary += `- \`summary.md\` — this file\n`;
  await writeFile(join(BATTERY_DIR, "summary.md"), summary);
  console.log(`\n${"═".repeat(60)}`);
  console.log(`✓ BATTERY COMPLETE — ${Math.round(batteryElapsed)}s`);
  console.log(`  Summary: ${join(BATTERY_DIR, "summary.md")}`);
  console.log(`${"═".repeat(60)}\n`);
  console.log(summary);
 }
 main().catch(e => {
  console.error(`\n${"═".repeat(60)}`);
  console.error(`✗ BATTERY FAILED: ${e.message}`);
  console.error(`${"═".repeat(60)}\n`);
  if (e.stack) console.error(e.stack);
  process.exit(1);
 });
--- a/tests/battery/tasks.json
+++ b/tests/battery/tasks.json
@ -0,0 +1,57 @@
 {
  "description": "Compounding stress battery tasks. Each iteration runs α (baseline) + β (drift) + γ (impossible) phases. The SAME tasks repeat across iterations so we can measure compounding (turns_used, overseer_called_rate, correction_effective).",
  "phases": {
    "alpha_baseline": [
      {
        "task_class": "staffing.fill",
        "operation": "fill: Warehouse Associate x3 in Columbus, OH",
        "spec": { "target_role": "Warehouse Associate", "target_count": 3, "target_city": "Columbus", "target_state": "OH", "approach_hint": "hybrid search against workers_500k_v1" }
      },
      {
        "task_class": "staffing.fill",
        "operation": "fill: Forklift Operator x2 in Toledo, OH",
        "spec": { "target_role": "Forklift Operator", "target_count": 2, "target_city": "Toledo", "target_state": "OH", "approach_hint": "hybrid search against workers_500k_v1" }
      },
      {
        "task_class": "staffing.fill",
        "operation": "fill: Packer x4 in Cleveland, OH",
        "spec": { "target_role": "Packer", "target_count": 4, "target_city": "Cleveland", "target_state": "OH", "approach_hint": "hybrid search against workers_500k_v1" }
      }
    ],
    "beta_drift": [
      {
        "task_class": "staffing.fill",
        "operation": "fill: Machine Operator x2 in Youngstown, OH (requires OSHA 30 + bilingual Spanish)",
        "spec": { "target_role": "Machine Operator", "target_count": 2, "target_city": "Youngstown", "target_state": "OH", "approach_hint": "hybrid search against workers_500k_v1; prefer candidates with OSHA certification and Spanish" }
      },
      {
        "task_class": "staffing.fill",
        "operation": "fill: Welder x2 in Dayton, OH (AWS D1.1 certified, night shift)",
        "spec": { "target_role": "Welder", "target_count": 2, "target_city": "Dayton", "target_state": "OH", "approach_hint": "hybrid search against workers_500k_v1; filter by certification and shift flexibility" }
      },
      {
        "task_class": "staffing.fill",
        "operation": "fill: Assembler x5 in Akron, OH (SMT experience, cleanroom)",
        "spec": { "target_role": "Assembler", "target_count": 5, "target_city": "Akron", "target_state": "OH", "approach_hint": "hybrid search against workers_500k_v1" }
      }
    ],
    "gamma_impossible": [
      {
        "task_class": "staffing.fill",
        "operation": "fill: Underwater Welder x2 in Toledo, OH",
        "spec": { "target_role": "Underwater Welder", "target_count": 2, "target_city": "Toledo", "target_state": "OH", "approach_hint": "hybrid search against workers_500k_v1 (expected to fail — no supply)" }
      },
      {
        "task_class": "staffing.fill",
        "operation": "fill: Astronaut x1 in Springfield, OH",
        "spec": { "target_role": "Astronaut", "target_count": 1, "target_city": "Springfield", "target_state": "OH", "approach_hint": "(expected to fail — out-of-domain role)" }
      }
    ]
  },
  "models": {
    "executor_cloud": "gpt-oss:20b",
    "reviewer_cloud": "gpt-oss:20b",
    "overseer_cloud": "gpt-oss:120b",
    "notes": "gpt-oss:20b for hot path (faster, cheaper per call), gpt-oss:120b for meta-reviews. All cloud per 2026-04-23 'cloud modes are on' directive."
  }
 }
--- a/tests/multi-agent/playbooks/ab_scorecard.json
+++ b/tests/multi-agent/playbooks/ab_scorecard.json
@ -0,0 +1,45 @@
 {
  "generated_at": "2026-04-21T00:44:59.486489Z",
  "runs": [
    {
      "label": "A(no-T3)",
      "path": "tests/multi-agent/playbooks/scenario-2026-04-21T00-30-54",
      "ok_events": 0,
      "total_events": 5,
      "total_turns": 0,
      "total_gaps": 5,
      "total_citations": 0,
      "prior_lessons_loaded": 0
    },
    {
      "label": "B(T3-seed)",
      "path": "tests/multi-agent/playbooks/scenario-2026-04-21T00-37-04",
      "ok_events": 0,
      "total_events": 5,
      "total_turns": 0,
      "total_gaps": 5,
      "total_citations": 0,
      "prior_lessons_loaded": 1
    },
    {
      "label": "C(T3-read)",
      "path": "tests/multi-agent/playbooks/scenario-2026-04-21T00-39-54",
      "ok_events": 0,
      "total_events": 5,
      "total_turns": 0,
      "total_gaps": 5,
      "total_citations": 0,
      "prior_lessons_loaded": 2
    },
    {
      "label": "D(T3-cloud)",
      "path": "tests/multi-agent/playbooks/scenario-2026-04-21T00-43-44",
      "ok_events": 0,
      "total_events": 5,
      "total_turns": 0,
      "total_gaps": 5,
      "total_citations": 0,
      "prior_lessons_loaded": 3
    }
  ]
 }
--- a/tests/multi-agent/playbooks/kb_measurement.md
+++ b/tests/multi-agent/playbooks/kb_measurement.md
@ -0,0 +1,25 @@
 # KB Measurement Report
 Generated from 26 runs across 24 distinct signatures.
 ## Recommender confidence
 - high: 23
 - medium: 1
 - low: 3
 ## Overall fill + citation
 - Fill rate: **60/86** (69.8%)
 - Avg citations per run: **1.38**
 - Avg turns per run: 6.6
 ## Citation coverage by (role, city, state)
 - Combos with ≥1 citation: 9
 - Combos with ok fills but 0 citations: 31
 ## Item 3 decision signal
 Non-zero: there are **combos that succeeded but never triggered playbook_memory boost**. Candidates for item 3 investigation:
 - Machine Operator in Indianapolis, IN: 1/1 ok, 0 cites
 - Assembler in Indianapolis, IN: 2/2 ok, 0 cites
 - Warehouse Associate in Indianapolis, IN: 1/1 ok, 0 cites
 - Forklift Operator in Cleveland, OH: 1/1 ok, 0 cites
 - Assembler in Cleveland, OH: 2/2 ok, 0 cites
--- a/tests/multi-agent/playbooks/scenario-2026-04-20T09-55-13/dispatch.jsonl
+++ b/tests/multi-agent/playbooks/scenario-2026-04-20T09-55-13/dispatch.jsonl
@ -0,0 +1 @@
 {"at":"10:30","kind":"recurring","operation":"fill: Machine Operator x2 in Toledo, OH","fills":[{"candidate_id":"W500K-29048","name":"Raymond G. Ward","reason":"Raymond has a high availability score (0.9033) and relevant skills for the Machine Operator position in Toledo, OH. He is also endorsed by past similar fills."},{"candidate_id":"W500K-20613","name":"Pamela V. Green","reason":"Pamela has a high availability score (0.8496) and relevant skills for the Machine Operator position in Toledo, OH. She is also endorsed by past similar fills."}],"turns":5,"duration_secs":12.051,"pool_size":997,"playbook_citations":[]}
--- a/tests/multi-agent/playbooks/scenario-2026-04-20T09-55-13/emails.md
+++ b/tests/multi-agent/playbooks/scenario-2026-04-20T09-55-13/emails.md
@ -0,0 +1,17 @@
 # Client emails — Riverfront Steel, 2026-04-21
 ## 10:30 recurring — Machine Operator x2
 Subject: 2 Filled
 Dear Riverfront Steel Team,
 We are pleased to confirm that we have filled both Machine Operator roles for your Tuesday/Thursday shifts starting at 11:00 AM.
 - Raymond G. Ward
 - Pamela V. Green
 Both candidates have high availability scores and relevant experience. Please note this is a recurring slot, and prior workers may still be available.
 Best regards,
 Dispatch Team Lakehouse
--- a/tests/multi-agent/playbooks/scenario-2026-04-20T09-55-13/report.md
+++ b/tests/multi-agent/playbooks/scenario-2026-04-20T09-55-13/report.md
@ -0,0 +1,45 @@
 # Scenario retrospective — Riverfront Steel, 2026-04-21
 Executor: `mistral:latest`   Reviewer: `qwen2.5:latest`   Draft: `qwen2.5:latest`
 ## Events
 | At | Kind | Role / Count | Pool | Fills | Turns | Dur(s) | Cites | Gaps |
 |---|---|---|---|---|---|---|---|---|
 | 08:00 | baseline_fill | Warehouse Associate × 3 | - | ✗ 0 | 0 | 28.9 | 0 | 1 |
 | 10:30 | recurring | Machine Operator × 2 | 997 | ✓ 2 | 5 | 12.1 | 0 | 1 |
 | 12:15 | expansion | Forklift Operator × 5 | - | ✗ 0 | 0 | 20.3 | 0 | 1 |
 | 14:00 | emergency | Loader × 4 | - | ✗ 0 | 0 | 35.7 | 0 | 1 |
 | 15:45 | misplacement | Warehouse Associate × 1 | - | ✗ 0 | 0 | 11.5 | 0 | 1 |
 ## Final roster
 | Worker | Booked | Role | City, ST | Status |
 |---|---|---|---|---|
 | undefined Raymond G. Ward | 10:30 | Machine Operator | Toledo, OH | confirmed |
 | undefined Pamela V. Green | 10:30 | Machine Operator | Toledo, OH | confirmed |
 ## Gap signals
 ### drift_or_tool
 - **08:00** — invalid JSON from executor: JSON Parse error: Unable to parse JSON string | raw: {"kind":"plan","steps":["Verify one candidate from the current list using sql tool for SQL verification.","Propose_done with top 3 candidates who are Warehouse Associates in Toledo, OH."]}
 {"kind":"tool_call","tool":"sql","args":{"query":"SELECT worker_id, name, role, city, state, availability FROM
 - **12:15** — invalid JSON from executor: JSON Parse error: Expected ']' | raw: {"kind":"plan","steps":["1. Use hybrid_search to find Forklift Operators in Toledo, OH with high engagement and communications scores who are available for work at Riverfront Steel starting at 01:00 PM. The search should prioritize workers with team/collaboration signals (engagement, communications
 - **14:00** — no consensus after 14 turns
 - **15:45** — invalid JSON from executor: JSON Parse error: Expected '}' | raw: {"kind":"tool_call","tool":"hybrid_search", "args":{"index_name":"workers_500k_v1","sql_filter":"LOWER(role) LIKE '%warehouse%' AND city = 'Toledo' AND state = 'OH' AND availability > 0.5 AND shift = '08:00' AND worker_id NOT IN [, ] AND worker_id NOT IN ["EXCLUDE_WORKERS_ID1", "EXCLUDE_WORKERS_ID2"
 ### double_book
 - **10:30** — undefined Pamela V. Green already booked for 10:30
 ### fairness
 - _cross-event_ — Raymond G. Ward (undefined) booked 2 times today
 ### write_through_audit
 - _post-run_ — playbook_memory has 33 entries (ran 5 events, expected ≥ 1 new entries from this run)
 ## Narrative
 - 1/5 events reached consensus.
 - Final roster: 2 bookings across 1 distinct workers.
 - Playbook citations across the day: 0 (proof the feedback loop fired across events).
 - Dropped events: 08:00 baseline_fill, 12:15 expansion, 14:00 emergency, 15:45 misplacement.
--- a/tests/multi-agent/playbooks/scenario-2026-04-20T09-55-13/results.json
+++ b/tests/multi-agent/playbooks/scenario-2026-04-20T09-55-13/results.json
@ -0,0 +1,118 @@
 [
  {
    "event": {
      "kind": "baseline_fill",
      "at": "08:00",
      "role": "Warehouse Associate",
      "count": 3,
      "city": "Toledo",
      "state": "OH",
      "shift_start": "08:00 AM",
      "scenario_note": "Regular Monday morning shift, 8-hour."
    },
    "ok": false,
    "fills": [],
    "turns": 0,
    "duration_secs": 28.888,
    "error": "invalid JSON from executor: JSON Parse error: Unable to parse JSON string | raw: {\"kind\":\"plan\",\"steps\":[\"Verify one candidate from the current list using sql tool for SQL verification.\",\"Propose_done with top 3 candidates who are Warehouse Associates in Toledo, OH.\"]}\n{\"kind\":\"tool_call\",\"tool\":\"sql\",\"args\":{\"query\":\"SELECT worker_id, name, role, city, state, availability FROM ",
    "gap_signals": [
      "drift_or_tool: invalid JSON from executor: JSON Parse error: Unable to parse JSON string | raw: {\"kind\":\"plan\",\"steps\":[\"Verify one candidate from the current list using sql tool for SQL verification.\",\"Propose_done with top 3 candidates who are Warehouse Associates in Toledo, OH.\"]}\n{\"kind\":\"tool_call\",\"tool\":\"sql\",\"args\":{\"query\":\"SELECT worker_id, name, role, city, state, availability FROM "
    ]
  },
  {
    "event": {
      "kind": "recurring",
      "at": "10:30",
      "role": "Machine Operator",
      "count": 2,
      "city": "Toledo",
      "state": "OH",
      "shift_start": "11:00 AM",
      "scenario_note": "Recurring Tuesday/Thursday slot — prior workers may still be available."
    },
    "ok": true,
    "fills": [
      {
        "candidate_id": "W500K-29048",
        "name": "Raymond G. Ward",
        "reason": "Raymond has a high availability score (0.9033) and relevant skills for the Machine Operator position in Toledo, OH. He is also endorsed by past similar fills."
      },
      {
        "candidate_id": "W500K-20613",
        "name": "Pamela V. Green",
        "reason": "Pamela has a high availability score (0.8496) and relevant skills for the Machine Operator position in Toledo, OH. She is also endorsed by past similar fills."
      }
    ],
    "turns": 5,
    "duration_secs": 12.051,
    "gap_signals": [
      "double_book: undefined Pamela V. Green already booked for 10:30"
    ],
    "sources_first_score": 0.6692528,
    "sources_last_score": 0.64494026,
    "pool_size": 997,
    "playbook_citations": []
  },
  {
    "event": {
      "kind": "expansion",
      "at": "12:15",
      "role": "Forklift Operator",
      "count": 5,
      "city": "Toledo",
      "state": "OH",
      "shift_start": "01:00 PM",
      "scenario_note": "New warehouse location opening, five-worker team needed."
    },
    "ok": false,
    "fills": [],
    "turns": 0,
    "duration_secs": 20.342,
    "error": "invalid JSON from executor: JSON Parse error: Expected ']' | raw: {\"kind\":\"plan\",\"steps\":[\"1. Use hybrid_search to find Forklift Operators in Toledo, OH with high engagement and communications scores who are available for work at Riverfront Steel starting at 01:00 PM. The search should prioritize workers with team/collaboration signals (engagement, communications ",
    "gap_signals": [
      "drift_or_tool: invalid JSON from executor: JSON Parse error: Expected ']' | raw: {\"kind\":\"plan\",\"steps\":[\"1. Use hybrid_search to find Forklift Operators in Toledo, OH with high engagement and communications scores who are available for work at Riverfront Steel starting at 01:00 PM. The search should prioritize workers with team/collaboration signals (engagement, communications "
    ]
  },
  {
    "event": {
      "kind": "emergency",
      "at": "14:00",
      "role": "Loader",
      "count": 4,
      "city": "Toledo",
      "state": "OH",
      "shift_start": "04:00 PM same day",
      "deadline": "16:00",
      "scenario_note": "Walkoff incident — replacement crew needed by 16:00 sharp."
    },
    "ok": false,
    "fills": [],
    "turns": 0,
    "duration_secs": 35.727,
    "error": "no consensus after 14 turns",
    "gap_signals": [
      "drift_or_tool: no consensus after 14 turns"
    ]
  },
  {
    "event": {
      "kind": "misplacement",
      "at": "15:45",
      "role": "Warehouse Associate",
      "count": 1,
      "city": "Toledo",
      "state": "OH",
      "shift_start": "remainder of 08:00 shift",
      "scenario_note": "One worker from the 08:00 fill didn't show; rebuild the gap.",
      "replaces_event": "08:00"
    },
    "ok": false,
    "fills": [],
    "turns": 0,
    "duration_secs": 11.518,
    "error": "invalid JSON from executor: JSON Parse error: Expected '}' | raw: {\"kind\":\"tool_call\",\"tool\":\"hybrid_search\", \"args\":{\"index_name\":\"workers_500k_v1\",\"sql_filter\":\"LOWER(role) LIKE '%warehouse%' AND city = 'Toledo' AND state = 'OH' AND availability > 0.5 AND shift = '08:00' AND worker_id NOT IN [, ] AND worker_id NOT IN [\"EXCLUDE_WORKERS_ID1\", \"EXCLUDE_WORKERS_ID2\"",
    "gap_signals": [
      "drift_or_tool: invalid JSON from executor: JSON Parse error: Expected '}' | raw: {\"kind\":\"tool_call\",\"tool\":\"hybrid_search\", \"args\":{\"index_name\":\"workers_500k_v1\",\"sql_filter\":\"LOWER(role) LIKE '%warehouse%' AND city = 'Toledo' AND state = 'OH' AND availability > 0.5 AND shift = '08:00' AND worker_id NOT IN [, ] AND worker_id NOT IN [\"EXCLUDE_WORKERS_ID1\", \"EXCLUDE_WORKERS_ID2\""
    ]
  }
 ]
--- a/tests/multi-agent/playbooks/scenario-2026-04-20T09-55-13/roster.json
+++ b/tests/multi-agent/playbooks/scenario-2026-04-20T09-55-13/roster.json
@ -0,0 +1,18 @@
 [
  {
    "name": "Raymond G. Ward",
    "booked_for": "10:30",
    "role": "Machine Operator",
    "city": "Toledo",
    "state": "OH",
    "status": "confirmed"
  },
  {
    "name": "Pamela V. Green",
    "booked_for": "10:30",
    "role": "Machine Operator",
    "city": "Toledo",
    "state": "OH",
    "status": "confirmed"
  }
 ]
--- a/tests/multi-agent/playbooks/scenario-2026-04-20T09-55-13/sms.md
+++ b/tests/multi-agent/playbooks/scenario-2026-04-20T09-55-13/sms.md
@ -0,0 +1,11 @@
 # SMS drafts — Riverfront Steel, 2026-04-21
 ## 10:30 recurring — Machine Operator x2 in Toledo, OH
 TO: Raymond G. Ward
 Confirming your Machine Operator shift at Riverfront Steel in Toledo, OH starting at 11:00 AM on Tuesday/Thursday. Still available! 
 ---
 TO: Pamela V. Green
 Your Machine Operator shift at Riverfront Steel in Toledo, OH starts at 11:00 AM on Tuesday/Thursday. Confirm your availability please.
--- a/tests/multi-agent/playbooks/scenario-2026-04-20T19-59-48/dispatch.jsonl
+++ b/tests/multi-agent/playbooks/scenario-2026-04-20T19-59-48/dispatch.jsonl
--- a/tests/multi-agent/playbooks/scenario-2026-04-20T19-59-48/emails.md
+++ b/tests/multi-agent/playbooks/scenario-2026-04-20T19-59-48/emails.md
@ -0,0 +1 @@
 # Client emails — Riverfront Steel, 2026-04-21
--- a/tests/multi-agent/playbooks/scenario-2026-04-20T19-59-48/sms.md
+++ b/tests/multi-agent/playbooks/scenario-2026-04-20T19-59-48/sms.md
@ -0,0 +1 @@
 # SMS drafts — Riverfront Steel, 2026-04-21
--- a/tests/multi-agent/playbooks/scenario-2026-04-20T20-02-01/dispatch.jsonl
+++ b/tests/multi-agent/playbooks/scenario-2026-04-20T20-02-01/dispatch.jsonl
--- a/tests/multi-agent/playbooks/scenario-2026-04-20T20-02-01/emails.md
+++ b/tests/multi-agent/playbooks/scenario-2026-04-20T20-02-01/emails.md
@ -0,0 +1,22 @@
 # Client emails — Riverfront Steel, 2026-04-21
 ## 12:15 expansion — Forklift Operator x5
 Subject: 5 Workers Confirmed
 Dear Riverfront Steel Team,
 I am pleased to confirm that we have filled all five positions for Forklift Operators at your new warehouse location opening today starting at 1:00 PM. The workers are:
 - Laura F. Morales
 - Kyle F. Brooks 
 - Maria K. Cruz
 - Jeffrey D. Taylor
 - Charles T. Walker
 All meet the criteria of being Forklift Operators in Toledo, OH.
 Looking forward to a successful shift!
 Best regards,
 Dispatch Team Lakehouse
--- a/tests/multi-agent/playbooks/scenario-2026-04-20T20-02-01/sms.md
+++ b/tests/multi-agent/playbooks/scenario-2026-04-20T20-02-01/sms.md
@ -0,0 +1,26 @@
 # SMS drafts — Riverfront Steel, 2026-04-21
 ## 12:15 expansion — Forklift Operator x5 in Toledo, OH
 TO: Laura F. Morales
 Confirming your shift as a Forklift Operator at Riverfront Steel in Toledo, OH starting at 1:00 PM for the new warehouse location opening. W500K-22625
 ---
 TO: Kyle F. Brooks
 You're scheduled to start your Forklift Operator shift at Riverfront Steel in Toledo, OH at 1:00 PM for the new warehouse location opening. Confirm W500K-33961
 ---
 TO: Maria K. Cruz
 Your shift as a Forklift Operator at Riverfront Steel in Toledo, OH starts at 1:00 PM for the new warehouse location opening. Please confirm your attendance. W500K-19588
 ---
 TO: Jeffrey D. Taylor
 Confirming your shift as a Forklift Operator at Riverfront Steel in Toledo, OH starting at 1:00 PM for the new warehouse location opening. W500K-37729
 ---
 TO: Charles T. Walker
 You're scheduled to start your Forklift Operator shift at Riverfront Steel in Toledo, OH at 1:00 PM for the new warehouse location opening. Confirm your attendance. W500K-17543
--- a/tests/multi-agent/playbooks/scenario-2026-04-20T20-04-45/dispatch.jsonl
+++ b/tests/multi-agent/playbooks/scenario-2026-04-20T20-04-45/dispatch.jsonl
@ -0,0 +1,2 @@
 {"at":"08:00","kind":"baseline_fill","operation":"fill: Warehouse Associate x3 in Toledo, OH","fills":[{"candidate_id":"W500K-49164","name":"Christopher Y. Phillips","reason":"Reliable Warehouse Associate with availability greater than 0.5 in Toledo, OH."},{"candidate_id":"W500K-40928","name":"Janet E. Hill","reason":"Reliable Warehouse Associate with availability greater than 0.5 in Toledo, OH."},{"candidate_id":"W500K-34704","name":"Fatima U. Rivera","reason":"Reliable Warehouse Associate with availability greater than 0.5 in Toledo, OH."}],"turns":7,"duration_secs":20.128,"pool_size":770,"playbook_citations":[],"discovered_pattern":"Across 25 similar past playbooks (8 workers examined) · archetype mostly: reliable · reliability median 0.80 (range 0.66–0.96)"}
 {"at":"10:30","kind":"recurring","operation":"fill: Machine Operator x2 in Toledo, OH","fills":[{"candidate_id":"W500K-19759","name":"Carmen Z. Rodriguez","reason":"Recurring Machine Operator in Toledo, OH with a score of 0.75, verified via sql tool."},{"candidate_id":"W500K-29298","name":"Robert W. Gonzalez","reason":"Recurring Machine Operator in Toledo, OH with a score of 0.74, not yet SQL verified but highly likely to meet requirements."}],"turns":5,"duration_secs":17.426,"pool_size":997,"playbook_citations":[],"discovered_pattern":"Across 25 similar past playbooks (8 workers examined) · archetype mostly: reliable · reliability median 0.80 (range 0.66–0.96)"}
--- a/tests/multi-agent/playbooks/scenario-2026-04-20T20-04-45/emails.md
+++ b/tests/multi-agent/playbooks/scenario-2026-04-20T20-04-45/emails.md
@ -0,0 +1,40 @@
 # Client emails — Riverfront Steel, 2026-04-21
 ## 08:00 baseline_fill — Warehouse Associate x3
 Subject: 3 Filled
 Dear Riverfront Steel Staffing Team,
 I am pleased to confirm that we have filled all three roles of Warehouse Associate for your Monday morning shift starting at 08:00 AM.
 The workers assigned are:
 - Christopher Y. Phillips
 - Janet E. Hill 
 - Fatima U. Rivera
 All three have confirmed their availability and are reliable team members.
 Best regards,
 Dispatch Team Lakehouse
 ## 10:30 recurring — Machine Operator x2
 To: staffing@riverfrontsteel.example  
 From: dispatch@lakehouse.example  
 Subject: Confirmed
 Dear Riverfront Steel Team,
 We are pleased to confirm that we have filled both Machine Operator roles for your Tuesday/Thursday shifts starting at 11:00 AM. The workers assigned are:
 - Carmen Z. Rodriguez
 - Robert W. Gonzalez
 Both are recurring Machine Operators in Toledo, OH with a score of 0.7.
 Please note this is a recurring slot; prior workers may still be available.
 Best regards,
 Dispatch Team
--- a/tests/multi-agent/playbooks/scenario-2026-04-20T20-04-45/report.md
+++ b/tests/multi-agent/playbooks/scenario-2026-04-20T20-04-45/report.md
@ -0,0 +1,74 @@
 # Scenario retrospective — Riverfront Steel, 2026-04-21
 Executor: `mistral:latest`   Reviewer: `qwen2.5:latest`   Draft: `qwen2.5:latest`
 ## Events
 | At | Kind | Role / Count | Pool | Fills | Turns | Dur(s) | Cites | Gaps |
 |---|---|---|---|---|---|---|---|---|
 | 08:00 | baseline_fill | Warehouse Associate × 3 | 770 | ✓ 3 | 7 | 20.1 | 0 | 2 |
 | 10:30 | recurring | Machine Operator × 2 | 997 | ✓ 2 | 5 | 17.4 | 0 | 2 |
 | 12:15 | expansion | Forklift Operator × 5 | - | ✗ 0 | 0 | 46.4 | 0 | 1 |
 | 14:00 | emergency | Loader × 4 | - | ✗ 0 | 0 | 54.1 | 0 | 1 |
 | 15:45 | misplacement | Warehouse Associate × 1 | - | ✗ 0 | 0 | 59.6 | 0 | 1 |
 ## Final roster
 | Worker | Booked | Role | City, ST | Status |
 |---|---|---|---|---|
 | undefined Christopher Y. Phillips | 08:00 | Warehouse Associate | Toledo, OH | no_show |
 | undefined Janet E. Hill | 08:00 | Warehouse Associate | Toledo, OH | confirmed |
 | undefined Fatima U. Rivera | 08:00 | Warehouse Associate | Toledo, OH | confirmed |
 | undefined Carmen Z. Rodriguez | 10:30 | Machine Operator | Toledo, OH | confirmed |
 | undefined Robert W. Gonzalez | 10:30 | Machine Operator | Toledo, OH | confirmed |
 ## Gap signals
 ### double_book
 - **08:00** — undefined Janet E. Hill already booked for 08:00
 - **08:00** — undefined Fatima U. Rivera already booked for 08:00
 - **10:30** — undefined Carmen Z. Rodriguez already booked for 08:00
 - **10:30** — undefined Robert W. Gonzalez already booked for 08:00
 ### drift_or_tool
 - **12:15** — invalid JSON from executor: JSON Parse error: Invalid escape character ' | raw: {"kind":"plan", "steps":["TOOL_CALL hybrid_search({'index_name':'workers_500k_v1', 'sql_filter':'LOWER(role) LIKE '%forklift%' AND city = \'Toledo\' AND state = \'OH\' AND availability > CAST(0.5 AS DOUBLE) AND reliability > CAST(0.75 AS DOUBLE)', 'question':'reliable forklift operators in Toledo, O
 - **14:00** — invalid JSON from executor: JSON Parse error: Expected '}' | raw: {"kind":"tool_call","tool":"hybrid_search",
 "args":{"index_name":"workers_500k_v1","sql_filter":"LOWER(role) LIKE '%loader%' AND city = 'Toledo' AND state = 'OH' AND availability > CAST(0.7 AS DOUBLE) AND worker_id NOT IN ('W500K-4321', 'W500K-8963', 'W500K-2345', 'W500K-6789', 'W500K-9876') AND wor
 - **15:45** — no consensus after 14 turns
 ### fairness
 - _cross-event_ — Christopher Y. Phillips (undefined) booked 4 times today
 ### write_through_audit
 - _post-run_ — playbook_memory has 165 entries (ran 5 events, expected ≥ 2 new entries from this run)
 ## Workers touched across the week
 6 distinct workers made it through to a decision. Every one is accounted for below — no-shows flagged, rebookings noted, everyone visible.
 | Worker ID | Name | Events | Outcome |
 |---|---|---|---|
 | W500K-49164 | Christopher Y. Phillips | 08:00 baseline_fill | booked |
 | W500K-40928 | Janet E. Hill | 08:00 baseline_fill | booked |
 | W500K-34704 | Fatima U. Rivera | 08:00 baseline_fill | booked |
 | W500K-19759 | Carmen Z. Rodriguez | 10:30 recurring | booked |
 | W500K-29298 | Robert W. Gonzalez | 10:30 recurring | booked |
 | undefined | Christopher Y. Phillips | 08:00 | no_show |
 ## Discovered patterns (meta-index)
 What the system identified across semantically-similar past fills as each event ran:
 - **08:00 baseline_fill** (Warehouse Associate): Across 25 similar past playbooks (8 workers examined) · archetype mostly: reliable · reliability median 0.80 (range 0.66–0.96)
 - **10:30 recurring** (Machine Operator): Across 25 similar past playbooks (8 workers examined) · archetype mostly: reliable · reliability median 0.80 (range 0.66–0.96)
 - **12:15 expansion** (Forklift Operator): —
 - **14:00 emergency** (Loader): —
 - **15:45 misplacement** (Warehouse Associate): —
 ## Narrative
 - 2/5 events reached consensus.
 - Final roster: 5 bookings across 1 distinct workers.
 - Workers touched (booked, failed, or otherwise decided): 6.
 - Playbook citations across the day: 0 (proof the feedback loop fired across events).
 - Dropped events: 12:15 expansion, 14:00 emergency, 15:45 misplacement.
--- a/tests/multi-agent/playbooks/scenario-2026-04-20T20-04-45/results.json
+++ b/tests/multi-agent/playbooks/scenario-2026-04-20T20-04-45/results.json
@ -0,0 +1,146 @@
 [
  {
    "event": {
      "kind": "baseline_fill",
      "at": "08:00",
      "role": "Warehouse Associate",
      "count": 3,
      "city": "Toledo",
      "state": "OH",
      "shift_start": "08:00 AM",
      "scenario_note": "Regular Monday morning shift, 8-hour."
    },
    "ok": true,
    "fills": [
      {
        "candidate_id": "W500K-49164",
        "name": "Christopher Y. Phillips",
        "reason": "Reliable Warehouse Associate with availability greater than 0.5 in Toledo, OH."
      },
      {
        "candidate_id": "W500K-40928",
        "name": "Janet E. Hill",
        "reason": "Reliable Warehouse Associate with availability greater than 0.5 in Toledo, OH."
      },
      {
        "candidate_id": "W500K-34704",
        "name": "Fatima U. Rivera",
        "reason": "Reliable Warehouse Associate with availability greater than 0.5 in Toledo, OH."
      }
    ],
    "turns": 7,
    "duration_secs": 20.128,
    "gap_signals": [
      "double_book: undefined Janet E. Hill already booked for 08:00",
      "double_book: undefined Fatima U. Rivera already booked for 08:00"
    ],
    "sources_first_score": 0.7124013,
    "sources_last_score": 0.66623676,
    "pool_size": 770,
    "playbook_citations": [],
    "discovered_pattern": "Across 25 similar past playbooks (8 workers examined) · archetype mostly: reliable · reliability median 0.80 (range 0.66–0.96)"
  },
  {
    "event": {
      "kind": "recurring",
      "at": "10:30",
      "role": "Machine Operator",
      "count": 2,
      "city": "Toledo",
      "state": "OH",
      "shift_start": "11:00 AM",
      "scenario_note": "Recurring Tuesday/Thursday slot — prior workers may still be available."
    },
    "ok": true,
    "fills": [
      {
        "candidate_id": "W500K-19759",
        "name": "Carmen Z. Rodriguez",
        "reason": "Recurring Machine Operator in Toledo, OH with a score of 0.75, verified via sql tool."
      },
      {
        "candidate_id": "W500K-29298",
        "name": "Robert W. Gonzalez",
        "reason": "Recurring Machine Operator in Toledo, OH with a score of 0.74, not yet SQL verified but highly likely to meet requirements."
      }
    ],
    "turns": 5,
    "duration_secs": 17.426,
    "gap_signals": [
      "double_book: undefined Carmen Z. Rodriguez already booked for 08:00",
      "double_book: undefined Robert W. Gonzalez already booked for 08:00"
    ],
    "sources_first_score": 0.72546995,
    "sources_last_score": 0.6690281,
    "pool_size": 997,
    "playbook_citations": [],
    "discovered_pattern": "Across 25 similar past playbooks (8 workers examined) · archetype mostly: reliable · reliability median 0.80 (range 0.66–0.96)"
  },
  {
    "event": {
      "kind": "expansion",
      "at": "12:15",
      "role": "Forklift Operator",
      "count": 5,
      "city": "Toledo",
      "state": "OH",
      "shift_start": "01:00 PM",
      "scenario_note": "New warehouse location opening, five-worker team needed."
    },
    "ok": false,
    "fills": [],
    "turns": 0,
    "duration_secs": 46.391,
    "error": "invalid JSON from executor: JSON Parse error: Invalid escape character ' | raw: {\"kind\":\"plan\", \"steps\":[\"TOOL_CALL hybrid_search({'index_name':'workers_500k_v1', 'sql_filter':'LOWER(role) LIKE '%forklift%' AND city = \\'Toledo\\' AND state = \\'OH\\' AND availability > CAST(0.5 AS DOUBLE) AND reliability > CAST(0.75 AS DOUBLE)', 'question':'reliable forklift operators in Toledo, O",
    "gap_signals": [
      "drift_or_tool: invalid JSON from executor: JSON Parse error: Invalid escape character ' | raw: {\"kind\":\"plan\", \"steps\":[\"TOOL_CALL hybrid_search({'index_name':'workers_500k_v1', 'sql_filter':'LOWER(role) LIKE '%forklift%' AND city = \\'Toledo\\' AND state = \\'OH\\' AND availability > CAST(0.5 AS DOUBLE) AND reliability > CAST(0.75 AS DOUBLE)', 'question':'reliable forklift operators in Toledo, O"
    ]
  },
  {
    "event": {
      "kind": "emergency",
      "at": "14:00",
      "role": "Loader",
      "count": 4,
      "city": "Toledo",
      "state": "OH",
      "shift_start": "04:00 PM same day",
      "deadline": "16:00",
      "scenario_note": "Walkoff incident — replacement crew needed by 16:00 sharp."
    },
    "ok": false,
    "fills": [],
    "turns": 0,
    "duration_secs": 54.123,
    "error": "invalid JSON from executor: JSON Parse error: Expected '}' | raw: {\"kind\":\"tool_call\",\"tool\":\"hybrid_search\",\n\"args\":{\"index_name\":\"workers_500k_v1\",\"sql_filter\":\"LOWER(role) LIKE '%loader%' AND city = 'Toledo' AND state = 'OH' AND availability > CAST(0.7 AS DOUBLE) AND worker_id NOT IN ('W500K-4321', 'W500K-8963', 'W500K-2345', 'W500K-6789', 'W500K-9876') AND wor",
    "gap_signals": [
      "drift_or_tool: invalid JSON from executor: JSON Parse error: Expected '}' | raw: {\"kind\":\"tool_call\",\"tool\":\"hybrid_search\",\n\"args\":{\"index_name\":\"workers_500k_v1\",\"sql_filter\":\"LOWER(role) LIKE '%loader%' AND city = 'Toledo' AND state = 'OH' AND availability > CAST(0.7 AS DOUBLE) AND worker_id NOT IN ('W500K-4321', 'W500K-8963', 'W500K-2345', 'W500K-6789', 'W500K-9876') AND wor"
    ]
  },
  {
    "event": {
      "kind": "misplacement",
      "at": "15:45",
      "role": "Warehouse Associate",
      "count": 1,
      "city": "Toledo",
      "state": "OH",
      "shift_start": "remainder of 08:00 shift",
      "scenario_note": "One worker from the 08:00 fill didn't show; rebuild the gap.",
      "replaces_event": "08:00",
      "exclude_worker_ids": [
        null,
        null,
        null
      ]
    },
    "ok": false,
    "fills": [],
    "turns": 0,
    "duration_secs": 59.593,
    "error": "no consensus after 14 turns",
    "gap_signals": [
      "drift_or_tool: no consensus after 14 turns"
    ]
  }
 ]
--- a/tests/multi-agent/playbooks/scenario-2026-04-20T20-04-45/roster.json
+++ b/tests/multi-agent/playbooks/scenario-2026-04-20T20-04-45/roster.json
@ -0,0 +1,42 @@
 [
  {
    "name": "Christopher Y. Phillips",
    "booked_for": "08:00",
    "role": "Warehouse Associate",
    "city": "Toledo",
    "state": "OH",
    "status": "no_show"
  },
  {
    "name": "Janet E. Hill",
    "booked_for": "08:00",
    "role": "Warehouse Associate",
    "city": "Toledo",
    "state": "OH",
    "status": "confirmed"
  },
  {
    "name": "Fatima U. Rivera",
    "booked_for": "08:00",
    "role": "Warehouse Associate",
    "city": "Toledo",
    "state": "OH",
    "status": "confirmed"
  },
  {
    "name": "Carmen Z. Rodriguez",
    "booked_for": "10:30",
    "role": "Machine Operator",
    "city": "Toledo",
    "state": "OH",
    "status": "confirmed"
  },
  {
    "name": "Robert W. Gonzalez",
    "booked_for": "10:30",
    "role": "Machine Operator",
    "city": "Toledo",
    "state": "OH",
    "status": "confirmed"
  }
 ]
--- a/tests/multi-agent/playbooks/scenario-2026-04-20T20-04-45/sms.md
+++ b/tests/multi-agent/playbooks/scenario-2026-04-20T20-04-45/sms.md
@ -0,0 +1,26 @@
 # SMS drafts — Riverfront Steel, 2026-04-21
 ## 08:00 baseline_fill — Warehouse Associate x3 in Toledo, OH
 TO: Christopher Y. Phillips
 Confirming your shift as a Warehouse Associate at Riverfront Steel in Toledo, OH starting 8 AM today.
 ---
 TO: Janet E. Hill
 Good morning! Confirming your shift as a Warehouse Associate from 8 AM onwards at our Toledo, OH location.
 ---
 TO: Fatima U. Rivera
 Morning Fatima! Just confirming your shift as a Warehouse Associate at Riverfront Steel in Toledo, OH starting at 8 AM.
 ## 10:30 recurring — Machine Operator x2 in Toledo, OH
 TO: Carmen Z. Rodriguez
 Confirming your shift as a Machine Operator at Riverfront Steel in Toledo, OH starting 11:00 AM on Tuesday/Thursday. Still available! 
 ---
 TO: Robert W. Gonzalez
 Your recurring Tuesday/Thursday Machine Operator shift at Riverfront Steel in Toledo, OH starts at 11:00 AM. Confirm your availability please.
--- a/tests/multi-agent/playbooks/scenario-2026-04-20T20-11-11/dispatch.jsonl
+++ b/tests/multi-agent/playbooks/scenario-2026-04-20T20-11-11/dispatch.jsonl
--- a/tests/multi-agent/playbooks/scenario-2026-04-20T20-11-11/emails.md
+++ b/tests/multi-agent/playbooks/scenario-2026-04-20T20-11-11/emails.md
@ -0,0 +1 @@
 # Client emails — Riverfront Steel, 2026-04-21
--- a/tests/multi-agent/playbooks/scenario-2026-04-20T20-11-11/report.md
+++ b/tests/multi-agent/playbooks/scenario-2026-04-20T20-11-11/report.md
@ -0,0 +1,57 @@
 # Scenario retrospective — Riverfront Steel, 2026-04-21
 Executor: `mistral:latest`   Reviewer: `qwen2.5:latest`   Draft: `qwen2.5:latest`
 ## Events
 | At | Kind | Role / Count | Pool | Fills | Turns | Dur(s) | Cites | Gaps |
 |---|---|---|---|---|---|---|---|---|
 | 08:00 | baseline_fill | Warehouse Associate × 3 | - | ✗ 0 | 0 | 63.8 | 0 | 1 |
 | 10:30 | recurring | Machine Operator × 2 | - | ✗ 0 | 0 | 9.5 | 0 | 1 |
 | 12:15 | expansion | Forklift Operator × 5 | - | ✗ 0 | 0 | 47.8 | 0 | 1 |
 | 14:00 | emergency | Loader × 4 | - | ✗ 0 | 0 | 60.1 | 0 | 1 |
 | 15:45 | misplacement | Warehouse Associate × 1 | - | ✗ 0 | 0 | 62.3 | 0 | 1 |
 ## Final roster
 | Worker | Booked | Role | City, ST | Status |
 |---|---|---|---|---|
 ## Gap signals
 ### drift_or_tool
 - **08:00** — aborted — 3 consecutive drift flags
 - **10:30** — invalid JSON from executor: JSON Parse error: Unterminated string | raw: {"kind":"plan","steps":["TOOL_CALL hybrid_search({'index_name':'workers_500k_v1','sql_filter':'role = 'Machine Operator' AND city = 'Toledo' AND state = 'OH' AND CAST(availability AS DOUBLE) > 0.5'})",
 "TOOL_CALL sql({'query':'SELECT worker_id, name, role, city, state, CAST(availability AS DOUBLE) A
 - **12:15** — aborted — 3 consecutive drift flags
 - **14:00** — aborted — 3 consecutive drift flags
 - **15:45** — invalid JSON from executor: JSON Parse error: Unterminated string | raw: {"kind": "plan", "steps": ["1.1. TOOL_CALL hybrid_search({'index_name': 'workers_500k_v1', 'sql_filter': 'role = 'Warehouse Associate' AND city = 'Toledo' AND state = 'OH' AND CAST(availability AS DOUBLE) > 0.5 AND worker_id NOT IN (49164, 1181, 7239, 299, 30930, 33212)'})",
 "2.2. TOOL_CALL sql({'qu
 ### write_through_audit
 - _post-run_ — playbook_memory has 165 entries (ran 5 events, expected ≥ 0 new entries from this run)
 ## Workers touched across the week
 0 distinct workers made it through to a decision. Every one is accounted for below — no-shows flagged, rebookings noted, everyone visible.
 | Worker ID | Name | Events | Outcome |
 |---|---|---|---|
 ## Discovered patterns (meta-index)
 What the system identified across semantically-similar past fills as each event ran:
 - **08:00 baseline_fill** (Warehouse Associate): —
 - **10:30 recurring** (Machine Operator): —
 - **12:15 expansion** (Forklift Operator): —
 - **14:00 emergency** (Loader): —
 - **15:45 misplacement** (Warehouse Associate): —
 ## Narrative
 - 0/5 events reached consensus.
 - Final roster: 0 bookings across 0 distinct workers.
 - Workers touched (booked, failed, or otherwise decided): 0.
 - Playbook citations across the day: 0 (proof the feedback loop fired across events).
 - Dropped events: 08:00 baseline_fill, 10:30 recurring, 12:15 expansion, 14:00 emergency, 15:45 misplacement.
--- a/tests/multi-agent/playbooks/scenario-2026-04-20T20-11-11/results.json
+++ b/tests/multi-agent/playbooks/scenario-2026-04-20T20-11-11/results.json
@ -0,0 +1,104 @@
 [
  {
    "event": {
      "kind": "baseline_fill",
      "at": "08:00",
      "role": "Warehouse Associate",
      "count": 3,
      "city": "Toledo",
      "state": "OH",
      "shift_start": "08:00 AM",
      "scenario_note": "Regular Monday morning shift, 8-hour."
    },
    "ok": false,
    "fills": [],
    "turns": 0,
    "duration_secs": 63.815,
    "error": "aborted — 3 consecutive drift flags",
    "gap_signals": [
      "drift_or_tool: aborted — 3 consecutive drift flags"
    ]
  },
  {
    "event": {
      "kind": "recurring",
      "at": "10:30",
      "role": "Machine Operator",
      "count": 2,
      "city": "Toledo",
      "state": "OH",
      "shift_start": "11:00 AM",
      "scenario_note": "Recurring Tuesday/Thursday slot — prior workers may still be available."
    },
    "ok": false,
    "fills": [],
    "turns": 0,
    "duration_secs": 9.538,
    "error": "invalid JSON from executor: JSON Parse error: Unterminated string | raw: {\"kind\":\"plan\",\"steps\":[\"TOOL_CALL hybrid_search({'index_name':'workers_500k_v1','sql_filter':'role = 'Machine Operator' AND city = 'Toledo' AND state = 'OH' AND CAST(availability AS DOUBLE) > 0.5'})\",\n\"TOOL_CALL sql({'query':'SELECT worker_id, name, role, city, state, CAST(availability AS DOUBLE) A",
    "gap_signals": [
      "drift_or_tool: invalid JSON from executor: JSON Parse error: Unterminated string | raw: {\"kind\":\"plan\",\"steps\":[\"TOOL_CALL hybrid_search({'index_name':'workers_500k_v1','sql_filter':'role = 'Machine Operator' AND city = 'Toledo' AND state = 'OH' AND CAST(availability AS DOUBLE) > 0.5'})\",\n\"TOOL_CALL sql({'query':'SELECT worker_id, name, role, city, state, CAST(availability AS DOUBLE) A"
    ]
  },
  {
    "event": {
      "kind": "expansion",
      "at": "12:15",
      "role": "Forklift Operator",
      "count": 5,
      "city": "Toledo",
      "state": "OH",
      "shift_start": "01:00 PM",
      "scenario_note": "New warehouse location opening, five-worker team needed."
    },
    "ok": false,
    "fills": [],
    "turns": 0,
    "duration_secs": 47.797,
    "error": "aborted — 3 consecutive drift flags",
    "gap_signals": [
      "drift_or_tool: aborted — 3 consecutive drift flags"
    ]
  },
  {
    "event": {
      "kind": "emergency",
      "at": "14:00",
      "role": "Loader",
      "count": 4,
      "city": "Toledo",
      "state": "OH",
      "shift_start": "04:00 PM same day",
      "deadline": "16:00",
      "scenario_note": "Walkoff incident — replacement crew needed by 16:00 sharp."
    },
    "ok": false,
    "fills": [],
    "turns": 0,
    "duration_secs": 60.115,
    "error": "aborted — 3 consecutive drift flags",
    "gap_signals": [
      "drift_or_tool: aborted — 3 consecutive drift flags"
    ]
  },
  {
    "event": {
      "kind": "misplacement",
      "at": "15:45",
      "role": "Warehouse Associate",
      "count": 1,
      "city": "Toledo",
      "state": "OH",
      "shift_start": "remainder of 08:00 shift",
      "scenario_note": "One worker from the 08:00 fill didn't show; rebuild the gap.",
      "replaces_event": "08:00"
    },
    "ok": false,
    "fills": [],
    "turns": 0,
    "duration_secs": 62.283,
    "error": "invalid JSON from executor: JSON Parse error: Unterminated string | raw: {\"kind\": \"plan\", \"steps\": [\"1.1. TOOL_CALL hybrid_search({'index_name': 'workers_500k_v1', 'sql_filter': 'role = 'Warehouse Associate' AND city = 'Toledo' AND state = 'OH' AND CAST(availability AS DOUBLE) > 0.5 AND worker_id NOT IN (49164, 1181, 7239, 299, 30930, 33212)'})\",\n\"2.2. TOOL_CALL sql({'qu",
    "gap_signals": [
      "drift_or_tool: invalid JSON from executor: JSON Parse error: Unterminated string | raw: {\"kind\": \"plan\", \"steps\": [\"1.1. TOOL_CALL hybrid_search({'index_name': 'workers_500k_v1', 'sql_filter': 'role = 'Warehouse Associate' AND city = 'Toledo' AND state = 'OH' AND CAST(availability AS DOUBLE) > 0.5 AND worker_id NOT IN (49164, 1181, 7239, 299, 30930, 33212)'})\",\n\"2.2. TOOL_CALL sql({'qu"
    ]
  }
 ]
--- a/tests/multi-agent/playbooks/scenario-2026-04-20T20-11-11/roster.json
+++ b/tests/multi-agent/playbooks/scenario-2026-04-20T20-11-11/roster.json
@ -0,0 +1 @@
 []
--- a/tests/multi-agent/playbooks/scenario-2026-04-20T20-11-11/sms.md
+++ b/tests/multi-agent/playbooks/scenario-2026-04-20T20-11-11/sms.md
@ -0,0 +1 @@
 # SMS drafts — Riverfront Steel, 2026-04-21
--- a/tests/multi-agent/playbooks/scenario-2026-04-20T20-16-49/dispatch.jsonl
+++ b/tests/multi-agent/playbooks/scenario-2026-04-20T20-16-49/dispatch.jsonl
--- a/tests/multi-agent/playbooks/scenario-2026-04-20T20-16-49/emails.md
+++ b/tests/multi-agent/playbooks/scenario-2026-04-20T20-16-49/emails.md
@ -0,0 +1 @@
 # Client emails — Riverfront Steel, 2026-04-21
--- a/tests/multi-agent/playbooks/scenario-2026-04-20T20-16-49/report.md
+++ b/tests/multi-agent/playbooks/scenario-2026-04-20T20-16-49/report.md
@ -0,0 +1,55 @@
 # Scenario retrospective — Riverfront Steel, 2026-04-21
 Executor: `qwen2.5:latest`   Reviewer: `qwen2.5:latest`   Draft: `qwen2.5:latest`
 ## Events
 | At | Kind | Role / Count | Pool | Fills | Turns | Dur(s) | Cites | Gaps |
 |---|---|---|---|---|---|---|---|---|
 | 08:00 | baseline_fill | Warehouse Associate × 3 | - | ✗ 0 | 0 | 6.4 | 0 | 1 |
 | 10:30 | recurring | Machine Operator × 2 | - | ✗ 0 | 0 | 16.8 | 0 | 1 |
 | 12:15 | expansion | Forklift Operator × 5 | - | ✗ 0 | 0 | 7.2 | 0 | 1 |
 | 14:00 | emergency | Loader × 4 | - | ✗ 0 | 0 | 54.0 | 0 | 1 |
 | 15:45 | misplacement | Warehouse Associate × 1 | - | ✗ 0 | 0 | 49.3 | 0 | 1 |
 ## Final roster
 | Worker | Booked | Role | City, ST | Status |
 |---|---|---|---|---|
 ## Gap signals
 ### drift_or_tool
 - **08:00** — invalid JSON from executor: JSON Parse error: Expected '}' | raw: {"kind":"tool_call","tool":"sql","args":{"query":"SELECT worker_id, name FROM workers_500k_v1 WHERE role = 'Warehouse Associate' AND city = 'Toledo' AND state = 'OH' AND CAST(availability AS DOUBLE) > 0.5 LIMIT 3"},"rationale":"verify top candidates via SQL query")}
 - **10:30** — invalid JSON from executor: JSON Parse error: Expected '}' | raw: {"kind":"tool_call","tool":"hybrid_search","args":{"index_name":"workers_500k_v1","sql_filter":"role = 'Machine Operator' AND city = 'Toledo' AND state = 'OH' AND CAST(availability AS DOUBLE) > 0.5 AND reliability >= 0.7","question":"machine operator Toledo OH high reliability","k":2}
 - **12:15** — invalid JSON from executor: JSON Parse error: Expected '}' | raw: {"kind":"tool_call","tool":"sql","args":{"query":"SELECT worker_id FROM workers_500k_v1 WHERE role = 'Forklift Operator' AND city = 'Toledo' AND state = 'OH' AND CAST(availability AS DOUBLE) > 0.5 AND CAST(reliability AS DOUBLE) > 0.75 LIMIT 5"},"rationale":"verify top candidates via SQL query to me
 - **14:00** — no consensus after 14 turns
 - **15:45** — no consensus after 14 turns
 ### write_through_audit
 - _post-run_ — playbook_memory has 165 entries (ran 5 events, expected ≥ 0 new entries from this run)
 ## Workers touched across the week
 0 distinct workers made it through to a decision. Every one is accounted for below — no-shows flagged, rebookings noted, everyone visible.
 | Worker ID | Name | Events | Outcome |
 |---|---|---|---|
 ## Discovered patterns (meta-index)
 What the system identified across semantically-similar past fills as each event ran:
 - **08:00 baseline_fill** (Warehouse Associate): —
 - **10:30 recurring** (Machine Operator): —
 - **12:15 expansion** (Forklift Operator): —
 - **14:00 emergency** (Loader): —
 - **15:45 misplacement** (Warehouse Associate): —
 ## Narrative
 - 0/5 events reached consensus.
 - Final roster: 0 bookings across 0 distinct workers.
 - Workers touched (booked, failed, or otherwise decided): 0.
 - Playbook citations across the day: 0 (proof the feedback loop fired across events).
 - Dropped events: 08:00 baseline_fill, 10:30 recurring, 12:15 expansion, 14:00 emergency, 15:45 misplacement.
--- a/tests/multi-agent/playbooks/scenario-2026-04-20T20-16-49/results.json
+++ b/tests/multi-agent/playbooks/scenario-2026-04-20T20-16-49/results.json
@ -0,0 +1,104 @@
 [
  {
    "event": {
      "kind": "baseline_fill",
      "at": "08:00",
      "role": "Warehouse Associate",
      "count": 3,
      "city": "Toledo",
      "state": "OH",
      "shift_start": "08:00 AM",
      "scenario_note": "Regular Monday morning shift, 8-hour."
    },
    "ok": false,
    "fills": [],
    "turns": 0,
    "duration_secs": 6.434,
    "error": "invalid JSON from executor: JSON Parse error: Expected '}' | raw: {\"kind\":\"tool_call\",\"tool\":\"sql\",\"args\":{\"query\":\"SELECT worker_id, name FROM workers_500k_v1 WHERE role = 'Warehouse Associate' AND city = 'Toledo' AND state = 'OH' AND CAST(availability AS DOUBLE) > 0.5 LIMIT 3\"},\"rationale\":\"verify top candidates via SQL query\")}",
    "gap_signals": [
      "drift_or_tool: invalid JSON from executor: JSON Parse error: Expected '}' | raw: {\"kind\":\"tool_call\",\"tool\":\"sql\",\"args\":{\"query\":\"SELECT worker_id, name FROM workers_500k_v1 WHERE role = 'Warehouse Associate' AND city = 'Toledo' AND state = 'OH' AND CAST(availability AS DOUBLE) > 0.5 LIMIT 3\"},\"rationale\":\"verify top candidates via SQL query\")}"
    ]
  },
  {
    "event": {
      "kind": "recurring",
      "at": "10:30",
      "role": "Machine Operator",
      "count": 2,
      "city": "Toledo",
      "state": "OH",
      "shift_start": "11:00 AM",
      "scenario_note": "Recurring Tuesday/Thursday slot — prior workers may still be available."
    },
    "ok": false,
    "fills": [],
    "turns": 0,
    "duration_secs": 16.752,
    "error": "invalid JSON from executor: JSON Parse error: Expected '}' | raw: {\"kind\":\"tool_call\",\"tool\":\"hybrid_search\",\"args\":{\"index_name\":\"workers_500k_v1\",\"sql_filter\":\"role = 'Machine Operator' AND city = 'Toledo' AND state = 'OH' AND CAST(availability AS DOUBLE) > 0.5 AND reliability >= 0.7\",\"question\":\"machine operator Toledo OH high reliability\",\"k\":2}",
    "gap_signals": [
      "drift_or_tool: invalid JSON from executor: JSON Parse error: Expected '}' | raw: {\"kind\":\"tool_call\",\"tool\":\"hybrid_search\",\"args\":{\"index_name\":\"workers_500k_v1\",\"sql_filter\":\"role = 'Machine Operator' AND city = 'Toledo' AND state = 'OH' AND CAST(availability AS DOUBLE) > 0.5 AND reliability >= 0.7\",\"question\":\"machine operator Toledo OH high reliability\",\"k\":2}"
    ]
  },
  {
    "event": {
      "kind": "expansion",
      "at": "12:15",
      "role": "Forklift Operator",
      "count": 5,
      "city": "Toledo",
      "state": "OH",
      "shift_start": "01:00 PM",
      "scenario_note": "New warehouse location opening, five-worker team needed."
    },
    "ok": false,
    "fills": [],
    "turns": 0,
    "duration_secs": 7.181,
    "error": "invalid JSON from executor: JSON Parse error: Expected '}' | raw: {\"kind\":\"tool_call\",\"tool\":\"sql\",\"args\":{\"query\":\"SELECT worker_id FROM workers_500k_v1 WHERE role = 'Forklift Operator' AND city = 'Toledo' AND state = 'OH' AND CAST(availability AS DOUBLE) > 0.5 AND CAST(reliability AS DOUBLE) > 0.75 LIMIT 5\"},\"rationale\":\"verify top candidates via SQL query to me",
    "gap_signals": [
      "drift_or_tool: invalid JSON from executor: JSON Parse error: Expected '}' | raw: {\"kind\":\"tool_call\",\"tool\":\"sql\",\"args\":{\"query\":\"SELECT worker_id FROM workers_500k_v1 WHERE role = 'Forklift Operator' AND city = 'Toledo' AND state = 'OH' AND CAST(availability AS DOUBLE) > 0.5 AND CAST(reliability AS DOUBLE) > 0.75 LIMIT 5\"},\"rationale\":\"verify top candidates via SQL query to me"
    ]
  },
  {
    "event": {
      "kind": "emergency",
      "at": "14:00",
      "role": "Loader",
      "count": 4,
      "city": "Toledo",
      "state": "OH",
      "shift_start": "04:00 PM same day",
      "deadline": "16:00",
      "scenario_note": "Walkoff incident — replacement crew needed by 16:00 sharp."
    },
    "ok": false,
    "fills": [],
    "turns": 0,
    "duration_secs": 54.028,
    "error": "no consensus after 14 turns",
    "gap_signals": [
      "drift_or_tool: no consensus after 14 turns"
    ]
  },
  {
    "event": {
      "kind": "misplacement",
      "at": "15:45",
      "role": "Warehouse Associate",
      "count": 1,
      "city": "Toledo",
      "state": "OH",
      "shift_start": "remainder of 08:00 shift",
      "scenario_note": "One worker from the 08:00 fill didn't show; rebuild the gap.",
      "replaces_event": "08:00"
    },
    "ok": false,
    "fills": [],
    "turns": 0,
    "duration_secs": 49.298,
    "error": "no consensus after 14 turns",
    "gap_signals": [
      "drift_or_tool: no consensus after 14 turns"
    ]
  }
 ]
--- a/tests/multi-agent/playbooks/scenario-2026-04-20T20-16-49/roster.json
+++ b/tests/multi-agent/playbooks/scenario-2026-04-20T20-16-49/roster.json
@ -0,0 +1 @@
 []
--- a/tests/multi-agent/playbooks/scenario-2026-04-20T20-16-49/sms.md
+++ b/tests/multi-agent/playbooks/scenario-2026-04-20T20-16-49/sms.md
@ -0,0 +1 @@
 # SMS drafts — Riverfront Steel, 2026-04-21
--- a/tests/multi-agent/playbooks/scenario-2026-04-20T20-20-23/dispatch.jsonl
+++ b/tests/multi-agent/playbooks/scenario-2026-04-20T20-20-23/dispatch.jsonl
--- a/tests/multi-agent/playbooks/scenario-2026-04-20T20-20-23/emails.md
+++ b/tests/multi-agent/playbooks/scenario-2026-04-20T20-20-23/emails.md
@ -0,0 +1 @@
 # Client emails — Riverfront Steel, 2026-04-21
--- a/tests/multi-agent/playbooks/scenario-2026-04-20T20-20-23/report.md
+++ b/tests/multi-agent/playbooks/scenario-2026-04-20T20-20-23/report.md
@ -0,0 +1,55 @@
 # Scenario retrospective — Riverfront Steel, 2026-04-21
 Executor: `mistral:latest`   Reviewer: `qwen2.5:latest`   Draft: `qwen2.5:latest`
 ## Events
 | At | Kind | Role / Count | Pool | Fills | Turns | Dur(s) | Cites | Gaps |
 |---|---|---|---|---|---|---|---|---|
 | 08:00 | baseline_fill | Warehouse Associate × 3 | - | ✗ 0 | 0 | 47.4 | 0 | 1 |
 | 10:30 | recurring | Machine Operator × 2 | - | ✗ 0 | 0 | 40.4 | 0 | 1 |
 | 12:15 | expansion | Forklift Operator × 5 | - | ✗ 0 | 0 | 9.4 | 0 | 1 |
 | 14:00 | emergency | Loader × 4 | - | ✗ 0 | 0 | 44.7 | 0 | 1 |
 | 15:45 | misplacement | Warehouse Associate × 1 | - | ✗ 0 | 0 | 45.1 | 0 | 1 |
 ## Final roster
 | Worker | Booked | Role | City, ST | Status |
 |---|---|---|---|---|
 ## Gap signals
 ### drift_or_tool
 - **08:00** — no consensus after 14 turns
 - **10:30** — aborted — 3 consecutive drift flags
 - **12:15** — invalid JSON from executor: JSON Parse error: Expected '}' | raw: {"kind":"tool_call","tool":"propose_done","args":{"fills":[{"candidate_id":"W500K-37736","name":"Jennifer K. Robinson","reason":"verified Toledo forklift op, reliability 0.9"}],"rationale":"one SQL-verified candidate from surfaced candidates"}
 - **14:00** — aborted — 3 consecutive drift flags
 - **15:45** — no consensus after 14 turns
 ### write_through_audit
 - _post-run_ — playbook_memory has 165 entries (ran 5 events, expected ≥ 0 new entries from this run)
 ## Workers touched across the week
 0 distinct workers made it through to a decision. Every one is accounted for below — no-shows flagged, rebookings noted, everyone visible.
 | Worker ID | Name | Events | Outcome |
 |---|---|---|---|
 ## Discovered patterns (meta-index)
 What the system identified across semantically-similar past fills as each event ran:
 - **08:00 baseline_fill** (Warehouse Associate): —
 - **10:30 recurring** (Machine Operator): —
 - **12:15 expansion** (Forklift Operator): —
 - **14:00 emergency** (Loader): —
 - **15:45 misplacement** (Warehouse Associate): —
 ## Narrative
 - 0/5 events reached consensus.
 - Final roster: 0 bookings across 0 distinct workers.
 - Workers touched (booked, failed, or otherwise decided): 0.
 - Playbook citations across the day: 0 (proof the feedback loop fired across events).
 - Dropped events: 08:00 baseline_fill, 10:30 recurring, 12:15 expansion, 14:00 emergency, 15:45 misplacement.
--- a/tests/multi-agent/playbooks/scenario-2026-04-20T20-20-23/results.json
+++ b/tests/multi-agent/playbooks/scenario-2026-04-20T20-20-23/results.json
@ -0,0 +1,104 @@
 [
  {
    "event": {
      "kind": "baseline_fill",
      "at": "08:00",
      "role": "Warehouse Associate",
      "count": 3,
      "city": "Toledo",
      "state": "OH",
      "shift_start": "08:00 AM",
      "scenario_note": "Regular Monday morning shift, 8-hour."
    },
    "ok": false,
    "fills": [],
    "turns": 0,
    "duration_secs": 47.404,
    "error": "no consensus after 14 turns",
    "gap_signals": [
      "drift_or_tool: no consensus after 14 turns"
    ]
  },
  {
    "event": {
      "kind": "recurring",
      "at": "10:30",
      "role": "Machine Operator",
      "count": 2,
      "city": "Toledo",
      "state": "OH",
      "shift_start": "11:00 AM",
      "scenario_note": "Recurring Tuesday/Thursday slot — prior workers may still be available."
    },
    "ok": false,
    "fills": [],
    "turns": 0,
    "duration_secs": 40.374,
    "error": "aborted — 3 consecutive drift flags",
    "gap_signals": [
      "drift_or_tool: aborted — 3 consecutive drift flags"
    ]
  },
  {
    "event": {
      "kind": "expansion",
      "at": "12:15",
      "role": "Forklift Operator",
      "count": 5,
      "city": "Toledo",
      "state": "OH",
      "shift_start": "01:00 PM",
      "scenario_note": "New warehouse location opening, five-worker team needed."
    },
    "ok": false,
    "fills": [],
    "turns": 0,
    "duration_secs": 9.414,
    "error": "invalid JSON from executor: JSON Parse error: Expected '}' | raw: {\"kind\":\"tool_call\",\"tool\":\"propose_done\",\"args\":{\"fills\":[{\"candidate_id\":\"W500K-37736\",\"name\":\"Jennifer K. Robinson\",\"reason\":\"verified Toledo forklift op, reliability 0.9\"}],\"rationale\":\"one SQL-verified candidate from surfaced candidates\"}",
    "gap_signals": [
      "drift_or_tool: invalid JSON from executor: JSON Parse error: Expected '}' | raw: {\"kind\":\"tool_call\",\"tool\":\"propose_done\",\"args\":{\"fills\":[{\"candidate_id\":\"W500K-37736\",\"name\":\"Jennifer K. Robinson\",\"reason\":\"verified Toledo forklift op, reliability 0.9\"}],\"rationale\":\"one SQL-verified candidate from surfaced candidates\"}"
    ]
  },
  {
    "event": {
      "kind": "emergency",
      "at": "14:00",
      "role": "Loader",
      "count": 4,
      "city": "Toledo",
      "state": "OH",
      "shift_start": "04:00 PM same day",
      "deadline": "16:00",
      "scenario_note": "Walkoff incident — replacement crew needed by 16:00 sharp."
    },
    "ok": false,
    "fills": [],
    "turns": 0,
    "duration_secs": 44.673,
    "error": "aborted — 3 consecutive drift flags",
    "gap_signals": [
      "drift_or_tool: aborted — 3 consecutive drift flags"
    ]
  },
  {
    "event": {
      "kind": "misplacement",
      "at": "15:45",
      "role": "Warehouse Associate",
      "count": 1,
      "city": "Toledo",
      "state": "OH",
      "shift_start": "remainder of 08:00 shift",
      "scenario_note": "One worker from the 08:00 fill didn't show; rebuild the gap.",
      "replaces_event": "08:00"
    },
    "ok": false,
    "fills": [],
    "turns": 0,
    "duration_secs": 45.149,
    "error": "no consensus after 14 turns",
    "gap_signals": [
      "drift_or_tool: no consensus after 14 turns"
    ]
  }
 ]
--- a/tests/multi-agent/playbooks/scenario-2026-04-20T20-20-23/roster.json
+++ b/tests/multi-agent/playbooks/scenario-2026-04-20T20-20-23/roster.json
@ -0,0 +1 @@
 []
--- a/tests/multi-agent/playbooks/scenario-2026-04-20T20-20-23/sms.md
+++ b/tests/multi-agent/playbooks/scenario-2026-04-20T20-20-23/sms.md
@ -0,0 +1 @@
 # SMS drafts — Riverfront Steel, 2026-04-21
--- a/tests/multi-agent/playbooks/scenario-2026-04-20T20-24-48/dispatch.jsonl
+++ b/tests/multi-agent/playbooks/scenario-2026-04-20T20-24-48/dispatch.jsonl
@ -0,0 +1,2 @@
 {"at":"12:15","kind":"expansion","operation":"fill: Forklift Operator x5 in Toledo, OH","fills":[{"candidate_id":"W500K-37736","name":"Jennifer K. Robinson","reason":"Meets the criteria of being a Forklift Operator in Toledo, OH with availability and reliability above the specified thresholds."},{"candidate_id":"W500K-33961","name":"Kyle F. Brooks","reason":"Meets the criteria of being a Forklift Operator in Toledo, OH with availability and reliability above the specified thresholds."},{"candidate_id":"W500K-31297","name":"Jacob T. Diaz","reason":"Meets the criteria of being a Forklift Operator in Toledo, OH with availability and reliability above the specified thresholds."},{"candidate_id":"W500K-40884","name":"Jerry M. Jones","reason":"Meets the criteria of being a Forklift Operator in Toledo, OH with availability and reliability above the specified thresholds."},{"candidate_id":"W500K-37729","name":"Jeffrey D. Taylor","reason":"Meets the criteria of being a Forklift Operator in Toledo, OH with availability and reliability above the specified thresholds."}],"turns":7,"duration_secs":28.23,"pool_size":687,"playbook_citations":[],"discovered_pattern":"Across 25 similar past playbooks (10 workers examined) · recurring certifications: Forklift (40%), OSHA-10 (40%) · recurring skills: mill (40%) · archetype mostly: leader · reliability median 0.83 (range 0.66–0.96)"}
 {"at":"14:00","kind":"emergency","operation":"fill: Loader x4 in Toledo, OH","fills":[{"candidate_id":"W500K-15305","name":"Mary R. Richardson","reason":"Verified availability score of 0.988 via SQL and ranked highest among the candidates with an availability score greater than 0.7."},{"candidate_id":"W500K-12325","name":"Raj Torres","reason":"Ranked second among the candidates with an availability score greater than 0.7."},{"candidate_id":"W500K-16975","name":"Brian X. Price","reason":"Ranked third among the candidates with an availability score greater than 0.7."},{"candidate_id":"W500K-22851","name":"Fatima X. Gutierrez","reason":"Ranked fourth among the candidates with an availability score greater than 0.7."}],"turns":6,"duration_secs":22.25,"pool_size":380,"playbook_citations":[],"discovered_pattern":"Across 25 similar past playbooks (9 workers examined) · recurring certifications: Forklift (44%) · recurring skills: mill (44%) · archetype mostly: leader · reliability median 0.80 (range 0.66–0.96)"}
--- a/tests/multi-agent/playbooks/scenario-2026-04-20T20-24-48/emails.md
+++ b/tests/multi-agent/playbooks/scenario-2026-04-20T20-24-48/emails.md
@ -0,0 +1,40 @@
 # Client emails — Riverfront Steel, 2026-04-21
 ## 12:15 expansion — Forklift Operator x5
 Subject: 5 Workers Confirmed
 Dear Riverfront Steel Team,
 We are pleased to confirm that we have filled all five positions for Forklift Operators at your new warehouse location opening. The workers starting at 01:00 PM today are:
 - Jennifer K. Robinson
 - Kyle F. Brooks 
 - Jacob T. Diaz
 - Jerry M. Jones
 - Jeffrey D. Taylor
 Each meets the criteria of being a Forklift Operator in Toledo, OH.
 Best regards,
 Dispatch Team Lakehouse
 ## 14:00 emergency — Loader x4
 Subject: 4 Loader Workers Confirmed
 Dear Riverfront Steel Team,
 I am pleased to confirm that we have filled all four loader positions as requested:
 - Mary R. Richardson 
 - Raj Torres  
 - Brian X. Price 
 - Fatima X. Gutierrez 
 All workers will start their shift at 04:00 PM today. Please note the walkoff incident requiring a replacement crew by 16:00 sharp.
 Thank you for your trust in Lakehouse Dispatch.
 Best regards,  
 Dispatch Team
--- a/tests/multi-agent/playbooks/scenario-2026-04-20T20-24-48/report.md
+++ b/tests/multi-agent/playbooks/scenario-2026-04-20T20-24-48/report.md
@ -0,0 +1,85 @@
 # Scenario retrospective — Riverfront Steel, 2026-04-21
 Executor: `mistral:latest`   Reviewer: `qwen2.5:latest`   Draft: `qwen2.5:latest`
 ## Events
 | At | Kind | Role / Count | Pool | Fills | Turns | Dur(s) | Cites | Gaps |
 |---|---|---|---|---|---|---|---|---|
 | 08:00 | baseline_fill | Warehouse Associate × 3 | - | ✗ 0 | 0 | 20.2 | 0 | 1 |
 | 10:30 | recurring | Machine Operator × 2 | - | ✗ 0 | 0 | 47.4 | 0 | 1 |
 | 12:15 | expansion | Forklift Operator × 5 | 687 | ✓ 5 | 7 | 28.2 | 0 | 4 |
 | 14:00 | emergency | Loader × 4 | 380 | ✓ 4 | 6 | 22.3 | 0 | 4 |
 | 15:45 | misplacement | Warehouse Associate × 1 | - | ✗ 0 | 0 | 52.5 | 0 | 1 |
 ## Final roster
 | Worker | Booked | Role | City, ST | Status |
 |---|---|---|---|---|
 | undefined Jennifer K. Robinson | 12:15 | Forklift Operator | Toledo, OH | confirmed |
 | undefined Kyle F. Brooks | 12:15 | Forklift Operator | Toledo, OH | confirmed |
 | undefined Jacob T. Diaz | 12:15 | Forklift Operator | Toledo, OH | confirmed |
 | undefined Jerry M. Jones | 12:15 | Forklift Operator | Toledo, OH | confirmed |
 | undefined Jeffrey D. Taylor | 12:15 | Forklift Operator | Toledo, OH | confirmed |
 | undefined Mary R. Richardson | 14:00 | Loader | Toledo, OH | confirmed |
 | undefined Raj Torres | 14:00 | Loader | Toledo, OH | confirmed |
 | undefined Brian X. Price | 14:00 | Loader | Toledo, OH | confirmed |
 | undefined Fatima X. Gutierrez | 14:00 | Loader | Toledo, OH | confirmed |
 ## Gap signals
 ### drift_or_tool
 - **08:00** — invalid JSON from executor: JSON Parse error: Invalid escape character ' | raw: {"kind":"plan","steps":["TOOL_CALL hybrid_search({'index_name':'workers_500k_v1','sql_filter':'role = \'Warehouse Associate\' AND city = \'Toledo\' AND state = \'OH\' AND CAST(availability AS DOUBLE) > 0.5','question':'reliable warehouse associate Toledo'})",
 "TOOL_CALL sql({'query':'SELECT worker_i
 - **10:30** — no consensus after 14 turns
 - **15:45** — no consensus after 14 turns
 ### double_book
 - **12:15** — undefined Kyle F. Brooks already booked for 12:15
 - **12:15** — undefined Jacob T. Diaz already booked for 12:15
 - **12:15** — undefined Jerry M. Jones already booked for 12:15
 - **12:15** — undefined Jeffrey D. Taylor already booked for 12:15
 - **14:00** — undefined Mary R. Richardson already booked for 12:15
 - **14:00** — undefined Raj Torres already booked for 12:15
 - **14:00** — undefined Brian X. Price already booked for 12:15
 - **14:00** — undefined Fatima X. Gutierrez already booked for 12:15
 ### fairness
 - _cross-event_ — Jennifer K. Robinson (undefined) booked 9 times today
 ### write_through_audit
 - _post-run_ — playbook_memory has 167 entries (ran 5 events, expected ≥ 2 new entries from this run)
 ## Workers touched across the week
 9 distinct workers made it through to a decision. Every one is accounted for below — no-shows flagged, rebookings noted, everyone visible.
 | Worker ID | Name | Events | Outcome |
 |---|---|---|---|
 | W500K-37736 | Jennifer K. Robinson | 12:15 expansion | booked |
 | W500K-33961 | Kyle F. Brooks | 12:15 expansion | booked |
 | W500K-31297 | Jacob T. Diaz | 12:15 expansion | booked |
 | W500K-40884 | Jerry M. Jones | 12:15 expansion | booked |
 | W500K-37729 | Jeffrey D. Taylor | 12:15 expansion | booked |
 | W500K-15305 | Mary R. Richardson | 14:00 emergency | booked |
 | W500K-12325 | Raj Torres | 14:00 emergency | booked |
 | W500K-16975 | Brian X. Price | 14:00 emergency | booked |
 | W500K-22851 | Fatima X. Gutierrez | 14:00 emergency | booked |
 ## Discovered patterns (meta-index)
 What the system identified across semantically-similar past fills as each event ran:
 - **08:00 baseline_fill** (Warehouse Associate): —
 - **10:30 recurring** (Machine Operator): —
 - **12:15 expansion** (Forklift Operator): Across 25 similar past playbooks (10 workers examined) · recurring certifications: Forklift (40%), OSHA-10 (40%) · recurring skills: mill (40%) · archetype mostly: leader · reliability median 0.83 (range 0.66–0.96)
 - **14:00 emergency** (Loader): Across 25 similar past playbooks (9 workers examined) · recurring certifications: Forklift (44%) · recurring skills: mill (44%) · archetype mostly: leader · reliability median 0.80 (range 0.66–0.96)
 - **15:45 misplacement** (Warehouse Associate): —
 ## Narrative
 - 2/5 events reached consensus.
 - Final roster: 9 bookings across 1 distinct workers.
 - Workers touched (booked, failed, or otherwise decided): 9.
 - Playbook citations across the day: 0 (proof the feedback loop fired across events).
 - Dropped events: 08:00 baseline_fill, 10:30 recurring, 15:45 misplacement.
--- a/tests/multi-agent/playbooks/scenario-2026-04-20T20-24-48/results.json
+++ b/tests/multi-agent/playbooks/scenario-2026-04-20T20-24-48/results.json
@ -0,0 +1,165 @@
 [
  {
    "event": {
      "kind": "baseline_fill",
      "at": "08:00",
      "role": "Warehouse Associate",
      "count": 3,
      "city": "Toledo",
      "state": "OH",
      "shift_start": "08:00 AM",
      "scenario_note": "Regular Monday morning shift, 8-hour."
    },
    "ok": false,
    "fills": [],
    "turns": 0,
    "duration_secs": 20.215,
    "error": "invalid JSON from executor: JSON Parse error: Invalid escape character ' | raw: {\"kind\":\"plan\",\"steps\":[\"TOOL_CALL hybrid_search({'index_name':'workers_500k_v1','sql_filter':'role = \\'Warehouse Associate\\' AND city = \\'Toledo\\' AND state = \\'OH\\' AND CAST(availability AS DOUBLE) > 0.5','question':'reliable warehouse associate Toledo'})\",\n\"TOOL_CALL sql({'query':'SELECT worker_i",
    "gap_signals": [
      "drift_or_tool: invalid JSON from executor: JSON Parse error: Invalid escape character ' | raw: {\"kind\":\"plan\",\"steps\":[\"TOOL_CALL hybrid_search({'index_name':'workers_500k_v1','sql_filter':'role = \\'Warehouse Associate\\' AND city = \\'Toledo\\' AND state = \\'OH\\' AND CAST(availability AS DOUBLE) > 0.5','question':'reliable warehouse associate Toledo'})\",\n\"TOOL_CALL sql({'query':'SELECT worker_i"
    ]
  },
  {
    "event": {
      "kind": "recurring",
      "at": "10:30",
      "role": "Machine Operator",
      "count": 2,
      "city": "Toledo",
      "state": "OH",
      "shift_start": "11:00 AM",
      "scenario_note": "Recurring Tuesday/Thursday slot — prior workers may still be available."
    },
    "ok": false,
    "fills": [],
    "turns": 0,
    "duration_secs": 47.392,
    "error": "no consensus after 14 turns",
    "gap_signals": [
      "drift_or_tool: no consensus after 14 turns"
    ]
  },
  {
    "event": {
      "kind": "expansion",
      "at": "12:15",
      "role": "Forklift Operator",
      "count": 5,
      "city": "Toledo",
      "state": "OH",
      "shift_start": "01:00 PM",
      "scenario_note": "New warehouse location opening, five-worker team needed."
    },
    "ok": true,
    "fills": [
      {
        "candidate_id": "W500K-37736",
        "name": "Jennifer K. Robinson",
        "reason": "Meets the criteria of being a Forklift Operator in Toledo, OH with availability and reliability above the specified thresholds."
      },
      {
        "candidate_id": "W500K-33961",
        "name": "Kyle F. Brooks",
        "reason": "Meets the criteria of being a Forklift Operator in Toledo, OH with availability and reliability above the specified thresholds."
      },
      {
        "candidate_id": "W500K-31297",
        "name": "Jacob T. Diaz",
        "reason": "Meets the criteria of being a Forklift Operator in Toledo, OH with availability and reliability above the specified thresholds."
      },
      {
        "candidate_id": "W500K-40884",
        "name": "Jerry M. Jones",
        "reason": "Meets the criteria of being a Forklift Operator in Toledo, OH with availability and reliability above the specified thresholds."
      },
      {
        "candidate_id": "W500K-37729",
        "name": "Jeffrey D. Taylor",
        "reason": "Meets the criteria of being a Forklift Operator in Toledo, OH with availability and reliability above the specified thresholds."
      }
    ],
    "turns": 7,
    "duration_secs": 28.23,
    "gap_signals": [
      "double_book: undefined Kyle F. Brooks already booked for 12:15",
      "double_book: undefined Jacob T. Diaz already booked for 12:15",
      "double_book: undefined Jerry M. Jones already booked for 12:15",
      "double_book: undefined Jeffrey D. Taylor already booked for 12:15"
    ],
    "sources_first_score": 0.6336688,
    "sources_last_score": 0.55183524,
    "pool_size": 687,
    "playbook_citations": [],
    "discovered_pattern": "Across 25 similar past playbooks (10 workers examined) · recurring certifications: Forklift (40%), OSHA-10 (40%) · recurring skills: mill (40%) · archetype mostly: leader · reliability median 0.83 (range 0.66–0.96)"
  },
  {
    "event": {
      "kind": "emergency",
      "at": "14:00",
      "role": "Loader",
      "count": 4,
      "city": "Toledo",
      "state": "OH",
      "shift_start": "04:00 PM same day",
      "deadline": "16:00",
      "scenario_note": "Walkoff incident — replacement crew needed by 16:00 sharp."
    },
    "ok": true,
    "fills": [
      {
        "candidate_id": "W500K-15305",
        "name": "Mary R. Richardson",
        "reason": "Verified availability score of 0.988 via SQL and ranked highest among the candidates with an availability score greater than 0.7."
      },
      {
        "candidate_id": "W500K-12325",
        "name": "Raj Torres",
        "reason": "Ranked second among the candidates with an availability score greater than 0.7."
      },
      {
        "candidate_id": "W500K-16975",
        "name": "Brian X. Price",
        "reason": "Ranked third among the candidates with an availability score greater than 0.7."
      },
      {
        "candidate_id": "W500K-22851",
        "name": "Fatima X. Gutierrez",
        "reason": "Ranked fourth among the candidates with an availability score greater than 0.7."
      }
    ],
    "turns": 6,
    "duration_secs": 22.25,
    "gap_signals": [
      "double_book: undefined Mary R. Richardson already booked for 12:15",
      "double_book: undefined Raj Torres already booked for 12:15",
      "double_book: undefined Brian X. Price already booked for 12:15",
      "double_book: undefined Fatima X. Gutierrez already booked for 12:15"
    ],
    "sources_first_score": 0.73792297,
    "sources_last_score": 0.7001053,
    "pool_size": 380,
    "playbook_citations": [],
    "discovered_pattern": "Across 25 similar past playbooks (9 workers examined) · recurring certifications: Forklift (44%) · recurring skills: mill (44%) · archetype mostly: leader · reliability median 0.80 (range 0.66–0.96)"
  },
  {
    "event": {
      "kind": "misplacement",
      "at": "15:45",
      "role": "Warehouse Associate",
      "count": 1,
      "city": "Toledo",
      "state": "OH",
      "shift_start": "remainder of 08:00 shift",
      "scenario_note": "One worker from the 08:00 fill didn't show; rebuild the gap.",
      "replaces_event": "08:00"
    },
    "ok": false,
    "fills": [],
    "turns": 0,
    "duration_secs": 52.523,
    "error": "no consensus after 14 turns",
    "gap_signals": [
      "drift_or_tool: no consensus after 14 turns"
    ]
  }
 ]
--- a/tests/multi-agent/playbooks/scenario-2026-04-20T20-24-48/roster.json
+++ b/tests/multi-agent/playbooks/scenario-2026-04-20T20-24-48/roster.json
@ -0,0 +1,74 @@
 [
  {
    "name": "Jennifer K. Robinson",
    "booked_for": "12:15",
    "role": "Forklift Operator",
    "city": "Toledo",
    "state": "OH",
    "status": "confirmed"
  },
  {
    "name": "Kyle F. Brooks",
    "booked_for": "12:15",
    "role": "Forklift Operator",
    "city": "Toledo",
    "state": "OH",
    "status": "confirmed"
  },
  {
    "name": "Jacob T. Diaz",
    "booked_for": "12:15",
    "role": "Forklift Operator",
    "city": "Toledo",
    "state": "OH",
    "status": "confirmed"
  },
  {
    "name": "Jerry M. Jones",
    "booked_for": "12:15",
    "role": "Forklift Operator",
    "city": "Toledo",
    "state": "OH",
    "status": "confirmed"
  },
  {
    "name": "Jeffrey D. Taylor",
    "booked_for": "12:15",
    "role": "Forklift Operator",
    "city": "Toledo",
    "state": "OH",
    "status": "confirmed"
  },
  {
    "name": "Mary R. Richardson",
    "booked_for": "14:00",
    "role": "Loader",
    "city": "Toledo",
    "state": "OH",
    "status": "confirmed"
  },
  {
    "name": "Raj Torres",
    "booked_for": "14:00",
    "role": "Loader",
    "city": "Toledo",
    "state": "OH",
    "status": "confirmed"
  },
  {
    "name": "Brian X. Price",
    "booked_for": "14:00",
    "role": "Loader",
    "city": "Toledo",
    "state": "OH",
    "status": "confirmed"
  },
  {
    "name": "Fatima X. Gutierrez",
    "booked_for": "14:00",
    "role": "Loader",
    "city": "Toledo",
    "state": "OH",
    "status": "confirmed"
  }
 ]
--- a/tests/multi-agent/playbooks/scenario-2026-04-20T20-24-48/sms.md
+++ b/tests/multi-agent/playbooks/scenario-2026-04-20T20-24-48/sms.md
@ -0,0 +1,46 @@
 # SMS drafts — Riverfront Steel, 2026-04-21
 ## 12:15 expansion — Forklift Operator x5 in Toledo, OH
 TO: Jennifer K. Robinson
 Confirming your shift as a Forklift Operator at Riverfront Steel's new warehouse in Toledo, OH starting 1:00 PM.
 ---
 TO: Kyle F. Brooks
 Your shift as a Forklift Operator at the new Toledo, OH warehouse starts at 1:00 PM today.
 ---
 TO: Jacob T. Diaz
 Confirm your shift as a Forklift Operator at Riverfront Steel's new Toledo, OH location starting at 1:00 PM.
 ---
 TO: Jerry M. Jones
 Your shift as a Forklift Operator at the new Toledo, OH warehouse starts at 1:00 PM today.
 ---
 TO: Jeffrey D. Taylor
 Confirming your shift as a Forklift Operator at Riverfront Steel's new warehouse in Toledo, OH starting 1:00 PM.
 ## 14:00 emergency — Loader x4 in Toledo, OH
 TO: Mary R. Richardson
 Confirming your shift start at 4 PM today as a replacement. See you at Toledo, OH.
 ---
 TO: Raj Torres
 Replacing shift starting now at 4 PM. Toledo, OH.
 ---
 TO: Brian X. Price
 You're on at 4 PM replacing the crew. Toledo, OH.
 ---
 TO: Fatima X. Gutierrez
 Confirming your walkoff shift start at 4 PM today. Toledo, OH.
--- a/tests/multi-agent/playbooks/scenario-2026-04-21T00-13-01/checkpoints.jsonl
+++ b/tests/multi-agent/playbooks/scenario-2026-04-21T00-13-01/checkpoints.jsonl
@ -0,0 +1,2 @@
 {"after_event":"12:15","event_kind":"expansion","ok":true,"model":"gpt-oss:20b","duration_secs":10.228,"risk":"Forklift Operator JSON error","hint":"Ensure JSON is valid; test with a JSON validator; correct syntax before executing the tool call."}
 {"after_event":"15:45","event_kind":"misplacement","ok":false,"model":"gpt-oss:20b","duration_secs":13.935,"hint":"(T3 unavailable)","risk":"generate returned empty text from gpt-oss:20b: {\"text\":\"\",\"model\":\"gpt-oss:20b\","}
--- a/tests/multi-agent/playbooks/scenario-2026-04-21T00-13-01/dispatch.jsonl
+++ b/tests/multi-agent/playbooks/scenario-2026-04-21T00-13-01/dispatch.jsonl
@ -0,0 +1 @@
 {"at":"08:00","kind":"baseline_fill","operation":"fill: Warehouse Associate x3 in Toledo, OH","fills":[{"candidate_id":"W500K-49164","name":"Christopher Y. Phillips","reason":"SQL verified for high availability and semantic score of 0.63, making him the top candidate."},{"candidate_id":"W500K-34704","name":"Fatima U. Rivera","reason":"Semantic score of 0.61 and skills in cold storage make her a strong candidate."},{"candidate_id":"W500K-40928","name":"Janet E. Hill","reason":"Semantic score of 0.61, RF scanner skill, and high reliability score make her a suitable candidate."}],"turns":5,"duration_secs":19.474,"pool_size":770,"playbook_citations":[],"discovered_pattern":"Across 25 similar past playbooks (6 workers examined) · recurring certifications: Forklift (67%), OSHA-10 (50%) · recurring skills: mill (50%), 6S (50%) · archetype mostly: communicator · reliability median 0.83 (range 0.75–0.96)"}
--- a/tests/multi-agent/playbooks/scenario-2026-04-21T00-13-01/emails.md
+++ b/tests/multi-agent/playbooks/scenario-2026-04-21T00-13-01/emails.md
@ -0,0 +1,18 @@
 # Client emails — Riverfront Steel, 2026-04-21
 ## 08:00 baseline_fill — Warehouse Associate x3
 Subject: 3 Filled
 Dear Riverfront Steel Team,
 I am pleased to confirm that we have filled all three positions with the following Warehouse Associates:
 - Christopher Y. Phillips 
 - Fatima U. Rivera  
 - Janet E. Hill 
 Shift starts at 08:00 AM on a regular Monday morning, 8-hour shift.
 Best regards,
 Dispatch Team Lakehouse
--- a/tests/multi-agent/playbooks/scenario-2026-04-21T00-13-01/lesson.md
+++ b/tests/multi-agent/playbooks/scenario-2026-04-21T00-13-01/lesson.md
@ -0,0 +1,9 @@
 # Cross-day lesson — Riverfront Steel, 2026-04-21
 _Generated by `gpt-oss:20b` in 7.1s. Based on 5 events + 2 mid-day checkpoints._
 **  
 Validate every JSON payload with a validator before invoking a tool; a malformed payload caused the Forklift Operator expansion to fail.  
 Confirm the GPT model is available and that the tool returns non‑empty text; if it returns an empty string, retry or switch to a fallback model.  
 For recurring, expansion, and emergency events, pre‑fetch the candidate pool and verify it meets the required count before attempting placement.  
 Log any tool failures immediately and update the risk mitigation plan for the next run.
--- a/tests/multi-agent/playbooks/scenario-2026-04-21T00-13-01/report.md
+++ b/tests/multi-agent/playbooks/scenario-2026-04-21T00-13-01/report.md
@ -0,0 +1,71 @@
 # Scenario retrospective — Riverfront Steel, 2026-04-21
 Executor: `mistral:latest`   Reviewer: `qwen2.5:latest`   Draft: `qwen2.5:latest`   Overview(T3): `gpt-oss:20b`
 ## Events
 | At | Kind | Role / Count | Pool | Fills | Turns | Dur(s) | Cites | Gaps |
 |---|---|---|---|---|---|---|---|---|
 | 08:00 | baseline_fill | Warehouse Associate × 3 | 770 | ✓ 3 | 5 | 19.5 | 0 | 2 |
 | 10:30 | recurring | Machine Operator × 2 | - | ✗ 0 | 0 | 49.0 | 0 | 1 |
 | 12:15 | expansion | Forklift Operator × 5 | - | ✗ 0 | 0 | 2.8 | 0 | 1 |
 | 14:00 | emergency | Loader × 4 | - | ✗ 0 | 0 | 48.9 | 0 | 1 |
 | 15:45 | misplacement | Warehouse Associate × 1 | - | ✗ 0 | 0 | 47.8 | 0 | 1 |
 ## Final roster
 | Worker | Booked | Role | City, ST | Status |
 |---|---|---|---|---|
 | undefined Christopher Y. Phillips | 08:00 | Warehouse Associate | Toledo, OH | no_show |
 | undefined Fatima U. Rivera | 08:00 | Warehouse Associate | Toledo, OH | confirmed |
 | undefined Janet E. Hill | 08:00 | Warehouse Associate | Toledo, OH | confirmed |
 ## Gap signals
 ### double_book
 - **08:00** — undefined Fatima U. Rivera already booked for 08:00
 - **08:00** — undefined Janet E. Hill already booked for 08:00
 ### drift_or_tool
 - **10:30** — no consensus after 14 turns
 - **12:15** — invalid JSON from executor: JSON Parse error: Expected '}' | raw: {"kind":"tool_call","tool":"hybrid_search",
 "args":{"index_name":"workers_500k_v1",
 "sql_filter":"role = 'Forklift Operator' AND city = 'Toledo' AND state = 'OH' AND CAST(availability AS DOUBLE) > 0.5 AND CAST(reliability AS DOUBLE) > 0.75 AND worker_id NOT IN (42319, 68741, 34927)",
 "rationale":"Se
 - **14:00** — no consensus after 14 turns
 - **15:45** — no consensus after 14 turns
 ### fairness
 - _cross-event_ — Christopher Y. Phillips (undefined) booked 2 times today
 ### write_through_audit
 - _post-run_ — playbook_memory has 1163 entries (ran 5 events, expected ≥ 1 new entries from this run)
 ## Workers touched across the week
 4 distinct workers made it through to a decision. Every one is accounted for below — no-shows flagged, rebookings noted, everyone visible.
 | Worker ID | Name | Events | Outcome |
 |---|---|---|---|
 | W500K-49164 | Christopher Y. Phillips | 08:00 baseline_fill | booked |
 | W500K-34704 | Fatima U. Rivera | 08:00 baseline_fill | booked |
 | W500K-40928 | Janet E. Hill | 08:00 baseline_fill | booked |
 | undefined | Christopher Y. Phillips | 08:00 | no_show |
 ## Discovered patterns (meta-index)
 What the system identified across semantically-similar past fills as each event ran:
 - **08:00 baseline_fill** (Warehouse Associate): Across 25 similar past playbooks (6 workers examined) · recurring certifications: Forklift (67%), OSHA-10 (50%) · recurring skills: mill (50%), 6S (50%) · archetype mostly: communicator · reliability median 0.83 (range 0.75–0.96)
 - **10:30 recurring** (Machine Operator): —
 - **12:15 expansion** (Forklift Operator): —
 - **14:00 emergency** (Loader): —
 - **15:45 misplacement** (Warehouse Associate): —
 ## Narrative
 - 1/5 events reached consensus.
 - Final roster: 3 bookings across 1 distinct workers.
 - Workers touched (booked, failed, or otherwise decided): 4.
 - Playbook citations across the day: 0 (proof the feedback loop fired across events).
 - Dropped events: 10:30 recurring, 12:15 expansion, 14:00 emergency, 15:45 misplacement.
--- a/tests/multi-agent/playbooks/scenario-2026-04-21T00-13-01/results.json
+++ b/tests/multi-agent/playbooks/scenario-2026-04-21T00-13-01/results.json
@ -0,0 +1,130 @@
 [
  {
    "event": {
      "kind": "baseline_fill",
      "at": "08:00",
      "role": "Warehouse Associate",
      "count": 3,
      "city": "Toledo",
      "state": "OH",
      "shift_start": "08:00 AM",
      "scenario_note": "Regular Monday morning shift, 8-hour."
    },
    "ok": true,
    "fills": [
      {
        "candidate_id": "W500K-49164",
        "name": "Christopher Y. Phillips",
        "reason": "SQL verified for high availability and semantic score of 0.63, making him the top candidate."
      },
      {
        "candidate_id": "W500K-34704",
        "name": "Fatima U. Rivera",
        "reason": "Semantic score of 0.61 and skills in cold storage make her a strong candidate."
      },
      {
        "candidate_id": "W500K-40928",
        "name": "Janet E. Hill",
        "reason": "Semantic score of 0.61, RF scanner skill, and high reliability score make her a suitable candidate."
      }
    ],
    "turns": 5,
    "duration_secs": 19.474,
    "gap_signals": [
      "double_book: undefined Fatima U. Rivera already booked for 08:00",
      "double_book: undefined Janet E. Hill already booked for 08:00"
    ],
    "sources_first_score": 0.6233225,
    "sources_last_score": 0.55385745,
    "pool_size": 770,
    "playbook_citations": [],
    "discovered_pattern": "Across 25 similar past playbooks (6 workers examined) · recurring certifications: Forklift (67%), OSHA-10 (50%) · recurring skills: mill (50%), 6S (50%) · archetype mostly: communicator · reliability median 0.83 (range 0.75–0.96)"
  },
  {
    "event": {
      "kind": "recurring",
      "at": "10:30",
      "role": "Machine Operator",
      "count": 2,
      "city": "Toledo",
      "state": "OH",
      "shift_start": "11:00 AM",
      "scenario_note": "Recurring Tuesday/Thursday slot — prior workers may still be available."
    },
    "ok": false,
    "fills": [],
    "turns": 0,
    "duration_secs": 48.986,
    "error": "no consensus after 14 turns",
    "gap_signals": [
      "drift_or_tool: no consensus after 14 turns"
    ]
  },
  {
    "event": {
      "kind": "expansion",
      "at": "12:15",
      "role": "Forklift Operator",
      "count": 5,
      "city": "Toledo",
      "state": "OH",
      "shift_start": "01:00 PM",
      "scenario_note": "New warehouse location opening, five-worker team needed."
    },
    "ok": false,
    "fills": [],
    "turns": 0,
    "duration_secs": 2.845,
    "error": "invalid JSON from executor: JSON Parse error: Expected '}' | raw: {\"kind\":\"tool_call\",\"tool\":\"hybrid_search\",\n\"args\":{\"index_name\":\"workers_500k_v1\",\n\"sql_filter\":\"role = 'Forklift Operator' AND city = 'Toledo' AND state = 'OH' AND CAST(availability AS DOUBLE) > 0.5 AND CAST(reliability AS DOUBLE) > 0.75 AND worker_id NOT IN (42319, 68741, 34927)\",\n\"rationale\":\"Se",
    "gap_signals": [
      "drift_or_tool: invalid JSON from executor: JSON Parse error: Expected '}' | raw: {\"kind\":\"tool_call\",\"tool\":\"hybrid_search\",\n\"args\":{\"index_name\":\"workers_500k_v1\",\n\"sql_filter\":\"role = 'Forklift Operator' AND city = 'Toledo' AND state = 'OH' AND CAST(availability AS DOUBLE) > 0.5 AND CAST(reliability AS DOUBLE) > 0.75 AND worker_id NOT IN (42319, 68741, 34927)\",\n\"rationale\":\"Se"
    ]
  },
  {
    "event": {
      "kind": "emergency",
      "at": "14:00",
      "role": "Loader",
      "count": 4,
      "city": "Toledo",
      "state": "OH",
      "shift_start": "04:00 PM same day",
      "deadline": "16:00",
      "scenario_note": "Walkoff incident — replacement crew needed by 16:00 sharp."
    },
    "ok": false,
    "fills": [],
    "turns": 0,
    "duration_secs": 48.905,
    "error": "no consensus after 14 turns",
    "gap_signals": [
      "drift_or_tool: no consensus after 14 turns"
    ]
  },
  {
    "event": {
      "kind": "misplacement",
      "at": "15:45",
      "role": "Warehouse Associate",
      "count": 1,
      "city": "Toledo",
      "state": "OH",
      "shift_start": "remainder of 08:00 shift",
      "scenario_note": "One worker from the 08:00 fill didn't show; rebuild the gap.",
      "replaces_event": "08:00",
      "exclude_worker_ids": [
        null,
        null,
        null
      ]
    },
    "ok": false,
    "fills": [],
    "turns": 0,
    "duration_secs": 47.789,
    "error": "no consensus after 14 turns",
    "gap_signals": [
      "drift_or_tool: no consensus after 14 turns"
    ]
  }
 ]
--- a/tests/multi-agent/playbooks/scenario-2026-04-21T00-13-01/roster.json
+++ b/tests/multi-agent/playbooks/scenario-2026-04-21T00-13-01/roster.json
@ -0,0 +1,26 @@
 [
  {
    "name": "Christopher Y. Phillips",
    "booked_for": "08:00",
    "role": "Warehouse Associate",
    "city": "Toledo",
    "state": "OH",
    "status": "no_show"
  },
  {
    "name": "Fatima U. Rivera",
    "booked_for": "08:00",
    "role": "Warehouse Associate",
    "city": "Toledo",
    "state": "OH",
    "status": "confirmed"
  },
  {
    "name": "Janet E. Hill",
    "booked_for": "08:00",
    "role": "Warehouse Associate",
    "city": "Toledo",
    "state": "OH",
    "status": "confirmed"
  }
 ]
--- a/tests/multi-agent/playbooks/scenario-2026-04-21T00-13-01/sms.md
+++ b/tests/multi-agent/playbooks/scenario-2026-04-21T00-13-01/sms.md
@ -0,0 +1,16 @@
 # SMS drafts — Riverfront Steel, 2026-04-21
 ## 08:00 baseline_fill — Warehouse Associate x3 in Toledo, OH
 TO: Christopher Y. Phillips
 Confirming your shift as a Warehouse Associate at Riverfront Steel in Toledo, OH starting 08:00 AM today.
 ---
 TO: Fatima U. Rivera
 Your shift as a Warehouse Associate at Riverfront Steel is confirmed for 08:00 AM today.
 ---
 TO: Janet E. Hill
 Confirming your 08:00 AM shift as a Warehouse Associate at Riverfront Steel in Toledo, OH.
--- a/tests/multi-agent/playbooks/scenario-2026-04-21T00-17-27/checkpoints.jsonl
+++ b/tests/multi-agent/playbooks/scenario-2026-04-21T00-17-27/checkpoints.jsonl
@ -0,0 +1,2 @@
 {"after_event":"12:15","event_kind":"expansion","ok":true,"model":"gpt-oss:20b","duration_secs":10.901,"risk":"JSON parse error","hint":"Validate JSON structure, close braces, escape quotes, and test with a JSON linter before executing hybrid_search."}
 {"after_event":"15:45","event_kind":"misplacement","ok":true,"model":"gpt-oss:20b","duration_secs":11.83,"risk":"JSON parsing failure in tool call","hint":"Ensure JSON syntax is correct before invoking hybrid_search for Warehouse Associate in Toledo, OH. Validate tool call structure."}
--- a/tests/multi-agent/playbooks/scenario-2026-04-21T00-17-27/dispatch.jsonl
+++ b/tests/multi-agent/playbooks/scenario-2026-04-21T00-17-27/dispatch.jsonl
--- a/tests/multi-agent/playbooks/scenario-2026-04-21T00-17-27/emails.md
+++ b/tests/multi-agent/playbooks/scenario-2026-04-21T00-17-27/emails.md
@ -0,0 +1 @@
 # Client emails — Riverfront Steel, 2026-04-21
--- a/tests/multi-agent/playbooks/scenario-2026-04-21T00-17-27/lesson.md
+++ b/tests/multi-agent/playbooks/scenario-2026-04-21T00-17-27/lesson.md
@ -0,0 +1,6 @@
 # Cross-day lesson — Riverfront Steel, 2026-04-21
 _Generated by `gpt-oss:20b` in 4.0s. Based on 5 events + 2 mid-day checkpoints._
 **  
 Always validate the JSON payload before calling `hybrid_search`. Ensure all braces are closed, quotes are escaped, and the structure matches the expected schema—use a linter or schema validator in a sandbox first. Construct the JSON programmatically or via a template rather than embedding raw text in the tool call. This prevents parse errors that cause job failures.
--- a/tests/multi-agent/playbooks/scenario-2026-04-21T00-17-27/report.md
+++ b/tests/multi-agent/playbooks/scenario-2026-04-21T00-17-27/report.md
@ -0,0 +1,58 @@
 # Scenario retrospective — Riverfront Steel, 2026-04-21
 Executor: `mistral:latest`   Reviewer: `qwen2.5:latest`   Draft: `qwen2.5:latest`   Overview(T3): `gpt-oss:20b`
 ## Events
 | At | Kind | Role / Count | Pool | Fills | Turns | Dur(s) | Cites | Gaps |
 |---|---|---|---|---|---|---|---|---|
 | 08:00 | baseline_fill | Warehouse Associate × 3 | - | ✗ 0 | 0 | 33.1 | 0 | 1 |
 | 10:30 | recurring | Machine Operator × 2 | - | ✗ 0 | 0 | 35.1 | 0 | 1 |
 | 12:15 | expansion | Forklift Operator × 5 | - | ✗ 0 | 0 | 55.3 | 0 | 1 |
 | 14:00 | emergency | Loader × 4 | - | ✗ 0 | 0 | 14.7 | 0 | 1 |
 | 15:45 | misplacement | Warehouse Associate × 1 | - | ✗ 0 | 0 | 28.8 | 0 | 1 |
 ## Final roster
 | Worker | Booked | Role | City, ST | Status |
 |---|---|---|---|---|
 ## Gap signals
 ### drift_or_tool
 - **08:00** — invalid JSON from executor: JSON Parse error: Expected ']' | raw: {"kind":"plan","steps":["TOOL_CALL",{"tool":"sql","args":{"query":"SELECT worker_id, name, role, city, state FROM workers_500k WHERE worker_id = '49164'"}},"TOOL_CALL",{"tool":"hybrid_search","args":{"index_name":"workers_500k_v1","sql_filter":"CAST(availability AS DOUBLE) > 0.5 AND role = 'Warehous
 - **10:30** — invalid JSON from executor: JSON Parse error: Expected ']' | raw: {"kind":"plan","steps":["TOOL_CALL hybrid_search({'index_name':'workers_500k_v1','sql_filter':'role = 'Machine Operator' AND city = 'Toledo' AND state = 'OH' AND CAST(availability AS DOUBLE) > 0.5'})","TOOL_CALL sql({'query':'SELECT worker_id, name, role, city, state FROM workers_500k WHERE worker_i
 - **12:15** — invalid JSON from executor: JSON Parse error: Expected '}' | raw: {"kind":"tool_call","tool":"hybrid_search",
 "args":{"index_name":"workers_500k_v1","sql_filter":"role = 'Forklift Operator' AND city = 'Toledo' AND state = 'OH' AND CAST(availability AS DOUBLE) > 0.5 AND CAST(reliability AS DOUBLE) > 0.75 AND NOT worker_id IN (W500K-22375, W500K-19588, W500K-28024,
 - **14:00** — aborted — 3 consecutive drift flags
 - **15:45** — invalid JSON from executor: JSON Parse error: Expected '}' | raw: {"kind":"tool_call","tool":"hybrid_search",
 "args":{"index_name":"workers_500k_v1","sql_filter":"role = 'Warehouse Associate' AND city = 'Toledo' AND state = 'OH' AND CAST(availability AS DOUBLE) > 0.5 AND worker_id NOT IN (CANDIDATES SURFACED SO FAR)",
 "rationale":"Find a reliable Warehouse Associa
 ### write_through_audit
 - _post-run_ — playbook_memory has 1163 entries (ran 5 events, expected ≥ 0 new entries from this run)
 ## Workers touched across the week
 0 distinct workers made it through to a decision. Every one is accounted for below — no-shows flagged, rebookings noted, everyone visible.
 | Worker ID | Name | Events | Outcome |
 |---|---|---|---|
 ## Discovered patterns (meta-index)
 What the system identified across semantically-similar past fills as each event ran:
 - **08:00 baseline_fill** (Warehouse Associate): —
 - **10:30 recurring** (Machine Operator): —
 - **12:15 expansion** (Forklift Operator): —
 - **14:00 emergency** (Loader): —
 - **15:45 misplacement** (Warehouse Associate): —
 ## Narrative
 - 0/5 events reached consensus.
 - Final roster: 0 bookings across 0 distinct workers.
 - Workers touched (booked, failed, or otherwise decided): 0.
 - Playbook citations across the day: 0 (proof the feedback loop fired across events).
 - Dropped events: 08:00 baseline_fill, 10:30 recurring, 12:15 expansion, 14:00 emergency, 15:45 misplacement.
--- a/tests/multi-agent/playbooks/scenario-2026-04-21T00-17-27/results.json
+++ b/tests/multi-agent/playbooks/scenario-2026-04-21T00-17-27/results.json
@ -0,0 +1,104 @@
 [
  {
    "event": {
      "kind": "baseline_fill",
      "at": "08:00",
      "role": "Warehouse Associate",
      "count": 3,
      "city": "Toledo",
      "state": "OH",
      "shift_start": "08:00 AM",
      "scenario_note": "Regular Monday morning shift, 8-hour."
    },
    "ok": false,
    "fills": [],
    "turns": 0,
    "duration_secs": 33.137,
    "error": "invalid JSON from executor: JSON Parse error: Expected ']' | raw: {\"kind\":\"plan\",\"steps\":[\"TOOL_CALL\",{\"tool\":\"sql\",\"args\":{\"query\":\"SELECT worker_id, name, role, city, state FROM workers_500k WHERE worker_id = '49164'\"}},\"TOOL_CALL\",{\"tool\":\"hybrid_search\",\"args\":{\"index_name\":\"workers_500k_v1\",\"sql_filter\":\"CAST(availability AS DOUBLE) > 0.5 AND role = 'Warehous",
    "gap_signals": [
      "drift_or_tool: invalid JSON from executor: JSON Parse error: Expected ']' | raw: {\"kind\":\"plan\",\"steps\":[\"TOOL_CALL\",{\"tool\":\"sql\",\"args\":{\"query\":\"SELECT worker_id, name, role, city, state FROM workers_500k WHERE worker_id = '49164'\"}},\"TOOL_CALL\",{\"tool\":\"hybrid_search\",\"args\":{\"index_name\":\"workers_500k_v1\",\"sql_filter\":\"CAST(availability AS DOUBLE) > 0.5 AND role = 'Warehous"
    ]
  },
  {
    "event": {
      "kind": "recurring",
      "at": "10:30",
      "role": "Machine Operator",
      "count": 2,
      "city": "Toledo",
      "state": "OH",
      "shift_start": "11:00 AM",
      "scenario_note": "Recurring Tuesday/Thursday slot — prior workers may still be available."
    },
    "ok": false,
    "fills": [],
    "turns": 0,
    "duration_secs": 35.123,
    "error": "invalid JSON from executor: JSON Parse error: Expected ']' | raw: {\"kind\":\"plan\",\"steps\":[\"TOOL_CALL hybrid_search({'index_name':'workers_500k_v1','sql_filter':'role = 'Machine Operator' AND city = 'Toledo' AND state = 'OH' AND CAST(availability AS DOUBLE) > 0.5'})\",\"TOOL_CALL sql({'query':'SELECT worker_id, name, role, city, state FROM workers_500k WHERE worker_i",
    "gap_signals": [
      "drift_or_tool: invalid JSON from executor: JSON Parse error: Expected ']' | raw: {\"kind\":\"plan\",\"steps\":[\"TOOL_CALL hybrid_search({'index_name':'workers_500k_v1','sql_filter':'role = 'Machine Operator' AND city = 'Toledo' AND state = 'OH' AND CAST(availability AS DOUBLE) > 0.5'})\",\"TOOL_CALL sql({'query':'SELECT worker_id, name, role, city, state FROM workers_500k WHERE worker_i"
    ]
  },
  {
    "event": {
      "kind": "expansion",
      "at": "12:15",
      "role": "Forklift Operator",
      "count": 5,
      "city": "Toledo",
      "state": "OH",
      "shift_start": "01:00 PM",
      "scenario_note": "New warehouse location opening, five-worker team needed."
    },
    "ok": false,
    "fills": [],
    "turns": 0,
    "duration_secs": 55.269,
    "error": "invalid JSON from executor: JSON Parse error: Expected '}' | raw: {\"kind\":\"tool_call\",\"tool\":\"hybrid_search\",\n\"args\":{\"index_name\":\"workers_500k_v1\",\"sql_filter\":\"role = 'Forklift Operator' AND city = 'Toledo' AND state = 'OH' AND CAST(availability AS DOUBLE) > 0.5 AND CAST(reliability AS DOUBLE) > 0.75 AND NOT worker_id IN (W500K-22375, W500K-19588, W500K-28024, ",
    "gap_signals": [
      "drift_or_tool: invalid JSON from executor: JSON Parse error: Expected '}' | raw: {\"kind\":\"tool_call\",\"tool\":\"hybrid_search\",\n\"args\":{\"index_name\":\"workers_500k_v1\",\"sql_filter\":\"role = 'Forklift Operator' AND city = 'Toledo' AND state = 'OH' AND CAST(availability AS DOUBLE) > 0.5 AND CAST(reliability AS DOUBLE) > 0.75 AND NOT worker_id IN (W500K-22375, W500K-19588, W500K-28024, "
    ]
  },
  {
    "event": {
      "kind": "emergency",
      "at": "14:00",
      "role": "Loader",
      "count": 4,
      "city": "Toledo",
      "state": "OH",
      "shift_start": "04:00 PM same day",
      "deadline": "16:00",
      "scenario_note": "Walkoff incident — replacement crew needed by 16:00 sharp."
    },
    "ok": false,
    "fills": [],
    "turns": 0,
    "duration_secs": 14.719,
    "error": "aborted — 3 consecutive drift flags",
    "gap_signals": [
      "drift_or_tool: aborted — 3 consecutive drift flags"
    ]
  },
  {
    "event": {
      "kind": "misplacement",
      "at": "15:45",
      "role": "Warehouse Associate",
      "count": 1,
      "city": "Toledo",
      "state": "OH",
      "shift_start": "remainder of 08:00 shift",
      "scenario_note": "One worker from the 08:00 fill didn't show; rebuild the gap.",
      "replaces_event": "08:00"
    },
    "ok": false,
    "fills": [],
    "turns": 0,
    "duration_secs": 28.761,
    "error": "invalid JSON from executor: JSON Parse error: Expected '}' | raw: {\"kind\":\"tool_call\",\"tool\":\"hybrid_search\",\n\"args\":{\"index_name\":\"workers_500k_v1\",\"sql_filter\":\"role = 'Warehouse Associate' AND city = 'Toledo' AND state = 'OH' AND CAST(availability AS DOUBLE) > 0.5 AND worker_id NOT IN (CANDIDATES SURFACED SO FAR)\",\n\"rationale\":\"Find a reliable Warehouse Associa",
    "gap_signals": [
      "drift_or_tool: invalid JSON from executor: JSON Parse error: Expected '}' | raw: {\"kind\":\"tool_call\",\"tool\":\"hybrid_search\",\n\"args\":{\"index_name\":\"workers_500k_v1\",\"sql_filter\":\"role = 'Warehouse Associate' AND city = 'Toledo' AND state = 'OH' AND CAST(availability AS DOUBLE) > 0.5 AND worker_id NOT IN (CANDIDATES SURFACED SO FAR)\",\n\"rationale\":\"Find a reliable Warehouse Associa"
    ]
  }
 ]
--- a/tests/multi-agent/playbooks/scenario-2026-04-21T00-17-27/roster.json
+++ b/tests/multi-agent/playbooks/scenario-2026-04-21T00-17-27/roster.json
@ -0,0 +1 @@
 []
--- a/tests/multi-agent/playbooks/scenario-2026-04-21T00-17-27/sms.md
+++ b/tests/multi-agent/playbooks/scenario-2026-04-21T00-17-27/sms.md
@ -0,0 +1 @@
 # SMS drafts — Riverfront Steel, 2026-04-21
--- a/tests/multi-agent/playbooks/scenario-2026-04-21T00-28-37/checkpoints.jsonl
+++ b/tests/multi-agent/playbooks/scenario-2026-04-21T00-28-37/checkpoints.jsonl
@ -0,0 +1,2 @@
 {"after_event":"12:15","event_kind":"expansion","ok":false,"model":"gpt-oss:20b","duration_secs":14.223,"hint":"(T3 unavailable)","risk":"generate returned empty text from gpt-oss:20b: {\"text\":\"\",\"model\":\"gpt-oss:20b\","}
 {"after_event":"15:45","event_kind":"misplacement","ok":false,"model":"gpt-oss:20b","duration_secs":15.447,"hint":"(T3 unavailable)","risk":"generate returned empty text from gpt-oss:20b: {\"text\":\"\",\"model\":\"gpt-oss:20b\","}
--- a/tests/multi-agent/playbooks/scenario-2026-04-21T00-28-37/dispatch.jsonl
+++ b/tests/multi-agent/playbooks/scenario-2026-04-21T00-28-37/dispatch.jsonl
@ -0,0 +1 @@
 {"at":"12:15","kind":"expansion","operation":"fill: Forklift Operator x5 in Toledo, OH","fills":[{"candidate_id":"W500K-22375","name":"Matthew P. Garcia","reason":"Meets the criteria of being a Forklift Operator in Toledo, OH with availability and reliability above the specified thresholds."},{"candidate_id":"W500K-19588","name":"Maria K. Cruz","reason":"Meets the criteria of being a Forklift Operator in Toledo, OH with availability and reliability above the specified thresholds."},{"candidate_id":"W500K-37736","name":"Jennifer K. Robinson","reason":"Meets the criteria of being a Forklift Operator in Toledo, OH with availability and reliability above the specified thresholds."},{"candidate_id":"W500K-3150","name":"Brenda Gutierrez","reason":"Meets the criteria of being a Forklift Operator in Toledo, OH with availability and reliability above the specified thresholds."},{"candidate_id":"W500K-28024","name":"Nancy W. Ward","reason":"Meets the criteria of being a Forklift Operator in Toledo, OH with availability and reliability above the specified thresholds."}],"turns":6,"duration_secs":21.998,"pool_size":687,"playbook_citations":[],"discovered_pattern":"Across 25 similar past playbooks (6 workers examined) · recurring certifications: Forklift (67%), OSHA-10 (50%) · recurring skills: mill (50%), 6S (50%) · archetype mostly: leader · reliability median 0.83 (range 0.75–0.96)"}
--- a/tests/multi-agent/playbooks/scenario-2026-04-21T00-28-37/emails.md
+++ b/tests/multi-agent/playbooks/scenario-2026-04-21T00-28-37/emails.md
@ -0,0 +1,20 @@
 # Client emails — Riverfront Steel, 2026-04-21
 ## 12:15 expansion — Forklift Operator x5
 Subject: 5 Workers Confirmed
 Dear Riverfront Steel Team,
 We are pleased to confirm that we have filled all 5 positions for Forklift Operators as requested. The workers starting at 01:00 PM today are:
 - Matthew P. Garcia
 - Maria K. Cruz 
 - Jennifer K. Robinson
 - Brenda Gutierrez
 - Nancy W. Ward
 This is in support of the new warehouse location opening, requiring a five-worker team.
 Best regards,
 Dispatch Team Lakehouse
--- a/tests/multi-agent/playbooks/scenario-2026-04-21T00-28-37/lesson.md
+++ b/tests/multi-agent/playbooks/scenario-2026-04-21T00-28-37/lesson.md
@ -0,0 +1,6 @@
 # Cross-day lesson — Riverfront Steel, 2026-04-21
 _Generated by `gpt-oss:20b` in 14.2s. Based on 5 events + 2 mid-day checkpoints._
 **  
 Before any baseline, recurring, or emergency fill, query the pool size and turn count; missing data causes the job to fail. Replicate the expansion logic that pulls pool and turns for all event types. If the GPT‑OSS model is unavailable, switch to a local fallback or log a warning instead of returning empty risk text. Validate that gaps are accounted for before committing the fill to avoid single‑gap failures.
--- a/tests/multi-agent/playbooks/scenario-2026-04-21T00-28-37/prior_lessons.json
+++ b/tests/multi-agent/playbooks/scenario-2026-04-21T00-28-37/prior_lessons.json
@ -0,0 +1 @@
 []
--- a/tests/multi-agent/playbooks/scenario-2026-04-21T00-28-37/report.md
+++ b/tests/multi-agent/playbooks/scenario-2026-04-21T00-28-37/report.md
@ -0,0 +1,76 @@
 # Scenario retrospective — Riverfront Steel, 2026-04-21
 Executor: `mistral:latest`   Reviewer: `qwen2.5:latest`   Draft: `qwen2.5:latest`   Overview(T3): `gpt-oss:20b`
 Prior lessons loaded into executor context: **0** (baseline — no prior T3 history)
 ## Events
 | At | Kind | Role / Count | Pool | Fills | Turns | Dur(s) | Cites | Gaps |
 |---|---|---|---|---|---|---|---|---|
 | 08:00 | baseline_fill | Warehouse Associate × 3 | - | ✗ 0 | 0 | 47.6 | 0 | 1 |
 | 10:30 | recurring | Machine Operator × 2 | - | ✗ 0 | 0 | 29.5 | 0 | 1 |
 | 12:15 | expansion | Forklift Operator × 5 | 687 | ✓ 5 | 6 | 22.0 | 0 | 4 |
 | 14:00 | emergency | Loader × 4 | - | ✗ 0 | 0 | 92.4 | 0 | 1 |
 | 15:45 | misplacement | Warehouse Associate × 1 | - | ✗ 0 | 0 | 100.9 | 0 | 1 |
 ## Final roster
 | Worker | Booked | Role | City, ST | Status |
 |---|---|---|---|---|
 | undefined Matthew P. Garcia | 12:15 | Forklift Operator | Toledo, OH | confirmed |
 | undefined Maria K. Cruz | 12:15 | Forklift Operator | Toledo, OH | confirmed |
 | undefined Jennifer K. Robinson | 12:15 | Forklift Operator | Toledo, OH | confirmed |
 | undefined Brenda Gutierrez | 12:15 | Forklift Operator | Toledo, OH | confirmed |
 | undefined Nancy W. Ward | 12:15 | Forklift Operator | Toledo, OH | confirmed |
 ## Gap signals
 ### drift_or_tool
 - **08:00** — no consensus after 14 turns
 - **10:30** — invalid JSON from executor: JSON Parse error: Expected '}' | raw: {"kind":"tool_call","tool":"hybrid_search",
 "args":{"index_name":"workers_500k_v1","sql_filter":"role = 'Machine Operator' AND city = 'Toledo' AND state = 'OH' AND CAST(availability AS DOUBLE) > 0.5 AND playbook_citations > 0",
 "rationale":"Narrow down candidates to Machine Operators in Toledo, OH w
 - **14:00** — no consensus after 14 turns
 - **15:45** — no consensus after 14 turns
 ### double_book
 - **12:15** — undefined Maria K. Cruz already booked for 12:15
 - **12:15** — undefined Jennifer K. Robinson already booked for 12:15
 - **12:15** — undefined Brenda Gutierrez already booked for 12:15
 - **12:15** — undefined Nancy W. Ward already booked for 12:15
 ### fairness
 - _cross-event_ — Matthew P. Garcia (undefined) booked 5 times today
 ### write_through_audit
 - _post-run_ — playbook_memory has 1164 entries (ran 5 events, expected ≥ 1 new entries from this run)
 ## Workers touched across the week
 5 distinct workers made it through to a decision. Every one is accounted for below — no-shows flagged, rebookings noted, everyone visible.
 | Worker ID | Name | Events | Outcome |
 |---|---|---|---|
 | W500K-22375 | Matthew P. Garcia | 12:15 expansion | booked |
 | W500K-19588 | Maria K. Cruz | 12:15 expansion | booked |
 | W500K-37736 | Jennifer K. Robinson | 12:15 expansion | booked |
 | W500K-3150 | Brenda Gutierrez | 12:15 expansion | booked |
 | W500K-28024 | Nancy W. Ward | 12:15 expansion | booked |
 ## Discovered patterns (meta-index)
 What the system identified across semantically-similar past fills as each event ran:
 - **08:00 baseline_fill** (Warehouse Associate): —
 - **10:30 recurring** (Machine Operator): —
 - **12:15 expansion** (Forklift Operator): Across 25 similar past playbooks (6 workers examined) · recurring certifications: Forklift (67%), OSHA-10 (50%) · recurring skills: mill (50%), 6S (50%) · archetype mostly: leader · reliability median 0.83 (range 0.75–0.96)
 - **14:00 emergency** (Loader): —
 - **15:45 misplacement** (Warehouse Associate): —
 ## Narrative
 - 1/5 events reached consensus.
 - Final roster: 5 bookings across 1 distinct workers.
 - Workers touched (booked, failed, or otherwise decided): 5.
 - Playbook citations across the day: 0 (proof the feedback loop fired across events).
 - Dropped events: 08:00 baseline_fill, 10:30 recurring, 14:00 emergency, 15:45 misplacement.
--- a/tests/multi-agent/playbooks/scenario-2026-04-21T00-28-37/results.json
+++ b/tests/multi-agent/playbooks/scenario-2026-04-21T00-28-37/results.json
@ -0,0 +1,137 @@
 [
  {
    "event": {
      "kind": "baseline_fill",
      "at": "08:00",
      "role": "Warehouse Associate",
      "count": 3,
      "city": "Toledo",
      "state": "OH",
      "shift_start": "08:00 AM",
      "scenario_note": "Regular Monday morning shift, 8-hour."
    },
    "ok": false,
    "fills": [],
    "turns": 0,
    "duration_secs": 47.571,
    "error": "no consensus after 14 turns",
    "gap_signals": [
      "drift_or_tool: no consensus after 14 turns"
    ]
  },
  {
    "event": {
      "kind": "recurring",
      "at": "10:30",
      "role": "Machine Operator",
      "count": 2,
      "city": "Toledo",
      "state": "OH",
      "shift_start": "11:00 AM",
      "scenario_note": "Recurring Tuesday/Thursday slot — prior workers may still be available."
    },
    "ok": false,
    "fills": [],
    "turns": 0,
    "duration_secs": 29.546,
    "error": "invalid JSON from executor: JSON Parse error: Expected '}' | raw: {\"kind\":\"tool_call\",\"tool\":\"hybrid_search\",\n\"args\":{\"index_name\":\"workers_500k_v1\",\"sql_filter\":\"role = 'Machine Operator' AND city = 'Toledo' AND state = 'OH' AND CAST(availability AS DOUBLE) > 0.5 AND playbook_citations > 0\",\n\"rationale\":\"Narrow down candidates to Machine Operators in Toledo, OH w",
    "gap_signals": [
      "drift_or_tool: invalid JSON from executor: JSON Parse error: Expected '}' | raw: {\"kind\":\"tool_call\",\"tool\":\"hybrid_search\",\n\"args\":{\"index_name\":\"workers_500k_v1\",\"sql_filter\":\"role = 'Machine Operator' AND city = 'Toledo' AND state = 'OH' AND CAST(availability AS DOUBLE) > 0.5 AND playbook_citations > 0\",\n\"rationale\":\"Narrow down candidates to Machine Operators in Toledo, OH w"
    ]
  },
  {
    "event": {
      "kind": "expansion",
      "at": "12:15",
      "role": "Forklift Operator",
      "count": 5,
      "city": "Toledo",
      "state": "OH",
      "shift_start": "01:00 PM",
      "scenario_note": "New warehouse location opening, five-worker team needed."
    },
    "ok": true,
    "fills": [
      {
        "candidate_id": "W500K-22375",
        "name": "Matthew P. Garcia",
        "reason": "Meets the criteria of being a Forklift Operator in Toledo, OH with availability and reliability above the specified thresholds."
      },
      {
        "candidate_id": "W500K-19588",
        "name": "Maria K. Cruz",
        "reason": "Meets the criteria of being a Forklift Operator in Toledo, OH with availability and reliability above the specified thresholds."
      },
      {
        "candidate_id": "W500K-37736",
        "name": "Jennifer K. Robinson",
        "reason": "Meets the criteria of being a Forklift Operator in Toledo, OH with availability and reliability above the specified thresholds."
      },
      {
        "candidate_id": "W500K-3150",
        "name": "Brenda Gutierrez",
        "reason": "Meets the criteria of being a Forklift Operator in Toledo, OH with availability and reliability above the specified thresholds."
      },
      {
        "candidate_id": "W500K-28024",
        "name": "Nancy W. Ward",
        "reason": "Meets the criteria of being a Forklift Operator in Toledo, OH with availability and reliability above the specified thresholds."
      }
    ],
    "turns": 6,
    "duration_secs": 21.998,
    "gap_signals": [
      "double_book: undefined Maria K. Cruz already booked for 12:15",
      "double_book: undefined Jennifer K. Robinson already booked for 12:15",
      "double_book: undefined Brenda Gutierrez already booked for 12:15",
      "double_book: undefined Nancy W. Ward already booked for 12:15"
    ],
    "sources_first_score": 0.6336688,
    "sources_last_score": 0.55183524,
    "pool_size": 687,
    "playbook_citations": [],
    "discovered_pattern": "Across 25 similar past playbooks (6 workers examined) · recurring certifications: Forklift (67%), OSHA-10 (50%) · recurring skills: mill (50%), 6S (50%) · archetype mostly: leader · reliability median 0.83 (range 0.75–0.96)"
  },
  {
    "event": {
      "kind": "emergency",
      "at": "14:00",
      "role": "Loader",
      "count": 4,
      "city": "Toledo",
      "state": "OH",
      "shift_start": "04:00 PM same day",
      "deadline": "16:00",
      "scenario_note": "Walkoff incident — replacement crew needed by 16:00 sharp."
    },
    "ok": false,
    "fills": [],
    "turns": 0,
    "duration_secs": 92.425,
    "error": "no consensus after 14 turns",
    "gap_signals": [
      "drift_or_tool: no consensus after 14 turns"
    ]
  },
  {
    "event": {
      "kind": "misplacement",
      "at": "15:45",
      "role": "Warehouse Associate",
      "count": 1,
      "city": "Toledo",
      "state": "OH",
      "shift_start": "remainder of 08:00 shift",
      "scenario_note": "One worker from the 08:00 fill didn't show; rebuild the gap.",
      "replaces_event": "08:00"
    },
    "ok": false,
    "fills": [],
    "turns": 0,
    "duration_secs": 100.945,
    "error": "no consensus after 14 turns",
    "gap_signals": [
      "drift_or_tool: no consensus after 14 turns"
    ]
  }
 ]
--- a/tests/multi-agent/playbooks/scenario-2026-04-21T00-28-37/roster.json
+++ b/tests/multi-agent/playbooks/scenario-2026-04-21T00-28-37/roster.json
@ -0,0 +1,42 @@
 [
  {
    "name": "Matthew P. Garcia",
    "booked_for": "12:15",
    "role": "Forklift Operator",
    "city": "Toledo",
    "state": "OH",
    "status": "confirmed"
  },
  {
    "name": "Maria K. Cruz",
    "booked_for": "12:15",
    "role": "Forklift Operator",
    "city": "Toledo",
    "state": "OH",
    "status": "confirmed"
  },
  {
    "name": "Jennifer K. Robinson",
    "booked_for": "12:15",
    "role": "Forklift Operator",
    "city": "Toledo",
    "state": "OH",
    "status": "confirmed"
  },
  {
    "name": "Brenda Gutierrez",
    "booked_for": "12:15",
    "role": "Forklift Operator",
    "city": "Toledo",
    "state": "OH",
    "status": "confirmed"
  },
  {
    "name": "Nancy W. Ward",
    "booked_for": "12:15",
    "role": "Forklift Operator",
    "city": "Toledo",
    "state": "OH",
    "status": "confirmed"
  }
 ]
--- a/tests/multi-agent/playbooks/scenario-2026-04-21T00-28-37/sms.md
+++ b/tests/multi-agent/playbooks/scenario-2026-04-21T00-28-37/sms.md
@ -0,0 +1,26 @@
 # SMS drafts — Riverfront Steel, 2026-04-21
 ## 12:15 expansion — Forklift Operator x5 in Toledo, OH
 TO: Matthew P. Garcia
 Confirming your shift as a Forklift Operator at Riverfront Steel's new warehouse in Toledo, OH starting 1:00 PM.
 ---
 TO: Maria K. Cruz
 You're scheduled to start your shift at 1:00 PM today at our new warehouse location in Toledo, OH.
 ---
 TO: Jennifer K. Robinson
 Confirming your shift as a Forklift Operator at Riverfront Steel's new warehouse opening in Toledo, OH starting 1:00 PM.
 ---
 TO: Brenda Gutierrez
 Your shift starts at 1:00 PM today at our new warehouse location for Riverfront Steel in Toledo, OH.
 ---
 TO: Nancy W. Ward
 Confirming your shift as a Forklift Operator at Riverfront Steel's new warehouse opening in Toledo, OH starting 1:00 PM.
--- a/tests/multi-agent/playbooks/scenario-2026-04-21T00-30-54/checkpoints.jsonl
+++ b/tests/multi-agent/playbooks/scenario-2026-04-21T00-30-54/checkpoints.jsonl
--- a/tests/multi-agent/playbooks/scenario-2026-04-21T00-30-54/dispatch.jsonl
+++ b/tests/multi-agent/playbooks/scenario-2026-04-21T00-30-54/dispatch.jsonl
--- a/tests/multi-agent/playbooks/scenario-2026-04-21T00-30-54/emails.md
+++ b/tests/multi-agent/playbooks/scenario-2026-04-21T00-30-54/emails.md
@ -0,0 +1 @@
 # Client emails — Riverfront Steel, 2026-04-21
--- a/tests/multi-agent/playbooks/scenario-2026-04-21T00-30-54/prior_lessons.json
+++ b/tests/multi-agent/playbooks/scenario-2026-04-21T00-30-54/prior_lessons.json
@ -0,0 +1 @@
 []
--- a/tests/multi-agent/playbooks/scenario-2026-04-21T00-30-54/report.md
+++ b/tests/multi-agent/playbooks/scenario-2026-04-21T00-30-54/report.md
@ -0,0 +1,59 @@
 # Scenario retrospective — Riverfront Steel, 2026-04-21
 Executor: `mistral:latest`   Reviewer: `qwen2.5:latest`   Draft: `qwen2.5:latest`   Overview(T3): `disabled`
 Prior lessons loaded into executor context: **0** (baseline — no prior T3 history)
 ## Events
 | At | Kind | Role / Count | Pool | Fills | Turns | Dur(s) | Cites | Gaps |
 |---|---|---|---|---|---|---|---|---|
 | 08:00 | baseline_fill | Warehouse Associate × 3 | - | ✗ 0 | 0 | 99.1 | 0 | 1 |
 | 10:30 | recurring | Machine Operator × 2 | - | ✗ 0 | 0 | 114.5 | 0 | 1 |
 | 12:15 | expansion | Forklift Operator × 5 | - | ✗ 0 | 0 | 14.5 | 0 | 1 |
 | 14:00 | emergency | Loader × 4 | - | ✗ 0 | 0 | 49.7 | 0 | 1 |
 | 15:45 | misplacement | Warehouse Associate × 1 | - | ✗ 0 | 0 | 26.6 | 0 | 1 |
 ## Final roster
 | Worker | Booked | Role | City, ST | Status |
 |---|---|---|---|---|
 ## Gap signals
 ### drift_or_tool
 - **08:00** — aborted — 3 consecutive drift flags
 - **10:30** — invalid JSON from executor: JSON Parse error: Expected ']' | raw: {"kind":"plan", "steps":["TOOL_CALL hybrid_search({'index_name':'workers_500k_v1','sql_filter':'role = 'Machine Operator' AND city = 'Toledo' AND state = 'OH' AND CAST(availability AS DOUBLE) > 0.5 AND playbook_citations > 0'})", "TOOL_CALL sql({'query':'SELECT worker_id, name, role, city, state FRO
 - **12:15** — invalid JSON from executor: JSON Parse error: Expected '}' | raw: {"kind":"tool_call","tool":"hybrid_search",
 "args":{"index_name":"workers_500k_v1","sql_filter":"role = 'Forklift Operator' AND city = 'Toledo' AND state = 'OH' AND CAST(availability AS DOUBLE) > 0.75 AND CAST(reliability AS DOUBLE) > 0.9",
 "question":"top 5 reliable forklift operators Toledo with h
 - **14:00** — no consensus after 14 turns
 - **15:45** — invalid JSON from executor: JSON Parse error: Expected '}' | raw: {"kind":"tool_call","tool":"hybrid_search",
 "args":{"index_name":"workers_500k_v1","sql_filter":"CAST(availability AS DOUBLE) > 0.5 AND role = 'Warehouse Associate' AND city = 'Toledo' AND state = 'OH' AND worker_id NOT IN (49164,40928,34704,5749,22587,4091,23160,5114,15482,11915,36011,17171,11061,4
 ### write_through_audit
 - _post-run_ — playbook_memory has 1164 entries (ran 5 events, expected ≥ 0 new entries from this run)
 ## Workers touched across the week
 0 distinct workers made it through to a decision. Every one is accounted for below — no-shows flagged, rebookings noted, everyone visible.
 | Worker ID | Name | Events | Outcome |
 |---|---|---|---|
 ## Discovered patterns (meta-index)
 What the system identified across semantically-similar past fills as each event ran:
 - **08:00 baseline_fill** (Warehouse Associate): —
 - **10:30 recurring** (Machine Operator): —
 - **12:15 expansion** (Forklift Operator): —
 - **14:00 emergency** (Loader): —
 - **15:45 misplacement** (Warehouse Associate): —
 ## Narrative
 - 0/5 events reached consensus.
 - Final roster: 0 bookings across 0 distinct workers.
 - Workers touched (booked, failed, or otherwise decided): 0.
 - Playbook citations across the day: 0 (proof the feedback loop fired across events).
 - Dropped events: 08:00 baseline_fill, 10:30 recurring, 12:15 expansion, 14:00 emergency, 15:45 misplacement.
--- a/tests/multi-agent/playbooks/scenario-2026-04-21T00-30-54/results.json
+++ b/tests/multi-agent/playbooks/scenario-2026-04-21T00-30-54/results.json
@ -0,0 +1,104 @@
 [
  {
    "event": {
      "kind": "baseline_fill",
      "at": "08:00",
      "role": "Warehouse Associate",
      "count": 3,
      "city": "Toledo",
      "state": "OH",
      "shift_start": "08:00 AM",
      "scenario_note": "Regular Monday morning shift, 8-hour."
    },
    "ok": false,
    "fills": [],
    "turns": 0,
    "duration_secs": 99.133,
    "error": "aborted — 3 consecutive drift flags",
    "gap_signals": [
      "drift_or_tool: aborted — 3 consecutive drift flags"
    ]
  },
  {
    "event": {
      "kind": "recurring",
      "at": "10:30",
      "role": "Machine Operator",
      "count": 2,
      "city": "Toledo",
      "state": "OH",
      "shift_start": "11:00 AM",
      "scenario_note": "Recurring Tuesday/Thursday slot — prior workers may still be available."
    },
    "ok": false,
    "fills": [],
    "turns": 0,
    "duration_secs": 114.512,
    "error": "invalid JSON from executor: JSON Parse error: Expected ']' | raw: {\"kind\":\"plan\", \"steps\":[\"TOOL_CALL hybrid_search({'index_name':'workers_500k_v1','sql_filter':'role = 'Machine Operator' AND city = 'Toledo' AND state = 'OH' AND CAST(availability AS DOUBLE) > 0.5 AND playbook_citations > 0'})\", \"TOOL_CALL sql({'query':'SELECT worker_id, name, role, city, state FRO",
    "gap_signals": [
      "drift_or_tool: invalid JSON from executor: JSON Parse error: Expected ']' | raw: {\"kind\":\"plan\", \"steps\":[\"TOOL_CALL hybrid_search({'index_name':'workers_500k_v1','sql_filter':'role = 'Machine Operator' AND city = 'Toledo' AND state = 'OH' AND CAST(availability AS DOUBLE) > 0.5 AND playbook_citations > 0'})\", \"TOOL_CALL sql({'query':'SELECT worker_id, name, role, city, state FRO"
    ]
  },
  {
    "event": {
      "kind": "expansion",
      "at": "12:15",
      "role": "Forklift Operator",
      "count": 5,
      "city": "Toledo",
      "state": "OH",
      "shift_start": "01:00 PM",
      "scenario_note": "New warehouse location opening, five-worker team needed."
    },
    "ok": false,
    "fills": [],
    "turns": 0,
    "duration_secs": 14.525,
    "error": "invalid JSON from executor: JSON Parse error: Expected '}' | raw: {\"kind\":\"tool_call\",\"tool\":\"hybrid_search\",\n\"args\":{\"index_name\":\"workers_500k_v1\",\"sql_filter\":\"role = 'Forklift Operator' AND city = 'Toledo' AND state = 'OH' AND CAST(availability AS DOUBLE) > 0.75 AND CAST(reliability AS DOUBLE) > 0.9\",\n\"question\":\"top 5 reliable forklift operators Toledo with h",
    "gap_signals": [
      "drift_or_tool: invalid JSON from executor: JSON Parse error: Expected '}' | raw: {\"kind\":\"tool_call\",\"tool\":\"hybrid_search\",\n\"args\":{\"index_name\":\"workers_500k_v1\",\"sql_filter\":\"role = 'Forklift Operator' AND city = 'Toledo' AND state = 'OH' AND CAST(availability AS DOUBLE) > 0.75 AND CAST(reliability AS DOUBLE) > 0.9\",\n\"question\":\"top 5 reliable forklift operators Toledo with h"
    ]
  },
  {
    "event": {
      "kind": "emergency",
      "at": "14:00",
      "role": "Loader",
      "count": 4,
      "city": "Toledo",
      "state": "OH",
      "shift_start": "04:00 PM same day",
      "deadline": "16:00",
      "scenario_note": "Walkoff incident — replacement crew needed by 16:00 sharp."
    },
    "ok": false,
    "fills": [],
    "turns": 0,
    "duration_secs": 49.725,
    "error": "no consensus after 14 turns",
    "gap_signals": [
      "drift_or_tool: no consensus after 14 turns"
    ]
  },
  {
    "event": {
      "kind": "misplacement",
      "at": "15:45",
      "role": "Warehouse Associate",
      "count": 1,
      "city": "Toledo",
      "state": "OH",
      "shift_start": "remainder of 08:00 shift",
      "scenario_note": "One worker from the 08:00 fill didn't show; rebuild the gap.",
      "replaces_event": "08:00"
    },
    "ok": false,
    "fills": [],
    "turns": 0,
    "duration_secs": 26.607,
    "error": "invalid JSON from executor: JSON Parse error: Expected '}' | raw: {\"kind\":\"tool_call\",\"tool\":\"hybrid_search\",\n\"args\":{\"index_name\":\"workers_500k_v1\",\"sql_filter\":\"CAST(availability AS DOUBLE) > 0.5 AND role = 'Warehouse Associate' AND city = 'Toledo' AND state = 'OH' AND worker_id NOT IN (49164,40928,34704,5749,22587,4091,23160,5114,15482,11915,36011,17171,11061,4",
    "gap_signals": [
      "drift_or_tool: invalid JSON from executor: JSON Parse error: Expected '}' | raw: {\"kind\":\"tool_call\",\"tool\":\"hybrid_search\",\n\"args\":{\"index_name\":\"workers_500k_v1\",\"sql_filter\":\"CAST(availability AS DOUBLE) > 0.5 AND role = 'Warehouse Associate' AND city = 'Toledo' AND state = 'OH' AND worker_id NOT IN (49164,40928,34704,5749,22587,4091,23160,5114,15482,11915,36011,17171,11061,4"
    ]
  }
 ]
--- a/tests/multi-agent/playbooks/scenario-2026-04-21T00-30-54/roster.json
+++ b/tests/multi-agent/playbooks/scenario-2026-04-21T00-30-54/roster.json
@ -0,0 +1 @@
 []
--- a/tests/multi-agent/playbooks/scenario-2026-04-21T00-30-54/sms.md
+++ b/tests/multi-agent/playbooks/scenario-2026-04-21T00-30-54/sms.md
@ -0,0 +1 @@
 # SMS drafts — Riverfront Steel, 2026-04-21
--- a/tests/multi-agent/playbooks/scenario-2026-04-21T00-37-04/checkpoints.jsonl
+++ b/tests/multi-agent/playbooks/scenario-2026-04-21T00-37-04/checkpoints.jsonl
@ -0,0 +1,2 @@
 {"after_event":"12:15","event_kind":"expansion","ok":false,"model":"gpt-oss:20b","duration_secs":14.287,"hint":"(T3 unavailable)","risk":"generate returned empty text from gpt-oss:20b: {\"text\":\"\",\"model\":\"gpt-oss:20b\","}
 {"after_event":"15:45","event_kind":"misplacement","ok":true,"model":"gpt-oss:20b","duration_secs":14.587,"risk":"Forklift Operator skill gap","hint":"Verify forklift operator certification and tool compatibility for Toledo shift."}
--- a/tests/multi-agent/playbooks/scenario-2026-04-21T00-37-04/dispatch.jsonl
+++ b/tests/multi-agent/playbooks/scenario-2026-04-21T00-37-04/dispatch.jsonl
--- a/tests/multi-agent/playbooks/scenario-2026-04-21T00-37-04/emails.md
+++ b/tests/multi-agent/playbooks/scenario-2026-04-21T00-37-04/emails.md
@ -0,0 +1 @@
 # Client emails — Riverfront Steel, 2026-04-21
--- a/tests/multi-agent/playbooks/scenario-2026-04-21T00-37-04/lesson.md
+++ b/tests/multi-agent/playbooks/scenario-2026-04-21T00-37-04/lesson.md
@ -0,0 +1,6 @@
 # Cross-day lesson — Riverfront Steel, 2026-04-21
 _Generated by `gpt-oss:20b` in 6.0s. Based on 5 events + 2 mid-day checkpoints._
 **  
 Before any event, pre‑fetch the full pool roster and skill certification data for Toledo, OH; the missing pool data caused every shift to fail. Verify forklift operator certifications and tool compatibility ahead of time, as the misplacement risk highlighted a skill gap. Ensure the risk‑generation model (gpt‑oss:20b) is online or have a manual fallback; the empty response after the expansion shows a T3 unavailability that halted risk assessment. Apply these checks for baseline, recurring, expansion, emergency, and misplacement events to avoid the single‑gap failure pattern.
--- a/tests/multi-agent/playbooks/scenario-2026-04-21T00-37-04/prior_lessons.json
+++ b/tests/multi-agent/playbooks/scenario-2026-04-21T00-37-04/prior_lessons.json
@ -0,0 +1,28 @@
 [
  {
    "date": "2026-04-21",
    "client": "Riverfront Steel",
    "cities": "Toledo",
    "states": "OH",
    "events_total": 5,
    "events_ok": 1,
    "checkpoint_count": 2,
    "model": "gpt-oss:20b",
    "cloud": false,
    "lesson": "**  \nBefore any baseline, recurring, or emergency fill, query the pool size and turn count; missing data causes the job to fail. Replicate the expansion logic that pulls pool and turns for all event types. If the GPT‑OSS model is unavailable, switch to a local fallback or log a warning instead of returning empty risk text. Validate that gaps are accounted for before committing the fill to avoid single‑gap failures.",
    "checkpoints": [
      {
        "after": "12:15",
        "risk": "generate returned empty text from gpt-oss:20b: {\"text\":\"\",\"model\":\"gpt-oss:20b\",",
        "hint": "(T3 unavailable)"
      },
      {
        "after": "15:45",
        "risk": "generate returned empty text from gpt-oss:20b: {\"text\":\"\",\"model\":\"gpt-oss:20b\",",
        "hint": "(T3 unavailable)"
      }
    ],
    "created_at": "2026-04-21T00:34:20.521Z",
    "file": "2026-04-21_Riverfront_Steel_1776731660521.json"
  }
 ]
--- a/tests/multi-agent/playbooks/scenario-2026-04-21T00-37-04/report.md
+++ b/tests/multi-agent/playbooks/scenario-2026-04-21T00-37-04/report.md
@ -0,0 +1,60 @@
 # Scenario retrospective — Riverfront Steel, 2026-04-21
 Executor: `mistral:latest`   Reviewer: `qwen2.5:latest`   Draft: `qwen2.5:latest`   Overview(T3): `gpt-oss:20b`
 Prior lessons loaded into executor context: **1** (from 2026-04-21)
 ## Events
 | At | Kind | Role / Count | Pool | Fills | Turns | Dur(s) | Cites | Gaps |
 |---|---|---|---|---|---|---|---|---|
 | 08:00 | baseline_fill | Warehouse Associate × 3 | - | ✗ 0 | 0 | 13.9 | 0 | 1 |
 | 10:30 | recurring | Machine Operator × 2 | - | ✗ 0 | 0 | 13.3 | 0 | 1 |
 | 12:15 | expansion | Forklift Operator × 5 | - | ✗ 0 | 0 | 30.7 | 0 | 1 |
 | 14:00 | emergency | Loader × 4 | - | ✗ 0 | 0 | 23.1 | 0 | 1 |
 | 15:45 | misplacement | Warehouse Associate × 1 | - | ✗ 0 | 0 | 51.1 | 0 | 1 |
 ## Final roster
 | Worker | Booked | Role | City, ST | Status |
 |---|---|---|---|---|
 ## Gap signals
 ### drift_or_tool
 - **08:00** — invalid JSON from executor: JSON Parse error: Expected ']' | raw: {"kind":"plan","steps":["TOOL_CALL","hybrid_search({'index_name':'workers_500k_v1','sql_filter':'role = 'Warehouse Associate' AND city = 'Toledo' AND state = 'OH' AND CAST(availability AS DOUBLE) > 0.5','question':'reliable warehouse associate Toledo'})","TOOL_CALL","sql({'query':'SELECT worker_id,
 - **10:30** — invalid JSON from executor: JSON Parse error: Expected ']' | raw: {"kind":"plan","steps":["TOOL_CALL","tool":"hybrid_search","args":{"index_name":"workers_500k_v1","sql_filter":"role = 'Machine Operator' AND city = 'Toledo' AND state = 'OH' AND CAST(availability AS DOUBLE) > 0.5 AND playbook_citations > 0"},"rationale":"Narrow the search to recurring Machine Opera
 - **12:15** — invalid JSON from executor: JSON Parse error: Invalid escape character ' | raw: {
  "kind": "plan",
  "steps": [
    "TOOL_CALL hybrid_search({'index_name': 'workers_500k_v1', 'sql_filter': 'role = \'Forklift Operator\' AND city = \'Toledo\' AND state = \'OH\' AND CAST(availability AS DOUBLE) > 0.5 AND CAST(reliability AS DOUBLE) > 0.75', 'k': 10})",
    "TOOL_CALL sql({'query'
 - **14:00** — aborted — 3 consecutive drift flags
 - **15:45** — no consensus after 14 turns
 ### write_through_audit
 - _post-run_ — playbook_memory has 1164 entries (ran 5 events, expected ≥ 0 new entries from this run)
 ## Workers touched across the week
 0 distinct workers made it through to a decision. Every one is accounted for below — no-shows flagged, rebookings noted, everyone visible.
 | Worker ID | Name | Events | Outcome |
 |---|---|---|---|
 ## Discovered patterns (meta-index)
 What the system identified across semantically-similar past fills as each event ran:
 - **08:00 baseline_fill** (Warehouse Associate): —
 - **10:30 recurring** (Machine Operator): —
 - **12:15 expansion** (Forklift Operator): —
 - **14:00 emergency** (Loader): —
 - **15:45 misplacement** (Warehouse Associate): —
 ## Narrative
 - 0/5 events reached consensus.
 - Final roster: 0 bookings across 0 distinct workers.
 - Workers touched (booked, failed, or otherwise decided): 0.
 - Playbook citations across the day: 0 (proof the feedback loop fired across events).
 - Dropped events: 08:00 baseline_fill, 10:30 recurring, 12:15 expansion, 14:00 emergency, 15:45 misplacement.
--- a/tests/multi-agent/playbooks/scenario-2026-04-21T00-37-04/results.json
+++ b/tests/multi-agent/playbooks/scenario-2026-04-21T00-37-04/results.json
@ -0,0 +1,104 @@
 [
  {
    "event": {
      "kind": "baseline_fill",
      "at": "08:00",
      "role": "Warehouse Associate",
      "count": 3,
      "city": "Toledo",
      "state": "OH",
      "shift_start": "08:00 AM",
      "scenario_note": "Regular Monday morning shift, 8-hour."
    },
    "ok": false,
    "fills": [],
    "turns": 0,
    "duration_secs": 13.874,
    "error": "invalid JSON from executor: JSON Parse error: Expected ']' | raw: {\"kind\":\"plan\",\"steps\":[\"TOOL_CALL\",\"hybrid_search({'index_name':'workers_500k_v1','sql_filter':'role = 'Warehouse Associate' AND city = 'Toledo' AND state = 'OH' AND CAST(availability AS DOUBLE) > 0.5','question':'reliable warehouse associate Toledo'})\",\"TOOL_CALL\",\"sql({'query':'SELECT worker_id, ",
    "gap_signals": [
      "drift_or_tool: invalid JSON from executor: JSON Parse error: Expected ']' | raw: {\"kind\":\"plan\",\"steps\":[\"TOOL_CALL\",\"hybrid_search({'index_name':'workers_500k_v1','sql_filter':'role = 'Warehouse Associate' AND city = 'Toledo' AND state = 'OH' AND CAST(availability AS DOUBLE) > 0.5','question':'reliable warehouse associate Toledo'})\",\"TOOL_CALL\",\"sql({'query':'SELECT worker_id, "
    ]
  },
  {
    "event": {
      "kind": "recurring",
      "at": "10:30",
      "role": "Machine Operator",
      "count": 2,
      "city": "Toledo",
      "state": "OH",
      "shift_start": "11:00 AM",
      "scenario_note": "Recurring Tuesday/Thursday slot — prior workers may still be available."
    },
    "ok": false,
    "fills": [],
    "turns": 0,
    "duration_secs": 13.257,
    "error": "invalid JSON from executor: JSON Parse error: Expected ']' | raw: {\"kind\":\"plan\",\"steps\":[\"TOOL_CALL\",\"tool\":\"hybrid_search\",\"args\":{\"index_name\":\"workers_500k_v1\",\"sql_filter\":\"role = 'Machine Operator' AND city = 'Toledo' AND state = 'OH' AND CAST(availability AS DOUBLE) > 0.5 AND playbook_citations > 0\"},\"rationale\":\"Narrow the search to recurring Machine Opera",
    "gap_signals": [
      "drift_or_tool: invalid JSON from executor: JSON Parse error: Expected ']' | raw: {\"kind\":\"plan\",\"steps\":[\"TOOL_CALL\",\"tool\":\"hybrid_search\",\"args\":{\"index_name\":\"workers_500k_v1\",\"sql_filter\":\"role = 'Machine Operator' AND city = 'Toledo' AND state = 'OH' AND CAST(availability AS DOUBLE) > 0.5 AND playbook_citations > 0\"},\"rationale\":\"Narrow the search to recurring Machine Opera"
    ]
  },
  {
    "event": {
      "kind": "expansion",
      "at": "12:15",
      "role": "Forklift Operator",
      "count": 5,
      "city": "Toledo",
      "state": "OH",
      "shift_start": "01:00 PM",
      "scenario_note": "New warehouse location opening, five-worker team needed."
    },
    "ok": false,
    "fills": [],
    "turns": 0,
    "duration_secs": 30.707,
    "error": "invalid JSON from executor: JSON Parse error: Invalid escape character ' | raw: {\n  \"kind\": \"plan\",\n  \"steps\": [\n    \"TOOL_CALL hybrid_search({'index_name': 'workers_500k_v1', 'sql_filter': 'role = \\'Forklift Operator\\' AND city = \\'Toledo\\' AND state = \\'OH\\' AND CAST(availability AS DOUBLE) > 0.5 AND CAST(reliability AS DOUBLE) > 0.75', 'k': 10})\",\n    \"TOOL_CALL sql({'query'",
    "gap_signals": [
      "drift_or_tool: invalid JSON from executor: JSON Parse error: Invalid escape character ' | raw: {\n  \"kind\": \"plan\",\n  \"steps\": [\n    \"TOOL_CALL hybrid_search({'index_name': 'workers_500k_v1', 'sql_filter': 'role = \\'Forklift Operator\\' AND city = \\'Toledo\\' AND state = \\'OH\\' AND CAST(availability AS DOUBLE) > 0.5 AND CAST(reliability AS DOUBLE) > 0.75', 'k': 10})\",\n    \"TOOL_CALL sql({'query'"
    ]
  },
  {
    "event": {
      "kind": "emergency",
      "at": "14:00",
      "role": "Loader",
      "count": 4,
      "city": "Toledo",
      "state": "OH",
      "shift_start": "04:00 PM same day",
      "deadline": "16:00",
      "scenario_note": "Walkoff incident — replacement crew needed by 16:00 sharp."
    },
    "ok": false,
    "fills": [],
    "turns": 0,
    "duration_secs": 23.148,
    "error": "aborted — 3 consecutive drift flags",
    "gap_signals": [
      "drift_or_tool: aborted — 3 consecutive drift flags"
    ]
  },
  {
    "event": {
      "kind": "misplacement",
      "at": "15:45",
      "role": "Warehouse Associate",
      "count": 1,
      "city": "Toledo",
      "state": "OH",
      "shift_start": "remainder of 08:00 shift",
      "scenario_note": "One worker from the 08:00 fill didn't show; rebuild the gap.",
      "replaces_event": "08:00"
    },
    "ok": false,
    "fills": [],
    "turns": 0,
    "duration_secs": 51.075,
    "error": "no consensus after 14 turns",
    "gap_signals": [
      "drift_or_tool: no consensus after 14 turns"
    ]
  }
 ]
--- a/tests/multi-agent/playbooks/scenario-2026-04-21T00-37-04/roster.json
+++ b/tests/multi-agent/playbooks/scenario-2026-04-21T00-37-04/roster.json
@ -0,0 +1 @@
 []
--- a/tests/multi-agent/playbooks/scenario-2026-04-21T00-37-04/sms.md
+++ b/tests/multi-agent/playbooks/scenario-2026-04-21T00-37-04/sms.md
@ -0,0 +1 @@
 # SMS drafts — Riverfront Steel, 2026-04-21
--- a/tests/multi-agent/playbooks/scenario-2026-04-21T00-39-54/checkpoints.jsonl
+++ b/tests/multi-agent/playbooks/scenario-2026-04-21T00-39-54/checkpoints.jsonl
@ -0,0 +1,2 @@
 {"after_event":"12:15","event_kind":"expansion","ok":true,"model":"gpt-oss:20b","duration_secs":12.189,"risk":"JSON syntax error in tool calls","hint":"For the next Forklift Operator expansion, escape single quotes in SQL query or use a parameterized query; validate JSON with a linter before execution."}
 {"after_event":"15:45","event_kind":"misplacement","ok":true,"model":"gpt-oss:20b","duration_secs":15.773,"risk":"Warehouse Associate JSON error","hint":"Escape quotes in SQL query; close JSON braces before sending to executor."}
--- a/tests/multi-agent/playbooks/scenario-2026-04-21T00-39-54/dispatch.jsonl
+++ b/tests/multi-agent/playbooks/scenario-2026-04-21T00-39-54/dispatch.jsonl
--- a/Show More
+++ b/Show More
		`@ -0,0 +1 @@`
							{"at":"10:30","kind":"recurring","operation":"fill: Machine Operator x2 in Toledo, OH","fills":[{"candidate_id":"W500K-29048","name":"Raymond G. Ward","reason":"Raymond has a high availability score (0.9033) and relevant skills for the Machine Operator position in Toledo, OH. He is also endorsed by past similar fills."},{"candidate_id":"W500K-20613","name":"Pamela V. Green","reason":"Pamela has a high availability score (0.8496) and relevant skills for the Machine Operator position in Toledo, OH. She is also endorsed by past similar fills."}],"turns":5,"duration_secs":12.051,"pool_size":997,"playbook_citations":[]}
		`@ -0,0 +1 @@`
							`# Client emails — Riverfront Steel, 2026-04-21`
		`@ -0,0 +1 @@`
							`# SMS drafts — Riverfront Steel, 2026-04-21`
		`@ -0,0 +1,2 @@`
							{"at":"08:00","kind":"baseline_fill","operation":"fill: Warehouse Associate x3 in Toledo, OH","fills":[{"candidate_id":"W500K-49164","name":"Christopher Y. Phillips","reason":"Reliable Warehouse Associate with availability greater than 0.5 in Toledo, OH."},{"candidate_id":"W500K-40928","name":"Janet E. Hill","reason":"Reliable Warehouse Associate with availability greater than 0.5 in Toledo, OH."},{"candidate_id":"W500K-34704","name":"Fatima U. Rivera","reason":"Reliable Warehouse Associate with availability greater than 0.5 in Toledo, OH."}],"turns":7,"duration_secs":20.128,"pool_size":770,"playbook_citations":[],"discovered_pattern":"Across 25 similar past playbooks (8 workers examined) · archetype mostly: reliable · reliability median 0.80 (range 0.66–0.96)"}
							{"at":"10:30","kind":"recurring","operation":"fill: Machine Operator x2 in Toledo, OH","fills":[{"candidate_id":"W500K-19759","name":"Carmen Z. Rodriguez","reason":"Recurring Machine Operator in Toledo, OH with a score of 0.75, verified via sql tool."},{"candidate_id":"W500K-29298","name":"Robert W. Gonzalez","reason":"Recurring Machine Operator in Toledo, OH with a score of 0.74, not yet SQL verified but highly likely to meet requirements."}],"turns":5,"duration_secs":17.426,"pool_size":997,"playbook_citations":[],"discovered_pattern":"Across 25 similar past playbooks (8 workers examined) · archetype mostly: reliable · reliability median 0.80 (range 0.66–0.96)"}
		`@ -0,0 +1,2 @@`
							{"at":"12:15","kind":"expansion","operation":"fill: Forklift Operator x5 in Toledo, OH","fills":[{"candidate_id":"W500K-37736","name":"Jennifer K. Robinson","reason":"Meets the criteria of being a Forklift Operator in Toledo, OH with availability and reliability above the specified thresholds."},{"candidate_id":"W500K-33961","name":"Kyle F. Brooks","reason":"Meets the criteria of being a Forklift Operator in Toledo, OH with availability and reliability above the specified thresholds."},{"candidate_id":"W500K-31297","name":"Jacob T. Diaz","reason":"Meets the criteria of being a Forklift Operator in Toledo, OH with availability and reliability above the specified thresholds."},{"candidate_id":"W500K-40884","name":"Jerry M. Jones","reason":"Meets the criteria of being a Forklift Operator in Toledo, OH with availability and reliability above the specified thresholds."},{"candidate_id":"W500K-37729","name":"Jeffrey D. Taylor","reason":"Meets the criteria of being a Forklift Operator in Toledo, OH with availability and reliability above the specified thresholds."}],"turns":7,"duration_secs":28.23,"pool_size":687,"playbook_citations":[],"discovered_pattern":"Across 25 similar past playbooks (10 workers examined) · recurring certifications: Forklift (40%), OSHA-10 (40%) · recurring skills: mill (40%) · archetype mostly: leader · reliability median 0.83 (range 0.66–0.96)"}
							{"at":"14:00","kind":"emergency","operation":"fill: Loader x4 in Toledo, OH","fills":[{"candidate_id":"W500K-15305","name":"Mary R. Richardson","reason":"Verified availability score of 0.988 via SQL and ranked highest among the candidates with an availability score greater than 0.7."},{"candidate_id":"W500K-12325","name":"Raj Torres","reason":"Ranked second among the candidates with an availability score greater than 0.7."},{"candidate_id":"W500K-16975","name":"Brian X. Price","reason":"Ranked third among the candidates with an availability score greater than 0.7."},{"candidate_id":"W500K-22851","name":"Fatima X. Gutierrez","reason":"Ranked fourth among the candidates with an availability score greater than 0.7."}],"turns":6,"duration_secs":22.25,"pool_size":380,"playbook_citations":[],"discovered_pattern":"Across 25 similar past playbooks (9 workers examined) · recurring certifications: Forklift (44%) · recurring skills: mill (44%) · archetype mostly: leader · reliability median 0.80 (range 0.66–0.96)"}
		`@ -0,0 +1,2 @@`
							`{"after_event":"12:15","event_kind":"expansion","ok":true,"model":"gpt-oss:20b","duration_secs":10.228,"risk":"Forklift Operator JSON error","hint":"Ensure JSON is valid; test with a JSON validator; correct syntax before executing the tool call."}`
							`{"after_event":"15:45","event_kind":"misplacement","ok":false,"model":"gpt-oss:20b","duration_secs":13.935,"hint":"(T3 unavailable)","risk":"generate returned empty text from gpt-oss:20b: {\"text\":\"\",\"model\":\"gpt-oss:20b\","}`