golangLAKEHOUSE/scripts/scrum_review.sh
root e8cf113af8 gauntlet 2026-05-02: smoke chain + per-component scrum + parity probe
Production-readiness gauntlet exploiting the dual Rust/Go
implementation as a measurement instrument.

## Phase 1 — Full smoke chain
21/21 PASS in ~60s. Substrate intact across the full service surface.

## Phase 2 — Per-component scrum (token-volume fix)
Prior wave (165KB diff): Kimi 62 tokens out, Qwen 297 → no useful
analysis. This wave splits today's commits into 4 focused bundles
(36-71KB each):
  c1 validatord (46KB) → 0 convergent / 11 distinct
  c2 vectord substrate (36KB) → 0 convergent / 10 distinct
  c3 materializer (71KB) → 0 convergent / 6 distinct (Opus emitted
                           a BLOCK then self-retracted in same response)
  c4 replay (45KB) → 0 convergent / 10 distinct

Reviewer engagement vs prior wave: Kimi went 62 → ~250 tokens out
once bundles dropped below 60KB.

scripts/scrum_review.sh hardening:
  * Diff-size guard (warn >60KB, hard-fail >100KB,
    SCRUM_FORCE_OVERSIZE=1 override)
  * Tightened prompt — file path must appear EXACTLY as in diff
    so post-processor can grep WHERE: lines reliably
  * Auto-tally step dedupes by (reviewer, location); convergence
    counts distinct lineages (closes the prior `opus+opus+opus`
    false-convergence bug)

## Phase 3 — Cross-runtime validator parity probe (the headline finding)
scripts/cutover/parity/validator_parity.sh sends 6 identical
/v1/validate cases to Rust :3100 AND Go :4110, compares status+body.

Result: **6/6 status codes match · 5/6 body shapes diverge.**

Rust returns serde-tagged enum:   {"Schema":{"field":"x","reason":"y"}}
Go returns flat exported-fields:  {"Kind":"schema","Field":"x","Reason":"y"}

Both round-trip inside their own runtime; a caller swapping one for
the other would break parsing silently. Captured as new _open_ row
in docs/ARCHITECTURE_COMPARISON.md decisions tracker.

This is the "use the dual-implementation as a measurement instrument"
return — single-repo scrums can't catch this class of cross-runtime
drift.

## Phase 4 — Production assessment
ship-with-known-gap. Validator wire-format gap is documented, not
regressed. ~50 LOC future fix on Go side (custom MarshalJSON on
ValidationError to match Rust's serde shape).

Persistent stack config (/tmp/lakehouse-persistent.toml) gains
validatord on :3221 + persistent-validatord binary so operators
bringing up the persistent stack get the new daemon automatically.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 04:05:18 -05:00

240 lines
9.5 KiB
Bash
Executable File

#!/usr/bin/env bash
# Cross-lineage scrum review driver.
#
# Per feedback_cross_lineage_review.md: Opus + Kimi + Qwen3-coder
# review the same diff via /v1/chat. Convergent findings (≥2
# reviewers) = high-signal real bugs; single-reviewer = lineage
# catch.
#
# Usage:
# ./scripts/scrum_review.sh <bundle.diff> <bundle_label>
#
# Outputs verbatim verdicts to:
# reports/scrum/_evidence/<DATE>/verdicts/<bundle>_<reviewer>.md
set -euo pipefail
cd "$(dirname "$0")/.."
DIFF_FILE="${1:-}"
BUNDLE_LABEL="${2:-}"
DATE="${SCRUM_DATE:-$(date +%Y-%m-%d)}"
GATEWAY="${LH_GATEWAY:-http://127.0.0.1:3110}"
OUT_DIR="reports/scrum/_evidence/${DATE}/verdicts"
if [ -z "$DIFF_FILE" ] || [ ! -f "$DIFF_FILE" ]; then
echo "usage: $0 <diff_file> <bundle_label>"
echo "example: $0 reports/scrum/_evidence/2026-04-30/diffs/bundle_A_config_refactor.diff config_refactor"
exit 1
fi
mkdir -p "$OUT_DIR"
DIFF_BYTES=$(wc -c < "$DIFF_FILE")
DIFF_LINES=$(wc -l < "$DIFF_FILE")
echo "[scrum] $BUNDLE_LABEL$DIFF_LINES lines · $DIFF_BYTES bytes · 3 reviewers"
# Diff-size guard. Per the 2026-05-02 disposition: a 165KB bundle
# produced 0 convergent findings + 3 confabulated BLOCKs because Kimi
# and Qwen gave up at <300 output tokens (input-token spent on
# scanning, not analysis). Sweet spot per per-component runs is
# ≤60KB. SCRUM_FORCE_OVERSIZE=1 lets operators override for cases
# where splitting isn't possible.
if [ "$DIFF_BYTES" -gt 100000 ] && [ "${SCRUM_FORCE_OVERSIZE:-0}" != "1" ]; then
echo "[scrum] ABORT: diff is ${DIFF_BYTES} bytes (>100KB)."
echo " Big diffs make Kimi/Qwen give up early — split into"
echo " per-component bundles ≤60KB each, then re-run."
echo " Override (NOT recommended): SCRUM_FORCE_OVERSIZE=1"
exit 2
fi
if [ "$DIFF_BYTES" -gt 60000 ]; then
echo "[scrum] WARN: diff is ${DIFF_BYTES} bytes (>60KB) — non-Opus"
echo " lineages may produce thin output. Per-component split"
echo " is preferred. Continuing."
fi
# System prompt — same shape as the Rust auditor's review template,
# tightened per feedback_cross_lineage_review.md (lead with verdict).
SYSTEM='You are a senior code reviewer in a 3-lineage cross-review.
Your verdict feeds a convergent-finding gate (≥2 reviewers = real
bug). Be terse, evidence-based, and lead with the verdict.
For each finding, output one block. The format is STRICT — a
post-processor greps WHERE: lines across all 3 reviewers to find
convergent findings, so the file path must appear EXACTLY as it
does in the diff (e.g. `cmd/foo/main.go:42`, not `foo/main.go:42`).
SEVERITY: BLOCK | WARN | INFO
WHERE: <relative/path/from/repo/root>:<line_or_symbol>
WHAT: one-sentence description
WHY: one-sentence rationale grounded in the diff
Severity ladder:
- BLOCK = ship-blocker. Wrong correctness, security flaw, broken
contract, lost test coverage, panic on real input, secret
leak. Worth blocking the PR.
- WARN = real but non-blocking. Race condition under specific
load, missing edge case, weak naming making future bugs
likely, regression risk in adjacent code.
- INFO = nit / style / better-name suggestion / dead-code remnant.
Skip the analysis preamble. Lead with the first BLOCK/WARN/INFO
block. End with an empty "VERDICT:" line of "ship | ship-with-fixes
| hold" + ≤15 word summary.
Never invent line numbers — only cite lines the diff shows.
Never repeat a file:line in two findings — combine them.'
REVIEWERS=(
"opus|opencode/claude-opus-4-7"
"kimi|openrouter/moonshotai/kimi-k2-0905"
"qwen|openrouter/qwen/qwen3-coder"
)
DIFF_CONTENT=$(cat "$DIFF_FILE")
run_review() {
local short="$1" model="$2"
local out="$OUT_DIR/${BUNDLE_LABEL}_${short}.md"
local user="Review the following diff. Bundle: $BUNDLE_LABEL.
\`\`\`diff
$DIFF_CONTENT
\`\`\`"
printf " %-6s %s ... " "$short" "$model"
local t0=$SECONDS
local status
# Build the body via temp files — both jq's --arg AND curl's
# --data run into the kernel's argv limit (~128KB) when the diff
# is large. Voice-ai full bundle was 156K and hit it twice.
# Piping through files (and using --rawfile for jq) sidesteps both.
local body_file user_file sys_file
body_file=$(mktemp); user_file=$(mktemp); sys_file=$(mktemp)
printf '%s' "$user" > "$user_file"
printf '%s' "$SYSTEM" > "$sys_file"
jq -n --arg model "$model" --rawfile sys "$sys_file" --rawfile user "$user_file" \
'{model:$model, max_tokens:4096, messages:[{role:"system",content:$sys},{role:"user",content:$user}]}' \
> "$body_file"
status=$(curl -sS -o /tmp/scrum_resp.json -w '%{http_code}' --max-time 240 \
-X POST "$GATEWAY/v1/chat" \
-H 'Content-Type: application/json' \
--data-binary "@$body_file")
rm -f "$body_file" "$user_file" "$sys_file"
local elapsed=$((SECONDS - t0))
if [ "$status" != "200" ]; then
printf "✗ HTTP %s (%ds)\n" "$status" "$elapsed"
head -c 300 /tmp/scrum_resp.json
echo
return 1
fi
local content latency tokens_in tokens_out
content=$(jq -r '.content' /tmp/scrum_resp.json)
latency=$(jq -r '.latency_ms' /tmp/scrum_resp.json)
tokens_in=$(jq -r '.input_tokens // 0' /tmp/scrum_resp.json)
tokens_out=$(jq -r '.output_tokens // 0' /tmp/scrum_resp.json)
{
echo "# Scrum review — $BUNDLE_LABEL$short ($model)"
echo
echo "**Latency:** ${latency}ms · **Tokens:** ${tokens_in} in / ${tokens_out} out · **Date:** ${DATE}"
echo
echo "---"
echo
echo "$content"
} > "$out"
printf "✓ %dms · %dt-out → %s\n" "$latency" "$tokens_out" "$out"
}
for r in "${REVIEWERS[@]}"; do
short="${r%%|*}"
model="${r#*|}"
run_review "$short" "$model" || true
done
# ─── Convergence tally ────────────────────────────────────────────
# Walk the 3 verdicts, extract WHERE: lines + their SEVERITY, dedupe
# across reviewers. Output a tally file showing what ≥2 reviewers
# flagged (real-bug signal) vs 1-reviewer (lineage catch / possibly
# confabulation).
TALLY="$OUT_DIR/${BUNDLE_LABEL}_tally.md"
{
echo "# Convergence tally — $BUNDLE_LABEL"
echo
echo "**Date:** ${DATE} · **Diff:** ${DIFF_LINES} lines / ${DIFF_BYTES} bytes"
echo
echo "## Findings by location"
echo
echo "| Reviewers | Severity | Where | Hits |"
echo "|---|---|---|---:|"
for v in "$OUT_DIR/${BUNDLE_LABEL}"_{opus,kimi,qwen}.md; do
[ -f "$v" ] || continue
short=$(basename "$v" .md | sed "s|.*${BUNDLE_LABEL}_||")
grep -E "^(SEVERITY|WHERE):" "$v" 2>/dev/null \
| awk -v r="$short" '
/^SEVERITY:/ { sev = $2; next }
/^WHERE:/ {
sub(/^WHERE: */, "")
# Drop trailing parenthetical ("(or <symbol>)") if it crept in.
sub(/\s*\(.*$/, "")
print r "|" sev "|" $0
}'
done | sort -u -t'|' -k1,1 -k3,3 \
| sort -t'|' -k3 \
| awk -F'|' '
# Aggregate by location. Dedup reviewers within a location
# (multiple findings from the same lineage at the same WHERE
# collapse to a single entry — that is reviewer self-repeat,
# not convergence). Track distinct reviewers + their highest
# severity across that location.
function rank(s) { return s == "BLOCK" ? 3 : s == "WARN" ? 2 : 1 }
function sevname(r) { return r == 3 ? "BLOCK" : r == 2 ? "WARN" : "INFO" }
{
key=$3
if (!(key in seen)) { seen[key]=""; sev_rank[key]=0 }
# split seen[key] on ";" and check if reviewer already present
present=0
n=split(seen[key], a, ";")
for (i=1;i<=n;i++) if (a[i]==$1) { present=1; break }
if (!present) {
seen[key] = seen[key] == "" ? $1 : seen[key] ";" $1
distinct_n[key]++
}
r = rank($2)
if (r > sev_rank[key]) { sev_rank[key]=r; sev_max[key]=$2 }
}
END {
for (k in distinct_n) {
# Reviewers column shows distinct lineages joined by "+"
gsub(";", "+", seen[k])
printf "%s|%s|%s|%d\n", seen[k], sev_max[k], k, distinct_n[k]
}
}
' \
| sort -t'|' -k4nr -k1 \
| awk -F'|' '{ printf "| %s | %s | `%s` | %d |\n", $1, $2, $3, $4 }'
echo
echo "(Convergent rows above are those whose Reviewers column contains a '+' — i.e. ≥2 lineages flagged the same location.)"
echo
echo "## Verdict line per reviewer"
echo
for v in "$OUT_DIR/${BUNDLE_LABEL}"_{opus,kimi,qwen}.md; do
[ -f "$v" ] || continue
short=$(basename "$v" .md | sed "s|.*${BUNDLE_LABEL}_||")
line=$(grep -E "^VERDICT:" "$v" 2>/dev/null | head -1)
echo "- **${short}**: ${line:-_no VERDICT line emitted_}"
done
} > "$TALLY"
echo "[scrum] tally → $TALLY"
# Convergent count from the tally body — count rows where the Hits
# column is ≥2 (distinct-reviewer count, after the awk dedup above).
CONV=$(awk -F'|' '$2 ~ /^ [0-9]+ $/ && ($5 + 0) >= 2 {n++} END {print n+0}' "$TALLY")
TOTAL=$(awk -F'|' '$2 ~ /^ [0-9]+ $/ {n++} END {print n+0}' "$TALLY")
# (The above scans rows of the tally table where the Hits column —
# cell 5 in `| reviewers | sev | where | hits |` — parses as int.)
# Fall back to a simpler check if the table parsing finds nothing.
if [ "$TOTAL" = "0" ]; then
TOTAL=$(grep -c "^| " "$TALLY" | awk '{print $1 - 1}') # subtract header row
CONV=$(awk '/^\|/ && $4 != "" && ($4 + 0) >= 2 {n++} END {print n+0}' "$TALLY")
fi
echo "[scrum] $BUNDLE_LABEL: $CONV convergent / $TOTAL distinct findings"
echo "[scrum] $BUNDLE_LABEL complete"