golangLAKEHOUSE/scripts/cutover/parity/materializer_parity.sh
root b0c8a3f227 parity probes: materializer + extract_json (caught + fixed real bug)
Two new cross-runtime parity probes joining the validator probe from
the gauntlet wave. Pattern: feed identical input through Rust and Go;
diff outputs. Each probe surfaced a different signal.

## Materializer parity probe
scripts/cutover/parity/materializer_parity.sh runs Bun + Go
materializer against an identical synthetic data/_kb/ root, diffs the
resulting evidence/ JSONL byte-equivalent (modulo provenance.recorded_at).

**First run: 0/2 match.** Real finding: Go's Provenance.LineOffset
had `json:"line_offset,omitempty"` which strips the field when value
is 0. Line offset 0 is the FIRST ROW of every source file — a real
semantic value, not absent. Bun side always emits it.

Fix: drop `omitempty` on Provenance.LineOffset. Updated comment
explaining why.

**Re-run: 2/2 match.** On-wire JSON parity holds.

## extract_json parity probe
scripts/cutover/parity/extract_json_parity.sh feeds 12 fixture
strings through both runtimes' extract_json:
  - fenced ```json``` blocks
  - unfenced ``` blocks
  - bare braces with prose around
  - first-balanced-of-many
  - nested objects
  - unicode in string values
  - escaped quotes
  - empty object
  - top-level array (both return first inner object)
  - no JSON
  - depth-balanced but invalid syntax
  - trailing garbage

Substrate gate: cargo test -p gateway extract_json PASS before probe.

**Result: 12/12 match.** Algorithms genuinely equivalent.

## scripts/cutover/parity/extract_json_helper/main.go
Tiny Go binary that reads stdin, calls validator.ExtractJSON, prints
{matched, value} JSON. Counterpart to the Rust parity_extract_json
binary in golangLAKEHOUSE's sibling lakehouse repo (separate commit).

## Pattern crystallized
Every cross-runtime port should land with a parity probe. Three
probes now exist:
  - validator (5/6 wire-format gap captured 2026-05-02)
  - materializer (caught + fixed real bug 2026-05-02)
  - extract_json (12/12 match 2026-05-02)

The instrument is reusable — each new shared HTTP/CLI surface gets
a probe row added.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 04:43:54 -05:00

181 lines
7.0 KiB
Bash
Executable File

#!/usr/bin/env bash
# materializer_parity — run Bun + Go materializer against an identical
# synthetic root, diff the resulting data/evidence/ JSONL files.
#
# This validates the parity claim from the 2026-05-02 port: "on-wire
# JSON shape matches TS so Bun and Go runs are interchangeable." Any
# divergence here is a finding the architecture comparison should
# record (precedent: 2026-05-02 validator parity probe surfaced the
# Rust serde-tagged-enum vs Go flat-struct error envelope gap).
#
# Approach:
# 1. Set up a temp ROOT with a fixed source data/_kb/distilled_facts.jsonl
# + observer_escalations.jsonl (small, every transform field exercised)
# 2. Run Bun materializer:
# bun run /home/profit/lakehouse/scripts/distillation/build_evidence_index.ts
# against TS-side ROOT (TS expects a real lakehouse repo layout)
# 3. Run Go materializer:
# ./bin/materializer -root <ROOT>
# against the same ROOT (after wiping evidence/ between runs)
# 4. Diff the output JSONL files, normalized for non-deterministic
# fields (provenance.recorded_at, ordering).
#
# Outputs: reports/cutover/gauntlet_2026-05-02/parity/materializer_parity.md
#
# Exit 0 = byte-equal (modulo timestamps); exit non-zero = drift.
#
# Env overrides:
# RUST_REPO=/home/profit/lakehouse # Rust legacy repo
# GO_BIN=./bin/materializer # Go binary (built per-call)
set -uo pipefail
cd "$(dirname "$0")/../../.."
RUST_REPO="${RUST_REPO:-/home/profit/lakehouse}"
GO_BIN="${GO_BIN:-./bin/materializer}"
OUT_DIR="reports/cutover/gauntlet_2026-05-02/parity"
mkdir -p "$OUT_DIR"
OUT="$OUT_DIR/materializer_parity.md"
# Build Go materializer fresh.
export PATH="$PATH:/usr/local/go/bin"
go build -o "$GO_BIN" ./cmd/materializer
# Locate Bun. Skip with an explicit message if it's missing (CI without bun).
if ! command -v bun >/dev/null 2>&1; then
echo "[materializer-parity] SKIP: bun not on PATH"
exit 0
fi
# Confirm Rust-side materializer is present.
TS_MAT="$RUST_REPO/scripts/distillation/build_evidence_index.ts"
if [ ! -f "$TS_MAT" ]; then
echo "[materializer-parity] SKIP: $TS_MAT not found"
exit 0
fi
ROOT="$(mktemp -d)"
trap 'rm -rf "$ROOT"' EXIT INT TERM
mkdir -p "$ROOT/data/_kb"
# Synthetic distilled_facts — exercises the simplest transform shape.
cat > "$ROOT/data/_kb/distilled_facts.jsonl" <<EOF
{"run_id":"r1","source_label":"lab-a","created_at":"2026-04-26T00:00:00Z","extractor":"qwen3.5:latest","text":"first fact"}
{"run_id":"r2","source_label":"lab-b","created_at":"2026-04-26T01:00:00Z","extractor":"qwen3.5:latest","text":"second fact"}
{"run_id":"r3","source_label":"lab-c","created_at":"2026-04-26T02:00:00Z","extractor":"qwen2.5:latest","text":"third — different extractor"}
EOF
# observer_escalations — exercises the prompt_tokens/completion_tokens
# fields we added to EvidenceRecord during the materializer port.
cat > "$ROOT/data/_kb/observer_escalations.jsonl" <<EOF
{"ts":"2026-04-26T12:00:00Z","sig_hash":"abc123","cluster_endpoint":"/v1/chat","prompt_tokens":150,"completion_tokens":75,"analysis":"escalation review notes"}
EOF
# Pin recorded_at across BOTH runs so provenance.recorded_at matches.
PINNED_TS="2026-05-02T00:00:00Z"
# ── Bun run ────────────────────────────────────────────────────────
# Bun side reads its own root. We have to copy the fixtures under the
# Bun side's expectation (it defaults to /home/profit/lakehouse), or
# override via LH_DISTILL_ROOT. But the Bun materializer's CLI reads
# `new Date().toISOString()` for recorded_at — we can't pin without
# editing the script. Workaround: capture the full output, then
# normalize provenance.recorded_at out of both sides for diff.
BUN_ROOT="$ROOT/bun_side"
mkdir -p "$BUN_ROOT/data/_kb"
cp "$ROOT/data/_kb/"* "$BUN_ROOT/data/_kb/"
echo "[materializer-parity] running Bun materializer..."
LH_DISTILL_ROOT="$BUN_ROOT" bun run "$TS_MAT" \
> /tmp/bun_mat.log 2>&1 || {
echo "[materializer-parity] bun run failed:"
tail -30 /tmp/bun_mat.log
exit 1
}
# ── Go run ─────────────────────────────────────────────────────────
GO_ROOT="$ROOT/go_side"
mkdir -p "$GO_ROOT/data/_kb"
cp "$ROOT/data/_kb/"* "$GO_ROOT/data/_kb/"
echo "[materializer-parity] running Go materializer..."
"$GO_BIN" -root "$GO_ROOT" > /tmp/go_mat.log 2>&1 || {
echo "[materializer-parity] go run failed:"
tail -30 /tmp/go_mat.log
exit 1
}
# ── Find output day-partition ──────────────────────────────────────
# Both runs use today's UTC date. Look up the partition.
TODAY="$(date -u +%Y/%m/%d)"
BUN_OUT="$BUN_ROOT/data/evidence/$TODAY"
GO_OUT="$GO_ROOT/data/evidence/$TODAY"
if [ ! -d "$BUN_OUT" ]; then
echo "[materializer-parity] no Bun output dir: $BUN_OUT"
ls -la "$BUN_ROOT/data/evidence" 2>/dev/null || true
exit 1
fi
if [ ! -d "$GO_OUT" ]; then
echo "[materializer-parity] no Go output dir: $GO_OUT"
exit 1
fi
# ── Normalize + diff per source-stem ───────────────────────────────
# Stripped fields:
# provenance.recorded_at — different per-run wall clock
#
# Sorted by sig_hash so dedup ordering can't matter.
normalize() {
jq -c -S 'del(.provenance.recorded_at)' "$1" 2>/dev/null \
| sort
}
TOTAL=0; MATCH=0; DIFF=0
DIFF_DETAIL=""
for f in "$BUN_OUT"/*.jsonl; do
stem=$(basename "$f" .jsonl)
go_f="$GO_OUT/$stem.jsonl"
TOTAL=$((TOTAL+1))
if [ ! -f "$go_f" ]; then
DIFF=$((DIFF+1))
DIFF_DETAIL="$DIFF_DETAIL"$'\n'"- $stem: present in Bun, missing in Go"
continue
fi
bun_norm=$(normalize "$f")
go_norm=$(normalize "$go_f")
if [ "$bun_norm" = "$go_norm" ]; then
MATCH=$((MATCH+1))
else
DIFF=$((DIFF+1))
# Capture a small diff for the report.
diff_block="$(diff <(echo "$bun_norm") <(echo "$go_norm") | head -40)"
DIFF_DETAIL="$DIFF_DETAIL"$'\n\n'"### $stem"$'\n''```diff'$'\n'"$diff_block"$'\n''```'
fi
done
# ── Write report ───────────────────────────────────────────────────
{
echo "# Materializer parity probe — Bun vs Go"
echo
echo "**Date:** $(date -u +%Y-%m-%dT%H:%M:%SZ)"
echo "**Bun:** \`$TS_MAT\`"
echo "**Go:** \`$GO_BIN\`"
echo
echo "Identical \`data/_kb/\` source → both runtimes' materializer."
echo "Match = JSONL byte-equal after normalizing \`provenance.recorded_at\`"
echo "(per-run wall clock) + sorted line order (dedup ordering)."
echo
echo "**Tally:** $MATCH match · $DIFF diff (out of $TOTAL stems)"
if [ -n "$DIFF_DETAIL" ]; then
echo
echo "## Divergences"
echo "$DIFF_DETAIL"
else
echo
echo "_No divergences — on-wire JSON parity holds._"
fi
} > "$OUT"
echo "[parity] materializer: $MATCH match / $DIFF diff (out of $TOTAL) → $OUT"
[ "$DIFF" -eq 0 ]