golangLAKEHOUSE/scripts/cutover/parity/extract_json_parity.sh
root b0c8a3f227 parity probes: materializer + extract_json (caught + fixed real bug)
Two new cross-runtime parity probes joining the validator probe from
the gauntlet wave. Pattern: feed identical input through Rust and Go;
diff outputs. Each probe surfaced a different signal.

## Materializer parity probe
scripts/cutover/parity/materializer_parity.sh runs Bun + Go
materializer against an identical synthetic data/_kb/ root, diffs the
resulting evidence/ JSONL byte-equivalent (modulo provenance.recorded_at).

**First run: 0/2 match.** Real finding: Go's Provenance.LineOffset
had `json:"line_offset,omitempty"` which strips the field when value
is 0. Line offset 0 is the FIRST ROW of every source file — a real
semantic value, not absent. Bun side always emits it.

Fix: drop `omitempty` on Provenance.LineOffset. Updated comment
explaining why.

**Re-run: 2/2 match.** On-wire JSON parity holds.

## extract_json parity probe
scripts/cutover/parity/extract_json_parity.sh feeds 12 fixture
strings through both runtimes' extract_json:
  - fenced ```json``` blocks
  - unfenced ``` blocks
  - bare braces with prose around
  - first-balanced-of-many
  - nested objects
  - unicode in string values
  - escaped quotes
  - empty object
  - top-level array (both return first inner object)
  - no JSON
  - depth-balanced but invalid syntax
  - trailing garbage

Substrate gate: cargo test -p gateway extract_json PASS before probe.

**Result: 12/12 match.** Algorithms genuinely equivalent.

## scripts/cutover/parity/extract_json_helper/main.go
Tiny Go binary that reads stdin, calls validator.ExtractJSON, prints
{matched, value} JSON. Counterpart to the Rust parity_extract_json
binary in golangLAKEHOUSE's sibling lakehouse repo (separate commit).

## Pattern crystallized
Every cross-runtime port should land with a parity probe. Three
probes now exist:
  - validator (5/6 wire-format gap captured 2026-05-02)
  - materializer (caught + fixed real bug 2026-05-02)
  - extract_json (12/12 match 2026-05-02)

The instrument is reusable — each new shared HTTP/CLI surface gets
a probe row added.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 04:43:54 -05:00

144 lines
5.7 KiB
Bash
Executable File

#!/usr/bin/env bash
# extract_json_parity — feed identical model-output strings through
# both Rust extract_json AND Go ExtractJSON; diff outputs.
#
# Why: the iteration loop's correctness hinges on extract_json finding
# the same JSON object in the same model output regardless of runtime.
# A divergence here means a model output that one runtime accepts and
# the other rejects (or worse, both accept but parse differently).
#
# Approach:
# 1. Run cargo test -p gateway extract_json to assert the LIVE Rust
# function still passes its own unit tests (substrate gate)
# 2. For each fixture (input, label) tuple:
# Rust: ./target/release/parity_extract_json < fixture
# Go: ./bin/parity_extract_json_go < fixture
# Compare {matched, value} JSON outputs
# 3. Emit a markdown report with per-fixture matches/diffs
#
# Outputs: reports/cutover/gauntlet_2026-05-02/parity/extract_json_parity.md
#
# Env overrides:
# RUST_REPO=/home/profit/lakehouse
# RUST_BIN=$RUST_REPO/target/release/parity_extract_json
set -uo pipefail
cd "$(dirname "$0")/../../.."
RUST_REPO="${RUST_REPO:-/home/profit/lakehouse}"
RUST_BIN="${RUST_BIN:-$RUST_REPO/target/release/parity_extract_json}"
GO_BIN="${GO_BIN:-./bin/parity_extract_json_go}"
OUT_DIR="reports/cutover/gauntlet_2026-05-02/parity"
mkdir -p "$OUT_DIR"
OUT="$OUT_DIR/extract_json_parity.md"
export PATH="$PATH:/usr/local/go/bin"
# ── Build / verify both sides ───────────────────────────────────────
if [ ! -x "$RUST_BIN" ]; then
echo "[extract-json-parity] building Rust helper..."
(cd "$RUST_REPO" && cargo build -p gateway --bin parity_extract_json --release 2>&1 | tail -3)
fi
if [ ! -x "$RUST_BIN" ]; then
echo "[extract-json-parity] SKIP: $RUST_BIN missing"
exit 0
fi
# Run live Rust unit tests (substrate gate) — ensures our helper
# matches the production extract_json behavior.
echo "[extract-json-parity] running cargo test extract_json (substrate gate)..."
(cd "$RUST_REPO" && cargo test -p gateway --release extract_json 2>&1 | tail -8) > /tmp/rust_extract_test.log
if ! grep -q "test result: ok" /tmp/rust_extract_test.log; then
echo "[extract-json-parity] live Rust tests FAILED — aborting probe"
cat /tmp/rust_extract_test.log
exit 1
fi
echo " ✓ live Rust extract_json tests PASS"
# Build Go-side helper from internal/validator.ExtractJSON.
go build -o "$GO_BIN" ./scripts/cutover/parity/extract_json_helper
# ── Fixture set ─────────────────────────────────────────────────────
# Inline as label||raw pairs. Curated to exercise every documented
# branch:
# - fenced ```json``` block
# - fenced unlabeled ``` block
# - bare-braces with stray prose
# - first-balanced-of-many
# - nested object
# - unicode in string values
# - escaped quotes
# - empty object
# - top-level array (both runtimes return first inner object)
# - no JSON at all
# - malformed JSON-shaped text (depth balanced but invalid syntax)
# - very-large input (~10KB of prose around a tiny object)
declare -a FIXTURES=(
"fenced_json_block||Here's my answer:
\`\`\`json
{\"fills\":[{\"candidate_id\":\"W-1\"}]}
\`\`\`
Done."
"fenced_unlabeled||result:
\`\`\`
{\"k\":\"v\"}
\`\`\`"
"bare_braces||Here you go: {\"fills\":[{\"candidate_id\":\"W-2\"}]}"
"first_of_many||{\"a\":1} then {\"b\":2}"
"nested||prefix {\"outer\":{\"inner\":[1,2,3]},\"x\":\"y\"} suffix"
"unicode||{\"name\":\"Café résumé\",\"emoji\":\"⭐\"}"
"escaped_quotes||{\"msg\":\"she said \\\"hello\\\"\"}"
"empty_object||{}"
"array_of_objects||[{\"a\":1},{\"b\":2}]"
"no_json||just prose, no json"
"depth_balanced_invalid||{not a key: still not}"
"trailing_garbage||{\"k\":\"v\"} and then 5} more } stuff"
)
TOTAL=0; MATCH=0; DIFF=0
DIFF_DETAIL=""
for entry in "${FIXTURES[@]}"; do
IFS='||' read -r label raw <<<"$entry"
TOTAL=$((TOTAL+1))
rust_out=$(printf '%s' "$raw" | "$RUST_BIN" 2>/dev/null || echo "RUST_ERROR")
go_out=$(printf '%s' "$raw" | "$GO_BIN" 2>/dev/null || echo "GO_ERROR")
# Normalize JSON serialization (key order) before comparing.
rust_norm=$(echo "$rust_out" | jq -cS . 2>/dev/null || echo "$rust_out")
go_norm=$(echo "$go_out" | jq -cS . 2>/dev/null || echo "$go_out")
if [ "$rust_norm" = "$go_norm" ]; then
MATCH=$((MATCH+1))
else
DIFF=$((DIFF+1))
raw_short=$(printf '%s' "$raw" | head -c 120 | tr '\n' ' ')
DIFF_DETAIL="$DIFF_DETAIL"$'\n\n'"### $label"$'\n''**Input (first 120 chars):** `'"$raw_short"'`'$'\n\n''**Rust:**'$'\n''```json'$'\n'"$rust_norm"$'\n''```'$'\n\n''**Go:**'$'\n''```json'$'\n'"$go_norm"$'\n''```'
fi
done
# ── Report ──────────────────────────────────────────────────────────
{
echo "# extract_json parity probe — Rust vs Go"
echo
echo "**Date:** $(date -u +%Y-%m-%dT%H:%M:%SZ)"
echo "**Rust helper:** \`$RUST_BIN\` (links live \`gateway::v1::iterate::extract_json\`)"
echo "**Go helper:** \`$GO_BIN\` (links live \`internal/validator.ExtractJSON\`)"
echo
echo "Identical model-output strings → both runtimes' \`extract_json\`."
echo "Match = identical \`{matched, value}\` JSON output."
echo
echo "**Substrate gate:** \`cargo test -p gateway extract_json\` PASS before probe."
echo
echo "**Tally:** $MATCH match · $DIFF diff (out of $TOTAL fixtures)"
if [ -n "$DIFF_DETAIL" ]; then
echo
echo "## Divergences"
echo "$DIFF_DETAIL"
else
echo
echo "_No divergences — extract_json parity holds across all fixtures._"
fi
} > "$OUT"
echo "[parity] extract_json: $MATCH match / $DIFF diff (out of $TOTAL) → $OUT"
[ "$DIFF" -eq 0 ]