Two new cross-runtime parity probes joining the validator probe from
the gauntlet wave. Pattern: feed identical input through Rust and Go;
diff outputs. Each probe surfaced a different signal.
## Materializer parity probe
scripts/cutover/parity/materializer_parity.sh runs Bun + Go
materializer against an identical synthetic data/_kb/ root, diffs the
resulting evidence/ JSONL byte-equivalent (modulo provenance.recorded_at).
**First run: 0/2 match.** Real finding: Go's Provenance.LineOffset
had `json:"line_offset,omitempty"` which strips the field when value
is 0. Line offset 0 is the FIRST ROW of every source file — a real
semantic value, not absent. Bun side always emits it.
Fix: drop `omitempty` on Provenance.LineOffset. Updated comment
explaining why.
**Re-run: 2/2 match.** On-wire JSON parity holds.
## extract_json parity probe
scripts/cutover/parity/extract_json_parity.sh feeds 12 fixture
strings through both runtimes' extract_json:
- fenced ```json``` blocks
- unfenced ``` blocks
- bare braces with prose around
- first-balanced-of-many
- nested objects
- unicode in string values
- escaped quotes
- empty object
- top-level array (both return first inner object)
- no JSON
- depth-balanced but invalid syntax
- trailing garbage
Substrate gate: cargo test -p gateway extract_json PASS before probe.
**Result: 12/12 match.** Algorithms genuinely equivalent.
## scripts/cutover/parity/extract_json_helper/main.go
Tiny Go binary that reads stdin, calls validator.ExtractJSON, prints
{matched, value} JSON. Counterpart to the Rust parity_extract_json
binary in golangLAKEHOUSE's sibling lakehouse repo (separate commit).
## Pattern crystallized
Every cross-runtime port should land with a parity probe. Three
probes now exist:
- validator (5/6 wire-format gap captured 2026-05-02)
- materializer (caught + fixed real bug 2026-05-02)
- extract_json (12/12 match 2026-05-02)
The instrument is reusable — each new shared HTTP/CLI surface gets
a probe row added.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two threads landing together — the doc edits interleave so they ship
in a single commit.
1. **vectord substrate fix verified at original scale** (closes the
2026-05-01 thread). Re-ran multitier 5min @ conc=50: 132,211
scenarios at 438/sec, 6/6 classes at 0% failure (was 4/6 pre-fix).
Throughput dropped 1,115 → 438/sec because previously-broken
scenarios now do real HNSW Add work — honest cost of correctness.
The fix (i.vectors side-store + safeGraphAdd recover wrappers +
smallIndexRebuildThreshold=32 + saveTask coalescing) holds at the
footprint that originally surfaced the bug.
2. **Materializer port** — internal/materializer + cmd/materializer +
scripts/materializer_smoke.sh. Ports scripts/distillation/transforms.ts
(12 transforms) + build_evidence_index.ts (idempotency, day-partition,
receipt). On-wire JSON shape matches TS so Bun and Go runs are
interchangeable. 14 tests green.
3. **Replay port** — internal/replay + cmd/replay +
scripts/replay_smoke.sh. Ports scripts/distillation/replay.ts
(retrieve → bundle → /v1/chat → validate → log). Closes audit-FULL
phase 7 live invocation on the Go side. Both runtimes append to the
same data/_kb/replay_runs.jsonl (schema=replay_run.v1). 14 tests green.
Side effect on internal/distillation/types.go: EvidenceRecord gained
prompt_tokens, completion_tokens, and metadata fields to mirror the TS
shape the materializer transforms produce.
STATE_OF_PLAY refreshed to 2026-05-02; ARCHITECTURE_COMPARISON decisions
tracker moves the materializer + replay items from _open_ to DONE and
adds the substrate-fix scale verification row.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes 4 of the 5 phases the initial audit-FULL port left as
deferred. The pattern: most "deferred" phases didn't actually need
the un-ported Rust pieces — they were observer-mode by design and
just needed to read existing on-disk artifacts.
Phase 1 (schema validators) → ported via exec.Command:
Invokes `go test ./internal/distillation/...` — the Go equivalent
of Rust's `bun test auditor/schemas/distillation/`. New
GoTestModule field on AuditFullOptions controls the package
pattern; empty disables the invocation (test mode, prevents
recursion when audit-full is invoked from inside `go test`).
Phase 2 (evidence materialization) → ported as observer:
Reads data/evidence/ directly and tallies rows + tier-1 source
hits. Doesn't re-run the materializer (which is Rust-side TS).
Emits p2_evidence_rows + p2_evidence_skips metrics matching
Rust shape — drop-in audit_baselines.jsonl entries possible.
Phase 5 (run summary) → ported as observer:
Reads reports/distillation/{run_id}/summary.json + 5 stage
receipts. Validates schema_version=1, run_hash sha256, git_commit
40-char hex, all stage receipts decode as JSON. Full schema
validation (StageReceipt schema) is intentionally NOT ported —
it would require porting the TS schemas/distillation/ validators
in full; basic shape checks catch the load-bearing invariants.
Phase 7 (replay log) → ported as observer:
Reads data/_kb/replay_runs.jsonl, validates last 50 rows parse
as JSON. Skips the live-replay invocation that Rust's phase 7
also does — porting Rust replay.ts is substantial and not in
scope. The "log shape sanity" check is what audit-full actually
needs; the live invocation is a separate concern.
Phase 6 (acceptance gate) — STILL SKIPPED:
Rust acceptance.ts is a TS-only fixture harness with bun-specific
deps. Porting the fixtures (tests/fixtures/distillation/acceptance/)
+ the 22-invariant runner to Go is an ADR-worth undertaking.
Documented in the header comment.
Live-data probe (against /home/profit/lakehouse):
Skips count: 4 → 1 (only phase 6).
Required checks: 6/6 → 12/12 PASS.
New metric: p2_evidence_rows=1055, BYTE-EQUAL to the Rust
pipeline's collect.records_out from the latest summary.json.
Cross-runtime parity now extends across phases 0/1/2/3/4/5/7.
6 new tests:
- TestPhase2_EvidenceTallyFromOnDisk: row + tier-1-hit tallying
- TestPhase5_FullSummaryFlow: complete run-summary fixture passes
- TestPhase5_ShortRunHashCaught: bad run_hash fails required check
- TestPhase7_ReplayLogReadsFromDisk: row-count reporting
- TestPhase7_MalformedTailRowsCaught: structural parse failure
- TestRunAuditFull_FullFixtureFlow updated to seed evidence/ +
reports/distillation/ for the phases now wired.
Cleanup: removed local sortStrings helper (replaced with sort.Strings
now that `sort` is imported for phase 5's mtime-sort).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3-lineage scrum on 434f466..0d4f033 surfaced one convergent finding
(Opus + Kimi) and 3 Opus-only real bugs. All actioned in this
commit. Two false positives (Kimi rollback misreading, Opus stale-
comment claim) verified + rejected — both required manual control-
flow inspection to refute, matching the documented Kimi-truncation
behavior in feedback_cross_lineage_review.md.
Convergent fix — DecodeIndex lost nil-meta items:
- Envelope version bumped 1 → 2.
- New v2 field: IDs []string carries the canonical ID set
explicitly, independent of meta map's nil-vs-{} sparseness.
- DecodeIndex accepts both versions: v2 reads from env.IDs; v1
falls back to meta-key inference (with the documented
limitation that nil-meta items are invisible — preserved for
backward-compat with already-persisted indexes).
- Encode emits v2 going forward.
- 2 new regression tests:
- TestEncodeDecode_NilMetaItemsSurviveRoundTrip: items added
with nil metadata MUST survive Encode → Decode and remain
visible to IDs(). Pre-fix would have yielded IDs() == [].
- TestDecodeIndex_V1BackwardCompat: hand-crafted v1 envelope
still decodes (proves the fallback path).
Opus-only fixes:
- handleMerge: non-ErrIndexNotFound errors at h.reg.Get(name) /
h.reg.Get(req.Dest) now return 500 + log instead of falling
through with nil src/dest pointers (which would panic on the
next deref). Real bug — only the sentinel error was handled.
- internal/drift/drift.go: mathLog wrapper removed; math.Log
inlined. Wrapper added no value (math was already imported).
- internal/distillation/audit_baseline.go: BuildAuditDriftTable's
bubble sort replaced with sort.Slice. Idiomatic + shorter.
Rejected after verification:
- Kimi WARN "missing rollback on partial merge": misread the
control flow. Code at cmd/vectord/main.go:404-414 does NOT
delete from src when dest.Add fails (continue before reaching
src.Delete). Only successful Adds trigger Deletes.
- Opus INFO "TimestampUnixNano comment references missing field":
field exists at scripts/multi_coord_stress/main.go:128. Opus
saw only the diff context, not the full file.
Deferred (no fired trigger):
- Opus WARN "no per-index lock during merge": no concurrent merge
callers today (operators run merge as deliberate one-shot job).
Worth a lock if/when matrixd or chatd start auto-triggering.
Disposition: reports/scrum/_evidence/2026-05-01/verdicts/post_role_gate_v1_disposition.md.
Build + vet + tests green; 2 new regression tests + all prior tests
unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The original OPEN #2 line called for "SFT export pipeline +
audit_baselines lineage." Commit 7bb432f shipped the SFT export.
This commit ports the audit_baselines half — the longitudinal
drift signal that distinguishes "metrics shifted because the world
changed" from "metrics shifted because we broke something."
Mirrors Rust scripts/distillation/audit_full.ts's substrate:
- LoadLastBaseline(path) reads the most recent entry from
data/_kb/audit_baselines.jsonl. Returns (nil, nil) on missing
file (first run), errors on truncated last line (partial-write
detection — operators don't lose drift signal silently).
- AppendBaseline(path, baseline) appends one entry as a JSON line.
Atomic at the line level via bufio + O_APPEND. Creates the
parent directory if missing.
- BuildAuditDriftTable(prior, current, threshold) computes
per-metric drift. flag values mirror Rust exactly: first_run,
ok, warn. DefaultDriftWarnThreshold = 0.20 = Rust's 20%.
- FormatAuditDriftTable renders a fixed-width text grid for
stdout dumps in audit-full runs.
Edge cases handled:
- Zero-baseline: prior=0 means no division — PctChange stays nil.
current=0 → ok (no change). current>0 → warn (zero→nonzero is
always notable, never silently fine).
- New metric in current: flagged first_run, not "0%-change".
Operators see "this is a new signal we haven't tracked before."
- Sort: stable by metric name for deterministic JSON output and
clean CI diffs.
Generic on metric name (vs Rust's pinned p2_evidence_rows etc.):
the Rust phase numbering doesn't translate to Go directly. The
AuditBaselineRustCompat constant pins the Rust names so operators
running both runtimes use the same labels, which makes drift
comparison meaningful across the two pipelines.
13 new tests covering: missing file, last-line-wins, blank-line
tolerance, malformed-line errors, append round-trip, append-to-
existing, schema validation, first-run, threshold boundary,
zero-baseline, new-metric-in-current, sort-by-metric stability,
formatter output rendering.
OPEN #2's "audit_baselines lineage" half now closed. The
distillation package surface is at parity with the Rust pipeline:
scorer, scored runs, SFT export, audit baselines all available
on the Go side.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Follow-up to b216b7e (which shipped the SFT export substrate). This
commit ports the synthesis logic, completing the migration:
- SynthesizeSft(scored, ev, recordedAt, sftID) → *SftSample
Mirrors the Rust synthesizeSft byte-for-byte. Returns nil for
extraction-class records + empty-text records (same skip
semantics as Rust).
- LoadEvidenceByRunID(scoredPath, cache) reads the paired evidence
JSONL (path derived by /scored-runs/ → /evidence/ replacement).
Per-call cache so multiple scored-runs files in the same dir
don't reload the same evidence.
- buildInstruction maps source_file stem → per-class instruction
template. All 8 templates (scrum_reviews, mode_experiments,
auto_apply, audits, observer_reviews, contract_analyses,
outcomes, default) match Rust output exactly so a/b validation
between runtimes can diff JSONL byte-for-byte.
- stemFromSourceFile strips data/_kb/ prefix + .jsonl suffix.
- ExportSft now writes data/distilled/sft/sft_export.jsonl with
the synthesized samples (DryRun=true skips file write).
Per-class templates verified by 8-case sub-test:
- scrum_reviews → "Review the file '...' against the PRD..."
- mode_experiments → "Run task_class='...' for file..."
- auto_apply → "Auto-apply: emit a 6-line surgical patch..."
- audits with phase: prefix → strips to bare phase name
- observer_reviews → "Observer-review the latest attempt..."
- contract_analyses with permit: prefix → strips to permit ID
- outcomes → "Run scenario; report per-event outcome..."
- unknown source → "Source 'X' run; produce the appropriate output"
Caveat documented inline: contract_analyses uses ev.metadata.contractor
in Rust to produce "Analyze contractor 'X' for permit 'Y'" when
present. Go's EvidenceRecord doesn't carry a free-form metadata bag
yet, so we always emit the no-contractor form. Operators needing
contractor-aware instructions can extend EvidenceRecord with an
explicit Metadata field (separate ADR).
Test additions (5 new):
- TestSynthesizeSft_PerSourceClass: 8 sub-cases, one per template
- TestSynthesizeSft_RejectsExtraction: extraction-role records skipped
- TestSynthesizeSft_RejectsEmptyText: empty/whitespace text skipped
- TestSynthesizeSft_ContextAssembly: matrix + pathway + model
context string formatting matches Rust " · " join
- TestExportSft_FullPort_WritesJSONL: end-to-end fixture, asserts
output contains expected instruction + omits firewalled records
Pre-existing TestExportSft_PartialPort_FirewallFires renamed +
updated to TestExportSft_FirewallFiresBeforeEvidenceLoad — reflects
the new contract that records passing the firewall but lacking
evidence land in "not-instructable" rather than being silently
exported. Honest semantics shift documented in the test.
OPEN #2 now fully closed (was: substrate-only). The synthesis path
no longer requires the Rust pipeline to be invoked — Go-side
operators can run the full distillation export end-to-end.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Substantial wave addressing all 4 prior OPEN items. Three closed in
full, one partially (the speculative half deliberately deferred).
OPEN #1 — Periodic fresh→main index merge (FULL):
- POST /v1/vectors/index/{src}/merge with {dest, clear_source}
- Idempotent on re-runs (existing-in-dest items skipped)
- internal/vectord/index.go: new Index.IDs() snapshot method +
i.ids tracker field as canonical ID set, independent of meta
map's nil-vs-{} sparseness (was a real bug — IDs() backed by meta
alone missed items added with nil metadata)
- 4 cmd-level integration tests (happy path drain+clear, dim
mismatch, dest not found, self-merge rejection) + 1 unit test
- DecodeIndex backward-compat: old envelopes restore i.ids from
meta keys (best effort; new items going forward use the tracker)
OPEN #2 — Distillation SFT export (SUBSTRATE):
- internal/distillation/sft_export.go ports the load-bearing half:
IsSftNever predicate + ListScoredRunFiles (data/scored-runs/YYYY/
MM/DD walk) + LoadScoredRunsFromFile + partial ExportSft.
- Synthesis (instruction/input/response generation) deferred to a
separate wave — too big for this session, but the substrate
makes the next wave a port-not-design exercise.
- TestSftNever_PinsExpectedSet locks the contamination firewall
set: if a future commit adds/removes from SftNever, this test
fails — forcing the change through review.
- 5 new tests; firewall fires end-to-end through the partial port.
OPEN #3 — Distribution drift via PSI (FULL):
- internal/drift/drift.go: ComputeDistributionDrift via Population
Stability Index. Standard finance/risk metric, well-defined
verdict tiers (stable < 0.10, minor 0.10–0.25, major ≥ 0.25).
- Equal-width bucketing over combined min/max so neither dist
falls outside; epsilon-clamping for empty buckets so log doesn't
blow up. Per-bucket breakdown for drilldown.
- Pairs with the existing ComputeScorerDrift: scorer drift is
categorical, distribution drift is continuous. Different shapes,
same package.
- 7 new tests covering identical-is-stable, hard-shift-is-major,
moderate-detected-not-stable, empty-inputs-safe, all-identical-
safe, bucket-counts-conserved, num-buckets-clamping.
OPEN #4 — Ops nice-to-haves (PARTIAL — wall-clock done, others
deferred):
- (a) Real-time wall-clock for stress harness: per-phase elapsed
time logged to stdout as it runs (`[stress] phase NAME starting
(T+12.3s)` + `[stress] phase NAME done — 8.5s (T+20.8s)`).
Output.PhaseTimings + Output.TotalElapsedMs in JSON.
- (b) chatd fixture-mode S3 mock + (c) liberal-paraphrase
calibration: not actioned — no fired trigger, would be
speculative. Documented as deferred-until-need rather than
ignored. Per the project's discipline ("don't add features
beyond what the task requires").
OPEN list now empty / steady-state. Future items will land as
production triggers fire.
Build + vet + tests green; 18 new tests across the 4 closures.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
First slice of the Rust v1.0.0 distillation substrate (e7636f2)
ported to Go per ADR-001 #4 (port LOGIC, not bit-identical
reproducibility). This commit lands the LOAD-BEARING pieces named
in project_distillation_substrate.md memory:
- The deterministic Success Scorer (8 sub-scorers + dispatch)
- The contamination firewall on SFT samples (the "non-negotiable"
spec property: rejected/needs_human_review NEVER ship to SFT)
- All on-wire types + validators for ScoredRun, SftSample,
EvidenceRecord with Provenance
Files:
internal/distillation/types.go — types + ScorerVersion + SftNever
+ ValidateScoredRun + ValidateSftSample
internal/distillation/scorer.go — ScoreRecord + 8 class scorers +
BuildScoredRun (deterministic)
internal/distillation/scorer_test.go — ~40 test cases:
- source-class dispatch (verdict / telemetry / extraction)
- scrum_review (4 attempt cases)
- observer_review (5 verdict cases)
- audit (legacy + severity, 9 cases)
- auto_apply (4 cases)
- outcomes / mode_experiment / extraction
- CONTAMINATION FIREWALL: ErrSftContamination sentinel fires
on rejected/needs_human_review, distinct from typo errors
- empty-pair guard (instruction/response trim != "")
- reasons-required ScoredRun validation
- deterministic sig_hash on identical input
- purity check (input not mutated, repeatable output)
Per the 2026-04-29 cross-lineage scrum's discipline: false-positive
findings would be dismissed inline (none in this commit). Real
findings would be addressed before merge — but this is greenfield
port code reviewed against its Rust source line-by-line, which the
test suite encodes as truth tables.
Explicitly DEFERRED to follow-up commits:
- Materialization layer (jsonl read/write, date-partitioned
storage in data/scored-runs/YYYY/MM/DD/, evidence index)
- SFT exporter (file iteration + filtering — the SCORING firewall
is here; the EXPORT firewall is the next layer)
- export_preference, export_rag (other export shapes)
- Acceptance harness (16/16 acceptance gate that locks v1.0.0)
- replay, receipts, build_evidence_index, transforms
The scorer + firewall validator are pure functions — operational
tooling layers on top without changing the deterministic logic the
downstream learning loop depends on. The Go ScorerVersion stays at
v1.0.0 to match the Rust e7636f2 baseline; bumping in the Go
materialization commit is reserved for the next scoring-rule
change, NOT the port itself.
15-smoke regression all green. vet clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>