catalogd: HTML-safe escape fix + decisions tracker entry

Per 2026-05-03 step_7_8_retention_and_parity scrum (opus WARN on
parity_subject_audit.rs:canonical_json):

Go's json.Marshal HTML-escapes < > & to < > & by
default. Rust's serde_json::to_vec keeps them literal. Any audit
row with these chars in any string field would silently produce
different canonical bytes across runtimes → broken HMAC chain.
Latent because no production audit field has carried <>& yet, but
realistic for purpose strings ("error & retry") or trace_id values
("<HTTP-Request-Id>").

Fix: marshalNoEscapeHTML helper wraps json.Encoder.SetEscapeHTML(false)
+ trims trailing newline. Routed through writeCanonical for both
keys and scalar values.

Regression test: TestVerifyChain_HtmlChars_NotEscaped (purpose has &,
trace_id has <>) asserts the canonical bytes contain literal chars,
not escape sequences.

11 unit tests pass including the new one; parity probe still 6/6
byte-identical against live production audit logs.

Decisions tracker: added 2026-05-03 entry for SUBJECT_MANIFESTS_ON_CATALOGD
Steps 1-8 closure + 6th cross-runtime parity probe (was 5).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
root 2026-05-03 04:29:53 -05:00
parent 262a77a52a
commit 857ca4c971
3 changed files with 62 additions and 4 deletions

View File

@ -59,6 +59,7 @@ Don't:
| 2026-05-02 | **extract_json parity probe — 12/12 match across edge cases** | New `scripts/cutover/parity/extract_json_parity.sh` runs identical model-output strings through Rust `gateway::v1::iterate::extract_json` AND Go `validator.ExtractJSON`. 12 fixtures: fenced/unfenced blocks, nested objects, unicode, escaped quotes, top-level array, malformed JSON. Substrate gate: `cargo test -p gateway extract_json` PASS before probe. Result: **12/12 match.** Algorithms genuinely equivalent. Rust side gained `pub` on `extract_json` + new `bin/parity_extract_json` (~30 LOC). | | 2026-05-02 | **extract_json parity probe — 12/12 match across edge cases** | New `scripts/cutover/parity/extract_json_parity.sh` runs identical model-output strings through Rust `gateway::v1::iterate::extract_json` AND Go `validator.ExtractJSON`. 12 fixtures: fenced/unfenced blocks, nested objects, unicode, escaped quotes, top-level array, malformed JSON. Substrate gate: `cargo test -p gateway extract_json` PASS before probe. Result: **12/12 match.** Algorithms genuinely equivalent. Rust side gained `pub` on `extract_json` + new `bin/parity_extract_json` (~30 LOC). |
| 2026-05-02 | **Validator wire-format alignment — DONE** | Custom `MarshalJSON`/`UnmarshalJSON` on Go's `validator.ValidationError` emits the Rust serde-tagged-enum shape `{"Schema":{"field":"x","reason":"y"}}`. UnmarshalJSON also accepts the legacy flat shape (migration safety) and rejects unknown variants (drift guard for future Rust enum additions). 4 new pinning tests in `types_test.go`. Re-run validator parity probe: **6/6 match** (was 1/6). | | 2026-05-02 | **Validator wire-format alignment — DONE** | Custom `MarshalJSON`/`UnmarshalJSON` on Go's `validator.ValidationError` emits the Rust serde-tagged-enum shape `{"Schema":{"field":"x","reason":"y"}}`. UnmarshalJSON also accepts the legacy flat shape (migration safety) and rejects unknown variants (drift guard for future Rust enum additions). 4 new pinning tests in `types_test.go`. Re-run validator parity probe: **6/6 match** (was 1/6). |
| 2026-05-02 | **Lance backend gauntlet (4-pack + root-cause fix) — DONE** | Lance crate had zero tests + no smoke when audited this morning. Shipped: (a) `sanitize_lance_err` over all 5 routes (search/doc/index/append/migrate) — missing-index now 404 not 500, no `/home/` or `/root/.cargo/` paths leaked; (b) 7 unit tests in `crates/vectord-lance` with synth Parquet helper; (c) 9-probe `scripts/lance_smoke.sh` against live `:3100`; (d) 10M re-bench (`reports/lance_10m_rebench_2026-05-02.md`) — search warm ~20ms, search cold ~46ms median. Bench surfaced doc-fetch p50 ~100ms (300x slower than ADR-019 100K projection); root-caused to lance-bench bypassing IndexMeta → warming auto-build never fired → no `doc_id` btree. **Fix shipped (commit `5d30b3d`)**: `lance_migrate` HTTP handler now auto-builds the btree inline (1.2s on 10M, +269MB), drops doc-fetch to ~5ms (20x). Live verified 9/9 smoke + post-restart doc-fetch 4-15ms. | | 2026-05-02 | **Lance backend gauntlet (4-pack + root-cause fix) — DONE** | Lance crate had zero tests + no smoke when audited this morning. Shipped: (a) `sanitize_lance_err` over all 5 routes (search/doc/index/append/migrate) — missing-index now 404 not 500, no `/home/` or `/root/.cargo/` paths leaked; (b) 7 unit tests in `crates/vectord-lance` with synth Parquet helper; (c) 9-probe `scripts/lance_smoke.sh` against live `:3100`; (d) 10M re-bench (`reports/lance_10m_rebench_2026-05-02.md`) — search warm ~20ms, search cold ~46ms median. Bench surfaced doc-fetch p50 ~100ms (300x slower than ADR-019 100K projection); root-caused to lance-bench bypassing IndexMeta → warming auto-build never fired → no `doc_id` btree. **Fix shipped (commit `5d30b3d`)**: `lance_migrate` HTTP handler now auto-builds the btree inline (1.2s on 10M, +269MB), drops doc-fetch to ~5ms (20x). Live verified 9/9 smoke + post-restart doc-fetch 4-15ms. |
| 2026-05-03 | **Subject manifests + per-subject HMAC audit log — DONE on Rust + Go** | Local-first compliance substrate per `lakehouse/docs/specs/SUBJECT_MANIFESTS_ON_CATALOGD.md` Steps 1-8. Rust shipped: `SubjectManifest` type + Registry CRUD (`crates/catalogd/src/registry.rs`), `SubjectAuditWriter` with HMAC-SHA256 chain + per-subject Mutex serialization + canonical-JSON via BTreeMap (`subject_audit.rs`), backfill ETL (`bin/backfill_subjects`), gateway tool dispatch + validator decorator wiring, legal-tier `/audit/subject/{id}` endpoint with constant-time-eq token + tampering detection, daily `bin/retention_sweep` (BIPA-aware, idempotent, no auto-mutation). Go shipped: identical `internal/catalogd/subject.go` reader + `VerifyChain` over RAW LINE BYTES (avoids time-precision drift), 11 unit tests. **6th cross-runtime parity probe**: `scripts/cutover/parity/subject_audit_parity.sh` — 6/6 byte-identical assertions across known-answer fixture + 5 real production audit logs. Surfaced + closed three drift classes in authoring loop: (1) Go `omitempty` stripping `trace_id:""`; (2) `time.RFC3339Nano` truncating trailing-zero nanoseconds where chrono AutoSi keeps 9 digits; (3) Go `json.Marshal` HTML-escaping `<>&` where serde keeps literal — fixed via `marshalNoEscapeHTML` + raw-bytes canonicalization. Two cross-lineage scrums caught real bugs each round (chain corruption race, schema-evolution HMAC drift, hardcoded "success" classifier, token min length, chain_root from windowed slice, tampering detection, HTML escape divergence). |
| _open_ | Decide Lance vs Parquet+HNSW for primary | Lance verified production-ready at 10M (this morning's gauntlet). HNSW at 10M doesn't fit RAM (~60GB for vectors+graph), so the comparison is between Lance and Parquet+HNSW-with-spilling. Decide once we have a 10M ingest scenario where the Parquet path is bottlenecked. | | _open_ | Decide Lance vs Parquet+HNSW for primary | Lance verified production-ready at 10M (this morning's gauntlet). HNSW at 10M doesn't fit RAM (~60GB for vectors+graph), so the comparison is between Lance and Parquet+HNSW-with-spilling. Decide once we have a 10M ingest scenario where the Parquet path is bottlenecked. |
| _open_ | Pick Go primary vs Rust primary | Both viable. Go has perf edge after today; Rust has production deploy + producer-side completeness. | | _open_ | Pick Go primary vs Rust primary | Both viable. Go has perf edge after today; Rust has production deploy + producer-side completeness. |

View File

@ -21,6 +21,7 @@
package catalogd package catalogd
import ( import (
"bytes"
"crypto/hmac" "crypto/hmac"
"crypto/sha256" "crypto/sha256"
"encoding/hex" "encoding/hex"
@ -203,9 +204,36 @@ func canonicalRowBytesFromStruct(row *SubjectAuditRow) ([]byte, error) {
return canonicalRowBytesFromRaw(raw) return canonicalRowBytesFromRaw(raw)
} }
// marshalNoEscapeHTML wraps json.Encoder with HTML escaping disabled.
//
// Why: Go's json.Marshal escapes `<`, `>`, `&` to `<`, `>`,
// `&` by default. Rust's serde_json::to_vec keeps them literal.
// Any string field containing one of those characters would produce
// different canonical bytes across runtimes → broken HMAC chain.
// (Caught 2026-05-03 by opus scrum WARN on parity_subject_audit.rs:
// canonical_json — initially undetected because no production audit
// field contained `<>&`, but realistic for purpose strings like
// "error & retry" or trace_id "<HTTP-Request-Id>".)
//
// Also strips the trailing newline json.Encoder appends — that newline
// is meaningful to JSONL consumers but is junk for hash input.
func marshalNoEscapeHTML(v any) ([]byte, error) {
var buf bytes.Buffer
enc := json.NewEncoder(&buf)
enc.SetEscapeHTML(false)
if err := enc.Encode(v); err != nil {
return nil, err
}
out := buf.Bytes()
return bytes.TrimRight(out, "\n"), nil
}
// writeCanonical recursively writes v as canonical JSON: object keys // writeCanonical recursively writes v as canonical JSON: object keys
// sorted alphabetically, no insignificant whitespace. Arrays preserve // sorted alphabetically, no insignificant whitespace. Arrays preserve
// element order (semantically significant per spec §3). // element order (semantically significant per spec §3).
//
// All scalar emission goes through marshalNoEscapeHTML so the byte
// sequence matches Rust's serde_json output character-for-character.
func writeCanonical(buf *strings.Builder, v any) error { func writeCanonical(buf *strings.Builder, v any) error {
switch t := v.(type) { switch t := v.(type) {
case map[string]any: case map[string]any:
@ -219,7 +247,7 @@ func writeCanonical(buf *strings.Builder, v any) error {
if i > 0 { if i > 0 {
buf.WriteByte(',') buf.WriteByte(',')
} }
ks, err := json.Marshal(k) ks, err := marshalNoEscapeHTML(k)
if err != nil { if err != nil {
return fmt.Errorf("marshal key: %w", err) return fmt.Errorf("marshal key: %w", err)
} }
@ -242,9 +270,7 @@ func writeCanonical(buf *strings.Builder, v any) error {
} }
buf.WriteByte(']') buf.WriteByte(']')
default: default:
// json.Number, string, bool, nil — encoding/json renders these bs, err := marshalNoEscapeHTML(v)
// the same way Rust's serde_json does (compact, RFC-8259-conformant).
bs, err := json.Marshal(v)
if err != nil { if err != nil {
return fmt.Errorf("marshal scalar: %w", err) return fmt.Errorf("marshal scalar: %w", err)
} }

View File

@ -237,6 +237,37 @@ func TestKnownAnswerVector(t *testing.T) {
} }
} }
// TestVerifyChain_HtmlChars_NotEscaped is the regression test for the
// 2026-05-03 opus scrum WARN: Go's json.Marshal escapes `<`, `>`, `&`
// to `<`, `>`, `&` by default; Rust's serde_json keeps
// them literal. Audit rows with these chars in any string field would
// silently break the chain across runtimes. Fix is in writeCanonical's
// marshalNoEscapeHTML helper. This test asserts canonical bytes contain
// the literal `<`, `>`, `&` (proving the fix is in place).
func TestVerifyChain_HtmlChars_NotEscaped(t *testing.T) {
r := mkRow("CAND-HTML", []string{"name"}, GenesisHash, "2026-05-03T12:00:00Z")
r.Accessor.Purpose = "error & retry" // & must NOT be &
r.Accessor.TraceID = "<HTTP-Req-Id>" // < and > must NOT be < / >
canon, err := canonicalRowBytesFromStruct(&r)
if err != nil {
t.Fatalf("canonical: %v", err)
}
s := string(canon)
// FAIL if the bytes contain Go's HTML-safe < / > / &
// escape sequences (six raw chars each: backslash, u, 0, 0, hex, hex).
// Those wouldn't match Rust's literal-char output and would silently
// break the cross-runtime HMAC chain. Note: the strings below are
// raw-string literals — the backslash + u006xx is six literal bytes,
// NOT a Go-source unicode escape.
if strings.Contains(s, "\\u003c") || strings.Contains(s, "\\u003e") || strings.Contains(s, "\\u0026") {
t.Fatalf("canonical bytes contain Go HTML-escape sequences (would diverge from Rust):\n%s", s)
}
// PASS only if the literal chars survived round-trip.
if !strings.Contains(s, "\"<HTTP-Req-Id>\"") || !strings.Contains(s, "\"error & retry\"") {
t.Fatalf("canonical bytes missing literal <>&:\n%s", s)
}
}
// TestVerifyChain_RawBytesPreserveTimePrecision is the regression test // TestVerifyChain_RawBytesPreserveTimePrecision is the regression test
// for the 2026-05-03 WORKER-5 finding: when a row's nanoseconds end in // for the 2026-05-03 WORKER-5 finding: when a row's nanoseconds end in
// 0, time.RFC3339Nano strips the trailing zero on re-marshal, producing // 0, time.RFC3339Nano strips the trailing zero on re-marshal, producing