catalogd: HTML-safe escape fix + decisions tracker entry
Per 2026-05-03 step_7_8_retention_and_parity scrum (opus WARN on
parity_subject_audit.rs:canonical_json):
Go's json.Marshal HTML-escapes < > & to < > & by
default. Rust's serde_json::to_vec keeps them literal. Any audit
row with these chars in any string field would silently produce
different canonical bytes across runtimes → broken HMAC chain.
Latent because no production audit field has carried <>& yet, but
realistic for purpose strings ("error & retry") or trace_id values
("<HTTP-Request-Id>").
Fix: marshalNoEscapeHTML helper wraps json.Encoder.SetEscapeHTML(false)
+ trims trailing newline. Routed through writeCanonical for both
keys and scalar values.
Regression test: TestVerifyChain_HtmlChars_NotEscaped (purpose has &,
trace_id has <>) asserts the canonical bytes contain literal chars,
not escape sequences.
11 unit tests pass including the new one; parity probe still 6/6
byte-identical against live production audit logs.
Decisions tracker: added 2026-05-03 entry for SUBJECT_MANIFESTS_ON_CATALOGD
Steps 1-8 closure + 6th cross-runtime parity probe (was 5).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
262a77a52a
commit
857ca4c971
@ -59,6 +59,7 @@ Don't:
|
||||
| 2026-05-02 | **extract_json parity probe — 12/12 match across edge cases** | New `scripts/cutover/parity/extract_json_parity.sh` runs identical model-output strings through Rust `gateway::v1::iterate::extract_json` AND Go `validator.ExtractJSON`. 12 fixtures: fenced/unfenced blocks, nested objects, unicode, escaped quotes, top-level array, malformed JSON. Substrate gate: `cargo test -p gateway extract_json` PASS before probe. Result: **12/12 match.** Algorithms genuinely equivalent. Rust side gained `pub` on `extract_json` + new `bin/parity_extract_json` (~30 LOC). |
|
||||
| 2026-05-02 | **Validator wire-format alignment — DONE** | Custom `MarshalJSON`/`UnmarshalJSON` on Go's `validator.ValidationError` emits the Rust serde-tagged-enum shape `{"Schema":{"field":"x","reason":"y"}}`. UnmarshalJSON also accepts the legacy flat shape (migration safety) and rejects unknown variants (drift guard for future Rust enum additions). 4 new pinning tests in `types_test.go`. Re-run validator parity probe: **6/6 match** (was 1/6). |
|
||||
| 2026-05-02 | **Lance backend gauntlet (4-pack + root-cause fix) — DONE** | Lance crate had zero tests + no smoke when audited this morning. Shipped: (a) `sanitize_lance_err` over all 5 routes (search/doc/index/append/migrate) — missing-index now 404 not 500, no `/home/` or `/root/.cargo/` paths leaked; (b) 7 unit tests in `crates/vectord-lance` with synth Parquet helper; (c) 9-probe `scripts/lance_smoke.sh` against live `:3100`; (d) 10M re-bench (`reports/lance_10m_rebench_2026-05-02.md`) — search warm ~20ms, search cold ~46ms median. Bench surfaced doc-fetch p50 ~100ms (300x slower than ADR-019 100K projection); root-caused to lance-bench bypassing IndexMeta → warming auto-build never fired → no `doc_id` btree. **Fix shipped (commit `5d30b3d`)**: `lance_migrate` HTTP handler now auto-builds the btree inline (1.2s on 10M, +269MB), drops doc-fetch to ~5ms (20x). Live verified 9/9 smoke + post-restart doc-fetch 4-15ms. |
|
||||
| 2026-05-03 | **Subject manifests + per-subject HMAC audit log — DONE on Rust + Go** | Local-first compliance substrate per `lakehouse/docs/specs/SUBJECT_MANIFESTS_ON_CATALOGD.md` Steps 1-8. Rust shipped: `SubjectManifest` type + Registry CRUD (`crates/catalogd/src/registry.rs`), `SubjectAuditWriter` with HMAC-SHA256 chain + per-subject Mutex serialization + canonical-JSON via BTreeMap (`subject_audit.rs`), backfill ETL (`bin/backfill_subjects`), gateway tool dispatch + validator decorator wiring, legal-tier `/audit/subject/{id}` endpoint with constant-time-eq token + tampering detection, daily `bin/retention_sweep` (BIPA-aware, idempotent, no auto-mutation). Go shipped: identical `internal/catalogd/subject.go` reader + `VerifyChain` over RAW LINE BYTES (avoids time-precision drift), 11 unit tests. **6th cross-runtime parity probe**: `scripts/cutover/parity/subject_audit_parity.sh` — 6/6 byte-identical assertions across known-answer fixture + 5 real production audit logs. Surfaced + closed three drift classes in authoring loop: (1) Go `omitempty` stripping `trace_id:""`; (2) `time.RFC3339Nano` truncating trailing-zero nanoseconds where chrono AutoSi keeps 9 digits; (3) Go `json.Marshal` HTML-escaping `<>&` where serde keeps literal — fixed via `marshalNoEscapeHTML` + raw-bytes canonicalization. Two cross-lineage scrums caught real bugs each round (chain corruption race, schema-evolution HMAC drift, hardcoded "success" classifier, token min length, chain_root from windowed slice, tampering detection, HTML escape divergence). |
|
||||
| _open_ | Decide Lance vs Parquet+HNSW for primary | Lance verified production-ready at 10M (this morning's gauntlet). HNSW at 10M doesn't fit RAM (~60GB for vectors+graph), so the comparison is between Lance and Parquet+HNSW-with-spilling. Decide once we have a 10M ingest scenario where the Parquet path is bottlenecked. |
|
||||
| _open_ | Pick Go primary vs Rust primary | Both viable. Go has perf edge after today; Rust has production deploy + producer-side completeness. |
|
||||
|
||||
|
||||
@ -21,6 +21,7 @@
|
||||
package catalogd
|
||||
|
||||
import (
|
||||
"bytes"
|
||||
"crypto/hmac"
|
||||
"crypto/sha256"
|
||||
"encoding/hex"
|
||||
@ -203,9 +204,36 @@ func canonicalRowBytesFromStruct(row *SubjectAuditRow) ([]byte, error) {
|
||||
return canonicalRowBytesFromRaw(raw)
|
||||
}
|
||||
|
||||
// marshalNoEscapeHTML wraps json.Encoder with HTML escaping disabled.
|
||||
//
|
||||
// Why: Go's json.Marshal escapes `<`, `>`, `&` to `<`, `>`,
|
||||
// `&` by default. Rust's serde_json::to_vec keeps them literal.
|
||||
// Any string field containing one of those characters would produce
|
||||
// different canonical bytes across runtimes → broken HMAC chain.
|
||||
// (Caught 2026-05-03 by opus scrum WARN on parity_subject_audit.rs:
|
||||
// canonical_json — initially undetected because no production audit
|
||||
// field contained `<>&`, but realistic for purpose strings like
|
||||
// "error & retry" or trace_id "<HTTP-Request-Id>".)
|
||||
//
|
||||
// Also strips the trailing newline json.Encoder appends — that newline
|
||||
// is meaningful to JSONL consumers but is junk for hash input.
|
||||
func marshalNoEscapeHTML(v any) ([]byte, error) {
|
||||
var buf bytes.Buffer
|
||||
enc := json.NewEncoder(&buf)
|
||||
enc.SetEscapeHTML(false)
|
||||
if err := enc.Encode(v); err != nil {
|
||||
return nil, err
|
||||
}
|
||||
out := buf.Bytes()
|
||||
return bytes.TrimRight(out, "\n"), nil
|
||||
}
|
||||
|
||||
// writeCanonical recursively writes v as canonical JSON: object keys
|
||||
// sorted alphabetically, no insignificant whitespace. Arrays preserve
|
||||
// element order (semantically significant per spec §3).
|
||||
//
|
||||
// All scalar emission goes through marshalNoEscapeHTML so the byte
|
||||
// sequence matches Rust's serde_json output character-for-character.
|
||||
func writeCanonical(buf *strings.Builder, v any) error {
|
||||
switch t := v.(type) {
|
||||
case map[string]any:
|
||||
@ -219,7 +247,7 @@ func writeCanonical(buf *strings.Builder, v any) error {
|
||||
if i > 0 {
|
||||
buf.WriteByte(',')
|
||||
}
|
||||
ks, err := json.Marshal(k)
|
||||
ks, err := marshalNoEscapeHTML(k)
|
||||
if err != nil {
|
||||
return fmt.Errorf("marshal key: %w", err)
|
||||
}
|
||||
@ -242,9 +270,7 @@ func writeCanonical(buf *strings.Builder, v any) error {
|
||||
}
|
||||
buf.WriteByte(']')
|
||||
default:
|
||||
// json.Number, string, bool, nil — encoding/json renders these
|
||||
// the same way Rust's serde_json does (compact, RFC-8259-conformant).
|
||||
bs, err := json.Marshal(v)
|
||||
bs, err := marshalNoEscapeHTML(v)
|
||||
if err != nil {
|
||||
return fmt.Errorf("marshal scalar: %w", err)
|
||||
}
|
||||
|
||||
@ -237,6 +237,37 @@ func TestKnownAnswerVector(t *testing.T) {
|
||||
}
|
||||
}
|
||||
|
||||
// TestVerifyChain_HtmlChars_NotEscaped is the regression test for the
|
||||
// 2026-05-03 opus scrum WARN: Go's json.Marshal escapes `<`, `>`, `&`
|
||||
// to `<`, `>`, `&` by default; Rust's serde_json keeps
|
||||
// them literal. Audit rows with these chars in any string field would
|
||||
// silently break the chain across runtimes. Fix is in writeCanonical's
|
||||
// marshalNoEscapeHTML helper. This test asserts canonical bytes contain
|
||||
// the literal `<`, `>`, `&` (proving the fix is in place).
|
||||
func TestVerifyChain_HtmlChars_NotEscaped(t *testing.T) {
|
||||
r := mkRow("CAND-HTML", []string{"name"}, GenesisHash, "2026-05-03T12:00:00Z")
|
||||
r.Accessor.Purpose = "error & retry" // & must NOT be &
|
||||
r.Accessor.TraceID = "<HTTP-Req-Id>" // < and > must NOT be < / >
|
||||
canon, err := canonicalRowBytesFromStruct(&r)
|
||||
if err != nil {
|
||||
t.Fatalf("canonical: %v", err)
|
||||
}
|
||||
s := string(canon)
|
||||
// FAIL if the bytes contain Go's HTML-safe < / > / &
|
||||
// escape sequences (six raw chars each: backslash, u, 0, 0, hex, hex).
|
||||
// Those wouldn't match Rust's literal-char output and would silently
|
||||
// break the cross-runtime HMAC chain. Note: the strings below are
|
||||
// raw-string literals — the backslash + u006xx is six literal bytes,
|
||||
// NOT a Go-source unicode escape.
|
||||
if strings.Contains(s, "\\u003c") || strings.Contains(s, "\\u003e") || strings.Contains(s, "\\u0026") {
|
||||
t.Fatalf("canonical bytes contain Go HTML-escape sequences (would diverge from Rust):\n%s", s)
|
||||
}
|
||||
// PASS only if the literal chars survived round-trip.
|
||||
if !strings.Contains(s, "\"<HTTP-Req-Id>\"") || !strings.Contains(s, "\"error & retry\"") {
|
||||
t.Fatalf("canonical bytes missing literal <>&:\n%s", s)
|
||||
}
|
||||
}
|
||||
|
||||
// TestVerifyChain_RawBytesPreserveTimePrecision is the regression test
|
||||
// for the 2026-05-03 WORKER-5 finding: when a row's nanoseconds end in
|
||||
// 0, time.RFC3339Nano strips the trailing zero on re-marshal, producing
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user