lakehouse

profit/lakehouse

Fork 0

Commit Graph

Author	SHA1	Message	Date
root	e38f3573ff	subject manifests Steps 1-4 — fix scrum-flagged BLOCKs and WARNs 2026-05-03 cross-lineage scrum on the subjects_steps_1_to_4 wave returned 14 distinct findings, 0 convergent. opus verdict was HOLD with 3 BLOCKs around the audit-chain integrity. All real. Fixed: ────────────────────────────────────────────────────────────────── BLOCK 1 — opus subject_audit.rs:172 + execution_loop.rs:391 Concurrency race: append_line is read-modify-write; the gateway hook used tokio::spawn fan-out → two concurrent appends to the same subject both read the same prev_hash, both compute their HMAC from the same prev, second write silently overwrites first → row lost AND chain broken. Fix: - SubjectAuditWriter gains per-subject Mutex map. append() acquires the subject's lock for the duration of the read-modify-write. Different subjects still parallelize. - Gateway hook switches from tokio::spawn to inline await. Per-row cost is ~1ms (one object_store put); inline is correct AND cheap. - New regression test: 50 concurrent appends to the same subject, asserts all 50 land with intact chain. BLOCK 2 — opus subject_audit.rs:108 Non-deterministic canonicalization: serde_json serializes struct fields in declaration order. Schema evolution (adding/reordering fields) silently changes the bytes verify_chain hashes → chain breaks even when nothing was actually tampered with. Fix: - New canonical_json() free fn — recursive value rewrite to sort object keys alphabetically (BTreeMap projection), arrays preserve order, scalars pass through. Stable across struct evolution. - Both append() and verify_chain() now compute HMAC over canonical bytes, not declaration-order bytes. - New regression tests: alphabetical-key + array-order-preserved. WARN — opus execution_loop:401 Audit row's `result` was hardcoded to "success" for every Ok(result) including payloads like {"error":"not found"}. Misleads compliance. Fix: - New audit_result_state() free fn that inspects the payload top-level for error/denied/not_found/status signals (per spec §3.2 enum). Defaults to "success" only when no error signal. - 4 new tests covering each enum case + falsy-signals defense. WARN — opus registry.rs:735 Storage-key collision: sanitize_view_name(id) is the disk key, but the in-memory HashMap was keyed by raw candidate_id. Two distinct ids that sanitize to the same key (e.g. "CAND/1" and "CAND_1") would collide on disk while appearing distinct in memory; second put silently overwrites first; rebuild loads only one. Fix: - put_subject() / get_subject() / delete_subject() / rebuild() all key the in-memory HashMap by sanitize_view_name(id), matching the storage key shape. - Collision guard: put_subject() refuses (with clear error) when the sanitized key matches an EXISTING subject with a DIFFERENT raw candidate_id. - New regression test: put("CAND/1") then put("CAND_1") errors + first subject survives. WARN — opus backfill_subjects.rs:189 trim_start_matches strips REPEATED prefixes; the spec wanted one-shot semantics. Edge case unlikely in practice but real. Fix: - Switched to strip_prefix(&prefix).unwrap_or(&cid). One-shot. INFO — opus subject_audit.rs:131 Per-byte format!("{:02x}", b) allocates each iteration. Hot path on every append. Fix: - Replaced with const HEX lookup table + push() into preallocated String. Same output bytes, no per-byte allocation. ────────────────────────────────────────────────────────────────── Test summary post-fix: catalogd subject_audit: 11/11 PASS (added 4 new — concurrency race regression, parallel-different-subjects, canonical-key sort, canonical-array order) catalogd registry subject: 6/6 PASS (added 1 new — collision guard) gateway execution_loop subject: 10/10 PASS (added 4 new — audit_result_state enum coverage) All 27 subject-related tests green. cargo build --release clean. The convergent-zero scrum result was misleading on its face — opus caught real BLOCKs that kimi/qwen missed. Per feedback_cross_lineage_review.md: opus is the load-bearing reviewer; single-opus BLOCKs warrant manual verification, which here confirmed all three were correct. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 03:37:45 -05:00
root	bce6dfd1ee	catalogd: Step 3 — backfill_subjects binary (BIPA-defensible defaults) Implementation of docs/specs/SUBJECT_MANIFESTS_ON_CATALOGD.md §5 Step 3. Reads a parquet source, creates one SubjectManifest per row with the spec-defined safe defaults, persists via Registry::put_subject(). Defaults baked in (per spec §2 + §5 Step 5): - vertical = unknown (HIPAA fail-closed) - consent.general_pii = pending_backfill_review (NOT inferred_existing — BIPA defense) - consent.biometric = never_collected (no biometric data backfilled) - retention.general_pii_until = now + 4 years - retention.policy = "4_year_default" Conservative ergonomics: - --limit 1000 by default. --all to do the full source. - --dry-run for parse + count + sample without writes. - --concurrency 32 (bounded via tokio::sync::Semaphore). - Idempotent: skips subjects that already exist in catalog. - Progress reports every ~5% (or 5K rows, whichever smaller). Live verification on workers_500k.parquet: --limit 100 dry-run: parsed 100 rows, sampled WORKER-1..5, 0 writes ✓ --limit 100 commit: 100 inserted, 0 failed, 100 files in data/_catalog/subjects/ ✓ --limit 100 re-run: 0 inserted, 100 skipped (idempotent) ✓ Sample manifest (data/_catalog/subjects/WORKER-1.json): { "schema": "subject_manifest.v1", "candidate_id": "WORKER-1", "status": "active", "vertical": "unknown", "consent": { "general_pii": {"status": "pending_backfill_review", ...}, "biometric": {"status": "never_collected", ...} }, "retention": {"general_pii_until": "2030-05-02T...", "policy": "4_year_default"}, "datasets": [{"name": "workers_500k", "key_column": "worker_id", "key_value": "1"}] } NOT in this commit (future steps): - Step 4: Wire gateway tool registry to write audit rows on every candidate_id returned (uses SubjectAuditWriter from Step 2) - Step 5: Wire validator WorkerLookup similarly - Step 6: /audit/subject/{id} HTTP endpoint - Step 7: Daily retention sweep - Backfill the full 500K (operator decision: --all when ready; note: 500K JSON files in one dir will slow startup load — may want SQLite/single-file backend before that scale) Operator note: backfill is run-once. To extend to candidates table, re-run with --dataset candidates --key-column candidate_id (no prefix since candidate_id is already the canonical token there). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 03:22:54 -05:00

Author

SHA1

Message

Date

root

e38f3573ff

subject manifests Steps 1-4 — fix scrum-flagged BLOCKs and WARNs

2026-05-03 cross-lineage scrum on the subjects_steps_1_to_4 wave
returned 14 distinct findings, 0 convergent. opus verdict was HOLD
with 3 BLOCKs around the audit-chain integrity. All real. Fixed:

──────────────────────────────────────────────────────────────────
BLOCK 1 — opus subject_audit.rs:172 + execution_loop.rs:391
  Concurrency race: append_line is read-modify-write; the gateway
  hook used tokio::spawn fan-out → two concurrent appends to the
  same subject both read the same prev_hash, both compute their
  HMAC from the same prev, second write silently overwrites first
  → row lost AND chain broken.

  Fix:
  - SubjectAuditWriter gains per-subject Mutex map. append() acquires
    the subject's lock for the duration of the read-modify-write.
    Different subjects still parallelize.
  - Gateway hook switches from tokio::spawn to inline await. Per-row
    cost is ~1ms (one object_store put); inline is correct AND cheap.
  - New regression test: 50 concurrent appends to the same subject,
    asserts all 50 land with intact chain.

BLOCK 2 — opus subject_audit.rs:108
  Non-deterministic canonicalization: serde_json serializes struct
  fields in declaration order. Schema evolution (adding/reordering
  fields) silently changes the bytes verify_chain hashes → chain
  breaks even when nothing was actually tampered with.

  Fix:
  - New canonical_json() free fn — recursive value rewrite to sort
    object keys alphabetically (BTreeMap projection), arrays preserve
    order, scalars pass through. Stable across struct evolution.
  - Both append() and verify_chain() now compute HMAC over canonical
    bytes, not declaration-order bytes.
  - New regression tests: alphabetical-key + array-order-preserved.

WARN — opus execution_loop:401
  Audit row's `result` was hardcoded to "success" for every Ok(result)
  including payloads like {"error":"not found"}. Misleads compliance.

  Fix:
  - New audit_result_state() free fn that inspects the payload
    top-level for error/denied/not_found/status signals (per spec
    §3.2 enum). Defaults to "success" only when no error signal.
  - 4 new tests covering each enum case + falsy-signals defense.

WARN — opus registry.rs:735
  Storage-key collision: sanitize_view_name(id) is the disk key,
  but the in-memory HashMap was keyed by raw candidate_id. Two
  distinct ids that sanitize to the same key (e.g. "CAND/1" and
  "CAND_1") would collide on disk while appearing distinct in
  memory; second put silently overwrites first; rebuild loads only
  one.

  Fix:
  - put_subject() / get_subject() / delete_subject() / rebuild()
    all key the in-memory HashMap by sanitize_view_name(id), matching
    the storage key shape.
  - Collision guard: put_subject() refuses (with clear error) when
    the sanitized key matches an EXISTING subject with a DIFFERENT
    raw candidate_id.
  - New regression test: put("CAND/1") then put("CAND_1") errors
    + first subject survives.

WARN — opus backfill_subjects.rs:189
  trim_start_matches strips REPEATED prefixes; the spec wanted
  one-shot semantics. Edge case unlikely in practice but real.

  Fix:
  - Switched to strip_prefix(&prefix).unwrap_or(&cid). One-shot.

INFO — opus subject_audit.rs:131
  Per-byte format!("{:02x}", b) allocates each iteration. Hot path
  on every append.

  Fix:
  - Replaced with const HEX lookup table + push() into preallocated
    String. Same output bytes, no per-byte allocation.

──────────────────────────────────────────────────────────────────
Test summary post-fix:
  catalogd subject_audit: 11/11 PASS (added 4 new — concurrency
                          race regression, parallel-different-subjects,
                          canonical-key sort, canonical-array order)
  catalogd registry subject: 6/6 PASS (added 1 new — collision guard)
  gateway execution_loop subject: 10/10 PASS (added 4 new —
                          audit_result_state enum coverage)

  All 27 subject-related tests green. cargo build --release clean.

The convergent-zero scrum result was misleading on its face — opus
caught real BLOCKs that kimi/qwen missed. Per
feedback_cross_lineage_review.md: opus is the load-bearing reviewer;
single-opus BLOCKs warrant manual verification, which here confirmed
all three were correct.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-03 03:37:45 -05:00

root

bce6dfd1ee

catalogd: Step 3 — backfill_subjects binary (BIPA-defensible defaults)

Implementation of docs/specs/SUBJECT_MANIFESTS_ON_CATALOGD.md §5 Step 3.
Reads a parquet source, creates one SubjectManifest per row with the
spec-defined safe defaults, persists via Registry::put_subject().

Defaults baked in (per spec §2 + §5 Step 5):
  - vertical = unknown                     (HIPAA fail-closed)
  - consent.general_pii = pending_backfill_review  (NOT inferred_existing — BIPA defense)
  - consent.biometric  = never_collected   (no biometric data backfilled)
  - retention.general_pii_until = now + 4 years
  - retention.policy = "4_year_default"

Conservative ergonomics:
  - --limit 1000 by default. --all to do the full source.
  - --dry-run for parse + count + sample without writes.
  - --concurrency 32 (bounded via tokio::sync::Semaphore).
  - Idempotent: skips subjects that already exist in catalog.
  - Progress reports every ~5% (or 5K rows, whichever smaller).

Live verification on workers_500k.parquet:
  --limit 100 dry-run:  parsed 100 rows, sampled WORKER-1..5, 0 writes ✓
  --limit 100 commit:   100 inserted, 0 failed, 100 files in
                        data/_catalog/subjects/ ✓
  --limit 100 re-run:   0 inserted, 100 skipped (idempotent) ✓

Sample manifest (data/_catalog/subjects/WORKER-1.json):
  {
    "schema": "subject_manifest.v1",
    "candidate_id": "WORKER-1",
    "status": "active",
    "vertical": "unknown",
    "consent": {
      "general_pii": {"status": "pending_backfill_review", ...},
      "biometric":   {"status": "never_collected",         ...}
    },
    "retention": {"general_pii_until": "2030-05-02T...", "policy": "4_year_default"},
    "datasets": [{"name": "workers_500k", "key_column": "worker_id", "key_value": "1"}]
  }

NOT in this commit (future steps):
  - Step 4: Wire gateway tool registry to write audit rows on every
    candidate_id returned (uses SubjectAuditWriter from Step 2)
  - Step 5: Wire validator WorkerLookup similarly
  - Step 6: /audit/subject/{id} HTTP endpoint
  - Step 7: Daily retention sweep
  - Backfill the full 500K (operator decision: --all when ready;
    note: 500K JSON files in one dir will slow startup load — may
    want SQLite/single-file backend before that scale)

Operator note: backfill is run-once. To extend to candidates table,
re-run with --dataset candidates --key-column candidate_id (no prefix
since candidate_id is already the canonical token there).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-03 03:22:54 -05:00

2 Commits