lakehouse

History

root bce6dfd1ee catalogd: Step 3 — backfill_subjects binary (BIPA-defensible defaults)

Implementation of docs/specs/SUBJECT_MANIFESTS_ON_CATALOGD.md §5 Step 3.
Reads a parquet source, creates one SubjectManifest per row with the
spec-defined safe defaults, persists via Registry::put_subject().

Defaults baked in (per spec §2 + §5 Step 5):
  - vertical = unknown                     (HIPAA fail-closed)
  - consent.general_pii = pending_backfill_review  (NOT inferred_existing — BIPA defense)
  - consent.biometric  = never_collected   (no biometric data backfilled)
  - retention.general_pii_until = now + 4 years
  - retention.policy = "4_year_default"

Conservative ergonomics:
  - --limit 1000 by default. --all to do the full source.
  - --dry-run for parse + count + sample without writes.
  - --concurrency 32 (bounded via tokio::sync::Semaphore).
  - Idempotent: skips subjects that already exist in catalog.
  - Progress reports every ~5% (or 5K rows, whichever smaller).

Live verification on workers_500k.parquet:
  --limit 100 dry-run:  parsed 100 rows, sampled WORKER-1..5, 0 writes ✓
  --limit 100 commit:   100 inserted, 0 failed, 100 files in
                        data/_catalog/subjects/ ✓
  --limit 100 re-run:   0 inserted, 100 skipped (idempotent) ✓

Sample manifest (data/_catalog/subjects/WORKER-1.json):
  {
    "schema": "subject_manifest.v1",
    "candidate_id": "WORKER-1",
    "status": "active",
    "vertical": "unknown",
    "consent": {
      "general_pii": {"status": "pending_backfill_review", ...},
      "biometric":   {"status": "never_collected",         ...}
    },
    "retention": {"general_pii_until": "2030-05-02T...", "policy": "4_year_default"},
    "datasets": [{"name": "workers_500k", "key_column": "worker_id", "key_value": "1"}]
  }

NOT in this commit (future steps):
  - Step 4: Wire gateway tool registry to write audit rows on every
    candidate_id returned (uses SubjectAuditWriter from Step 2)
  - Step 5: Wire validator WorkerLookup similarly
  - Step 6: /audit/subject/{id} HTTP endpoint
  - Step 7: Daily retention sweep
  - Backfill the full 500K (operator decision: --all when ready;
    note: 500K JSON files in one dir will slow startup load — may
    want SQLite/single-file backend before that scale)

Operator note: backfill is run-once. To extend to candidates table,
re-run with --dataset candidates --key-column candidate_id (no prefix
since candidate_id is already the canonical token there).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-03 03:22:54 -05:00

src

catalogd: Step 3 — backfill_subjects binary (BIPA-defensible defaults)

2026-05-03 03:22:54 -05:00

Cargo.toml

catalogd: Step 3 — backfill_subjects binary (BIPA-defensible defaults)

2026-05-03 03:22:54 -05:00