catalogd: Step 3 — backfill_subjects binary (BIPA-defensible defaults)
Implementation of docs/specs/SUBJECT_MANIFESTS_ON_CATALOGD.md §5 Step 3.
Reads a parquet source, creates one SubjectManifest per row with the
spec-defined safe defaults, persists via Registry::put_subject().
Defaults baked in (per spec §2 + §5 Step 5):
- vertical = unknown (HIPAA fail-closed)
- consent.general_pii = pending_backfill_review (NOT inferred_existing — BIPA defense)
- consent.biometric = never_collected (no biometric data backfilled)
- retention.general_pii_until = now + 4 years
- retention.policy = "4_year_default"
Conservative ergonomics:
- --limit 1000 by default. --all to do the full source.
- --dry-run for parse + count + sample without writes.
- --concurrency 32 (bounded via tokio::sync::Semaphore).
- Idempotent: skips subjects that already exist in catalog.
- Progress reports every ~5% (or 5K rows, whichever smaller).
Live verification on workers_500k.parquet:
--limit 100 dry-run: parsed 100 rows, sampled WORKER-1..5, 0 writes ✓
--limit 100 commit: 100 inserted, 0 failed, 100 files in
data/_catalog/subjects/ ✓
--limit 100 re-run: 0 inserted, 100 skipped (idempotent) ✓
Sample manifest (data/_catalog/subjects/WORKER-1.json):
{
"schema": "subject_manifest.v1",
"candidate_id": "WORKER-1",
"status": "active",
"vertical": "unknown",
"consent": {
"general_pii": {"status": "pending_backfill_review", ...},
"biometric": {"status": "never_collected", ...}
},
"retention": {"general_pii_until": "2030-05-02T...", "policy": "4_year_default"},
"datasets": [{"name": "workers_500k", "key_column": "worker_id", "key_value": "1"}]
}
NOT in this commit (future steps):
- Step 4: Wire gateway tool registry to write audit rows on every
candidate_id returned (uses SubjectAuditWriter from Step 2)
- Step 5: Wire validator WorkerLookup similarly
- Step 6: /audit/subject/{id} HTTP endpoint
- Step 7: Daily retention sweep
- Backfill the full 500K (operator decision: --all when ready;
note: 500K JSON files in one dir will slow startup load — may
want SQLite/single-file backend before that scale)
Operator note: backfill is run-once. To extend to candidates table,
re-run with --dataset candidates --key-column candidate_id (no prefix
since candidate_id is already the canonical token there).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>