Scrummed v1 across opus + kimi + gemini lineages via the new model fleet. 3/3 reviewers said 'I would NOT build v1 as written.' 4 convergent blockers, all resolved in v2: 1. Migration order wrong — backfill before validation creates dark database; if backfill bug, no production traffic catches it. v2 inserts BIPA-prereq Step 0 + shadow-write before backfill + shadow-read before cutover. 9-step migration with cryptographic attestation of completeness at quarantine. 2. Master key on disk + legal token static file = 'security theater' per all 3. v2: HashiCorp Vault Transit / AWS KMS for KEK (not sealed file). Legal token: split-secret short-lived JWT (max 24h), dual-control issuance (J + counsel both sign), revocable in <60s. 3. consent_status='inferred_existing' is BIPA prima facie violation (kimi+gemini explicit). v2 backfill uses 'pending_backfill_review'; biometric data NEVER backfilled — separate consent stream. 4. Healthcare default 'general' = HIPAA exposure window for every misclassified subject. v2 default 'unknown' with fail-closed routing (treat unknown as healthcare-equivalent until classified by manual review). Auto-escalation to healthcare on resume_text pattern match. Plus 12 single-reviewer additions: - mTLS mandatory between gateway↔identityd (kimi) - External anchor for audit chain: S3 Object Lock 7-year compliance mode, hourly + on-event commits (all 3) - Audit-log signing key separate from encryption KEK (opus) - Field-level authorization via purpose_definitions table (kimi) - Per-row encryption keys deferred to Phase 7 (kimi simplification) - pii_access_log itself needs legal-tier read auth (opus) - Synchronous cache invalidation pub/sub on RTBF (opus) - Outbound NER pass for Langfuse defense-in-depth (opus TOCTOU) - model_version_hash per decision row (gemini) - /vertical minimal-disclosure endpoint (kimi HIPAA min-necessary) - Auto-escalation healthcare on resume_text pattern (kimi) - Rate limiting + token revocation list (opus) - Oracle tests in audit_parity.sh (kimi SOC2 CC4.1) Architecturally simplified per scrum: - Per-row encryption keys deferred to Phase 7 (single DEK + HSM- wrapped KEK + ciphertext deletion is equivalent practical erasure with less complexity) - PDF render deferred (JSON ships first) - Training-safe export deferred (not critical path) Estimated effort revised 8-10 → 12-15 days. Worth it — every addition was a 3/3-reviewer convergent finding. Re-scrum recommended before implementation starts to verify v2 addresses the v1 blockers. No code changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
31 KiB
Identity Service — Phase 2 Design (v2 — post-scrum revisions)
Status: Draft v2 — 2026-05-03 · Owner: J · Drafted by: working session 2026-05-03
Companion to: AUDIT_TRAIL_PRD.md, AUDIT_PHASE_1_DISCOVERY.md, AUDIT_PHASE_1_5_BIPA_AND_OUTCOMES.md
v2 history. v1 (2026-05-03 morning) was scrummed across opus + kimi + gemini lineages. 3/3 reviewers converged on 6 critical issues — wrong migration order, master key + legal token "security theater," BIPA
inferred_existingprima facie violation, healthcare default in wrong direction, Merkle chain without external anchor. v2 incorporates those changes. Full scrum reviews preserved at/tmp/identity_scrum/{opus,kimi,gemini}_review.md. Diff between v1 and v2 captured in §11 change log.
Why this exists. Phase 1 + 1.5 confirmed today's substrate has no separation between candidate_id and PII. Both live in
workers_500k.parquet. No per-access audit, no consent gate, no retention enforcement. This document specifies the new identity service that holds the candidate_id ↔ PII mapping, gates every PII read, audits every access, and serves as the single legal-attestable boundary between PII and the rest of the system.
Prerequisite: Phase 1.6 (BIPA pre-launch gates) MUST ship before any identityd backfill begins. Per kimi+gemini scrum, backfilling with
consent_status='inferred_existing'is a BIPA §15 prima facie violation. Phase 1.6 establishes the consent template + retention schedule + employee training that turns backfill from "inferred" into "pending_backfill_review with documented escalation path."
1. Scope and non-goals
In scope
- Single source of truth for
candidate_id ↔ PIImapping - Per-PII-access audit log (who/what/when/why)
- Consent + retention metadata (BIPA + IL Day and Temporary Labor Services Act + healthcare PHI)
- Legal-tier access — short-lived JWT with split-secret dual-control issuance (NOT a static file token, per all 3 reviewers)
- Healthcare-vertical routing with fail-closed default (
unknowntreated as healthcare until classified, per opus+gemini) - EU-compatible interface — fields exist, enforcement is per-subject (no system-wide flag flip needed)
- Hardware-backed master key — HashiCorp Vault Transit OR AWS KMS for v1 (NOT a sealed file, per all 3 reviewers; sealed-file v1 was "obfuscation in any defensible sense" — kimi)
- Signed-JSON audit response with PDF render path (PDF deferred; JSON ships first per opus)
- External-anchor for audit-log integrity (S3 Object Lock + signed timestamps; NOT just Postgres Merkle, per all 3 reviewers)
- mTLS or Unix Domain Socket between gateway ↔ identityd (per kimi — "port-isolated is theater without authenticated channel")
- Field-level authorization on
/subjects/{id}(per-purpose field allowlists, per kimi) - Synchronous cache invalidation hook for RTBF (per opus — even if Phase 7 is the full RTBF build)
Out of scope (for this phase)
- The
/audit/subject/{id}endpoint itself (Phase 3) - Full subject-tagging across other substrates (Phase 4)
- Right-to-be-forgotten implementation full-flow (Phase 7)
- Training-safe export (deferred from Phase 2 per opus — "not on critical path for audit-trail defense")
- BIPA pre-launch gates content (Phase 1.6 — separate doc)
Architecturally simplified from v1 (per scrum)
- Per-row encryption keys deferred to Phase 7. v2 uses a single Data Encryption Key (DEK) wrapped under HSM-backed Key Encryption Key (KEK), with per-subject ciphertext-deletion as the v2 erasure mechanism. This matches kimi's recommendation: "A single data-encryption key with HSM-backed rotation, plus per-subject deletion of ciphertext, achieves equivalent practical erasure with far less complexity." Cryptographic-erasure via per-subject keys becomes a Phase 7 enhancement.
- EU placeholder fields kept (per J 2026-05-03), but enforcement code is genuinely no-op until first eu_resident=true subject — no code paths run, no schema migrations needed when EU comes online (just per-subject field population).
- Merkle chaining narrowed (per gemini): full Merkle tree only for legal-tier and consent/erasure events. Standard
gateway_lookupevents get a simpler signed-HMAC chain. Both anchored externally.
2. Architectural shape
2.1 — Process model + transport
Per J 2026-05-03 + kimi scrum: separate daemon, mTLS-mandatory transport. Single Go implementation, both runtimes call it.
| Property | Value |
|---|---|
| Name | identityd |
| Port | :3225 (single port, not dual; per kimi simplification — runtime-agnostic routing) |
| Implementation | Go |
| Storage | Postgres in isolated database (per J answer, confirmed). Schema-level isolation; no shared schemas with Langfuse or other lakehouse storage. |
| Transport gateway↔identityd | mTLS mandatory. Self-signed CA managed by identityd at startup; gateway clients have their own client certs. Plain HTTP is rejected at the listener. |
| Encryption-at-rest | Single DEK wrapped under KEK in HashiCorp Vault Transit (recommended) OR AWS KMS. Master key NEVER on disk. v1 of identityd refuses to start if KEK is unreachable. |
| Audit-log signing | Separate Ed25519 keypair from the master encryption key. Signing key in Vault Transit (signing backend) OR a separate sealed-secret file with strict rotation procedure. |
| External anchor for audit chain | S3 Object Lock (compliance mode, 7-year retention) holding periodic Merkle root commitments. Daemon writes hourly + on legal-tier event. |
| Backup | Postgres standard backup; KEK backup separate (different storage, different access). Crypto-erasure model only works if these are not co-located. |
| Rate limiting | Per-token QPS + daily-volume caps. Default legal-token: 100 lookups/day + 10K daily volume; alerting at 50%. Configurable per-token. |
| Token revocation | token_revocations table checked on every legal-tier auth. Revocation propagates within 60s (TTL on cache). |
2.2 — Schema (Postgres DDL, v2)
-- Subject record. PII columns are ciphertext under the DEK.
CREATE TABLE subjects (
candidate_id TEXT PRIMARY KEY, -- UUID v7, NOT sequential (per kimi enumeration concern)
-- Encrypted PII fields. AES-256-GCM under DEK. NULL = not collected.
name_ct BYTEA,
email_ct BYTEA,
phone_ct BYTEA,
address_ct BYTEA,
ssn_ct BYTEA,
dob_ct BYTEA,
-- DEK version this subject was encrypted under. KEK rotates; DEK rotates per
-- KEK rotation. New DEK version = subjects re-encrypted in background sweep.
dek_version INT NOT NULL,
-- Lawful basis + consent metadata
consent_status TEXT NOT NULL CHECK (consent_status IN (
'pending_backfill_review', -- backfill default; no biometric/PHI use until reviewed
'pending_first_contact', -- new subject, awaiting consent UX
'given', -- explicit consent recorded
'withdrawn', -- subject revoked
'expired' -- consent timeout
)),
consent_version TEXT, -- references published consent template version
consent_given_at TIMESTAMPTZ,
consent_withdrawn_at TIMESTAMPTZ,
-- BIPA-specific fields (Phase 1.5 §1E + Phase 1.6 prerequisite)
biometric_consent_status TEXT NOT NULL DEFAULT 'never_collected' CHECK (biometric_consent_status IN (
'never_collected', 'pending', 'given', 'withdrawn', 'expired'
)),
biometric_retention_until TIMESTAMPTZ, -- BIPA: max 3 years from last interaction
-- Vertical detection — drives healthcare PHI routing.
-- DEFAULT 'unknown' (per opus+gemini scrum) — fail-closed routing treats
-- unknown as healthcare-equivalent until reclassified.
vertical TEXT NOT NULL DEFAULT 'unknown' CHECK (vertical IN (
'unknown', 'general', 'healthcare', 'finance', 'other'
)),
-- Per-vertical retention period. Drives the daily erasure sweep.
-- (Per opus: this was missing in v1. Required for BIPA's 3-year-from-
-- last-interaction rule which differs from generic retention.)
retention_period_days INT NOT NULL,
-- EU-placeholder fields. Enforcement is per-subject; nothing runs
-- until a row has eu_resident=true AND lawful_basis IS NOT NULL.
eu_resident BOOLEAN NOT NULL DEFAULT FALSE,
lawful_basis TEXT, -- GDPR Art. 6 basis when eu_resident=true
transfer_mechanism TEXT, -- SCC, DPF, BCR — populated when EU comes online
-- Standard audit columns
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT now(),
last_interaction TIMESTAMPTZ, -- drives retention sweep
-- Crypto-erasure state (v2: ciphertext deletion, NOT key destruction)
erased_at TIMESTAMPTZ,
erasure_reason TEXT
);
-- Append-only access log. EVERY PII read writes a row.
-- Field-level: fields_accessed records WHICH fields the caller resolved.
-- purpose enforced against an allowlist per-purpose-token.
CREATE TABLE pii_access_log (
access_id BIGSERIAL PRIMARY KEY,
candidate_id TEXT NOT NULL,
accessed_at TIMESTAMPTZ NOT NULL DEFAULT now(),
accessor_kind TEXT NOT NULL, -- 'gateway_lookup' | 'audit_response' | 'legal_request' | 'system_resolve'
accessor_id TEXT NOT NULL, -- daemon name + caller token-hash; never raw token
purpose_token TEXT NOT NULL, -- opaque token; resolved through purpose_definitions table
fields_accessed TEXT[] NOT NULL,
request_trace_id TEXT,
chain_kind TEXT NOT NULL CHECK (chain_kind IN ('hmac', 'merkle')), -- HMAC for standard, Merkle for legal/consent/erasure
integrity_hash TEXT NOT NULL,
-- Per opus security finding: this row's existence is itself sensitive
-- ("candidate X is under legal review" leaks via purpose). Read access
-- to this table requires legal-tier auth or scoped per-subject auth.
is_legal_tier_event BOOLEAN NOT NULL DEFAULT FALSE
);
-- Purpose definitions — enforces field-level authorization (per kimi).
-- A given purpose_token can only request fields in its allowlist.
CREATE TABLE purpose_definitions (
purpose_token TEXT PRIMARY KEY, -- e.g. 'fill_validation', 'audit_subject_response'
description TEXT NOT NULL,
allowed_fields TEXT[] NOT NULL, -- e.g. ARRAY['name'] for fill_validation
auth_tier TEXT NOT NULL CHECK (auth_tier IN ('service', 'legal')),
rate_limit_qps INT,
daily_volume_cap INT
);
-- Token revocation list. Checked on every auth, cached 60s.
CREATE TABLE token_revocations (
token_hash TEXT PRIMARY KEY,
revoked_at TIMESTAMPTZ NOT NULL DEFAULT now(),
revoked_by TEXT NOT NULL,
reason TEXT
);
-- Consent template versioning. Hash references, not embedded text per row.
CREATE TABLE consent_versions (
version TEXT PRIMARY KEY,
effective_at TIMESTAMPTZ NOT NULL,
superseded_at TIMESTAMPTZ,
template_hash TEXT NOT NULL, -- SHA256 of the canonical template
template_path TEXT NOT NULL -- where the canonical text lives (e.g. data/_consent/v3-2026-04-15.md)
);
-- External anchor checkpoints. Periodically committed to S3 Object Lock.
CREATE TABLE anchor_commits (
commit_id BIGSERIAL PRIMARY KEY,
chain_kind TEXT NOT NULL, -- 'hmac' or 'merkle'
checkpoint_at TIMESTAMPTZ NOT NULL,
last_access_id BIGINT NOT NULL, -- max access_id covered by this checkpoint
root_hash TEXT NOT NULL,
s3_object_uri TEXT NOT NULL, -- s3://anchor-bucket/identityd/...
s3_lock_until TIMESTAMPTZ NOT NULL -- compliance-mode retention end
);
2.3 — HTTP surface (v2)
All under /v1/identity/. mTLS + token required on every endpoint.
| Method + Path | Purpose | Auth tier | Notes |
|---|---|---|---|
POST /v1/identity/subjects |
Create new subject. Body: PII fields + purpose_token. Returns: candidate_id (UUID v7). |
service | Validates purpose_token allowlist matches submitted fields |
GET /v1/identity/subjects/{candidate_id}?purpose={token}&fields={list} |
Resolve PII for candidate. | service | Field-level enforcement: returned fields ⊆ purpose_token's allowed_fields. Mismatch → 403. Logs access row. |
GET /v1/identity/subjects/{candidate_id}/vertical |
Minimal-disclosure vertical lookup (per kimi HIPAA minimum-necessary). Returns only {vertical, consent_status}. |
service | Used by gateway routing. Cheaper access-log row. |
GET /v1/identity/subjects/{candidate_id}/full |
Complete subject record + audit summary. | legal | Short-lived JWT only. Logged with is_legal_tier_event=true. Triggers real-time notification to designated counsel + J. |
POST /v1/identity/subjects/{candidate_id}/consent |
Record consent given/withdrawn with version. | service | |
POST /v1/identity/subjects/{candidate_id}/erase |
Crypto-erasure. Idempotent. | legal | Synchronously triggers cache invalidation hooks (per opus). |
GET /v1/identity/access_log/{candidate_id} |
Per-subject access log for audit response. | legal | |
POST /v1/identity/auth/legal_token |
Issue short-lived JWT (max 24h). Requires dual-control attestation: J's signed nonce + counsel's signed nonce, both verified against pre-registered public keys. | dual-control | Per all 3 reviewers — replaces v1's static file approach |
GET /v1/identity/health |
Liveness | none | Returns only {status: "ok"} — no version, no schema info |
(no public training-safe-export endpoint in v2) |
— | — | Deferred per opus — not on critical path |
2.4 — Auth model (v2 — split-secret + short-lived)
Service-tier auth (gateway → identityd for routine PII resolution):
- mTLS client cert per gateway (Rust :3100 + Go :4110 each get their own cert)
- Bearer token in
Authorizationheader (long-lived — 90d), rotated per ops runbook
Legal-tier auth (per-request, short-lived JWT, dual-control issuance):
- J holds private key A
- Designated outside counsel holds private key B
POST /v1/identity/auth/legal_tokenrequires:- HTTP body:
{purpose, ttl_seconds, requested_fields, signature_a, signature_b, nonce_a, nonce_b} - identityd verifies both signatures against pre-registered A and B public keys
- On success: emits a JWT signed by identityd, valid for
ttl_seconds(max 24h), scoped topurpose+fields
- HTTP body:
- The JWT:
- Cannot be issued without BOTH A and B signatures (per gemini "split-secret startup ceremony" pattern, applied per-token)
- Carries scope (purpose + fields) inside its claims; identityd enforces at use time
- Per-token rate limit + daily cap recorded on issuance
- Revocation: token_hash added to
token_revocations; identityd refuses on next check (60s cache TTL)
Why this matters:
- v1's "static file mode 0400" was security theater per all 3 reviewers
- A leaked legal token v1 = unbounded exfiltration window until rotation
- v2's leaked JWT = bounded by TTL (max 24h) AND revocable in <60s
- Single-actor compromise (J's machine OR counsel's machine, not both) cannot mint legal tokens
3. Integration with the rest of the substrate
3.1 — Gateway changes
When gateway needs PII for a fill scenario:
- Gateway has only
candidate_id(post-Phase-1.5 view-routing fix) - Gateway calls
GET /v1/identity/subjects/{candidate_id}/verticalfirst if it might be healthcare-vertical-sensitive routing - Gateway calls
GET /v1/identity/subjects/{candidate_id}?purpose=fill_validation&fields=name - identityd validates: purpose_token=
fill_validationallows[name]only; if request asks for[name,ssn]→ 403 - identityd writes
pii_access_logrow, decrypts requested fields, returns - Gateway uses fields, then explicitly zeros memory (kimi's drop-and-overwrite pattern) before returning HTTP response
- No PII caching in gateway memory beyond request lifetime
LRU embed cache (commit 150cc3b) currently keys by (model, text) where text contains PII. Phase 4 task to re-key as (model, candidate_id, field_subset_hash). Phase 2 must add the synchronous cache-purge hook so RTBF in Phase 2 invalidates Phase 4's cache when it ships (opus finding).
3.2 — Cross-runtime: same identityd, both gateways call it
Both gateways call identityd over mTLS. Same endpoints, same auth model. New cross-runtime parity probe audit_parity.sh validates identical PII request through both gateways produces identical access-log rows (modulo daemon-name field).
3.3 — JSONL writer changes (subject_id top-level promotion)
Per AUDIT_PHASE_1_DISCOVERY §10/C5:
- All JSONL sinks (outcomes, sessions, overseer_corrections, observerd ops) gain top-level
subject_ids: [...]field fills[*].namereplaced withname_ref: "[REDACTED-{candidate_id}]"- Authorized callers dereference via identityd
3.4 — Langfuse boundary redaction (defense in depth, per opus TOCTOU finding)
v1 design: per-request map of subject_id → resolved_PII, replace before Langfuse POST.
v2 adds (per opus): outbound regex/NER pass on the Langfuse payload as defense-in-depth. The model can hallucinate names not in the resolved-PII map. The outbound NER pass catches them.
Gateway constructs Langfuse payload
→ Pass 1: per-request resolved-PII map replacement
→ Pass 2: NER scan (regex for SSN/phone shapes; named-entity model for names)
→ Either pass detecting unredacted PII → drop the trace OR replace with [POSSIBLE-PII-DETECTED]
→ POST to Langfuse
3.5 — Healthcare vertical routing — fail-closed default
Per opus + gemini scrum: Default is unknown → treated as healthcare for routing purposes.
When gateway receives a request involving any candidate:
- Query
/verticalfor the candidate - If
vertical IN ('healthcare', 'unknown')→ route to on-box Ollama only (no opencode/openrouter/ollama_cloud egress) - If
vertical='general'(or other non-healthcare) → cloud routing OK - If identityd unreachable → fail closed (refuse the request, return 503)
Reclassification path: subjects can be moved from unknown → general via explicit operator action (after manual review). Subjects flagged healthcare stay healthcare unless explicitly downgraded.
Auto-escalation (per kimi): if a subject's resume_text or call_log content is updated with healthcare-pattern matches (RN, BSN, hospital, MD, physician, etc.), vertical auto-escalates to healthcare. Never auto-de-escalates.
3.6 — Cache invalidation hook (Phase 2 must ship even though full RTBF is Phase 7)
When POST /v1/identity/subjects/{id}/erase fires:
- Mark
subjects.erased_atand zero out ciphertext columns - Write
pii_access_logrow withpurpose_token='retention_expired'or'rtbf_request' - Synchronously publish a
subject_erasedevent to a pub/sub channel (Redis or Postgres LISTEN/NOTIFY) - Gateway subscribes; on event, purges any in-flight cache entries for that candidate_id
- Eventual-consistency window: <5s between erase-call and cache-flush
Without this hook, RTBF in Phase 7 would be a lie because the gateway's LRU embed cache could still hold the subject's data.
4. Audit response shape (Phase 3 preview, v2 — adds model-version snapshot per gemini)
{
"schema": "audit.subject.v1",
"subject_token": "01926f2e-7c1b-7000-...", // UUID v7, NOT CAND-NNNNNN
"request_window": { "from": "2026-01-01", "to": "2026-05-03" },
"generated_at": "2026-05-03T12:00:00Z",
"generated_by": "identityd@hostname",
"merkle_root": "sha256:...", // root hash of legal/consent/erasure events
"external_anchor": "s3://anchor-bucket/identityd/2026-05-03T12-00.json", // S3 Object Lock URI
"signature": "ed25519:...", // separate signing key from encryption KEK
"consent": { ... },
"decisions": [
{
"ts": "2026-04-22T09:15:23Z",
"decision_kind": "fill_recommendation",
"daemon": "gateway",
"model": "kimi-k2.6",
"model_version_hash": "sha256:...", // NEW per gemini — proves what model existed AT decision time
"provider": "ollama_cloud",
"trace_id": "trace-abc",
"session_id": "session-xyz",
"input_features": { ... }, // sanitized; no protected attributes; no inferred-attribute proxies
"output": "...",
"rationale": "...",
"comparator_pool_size": 47,
"comparator_appendix_ref": "see comparator_appendix.A"
}
],
"comparator_appendix": {
"A": {
"scope": "fill scenarios in window matching role X, geo Y",
"total_pool_size": 47,
"selection_rate_by_protected_class": {
// Aggregated; NO other subjects' identifiers leak
"race": { "white": 0.33, "black": 0.31, ... },
"gender": { "man": 0.34, "woman": 0.30, ... }
},
"four_fifths_test": "passed" // or "concern: rate ratio 0.71"
}
},
"access_log": [...],
"footer": {
"completeness_attestation": "all decisions about subject_token in window per retention policy v2 are included",
"merkle_proof": "...", // proof this audit's root is in S3 anchor
"what_was_excluded": "decisions older than 4 years (retention expired) — count: 0"
}
}
5. Migration path (v2 — REORDERED per all 3 reviewers)
v1's order was wrong. New order, with explicit prerequisites:
| Step | Action | Prerequisite |
|---|---|---|
| 0 | Phase 1.6 BIPA pre-launch gates SHIPPED — consent template published, retention schedule public, deletion procedure documented, employee training acknowledged. | (separate work — see PHASE_1_6_BIPA_GATES.md when written) |
| 1 | Stand up identityd with synthetic test subjects only. KEK in vault, mTLS live, all endpoints serve. | Vault/KMS available; mTLS CA bootstrapped |
| 2 | Audit-parity probe (audit_parity.sh) green on synthetic data. Cross-runtime equivalence verified. |
Step 1 |
| 3 | Add gateway feature flag LH_USE_IDENTITY_SERVICE. Shadow-write only — gateway writes new subjects to identityd AND continues SQL path. No reads from identityd yet. |
Step 2 |
| 4 | Run shadow-write for ≥1 week. Validate access logs, encryption correctness, write-path performance under real traffic. | Step 3 |
| 5 | Backfill from workers_500k.parquet. consent_status='pending_backfill_review' for ALL existing rows. vertical='unknown' default (NOT 'general'). biometric_consent_status='never_collected' — backfill does NOT include any biometric data. |
Step 0 (BIPA gates) + Step 4 |
| 6 | Human-review queue for vertical reclassification. Subjects move from unknown to general only after explicit review. Healthcare-pattern matches auto-escalate to healthcare (never auto-downgrade). |
Step 5 |
| 7 | Shadow-read — gateway reads from BOTH SQL and identityd, compares, logs divergences, returns SQL result. Run ≥1 week. | Step 5 |
| 8 | Feature-flag cutover — gateway reads from identityd, falls back to SQL on error, alerts on every fallback. | Step 7 |
| 9 | Quarantine PII columns in workers_500k.parquet — only after cryptographic attestation of identityd completeness (Merkle proof: source row hash = identityd row hash for every candidate_id). Move PII columns to a different bucket; the candidate_id-only projection becomes operational. | Step 8 |
Key v2 changes from v1:
- New Step 0 (BIPA gates prerequisite) — backfill cannot start without this
- Steps 3-4 (shadow-write before backfill) — production validates the WRITE path before data lands
- Step 5 backfill consent:
pending_backfill_reviewnotinferred_existing(BIPA defense) - Step 5 vertical default:
unknownnotgeneral(HIPAA fail-closed) - New Step 6 (human review) — vertical classification via explicit operator action
- Step 7 added (shadow-read after backfill) — catches encoding/normalization bugs before cutover
- Step 9 (quarantine) requires cryptographic attestation of completeness
Each step is its own commit, its own gate, its own rollback path.
6. Cross-runtime parity probe (NEW, per scrum §6 plus opus 'oracle test' addition)
audit_parity.sh ships in Phase 5. Asserts:
- Same PII fetch through Rust gateway and Go gateway produces identical identityd access-log rows (modulo daemon name)
- Crypto-erasure of test subject through Rust gateway is honored when Go gateway tries to fetch
- Healthcare-vertical routing decision identical across both runtimes
- Oracle test (per kimi SOC2 CC4.1 finding): probe with known-good inputs against expected outputs. A bug present in BOTH implementations must not pass the parity probe.
- Discrimination-proxy phrase redaction triggers as expected on adversarial test cases
7. What this design intentionally does NOT solve
- Protected-attribute exclusion at decision time (Phase 6)
- Pathway memory trace body redaction (Phase 4)
- Retroactive Langfuse history scrub (separate runbook)
- GDPR Art. 22 right to explanation full implementation (Phase 8)
- Multi-region data residency (single-region US-Midwest by default)
- Training-safe export (deferred — not on critical path per opus)
8. Open questions — RESOLVED 2026-05-03
J confirmed all 6 v1 recommendations + scrum-driven changes:
- ✅ Master key: HashiCorp Vault Transit (recommended) OR AWS KMS. NOT a sealed file. v1's sealed-file recommendation is rejected per all 3 reviewers ("obfuscation in any defensible sense").
- ✅ Postgres isolation: identityd's own database, isolated schema.
- ✅ Vertical backfill:
'unknown'default with fail-closed routing. NOT'general'per opus+gemini scrum. - ✅ Legal token: Split-secret dual-control issuance (J + counsel both sign). Short-lived JWT, max 24h, revocable in <60s.
- ✅ Crypto-erasure sweep: Daily 03:00 UTC. (Note: v2 erasure mechanism is ciphertext deletion, not key destruction. Per-row keys deferred to Phase 7.)
- ✅ EU enforcement: Per-subject. Schema fields exist; nothing runs until first eu_resident=true subject.
Newly resolved per scrum
- ✅ Migration order: REORDERED per §5 above. New Step 0 (BIPA prerequisite). Shadow-write before backfill. Shadow-read before cutover.
- ✅ Audit-log external anchor: S3 Object Lock with compliance-mode 7-year retention. Hourly + on-event commits.
- ✅ Audit-log signing key: Separate Ed25519 keypair from KEK. Vault Transit signing backend OR sealed-secret with strict rotation runbook.
- ✅ mTLS gateway↔identityd: Mandatory. Self-signed CA managed by identityd at startup.
- ✅ Per-row encryption keys: Deferred to Phase 7. v2 uses single DEK with HSM-wrapped KEK + ciphertext deletion for erasure.
- ✅ Field-level authorization: purpose_definitions table enforces per-purpose field allowlists.
- ✅ Synchronous cache invalidation: pub/sub event on erase; gateway subscribes.
- ✅ Outbound NER pass for Langfuse: Defense-in-depth in addition to symbol-table replacement.
- ✅ Model version hash in audit response: Captured per decision row.
- ✅ PDF render: Deferred from Phase 2; JSON ships first.
9. Estimated implementation cost (revised v2)
| Sub-phase | Effort | Notes |
|---|---|---|
| 2A — Postgres schema + Vault/KMS integration | 1.5-2 days | Includes mTLS CA bootstrap + signing-key separation |
| 2B — identityd HTTP surface (Go) | 2-3 days | All endpoints, auth, dual-control JWT issuance, rate limiting, revocation |
| 2C — Backfill ETL (BIPA-compliant — pending_backfill_review) | 1 day | Plus Merkle attestation of completeness |
| 2D — Gateway integration (Rust + Go, mTLS, shadow-write phase) | 2-3 days | Per-tool migration, parity probe |
| 2E — JSONL writer changes (subject_id top-level promotion) | 1 day | All sinks |
| 2F — Langfuse redaction (symbol-table + NER) | 2 days | Two passes + drop-on-detect |
| 2G — Healthcare-vertical routing (fail-closed) | 0.5 day | Plus auto-escalation pattern matcher |
| 2H — Cache invalidation pub/sub hook | 0.5 day | Critical for Phase 7 RTBF |
| 2I — External anchor (S3 Object Lock) | 1 day | Hourly + on-event commits |
| 2J — Cross-runtime parity probe with oracle tests | 1 day | New probe |
| Total | ~12-15 working days | (v1 estimated 8-10; v2 added BIPA gates dependency, mTLS, dual-control JWT, NER pass, anchor, etc.) |
Bigger than v1. Worth it — every addition was a 3/3-reviewer convergent finding.
10. The four "would not build" blockers from scrum (all addressed in v2)
| # | v1 issue (3/3 reviewers) | v2 resolution |
|---|---|---|
| 1 | Migration order (backfill before validation) | §5 reordered; Step 0 BIPA gates prereq added |
| 2 | Master key on disk + legal token static file | §2.1 Vault/KMS for KEK; §2.4 split-secret short-lived JWT |
| 3 | inferred_existing BIPA prima facie violation |
§5 Step 5 uses pending_backfill_review; Step 0 gates required first |
| 4 | Healthcare default general (HIPAA exposure window) |
§3.5 fail-closed unknown-as-healthcare; §5 Step 5 backfill default unknown |
All three reviewers said "I would not build v1 as written." All four blockers are resolved in v2. Re-scrum recommended before implementation starts.
11. Change log (v1 → v2)
| Section | v1 | v2 |
|---|---|---|
| Master key storage | Sealed file /etc/lakehouse/identityd_master.key |
HashiCorp Vault Transit / AWS KMS |
| Legal token | Static file mode 0400 | Split-secret short-lived JWT (max 24h), dual-control issuance |
| Encryption keys | Per-row keys with master wrapping | Single DEK + HSM-wrapped KEK + ciphertext deletion (per-row deferred to Phase 7) |
| Healthcare default | vertical='general' backfill |
vertical='unknown' backfill, fail-closed routing |
| Migration §5 order | 1: stand up, 2: backfill, 3: feature flag, 4: cutover, 5: quarantine | 0: BIPA gates, 1: stand up, 2: parity probe, 3: shadow-write, 4: shadow-write soak, 5: backfill (pending_backfill_review), 6: human vertical review, 7: shadow-read, 8: cutover, 9: quarantine with cryptographic attestation |
| Audit-log integrity | Postgres Merkle chain only | Postgres Merkle for legal/consent/erasure + HMAC for standard + S3 Object Lock external anchor |
| Audit-log signing key | Same as encryption key | Separate Ed25519 keypair; Vault signing backend |
| Gateway↔identityd transport | Plain HTTP (implied) | mTLS mandatory |
| Field authorization | Per-endpoint only | Per-purpose-token field allowlists |
| Cache invalidation | Phase 4 / Phase 7 | Phase 2 ships pub/sub hook |
| Langfuse redaction | Symbol-table replacement only | Symbol-table + outbound NER pass + drop-on-detect |
| Audit response | No model version | model_version_hash per decision |
| PDF render | Phase 2 | Deferred — JSON ships first |
| Training-safe export | Phase 2 | Deferred — not critical path |
| Effort estimate | 8-10 days | 12-15 days |
Change log
- 2026-05-03 — v1 initial draft.
- 2026-05-03 — v2 post-scrum: 3/3 reviewer convergent findings folded in. 4 "would not build" blockers all resolved. Re-scrum before implementation recommended.