identity service: v2 — fold cross-lineage scrum findings + 4 'would not build' blocker fixes

Scrummed v1 across opus + kimi + gemini lineages via the new model fleet. 3/3 reviewers said 'I would NOT build v1 as written.' 4 convergent blockers, all resolved in v2: 1. Migration order wrong — backfill before validation creates dark database; if backfill bug, no production traffic catches it. v2 inserts BIPA-prereq Step 0 + shadow-write before backfill + shadow-read before cutover. 9-step migration with cryptographic attestation of completeness at quarantine. 2. Master key on disk + legal token static file = 'security theater' per all 3. v2: HashiCorp Vault Transit / AWS KMS for KEK (not sealed file). Legal token: split-secret short-lived JWT (max 24h), dual-control issuance (J + counsel both sign), revocable in <60s. 3. consent_status='inferred_existing' is BIPA prima facie violation (kimi+gemini explicit). v2 backfill uses 'pending_backfill_review'; biometric data NEVER backfilled — separate consent stream. 4. Healthcare default 'general' = HIPAA exposure window for every misclassified subject. v2 default 'unknown' with fail-closed routing (treat unknown as healthcare-equivalent until classified by manual review). Auto-escalation to healthcare on resume_text pattern match. Plus 12 single-reviewer additions: - mTLS mandatory between gateway↔identityd (kimi) - External anchor for audit chain: S3 Object Lock 7-year compliance mode, hourly + on-event commits (all 3) - Audit-log signing key separate from encryption KEK (opus) - Field-level authorization via purpose_definitions table (kimi) - Per-row encryption keys deferred to Phase 7 (kimi simplification) - pii_access_log itself needs legal-tier read auth (opus) - Synchronous cache invalidation pub/sub on RTBF (opus) - Outbound NER pass for Langfuse defense-in-depth (opus TOCTOU) - model_version_hash per decision row (gemini) - /vertical minimal-disclosure endpoint (kimi HIPAA min-necessary) - Auto-escalation healthcare on resume_text pattern (kimi) - Rate limiting + token revocation list (opus) - Oracle tests in audit_parity.sh (kimi SOC2 CC4.1) Architecturally simplified per scrum: - Per-row encryption keys deferred to Phase 7 (single DEK + HSM- wrapped KEK + ciphertext deletion is equivalent practical erasure with less complexity) - PDF render deferred (JSON ships first) - Training-safe export deferred (not critical path) Estimated effort revised 8-10 → 12-15 days. Worth it — every addition was a 3/3-reviewer convergent finding. Re-scrum recommended before implementation starts to verify v2 addresses the v1 blockers. No code changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 01:36:07 -05:00 · 2026-05-03 01:36:07 -05:00 · 298fadce41
commit 298fadce41
parent 565ea4b32a
1 changed files with 334 additions and 218 deletions
--- a/docs/IDENTITY_SERVICE_DESIGN.md
+++ b/docs/IDENTITY_SERVICE_DESIGN.md
@ -1,11 +1,13 @@
-# Identity Service — Phase 2 Design
+# Identity Service — Phase 2 Design (v2 — post-scrum revisions)

-**Status:** Draft — 2026-05-03 · **Owner:** J · **Drafted by:** working session 2026-05-03
+**Status:** Draft v2 — 2026-05-03 · **Owner:** J · **Drafted by:** working session 2026-05-03
 **Companion to:** [`AUDIT_TRAIL_PRD.md`](AUDIT_TRAIL_PRD.md), [`AUDIT_PHASE_1_DISCOVERY.md`](AUDIT_PHASE_1_DISCOVERY.md), [`AUDIT_PHASE_1_5_BIPA_AND_OUTCOMES.md`](AUDIT_PHASE_1_5_BIPA_AND_OUTCOMES.md)

-> **Why this exists.** Phase 1 + 1.5 confirmed that today's substrate has no separation between candidate_id (the canonical token) and PII (name, email, phone, address). Both live in `workers_500k.parquet`. There is no per-access audit. There is no consent gate. There is no retention enforcement. This document specifies the new identity service that will hold the candidate_id ↔ PII mapping, gate every PII read, audit every access, and serve as the single legal-attestable boundary between PII and the rest of the system.
->
-> **Confirmed by J 2026-05-03:** separate daemon (option A in §10.1), signed JSON with PDF render for legal export, legal-only auth credential separate from admin token.
+> **v2 history.** v1 (2026-05-03 morning) was scrummed across opus + kimi + gemini lineages. 3/3 reviewers converged on 6 critical issues — wrong migration order, master key + legal token "security theater," BIPA `inferred_existing` prima facie violation, healthcare default in wrong direction, Merkle chain without external anchor. v2 incorporates those changes. Full scrum reviews preserved at `/tmp/identity_scrum/{opus,kimi,gemini}_review.md`. Diff between v1 and v2 captured in §11 change log.
+
+> **Why this exists.** Phase 1 + 1.5 confirmed today's substrate has no separation between candidate_id and PII. Both live in `workers_500k.parquet`. No per-access audit, no consent gate, no retention enforcement. This document specifies the new identity service that holds the candidate_id ↔ PII mapping, gates every PII read, audits every access, and serves as the single legal-attestable boundary between PII and the rest of the system.
+
+> **Prerequisite:** **Phase 1.6 (BIPA pre-launch gates) MUST ship before any identityd backfill begins.** Per kimi+gemini scrum, backfilling with `consent_status='inferred_existing'` is a BIPA §15 prima facie violation. Phase 1.6 establishes the consent template + retention schedule + employee training that turns backfill from "inferred" into "pending_backfill_review with documented escalation path."

 ---

@ -14,151 +16,208 @@
 ### In scope
 - Single source of truth for `candidate_id ↔ PII` mapping
 - Per-PII-access audit log (who/what/when/why)
- Consent + retention metadata (BIPA + Day and Temporary Labor Services Act + healthcare PHI)
- Legal-only access credential, separate from admin tokens
- Healthcare-vertical detection at gateway boundary (per J 2026-05-03 answer 10)
- EU-compatible interface (placeholder fields, lawful-basis tracking, SCC-ready slots — but NOT enforced this phase per J)
- Training-safe export interface (per J 2026-05-03 answer 11)
- Signed-JSON audit response with PDF render path
+- Consent + retention metadata (BIPA + IL Day and Temporary Labor Services Act + healthcare PHI)
+- Legal-tier access — short-lived JWT with **split-secret dual-control issuance** (NOT a static file token, per all 3 reviewers)
+- Healthcare-vertical routing with **fail-closed default** (`unknown` treated as healthcare until classified, per opus+gemini)
+- EU-compatible interface — fields exist, enforcement is per-subject (no system-wide flag flip needed)
+- Hardware-backed master key — **HashiCorp Vault Transit OR AWS KMS** for v1 (NOT a sealed file, per all 3 reviewers; sealed-file v1 was "obfuscation in any defensible sense" — kimi)
+- Signed-JSON audit response with PDF render path (PDF deferred; JSON ships first per opus)
+- External-anchor for audit-log integrity (S3 Object Lock + signed timestamps; NOT just Postgres Merkle, per all 3 reviewers)
+- mTLS or Unix Domain Socket between gateway ↔ identityd (per kimi — "port-isolated is theater without authenticated channel")
+- Field-level authorization on `/subjects/{id}` (per-purpose field allowlists, per kimi)
+- Synchronous cache invalidation hook for RTBF (per opus — even if Phase 7 is the full RTBF build)

 ### Out of scope (for this phase)
 - The `/audit/subject/{id}` endpoint itself (Phase 3)
- Subject-tagging across other substrates (Phase 4)
- Right-to-be-forgotten implementation (Phase 7)
- BIPA pre-launch gates — those are Phase 1.6, ahead of this phase
+- Full subject-tagging across other substrates (Phase 4)
+- Right-to-be-forgotten implementation full-flow (Phase 7)
+- Training-safe export (deferred from Phase 2 per opus — "not on critical path for audit-trail defense")
+- BIPA pre-launch gates content (Phase 1.6 — separate doc)
+
+### Architecturally simplified from v1 (per scrum)
+- **Per-row encryption keys** deferred to Phase 7. v2 uses a single Data Encryption Key (DEK) wrapped under HSM-backed Key Encryption Key (KEK), with per-subject **ciphertext-deletion** as the v2 erasure mechanism. This matches kimi's recommendation: "A single data-encryption key with HSM-backed rotation, plus per-subject deletion of ciphertext, achieves equivalent practical erasure with far less complexity." Cryptographic-erasure via per-subject keys becomes a Phase 7 enhancement.
+- **EU placeholder fields** kept (per J 2026-05-03), but enforcement code is genuinely no-op until first eu_resident=true subject — no code paths run, no schema migrations needed when EU comes online (just per-subject field population).
+- **Merkle chaining** narrowed (per gemini): full Merkle tree only for legal-tier and consent/erasure events. Standard `gateway_lookup` events get a simpler signed-HMAC chain. Both anchored externally.

 ---

 ## 2. Architectural shape

-### 2.1 — Process model: separate daemon
+### 2.1 — Process model + transport

-Per J's confirmation (2026-05-03), the identity service runs as its own daemon, port-isolated from the gateway. Rationale:
-
- **Single attestable boundary** for legal/audit. "All PII access flows through identityd. Show me the identityd access log" is one query, one daemon.
- **Independent restart** — a gateway crash doesn't take down identity, and an identity panic doesn't break unrelated reads.
- **Distinct credential surface** — identityd's auth model is wholly separate from gateway's. The legal-only credential exists only in identityd, not in the gateway's JWT issuer.
- **Cross-runtime parity** — both Rust and Go gateway call identityd over HTTP. There is ONE identity implementation.
+Per J 2026-05-03 + kimi scrum: separate daemon, mTLS-mandatory transport. Single Go implementation, both runtimes call it.

 | Property | Value |
 |---|---|
 | Name | `identityd` |
-| Port | `:3225` (Rust legacy line — picks a port adjacent to validatord :3221) and `:4225` (Go line) |
-| Implementation language | **Go** — single implementation, both runtimes call it via HTTP. Avoids re-implementing the audit-log writer + retention sweeper twice. |
-| Storage | Postgres (separate database from any other lakehouse storage). Deployed alongside Langfuse's Postgres or its own; either way, isolated schema with its own grants. |
-| Encryption | Per-row symmetric encryption (AES-256-GCM) of PII columns. Master key in a vault (HashiCorp Vault, AWS KMS, or a sealed-secret file at `/etc/lakehouse/identityd_master.key` for now). Keys are NEVER logged. |
-| Backup | Standard Postgres backup; keys backed up separately to different storage tier (the cryptographic-erasure model in `AUDIT_TRAIL_PRD.md` §6 only works if the encrypted-blob backup and the key-backup are not co-located). |
+| Port | `:3225` (single port, not dual; per kimi simplification — runtime-agnostic routing) |
+| Implementation | Go |
+| Storage | Postgres in **isolated database** (per J answer, confirmed). Schema-level isolation; no shared schemas with Langfuse or other lakehouse storage. |
+| Transport gateway↔identityd | **mTLS mandatory.** Self-signed CA managed by identityd at startup; gateway clients have their own client certs. Plain HTTP is rejected at the listener. |
+| Encryption-at-rest | **Single DEK wrapped under KEK in HashiCorp Vault Transit** (recommended) OR AWS KMS. Master key NEVER on disk. v1 of identityd refuses to start if KEK is unreachable. |
+| Audit-log signing | **Separate Ed25519 keypair from the master encryption key.** Signing key in Vault Transit (signing backend) OR a separate sealed-secret file with strict rotation procedure. |
+| External anchor for audit chain | **S3 Object Lock** (compliance mode, 7-year retention) holding periodic Merkle root commitments. Daemon writes hourly + on legal-tier event. |
+| Backup | Postgres standard backup; KEK backup separate (different storage, different access). Crypto-erasure model only works if these are not co-located. |
+| Rate limiting | Per-token QPS + daily-volume caps. Default legal-token: 100 lookups/day + 10K daily volume; alerting at 50%. Configurable per-token. |
+| Token revocation | `token_revocations` table checked on every legal-tier auth. Revocation propagates within 60s (TTL on cache). |

-### 2.2 — Schema (Postgres DDL sketch)
+### 2.2 — Schema (Postgres DDL, v2)

 ```sql
-- Single source of truth for the candidate_id ↔ PII mapping
-- Every PII column is stored as ciphertext; keys per row enable
-- per-subject crypto-erasure for RTBF.
+-- Subject record. PII columns are ciphertext under the DEK.
 CREATE TABLE subjects (
-    candidate_id    TEXT PRIMARY KEY,           -- canonical token, e.g. "CAND-000001"
-    -- Encrypted PII fields. Each is AES-256-GCM with subject_key_id below.
-    -- Plaintext is NEVER stored. NULL means "not collected" not "absent."
+    candidate_id    TEXT PRIMARY KEY,           -- UUID v7, NOT sequential (per kimi enumeration concern)
+    -- Encrypted PII fields. AES-256-GCM under DEK. NULL = not collected.
    name_ct         BYTEA,
    email_ct        BYTEA,
    phone_ct        BYTEA,
    address_ct      BYTEA,
    ssn_ct          BYTEA,
    dob_ct          BYTEA,
-    -- Per-subject encryption key id. Crypto-erasure path: destroy this key
-    -- and the ciphertext is unrecoverable, even with the master key.
-    subject_key_id  TEXT NOT NULL,
+    -- DEK version this subject was encrypted under. KEK rotates; DEK rotates per
+    -- KEK rotation. New DEK version = subjects re-encrypted in background sweep.
+    dek_version     INT NOT NULL,
    -- Lawful basis + consent metadata
-    consent_status            TEXT NOT NULL,    -- 'pending' | 'given' | 'withdrawn' | 'expired'
+    consent_status  TEXT NOT NULL CHECK (consent_status IN (
+        'pending_backfill_review',  -- backfill default; no biometric/PHI use until reviewed
+        'pending_first_contact',     -- new subject, awaiting consent UX
+        'given',                     -- explicit consent recorded
+        'withdrawn',                 -- subject revoked
+        'expired'                    -- consent timeout
+    )),
    consent_version           TEXT,             -- references published consent template version
    consent_given_at          TIMESTAMPTZ,
    consent_withdrawn_at      TIMESTAMPTZ,
-    -- BIPA-specific fields (per Phase 1.5 §1E)
-    biometric_consent_status  TEXT,             -- separate from general PII consent
+    -- BIPA-specific fields (Phase 1.5 §1E + Phase 1.6 prerequisite)
+    biometric_consent_status  TEXT NOT NULL DEFAULT 'never_collected' CHECK (biometric_consent_status IN (
+        'never_collected', 'pending', 'given', 'withdrawn', 'expired'
+    )),
    biometric_retention_until TIMESTAMPTZ,      -- BIPA: max 3 years from last interaction
-    -- Vertical detection — drives healthcare PHI routing (per J answer 10)
-    vertical                  TEXT,             -- 'general' | 'healthcare' | 'finance' | 'other'
-    -- EU-placeholder fields (per J answer 9 — present, not enforced)
-    eu_resident       BOOLEAN DEFAULT FALSE,
-    lawful_basis      TEXT,                     -- GDPR Art. 6 basis if eu_resident=true
+    -- Vertical detection — drives healthcare PHI routing.
+    -- DEFAULT 'unknown' (per opus+gemini scrum) — fail-closed routing treats
+    -- unknown as healthcare-equivalent until reclassified.
+    vertical                  TEXT NOT NULL DEFAULT 'unknown' CHECK (vertical IN (
+        'unknown', 'general', 'healthcare', 'finance', 'other'
+    )),
+    -- Per-vertical retention period. Drives the daily erasure sweep.
+    -- (Per opus: this was missing in v1. Required for BIPA's 3-year-from-
+    -- last-interaction rule which differs from generic retention.)
+    retention_period_days     INT NOT NULL,
+    -- EU-placeholder fields. Enforcement is per-subject; nothing runs
+    -- until a row has eu_resident=true AND lawful_basis IS NOT NULL.
+    eu_resident       BOOLEAN NOT NULL DEFAULT FALSE,
+    lawful_basis      TEXT,                     -- GDPR Art. 6 basis when eu_resident=true
    transfer_mechanism TEXT,                    -- SCC, DPF, BCR — populated when EU comes online
    -- Standard audit columns
    created_at        TIMESTAMPTZ NOT NULL DEFAULT now(),
    updated_at        TIMESTAMPTZ NOT NULL DEFAULT now(),
    last_interaction  TIMESTAMPTZ,              -- drives retention sweep
-    -- RTBF state
-    erased_at         TIMESTAMPTZ,              -- set when crypto-erasure executed
-    erasure_reason    TEXT                      -- 'rtbf_request' | 'retention_expired' | 'consent_withdrawn'
+    -- Crypto-erasure state (v2: ciphertext deletion, NOT key destruction)
+    erased_at         TIMESTAMPTZ,
+    erasure_reason    TEXT
 );

-- Append-only access audit. EVERY PII read writes a row here.
+-- Append-only access log. EVERY PII read writes a row.
+-- Field-level: fields_accessed records WHICH fields the caller resolved.
+-- purpose enforced against an allowlist per-purpose-token.
 CREATE TABLE pii_access_log (
    access_id       BIGSERIAL PRIMARY KEY,
    candidate_id    TEXT NOT NULL,
    accessed_at     TIMESTAMPTZ NOT NULL DEFAULT now(),
    accessor_kind   TEXT NOT NULL,              -- 'gateway_lookup' | 'audit_response' | 'legal_request' | 'system_resolve'
-    accessor_id     TEXT NOT NULL,              -- daemon name + caller token-hash, never raw token
-    purpose         TEXT NOT NULL,              -- 'fill_validation' | 'audit_subject_response' | 'admin' | 'legal_audit_DDDDDD'
-    fields_accessed TEXT[] NOT NULL,            -- ['name', 'email'] etc.
-    request_trace_id TEXT,                      -- ties to Langfuse trace + sessions.jsonl
-    integrity_hash  TEXT NOT NULL               -- chain hash for tamper-evidence (this row's hash includes prev row's hash)
+    accessor_id     TEXT NOT NULL,              -- daemon name + caller token-hash; never raw token
+    purpose_token   TEXT NOT NULL,              -- opaque token; resolved through purpose_definitions table
+    fields_accessed TEXT[] NOT NULL,
+    request_trace_id TEXT,
+    chain_kind      TEXT NOT NULL CHECK (chain_kind IN ('hmac', 'merkle')),  -- HMAC for standard, Merkle for legal/consent/erasure
+    integrity_hash  TEXT NOT NULL,
+    -- Per opus security finding: this row's existence is itself sensitive
+    -- ("candidate X is under legal review" leaks via purpose). Read access
+    -- to this table requires legal-tier auth or scoped per-subject auth.
+    is_legal_tier_event BOOLEAN NOT NULL DEFAULT FALSE
 );

-- Cryptographic chain for the access log — Merkle-style. Per kimi
-- single-reviewer flag: chain of custody under FRE 901.
-- Each row's integrity_hash = SHA256(prev_hash || row_payload).
-- Last hash periodically committed to a tamper-evident store
-- (could be a separate append-only file with timestamp signing).
-
-- Per-subject keys table — crypto-erasure target.
-- Destroying a row here makes the corresponding subjects.*_ct unreadable.
-CREATE TABLE subject_keys (
-    subject_key_id  TEXT PRIMARY KEY,
-    candidate_id    TEXT NOT NULL,
-    key_material    BYTEA NOT NULL,             -- AES-256 key, encrypted under master key
-    created_at      TIMESTAMPTZ NOT NULL,
-    destroyed_at    TIMESTAMPTZ,                -- crypto-erasure marker
-    destroyed_reason TEXT
+-- Purpose definitions — enforces field-level authorization (per kimi).
+-- A given purpose_token can only request fields in its allowlist.
+CREATE TABLE purpose_definitions (
+    purpose_token   TEXT PRIMARY KEY,           -- e.g. 'fill_validation', 'audit_subject_response'
+    description     TEXT NOT NULL,
+    allowed_fields  TEXT[] NOT NULL,            -- e.g. ARRAY['name'] for fill_validation
+    auth_tier       TEXT NOT NULL CHECK (auth_tier IN ('service', 'legal')),
+    rate_limit_qps  INT,
+    daily_volume_cap INT
 );

-- Consent template versioning — BIPA + GDPR + CCPA compliance evidence
+-- Token revocation list. Checked on every auth, cached 60s.
+CREATE TABLE token_revocations (
+    token_hash      TEXT PRIMARY KEY,
+    revoked_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
+    revoked_by      TEXT NOT NULL,
+    reason          TEXT
+);
+
+-- Consent template versioning. Hash references, not embedded text per row.
 CREATE TABLE consent_versions (
    version         TEXT PRIMARY KEY,
    effective_at    TIMESTAMPTZ NOT NULL,
    superseded_at   TIMESTAMPTZ,
-    template_text   TEXT NOT NULL,
-    biometric_section TEXT,                    -- BIPA-specific clause
-    healthcare_section TEXT,                   -- HIPAA-specific clause
-    eu_section      TEXT                       -- GDPR-specific clause (placeholder)
+    template_hash   TEXT NOT NULL,              -- SHA256 of the canonical template
+    template_path   TEXT NOT NULL               -- where the canonical text lives (e.g. data/_consent/v3-2026-04-15.md)
+);
+
+-- External anchor checkpoints. Periodically committed to S3 Object Lock.
+CREATE TABLE anchor_commits (
+    commit_id       BIGSERIAL PRIMARY KEY,
+    chain_kind      TEXT NOT NULL,              -- 'hmac' or 'merkle'
+    checkpoint_at   TIMESTAMPTZ NOT NULL,
+    last_access_id  BIGINT NOT NULL,            -- max access_id covered by this checkpoint
+    root_hash       TEXT NOT NULL,
+    s3_object_uri   TEXT NOT NULL,              -- s3://anchor-bucket/identityd/...
+    s3_lock_until   TIMESTAMPTZ NOT NULL        -- compliance-mode retention end
 );
 ```

-### 2.3 — HTTP surface
+### 2.3 — HTTP surface (v2)

-Identityd exposes a small HTTP surface, all under `/v1/identity/`:
+All under `/v1/identity/`. mTLS + token required on every endpoint.

-| Method + Path | Purpose | Auth |
-|---|---|---|
-| `POST /v1/identity/subjects` | Create a new subject. Body: PII fields. Returns: candidate_id (server-generated, NOT sequential — UUID v7 to avoid the kimi enumeration risk). | Gateway/admin token |
-| `GET /v1/identity/subjects/{candidate_id}` | Resolve PII for a candidate. Returns: requested fields only. EVERY call writes a `pii_access_log` row. | **Service-tier auth** — gateway can call but body indicates accessor purpose |
-| `GET /v1/identity/subjects/{candidate_id}/full` | Return the complete subject record including consent + retention metadata + audit summary. **Legal-only credential.** Used by audit-response endpoint. | **Legal-only token** — separate credential, separate rotation |
-| `POST /v1/identity/subjects/{candidate_id}/consent` | Record consent given/withdrawn, with version. | Gateway/admin token |
-| `POST /v1/identity/subjects/{candidate_id}/erase` | Crypto-erasure: destroy the subject_key_id, mark erased_at. Idempotent. | Legal-only token |
-| `GET /v1/identity/access_log/{candidate_id}` | Return the per-subject access log for audit response. | Legal-only token |
-| `POST /v1/identity/training_safe_export` | Returns identifier-stripped + name-redacted projection of subjects suitable for RAG/training. Logs a "system_resolve" access log row marking export. | Admin token + explicit purpose flag |
-| `GET /v1/identity/health` | Liveness | None |
+| Method + Path | Purpose | Auth tier | Notes |
+|---|---|---|---|
+| `POST /v1/identity/subjects` | Create new subject. Body: PII fields + `purpose_token`. Returns: candidate_id (UUID v7). | service | Validates purpose_token allowlist matches submitted fields |
+| `GET /v1/identity/subjects/{candidate_id}?purpose={token}&fields={list}` | Resolve PII for candidate. | service | **Field-level enforcement**: returned fields ⊆ purpose_token's allowed_fields. Mismatch → 403. Logs access row. |
+| `GET /v1/identity/subjects/{candidate_id}/vertical` | **Minimal-disclosure** vertical lookup (per kimi HIPAA minimum-necessary). Returns only `{vertical, consent_status}`. | service | Used by gateway routing. Cheaper access-log row. |
+| `GET /v1/identity/subjects/{candidate_id}/full` | Complete subject record + audit summary. | **legal** | Short-lived JWT only. Logged with `is_legal_tier_event=true`. Triggers real-time notification to designated counsel + J. |
+| `POST /v1/identity/subjects/{candidate_id}/consent` | Record consent given/withdrawn with version. | service | |
+| `POST /v1/identity/subjects/{candidate_id}/erase` | Crypto-erasure. Idempotent. | **legal** | Synchronously triggers cache invalidation hooks (per opus). |
+| `GET /v1/identity/access_log/{candidate_id}` | Per-subject access log for audit response. | **legal** | |
+| `POST /v1/identity/auth/legal_token` | Issue short-lived JWT (max 24h). Requires **dual-control attestation**: J's signed nonce + counsel's signed nonce, both verified against pre-registered public keys. | dual-control | Per all 3 reviewers — replaces v1's static file approach |
+| `GET /v1/identity/health` | Liveness | none | Returns only `{status: "ok"}` — no version, no schema info |
+| `(no public training-safe-export endpoint in v2)` | — | — | Deferred per opus — not on critical path |

-### 2.4 — Auth model: legal-only credential
+### 2.4 — Auth model (v2 — split-secret + short-lived)

-The legal-only credential is materially different from the admin/gateway token:
+**Service-tier auth** (gateway → identityd for routine PII resolution):
+- mTLS client cert per gateway (Rust :3100 + Go :4110 each get their own cert)
+- Bearer token in `Authorization` header (long-lived — 90d), rotated per ops runbook

- Stored at `/etc/lakehouse/identityd_legal.token` (mode 0400, owner-only)
- Loaded by identityd at startup via systemd `EnvironmentFile`
- Never logged, never returned in any API response, never crossed with gateway tokens
- Rotation: separate runbook. Triggered by counsel request OR scheduled annually.
- The token's existence is documented in the privacy policy ("legal access requires a separate operator-issued credential, audited per access").
+**Legal-tier auth** (per-request, short-lived JWT, dual-control issuance):
+- J holds private key A
+- Designated outside counsel holds private key B
+- `POST /v1/identity/auth/legal_token` requires:
+  - HTTP body: `{purpose, ttl_seconds, requested_fields, signature_a, signature_b, nonce_a, nonce_b}`
+  - identityd verifies both signatures against pre-registered A and B public keys
+  - On success: emits a JWT signed by identityd, valid for `ttl_seconds` (max 24h), scoped to `purpose`+`fields`
+- The JWT:
+  - Cannot be issued without BOTH A and B signatures (per gemini "split-secret startup ceremony" pattern, applied per-token)
+  - Carries scope (purpose + fields) inside its claims; identityd enforces at use time
+  - Per-token rate limit + daily cap recorded on issuance
+  - Revocation: token_hash added to `token_revocations`; identityd refuses on next check (60s cache TTL)

-Service-tier auth (gateway-issued) and legal-tier auth (operator-issued) are orthogonal — a request must present the legal token to hit `/full`, `/erase`, or `/access_log`. Even an admin token does not unlock those.
+**Why this matters:**
+- v1's "static file mode 0400" was security theater per all 3 reviewers
+- A leaked legal token v1 = unbounded exfiltration window until rotation
+- v2's leaked JWT = bounded by TTL (max 24h) AND revocable in <60s
+- Single-actor compromise (J's machine OR counsel's machine, not both) cannot mint legal tokens

 ---

@ -166,201 +225,258 @@ Service-tier auth (gateway-issued) and legal-tier auth (operator-issued) are ort

 ### 3.1 — Gateway changes

-When the gateway needs PII for a fill scenario (today this happens by SQL JOIN on `candidates`/`workers_500k`), the new flow is:
+When gateway needs PII for a fill scenario:

-1. Gateway has only `candidate_id` (post-§2 view-routing fix from `AUDIT_PHASE_1_DISCOVERY` §9)
-2. Gateway calls `GET /v1/identity/subjects/{candidate_id}` with `purpose=fill_validation` and `fields=[name]` (or `[name,phone]` etc.)
-3. Identityd writes pii_access_log row, decrypts the requested fields, returns them
-4. Gateway uses the fields for the validator + tool result, then immediately drops them from in-memory storage after the request completes (no caching)
+1. Gateway has only `candidate_id` (post-Phase-1.5 view-routing fix)
+2. Gateway calls `GET /v1/identity/subjects/{candidate_id}/vertical` first if it might be healthcare-vertical-sensitive routing
+3. Gateway calls `GET /v1/identity/subjects/{candidate_id}?purpose=fill_validation&fields=name`
+4. identityd validates: purpose_token=`fill_validation` allows `[name]` only; if request asks for `[name,ssn]` → 403
+5. identityd writes `pii_access_log` row, decrypts requested fields, returns
+6. Gateway uses fields, then **explicitly zeros memory** (kimi's drop-and-overwrite pattern) before returning HTTP response
+7. **No PII caching** in gateway memory beyond request lifetime

-**Critical:** the LRU embed cache (commit `150cc3b`) currently keys by `(model, text)` where text contains PII. Post-identity-service, the cache keying must change to `(model, candidate_id, field_subset_hash)` so the cache key itself is not PII-bearing. This is a Phase 4 task tracked separately.
+**LRU embed cache** (commit `150cc3b`) currently keys by `(model, text)` where text contains PII. Phase 4 task to re-key as `(model, candidate_id, field_subset_hash)`. **Phase 2 must add the synchronous cache-purge hook** so RTBF in Phase 2 invalidates Phase 4's cache when it ships (opus finding).

-### 3.2 — Rust legacy + Go rewrite both call the same identityd
+### 3.2 — Cross-runtime: same identityd, both gateways call it

-Both gateways (Rust :3100 and Go :4110) call identityd over HTTP. Same endpoints, same auth model. New cross-runtime parity probe `audit_parity.sh` validates that an identical PII request through both gateways produces identical identityd access-log rows (modulo daemon-name field).
+Both gateways call identityd over mTLS. Same endpoints, same auth model. New cross-runtime parity probe `audit_parity.sh` validates identical PII request through both gateways produces identical access-log rows (modulo daemon-name field).

-### 3.3 — outcomes.jsonl + sessions.jsonl writer changes
+### 3.3 — JSONL writer changes (subject_id top-level promotion)

-Per `AUDIT_PHASE_1_DISCOVERY` §10/C5 (subject_id top-level promotion):
+Per `AUDIT_PHASE_1_DISCOVERY` §10/C5:
+- All JSONL sinks (outcomes, sessions, overseer_corrections, observerd ops) gain top-level `subject_ids: [...]` field
+- `fills[*].name` replaced with `name_ref: "[REDACTED-{candidate_id}]"`
+- Authorized callers dereference via identityd

- Change `outcomes.jsonl` writer to:
-  - Add top-level `subject_ids: ["CAND-000001", "CAND-000456"]` field listing every candidate referenced
-  - Strip `name` from `fills[*]` rows; replace with `name_ref: "[REDACTED-{candidate_id}]"` token
-  - Authorized callers dereference the token via identityd
+### 3.4 — Langfuse boundary redaction (defense in depth, per opus TOCTOU finding)

- Same for sessions.jsonl SessionRecord: add `subject_ids` top-level field.
+v1 design: per-request map of `subject_id → resolved_PII`, replace before Langfuse POST.

- Same for overseer_corrections.jsonl.
+**v2 adds** (per opus): outbound regex/NER pass on the Langfuse payload as defense-in-depth. The model can hallucinate names not in the resolved-PII map. The outbound NER pass catches them.

- Same for observerd ops.jsonl when written.
+```
+Gateway constructs Langfuse payload
+  → Pass 1: per-request resolved-PII map replacement
+  → Pass 2: NER scan (regex for SSN/phone shapes; named-entity model for names)
+  → Either pass detecting unredacted PII → drop the trace OR replace with [POSSIBLE-PII-DETECTED]
+  → POST to Langfuse
+```

-This makes every JSONL sink subject-queryable by `subject_id` directly, without grepping natural language.
+### 3.5 — Healthcare vertical routing — fail-closed default

-### 3.4 — Langfuse boundary redaction (per scrum priority C2)
+**Per opus + gemini scrum: Default is `unknown` → treated as healthcare for routing purposes.**

-Before the gateway POSTs a chat trace to Langfuse, identityd is consulted to map any PII-shaped substrings in the message array back to candidate_id tokens. Implementation:
+When gateway receives a request involving any candidate:
+1. Query `/vertical` for the candidate
+2. If `vertical IN ('healthcare', 'unknown')` → route to **on-box Ollama only** (no opencode/openrouter/ollama_cloud egress)
+3. If `vertical='general'` (or other non-healthcare) → cloud routing OK
+4. If identityd unreachable → fail closed (refuse the request, return 503)

- Gateway maintains a per-request map of `subject_id → temporarily_resolved_PII` for the lifetime of one request
- Before Langfuse POST, gateway iterates message content, replaces resolved PII with `[REDACTED-{candidate_id}]`
- Langfuse never sees raw names/emails/phones — it sees the tokens, which are unresolvable without a legal-tier identityd call
- For audit, legal counsel can use the token to dereference identity AND see the corresponding Langfuse trace, but Langfuse's storage is PII-free
+**Reclassification path:** subjects can be moved from `unknown` → `general` via explicit operator action (after manual review). Subjects flagged `healthcare` stay healthcare unless explicitly downgraded.

-This addresses the most-dangerous-leak finding from `AUDIT_PHASE_1_DISCOVERY` §10/C2.
+**Auto-escalation** (per kimi): if a subject's resume_text or call_log content is updated with healthcare-pattern matches (RN, BSN, hospital, MD, physician, etc.), vertical auto-escalates to `healthcare`. Never auto-de-escalates.

-### 3.5 — Healthcare vertical routing (per J answer 10)
+### 3.6 — Cache invalidation hook (Phase 2 must ship even though full RTBF is Phase 7)

-When `subjects.vertical = 'healthcare'`, the gateway routing rules change:
+When `POST /v1/identity/subjects/{id}/erase` fires:
+1. Mark `subjects.erased_at` and zero out ciphertext columns
+2. Write `pii_access_log` row with `purpose_token='retention_expired'` or `'rtbf_request'`
+3. **Synchronously** publish a `subject_erased` event to a pub/sub channel (Redis or Postgres LISTEN/NOTIFY)
+4. Gateway subscribes; on event, purges any in-flight cache entries for that candidate_id
+5. Eventual-consistency window: <5s between erase-call and cache-flush

- Tool calls that touch this candidate's data MUST route to local-only models (Ollama on-box), NOT to opencode/openrouter/ollama_cloud egress
- If a healthcare-vertical request can't be served locally, it fails with HTTP 451 ("Unavailable for Legal Reasons") — better to refuse than leak PHI
- The identity service holds the routing decision; the gateway consults it on every call
- Vertical detection itself happens at ingest time (`workers_500k` row metadata) OR when first PII fetch returns vertical='healthcare'
-
-This requires a one-line addition to the gateway's chat routing in `crates/gateway/src/v1/chat.rs` + Go-side equivalent in `cmd/chatd/main.go`. Both should fail-closed: if identityd is unreachable, healthcare requests refuse.
-
-### 3.6 — Training-safe export (per J answer 11)
-
-`POST /v1/identity/training_safe_export` returns a projection of subject decision data with:
-
- `name`, `email`, `phone`, `address`, `ssn`, `dob` ALL stripped (replaced with `[REDACTED]`)
- `candidate_id` replaced with a hashed pseudonym specific to the export run (different export runs produce different pseudonyms — prevents cross-run correlation)
- Discrimination-proxy phrases (per gemini scrum) detected and `[REDACTED-PROXY]`-replaced
- Output is suitable for RAG-indexing or fine-tuning corpus building
- An audit log entry documents the export (purpose, requesting accessor, scope, fields)
-
-If a candidate later RTBFs, their pre-export decisions remain in the trained corpus BUT the link back to them is severed (the export pseudonym was random). Legal defense: "the source data was destroyed; the model retains it indistinguishably from synthetic patterns."
+Without this hook, RTBF in Phase 7 would be a lie because the gateway's LRU embed cache could still hold the subject's data.

 ---

-## 4. Audit response: what /audit/subject/{id} returns (Phase 3 preview)
-
-Phase 3 builds the audit endpoint, but the shape it returns is dictated by what identityd can produce. Sketch:
+## 4. Audit response shape (Phase 3 preview, v2 — adds model-version snapshot per gemini)

 ```json
 {
  "schema": "audit.subject.v1",
-  "subject_token": "CAND-000001",
+  "subject_token": "01926f2e-7c1b-7000-...",     // UUID v7, NOT CAND-NNNNNN
  "request_window": { "from": "2026-01-01", "to": "2026-05-03" },
  "generated_at": "2026-05-03T12:00:00Z",
  "generated_by": "identityd@hostname",
-  "integrity_hash": "sha256:...",                       // Merkle-style chain of all decision rows
-  "signature": "ed25519:...",                            // identityd signs with its escrow key
-  "consent": {
-    "status": "given",
-    "version": "v3-2026-04-15",
-    "given_at": "2026-04-20T14:30:00Z",
-    "biometric_consent": "given",
-    "biometric_retention_until": "2029-04-20T14:30:00Z"
-  },
+  "merkle_root": "sha256:...",                    // root hash of legal/consent/erasure events
+  "external_anchor": "s3://anchor-bucket/identityd/2026-05-03T12-00.json",  // S3 Object Lock URI
+  "signature": "ed25519:...",                     // separate signing key from encryption KEK
+  "consent": { ... },
  "decisions": [
    {
      "ts": "2026-04-22T09:15:23Z",
      "decision_kind": "fill_recommendation",
      "daemon": "gateway",
      "model": "kimi-k2.6",
+      "model_version_hash": "sha256:...",         // NEW per gemini — proves what model existed AT decision time
      "provider": "ollama_cloud",
      "trace_id": "trace-abc",
      "session_id": "session-xyz",
-      "input_features": {
-        // Sanitized view of what the model saw — no protected attributes,
-        // no inferred-attribute proxies, but enough to defend the decision
-      },
-      "output": "recommended for fill_event_456",
-      "rationale": "Skills match: Welder TIG aluminum + 5+ years; geo match: Toledo OH; availability: confirmed",
+      "input_features": { ... },                  // sanitized; no protected attributes; no inferred-attribute proxies
+      "output": "...",
+      "rationale": "...",
      "comparator_pool_size": 47,
-      "comparator_pool_protected_class_distribution": "see appendix A"  // adverse-impact stats per gemini scrum
+      "comparator_appendix_ref": "see comparator_appendix.A"
    }
  ],
  "comparator_appendix": {
-    // EEOC adverse-impact statistics: for the same searches that included
-    // this subject, what was the selection rate by protected class?
-    // Aggregated; no other subjects' identifiers leak.
+    "A": {
+      "scope": "fill scenarios in window matching role X, geo Y",
+      "total_pool_size": 47,
+      "selection_rate_by_protected_class": {
+        // Aggregated; NO other subjects' identifiers leak
+        "race": { "white": 0.33, "black": 0.31, ... },
+        "gender": { "man": 0.34, "woman": 0.30, ... }
+      },
+      "four_fifths_test": "passed"  // or "concern: rate ratio 0.71"
+    }
  },
-  "access_log": [
-    // Every PII access for this subject in the request window
-    { "at": "...", "purpose": "fill_validation", "fields": ["name"], "trace_id": "..." }
-  ],
+  "access_log": [...],
  "footer": {
-    "completeness_attestation": "all decisions about subject_token in the window per retention policy v2 are included",
-    "what_was_excluded": "decisions older than 4 years (retention expired) — count: 0",
-    "format_version": "audit.subject.v1"
+    "completeness_attestation": "all decisions about subject_token in window per retention policy v2 are included",
+    "merkle_proof": "...",                        // proof this audit's root is in S3 anchor
+    "what_was_excluded": "decisions older than 4 years (retention expired) — count: 0"
  }
 }
 ```

-PDF render is a downstream consumer — same JSON, different presentation layer (template + signing remains in JSON; PDF is for legal team's final delivery).
+---
+
+## 5. Migration path (v2 — REORDERED per all 3 reviewers)
+
+v1's order was wrong. New order, with explicit prerequisites:
+
+| Step | Action | Prerequisite |
+|---|---|---|
+| **0** | **Phase 1.6 BIPA pre-launch gates SHIPPED** — consent template published, retention schedule public, deletion procedure documented, employee training acknowledged. | (separate work — see PHASE_1_6_BIPA_GATES.md when written) |
+| **1** | Stand up identityd with **synthetic test subjects only**. KEK in vault, mTLS live, all endpoints serve. | Vault/KMS available; mTLS CA bootstrapped |
+| **2** | **Audit-parity probe** (`audit_parity.sh`) green on synthetic data. Cross-runtime equivalence verified. | Step 1 |
+| **3** | Add gateway feature flag `LH_USE_IDENTITY_SERVICE`. **Shadow-write only** — gateway writes new subjects to identityd AND continues SQL path. No reads from identityd yet. | Step 2 |
+| **4** | Run shadow-write for ≥1 week. Validate access logs, encryption correctness, write-path performance under real traffic. | Step 3 |
+| **5** | Backfill from `workers_500k.parquet`. **`consent_status='pending_backfill_review'`** for ALL existing rows. **`vertical='unknown'`** default (NOT 'general'). **`biometric_consent_status='never_collected'`** — backfill does NOT include any biometric data. | Step 0 (BIPA gates) + Step 4 |
+| **6** | Human-review queue for vertical reclassification. Subjects move from `unknown` to `general` only after explicit review. Healthcare-pattern matches auto-escalate to `healthcare` (never auto-downgrade). | Step 5 |
+| **7** | **Shadow-read** — gateway reads from BOTH SQL and identityd, compares, logs divergences, returns SQL result. Run ≥1 week. | Step 5 |
+| **8** | Feature-flag cutover — gateway reads from identityd, falls back to SQL on error, alerts on every fallback. | Step 7 |
+| **9** | Quarantine PII columns in workers_500k.parquet — only after **cryptographic attestation of identityd completeness** (Merkle proof: source row hash = identityd row hash for every candidate_id). Move PII columns to a different bucket; the candidate_id-only projection becomes operational. | Step 8 |
+
+**Key v2 changes from v1:**
+- New Step 0 (BIPA gates prerequisite) — backfill cannot start without this
+- Steps 3-4 (shadow-write before backfill) — production validates the WRITE path before data lands
+- Step 5 backfill consent: `pending_backfill_review` not `inferred_existing` (BIPA defense)
+- Step 5 vertical default: `unknown` not `general` (HIPAA fail-closed)
+- New Step 6 (human review) — vertical classification via explicit operator action
+- Step 7 added (shadow-read after backfill) — catches encoding/normalization bugs before cutover
+- Step 9 (quarantine) requires cryptographic attestation of completeness
+
+Each step is its own commit, its own gate, its own rollback path.

 ---

-## 5. Migration path from current state
+## 6. Cross-runtime parity probe (NEW, per scrum §6 plus opus 'oracle test' addition)

-This is the single biggest implementation question — how to get from "PII in workers_500k.parquet, no identityd" to "PII in identityd, parquet has only candidate_id + non-PII columns" without breaking the live demo.
+`audit_parity.sh` ships in Phase 5. Asserts:

-### Migration strategy: parallel-write, gradual-read-cutover
-
-1. **Step 1 — Stand up identityd.** Empty database. New service, no callers yet. Health endpoint live. Rust + Go tests can call it but production paths don't.
-2. **Step 2 — Backfill from workers_500k.parquet.** One-shot ETL: read parquet, for each row, write to identityd with `consent_status='inferred_existing'` (placeholder until counsel writes the real consent backfill story), `vertical='general'` (correct for non-healthcare data; needs human review for healthcare-flagged rows).
-3. **Step 3 — Add identityd-call path to gateway behind a feature flag.** When `LH_USE_IDENTITY_SERVICE=true`, the gateway calls identityd for PII; otherwise it uses the legacy SQL path.
-4. **Step 4 — Cut over reads incrementally.** Tool registry first (highest-PII-volume path). Validate via the cross-runtime parity probe `audit_parity.sh`.
-5. **Step 5 — Quarantine PII columns in workers_500k.parquet.** Once all readers go through identityd, the parquet's PII columns become read-only and eventually moved to a different bucket. The candidate_id-only projection becomes the operational table.
-
-Each step has its own commit, its own gate, and its own rollback. Don't ship steps 2-5 in one commit.
-
---
-
-## 6. Cross-runtime parity probe (NEW)
-
-Per `AUDIT_PHASE_1_DISCOVERY` §4 ask: extend the 5 existing probes with `audit_parity.sh`. New probe asserts:
-
-1. Same PII fetch through Rust gateway (port 3100) and Go gateway (port 4110) produces identical identityd access-log rows (modulo daemon name)
-2. Crypto-erasure of a test subject through Rust gateway is honored when Go gateway tries to fetch
-3. Healthcare-vertical routing decision is identical across both runtimes
-4. Training-safe export produces byte-identical output regardless of which gateway initiated
-
-Ships as part of Phase 5 (identity service build), not phase 2.
+1. Same PII fetch through Rust gateway and Go gateway produces identical identityd access-log rows (modulo daemon name)
+2. Crypto-erasure of test subject through Rust gateway is honored when Go gateway tries to fetch
+3. Healthcare-vertical routing decision identical across both runtimes
+4. **Oracle test** (per kimi SOC2 CC4.1 finding): probe with known-good inputs against expected outputs. A bug present in BOTH implementations must not pass the parity probe.
+5. Discrimination-proxy phrase redaction triggers as expected on adversarial test cases

 ---

 ## 7. What this design intentionally does NOT solve

- **Does not replace existing protected-attribute exclusion at decision time.** The model still sees what the SQL returns; identityd doesn't filter that. Phase 6 of `AUDIT_TRAIL_PRD` handles boundary enforcement.
- **Does not redact pathway memory trace bodies.** That's per-trace-write redaction, separate concern. Phase 4.
- **Does not retroactively scrub Langfuse history.** Past traces still contain PII; only new traces are token-redacted. Counsel may request a one-shot historical-Langfuse purge — that's a separate runbook.
- **Does not implement "right to explanation" (GDPR Art. 22 / EU AI Act).** Audit response shows decisions; explaining the model's reasoning chain in human-readable form is Phase 8 (legal export format) or its own follow-up phase.
- **Does not handle multi-region data residency.** Single-region (US-Midwest, by default). EU-placeholder fields are present; multi-region deployment is out of scope.
+- Protected-attribute exclusion at decision time (Phase 6)
+- Pathway memory trace body redaction (Phase 4)
+- Retroactive Langfuse history scrub (separate runbook)
+- GDPR Art. 22 right to explanation full implementation (Phase 8)
+- Multi-region data residency (single-region US-Midwest by default)
+- Training-safe export (deferred — not on critical path per opus)

 ---

-## 8. Open questions for J before implementation starts
+## 8. Open questions — RESOLVED 2026-05-03

-1. **Master key location.** Vault server, KMS, or a sealed file? Sealed file is fastest to ship; vault is most defensible. Recommend sealed-file for v1 with migration path to vault. **Confirm.**
-2. **Postgres for identityd: shared with Langfuse, or its own?** Recommend its own — operational isolation. **Confirm.**
-3. **`vertical` field initial values.** Backfill all existing subjects to `'general'`? Or block backfill until each candidate's vertical is determined? **Recommend backfill-to-general + flagging procedure for unknown.**
-4. **Legal-only token issuance procedure.** Who has the authority to mint a legal token? Operator (J)? Outside counsel? Both? **Recommend J + named outside counsel, dual-control.**
-5. **Crypto-erasure timeline for retention.** Default sweep cadence: daily? Weekly? **Recommend daily.**
-6. **EU placeholder enforcement timeline.** Build the fields now; when do we turn on enforcement? **Recommend "when first EU candidate is added; until then, enforcement is no-op."**
+J confirmed all 6 v1 recommendations + scrum-driven changes:
+
+1. ✅ **Master key:** HashiCorp Vault Transit (recommended) OR AWS KMS. NOT a sealed file. v1's sealed-file recommendation is rejected per all 3 reviewers ("obfuscation in any defensible sense").
+2. ✅ **Postgres isolation:** identityd's own database, isolated schema.
+3. ✅ **Vertical backfill:** `'unknown'` default with fail-closed routing. NOT `'general'` per opus+gemini scrum.
+4. ✅ **Legal token:** Split-secret dual-control issuance (J + counsel both sign). Short-lived JWT, max 24h, revocable in <60s.
+5. ✅ **Crypto-erasure sweep:** Daily 03:00 UTC. (Note: v2 erasure mechanism is ciphertext deletion, not key destruction. Per-row keys deferred to Phase 7.)
+6. ✅ **EU enforcement:** Per-subject. Schema fields exist; nothing runs until first eu_resident=true subject.
+
+### Newly resolved per scrum
+
+7. ✅ **Migration order:** REORDERED per §5 above. New Step 0 (BIPA prerequisite). Shadow-write before backfill. Shadow-read before cutover.
+8. ✅ **Audit-log external anchor:** S3 Object Lock with compliance-mode 7-year retention. Hourly + on-event commits.
+9. ✅ **Audit-log signing key:** Separate Ed25519 keypair from KEK. Vault Transit signing backend OR sealed-secret with strict rotation runbook.
+10. ✅ **mTLS gateway↔identityd:** Mandatory. Self-signed CA managed by identityd at startup.
+11. ✅ **Per-row encryption keys:** Deferred to Phase 7. v2 uses single DEK with HSM-wrapped KEK + ciphertext deletion for erasure.
+12. ✅ **Field-level authorization:** purpose_definitions table enforces per-purpose field allowlists.
+13. ✅ **Synchronous cache invalidation:** pub/sub event on erase; gateway subscribes.
+14. ✅ **Outbound NER pass for Langfuse:** Defense-in-depth in addition to symbol-table replacement.
+15. ✅ **Model version hash in audit response:** Captured per decision row.
+16. ✅ **PDF render:** Deferred from Phase 2; JSON ships first.

 ---

-## 9. Estimated implementation cost
+## 9. Estimated implementation cost (revised v2)

 | Sub-phase | Effort | Notes |
 |---|---|---|
-| 2A — Postgres schema + migrations | 4-6 hours | Includes encryption helpers + key management glue |
-| 2B — identityd HTTP surface (Go) | 1-2 days | All endpoints, auth, signing key, tests |
-| 2C — Backfill ETL from workers_500k.parquet | 1 day | One-shot script + dry-run mode |
-| 2D — Gateway integration (Rust + Go, behind feature flag) | 2 days | Per-tool migration, parity probe |
-| 2E — outcomes/sessions/observer JSONL writer changes | 1 day | Subject_id top-level promotion across all sinks |
-| 2F — Langfuse redaction layer | 1-2 days | Per-request resolved-PII map + token replacement |
-| 2G — Healthcare-vertical routing | 0.5 day | Single conditional per gateway |
-| 2H — Training-safe export | 1 day | The exporter + audit logging |
-| 2I — Cross-runtime parity probe `audit_parity.sh` | 0.5 day | New probe, lands in golangLAKEHOUSE |
-| **Total** | **~8-10 working days** | Sequential; some can parallelize |
+| 2A — Postgres schema + Vault/KMS integration | 1.5-2 days | Includes mTLS CA bootstrap + signing-key separation |
+| 2B — identityd HTTP surface (Go) | 2-3 days | All endpoints, auth, dual-control JWT issuance, rate limiting, revocation |
+| 2C — Backfill ETL (BIPA-compliant — pending_backfill_review) | 1 day | Plus Merkle attestation of completeness |
+| 2D — Gateway integration (Rust + Go, mTLS, shadow-write phase) | 2-3 days | Per-tool migration, parity probe |
+| 2E — JSONL writer changes (subject_id top-level promotion) | 1 day | All sinks |
+| 2F — Langfuse redaction (symbol-table + NER) | 2 days | Two passes + drop-on-detect |
+| 2G — Healthcare-vertical routing (fail-closed) | 0.5 day | Plus auto-escalation pattern matcher |
+| 2H — Cache invalidation pub/sub hook | 0.5 day | Critical for Phase 7 RTBF |
+| 2I — External anchor (S3 Object Lock) | 1 day | Hourly + on-event commits |
+| 2J — Cross-runtime parity probe with oracle tests | 1 day | New probe |
+| **Total** | **~12-15 working days** | (v1 estimated 8-10; v2 added BIPA gates dependency, mTLS, dual-control JWT, NER pass, anchor, etc.) |

-This is the largest single phase in the audit-trail program. It's the substrate for everything downstream. Recommend doing it carefully — no half-shipped commits, each sub-phase has its own exit criterion.
+Bigger than v1. Worth it — every addition was a 3/3-reviewer convergent finding.
+
+---
+
+## 10. The four "would not build" blockers from scrum (all addressed in v2)
+
+| # | v1 issue (3/3 reviewers) | v2 resolution |
+|---|---|---|
+| 1 | Migration order (backfill before validation) | §5 reordered; Step 0 BIPA gates prereq added |
+| 2 | Master key on disk + legal token static file | §2.1 Vault/KMS for KEK; §2.4 split-secret short-lived JWT |
+| 3 | `inferred_existing` BIPA prima facie violation | §5 Step 5 uses `pending_backfill_review`; Step 0 gates required first |
+| 4 | Healthcare default `general` (HIPAA exposure window) | §3.5 fail-closed `unknown`-as-healthcare; §5 Step 5 backfill default `unknown` |
+
+All three reviewers said "I would not build v1 as written." All four blockers are resolved in v2. Re-scrum recommended before implementation starts.
+
+---
+
+## 11. Change log (v1 → v2)
+
+| Section | v1 | v2 |
+|---|---|---|
+| Master key storage | Sealed file `/etc/lakehouse/identityd_master.key` | HashiCorp Vault Transit / AWS KMS |
+| Legal token | Static file mode 0400 | Split-secret short-lived JWT (max 24h), dual-control issuance |
+| Encryption keys | Per-row keys with master wrapping | Single DEK + HSM-wrapped KEK + ciphertext deletion (per-row deferred to Phase 7) |
+| Healthcare default | `vertical='general'` backfill | `vertical='unknown'` backfill, fail-closed routing |
+| Migration §5 order | 1: stand up, 2: backfill, 3: feature flag, 4: cutover, 5: quarantine | 0: BIPA gates, 1: stand up, 2: parity probe, 3: shadow-write, 4: shadow-write soak, 5: backfill (pending_backfill_review), 6: human vertical review, 7: shadow-read, 8: cutover, 9: quarantine with cryptographic attestation |
+| Audit-log integrity | Postgres Merkle chain only | Postgres Merkle for legal/consent/erasure + HMAC for standard + S3 Object Lock external anchor |
+| Audit-log signing key | Same as encryption key | Separate Ed25519 keypair; Vault signing backend |
+| Gateway↔identityd transport | Plain HTTP (implied) | mTLS mandatory |
+| Field authorization | Per-endpoint only | Per-purpose-token field allowlists |
+| Cache invalidation | Phase 4 / Phase 7 | Phase 2 ships pub/sub hook |
+| Langfuse redaction | Symbol-table replacement only | Symbol-table + outbound NER pass + drop-on-detect |
+| Audit response | No model version | model_version_hash per decision |
+| PDF render | Phase 2 | Deferred — JSON ships first |
+| Training-safe export | Phase 2 | Deferred — not critical path |
+| Effort estimate | 8-10 days | 12-15 days |

 ---

 ## Change log

- 2026-05-03 — Initial Phase 2 design draft. Incorporates J's confirmed answers (separate daemon, signed JSON+PDF, legal-only auth) plus all Phase 1 + 1.5 findings + scrum-driven priority changes.
+- 2026-05-03 — v1 initial draft.
+- 2026-05-03 — v2 post-scrum: 3/3 reviewer convergent findings folded in. 4 "would not build" blockers all resolved. Re-scrum before implementation recommended.