lakehouse/docs/IDENTITY_SERVICE_DESIGN.md
root 298fadce41 identity service: v2 — fold cross-lineage scrum findings + 4 'would not build' blocker fixes
Scrummed v1 across opus + kimi + gemini lineages via the new model
fleet. 3/3 reviewers said 'I would NOT build v1 as written.' 4
convergent blockers, all resolved in v2:

1. Migration order wrong — backfill before validation creates dark
   database; if backfill bug, no production traffic catches it.
   v2 inserts BIPA-prereq Step 0 + shadow-write before backfill +
   shadow-read before cutover. 9-step migration with cryptographic
   attestation of completeness at quarantine.

2. Master key on disk + legal token static file = 'security theater'
   per all 3. v2: HashiCorp Vault Transit / AWS KMS for KEK (not
   sealed file). Legal token: split-secret short-lived JWT (max 24h),
   dual-control issuance (J + counsel both sign), revocable in <60s.

3. consent_status='inferred_existing' is BIPA prima facie violation
   (kimi+gemini explicit). v2 backfill uses 'pending_backfill_review';
   biometric data NEVER backfilled — separate consent stream.

4. Healthcare default 'general' = HIPAA exposure window for every
   misclassified subject. v2 default 'unknown' with fail-closed
   routing (treat unknown as healthcare-equivalent until classified
   by manual review). Auto-escalation to healthcare on resume_text
   pattern match.

Plus 12 single-reviewer additions:
- mTLS mandatory between gateway↔identityd (kimi)
- External anchor for audit chain: S3 Object Lock 7-year compliance
  mode, hourly + on-event commits (all 3)
- Audit-log signing key separate from encryption KEK (opus)
- Field-level authorization via purpose_definitions table (kimi)
- Per-row encryption keys deferred to Phase 7 (kimi simplification)
- pii_access_log itself needs legal-tier read auth (opus)
- Synchronous cache invalidation pub/sub on RTBF (opus)
- Outbound NER pass for Langfuse defense-in-depth (opus TOCTOU)
- model_version_hash per decision row (gemini)
- /vertical minimal-disclosure endpoint (kimi HIPAA min-necessary)
- Auto-escalation healthcare on resume_text pattern (kimi)
- Rate limiting + token revocation list (opus)
- Oracle tests in audit_parity.sh (kimi SOC2 CC4.1)

Architecturally simplified per scrum:
- Per-row encryption keys deferred to Phase 7 (single DEK + HSM-
  wrapped KEK + ciphertext deletion is equivalent practical erasure
  with less complexity)
- PDF render deferred (JSON ships first)
- Training-safe export deferred (not critical path)

Estimated effort revised 8-10 → 12-15 days. Worth it — every
addition was a 3/3-reviewer convergent finding.

Re-scrum recommended before implementation starts to verify v2
addresses the v1 blockers.

No code changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 01:36:07 -05:00

31 KiB

Identity Service — Phase 2 Design (v2 — post-scrum revisions)

Status: Draft v2 — 2026-05-03 · Owner: J · Drafted by: working session 2026-05-03 Companion to: AUDIT_TRAIL_PRD.md, AUDIT_PHASE_1_DISCOVERY.md, AUDIT_PHASE_1_5_BIPA_AND_OUTCOMES.md

v2 history. v1 (2026-05-03 morning) was scrummed across opus + kimi + gemini lineages. 3/3 reviewers converged on 6 critical issues — wrong migration order, master key + legal token "security theater," BIPA inferred_existing prima facie violation, healthcare default in wrong direction, Merkle chain without external anchor. v2 incorporates those changes. Full scrum reviews preserved at /tmp/identity_scrum/{opus,kimi,gemini}_review.md. Diff between v1 and v2 captured in §11 change log.

Why this exists. Phase 1 + 1.5 confirmed today's substrate has no separation between candidate_id and PII. Both live in workers_500k.parquet. No per-access audit, no consent gate, no retention enforcement. This document specifies the new identity service that holds the candidate_id ↔ PII mapping, gates every PII read, audits every access, and serves as the single legal-attestable boundary between PII and the rest of the system.

Prerequisite: Phase 1.6 (BIPA pre-launch gates) MUST ship before any identityd backfill begins. Per kimi+gemini scrum, backfilling with consent_status='inferred_existing' is a BIPA §15 prima facie violation. Phase 1.6 establishes the consent template + retention schedule + employee training that turns backfill from "inferred" into "pending_backfill_review with documented escalation path."


1. Scope and non-goals

In scope

  • Single source of truth for candidate_id ↔ PII mapping
  • Per-PII-access audit log (who/what/when/why)
  • Consent + retention metadata (BIPA + IL Day and Temporary Labor Services Act + healthcare PHI)
  • Legal-tier access — short-lived JWT with split-secret dual-control issuance (NOT a static file token, per all 3 reviewers)
  • Healthcare-vertical routing with fail-closed default (unknown treated as healthcare until classified, per opus+gemini)
  • EU-compatible interface — fields exist, enforcement is per-subject (no system-wide flag flip needed)
  • Hardware-backed master key — HashiCorp Vault Transit OR AWS KMS for v1 (NOT a sealed file, per all 3 reviewers; sealed-file v1 was "obfuscation in any defensible sense" — kimi)
  • Signed-JSON audit response with PDF render path (PDF deferred; JSON ships first per opus)
  • External-anchor for audit-log integrity (S3 Object Lock + signed timestamps; NOT just Postgres Merkle, per all 3 reviewers)
  • mTLS or Unix Domain Socket between gateway ↔ identityd (per kimi — "port-isolated is theater without authenticated channel")
  • Field-level authorization on /subjects/{id} (per-purpose field allowlists, per kimi)
  • Synchronous cache invalidation hook for RTBF (per opus — even if Phase 7 is the full RTBF build)

Out of scope (for this phase)

  • The /audit/subject/{id} endpoint itself (Phase 3)
  • Full subject-tagging across other substrates (Phase 4)
  • Right-to-be-forgotten implementation full-flow (Phase 7)
  • Training-safe export (deferred from Phase 2 per opus — "not on critical path for audit-trail defense")
  • BIPA pre-launch gates content (Phase 1.6 — separate doc)

Architecturally simplified from v1 (per scrum)

  • Per-row encryption keys deferred to Phase 7. v2 uses a single Data Encryption Key (DEK) wrapped under HSM-backed Key Encryption Key (KEK), with per-subject ciphertext-deletion as the v2 erasure mechanism. This matches kimi's recommendation: "A single data-encryption key with HSM-backed rotation, plus per-subject deletion of ciphertext, achieves equivalent practical erasure with far less complexity." Cryptographic-erasure via per-subject keys becomes a Phase 7 enhancement.
  • EU placeholder fields kept (per J 2026-05-03), but enforcement code is genuinely no-op until first eu_resident=true subject — no code paths run, no schema migrations needed when EU comes online (just per-subject field population).
  • Merkle chaining narrowed (per gemini): full Merkle tree only for legal-tier and consent/erasure events. Standard gateway_lookup events get a simpler signed-HMAC chain. Both anchored externally.

2. Architectural shape

2.1 — Process model + transport

Per J 2026-05-03 + kimi scrum: separate daemon, mTLS-mandatory transport. Single Go implementation, both runtimes call it.

Property Value
Name identityd
Port :3225 (single port, not dual; per kimi simplification — runtime-agnostic routing)
Implementation Go
Storage Postgres in isolated database (per J answer, confirmed). Schema-level isolation; no shared schemas with Langfuse or other lakehouse storage.
Transport gateway↔identityd mTLS mandatory. Self-signed CA managed by identityd at startup; gateway clients have their own client certs. Plain HTTP is rejected at the listener.
Encryption-at-rest Single DEK wrapped under KEK in HashiCorp Vault Transit (recommended) OR AWS KMS. Master key NEVER on disk. v1 of identityd refuses to start if KEK is unreachable.
Audit-log signing Separate Ed25519 keypair from the master encryption key. Signing key in Vault Transit (signing backend) OR a separate sealed-secret file with strict rotation procedure.
External anchor for audit chain S3 Object Lock (compliance mode, 7-year retention) holding periodic Merkle root commitments. Daemon writes hourly + on legal-tier event.
Backup Postgres standard backup; KEK backup separate (different storage, different access). Crypto-erasure model only works if these are not co-located.
Rate limiting Per-token QPS + daily-volume caps. Default legal-token: 100 lookups/day + 10K daily volume; alerting at 50%. Configurable per-token.
Token revocation token_revocations table checked on every legal-tier auth. Revocation propagates within 60s (TTL on cache).

2.2 — Schema (Postgres DDL, v2)

-- Subject record. PII columns are ciphertext under the DEK.
CREATE TABLE subjects (
    candidate_id    TEXT PRIMARY KEY,           -- UUID v7, NOT sequential (per kimi enumeration concern)
    -- Encrypted PII fields. AES-256-GCM under DEK. NULL = not collected.
    name_ct         BYTEA,
    email_ct        BYTEA,
    phone_ct        BYTEA,
    address_ct      BYTEA,
    ssn_ct          BYTEA,
    dob_ct          BYTEA,
    -- DEK version this subject was encrypted under. KEK rotates; DEK rotates per
    -- KEK rotation. New DEK version = subjects re-encrypted in background sweep.
    dek_version     INT NOT NULL,
    -- Lawful basis + consent metadata
    consent_status  TEXT NOT NULL CHECK (consent_status IN (
        'pending_backfill_review',  -- backfill default; no biometric/PHI use until reviewed
        'pending_first_contact',     -- new subject, awaiting consent UX
        'given',                     -- explicit consent recorded
        'withdrawn',                 -- subject revoked
        'expired'                    -- consent timeout
    )),
    consent_version           TEXT,             -- references published consent template version
    consent_given_at          TIMESTAMPTZ,
    consent_withdrawn_at      TIMESTAMPTZ,
    -- BIPA-specific fields (Phase 1.5 §1E + Phase 1.6 prerequisite)
    biometric_consent_status  TEXT NOT NULL DEFAULT 'never_collected' CHECK (biometric_consent_status IN (
        'never_collected', 'pending', 'given', 'withdrawn', 'expired'
    )),
    biometric_retention_until TIMESTAMPTZ,      -- BIPA: max 3 years from last interaction
    -- Vertical detection — drives healthcare PHI routing.
    -- DEFAULT 'unknown' (per opus+gemini scrum) — fail-closed routing treats
    -- unknown as healthcare-equivalent until reclassified.
    vertical                  TEXT NOT NULL DEFAULT 'unknown' CHECK (vertical IN (
        'unknown', 'general', 'healthcare', 'finance', 'other'
    )),
    -- Per-vertical retention period. Drives the daily erasure sweep.
    -- (Per opus: this was missing in v1. Required for BIPA's 3-year-from-
    -- last-interaction rule which differs from generic retention.)
    retention_period_days     INT NOT NULL,
    -- EU-placeholder fields. Enforcement is per-subject; nothing runs
    -- until a row has eu_resident=true AND lawful_basis IS NOT NULL.
    eu_resident       BOOLEAN NOT NULL DEFAULT FALSE,
    lawful_basis      TEXT,                     -- GDPR Art. 6 basis when eu_resident=true
    transfer_mechanism TEXT,                    -- SCC, DPF, BCR — populated when EU comes online
    -- Standard audit columns
    created_at        TIMESTAMPTZ NOT NULL DEFAULT now(),
    updated_at        TIMESTAMPTZ NOT NULL DEFAULT now(),
    last_interaction  TIMESTAMPTZ,              -- drives retention sweep
    -- Crypto-erasure state (v2: ciphertext deletion, NOT key destruction)
    erased_at         TIMESTAMPTZ,
    erasure_reason    TEXT
);

-- Append-only access log. EVERY PII read writes a row.
-- Field-level: fields_accessed records WHICH fields the caller resolved.
-- purpose enforced against an allowlist per-purpose-token.
CREATE TABLE pii_access_log (
    access_id       BIGSERIAL PRIMARY KEY,
    candidate_id    TEXT NOT NULL,
    accessed_at     TIMESTAMPTZ NOT NULL DEFAULT now(),
    accessor_kind   TEXT NOT NULL,              -- 'gateway_lookup' | 'audit_response' | 'legal_request' | 'system_resolve'
    accessor_id     TEXT NOT NULL,              -- daemon name + caller token-hash; never raw token
    purpose_token   TEXT NOT NULL,              -- opaque token; resolved through purpose_definitions table
    fields_accessed TEXT[] NOT NULL,
    request_trace_id TEXT,
    chain_kind      TEXT NOT NULL CHECK (chain_kind IN ('hmac', 'merkle')),  -- HMAC for standard, Merkle for legal/consent/erasure
    integrity_hash  TEXT NOT NULL,
    -- Per opus security finding: this row's existence is itself sensitive
    -- ("candidate X is under legal review" leaks via purpose). Read access
    -- to this table requires legal-tier auth or scoped per-subject auth.
    is_legal_tier_event BOOLEAN NOT NULL DEFAULT FALSE
);

-- Purpose definitions — enforces field-level authorization (per kimi).
-- A given purpose_token can only request fields in its allowlist.
CREATE TABLE purpose_definitions (
    purpose_token   TEXT PRIMARY KEY,           -- e.g. 'fill_validation', 'audit_subject_response'
    description     TEXT NOT NULL,
    allowed_fields  TEXT[] NOT NULL,            -- e.g. ARRAY['name'] for fill_validation
    auth_tier       TEXT NOT NULL CHECK (auth_tier IN ('service', 'legal')),
    rate_limit_qps  INT,
    daily_volume_cap INT
);

-- Token revocation list. Checked on every auth, cached 60s.
CREATE TABLE token_revocations (
    token_hash      TEXT PRIMARY KEY,
    revoked_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
    revoked_by      TEXT NOT NULL,
    reason          TEXT
);

-- Consent template versioning. Hash references, not embedded text per row.
CREATE TABLE consent_versions (
    version         TEXT PRIMARY KEY,
    effective_at    TIMESTAMPTZ NOT NULL,
    superseded_at   TIMESTAMPTZ,
    template_hash   TEXT NOT NULL,              -- SHA256 of the canonical template
    template_path   TEXT NOT NULL               -- where the canonical text lives (e.g. data/_consent/v3-2026-04-15.md)
);

-- External anchor checkpoints. Periodically committed to S3 Object Lock.
CREATE TABLE anchor_commits (
    commit_id       BIGSERIAL PRIMARY KEY,
    chain_kind      TEXT NOT NULL,              -- 'hmac' or 'merkle'
    checkpoint_at   TIMESTAMPTZ NOT NULL,
    last_access_id  BIGINT NOT NULL,            -- max access_id covered by this checkpoint
    root_hash       TEXT NOT NULL,
    s3_object_uri   TEXT NOT NULL,              -- s3://anchor-bucket/identityd/...
    s3_lock_until   TIMESTAMPTZ NOT NULL        -- compliance-mode retention end
);

2.3 — HTTP surface (v2)

All under /v1/identity/. mTLS + token required on every endpoint.

Method + Path Purpose Auth tier Notes
POST /v1/identity/subjects Create new subject. Body: PII fields + purpose_token. Returns: candidate_id (UUID v7). service Validates purpose_token allowlist matches submitted fields
GET /v1/identity/subjects/{candidate_id}?purpose={token}&fields={list} Resolve PII for candidate. service Field-level enforcement: returned fields ⊆ purpose_token's allowed_fields. Mismatch → 403. Logs access row.
GET /v1/identity/subjects/{candidate_id}/vertical Minimal-disclosure vertical lookup (per kimi HIPAA minimum-necessary). Returns only {vertical, consent_status}. service Used by gateway routing. Cheaper access-log row.
GET /v1/identity/subjects/{candidate_id}/full Complete subject record + audit summary. legal Short-lived JWT only. Logged with is_legal_tier_event=true. Triggers real-time notification to designated counsel + J.
POST /v1/identity/subjects/{candidate_id}/consent Record consent given/withdrawn with version. service
POST /v1/identity/subjects/{candidate_id}/erase Crypto-erasure. Idempotent. legal Synchronously triggers cache invalidation hooks (per opus).
GET /v1/identity/access_log/{candidate_id} Per-subject access log for audit response. legal
POST /v1/identity/auth/legal_token Issue short-lived JWT (max 24h). Requires dual-control attestation: J's signed nonce + counsel's signed nonce, both verified against pre-registered public keys. dual-control Per all 3 reviewers — replaces v1's static file approach
GET /v1/identity/health Liveness none Returns only {status: "ok"} — no version, no schema info
(no public training-safe-export endpoint in v2) Deferred per opus — not on critical path

2.4 — Auth model (v2 — split-secret + short-lived)

Service-tier auth (gateway → identityd for routine PII resolution):

  • mTLS client cert per gateway (Rust :3100 + Go :4110 each get their own cert)
  • Bearer token in Authorization header (long-lived — 90d), rotated per ops runbook

Legal-tier auth (per-request, short-lived JWT, dual-control issuance):

  • J holds private key A
  • Designated outside counsel holds private key B
  • POST /v1/identity/auth/legal_token requires:
    • HTTP body: {purpose, ttl_seconds, requested_fields, signature_a, signature_b, nonce_a, nonce_b}
    • identityd verifies both signatures against pre-registered A and B public keys
    • On success: emits a JWT signed by identityd, valid for ttl_seconds (max 24h), scoped to purpose+fields
  • The JWT:
    • Cannot be issued without BOTH A and B signatures (per gemini "split-secret startup ceremony" pattern, applied per-token)
    • Carries scope (purpose + fields) inside its claims; identityd enforces at use time
    • Per-token rate limit + daily cap recorded on issuance
    • Revocation: token_hash added to token_revocations; identityd refuses on next check (60s cache TTL)

Why this matters:

  • v1's "static file mode 0400" was security theater per all 3 reviewers
  • A leaked legal token v1 = unbounded exfiltration window until rotation
  • v2's leaked JWT = bounded by TTL (max 24h) AND revocable in <60s
  • Single-actor compromise (J's machine OR counsel's machine, not both) cannot mint legal tokens

3. Integration with the rest of the substrate

3.1 — Gateway changes

When gateway needs PII for a fill scenario:

  1. Gateway has only candidate_id (post-Phase-1.5 view-routing fix)
  2. Gateway calls GET /v1/identity/subjects/{candidate_id}/vertical first if it might be healthcare-vertical-sensitive routing
  3. Gateway calls GET /v1/identity/subjects/{candidate_id}?purpose=fill_validation&fields=name
  4. identityd validates: purpose_token=fill_validation allows [name] only; if request asks for [name,ssn] → 403
  5. identityd writes pii_access_log row, decrypts requested fields, returns
  6. Gateway uses fields, then explicitly zeros memory (kimi's drop-and-overwrite pattern) before returning HTTP response
  7. No PII caching in gateway memory beyond request lifetime

LRU embed cache (commit 150cc3b) currently keys by (model, text) where text contains PII. Phase 4 task to re-key as (model, candidate_id, field_subset_hash). Phase 2 must add the synchronous cache-purge hook so RTBF in Phase 2 invalidates Phase 4's cache when it ships (opus finding).

3.2 — Cross-runtime: same identityd, both gateways call it

Both gateways call identityd over mTLS. Same endpoints, same auth model. New cross-runtime parity probe audit_parity.sh validates identical PII request through both gateways produces identical access-log rows (modulo daemon-name field).

3.3 — JSONL writer changes (subject_id top-level promotion)

Per AUDIT_PHASE_1_DISCOVERY §10/C5:

  • All JSONL sinks (outcomes, sessions, overseer_corrections, observerd ops) gain top-level subject_ids: [...] field
  • fills[*].name replaced with name_ref: "[REDACTED-{candidate_id}]"
  • Authorized callers dereference via identityd

3.4 — Langfuse boundary redaction (defense in depth, per opus TOCTOU finding)

v1 design: per-request map of subject_id → resolved_PII, replace before Langfuse POST.

v2 adds (per opus): outbound regex/NER pass on the Langfuse payload as defense-in-depth. The model can hallucinate names not in the resolved-PII map. The outbound NER pass catches them.

Gateway constructs Langfuse payload
  → Pass 1: per-request resolved-PII map replacement
  → Pass 2: NER scan (regex for SSN/phone shapes; named-entity model for names)
  → Either pass detecting unredacted PII → drop the trace OR replace with [POSSIBLE-PII-DETECTED]
  → POST to Langfuse

3.5 — Healthcare vertical routing — fail-closed default

Per opus + gemini scrum: Default is unknown → treated as healthcare for routing purposes.

When gateway receives a request involving any candidate:

  1. Query /vertical for the candidate
  2. If vertical IN ('healthcare', 'unknown') → route to on-box Ollama only (no opencode/openrouter/ollama_cloud egress)
  3. If vertical='general' (or other non-healthcare) → cloud routing OK
  4. If identityd unreachable → fail closed (refuse the request, return 503)

Reclassification path: subjects can be moved from unknowngeneral via explicit operator action (after manual review). Subjects flagged healthcare stay healthcare unless explicitly downgraded.

Auto-escalation (per kimi): if a subject's resume_text or call_log content is updated with healthcare-pattern matches (RN, BSN, hospital, MD, physician, etc.), vertical auto-escalates to healthcare. Never auto-de-escalates.

3.6 — Cache invalidation hook (Phase 2 must ship even though full RTBF is Phase 7)

When POST /v1/identity/subjects/{id}/erase fires:

  1. Mark subjects.erased_at and zero out ciphertext columns
  2. Write pii_access_log row with purpose_token='retention_expired' or 'rtbf_request'
  3. Synchronously publish a subject_erased event to a pub/sub channel (Redis or Postgres LISTEN/NOTIFY)
  4. Gateway subscribes; on event, purges any in-flight cache entries for that candidate_id
  5. Eventual-consistency window: <5s between erase-call and cache-flush

Without this hook, RTBF in Phase 7 would be a lie because the gateway's LRU embed cache could still hold the subject's data.


4. Audit response shape (Phase 3 preview, v2 — adds model-version snapshot per gemini)

{
  "schema": "audit.subject.v1",
  "subject_token": "01926f2e-7c1b-7000-...",     // UUID v7, NOT CAND-NNNNNN
  "request_window": { "from": "2026-01-01", "to": "2026-05-03" },
  "generated_at": "2026-05-03T12:00:00Z",
  "generated_by": "identityd@hostname",
  "merkle_root": "sha256:...",                    // root hash of legal/consent/erasure events
  "external_anchor": "s3://anchor-bucket/identityd/2026-05-03T12-00.json",  // S3 Object Lock URI
  "signature": "ed25519:...",                     // separate signing key from encryption KEK
  "consent": { ... },
  "decisions": [
    {
      "ts": "2026-04-22T09:15:23Z",
      "decision_kind": "fill_recommendation",
      "daemon": "gateway",
      "model": "kimi-k2.6",
      "model_version_hash": "sha256:...",         // NEW per gemini — proves what model existed AT decision time
      "provider": "ollama_cloud",
      "trace_id": "trace-abc",
      "session_id": "session-xyz",
      "input_features": { ... },                  // sanitized; no protected attributes; no inferred-attribute proxies
      "output": "...",
      "rationale": "...",
      "comparator_pool_size": 47,
      "comparator_appendix_ref": "see comparator_appendix.A"
    }
  ],
  "comparator_appendix": {
    "A": {
      "scope": "fill scenarios in window matching role X, geo Y",
      "total_pool_size": 47,
      "selection_rate_by_protected_class": {
        // Aggregated; NO other subjects' identifiers leak
        "race": { "white": 0.33, "black": 0.31, ... },
        "gender": { "man": 0.34, "woman": 0.30, ... }
      },
      "four_fifths_test": "passed"  // or "concern: rate ratio 0.71"
    }
  },
  "access_log": [...],
  "footer": {
    "completeness_attestation": "all decisions about subject_token in window per retention policy v2 are included",
    "merkle_proof": "...",                        // proof this audit's root is in S3 anchor
    "what_was_excluded": "decisions older than 4 years (retention expired) — count: 0"
  }
}

5. Migration path (v2 — REORDERED per all 3 reviewers)

v1's order was wrong. New order, with explicit prerequisites:

Step Action Prerequisite
0 Phase 1.6 BIPA pre-launch gates SHIPPED — consent template published, retention schedule public, deletion procedure documented, employee training acknowledged. (separate work — see PHASE_1_6_BIPA_GATES.md when written)
1 Stand up identityd with synthetic test subjects only. KEK in vault, mTLS live, all endpoints serve. Vault/KMS available; mTLS CA bootstrapped
2 Audit-parity probe (audit_parity.sh) green on synthetic data. Cross-runtime equivalence verified. Step 1
3 Add gateway feature flag LH_USE_IDENTITY_SERVICE. Shadow-write only — gateway writes new subjects to identityd AND continues SQL path. No reads from identityd yet. Step 2
4 Run shadow-write for ≥1 week. Validate access logs, encryption correctness, write-path performance under real traffic. Step 3
5 Backfill from workers_500k.parquet. consent_status='pending_backfill_review' for ALL existing rows. vertical='unknown' default (NOT 'general'). biometric_consent_status='never_collected' — backfill does NOT include any biometric data. Step 0 (BIPA gates) + Step 4
6 Human-review queue for vertical reclassification. Subjects move from unknown to general only after explicit review. Healthcare-pattern matches auto-escalate to healthcare (never auto-downgrade). Step 5
7 Shadow-read — gateway reads from BOTH SQL and identityd, compares, logs divergences, returns SQL result. Run ≥1 week. Step 5
8 Feature-flag cutover — gateway reads from identityd, falls back to SQL on error, alerts on every fallback. Step 7
9 Quarantine PII columns in workers_500k.parquet — only after cryptographic attestation of identityd completeness (Merkle proof: source row hash = identityd row hash for every candidate_id). Move PII columns to a different bucket; the candidate_id-only projection becomes operational. Step 8

Key v2 changes from v1:

  • New Step 0 (BIPA gates prerequisite) — backfill cannot start without this
  • Steps 3-4 (shadow-write before backfill) — production validates the WRITE path before data lands
  • Step 5 backfill consent: pending_backfill_review not inferred_existing (BIPA defense)
  • Step 5 vertical default: unknown not general (HIPAA fail-closed)
  • New Step 6 (human review) — vertical classification via explicit operator action
  • Step 7 added (shadow-read after backfill) — catches encoding/normalization bugs before cutover
  • Step 9 (quarantine) requires cryptographic attestation of completeness

Each step is its own commit, its own gate, its own rollback path.


6. Cross-runtime parity probe (NEW, per scrum §6 plus opus 'oracle test' addition)

audit_parity.sh ships in Phase 5. Asserts:

  1. Same PII fetch through Rust gateway and Go gateway produces identical identityd access-log rows (modulo daemon name)
  2. Crypto-erasure of test subject through Rust gateway is honored when Go gateway tries to fetch
  3. Healthcare-vertical routing decision identical across both runtimes
  4. Oracle test (per kimi SOC2 CC4.1 finding): probe with known-good inputs against expected outputs. A bug present in BOTH implementations must not pass the parity probe.
  5. Discrimination-proxy phrase redaction triggers as expected on adversarial test cases

7. What this design intentionally does NOT solve

  • Protected-attribute exclusion at decision time (Phase 6)
  • Pathway memory trace body redaction (Phase 4)
  • Retroactive Langfuse history scrub (separate runbook)
  • GDPR Art. 22 right to explanation full implementation (Phase 8)
  • Multi-region data residency (single-region US-Midwest by default)
  • Training-safe export (deferred — not on critical path per opus)

8. Open questions — RESOLVED 2026-05-03

J confirmed all 6 v1 recommendations + scrum-driven changes:

  1. Master key: HashiCorp Vault Transit (recommended) OR AWS KMS. NOT a sealed file. v1's sealed-file recommendation is rejected per all 3 reviewers ("obfuscation in any defensible sense").
  2. Postgres isolation: identityd's own database, isolated schema.
  3. Vertical backfill: 'unknown' default with fail-closed routing. NOT 'general' per opus+gemini scrum.
  4. Legal token: Split-secret dual-control issuance (J + counsel both sign). Short-lived JWT, max 24h, revocable in <60s.
  5. Crypto-erasure sweep: Daily 03:00 UTC. (Note: v2 erasure mechanism is ciphertext deletion, not key destruction. Per-row keys deferred to Phase 7.)
  6. EU enforcement: Per-subject. Schema fields exist; nothing runs until first eu_resident=true subject.

Newly resolved per scrum

  1. Migration order: REORDERED per §5 above. New Step 0 (BIPA prerequisite). Shadow-write before backfill. Shadow-read before cutover.
  2. Audit-log external anchor: S3 Object Lock with compliance-mode 7-year retention. Hourly + on-event commits.
  3. Audit-log signing key: Separate Ed25519 keypair from KEK. Vault Transit signing backend OR sealed-secret with strict rotation runbook.
  4. mTLS gateway↔identityd: Mandatory. Self-signed CA managed by identityd at startup.
  5. Per-row encryption keys: Deferred to Phase 7. v2 uses single DEK with HSM-wrapped KEK + ciphertext deletion for erasure.
  6. Field-level authorization: purpose_definitions table enforces per-purpose field allowlists.
  7. Synchronous cache invalidation: pub/sub event on erase; gateway subscribes.
  8. Outbound NER pass for Langfuse: Defense-in-depth in addition to symbol-table replacement.
  9. Model version hash in audit response: Captured per decision row.
  10. PDF render: Deferred from Phase 2; JSON ships first.

9. Estimated implementation cost (revised v2)

Sub-phase Effort Notes
2A — Postgres schema + Vault/KMS integration 1.5-2 days Includes mTLS CA bootstrap + signing-key separation
2B — identityd HTTP surface (Go) 2-3 days All endpoints, auth, dual-control JWT issuance, rate limiting, revocation
2C — Backfill ETL (BIPA-compliant — pending_backfill_review) 1 day Plus Merkle attestation of completeness
2D — Gateway integration (Rust + Go, mTLS, shadow-write phase) 2-3 days Per-tool migration, parity probe
2E — JSONL writer changes (subject_id top-level promotion) 1 day All sinks
2F — Langfuse redaction (symbol-table + NER) 2 days Two passes + drop-on-detect
2G — Healthcare-vertical routing (fail-closed) 0.5 day Plus auto-escalation pattern matcher
2H — Cache invalidation pub/sub hook 0.5 day Critical for Phase 7 RTBF
2I — External anchor (S3 Object Lock) 1 day Hourly + on-event commits
2J — Cross-runtime parity probe with oracle tests 1 day New probe
Total ~12-15 working days (v1 estimated 8-10; v2 added BIPA gates dependency, mTLS, dual-control JWT, NER pass, anchor, etc.)

Bigger than v1. Worth it — every addition was a 3/3-reviewer convergent finding.


10. The four "would not build" blockers from scrum (all addressed in v2)

# v1 issue (3/3 reviewers) v2 resolution
1 Migration order (backfill before validation) §5 reordered; Step 0 BIPA gates prereq added
2 Master key on disk + legal token static file §2.1 Vault/KMS for KEK; §2.4 split-secret short-lived JWT
3 inferred_existing BIPA prima facie violation §5 Step 5 uses pending_backfill_review; Step 0 gates required first
4 Healthcare default general (HIPAA exposure window) §3.5 fail-closed unknown-as-healthcare; §5 Step 5 backfill default unknown

All three reviewers said "I would not build v1 as written." All four blockers are resolved in v2. Re-scrum recommended before implementation starts.


11. Change log (v1 → v2)

Section v1 v2
Master key storage Sealed file /etc/lakehouse/identityd_master.key HashiCorp Vault Transit / AWS KMS
Legal token Static file mode 0400 Split-secret short-lived JWT (max 24h), dual-control issuance
Encryption keys Per-row keys with master wrapping Single DEK + HSM-wrapped KEK + ciphertext deletion (per-row deferred to Phase 7)
Healthcare default vertical='general' backfill vertical='unknown' backfill, fail-closed routing
Migration §5 order 1: stand up, 2: backfill, 3: feature flag, 4: cutover, 5: quarantine 0: BIPA gates, 1: stand up, 2: parity probe, 3: shadow-write, 4: shadow-write soak, 5: backfill (pending_backfill_review), 6: human vertical review, 7: shadow-read, 8: cutover, 9: quarantine with cryptographic attestation
Audit-log integrity Postgres Merkle chain only Postgres Merkle for legal/consent/erasure + HMAC for standard + S3 Object Lock external anchor
Audit-log signing key Same as encryption key Separate Ed25519 keypair; Vault signing backend
Gateway↔identityd transport Plain HTTP (implied) mTLS mandatory
Field authorization Per-endpoint only Per-purpose-token field allowlists
Cache invalidation Phase 4 / Phase 7 Phase 2 ships pub/sub hook
Langfuse redaction Symbol-table replacement only Symbol-table + outbound NER pass + drop-on-detect
Audit response No model version model_version_hash per decision
PDF render Phase 2 Deferred — JSON ships first
Training-safe export Phase 2 Deferred — not critical path
Effort estimate 8-10 days 12-15 days

Change log

  • 2026-05-03 — v1 initial draft.
  • 2026-05-03 — v2 post-scrum: 3/3 reviewer convergent findings folded in. 4 "would not build" blockers all resolved. Re-scrum before implementation recommended.