# Identity Service — Phase 2 Design **Status:** Draft — 2026-05-03 · **Owner:** J · **Drafted by:** working session 2026-05-03 **Companion to:** [`AUDIT_TRAIL_PRD.md`](AUDIT_TRAIL_PRD.md), [`AUDIT_PHASE_1_DISCOVERY.md`](AUDIT_PHASE_1_DISCOVERY.md), [`AUDIT_PHASE_1_5_BIPA_AND_OUTCOMES.md`](AUDIT_PHASE_1_5_BIPA_AND_OUTCOMES.md) > **Why this exists.** Phase 1 + 1.5 confirmed that today's substrate has no separation between candidate_id (the canonical token) and PII (name, email, phone, address). Both live in `workers_500k.parquet`. There is no per-access audit. There is no consent gate. There is no retention enforcement. This document specifies the new identity service that will hold the candidate_id ↔ PII mapping, gate every PII read, audit every access, and serve as the single legal-attestable boundary between PII and the rest of the system. > > **Confirmed by J 2026-05-03:** separate daemon (option A in §10.1), signed JSON with PDF render for legal export, legal-only auth credential separate from admin token. --- ## 1. Scope and non-goals ### In scope - Single source of truth for `candidate_id ↔ PII` mapping - Per-PII-access audit log (who/what/when/why) - Consent + retention metadata (BIPA + Day and Temporary Labor Services Act + healthcare PHI) - Legal-only access credential, separate from admin tokens - Healthcare-vertical detection at gateway boundary (per J 2026-05-03 answer 10) - EU-compatible interface (placeholder fields, lawful-basis tracking, SCC-ready slots — but NOT enforced this phase per J) - Training-safe export interface (per J 2026-05-03 answer 11) - Signed-JSON audit response with PDF render path ### Out of scope (for this phase) - The `/audit/subject/{id}` endpoint itself (Phase 3) - Subject-tagging across other substrates (Phase 4) - Right-to-be-forgotten implementation (Phase 7) - BIPA pre-launch gates — those are Phase 1.6, ahead of this phase --- ## 2. Architectural shape ### 2.1 — Process model: separate daemon Per J's confirmation (2026-05-03), the identity service runs as its own daemon, port-isolated from the gateway. Rationale: - **Single attestable boundary** for legal/audit. "All PII access flows through identityd. Show me the identityd access log" is one query, one daemon. - **Independent restart** — a gateway crash doesn't take down identity, and an identity panic doesn't break unrelated reads. - **Distinct credential surface** — identityd's auth model is wholly separate from gateway's. The legal-only credential exists only in identityd, not in the gateway's JWT issuer. - **Cross-runtime parity** — both Rust and Go gateway call identityd over HTTP. There is ONE identity implementation. | Property | Value | |---|---| | Name | `identityd` | | Port | `:3225` (Rust legacy line — picks a port adjacent to validatord :3221) and `:4225` (Go line) | | Implementation language | **Go** — single implementation, both runtimes call it via HTTP. Avoids re-implementing the audit-log writer + retention sweeper twice. | | Storage | Postgres (separate database from any other lakehouse storage). Deployed alongside Langfuse's Postgres or its own; either way, isolated schema with its own grants. | | Encryption | Per-row symmetric encryption (AES-256-GCM) of PII columns. Master key in a vault (HashiCorp Vault, AWS KMS, or a sealed-secret file at `/etc/lakehouse/identityd_master.key` for now). Keys are NEVER logged. | | Backup | Standard Postgres backup; keys backed up separately to different storage tier (the cryptographic-erasure model in `AUDIT_TRAIL_PRD.md` §6 only works if the encrypted-blob backup and the key-backup are not co-located). | ### 2.2 — Schema (Postgres DDL sketch) ```sql -- Single source of truth for the candidate_id ↔ PII mapping -- Every PII column is stored as ciphertext; keys per row enable -- per-subject crypto-erasure for RTBF. CREATE TABLE subjects ( candidate_id TEXT PRIMARY KEY, -- canonical token, e.g. "CAND-000001" -- Encrypted PII fields. Each is AES-256-GCM with subject_key_id below. -- Plaintext is NEVER stored. NULL means "not collected" not "absent." name_ct BYTEA, email_ct BYTEA, phone_ct BYTEA, address_ct BYTEA, ssn_ct BYTEA, dob_ct BYTEA, -- Per-subject encryption key id. Crypto-erasure path: destroy this key -- and the ciphertext is unrecoverable, even with the master key. subject_key_id TEXT NOT NULL, -- Lawful basis + consent metadata consent_status TEXT NOT NULL, -- 'pending' | 'given' | 'withdrawn' | 'expired' consent_version TEXT, -- references published consent template version consent_given_at TIMESTAMPTZ, consent_withdrawn_at TIMESTAMPTZ, -- BIPA-specific fields (per Phase 1.5 §1E) biometric_consent_status TEXT, -- separate from general PII consent biometric_retention_until TIMESTAMPTZ, -- BIPA: max 3 years from last interaction -- Vertical detection — drives healthcare PHI routing (per J answer 10) vertical TEXT, -- 'general' | 'healthcare' | 'finance' | 'other' -- EU-placeholder fields (per J answer 9 — present, not enforced) eu_resident BOOLEAN DEFAULT FALSE, lawful_basis TEXT, -- GDPR Art. 6 basis if eu_resident=true transfer_mechanism TEXT, -- SCC, DPF, BCR — populated when EU comes online -- Standard audit columns created_at TIMESTAMPTZ NOT NULL DEFAULT now(), updated_at TIMESTAMPTZ NOT NULL DEFAULT now(), last_interaction TIMESTAMPTZ, -- drives retention sweep -- RTBF state erased_at TIMESTAMPTZ, -- set when crypto-erasure executed erasure_reason TEXT -- 'rtbf_request' | 'retention_expired' | 'consent_withdrawn' ); -- Append-only access audit. EVERY PII read writes a row here. CREATE TABLE pii_access_log ( access_id BIGSERIAL PRIMARY KEY, candidate_id TEXT NOT NULL, accessed_at TIMESTAMPTZ NOT NULL DEFAULT now(), accessor_kind TEXT NOT NULL, -- 'gateway_lookup' | 'audit_response' | 'legal_request' | 'system_resolve' accessor_id TEXT NOT NULL, -- daemon name + caller token-hash, never raw token purpose TEXT NOT NULL, -- 'fill_validation' | 'audit_subject_response' | 'admin' | 'legal_audit_DDDDDD' fields_accessed TEXT[] NOT NULL, -- ['name', 'email'] etc. request_trace_id TEXT, -- ties to Langfuse trace + sessions.jsonl integrity_hash TEXT NOT NULL -- chain hash for tamper-evidence (this row's hash includes prev row's hash) ); -- Cryptographic chain for the access log — Merkle-style. Per kimi -- single-reviewer flag: chain of custody under FRE 901. -- Each row's integrity_hash = SHA256(prev_hash || row_payload). -- Last hash periodically committed to a tamper-evident store -- (could be a separate append-only file with timestamp signing). -- Per-subject keys table — crypto-erasure target. -- Destroying a row here makes the corresponding subjects.*_ct unreadable. CREATE TABLE subject_keys ( subject_key_id TEXT PRIMARY KEY, candidate_id TEXT NOT NULL, key_material BYTEA NOT NULL, -- AES-256 key, encrypted under master key created_at TIMESTAMPTZ NOT NULL, destroyed_at TIMESTAMPTZ, -- crypto-erasure marker destroyed_reason TEXT ); -- Consent template versioning — BIPA + GDPR + CCPA compliance evidence CREATE TABLE consent_versions ( version TEXT PRIMARY KEY, effective_at TIMESTAMPTZ NOT NULL, superseded_at TIMESTAMPTZ, template_text TEXT NOT NULL, biometric_section TEXT, -- BIPA-specific clause healthcare_section TEXT, -- HIPAA-specific clause eu_section TEXT -- GDPR-specific clause (placeholder) ); ``` ### 2.3 — HTTP surface Identityd exposes a small HTTP surface, all under `/v1/identity/`: | Method + Path | Purpose | Auth | |---|---|---| | `POST /v1/identity/subjects` | Create a new subject. Body: PII fields. Returns: candidate_id (server-generated, NOT sequential — UUID v7 to avoid the kimi enumeration risk). | Gateway/admin token | | `GET /v1/identity/subjects/{candidate_id}` | Resolve PII for a candidate. Returns: requested fields only. EVERY call writes a `pii_access_log` row. | **Service-tier auth** — gateway can call but body indicates accessor purpose | | `GET /v1/identity/subjects/{candidate_id}/full` | Return the complete subject record including consent + retention metadata + audit summary. **Legal-only credential.** Used by audit-response endpoint. | **Legal-only token** — separate credential, separate rotation | | `POST /v1/identity/subjects/{candidate_id}/consent` | Record consent given/withdrawn, with version. | Gateway/admin token | | `POST /v1/identity/subjects/{candidate_id}/erase` | Crypto-erasure: destroy the subject_key_id, mark erased_at. Idempotent. | Legal-only token | | `GET /v1/identity/access_log/{candidate_id}` | Return the per-subject access log for audit response. | Legal-only token | | `POST /v1/identity/training_safe_export` | Returns identifier-stripped + name-redacted projection of subjects suitable for RAG/training. Logs a "system_resolve" access log row marking export. | Admin token + explicit purpose flag | | `GET /v1/identity/health` | Liveness | None | ### 2.4 — Auth model: legal-only credential The legal-only credential is materially different from the admin/gateway token: - Stored at `/etc/lakehouse/identityd_legal.token` (mode 0400, owner-only) - Loaded by identityd at startup via systemd `EnvironmentFile` - Never logged, never returned in any API response, never crossed with gateway tokens - Rotation: separate runbook. Triggered by counsel request OR scheduled annually. - The token's existence is documented in the privacy policy ("legal access requires a separate operator-issued credential, audited per access"). Service-tier auth (gateway-issued) and legal-tier auth (operator-issued) are orthogonal — a request must present the legal token to hit `/full`, `/erase`, or `/access_log`. Even an admin token does not unlock those. --- ## 3. Integration with the rest of the substrate ### 3.1 — Gateway changes When the gateway needs PII for a fill scenario (today this happens by SQL JOIN on `candidates`/`workers_500k`), the new flow is: 1. Gateway has only `candidate_id` (post-§2 view-routing fix from `AUDIT_PHASE_1_DISCOVERY` §9) 2. Gateway calls `GET /v1/identity/subjects/{candidate_id}` with `purpose=fill_validation` and `fields=[name]` (or `[name,phone]` etc.) 3. Identityd writes pii_access_log row, decrypts the requested fields, returns them 4. Gateway uses the fields for the validator + tool result, then immediately drops them from in-memory storage after the request completes (no caching) **Critical:** the LRU embed cache (commit `150cc3b`) currently keys by `(model, text)` where text contains PII. Post-identity-service, the cache keying must change to `(model, candidate_id, field_subset_hash)` so the cache key itself is not PII-bearing. This is a Phase 4 task tracked separately. ### 3.2 — Rust legacy + Go rewrite both call the same identityd Both gateways (Rust :3100 and Go :4110) call identityd over HTTP. Same endpoints, same auth model. New cross-runtime parity probe `audit_parity.sh` validates that an identical PII request through both gateways produces identical identityd access-log rows (modulo daemon-name field). ### 3.3 — outcomes.jsonl + sessions.jsonl writer changes Per `AUDIT_PHASE_1_DISCOVERY` §10/C5 (subject_id top-level promotion): - Change `outcomes.jsonl` writer to: - Add top-level `subject_ids: ["CAND-000001", "CAND-000456"]` field listing every candidate referenced - Strip `name` from `fills[*]` rows; replace with `name_ref: "[REDACTED-{candidate_id}]"` token - Authorized callers dereference the token via identityd - Same for sessions.jsonl SessionRecord: add `subject_ids` top-level field. - Same for overseer_corrections.jsonl. - Same for observerd ops.jsonl when written. This makes every JSONL sink subject-queryable by `subject_id` directly, without grepping natural language. ### 3.4 — Langfuse boundary redaction (per scrum priority C2) Before the gateway POSTs a chat trace to Langfuse, identityd is consulted to map any PII-shaped substrings in the message array back to candidate_id tokens. Implementation: - Gateway maintains a per-request map of `subject_id → temporarily_resolved_PII` for the lifetime of one request - Before Langfuse POST, gateway iterates message content, replaces resolved PII with `[REDACTED-{candidate_id}]` - Langfuse never sees raw names/emails/phones — it sees the tokens, which are unresolvable without a legal-tier identityd call - For audit, legal counsel can use the token to dereference identity AND see the corresponding Langfuse trace, but Langfuse's storage is PII-free This addresses the most-dangerous-leak finding from `AUDIT_PHASE_1_DISCOVERY` §10/C2. ### 3.5 — Healthcare vertical routing (per J answer 10) When `subjects.vertical = 'healthcare'`, the gateway routing rules change: - Tool calls that touch this candidate's data MUST route to local-only models (Ollama on-box), NOT to opencode/openrouter/ollama_cloud egress - If a healthcare-vertical request can't be served locally, it fails with HTTP 451 ("Unavailable for Legal Reasons") — better to refuse than leak PHI - The identity service holds the routing decision; the gateway consults it on every call - Vertical detection itself happens at ingest time (`workers_500k` row metadata) OR when first PII fetch returns vertical='healthcare' This requires a one-line addition to the gateway's chat routing in `crates/gateway/src/v1/chat.rs` + Go-side equivalent in `cmd/chatd/main.go`. Both should fail-closed: if identityd is unreachable, healthcare requests refuse. ### 3.6 — Training-safe export (per J answer 11) `POST /v1/identity/training_safe_export` returns a projection of subject decision data with: - `name`, `email`, `phone`, `address`, `ssn`, `dob` ALL stripped (replaced with `[REDACTED]`) - `candidate_id` replaced with a hashed pseudonym specific to the export run (different export runs produce different pseudonyms — prevents cross-run correlation) - Discrimination-proxy phrases (per gemini scrum) detected and `[REDACTED-PROXY]`-replaced - Output is suitable for RAG-indexing or fine-tuning corpus building - An audit log entry documents the export (purpose, requesting accessor, scope, fields) If a candidate later RTBFs, their pre-export decisions remain in the trained corpus BUT the link back to them is severed (the export pseudonym was random). Legal defense: "the source data was destroyed; the model retains it indistinguishably from synthetic patterns." --- ## 4. Audit response: what /audit/subject/{id} returns (Phase 3 preview) Phase 3 builds the audit endpoint, but the shape it returns is dictated by what identityd can produce. Sketch: ```json { "schema": "audit.subject.v1", "subject_token": "CAND-000001", "request_window": { "from": "2026-01-01", "to": "2026-05-03" }, "generated_at": "2026-05-03T12:00:00Z", "generated_by": "identityd@hostname", "integrity_hash": "sha256:...", // Merkle-style chain of all decision rows "signature": "ed25519:...", // identityd signs with its escrow key "consent": { "status": "given", "version": "v3-2026-04-15", "given_at": "2026-04-20T14:30:00Z", "biometric_consent": "given", "biometric_retention_until": "2029-04-20T14:30:00Z" }, "decisions": [ { "ts": "2026-04-22T09:15:23Z", "decision_kind": "fill_recommendation", "daemon": "gateway", "model": "kimi-k2.6", "provider": "ollama_cloud", "trace_id": "trace-abc", "session_id": "session-xyz", "input_features": { // Sanitized view of what the model saw — no protected attributes, // no inferred-attribute proxies, but enough to defend the decision }, "output": "recommended for fill_event_456", "rationale": "Skills match: Welder TIG aluminum + 5+ years; geo match: Toledo OH; availability: confirmed", "comparator_pool_size": 47, "comparator_pool_protected_class_distribution": "see appendix A" // adverse-impact stats per gemini scrum } ], "comparator_appendix": { // EEOC adverse-impact statistics: for the same searches that included // this subject, what was the selection rate by protected class? // Aggregated; no other subjects' identifiers leak. }, "access_log": [ // Every PII access for this subject in the request window { "at": "...", "purpose": "fill_validation", "fields": ["name"], "trace_id": "..." } ], "footer": { "completeness_attestation": "all decisions about subject_token in the window per retention policy v2 are included", "what_was_excluded": "decisions older than 4 years (retention expired) — count: 0", "format_version": "audit.subject.v1" } } ``` PDF render is a downstream consumer — same JSON, different presentation layer (template + signing remains in JSON; PDF is for legal team's final delivery). --- ## 5. Migration path from current state This is the single biggest implementation question — how to get from "PII in workers_500k.parquet, no identityd" to "PII in identityd, parquet has only candidate_id + non-PII columns" without breaking the live demo. ### Migration strategy: parallel-write, gradual-read-cutover 1. **Step 1 — Stand up identityd.** Empty database. New service, no callers yet. Health endpoint live. Rust + Go tests can call it but production paths don't. 2. **Step 2 — Backfill from workers_500k.parquet.** One-shot ETL: read parquet, for each row, write to identityd with `consent_status='inferred_existing'` (placeholder until counsel writes the real consent backfill story), `vertical='general'` (correct for non-healthcare data; needs human review for healthcare-flagged rows). 3. **Step 3 — Add identityd-call path to gateway behind a feature flag.** When `LH_USE_IDENTITY_SERVICE=true`, the gateway calls identityd for PII; otherwise it uses the legacy SQL path. 4. **Step 4 — Cut over reads incrementally.** Tool registry first (highest-PII-volume path). Validate via the cross-runtime parity probe `audit_parity.sh`. 5. **Step 5 — Quarantine PII columns in workers_500k.parquet.** Once all readers go through identityd, the parquet's PII columns become read-only and eventually moved to a different bucket. The candidate_id-only projection becomes the operational table. Each step has its own commit, its own gate, and its own rollback. Don't ship steps 2-5 in one commit. --- ## 6. Cross-runtime parity probe (NEW) Per `AUDIT_PHASE_1_DISCOVERY` §4 ask: extend the 5 existing probes with `audit_parity.sh`. New probe asserts: 1. Same PII fetch through Rust gateway (port 3100) and Go gateway (port 4110) produces identical identityd access-log rows (modulo daemon name) 2. Crypto-erasure of a test subject through Rust gateway is honored when Go gateway tries to fetch 3. Healthcare-vertical routing decision is identical across both runtimes 4. Training-safe export produces byte-identical output regardless of which gateway initiated Ships as part of Phase 5 (identity service build), not phase 2. --- ## 7. What this design intentionally does NOT solve - **Does not replace existing protected-attribute exclusion at decision time.** The model still sees what the SQL returns; identityd doesn't filter that. Phase 6 of `AUDIT_TRAIL_PRD` handles boundary enforcement. - **Does not redact pathway memory trace bodies.** That's per-trace-write redaction, separate concern. Phase 4. - **Does not retroactively scrub Langfuse history.** Past traces still contain PII; only new traces are token-redacted. Counsel may request a one-shot historical-Langfuse purge — that's a separate runbook. - **Does not implement "right to explanation" (GDPR Art. 22 / EU AI Act).** Audit response shows decisions; explaining the model's reasoning chain in human-readable form is Phase 8 (legal export format) or its own follow-up phase. - **Does not handle multi-region data residency.** Single-region (US-Midwest, by default). EU-placeholder fields are present; multi-region deployment is out of scope. --- ## 8. Open questions for J before implementation starts 1. **Master key location.** Vault server, KMS, or a sealed file? Sealed file is fastest to ship; vault is most defensible. Recommend sealed-file for v1 with migration path to vault. **Confirm.** 2. **Postgres for identityd: shared with Langfuse, or its own?** Recommend its own — operational isolation. **Confirm.** 3. **`vertical` field initial values.** Backfill all existing subjects to `'general'`? Or block backfill until each candidate's vertical is determined? **Recommend backfill-to-general + flagging procedure for unknown.** 4. **Legal-only token issuance procedure.** Who has the authority to mint a legal token? Operator (J)? Outside counsel? Both? **Recommend J + named outside counsel, dual-control.** 5. **Crypto-erasure timeline for retention.** Default sweep cadence: daily? Weekly? **Recommend daily.** 6. **EU placeholder enforcement timeline.** Build the fields now; when do we turn on enforcement? **Recommend "when first EU candidate is added; until then, enforcement is no-op."** --- ## 9. Estimated implementation cost | Sub-phase | Effort | Notes | |---|---|---| | 2A — Postgres schema + migrations | 4-6 hours | Includes encryption helpers + key management glue | | 2B — identityd HTTP surface (Go) | 1-2 days | All endpoints, auth, signing key, tests | | 2C — Backfill ETL from workers_500k.parquet | 1 day | One-shot script + dry-run mode | | 2D — Gateway integration (Rust + Go, behind feature flag) | 2 days | Per-tool migration, parity probe | | 2E — outcomes/sessions/observer JSONL writer changes | 1 day | Subject_id top-level promotion across all sinks | | 2F — Langfuse redaction layer | 1-2 days | Per-request resolved-PII map + token replacement | | 2G — Healthcare-vertical routing | 0.5 day | Single conditional per gateway | | 2H — Training-safe export | 1 day | The exporter + audit logging | | 2I — Cross-runtime parity probe `audit_parity.sh` | 0.5 day | New probe, lands in golangLAKEHOUSE | | **Total** | **~8-10 working days** | Sequential; some can parallelize | This is the largest single phase in the audit-trail program. It's the substrate for everything downstream. Recommend doing it carefully — no half-shipped commits, each sub-phase has its own exit criterion. --- ## Change log - 2026-05-03 — Initial Phase 2 design draft. Incorporates J's confirmed answers (separate daemon, signed JSON+PDF, legal-only auth) plus all Phase 1 + 1.5 findings + scrum-driven priority changes.