Today's PRD-line-70 reframe (everything runs locally) means the audit-trail docs I drafted earlier this session are over-engineered for J's actual deployment model. They were sized for SaaS-tier infra (Vault/KMS/S3 Object Lock/dual-control JWT/separate Postgres) — appropriate for a multi-tenant cloud service, wrong for a single-box local install. Adding clear deprecation headers so future sessions don't read these as authoritative and propose another 17-20 day plan involving cloud infrastructure that would re-violate PRD line 70. What STAYS valid (preserved in headers): - The legal use case (John Martinez worked example) - The IL/IN jurisdictional surface (counsel checklist) - The Phase 1 + 1.5 discovery findings (PII flow paths file:line) - Phase 1.6 BIPA gates (when real photos arrive) What's OVER-SCOPED (flagged in headers): - The 9-phase implementation plan - The identity service design (Vault/KMS/dual-control) Future v2 of these docs needs to be sized for local single-box: a few hundred LOC of local writers + signed local audit file, not 17-20 days of distributed-systems design. No code changes. Just doc-level guardrails for future scope drift. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
585 lines
41 KiB
Markdown
585 lines
41 KiB
Markdown
# Identity Service — Phase 2 Design (v2 — post-scrum revisions)
|
|
|
|
> **⚠ OVER-SCOPED FOR LOCAL-ONLY DEPLOYMENT — needs simpler rewrite before implementation.**
|
|
>
|
|
> 2026-05-03 evening: J reframed the system as local-only per PRD line 70 ("Everything runs locally — no cloud APIs"). This document was drafted assuming SaaS-tier infrastructure (HashiCorp Vault, AWS KMS, S3 Object Lock, dual-control JWT split-secret ceremony, mTLS CA, separate Postgres database). For J's local-only single-box deployment serving IL+IN staffing, the audit trail can be MUCH smaller: local SQLite or Postgres, local key file, local HMAC chain to an append-only JSONL.
|
|
>
|
|
> The discovery findings in `AUDIT_PHASE_1_DISCOVERY.md` and `AUDIT_PHASE_1_5_BIPA_AND_OUTCOMES.md` remain valid (PII flow paths, BIPA exposure, etc.). The PROBLEM is real. This DOC's solution shape is wrong for the deployment.
|
|
>
|
|
> Do NOT implement this document as-written. When J greenlights audit-trail work, draft a v3 that's local-only sized (~3-5 days, not 17-20).
|
|
>
|
|
> See `STATE_OF_PLAY.md` "PRD line 70 is load-bearing" entry for the binding direction.
|
|
|
|
**Status:** Draft v2 — 2026-05-03 · **Owner:** J · **Drafted by:** working session 2026-05-03
|
|
**Companion to:** [`AUDIT_TRAIL_PRD.md`](AUDIT_TRAIL_PRD.md), [`AUDIT_PHASE_1_DISCOVERY.md`](AUDIT_PHASE_1_DISCOVERY.md), [`AUDIT_PHASE_1_5_BIPA_AND_OUTCOMES.md`](AUDIT_PHASE_1_5_BIPA_AND_OUTCOMES.md)
|
|
|
|
> **v2 history.** v1 (2026-05-03 morning) was scrummed across opus + kimi + gemini lineages. 3/3 reviewers converged on 6 critical issues — wrong migration order, master key + legal token "security theater," BIPA `inferred_existing` prima facie violation, healthcare default in wrong direction, Merkle chain without external anchor. v2 incorporates those changes. Full scrum reviews preserved at `/tmp/identity_scrum/{opus,kimi,gemini}_review.md`. Diff between v1 and v2 captured in §11 change log.
|
|
|
|
> **Why this exists.** Phase 1 + 1.5 confirmed today's substrate has no separation between candidate_id and PII. Both live in `workers_500k.parquet`. No per-access audit, no consent gate, no retention enforcement. This document specifies the new identity service that holds the candidate_id ↔ PII mapping, gates every PII read, audits every access, and serves as the single legal-attestable boundary between PII and the rest of the system.
|
|
|
|
> **Prerequisite:** **Phase 1.6 (BIPA pre-launch gates) MUST ship before any identityd backfill begins.** Per kimi+gemini scrum, backfilling with `consent_status='inferred_existing'` is a BIPA §15 prima facie violation. Phase 1.6 establishes the consent template + retention schedule + employee training that turns backfill from "inferred" into "pending_backfill_review with documented escalation path."
|
|
|
|
---
|
|
|
|
## 1. Scope and non-goals
|
|
|
|
### In scope
|
|
- Single source of truth for `candidate_id ↔ PII` mapping
|
|
- Per-PII-access audit log (who/what/when/why)
|
|
- Consent + retention metadata (BIPA + IL Day and Temporary Labor Services Act + healthcare PHI)
|
|
- Legal-tier access — short-lived JWT with **split-secret dual-control issuance** (NOT a static file token, per all 3 reviewers)
|
|
- Healthcare-vertical routing with **fail-closed default** (`unknown` treated as healthcare until classified, per opus+gemini)
|
|
- EU-compatible interface — fields exist, enforcement is per-subject (no system-wide flag flip needed)
|
|
- Hardware-backed master key — **HashiCorp Vault Transit OR AWS KMS** for v1 (NOT a sealed file, per all 3 reviewers; sealed-file v1 was "obfuscation in any defensible sense" — kimi)
|
|
- Signed-JSON audit response with PDF render path (PDF deferred; JSON ships first per opus)
|
|
- External-anchor for audit-log integrity (S3 Object Lock + signed timestamps; NOT just Postgres Merkle, per all 3 reviewers)
|
|
- mTLS or Unix Domain Socket between gateway ↔ identityd (per kimi — "port-isolated is theater without authenticated channel")
|
|
- Field-level authorization on `/subjects/{id}` (per-purpose field allowlists, per kimi)
|
|
- Synchronous cache invalidation hook for RTBF (per opus — even if Phase 7 is the full RTBF build)
|
|
|
|
### Out of scope (for this phase)
|
|
- The `/audit/subject/{id}` endpoint itself (Phase 3)
|
|
- Full subject-tagging across other substrates (Phase 4)
|
|
- Right-to-be-forgotten implementation full-flow (Phase 7)
|
|
- Training-safe export (deferred from Phase 2 per opus — "not on critical path for audit-trail defense")
|
|
- BIPA pre-launch gates content (Phase 1.6 — separate doc)
|
|
|
|
### Architecturally simplified from v1 (per scrum)
|
|
- **Per-row encryption keys** deferred to Phase 7. v2 uses a single Data Encryption Key (DEK) wrapped under HSM-backed Key Encryption Key (KEK), with per-subject **ciphertext-deletion** as the v2 erasure mechanism. This matches kimi's recommendation: "A single data-encryption key with HSM-backed rotation, plus per-subject deletion of ciphertext, achieves equivalent practical erasure with far less complexity." Cryptographic-erasure via per-subject keys becomes a Phase 7 enhancement.
|
|
- **EU placeholder fields** kept (per J 2026-05-03), but enforcement code is genuinely no-op until first eu_resident=true subject — no code paths run, no schema migrations needed when EU comes online (just per-subject field population).
|
|
- **Merkle chaining** narrowed (per gemini): full Merkle tree only for legal-tier and consent/erasure events. Standard `gateway_lookup` events get a simpler signed-HMAC chain. Both anchored externally.
|
|
|
|
---
|
|
|
|
## 2. Architectural shape
|
|
|
|
### 2.1 — Process model + transport
|
|
|
|
Per J 2026-05-03 + kimi scrum: separate daemon, mTLS-mandatory transport. Single Go implementation, both runtimes call it.
|
|
|
|
| Property | Value |
|
|
|---|---|
|
|
| Name | `identityd` |
|
|
| Port | `:3225` (single port, not dual; per kimi simplification — runtime-agnostic routing) |
|
|
| Implementation | Go |
|
|
| Storage | Postgres in **isolated database** (per J answer, confirmed). Schema-level isolation; no shared schemas with Langfuse or other lakehouse storage. |
|
|
| Transport gateway↔identityd | **mTLS mandatory.** Self-signed CA managed by identityd at startup; gateway clients have their own client certs. Plain HTTP is rejected at the listener. |
|
|
| Encryption-at-rest | **Single DEK wrapped under KEK in HashiCorp Vault Transit** (recommended) OR AWS KMS. Master key NEVER on disk. v1 of identityd refuses to start if KEK is unreachable. |
|
|
| Audit-log signing | **Separate Ed25519 keypair from the master encryption key.** Signing key in Vault Transit (signing backend) OR a separate sealed-secret file with strict rotation procedure. |
|
|
| External anchor for audit chain | **S3 Object Lock** (compliance mode, 7-year retention) holding periodic Merkle root commitments. Daemon writes hourly + on legal-tier event. |
|
|
| Backup | Postgres standard backup; KEK backup separate (different storage, different access). Crypto-erasure model only works if these are not co-located. |
|
|
| Rate limiting | Per-token QPS + daily-volume caps. Default legal-token: 100 lookups/day + 10K daily volume; alerting at 50%. Configurable per-token. |
|
|
| Token revocation | `token_revocations` table checked on every legal-tier auth. Revocation propagates within 60s (TTL on cache). |
|
|
|
|
### 2.2 — Schema (Postgres DDL, v2)
|
|
|
|
```sql
|
|
-- Subject record. PII columns are ciphertext under the DEK.
|
|
CREATE TABLE subjects (
|
|
candidate_id TEXT PRIMARY KEY, -- UUID v7, NOT sequential (per kimi enumeration concern)
|
|
-- Encrypted PII fields. AES-256-GCM under DEK. NULL = not collected.
|
|
name_ct BYTEA,
|
|
email_ct BYTEA,
|
|
phone_ct BYTEA,
|
|
address_ct BYTEA,
|
|
ssn_ct BYTEA,
|
|
dob_ct BYTEA,
|
|
-- DEK version this subject was encrypted under. KEK rotates; DEK rotates per
|
|
-- KEK rotation. New DEK version = subjects re-encrypted in background sweep.
|
|
dek_version INT NOT NULL,
|
|
-- Lawful basis + consent metadata
|
|
consent_status TEXT NOT NULL CHECK (consent_status IN (
|
|
'pending_backfill_review', -- backfill default; no biometric/PHI use until reviewed
|
|
'pending_first_contact', -- new subject, awaiting consent UX
|
|
'given', -- explicit consent recorded
|
|
'withdrawn', -- subject revoked
|
|
'expired' -- consent timeout
|
|
)),
|
|
consent_version TEXT, -- references published consent template version
|
|
consent_given_at TIMESTAMPTZ,
|
|
consent_withdrawn_at TIMESTAMPTZ,
|
|
-- BIPA-specific fields (Phase 1.5 §1E + Phase 1.6 prerequisite)
|
|
biometric_consent_status TEXT NOT NULL DEFAULT 'never_collected' CHECK (biometric_consent_status IN (
|
|
'never_collected', 'pending', 'given', 'withdrawn', 'expired'
|
|
)),
|
|
biometric_retention_until TIMESTAMPTZ, -- BIPA: max 3 years from last interaction
|
|
-- Vertical detection — drives healthcare PHI routing.
|
|
-- DEFAULT 'unknown' (per opus+gemini scrum) — fail-closed routing treats
|
|
-- unknown as healthcare-equivalent until reclassified.
|
|
vertical TEXT NOT NULL DEFAULT 'unknown' CHECK (vertical IN (
|
|
'unknown', 'general', 'healthcare', 'finance', 'other'
|
|
)),
|
|
-- Per-vertical retention period. Drives the daily erasure sweep.
|
|
-- (Per opus: this was missing in v1. Required for BIPA's 3-year-from-
|
|
-- last-interaction rule which differs from generic retention.)
|
|
retention_period_days INT NOT NULL,
|
|
-- EU-placeholder fields. Enforcement is per-subject; nothing runs
|
|
-- until a row has eu_resident=true AND lawful_basis IS NOT NULL.
|
|
eu_resident BOOLEAN NOT NULL DEFAULT FALSE,
|
|
lawful_basis TEXT, -- GDPR Art. 6 basis when eu_resident=true
|
|
transfer_mechanism TEXT, -- SCC, DPF, BCR — populated when EU comes online
|
|
-- Standard audit columns
|
|
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
|
|
updated_at TIMESTAMPTZ NOT NULL DEFAULT now(),
|
|
last_interaction TIMESTAMPTZ, -- drives retention sweep
|
|
-- Crypto-erasure state (v2: ciphertext deletion, NOT key destruction)
|
|
erased_at TIMESTAMPTZ,
|
|
erasure_reason TEXT
|
|
);
|
|
|
|
-- Append-only access log. EVERY PII read writes a row.
|
|
-- Field-level: fields_accessed records WHICH fields the caller resolved.
|
|
-- purpose enforced against an allowlist per-purpose-token.
|
|
CREATE TABLE pii_access_log (
|
|
access_id BIGSERIAL PRIMARY KEY,
|
|
candidate_id TEXT NOT NULL,
|
|
accessed_at TIMESTAMPTZ NOT NULL DEFAULT now(),
|
|
accessor_kind TEXT NOT NULL, -- 'gateway_lookup' | 'audit_response' | 'legal_request' | 'system_resolve'
|
|
accessor_id TEXT NOT NULL, -- daemon name + caller token-hash; never raw token
|
|
purpose_token TEXT NOT NULL, -- opaque token; resolved through purpose_definitions table
|
|
fields_accessed TEXT[] NOT NULL,
|
|
request_trace_id TEXT,
|
|
chain_kind TEXT NOT NULL CHECK (chain_kind IN ('hmac', 'merkle')), -- HMAC for standard, Merkle for legal/consent/erasure
|
|
integrity_hash TEXT NOT NULL,
|
|
-- Per opus security finding: this row's existence is itself sensitive
|
|
-- ("candidate X is under legal review" leaks via purpose). Read access
|
|
-- to this table requires legal-tier auth or scoped per-subject auth.
|
|
is_legal_tier_event BOOLEAN NOT NULL DEFAULT FALSE
|
|
);
|
|
|
|
-- Purpose definitions — enforces field-level authorization (per kimi).
|
|
-- A given purpose_token can only request fields in its allowlist.
|
|
CREATE TABLE purpose_definitions (
|
|
purpose_token TEXT PRIMARY KEY, -- e.g. 'fill_validation', 'audit_subject_response'
|
|
description TEXT NOT NULL,
|
|
allowed_fields TEXT[] NOT NULL, -- e.g. ARRAY['name'] for fill_validation
|
|
auth_tier TEXT NOT NULL CHECK (auth_tier IN ('service', 'legal')),
|
|
rate_limit_qps INT,
|
|
daily_volume_cap INT
|
|
);
|
|
|
|
-- Token revocation list. Checked on every auth, cached 60s.
|
|
CREATE TABLE token_revocations (
|
|
token_hash TEXT PRIMARY KEY,
|
|
revoked_at TIMESTAMPTZ NOT NULL DEFAULT now(),
|
|
revoked_by TEXT NOT NULL,
|
|
reason TEXT
|
|
);
|
|
|
|
-- Consent template versioning. Hash references, not embedded text per row.
|
|
CREATE TABLE consent_versions (
|
|
version TEXT PRIMARY KEY,
|
|
effective_at TIMESTAMPTZ NOT NULL,
|
|
superseded_at TIMESTAMPTZ,
|
|
template_hash TEXT NOT NULL, -- SHA256 of the canonical template
|
|
template_path TEXT NOT NULL -- where the canonical text lives (e.g. data/_consent/v3-2026-04-15.md)
|
|
);
|
|
|
|
-- External anchor checkpoints. Periodically committed to S3 Object Lock.
|
|
CREATE TABLE anchor_commits (
|
|
commit_id BIGSERIAL PRIMARY KEY,
|
|
chain_kind TEXT NOT NULL, -- 'hmac' or 'merkle'
|
|
checkpoint_at TIMESTAMPTZ NOT NULL,
|
|
last_access_id BIGINT NOT NULL, -- max access_id covered by this checkpoint
|
|
root_hash TEXT NOT NULL,
|
|
s3_object_uri TEXT NOT NULL, -- s3://anchor-bucket/identityd/...
|
|
s3_lock_until TIMESTAMPTZ NOT NULL -- compliance-mode retention end
|
|
);
|
|
```
|
|
|
|
### 2.3 — HTTP surface (v2)
|
|
|
|
All under `/v1/identity/`. mTLS + token required on every endpoint.
|
|
|
|
| Method + Path | Purpose | Auth tier | Notes |
|
|
|---|---|---|---|
|
|
| `POST /v1/identity/subjects` | Create new subject. Body: PII fields + `purpose_token`. Returns: candidate_id (UUID v7). | service | Validates purpose_token allowlist matches submitted fields |
|
|
| `GET /v1/identity/subjects/{candidate_id}?purpose={token}&fields={list}` | Resolve PII for candidate. | service | **Field-level enforcement**: returned fields ⊆ purpose_token's allowed_fields. Mismatch → 403. Logs access row. |
|
|
| `GET /v1/identity/subjects/{candidate_id}/vertical` | **Minimal-disclosure** vertical lookup (per kimi HIPAA minimum-necessary). Returns only `{vertical, consent_status}`. | service | Used by gateway routing. Cheaper access-log row. |
|
|
| `GET /v1/identity/subjects/{candidate_id}/full` | Complete subject record + audit summary. | **legal** | Short-lived JWT only. Logged with `is_legal_tier_event=true`. Triggers real-time notification to designated counsel + J. |
|
|
| `POST /v1/identity/subjects/{candidate_id}/consent` | Record consent given/withdrawn with version. | service | |
|
|
| `POST /v1/identity/subjects/{candidate_id}/erase` | Crypto-erasure. Idempotent. | **legal** | Synchronously triggers cache invalidation hooks (per opus). |
|
|
| `GET /v1/identity/access_log/{candidate_id}` | Per-subject access log for audit response. | **legal** | |
|
|
| `POST /v1/identity/auth/legal_token` | Issue short-lived JWT (max 24h). Requires **dual-control attestation**: J's signed nonce + counsel's signed nonce, both verified against pre-registered public keys. | dual-control | Per all 3 reviewers — replaces v1's static file approach |
|
|
| `GET /v1/identity/health` | Liveness | none | Returns only `{status: "ok"}` — no version, no schema info |
|
|
| `(no public training-safe-export endpoint in v2)` | — | — | Deferred per opus — not on critical path |
|
|
|
|
### 2.4 — Auth model (v2 — split-secret + short-lived)
|
|
|
|
**Service-tier auth** (gateway → identityd for routine PII resolution):
|
|
- mTLS client cert per gateway (Rust :3100 + Go :4110 each get their own cert)
|
|
- Bearer token in `Authorization` header (long-lived — 90d), rotated per ops runbook
|
|
|
|
**Legal-tier auth** (per-request, short-lived JWT, dual-control issuance):
|
|
- J holds private key A
|
|
- Designated outside counsel holds private key B
|
|
- `POST /v1/identity/auth/legal_token` requires:
|
|
- HTTP body: `{purpose, ttl_seconds, requested_fields, signature_a, signature_b, nonce_a, nonce_b}`
|
|
- identityd verifies both signatures against pre-registered A and B public keys
|
|
- On success: emits a JWT signed by identityd, valid for `ttl_seconds` (max 24h), scoped to `purpose`+`fields`
|
|
- The JWT:
|
|
- Cannot be issued without BOTH A and B signatures (per gemini "split-secret startup ceremony" pattern, applied per-token)
|
|
- Carries scope (purpose + fields) inside its claims; identityd enforces at use time
|
|
- Per-token rate limit + daily cap recorded on issuance
|
|
- Revocation: token_hash added to `token_revocations`; identityd refuses on next check (60s cache TTL)
|
|
|
|
**Why this matters:**
|
|
- v1's "static file mode 0400" was security theater per all 3 reviewers
|
|
- A leaked legal token v1 = unbounded exfiltration window until rotation
|
|
- v2's leaked JWT = bounded by TTL (max 24h) AND revocable in <60s
|
|
- Single-actor compromise (J's machine OR counsel's machine, not both) cannot mint legal tokens
|
|
|
|
---
|
|
|
|
## 3. Integration with the rest of the substrate
|
|
|
|
### 3.1 — Gateway changes
|
|
|
|
When gateway needs PII for a fill scenario:
|
|
|
|
1. Gateway has only `candidate_id` (post-Phase-1.5 view-routing fix)
|
|
2. Gateway calls `GET /v1/identity/subjects/{candidate_id}/vertical` first if it might be healthcare-vertical-sensitive routing
|
|
3. Gateway calls `GET /v1/identity/subjects/{candidate_id}?purpose=fill_validation&fields=name`
|
|
4. identityd validates: purpose_token=`fill_validation` allows `[name]` only; if request asks for `[name,ssn]` → 403
|
|
5. identityd writes `pii_access_log` row, decrypts requested fields, returns
|
|
6. Gateway uses fields, then **explicitly zeros memory** (kimi's drop-and-overwrite pattern) before returning HTTP response
|
|
7. **No PII caching** in gateway memory beyond request lifetime
|
|
|
|
**LRU embed cache** (commit `150cc3b`) currently keys by `(model, text)` where text contains PII. Phase 4 task to re-key as `(model, candidate_id, field_subset_hash)`. **Phase 2 must add the synchronous cache-purge hook** so RTBF in Phase 2 invalidates Phase 4's cache when it ships (opus finding).
|
|
|
|
### 3.2 — Cross-runtime: same identityd, both gateways call it
|
|
|
|
Both gateways call identityd over mTLS. Same endpoints, same auth model. New cross-runtime parity probe `audit_parity.sh` validates identical PII request through both gateways produces identical access-log rows (modulo daemon-name field).
|
|
|
|
### 3.3 — JSONL writer changes (subject_id top-level promotion)
|
|
|
|
Per `AUDIT_PHASE_1_DISCOVERY` §10/C5:
|
|
- All JSONL sinks (outcomes, sessions, overseer_corrections, observerd ops) gain top-level `subject_ids: [...]` field
|
|
- `fills[*].name` replaced with `name_ref: "[REDACTED-{candidate_id}]"`
|
|
- Authorized callers dereference via identityd
|
|
|
|
### 3.4 — Langfuse boundary redaction (defense in depth, per opus TOCTOU finding)
|
|
|
|
v1 design: per-request map of `subject_id → resolved_PII`, replace before Langfuse POST.
|
|
|
|
**v2 adds** (per opus): outbound regex/NER pass on the Langfuse payload as defense-in-depth. The model can hallucinate names not in the resolved-PII map. The outbound NER pass catches them.
|
|
|
|
```
|
|
Gateway constructs Langfuse payload
|
|
→ Pass 1: per-request resolved-PII map replacement
|
|
→ Pass 2: NER scan (regex for SSN/phone shapes; named-entity model for names)
|
|
→ Either pass detecting unredacted PII → drop the trace OR replace with [POSSIBLE-PII-DETECTED]
|
|
→ POST to Langfuse
|
|
```
|
|
|
|
### 3.5 — Healthcare vertical routing — fail-closed default
|
|
|
|
**Per opus + gemini scrum: Default is `unknown` → treated as healthcare for routing purposes.**
|
|
|
|
When gateway receives a request involving any candidate:
|
|
1. Query `/vertical` for the candidate
|
|
2. If `vertical IN ('healthcare', 'unknown')` → route to **on-box Ollama only** (no opencode/openrouter/ollama_cloud egress)
|
|
3. If `vertical='general'` (or other non-healthcare) → cloud routing OK
|
|
4. If identityd unreachable → fail closed (refuse the request, return 503)
|
|
|
|
**Reclassification path:** subjects can be moved from `unknown` → `general` via explicit operator action (after manual review). Subjects flagged `healthcare` stay healthcare unless explicitly downgraded.
|
|
|
|
**Auto-escalation** (per kimi): if a subject's resume_text or call_log content is updated with healthcare-pattern matches (RN, BSN, hospital, MD, physician, etc.), vertical auto-escalates to `healthcare`. Never auto-de-escalates.
|
|
|
|
### 3.6 — Cache invalidation hook (Phase 2 must ship even though full RTBF is Phase 7)
|
|
|
|
When `POST /v1/identity/subjects/{id}/erase` fires:
|
|
1. Mark `subjects.erased_at` and zero out ciphertext columns
|
|
2. Write `pii_access_log` row with `purpose_token='retention_expired'` or `'rtbf_request'`
|
|
3. **Synchronously** publish a `subject_erased` event to a pub/sub channel (Redis or Postgres LISTEN/NOTIFY)
|
|
4. Gateway subscribes; on event, purges any in-flight cache entries for that candidate_id
|
|
5. Eventual-consistency window: <5s between erase-call and cache-flush
|
|
|
|
Without this hook, RTBF in Phase 7 would be a lie because the gateway's LRU embed cache could still hold the subject's data.
|
|
|
|
---
|
|
|
|
## 4. Audit response shape (Phase 3 preview, v2 — adds model-version snapshot per gemini)
|
|
|
|
```json
|
|
{
|
|
"schema": "audit.subject.v1",
|
|
"subject_token": "01926f2e-7c1b-7000-...", // UUID v7, NOT CAND-NNNNNN
|
|
"request_window": { "from": "2026-01-01", "to": "2026-05-03" },
|
|
"generated_at": "2026-05-03T12:00:00Z",
|
|
"generated_by": "identityd@hostname",
|
|
"merkle_root": "sha256:...", // root hash of legal/consent/erasure events
|
|
"external_anchor": "s3://anchor-bucket/identityd/2026-05-03T12-00.json", // S3 Object Lock URI
|
|
"signature": "ed25519:...", // separate signing key from encryption KEK
|
|
"consent": { ... },
|
|
"decisions": [
|
|
{
|
|
"ts": "2026-04-22T09:15:23Z",
|
|
"decision_kind": "fill_recommendation",
|
|
"daemon": "gateway",
|
|
"model": "kimi-k2.6",
|
|
"model_version_hash": "sha256:...", // NEW per gemini — proves what model existed AT decision time
|
|
"provider": "ollama_cloud",
|
|
"trace_id": "trace-abc",
|
|
"session_id": "session-xyz",
|
|
"input_features": { ... }, // sanitized; no protected attributes; no inferred-attribute proxies
|
|
"output": "...",
|
|
"rationale": "...",
|
|
"comparator_pool_size": 47,
|
|
"comparator_appendix_ref": "see comparator_appendix.A"
|
|
}
|
|
],
|
|
"comparator_appendix": {
|
|
"A": {
|
|
"scope": "fill scenarios in window matching role X, geo Y",
|
|
"total_pool_size": 47,
|
|
"selection_rate_by_protected_class": {
|
|
// Aggregated; NO other subjects' identifiers leak
|
|
"race": { "white": 0.33, "black": 0.31, ... },
|
|
"gender": { "man": 0.34, "woman": 0.30, ... }
|
|
},
|
|
"four_fifths_test": "passed" // or "concern: rate ratio 0.71"
|
|
}
|
|
},
|
|
"access_log": [...],
|
|
"footer": {
|
|
"completeness_attestation": "all decisions about subject_token in window per retention policy v2 are included",
|
|
"merkle_proof": "...", // proof this audit's root is in S3 anchor
|
|
"what_was_excluded": "decisions older than 4 years (retention expired) — count: 0"
|
|
}
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## 5. Migration path (v2 — REORDERED per all 3 reviewers)
|
|
|
|
v1's order was wrong. New order, with explicit prerequisites:
|
|
|
|
| Step | Action | Prerequisite |
|
|
|---|---|---|
|
|
| **0** | **Phase 1.6 BIPA pre-launch gates SHIPPED** — consent template published, retention schedule public, deletion procedure documented, employee training acknowledged. | (separate work — see PHASE_1_6_BIPA_GATES.md when written) |
|
|
| **1** | Stand up identityd with **synthetic test subjects only**. KEK in vault, mTLS live, all endpoints serve. | Vault/KMS available; mTLS CA bootstrapped |
|
|
| **2** | **Audit-parity probe** (`audit_parity.sh`) green on synthetic data. Cross-runtime equivalence verified. | Step 1 |
|
|
| **3** | Add gateway feature flag `LH_USE_IDENTITY_SERVICE`. **Shadow-write only** — gateway writes new subjects to identityd AND continues SQL path. No reads from identityd yet. | Step 2 |
|
|
| **4** | Run shadow-write for ≥1 week. Validate access logs, encryption correctness, write-path performance under real traffic. | Step 3 |
|
|
| **5** | Backfill from `workers_500k.parquet`. **`consent_status='pending_backfill_review'`** for ALL existing rows. **`vertical='unknown'`** default (NOT 'general'). **`biometric_consent_status='never_collected'`** — backfill does NOT include any biometric data. | Step 0 (BIPA gates) + Step 4 |
|
|
| **6** | Human-review queue for vertical reclassification. Subjects move from `unknown` to `general` only after explicit review. Healthcare-pattern matches auto-escalate to `healthcare` (never auto-downgrade). | Step 5 |
|
|
| **7** | **Shadow-read** — gateway reads from BOTH SQL and identityd, compares, logs divergences, returns SQL result. Run ≥1 week. | Step 5 |
|
|
| **8** | Feature-flag cutover — gateway reads from identityd, falls back to SQL on error, alerts on every fallback. | Step 7 |
|
|
| **9** | Quarantine PII columns in workers_500k.parquet — only after **cryptographic attestation of identityd completeness** (Merkle proof: source row hash = identityd row hash for every candidate_id). Move PII columns to a different bucket; the candidate_id-only projection becomes operational. | Step 8 |
|
|
|
|
**Key v2 changes from v1:**
|
|
- New Step 0 (BIPA gates prerequisite) — backfill cannot start without this
|
|
- Steps 3-4 (shadow-write before backfill) — production validates the WRITE path before data lands
|
|
- Step 5 backfill consent: `pending_backfill_review` not `inferred_existing` (BIPA defense)
|
|
- Step 5 vertical default: `unknown` not `general` (HIPAA fail-closed)
|
|
- New Step 6 (human review) — vertical classification via explicit operator action
|
|
- Step 7 added (shadow-read after backfill) — catches encoding/normalization bugs before cutover
|
|
- Step 9 (quarantine) requires cryptographic attestation of completeness
|
|
|
|
Each step is its own commit, its own gate, its own rollback path.
|
|
|
|
---
|
|
|
|
## 6. Cross-runtime parity probe (NEW, per scrum §6 plus opus 'oracle test' addition)
|
|
|
|
`audit_parity.sh` ships in Phase 5. Asserts:
|
|
|
|
1. Same PII fetch through Rust gateway and Go gateway produces identical identityd access-log rows (modulo daemon name)
|
|
2. Crypto-erasure of test subject through Rust gateway is honored when Go gateway tries to fetch
|
|
3. Healthcare-vertical routing decision identical across both runtimes
|
|
4. **Oracle test** (per kimi SOC2 CC4.1 finding): probe with known-good inputs against expected outputs. A bug present in BOTH implementations must not pass the parity probe.
|
|
5. Discrimination-proxy phrase redaction triggers as expected on adversarial test cases
|
|
|
|
---
|
|
|
|
## 7. What this design intentionally does NOT solve
|
|
|
|
- Protected-attribute exclusion at decision time (Phase 6)
|
|
- Pathway memory trace body redaction (Phase 4)
|
|
- Retroactive Langfuse history scrub (separate runbook)
|
|
- GDPR Art. 22 right to explanation full implementation (Phase 8)
|
|
- Multi-region data residency (single-region US-Midwest by default)
|
|
- Training-safe export (deferred — not on critical path per opus)
|
|
|
|
---
|
|
|
|
## 8. Open questions — RESOLVED 2026-05-03
|
|
|
|
J confirmed all 6 v1 recommendations + scrum-driven changes:
|
|
|
|
1. ✅ **Master key:** HashiCorp Vault Transit (recommended) OR AWS KMS. NOT a sealed file. v1's sealed-file recommendation is rejected per all 3 reviewers ("obfuscation in any defensible sense").
|
|
2. ✅ **Postgres isolation:** identityd's own database, isolated schema.
|
|
3. ✅ **Vertical backfill:** `'unknown'` default with fail-closed routing. NOT `'general'` per opus+gemini scrum.
|
|
4. ✅ **Legal token:** Split-secret dual-control issuance (J + counsel both sign). Short-lived JWT, max 24h, revocable in <60s.
|
|
5. ✅ **Crypto-erasure sweep:** Daily 03:00 UTC. (Note: v2 erasure mechanism is ciphertext deletion, not key destruction. Per-row keys deferred to Phase 7.)
|
|
6. ✅ **EU enforcement:** Per-subject. Schema fields exist; nothing runs until first eu_resident=true subject.
|
|
|
|
### Newly resolved per scrum
|
|
|
|
7. ✅ **Migration order:** REORDERED per §5 above. New Step 0 (BIPA prerequisite). Shadow-write before backfill. Shadow-read before cutover.
|
|
8. ✅ **Audit-log external anchor:** S3 Object Lock with compliance-mode 7-year retention. Hourly + on-event commits.
|
|
9. ✅ **Audit-log signing key:** Separate Ed25519 keypair from KEK. Vault Transit signing backend OR sealed-secret with strict rotation runbook.
|
|
10. ✅ **mTLS gateway↔identityd:** Mandatory. Self-signed CA managed by identityd at startup.
|
|
11. ✅ **Per-row encryption keys:** Deferred to Phase 7. v2 uses single DEK with HSM-wrapped KEK + ciphertext deletion for erasure.
|
|
12. ✅ **Field-level authorization:** purpose_definitions table enforces per-purpose field allowlists.
|
|
13. ✅ **Synchronous cache invalidation:** pub/sub event on erase; gateway subscribes.
|
|
14. ✅ **Outbound NER pass for Langfuse:** Defense-in-depth in addition to symbol-table replacement.
|
|
15. ✅ **Model version hash in audit response:** Captured per decision row.
|
|
16. ✅ **PDF render:** Deferred from Phase 2; JSON ships first.
|
|
|
|
---
|
|
|
|
## 9. Estimated implementation cost (revised v2)
|
|
|
|
| Sub-phase | Effort | Notes |
|
|
|---|---|---|
|
|
| 2A — Postgres schema + Vault/KMS integration | 1.5-2 days | Includes mTLS CA bootstrap + signing-key separation |
|
|
| 2B — identityd HTTP surface (Go) | 2-3 days | All endpoints, auth, dual-control JWT issuance, rate limiting, revocation |
|
|
| 2C — Backfill ETL (BIPA-compliant — pending_backfill_review) | 1 day | Plus Merkle attestation of completeness |
|
|
| 2D — Gateway integration (Rust + Go, mTLS, shadow-write phase) | 2-3 days | Per-tool migration, parity probe |
|
|
| 2E — JSONL writer changes (subject_id top-level promotion) | 1 day | All sinks |
|
|
| 2F — Langfuse redaction (symbol-table + NER) | 2 days | Two passes + drop-on-detect |
|
|
| 2G — Healthcare-vertical routing (fail-closed) | 0.5 day | Plus auto-escalation pattern matcher |
|
|
| 2H — Cache invalidation pub/sub hook | 0.5 day | Critical for Phase 7 RTBF |
|
|
| 2I — External anchor (S3 Object Lock) | 1 day | Hourly + on-event commits |
|
|
| 2J — Cross-runtime parity probe with oracle tests | 1 day | New probe |
|
|
| **Total** | **~12-15 working days** | (v1 estimated 8-10; v2 added BIPA gates dependency, mTLS, dual-control JWT, NER pass, anchor, etc.) |
|
|
|
|
Bigger than v1. Worth it — every addition was a 3/3-reviewer convergent finding.
|
|
|
|
---
|
|
|
|
## 10. The four "would not build" blockers from scrum (all addressed in v2)
|
|
|
|
| # | v1 issue (3/3 reviewers) | v2 resolution |
|
|
|---|---|---|
|
|
| 1 | Migration order (backfill before validation) | §5 reordered; Step 0 BIPA gates prereq added |
|
|
| 2 | Master key on disk + legal token static file | §2.1 Vault/KMS for KEK; §2.4 split-secret short-lived JWT |
|
|
| 3 | `inferred_existing` BIPA prima facie violation | §5 Step 5 uses `pending_backfill_review`; Step 0 gates required first |
|
|
| 4 | Healthcare default `general` (HIPAA exposure window) | §3.5 fail-closed `unknown`-as-healthcare; §5 Step 5 backfill default `unknown` |
|
|
|
|
All three reviewers said "I would not build v1 as written." All four blockers are resolved in v2. Re-scrum recommended before implementation starts.
|
|
|
|
---
|
|
|
|
## 11. Change log (v1 → v2)
|
|
|
|
| Section | v1 | v2 |
|
|
|---|---|---|
|
|
| Master key storage | Sealed file `/etc/lakehouse/identityd_master.key` | HashiCorp Vault Transit / AWS KMS |
|
|
| Legal token | Static file mode 0400 | Split-secret short-lived JWT (max 24h), dual-control issuance |
|
|
| Encryption keys | Per-row keys with master wrapping | Single DEK + HSM-wrapped KEK + ciphertext deletion (per-row deferred to Phase 7) |
|
|
| Healthcare default | `vertical='general'` backfill | `vertical='unknown'` backfill, fail-closed routing |
|
|
| Migration §5 order | 1: stand up, 2: backfill, 3: feature flag, 4: cutover, 5: quarantine | 0: BIPA gates, 1: stand up, 2: parity probe, 3: shadow-write, 4: shadow-write soak, 5: backfill (pending_backfill_review), 6: human vertical review, 7: shadow-read, 8: cutover, 9: quarantine with cryptographic attestation |
|
|
| Audit-log integrity | Postgres Merkle chain only | Postgres Merkle for legal/consent/erasure + HMAC for standard + S3 Object Lock external anchor |
|
|
| Audit-log signing key | Same as encryption key | Separate Ed25519 keypair; Vault signing backend |
|
|
| Gateway↔identityd transport | Plain HTTP (implied) | mTLS mandatory |
|
|
| Field authorization | Per-endpoint only | Per-purpose-token field allowlists |
|
|
| Cache invalidation | Phase 4 / Phase 7 | Phase 2 ships pub/sub hook |
|
|
| Langfuse redaction | Symbol-table replacement only | Symbol-table + outbound NER pass + drop-on-detect |
|
|
| Audit response | No model version | model_version_hash per decision |
|
|
| PDF render | Phase 2 | Deferred — JSON ships first |
|
|
| Training-safe export | Phase 2 | Deferred — not critical path |
|
|
| Effort estimate | 8-10 days | 12-15 days |
|
|
|
|
---
|
|
|
|
## 12. v3 amendments (post-second-pass scrum, 2026-05-03 evening)
|
|
|
|
v2 was re-scrummed across opus + kimi + gemini. **All 3 verdict: BUILD-WITH-CHANGES.** All 4 v1 blockers verified RESOLVED. New v2 findings are tractable design fixes (not re-architecture). Folded in below. Reviews preserved at `/tmp/identity_scrum_v2/{opus,kimi,gemini}_review.md`.
|
|
|
|
### Convergent v2 findings (≥2 reviewers) — must fix before implementation
|
|
|
|
**v3-A1 — mTLS CA root must NOT live in identityd** (opus + gemini converge). v2 said "self-signed CA managed by identityd at startup" — if identityd is both the CA and authenticated party, compromise of identityd compromises the whole trust fabric. **v3 change:** mTLS CA root lives in **Vault PKI** (or, if Vault PKI is unavailable, an offline root with identityd as a properly-issued intermediate). identityd never has root-CA private key access.
|
|
|
|
**v3-A2 — Dual-control public key registry must be tamper-evident** (opus + gemini converge). v2 said "pre-registered public keys" but didn't say where. If they live in identityd's Postgres, DB-write compromise defeats dual-control. **v3 change:** J + counsel public keys live in **Vault KV with separate access policies** (or a signed config file at `/etc/lakehouse/identityd_dual_control.yaml` with its own dual-control rotation procedure). Public-key changes themselves require dual-control attestation. Plus: `nonce_a`/`nonce_b` get **server-issued challenges** (identityd issues a fresh nonce per request; clients sign that nonce; replay-protection via 5-min nonce cache).
|
|
|
|
### Single-reviewer v2 findings — also folded into v3
|
|
|
|
**v3-B1 (opus): Step 8 fallback-to-SQL needs explicit time bound.** Otherwise becomes permanent dual-read debt. v3: fallback active for max 14 days post-cutover; alert on every fallback; auto-disable at 14 days regardless of fallback rate.
|
|
|
|
**v3-B2 (opus): NER drop-on-detect needs alerting.** Silent observability gap if model regression starts hallucinating PII. v3: drop counter exposed as `identityd_ner_drops_total` Prometheus metric; alert on drop_rate >0.1% over rolling 1h.
|
|
|
|
**v3-B3 (opus): legal-tier notification transport must be specified.** v2 said "real-time notification to designated counsel + J" without saying how. v3: notification transport is **signed Slack webhook OR signed email** (configurable per deployment); message body never contains candidate_id or PII; only `{event_kind, timestamp, accessor_kind, integrity_hash}`. Notification failure does NOT block the legal-tier action — failure is logged for follow-up but token issuance proceeds.
|
|
|
|
**v3-B4 (opus): Step 6 human review queue needs SLA + volume estimate.** With 500k backfilled rows defaulting to `unknown`, all healthcare-adjacent traffic routes to on-box Ollama until reclassified. v3: budget assumption — auto-pattern-matching pre-classifies ~80% to confidently-non-healthcare; remaining 20% (~100k) stays `unknown` pending review at ~500/day operator throughput = ~7 months of queue. **Operational decision needed:** either staff the queue, accept the on-box-Ollama-only routing for ~7 months, or relax pattern matching at the cost of higher false-negative rate. Flag for J.
|
|
|
|
**v3-B5 (opus): Memory zeroing in Go is non-trivial.** v2 §3.1 step 6 says "explicitly zeros memory" — Go GC-managed strings are immutable, not zeroable. v3 implementation note: PII handling code path uses `[]byte` throughout; convert to string only at the JSON serialization boundary; explicit `runtime.GC()` after request completes is theater (Go won't actually zero the slice). **Acceptance:** "best-effort zero" — overwrite the `[]byte` slice contents post-use. Document this is best-effort, not cryptographic-grade scrubbing. (Rust-side: use the `zeroize` crate which IS cryptographic-grade.)
|
|
|
|
**v3-B6 (kimi): Service-tier purpose_definitions needs versioning + emergency revocation.** A misconfigured purpose with overly-broad `allowed_fields` is standing exfiltration authorization. v3 schema additions:
|
|
|
|
```sql
|
|
CREATE TABLE purpose_versions (
|
|
purpose_token TEXT NOT NULL,
|
|
version INT NOT NULL,
|
|
effective_at TIMESTAMPTZ NOT NULL,
|
|
superseded_at TIMESTAMPTZ,
|
|
PRIMARY KEY (purpose_token, version)
|
|
);
|
|
|
|
CREATE TABLE purpose_revocations (
|
|
purpose_token TEXT PRIMARY KEY,
|
|
revoked_at TIMESTAMPTZ NOT NULL DEFAULT now(),
|
|
revoked_by TEXT NOT NULL,
|
|
reason TEXT
|
|
);
|
|
```
|
|
|
|
Auth path checks `purpose_revocations` on every call (not cached, OR cached <5s). Revocation is <60s without code deploy.
|
|
|
|
**v3-B7 (kimi): Cache invalidation needs erasure-generation atomicity.** v2 §3.6 pub/sub is best-effort. If Redis drops the message or gateway crashes between erase + purge, cache holds stale PII. v3 change:
|
|
- `subjects` table gains `erasure_generation INT NOT NULL DEFAULT 0`. Increments on every erase.
|
|
- Identityd PII responses include current generation in headers
|
|
- Gateway cache entries tagged with the generation they were filled at
|
|
- Cache hits with `cached_generation < current_generation` are rejected, force re-fetch from identityd
|
|
- Eventually-consistent within ONE request hop, not eventually-consistent over time
|
|
|
|
**v3-B8 (kimi): Dual-control issuance needs cooling-off period to prevent bypass culture.** v3: 15-minute mandatory delay between nonce submission and token issuance, with notification to BOTH parties at delay-start AND issuance. Bypass requires both J and counsel to sign an "emergency bypass" attestation that's logged separately + escalated to a board-level reviewer.
|
|
|
|
**v3-B9 (kimi): NER must have calibrated false-negative rate.** v3: maintain a synthetic adversarial test set (50 hallucinated-PII examples + 50 legitimate non-PII). NER must achieve recall ≥99.5% on the synthetic set in CI; production drop-rate >0.1% triggers re-calibration.
|
|
|
|
**v3-B10 (kimi): S3 Object Lock root credentials need separate AWS account.** Same ops team holding Vault admin + AWS root = limited independence. v3: Object Lock bucket lives in **separate AWS account** with **write-only IAM role** issued to identityd. Root credentials for that account held by named external party (could be outside counsel) with quarterly attestation.
|
|
|
|
**v3-B11 (kimi): biometric_consent_status='never_collected' may not satisfy BIPA infrastructure-as-notice argument.** Plaintiffs may argue presence of biometric schema fields constitutes constructive notice of intent. v3: Phase 1.6 BIPA policy doc must explicitly attest "no biometric data exists in the pre-identityd substrate" with cryptographic evidence (hash of pre-identityd schema). Counsel must bless this attestation.
|
|
|
|
**v3-B12 (gemini): Backup retention window vs ciphertext-deletion erasure.** Phase 2 uses ciphertext deletion + single-DEK; if Postgres backups retain pre-erasure ciphertext + the DEK still exists, backup is recoverable. v3 erasure runbook: includes maximum backup retention period (e.g., 30 days). RTBF status reported to subject as "erased now; residual in backups expires by {date}." For full crypto-erasure-on-demand, Phase 7 per-row keys is the path.
|
|
|
|
### Net v3 effort delta
|
|
|
|
v2 estimate: 12-15 days. v3 amendments add:
|
|
- mTLS CA via Vault PKI integration: +0.5 day
|
|
- Public-key registry in Vault KV: +0.5 day
|
|
- Server-issued nonces for dual-control: +0.25 day
|
|
- Step 8 fallback time bound: +0.1 day
|
|
- NER drop-rate metric + alert: +0.25 day
|
|
- Legal-tier notification transport: +0.5 day
|
|
- purpose_versions + purpose_revocations: +0.5 day
|
|
- erasure_generation cache atomicity: +0.5 day
|
|
- 15-min cooling-off: +0.25 day
|
|
- NER calibrated test set: +0.5 day
|
|
- Separate AWS account for Object Lock: +0.5 day
|
|
- BIPA infrastructure-as-notice attestation: +0.25 day (mostly doc work)
|
|
- Backup-retention erasure runbook: +0.25 day
|
|
|
|
**Total v3 delta: ~5 days. Revised estimate: 17-20 days.**
|
|
|
|
The cost is real. The reason it's worth paying is what those 5 days buy: a design that 3 independent senior security architects (across 3 model lineages) all said they would build. v1 said "do not build." v3 says "build."
|
|
|
|
### What's v3-deferred vs v3-must-have
|
|
|
|
**v3 must-have (block implementation):** v3-A1, v3-A2, v3-B1, v3-B6, v3-B7, v3-B11.
|
|
|
|
**v3 should-have (ship in Phase 2 if calendar allows; otherwise Phase 5):** v3-B2, v3-B3, v3-B4 (operational decision), v3-B5 (best-effort), v3-B8, v3-B9, v3-B10, v3-B12.
|
|
|
|
If schedule pressure forces a cut, the should-have items can ship in Phase 5 (identity service build completion). The must-haves cannot — they're integral to the security boundary.
|
|
|
|
---
|
|
|
|
## Change log
|
|
|
|
- 2026-05-03 — v1 initial draft.
|
|
- 2026-05-03 — v2 post-first-scrum: 3/3 reviewer convergent findings folded in. 4 "would not build" blockers all resolved. Re-scrum recommended.
|
|
- 2026-05-03 — v3 post-second-scrum: 3/3 BUILD-WITH-CHANGES. v1 blockers verified resolved. 12 new v2 findings folded into v3 amendments (§12). Re-scrum NOT required for v3 — diminishing returns; the must-have items are concrete fixes with clear acceptance criteria.
|