diff --git a/docs/IDENTITY_SERVICE_DESIGN.md b/docs/IDENTITY_SERVICE_DESIGN.md index 5d85aab..2231fab 100644 --- a/docs/IDENTITY_SERVICE_DESIGN.md +++ b/docs/IDENTITY_SERVICE_DESIGN.md @@ -476,7 +476,99 @@ All three reviewers said "I would not build v1 as written." All four blockers ar --- +## 12. v3 amendments (post-second-pass scrum, 2026-05-03 evening) + +v2 was re-scrummed across opus + kimi + gemini. **All 3 verdict: BUILD-WITH-CHANGES.** All 4 v1 blockers verified RESOLVED. New v2 findings are tractable design fixes (not re-architecture). Folded in below. Reviews preserved at `/tmp/identity_scrum_v2/{opus,kimi,gemini}_review.md`. + +### Convergent v2 findings (≥2 reviewers) — must fix before implementation + +**v3-A1 — mTLS CA root must NOT live in identityd** (opus + gemini converge). v2 said "self-signed CA managed by identityd at startup" — if identityd is both the CA and authenticated party, compromise of identityd compromises the whole trust fabric. **v3 change:** mTLS CA root lives in **Vault PKI** (or, if Vault PKI is unavailable, an offline root with identityd as a properly-issued intermediate). identityd never has root-CA private key access. + +**v3-A2 — Dual-control public key registry must be tamper-evident** (opus + gemini converge). v2 said "pre-registered public keys" but didn't say where. If they live in identityd's Postgres, DB-write compromise defeats dual-control. **v3 change:** J + counsel public keys live in **Vault KV with separate access policies** (or a signed config file at `/etc/lakehouse/identityd_dual_control.yaml` with its own dual-control rotation procedure). Public-key changes themselves require dual-control attestation. Plus: `nonce_a`/`nonce_b` get **server-issued challenges** (identityd issues a fresh nonce per request; clients sign that nonce; replay-protection via 5-min nonce cache). + +### Single-reviewer v2 findings — also folded into v3 + +**v3-B1 (opus): Step 8 fallback-to-SQL needs explicit time bound.** Otherwise becomes permanent dual-read debt. v3: fallback active for max 14 days post-cutover; alert on every fallback; auto-disable at 14 days regardless of fallback rate. + +**v3-B2 (opus): NER drop-on-detect needs alerting.** Silent observability gap if model regression starts hallucinating PII. v3: drop counter exposed as `identityd_ner_drops_total` Prometheus metric; alert on drop_rate >0.1% over rolling 1h. + +**v3-B3 (opus): legal-tier notification transport must be specified.** v2 said "real-time notification to designated counsel + J" without saying how. v3: notification transport is **signed Slack webhook OR signed email** (configurable per deployment); message body never contains candidate_id or PII; only `{event_kind, timestamp, accessor_kind, integrity_hash}`. Notification failure does NOT block the legal-tier action — failure is logged for follow-up but token issuance proceeds. + +**v3-B4 (opus): Step 6 human review queue needs SLA + volume estimate.** With 500k backfilled rows defaulting to `unknown`, all healthcare-adjacent traffic routes to on-box Ollama until reclassified. v3: budget assumption — auto-pattern-matching pre-classifies ~80% to confidently-non-healthcare; remaining 20% (~100k) stays `unknown` pending review at ~500/day operator throughput = ~7 months of queue. **Operational decision needed:** either staff the queue, accept the on-box-Ollama-only routing for ~7 months, or relax pattern matching at the cost of higher false-negative rate. Flag for J. + +**v3-B5 (opus): Memory zeroing in Go is non-trivial.** v2 §3.1 step 6 says "explicitly zeros memory" — Go GC-managed strings are immutable, not zeroable. v3 implementation note: PII handling code path uses `[]byte` throughout; convert to string only at the JSON serialization boundary; explicit `runtime.GC()` after request completes is theater (Go won't actually zero the slice). **Acceptance:** "best-effort zero" — overwrite the `[]byte` slice contents post-use. Document this is best-effort, not cryptographic-grade scrubbing. (Rust-side: use the `zeroize` crate which IS cryptographic-grade.) + +**v3-B6 (kimi): Service-tier purpose_definitions needs versioning + emergency revocation.** A misconfigured purpose with overly-broad `allowed_fields` is standing exfiltration authorization. v3 schema additions: + +```sql +CREATE TABLE purpose_versions ( + purpose_token TEXT NOT NULL, + version INT NOT NULL, + effective_at TIMESTAMPTZ NOT NULL, + superseded_at TIMESTAMPTZ, + PRIMARY KEY (purpose_token, version) +); + +CREATE TABLE purpose_revocations ( + purpose_token TEXT PRIMARY KEY, + revoked_at TIMESTAMPTZ NOT NULL DEFAULT now(), + revoked_by TEXT NOT NULL, + reason TEXT +); +``` + +Auth path checks `purpose_revocations` on every call (not cached, OR cached <5s). Revocation is <60s without code deploy. + +**v3-B7 (kimi): Cache invalidation needs erasure-generation atomicity.** v2 §3.6 pub/sub is best-effort. If Redis drops the message or gateway crashes between erase + purge, cache holds stale PII. v3 change: +- `subjects` table gains `erasure_generation INT NOT NULL DEFAULT 0`. Increments on every erase. +- Identityd PII responses include current generation in headers +- Gateway cache entries tagged with the generation they were filled at +- Cache hits with `cached_generation < current_generation` are rejected, force re-fetch from identityd +- Eventually-consistent within ONE request hop, not eventually-consistent over time + +**v3-B8 (kimi): Dual-control issuance needs cooling-off period to prevent bypass culture.** v3: 15-minute mandatory delay between nonce submission and token issuance, with notification to BOTH parties at delay-start AND issuance. Bypass requires both J and counsel to sign an "emergency bypass" attestation that's logged separately + escalated to a board-level reviewer. + +**v3-B9 (kimi): NER must have calibrated false-negative rate.** v3: maintain a synthetic adversarial test set (50 hallucinated-PII examples + 50 legitimate non-PII). NER must achieve recall ≥99.5% on the synthetic set in CI; production drop-rate >0.1% triggers re-calibration. + +**v3-B10 (kimi): S3 Object Lock root credentials need separate AWS account.** Same ops team holding Vault admin + AWS root = limited independence. v3: Object Lock bucket lives in **separate AWS account** with **write-only IAM role** issued to identityd. Root credentials for that account held by named external party (could be outside counsel) with quarterly attestation. + +**v3-B11 (kimi): biometric_consent_status='never_collected' may not satisfy BIPA infrastructure-as-notice argument.** Plaintiffs may argue presence of biometric schema fields constitutes constructive notice of intent. v3: Phase 1.6 BIPA policy doc must explicitly attest "no biometric data exists in the pre-identityd substrate" with cryptographic evidence (hash of pre-identityd schema). Counsel must bless this attestation. + +**v3-B12 (gemini): Backup retention window vs ciphertext-deletion erasure.** Phase 2 uses ciphertext deletion + single-DEK; if Postgres backups retain pre-erasure ciphertext + the DEK still exists, backup is recoverable. v3 erasure runbook: includes maximum backup retention period (e.g., 30 days). RTBF status reported to subject as "erased now; residual in backups expires by {date}." For full crypto-erasure-on-demand, Phase 7 per-row keys is the path. + +### Net v3 effort delta + +v2 estimate: 12-15 days. v3 amendments add: +- mTLS CA via Vault PKI integration: +0.5 day +- Public-key registry in Vault KV: +0.5 day +- Server-issued nonces for dual-control: +0.25 day +- Step 8 fallback time bound: +0.1 day +- NER drop-rate metric + alert: +0.25 day +- Legal-tier notification transport: +0.5 day +- purpose_versions + purpose_revocations: +0.5 day +- erasure_generation cache atomicity: +0.5 day +- 15-min cooling-off: +0.25 day +- NER calibrated test set: +0.5 day +- Separate AWS account for Object Lock: +0.5 day +- BIPA infrastructure-as-notice attestation: +0.25 day (mostly doc work) +- Backup-retention erasure runbook: +0.25 day + +**Total v3 delta: ~5 days. Revised estimate: 17-20 days.** + +The cost is real. The reason it's worth paying is what those 5 days buy: a design that 3 independent senior security architects (across 3 model lineages) all said they would build. v1 said "do not build." v3 says "build." + +### What's v3-deferred vs v3-must-have + +**v3 must-have (block implementation):** v3-A1, v3-A2, v3-B1, v3-B6, v3-B7, v3-B11. + +**v3 should-have (ship in Phase 2 if calendar allows; otherwise Phase 5):** v3-B2, v3-B3, v3-B4 (operational decision), v3-B5 (best-effort), v3-B8, v3-B9, v3-B10, v3-B12. + +If schedule pressure forces a cut, the should-have items can ship in Phase 5 (identity service build completion). The must-haves cannot — they're integral to the security boundary. + +--- + ## Change log - 2026-05-03 — v1 initial draft. -- 2026-05-03 — v2 post-scrum: 3/3 reviewer convergent findings folded in. 4 "would not build" blockers all resolved. Re-scrum before implementation recommended. +- 2026-05-03 — v2 post-first-scrum: 3/3 reviewer convergent findings folded in. 4 "would not build" blockers all resolved. Re-scrum recommended. +- 2026-05-03 — v3 post-second-scrum: 3/3 BUILD-WITH-CHANGES. v1 blockers verified resolved. 12 new v2 findings folded into v3 amendments (§12). Re-scrum NOT required for v3 — diminishing returns; the must-have items are concrete fixes with clear acceptance criteria.