identity service: v3 amendments — second-pass scrum BUILD-WITH-CHANGES

Re-scrummed v2 across opus + kimi + gemini. All 3 verdict: BUILD-WITH-CHANGES. v1 blockers verified RESOLVED. 12 new v2 findings folded as v3 amendments in §12. Convergent v2 findings (≥2 reviewers): v3-A1: mTLS CA root must NOT live in identityd (opus + gemini). v3 fix: Vault PKI for CA, identityd as intermediate. v3-A2: Dual-control public key registry must be tamper-evident (opus + gemini). v3 fix: Vault KV with separate access policies + server-issued nonces for replay protection. Single-reviewer v3 amendments (10 more): - B1: Step 8 fallback-to-SQL needs explicit 14-day time bound - B2: NER drop-on-detect needs Prometheus alerting - B3: legal-tier notification transport spec'd (signed Slack/email, no PII in body, failure non-blocking) - B4: Step 6 human review SLA flagged — ~7 months at 500/day for ~100k unknown rows; operational decision needed - B5: Memory zeroing in Go is best-effort (Rust uses zeroize crate); documented as not cryptographic-grade - B6: purpose_definitions needs versioning + emergency revocation (purpose_versions + purpose_revocations tables) - B7: Cache invalidation needs erasure_generation atomicity (subjects.erasure_generation int; gateway rejects stale-gen cache hits) — replaces best-effort pub/sub - B8: 15-min cooling-off period for dual-control issuance to prevent emergency-bypass culture - B9: NER calibrated test set with target recall ≥99.5% on synthetic adversarial PII - B10: S3 Object Lock in separate AWS account with write-only IAM; root credentials held by external party - B11: BIPA infrastructure-as-notice attestation in Phase 1.6 doc - B12: Backup retention vs ciphertext-deletion erasure window documented in RTBF runbook Estimate revised v2 12-15d → v3 17-20d. Worth it — the cost is what buys "I would build this" from 3 independent senior security architects across 3 model lineages. Must-have v3 items (block implementation): A1, A2, B1, B6, B7, B11. Should-have (ship in Phase 5 if calendar tight): B2-B5, B8-B10, B12. Re-scrum NOT recommended for v3 — diminishing returns; must-have items are concrete fixes with clear acceptance criteria. No code changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 01:39:35 -05:00 · 2026-05-03 01:39:35 -05:00 · 8129ddd883
commit 8129ddd883
parent 298fadce41
1 changed files with 93 additions and 1 deletions
--- a/docs/IDENTITY_SERVICE_DESIGN.md
+++ b/docs/IDENTITY_SERVICE_DESIGN.md
@ -476,7 +476,99 @@ All three reviewers said "I would not build v1 as written." All four blockers ar

 ---

+## 12. v3 amendments (post-second-pass scrum, 2026-05-03 evening)
+
+v2 was re-scrummed across opus + kimi + gemini. **All 3 verdict: BUILD-WITH-CHANGES.** All 4 v1 blockers verified RESOLVED. New v2 findings are tractable design fixes (not re-architecture). Folded in below. Reviews preserved at `/tmp/identity_scrum_v2/{opus,kimi,gemini}_review.md`.
+
+### Convergent v2 findings (≥2 reviewers) — must fix before implementation
+
+**v3-A1 — mTLS CA root must NOT live in identityd** (opus + gemini converge). v2 said "self-signed CA managed by identityd at startup" — if identityd is both the CA and authenticated party, compromise of identityd compromises the whole trust fabric. **v3 change:** mTLS CA root lives in **Vault PKI** (or, if Vault PKI is unavailable, an offline root with identityd as a properly-issued intermediate). identityd never has root-CA private key access.
+
+**v3-A2 — Dual-control public key registry must be tamper-evident** (opus + gemini converge). v2 said "pre-registered public keys" but didn't say where. If they live in identityd's Postgres, DB-write compromise defeats dual-control. **v3 change:** J + counsel public keys live in **Vault KV with separate access policies** (or a signed config file at `/etc/lakehouse/identityd_dual_control.yaml` with its own dual-control rotation procedure). Public-key changes themselves require dual-control attestation. Plus: `nonce_a`/`nonce_b` get **server-issued challenges** (identityd issues a fresh nonce per request; clients sign that nonce; replay-protection via 5-min nonce cache).
+
+### Single-reviewer v2 findings — also folded into v3
+
+**v3-B1 (opus): Step 8 fallback-to-SQL needs explicit time bound.** Otherwise becomes permanent dual-read debt. v3: fallback active for max 14 days post-cutover; alert on every fallback; auto-disable at 14 days regardless of fallback rate.
+
+**v3-B2 (opus): NER drop-on-detect needs alerting.** Silent observability gap if model regression starts hallucinating PII. v3: drop counter exposed as `identityd_ner_drops_total` Prometheus metric; alert on drop_rate >0.1% over rolling 1h.
+
+**v3-B3 (opus): legal-tier notification transport must be specified.** v2 said "real-time notification to designated counsel + J" without saying how. v3: notification transport is **signed Slack webhook OR signed email** (configurable per deployment); message body never contains candidate_id or PII; only `{event_kind, timestamp, accessor_kind, integrity_hash}`. Notification failure does NOT block the legal-tier action — failure is logged for follow-up but token issuance proceeds.
+
+**v3-B4 (opus): Step 6 human review queue needs SLA + volume estimate.** With 500k backfilled rows defaulting to `unknown`, all healthcare-adjacent traffic routes to on-box Ollama until reclassified. v3: budget assumption — auto-pattern-matching pre-classifies ~80% to confidently-non-healthcare; remaining 20% (~100k) stays `unknown` pending review at ~500/day operator throughput = ~7 months of queue. **Operational decision needed:** either staff the queue, accept the on-box-Ollama-only routing for ~7 months, or relax pattern matching at the cost of higher false-negative rate. Flag for J.
+
+**v3-B5 (opus): Memory zeroing in Go is non-trivial.** v2 §3.1 step 6 says "explicitly zeros memory" — Go GC-managed strings are immutable, not zeroable. v3 implementation note: PII handling code path uses `[]byte` throughout; convert to string only at the JSON serialization boundary; explicit `runtime.GC()` after request completes is theater (Go won't actually zero the slice). **Acceptance:** "best-effort zero" — overwrite the `[]byte` slice contents post-use. Document this is best-effort, not cryptographic-grade scrubbing. (Rust-side: use the `zeroize` crate which IS cryptographic-grade.)
+
+**v3-B6 (kimi): Service-tier purpose_definitions needs versioning + emergency revocation.** A misconfigured purpose with overly-broad `allowed_fields` is standing exfiltration authorization. v3 schema additions:
+
+```sql
+CREATE TABLE purpose_versions (
+    purpose_token   TEXT NOT NULL,
+    version         INT NOT NULL,
+    effective_at    TIMESTAMPTZ NOT NULL,
+    superseded_at   TIMESTAMPTZ,
+    PRIMARY KEY (purpose_token, version)
+);
+
+CREATE TABLE purpose_revocations (
+    purpose_token   TEXT PRIMARY KEY,
+    revoked_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
+    revoked_by      TEXT NOT NULL,
+    reason          TEXT
+);
+```
+
+Auth path checks `purpose_revocations` on every call (not cached, OR cached <5s). Revocation is <60s without code deploy.
+
+**v3-B7 (kimi): Cache invalidation needs erasure-generation atomicity.** v2 §3.6 pub/sub is best-effort. If Redis drops the message or gateway crashes between erase + purge, cache holds stale PII. v3 change:
+- `subjects` table gains `erasure_generation INT NOT NULL DEFAULT 0`. Increments on every erase.
+- Identityd PII responses include current generation in headers
+- Gateway cache entries tagged with the generation they were filled at
+- Cache hits with `cached_generation < current_generation` are rejected, force re-fetch from identityd
+- Eventually-consistent within ONE request hop, not eventually-consistent over time
+
+**v3-B8 (kimi): Dual-control issuance needs cooling-off period to prevent bypass culture.** v3: 15-minute mandatory delay between nonce submission and token issuance, with notification to BOTH parties at delay-start AND issuance. Bypass requires both J and counsel to sign an "emergency bypass" attestation that's logged separately + escalated to a board-level reviewer.
+
+**v3-B9 (kimi): NER must have calibrated false-negative rate.** v3: maintain a synthetic adversarial test set (50 hallucinated-PII examples + 50 legitimate non-PII). NER must achieve recall ≥99.5% on the synthetic set in CI; production drop-rate >0.1% triggers re-calibration.
+
+**v3-B10 (kimi): S3 Object Lock root credentials need separate AWS account.** Same ops team holding Vault admin + AWS root = limited independence. v3: Object Lock bucket lives in **separate AWS account** with **write-only IAM role** issued to identityd. Root credentials for that account held by named external party (could be outside counsel) with quarterly attestation.
+
+**v3-B11 (kimi): biometric_consent_status='never_collected' may not satisfy BIPA infrastructure-as-notice argument.** Plaintiffs may argue presence of biometric schema fields constitutes constructive notice of intent. v3: Phase 1.6 BIPA policy doc must explicitly attest "no biometric data exists in the pre-identityd substrate" with cryptographic evidence (hash of pre-identityd schema). Counsel must bless this attestation.
+
+**v3-B12 (gemini): Backup retention window vs ciphertext-deletion erasure.** Phase 2 uses ciphertext deletion + single-DEK; if Postgres backups retain pre-erasure ciphertext + the DEK still exists, backup is recoverable. v3 erasure runbook: includes maximum backup retention period (e.g., 30 days). RTBF status reported to subject as "erased now; residual in backups expires by {date}." For full crypto-erasure-on-demand, Phase 7 per-row keys is the path.
+
+### Net v3 effort delta
+
+v2 estimate: 12-15 days. v3 amendments add:
+- mTLS CA via Vault PKI integration: +0.5 day
+- Public-key registry in Vault KV: +0.5 day
+- Server-issued nonces for dual-control: +0.25 day
+- Step 8 fallback time bound: +0.1 day
+- NER drop-rate metric + alert: +0.25 day
+- Legal-tier notification transport: +0.5 day
+- purpose_versions + purpose_revocations: +0.5 day
+- erasure_generation cache atomicity: +0.5 day
+- 15-min cooling-off: +0.25 day
+- NER calibrated test set: +0.5 day
+- Separate AWS account for Object Lock: +0.5 day
+- BIPA infrastructure-as-notice attestation: +0.25 day (mostly doc work)
+- Backup-retention erasure runbook: +0.25 day
+
+**Total v3 delta: ~5 days. Revised estimate: 17-20 days.**
+
+The cost is real. The reason it's worth paying is what those 5 days buy: a design that 3 independent senior security architects (across 3 model lineages) all said they would build. v1 said "do not build." v3 says "build."
+
+### What's v3-deferred vs v3-must-have
+
+**v3 must-have (block implementation):** v3-A1, v3-A2, v3-B1, v3-B6, v3-B7, v3-B11.
+
+**v3 should-have (ship in Phase 2 if calendar allows; otherwise Phase 5):** v3-B2, v3-B3, v3-B4 (operational decision), v3-B5 (best-effort), v3-B8, v3-B9, v3-B10, v3-B12.
+
+If schedule pressure forces a cut, the should-have items can ship in Phase 5 (identity service build completion). The must-haves cannot — they're integral to the security boundary.
+
+---
+
 ## Change log

 - 2026-05-03 — v1 initial draft.
- 2026-05-03 — v2 post-scrum: 3/3 reviewer convergent findings folded in. 4 "would not build" blockers all resolved. Re-scrum before implementation recommended.
+- 2026-05-03 — v2 post-first-scrum: 3/3 reviewer convergent findings folded in. 4 "would not build" blockers all resolved. Re-scrum recommended.
+- 2026-05-03 — v3 post-second-scrum: 3/3 BUILD-WITH-CHANGES. v1 blockers verified resolved. 12 new v2 findings folded into v3 amendments (§12). Re-scrum NOT required for v3 — diminishing returns; the must-have items are concrete fixes with clear acceptance criteria.