From 8129ddd8834f58faefd55c2f7864460877125e95 Mon Sep 17 00:00:00 2001 From: root Date: Sun, 3 May 2026 01:39:35 -0500 Subject: [PATCH] =?UTF-8?q?identity=20service:=20v3=20amendments=20?= =?UTF-8?q?=E2=80=94=20second-pass=20scrum=20BUILD-WITH-CHANGES?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Re-scrummed v2 across opus + kimi + gemini. All 3 verdict: BUILD-WITH-CHANGES. v1 blockers verified RESOLVED. 12 new v2 findings folded as v3 amendments in §12. Convergent v2 findings (≥2 reviewers): v3-A1: mTLS CA root must NOT live in identityd (opus + gemini). v3 fix: Vault PKI for CA, identityd as intermediate. v3-A2: Dual-control public key registry must be tamper-evident (opus + gemini). v3 fix: Vault KV with separate access policies + server-issued nonces for replay protection. Single-reviewer v3 amendments (10 more): - B1: Step 8 fallback-to-SQL needs explicit 14-day time bound - B2: NER drop-on-detect needs Prometheus alerting - B3: legal-tier notification transport spec'd (signed Slack/email, no PII in body, failure non-blocking) - B4: Step 6 human review SLA flagged — ~7 months at 500/day for ~100k unknown rows; operational decision needed - B5: Memory zeroing in Go is best-effort (Rust uses zeroize crate); documented as not cryptographic-grade - B6: purpose_definitions needs versioning + emergency revocation (purpose_versions + purpose_revocations tables) - B7: Cache invalidation needs erasure_generation atomicity (subjects.erasure_generation int; gateway rejects stale-gen cache hits) — replaces best-effort pub/sub - B8: 15-min cooling-off period for dual-control issuance to prevent emergency-bypass culture - B9: NER calibrated test set with target recall ≥99.5% on synthetic adversarial PII - B10: S3 Object Lock in separate AWS account with write-only IAM; root credentials held by external party - B11: BIPA infrastructure-as-notice attestation in Phase 1.6 doc - B12: Backup retention vs ciphertext-deletion erasure window documented in RTBF runbook Estimate revised v2 12-15d → v3 17-20d. Worth it — the cost is what buys "I would build this" from 3 independent senior security architects across 3 model lineages. Must-have v3 items (block implementation): A1, A2, B1, B6, B7, B11. Should-have (ship in Phase 5 if calendar tight): B2-B5, B8-B10, B12. Re-scrum NOT recommended for v3 — diminishing returns; must-have items are concrete fixes with clear acceptance criteria. No code changes. Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/IDENTITY_SERVICE_DESIGN.md | 94 ++++++++++++++++++++++++++++++++- 1 file changed, 93 insertions(+), 1 deletion(-) diff --git a/docs/IDENTITY_SERVICE_DESIGN.md b/docs/IDENTITY_SERVICE_DESIGN.md index 5d85aab..2231fab 100644 --- a/docs/IDENTITY_SERVICE_DESIGN.md +++ b/docs/IDENTITY_SERVICE_DESIGN.md @@ -476,7 +476,99 @@ All three reviewers said "I would not build v1 as written." All four blockers ar --- +## 12. v3 amendments (post-second-pass scrum, 2026-05-03 evening) + +v2 was re-scrummed across opus + kimi + gemini. **All 3 verdict: BUILD-WITH-CHANGES.** All 4 v1 blockers verified RESOLVED. New v2 findings are tractable design fixes (not re-architecture). Folded in below. Reviews preserved at `/tmp/identity_scrum_v2/{opus,kimi,gemini}_review.md`. + +### Convergent v2 findings (≥2 reviewers) — must fix before implementation + +**v3-A1 — mTLS CA root must NOT live in identityd** (opus + gemini converge). v2 said "self-signed CA managed by identityd at startup" — if identityd is both the CA and authenticated party, compromise of identityd compromises the whole trust fabric. **v3 change:** mTLS CA root lives in **Vault PKI** (or, if Vault PKI is unavailable, an offline root with identityd as a properly-issued intermediate). identityd never has root-CA private key access. + +**v3-A2 — Dual-control public key registry must be tamper-evident** (opus + gemini converge). v2 said "pre-registered public keys" but didn't say where. If they live in identityd's Postgres, DB-write compromise defeats dual-control. **v3 change:** J + counsel public keys live in **Vault KV with separate access policies** (or a signed config file at `/etc/lakehouse/identityd_dual_control.yaml` with its own dual-control rotation procedure). Public-key changes themselves require dual-control attestation. Plus: `nonce_a`/`nonce_b` get **server-issued challenges** (identityd issues a fresh nonce per request; clients sign that nonce; replay-protection via 5-min nonce cache). + +### Single-reviewer v2 findings — also folded into v3 + +**v3-B1 (opus): Step 8 fallback-to-SQL needs explicit time bound.** Otherwise becomes permanent dual-read debt. v3: fallback active for max 14 days post-cutover; alert on every fallback; auto-disable at 14 days regardless of fallback rate. + +**v3-B2 (opus): NER drop-on-detect needs alerting.** Silent observability gap if model regression starts hallucinating PII. v3: drop counter exposed as `identityd_ner_drops_total` Prometheus metric; alert on drop_rate >0.1% over rolling 1h. + +**v3-B3 (opus): legal-tier notification transport must be specified.** v2 said "real-time notification to designated counsel + J" without saying how. v3: notification transport is **signed Slack webhook OR signed email** (configurable per deployment); message body never contains candidate_id or PII; only `{event_kind, timestamp, accessor_kind, integrity_hash}`. Notification failure does NOT block the legal-tier action — failure is logged for follow-up but token issuance proceeds. + +**v3-B4 (opus): Step 6 human review queue needs SLA + volume estimate.** With 500k backfilled rows defaulting to `unknown`, all healthcare-adjacent traffic routes to on-box Ollama until reclassified. v3: budget assumption — auto-pattern-matching pre-classifies ~80% to confidently-non-healthcare; remaining 20% (~100k) stays `unknown` pending review at ~500/day operator throughput = ~7 months of queue. **Operational decision needed:** either staff the queue, accept the on-box-Ollama-only routing for ~7 months, or relax pattern matching at the cost of higher false-negative rate. Flag for J. + +**v3-B5 (opus): Memory zeroing in Go is non-trivial.** v2 §3.1 step 6 says "explicitly zeros memory" — Go GC-managed strings are immutable, not zeroable. v3 implementation note: PII handling code path uses `[]byte` throughout; convert to string only at the JSON serialization boundary; explicit `runtime.GC()` after request completes is theater (Go won't actually zero the slice). **Acceptance:** "best-effort zero" — overwrite the `[]byte` slice contents post-use. Document this is best-effort, not cryptographic-grade scrubbing. (Rust-side: use the `zeroize` crate which IS cryptographic-grade.) + +**v3-B6 (kimi): Service-tier purpose_definitions needs versioning + emergency revocation.** A misconfigured purpose with overly-broad `allowed_fields` is standing exfiltration authorization. v3 schema additions: + +```sql +CREATE TABLE purpose_versions ( + purpose_token TEXT NOT NULL, + version INT NOT NULL, + effective_at TIMESTAMPTZ NOT NULL, + superseded_at TIMESTAMPTZ, + PRIMARY KEY (purpose_token, version) +); + +CREATE TABLE purpose_revocations ( + purpose_token TEXT PRIMARY KEY, + revoked_at TIMESTAMPTZ NOT NULL DEFAULT now(), + revoked_by TEXT NOT NULL, + reason TEXT +); +``` + +Auth path checks `purpose_revocations` on every call (not cached, OR cached <5s). Revocation is <60s without code deploy. + +**v3-B7 (kimi): Cache invalidation needs erasure-generation atomicity.** v2 §3.6 pub/sub is best-effort. If Redis drops the message or gateway crashes between erase + purge, cache holds stale PII. v3 change: +- `subjects` table gains `erasure_generation INT NOT NULL DEFAULT 0`. Increments on every erase. +- Identityd PII responses include current generation in headers +- Gateway cache entries tagged with the generation they were filled at +- Cache hits with `cached_generation < current_generation` are rejected, force re-fetch from identityd +- Eventually-consistent within ONE request hop, not eventually-consistent over time + +**v3-B8 (kimi): Dual-control issuance needs cooling-off period to prevent bypass culture.** v3: 15-minute mandatory delay between nonce submission and token issuance, with notification to BOTH parties at delay-start AND issuance. Bypass requires both J and counsel to sign an "emergency bypass" attestation that's logged separately + escalated to a board-level reviewer. + +**v3-B9 (kimi): NER must have calibrated false-negative rate.** v3: maintain a synthetic adversarial test set (50 hallucinated-PII examples + 50 legitimate non-PII). NER must achieve recall ≥99.5% on the synthetic set in CI; production drop-rate >0.1% triggers re-calibration. + +**v3-B10 (kimi): S3 Object Lock root credentials need separate AWS account.** Same ops team holding Vault admin + AWS root = limited independence. v3: Object Lock bucket lives in **separate AWS account** with **write-only IAM role** issued to identityd. Root credentials for that account held by named external party (could be outside counsel) with quarterly attestation. + +**v3-B11 (kimi): biometric_consent_status='never_collected' may not satisfy BIPA infrastructure-as-notice argument.** Plaintiffs may argue presence of biometric schema fields constitutes constructive notice of intent. v3: Phase 1.6 BIPA policy doc must explicitly attest "no biometric data exists in the pre-identityd substrate" with cryptographic evidence (hash of pre-identityd schema). Counsel must bless this attestation. + +**v3-B12 (gemini): Backup retention window vs ciphertext-deletion erasure.** Phase 2 uses ciphertext deletion + single-DEK; if Postgres backups retain pre-erasure ciphertext + the DEK still exists, backup is recoverable. v3 erasure runbook: includes maximum backup retention period (e.g., 30 days). RTBF status reported to subject as "erased now; residual in backups expires by {date}." For full crypto-erasure-on-demand, Phase 7 per-row keys is the path. + +### Net v3 effort delta + +v2 estimate: 12-15 days. v3 amendments add: +- mTLS CA via Vault PKI integration: +0.5 day +- Public-key registry in Vault KV: +0.5 day +- Server-issued nonces for dual-control: +0.25 day +- Step 8 fallback time bound: +0.1 day +- NER drop-rate metric + alert: +0.25 day +- Legal-tier notification transport: +0.5 day +- purpose_versions + purpose_revocations: +0.5 day +- erasure_generation cache atomicity: +0.5 day +- 15-min cooling-off: +0.25 day +- NER calibrated test set: +0.5 day +- Separate AWS account for Object Lock: +0.5 day +- BIPA infrastructure-as-notice attestation: +0.25 day (mostly doc work) +- Backup-retention erasure runbook: +0.25 day + +**Total v3 delta: ~5 days. Revised estimate: 17-20 days.** + +The cost is real. The reason it's worth paying is what those 5 days buy: a design that 3 independent senior security architects (across 3 model lineages) all said they would build. v1 said "do not build." v3 says "build." + +### What's v3-deferred vs v3-must-have + +**v3 must-have (block implementation):** v3-A1, v3-A2, v3-B1, v3-B6, v3-B7, v3-B11. + +**v3 should-have (ship in Phase 2 if calendar allows; otherwise Phase 5):** v3-B2, v3-B3, v3-B4 (operational decision), v3-B5 (best-effort), v3-B8, v3-B9, v3-B10, v3-B12. + +If schedule pressure forces a cut, the should-have items can ship in Phase 5 (identity service build completion). The must-haves cannot — they're integral to the security boundary. + +--- + ## Change log - 2026-05-03 — v1 initial draft. -- 2026-05-03 — v2 post-scrum: 3/3 reviewer convergent findings folded in. 4 "would not build" blockers all resolved. Re-scrum before implementation recommended. +- 2026-05-03 — v2 post-first-scrum: 3/3 reviewer convergent findings folded in. 4 "would not build" blockers all resolved. Re-scrum recommended. +- 2026-05-03 — v3 post-second-scrum: 3/3 BUILD-WITH-CHANGES. v1 blockers verified resolved. 12 new v2 findings folded into v3 amendments (§12). Re-scrum NOT required for v3 — diminishing returns; the must-have items are concrete fixes with clear acceptance criteria.