identity service: v3 amendments — second-pass scrum BUILD-WITH-CHANGES
Re-scrummed v2 across opus + kimi + gemini. All 3 verdict:
BUILD-WITH-CHANGES. v1 blockers verified RESOLVED. 12 new v2
findings folded as v3 amendments in §12.
Convergent v2 findings (≥2 reviewers):
v3-A1: mTLS CA root must NOT live in identityd (opus + gemini).
v3 fix: Vault PKI for CA, identityd as intermediate.
v3-A2: Dual-control public key registry must be tamper-evident
(opus + gemini). v3 fix: Vault KV with separate access
policies + server-issued nonces for replay protection.
Single-reviewer v3 amendments (10 more):
- B1: Step 8 fallback-to-SQL needs explicit 14-day time bound
- B2: NER drop-on-detect needs Prometheus alerting
- B3: legal-tier notification transport spec'd (signed Slack/email,
no PII in body, failure non-blocking)
- B4: Step 6 human review SLA flagged — ~7 months at 500/day for
~100k unknown rows; operational decision needed
- B5: Memory zeroing in Go is best-effort (Rust uses zeroize crate);
documented as not cryptographic-grade
- B6: purpose_definitions needs versioning + emergency revocation
(purpose_versions + purpose_revocations tables)
- B7: Cache invalidation needs erasure_generation atomicity
(subjects.erasure_generation int; gateway rejects stale-gen
cache hits) — replaces best-effort pub/sub
- B8: 15-min cooling-off period for dual-control issuance to
prevent emergency-bypass culture
- B9: NER calibrated test set with target recall ≥99.5% on
synthetic adversarial PII
- B10: S3 Object Lock in separate AWS account with write-only IAM;
root credentials held by external party
- B11: BIPA infrastructure-as-notice attestation in Phase 1.6 doc
- B12: Backup retention vs ciphertext-deletion erasure window
documented in RTBF runbook
Estimate revised v2 12-15d → v3 17-20d. Worth it — the cost is what
buys "I would build this" from 3 independent senior security
architects across 3 model lineages.
Must-have v3 items (block implementation): A1, A2, B1, B6, B7, B11.
Should-have (ship in Phase 5 if calendar tight): B2-B5, B8-B10, B12.
Re-scrum NOT recommended for v3 — diminishing returns; must-have
items are concrete fixes with clear acceptance criteria.
No code changes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
298fadce41
commit
8129ddd883
@ -476,7 +476,99 @@ All three reviewers said "I would not build v1 as written." All four blockers ar
|
||||
|
||||
---
|
||||
|
||||
## 12. v3 amendments (post-second-pass scrum, 2026-05-03 evening)
|
||||
|
||||
v2 was re-scrummed across opus + kimi + gemini. **All 3 verdict: BUILD-WITH-CHANGES.** All 4 v1 blockers verified RESOLVED. New v2 findings are tractable design fixes (not re-architecture). Folded in below. Reviews preserved at `/tmp/identity_scrum_v2/{opus,kimi,gemini}_review.md`.
|
||||
|
||||
### Convergent v2 findings (≥2 reviewers) — must fix before implementation
|
||||
|
||||
**v3-A1 — mTLS CA root must NOT live in identityd** (opus + gemini converge). v2 said "self-signed CA managed by identityd at startup" — if identityd is both the CA and authenticated party, compromise of identityd compromises the whole trust fabric. **v3 change:** mTLS CA root lives in **Vault PKI** (or, if Vault PKI is unavailable, an offline root with identityd as a properly-issued intermediate). identityd never has root-CA private key access.
|
||||
|
||||
**v3-A2 — Dual-control public key registry must be tamper-evident** (opus + gemini converge). v2 said "pre-registered public keys" but didn't say where. If they live in identityd's Postgres, DB-write compromise defeats dual-control. **v3 change:** J + counsel public keys live in **Vault KV with separate access policies** (or a signed config file at `/etc/lakehouse/identityd_dual_control.yaml` with its own dual-control rotation procedure). Public-key changes themselves require dual-control attestation. Plus: `nonce_a`/`nonce_b` get **server-issued challenges** (identityd issues a fresh nonce per request; clients sign that nonce; replay-protection via 5-min nonce cache).
|
||||
|
||||
### Single-reviewer v2 findings — also folded into v3
|
||||
|
||||
**v3-B1 (opus): Step 8 fallback-to-SQL needs explicit time bound.** Otherwise becomes permanent dual-read debt. v3: fallback active for max 14 days post-cutover; alert on every fallback; auto-disable at 14 days regardless of fallback rate.
|
||||
|
||||
**v3-B2 (opus): NER drop-on-detect needs alerting.** Silent observability gap if model regression starts hallucinating PII. v3: drop counter exposed as `identityd_ner_drops_total` Prometheus metric; alert on drop_rate >0.1% over rolling 1h.
|
||||
|
||||
**v3-B3 (opus): legal-tier notification transport must be specified.** v2 said "real-time notification to designated counsel + J" without saying how. v3: notification transport is **signed Slack webhook OR signed email** (configurable per deployment); message body never contains candidate_id or PII; only `{event_kind, timestamp, accessor_kind, integrity_hash}`. Notification failure does NOT block the legal-tier action — failure is logged for follow-up but token issuance proceeds.
|
||||
|
||||
**v3-B4 (opus): Step 6 human review queue needs SLA + volume estimate.** With 500k backfilled rows defaulting to `unknown`, all healthcare-adjacent traffic routes to on-box Ollama until reclassified. v3: budget assumption — auto-pattern-matching pre-classifies ~80% to confidently-non-healthcare; remaining 20% (~100k) stays `unknown` pending review at ~500/day operator throughput = ~7 months of queue. **Operational decision needed:** either staff the queue, accept the on-box-Ollama-only routing for ~7 months, or relax pattern matching at the cost of higher false-negative rate. Flag for J.
|
||||
|
||||
**v3-B5 (opus): Memory zeroing in Go is non-trivial.** v2 §3.1 step 6 says "explicitly zeros memory" — Go GC-managed strings are immutable, not zeroable. v3 implementation note: PII handling code path uses `[]byte` throughout; convert to string only at the JSON serialization boundary; explicit `runtime.GC()` after request completes is theater (Go won't actually zero the slice). **Acceptance:** "best-effort zero" — overwrite the `[]byte` slice contents post-use. Document this is best-effort, not cryptographic-grade scrubbing. (Rust-side: use the `zeroize` crate which IS cryptographic-grade.)
|
||||
|
||||
**v3-B6 (kimi): Service-tier purpose_definitions needs versioning + emergency revocation.** A misconfigured purpose with overly-broad `allowed_fields` is standing exfiltration authorization. v3 schema additions:
|
||||
|
||||
```sql
|
||||
CREATE TABLE purpose_versions (
|
||||
purpose_token TEXT NOT NULL,
|
||||
version INT NOT NULL,
|
||||
effective_at TIMESTAMPTZ NOT NULL,
|
||||
superseded_at TIMESTAMPTZ,
|
||||
PRIMARY KEY (purpose_token, version)
|
||||
);
|
||||
|
||||
CREATE TABLE purpose_revocations (
|
||||
purpose_token TEXT PRIMARY KEY,
|
||||
revoked_at TIMESTAMPTZ NOT NULL DEFAULT now(),
|
||||
revoked_by TEXT NOT NULL,
|
||||
reason TEXT
|
||||
);
|
||||
```
|
||||
|
||||
Auth path checks `purpose_revocations` on every call (not cached, OR cached <5s). Revocation is <60s without code deploy.
|
||||
|
||||
**v3-B7 (kimi): Cache invalidation needs erasure-generation atomicity.** v2 §3.6 pub/sub is best-effort. If Redis drops the message or gateway crashes between erase + purge, cache holds stale PII. v3 change:
|
||||
- `subjects` table gains `erasure_generation INT NOT NULL DEFAULT 0`. Increments on every erase.
|
||||
- Identityd PII responses include current generation in headers
|
||||
- Gateway cache entries tagged with the generation they were filled at
|
||||
- Cache hits with `cached_generation < current_generation` are rejected, force re-fetch from identityd
|
||||
- Eventually-consistent within ONE request hop, not eventually-consistent over time
|
||||
|
||||
**v3-B8 (kimi): Dual-control issuance needs cooling-off period to prevent bypass culture.** v3: 15-minute mandatory delay between nonce submission and token issuance, with notification to BOTH parties at delay-start AND issuance. Bypass requires both J and counsel to sign an "emergency bypass" attestation that's logged separately + escalated to a board-level reviewer.
|
||||
|
||||
**v3-B9 (kimi): NER must have calibrated false-negative rate.** v3: maintain a synthetic adversarial test set (50 hallucinated-PII examples + 50 legitimate non-PII). NER must achieve recall ≥99.5% on the synthetic set in CI; production drop-rate >0.1% triggers re-calibration.
|
||||
|
||||
**v3-B10 (kimi): S3 Object Lock root credentials need separate AWS account.** Same ops team holding Vault admin + AWS root = limited independence. v3: Object Lock bucket lives in **separate AWS account** with **write-only IAM role** issued to identityd. Root credentials for that account held by named external party (could be outside counsel) with quarterly attestation.
|
||||
|
||||
**v3-B11 (kimi): biometric_consent_status='never_collected' may not satisfy BIPA infrastructure-as-notice argument.** Plaintiffs may argue presence of biometric schema fields constitutes constructive notice of intent. v3: Phase 1.6 BIPA policy doc must explicitly attest "no biometric data exists in the pre-identityd substrate" with cryptographic evidence (hash of pre-identityd schema). Counsel must bless this attestation.
|
||||
|
||||
**v3-B12 (gemini): Backup retention window vs ciphertext-deletion erasure.** Phase 2 uses ciphertext deletion + single-DEK; if Postgres backups retain pre-erasure ciphertext + the DEK still exists, backup is recoverable. v3 erasure runbook: includes maximum backup retention period (e.g., 30 days). RTBF status reported to subject as "erased now; residual in backups expires by {date}." For full crypto-erasure-on-demand, Phase 7 per-row keys is the path.
|
||||
|
||||
### Net v3 effort delta
|
||||
|
||||
v2 estimate: 12-15 days. v3 amendments add:
|
||||
- mTLS CA via Vault PKI integration: +0.5 day
|
||||
- Public-key registry in Vault KV: +0.5 day
|
||||
- Server-issued nonces for dual-control: +0.25 day
|
||||
- Step 8 fallback time bound: +0.1 day
|
||||
- NER drop-rate metric + alert: +0.25 day
|
||||
- Legal-tier notification transport: +0.5 day
|
||||
- purpose_versions + purpose_revocations: +0.5 day
|
||||
- erasure_generation cache atomicity: +0.5 day
|
||||
- 15-min cooling-off: +0.25 day
|
||||
- NER calibrated test set: +0.5 day
|
||||
- Separate AWS account for Object Lock: +0.5 day
|
||||
- BIPA infrastructure-as-notice attestation: +0.25 day (mostly doc work)
|
||||
- Backup-retention erasure runbook: +0.25 day
|
||||
|
||||
**Total v3 delta: ~5 days. Revised estimate: 17-20 days.**
|
||||
|
||||
The cost is real. The reason it's worth paying is what those 5 days buy: a design that 3 independent senior security architects (across 3 model lineages) all said they would build. v1 said "do not build." v3 says "build."
|
||||
|
||||
### What's v3-deferred vs v3-must-have
|
||||
|
||||
**v3 must-have (block implementation):** v3-A1, v3-A2, v3-B1, v3-B6, v3-B7, v3-B11.
|
||||
|
||||
**v3 should-have (ship in Phase 2 if calendar allows; otherwise Phase 5):** v3-B2, v3-B3, v3-B4 (operational decision), v3-B5 (best-effort), v3-B8, v3-B9, v3-B10, v3-B12.
|
||||
|
||||
If schedule pressure forces a cut, the should-have items can ship in Phase 5 (identity service build completion). The must-haves cannot — they're integral to the security boundary.
|
||||
|
||||
---
|
||||
|
||||
## Change log
|
||||
|
||||
- 2026-05-03 — v1 initial draft.
|
||||
- 2026-05-03 — v2 post-scrum: 3/3 reviewer convergent findings folded in. 4 "would not build" blockers all resolved. Re-scrum before implementation recommended.
|
||||
- 2026-05-03 — v2 post-first-scrum: 3/3 reviewer convergent findings folded in. 4 "would not build" blockers all resolved. Re-scrum recommended.
|
||||
- 2026-05-03 — v3 post-second-scrum: 3/3 BUILD-WITH-CHANGES. v1 blockers verified resolved. 12 new v2 findings folded into v3 amendments (§12). Re-scrum NOT required for v3 — diminishing returns; the must-have items are concrete fixes with clear acceptance criteria.
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user