diff --git a/docs/PHASE_1_6_BIPA_GATES.md b/docs/PHASE_1_6_BIPA_GATES.md new file mode 100644 index 0000000..2d57524 --- /dev/null +++ b/docs/PHASE_1_6_BIPA_GATES.md @@ -0,0 +1,240 @@ +# Phase 1.6 — BIPA Pre-Launch Gates + +**Status:** Draft — 2026-05-03 · **Owner:** J + outside counsel · **Companion to:** [`AUDIT_TRAIL_PRD.md`](AUDIT_TRAIL_PRD.md), [`AUDIT_PHASE_1_5_BIPA_AND_OUTCOMES.md`](AUDIT_PHASE_1_5_BIPA_AND_OUTCOMES.md), [`IDENTITY_SERVICE_DESIGN.md`](IDENTITY_SERVICE_DESIGN.md) + +> **Why this exists.** `IDENTITY_SERVICE_DESIGN.md` v3 §5 Step 0 names Phase 1.6 as a HARD PREREQUISITE: identityd backfill cannot start until Phase 1.6 ships. This doc specifies what Phase 1.6 contains. +> +> **Scope.** BIPA (740 ILCS 14) compliance gates that must be in place BEFORE the system accepts a single real candidate photo. Synthetic-data face pool can keep operating; real-photo intake CANNOT begin without these gates. +> +> **Authority.** This is an engineering scaffold. Sections marked `⚖ COUNSEL` need outside counsel to author the actual legally-binding text. Engineering ships the procedural gates; counsel writes the words. + +--- + +## 1. The five BIPA pre-launch gates + +Each gate is a deliverable that must ship before real-photo intake. None is optional. Order shown is the recommended ship sequence. + +### Gate 1 — Public retention schedule (BIPA §15(a)) + +**Required:** A publicly-available, written retention schedule for biometric identifiers and information. + +**What ships:** +- `data/_consent/biometric_retention_schedule_v1.md` — public file +- Linked from public privacy policy at the deployment URL +- Specifies: + - Categories of biometric data collected (facial geometry derived from candidate photos, age estimate, gender classification, race classification — per Phase 1.5 deepface walk) + - Purpose of collection (identity matching for staffing operations) + - Maximum retention: BIPA §15(a) caps at "3 years from the individual's last interaction with the private entity, whichever occurs first" — recommend 18-24 months as the operational ceiling (provides safety margin) + - Destruction procedure: per Gate 5 below +- Versioned (this is v1; future updates supersede with a new version) + +**⚖ COUNSEL** — write the actual schedule. Engineering provides the operational facts; counsel writes the binding language. + +**Engineering acceptance:** the file is committed, the public URL renders it, and identityd's `consent_versions` table references it by hash. + +--- + +### Gate 2 — Informed written consent (BIPA §15(b)) + +**Required:** Informed, written consent BEFORE any biometric collection occurs. + +**What ships:** +- `data/_consent/biometric_consent_template_v1.md` — public consent template +- Versioned, hashed, referenced from identityd's `consent_versions` table +- Must disclose, per BIPA §15(b)(1)-(3): + 1. That biometric identifiers/information will be collected + 2. The specific purpose for collection (and the length of term — references Gate 1) + 3. Receipt of a written release authorizing collection +- Consent flow at intake: + - Candidate sees the disclosure on a UI surface (web form / paper / digital signature) + - Candidate provides explicit affirmative action (signature, click-acceptance with timestamp, etc.) + - Identityd records `biometric_consent_status='given'` with `consent_version` reference + `consent_given_at` timestamp + - **Without identityd recording 'given', no biometric data flows through deepface.** + +**⚖ COUNSEL** — write the consent template. Recommended content (engineering view): +- Clear language (not just legal boilerplate) +- Specific to facial-classification (not generic biometrics) +- Includes withdrawal procedure +- Includes data-subject rights enumeration + +**Engineering acceptance:** consent gate is enforced in code at the photo-upload endpoint; identityd refuses biometric writes when `biometric_consent_status != 'given'`; pre-existing synthetic-face pool is exempt (no consent needed because no real subject). + +--- + +### Gate 3 — Photo-upload endpoint with consent enforcement + +**Required:** Code-level enforcement that real-photo intake checks consent before processing. + +**What ships:** + +A new endpoint (proposed: `POST /v1/identity/subjects/{candidate_id}/photo`) with the following behavior: + +1. Caller authenticates with service-tier token +2. Endpoint queries identityd for `subjects.biometric_consent_status` +3. If status ≠ `'given'` → HTTP 403 with reason `"BIPA consent required before biometric processing"` +4. If status = `'given'`: + a. Photo bytes accepted, stored to a quarantined path under `data/biometric/uploads/{candidate_id}/{ts}.{ext}` (NOT `data/headshots/`) + b. deepface tagging runs against the photo + c. Classifications (gender, race, age) stored to `subjects` table fields (NEW columns — see schema additions below) + d. Original photo bytes encrypted under DEK + retained per Gate 1 schedule + e. `pii_access_log` row written with `purpose_token='biometric_collection'` +5. Response: `{candidate_id, retention_until, consent_version}` + +**Schema additions to identityd `subjects`:** + +```sql +ALTER TABLE subjects ADD COLUMN biometric_classifications JSONB; -- {gender, race, age} from deepface +ALTER TABLE subjects ADD COLUMN biometric_data_path TEXT; -- quarantined path +ALTER TABLE subjects ADD COLUMN biometric_collected_at TIMESTAMPTZ; +ALTER TABLE subjects ADD COLUMN biometric_template_hash TEXT; -- hash of the photo bytes (for integrity, NOT for re-derivation) +``` + +**Engineering acceptance:** +- Endpoint refuses uploads when consent missing (verified by integration test) +- deepface output never lands in the synthetic-face manifest (`data/headshots/manifest.jsonl`) +- Real-photo classifications are isolated to identityd `subjects` table — never flow to JSONL sinks +- The `/headshots/:key` route in mcp-server REMAINS synthetic-only — does NOT serve real candidate photos to LLMs without an explicit allowance (proposed: real photos served only to authenticated staffer UI, never to model context) + +--- + +### Gate 4 — Deprecate name → ethnicity inference + +**Required:** The hard-coded `NAMES_HISPANIC` / `SURNAMES_*` lookup tables in `mcp-server/search.html:3375-3432` (per Phase 1.5 §1B walk) get removed. + +**What ships:** +- A code commit that removes: + - `FEMALE_NAMES`, `MALE_NAMES` constants + - `NAMES_HISPANIC`, `NAMES_BLACK`, `NAMES_SOUTH_ASIAN`, `NAMES_EAST_ASIAN`, `NAMES_MIDDLE_EASTERN` constants + - `SURNAMES_HISPANIC`, `SURNAMES_SOUTH_ASIAN`, `SURNAMES_EAST_ASIAN`, `SURNAMES_MIDDLE_EASTERN`, `SURNAMES_BLACK` constants + - The `genderFor()` and `guessEthnicityFromFirstName()` functions + - All call sites that consumed these (face-pool bucket selection) +- Replacement strategy: + - For SYNTHETIC face pool routing: deterministic hash of candidate_id selects a face bucket, no demographic inference + - For REAL candidate photos: the candidate's actual photo IS the representation; no inference needed + +**Why this is BIPA + Title VII risk separately:** name-based ethnicity classification is BOTH a discriminatory feature engineering practice (Title VII) AND, when combined with photo-based attribute extraction, a "biometric information derived from a biometric identifier" pattern (BIPA broad reading). Removing the lookup tables forecloses both arguments. + +**Engineering acceptance:** +- Lookup tables removed from search.html +- Unit test asserts no protected-attribute inference functions exist in search.html or any mcp-server module +- Face-pool routing for synthetic faces uses candidate_id hash exclusively +- Phase 1.5 §1B finding closed + +--- + +### Gate 5 — Documented destruction procedure + +**Required:** A written procedure for biometric data destruction at retention expiry OR consent withdrawal OR right-to-be-forgotten request. + +**What ships:** +- `docs/runbooks/BIPA_DESTRUCTION_RUNBOOK.md` — operator-facing +- Specifies: + - Triggers: retention expiry (per Gate 1), withdrawal, RTBF request, candidate request + - Procedure: identityd `POST /v1/identity/subjects/{id}/erase` (legal-tier auth) + - Erasure scope: `subjects.biometric_*` columns ciphertext-deleted, `biometric_data_path` files securely overwritten + unlinked, deepface classifications nulled + - Backup window: per `IDENTITY_SERVICE_DESIGN` v3-B12, residual exists in DB backups for 30 days max; subject is informed + - Witnessed: every erasure event written to `pii_access_log` with `purpose_token='biometric_erasure'` and the legal-tier JWT signature (proves authorized destruction) + - Reporting: monthly internal report of erasures + retention-expiry sweeps; available to counsel on request + +**⚖ COUNSEL** — review the runbook for legal sufficiency. Engineering writes the procedure; counsel attests that the procedure satisfies BIPA §15(a) destruction requirements. + +**Engineering acceptance:** +- Runbook committed +- `POST /v1/identity/subjects/{id}/erase` endpoint includes biometric-specific erasure path +- Daily sweep job destroys biometric data past `biometric_retention_until` (separate from general retention sweep — biometric has stricter clock) +- Erasure events are logged with cryptographic attestation + +--- + +## 2. Cryptographic attestation: no biometric data exists pre-identityd + +**Per `IDENTITY_SERVICE_DESIGN` v3-B11.** Plaintiffs may argue that the EXISTENCE of biometric schema fields constitutes constructive notice of intent to collect biometric data — therefore consent should have preceded the schema. The defense: prove that no biometric data was actually collected from real candidates before identityd + the consent gate. + +**What ships:** +- A one-shot script `scripts/staffing/attest_pre_identityd_biometric_state.sh` that: + - Queries `data/datasets/workers_500k.parquet` schema and confirms NO column named `photo`, `biometric_*`, `face_*`, `image_*` exists + - Greps `data/_kb/*.jsonl` and `data/_pathway_memory/state.json` for any base64-encoded image bytes (deepface output, photo blobs) + - Verifies `data/headshots/manifest.jsonl` rows ≤ synthetic face pool size + - Hashes the schema + summary; commits the hash to S3 Object Lock (per identity service v3 anchor pattern) +- Attestation document `docs/BIPA_PRE_IDENTITYD_ATTESTATION_2026-05-XX.md` signed by J + outside counsel + +**This is a one-time defense artifact.** It establishes the baseline: "as of this date, no biometric data was collected from real candidates." + +--- + +## 3. Employee training acknowledgment (general BIPA hygiene) + +**Required:** People with access to biometric data acknowledge BIPA-handling training. + +**What ships:** +- `docs/policies/BIPA_HANDLING_TRAINING_v1.md` — training material covering: + - What constitutes biometric identifiers / information + - The consent + retention procedures + - Destruction obligations + - Reporting suspected exposure +- Acknowledgment record per individual (initially: J + counsel + named operators) +- Annual refresh + +**⚖ COUNSEL** — write training content. Engineering doesn't author legal-compliance training. + +--- + +## 4. Phase 1.6 exit criteria (gates Phase 2 backfill) + +All 5 gates must be DONE before identityd backfill begins: + +1. ✅ Public retention schedule published + linked from privacy policy + counsel sign-off +2. ✅ Consent template published + counsel sign-off + technical enforcement integrated +3. ✅ Photo-upload endpoint shipped with consent enforcement + integration test green +4. ✅ Name → ethnicity inference removed from search.html + unit test asserting absence +5. ✅ Destruction runbook published + erasure endpoint includes biometric path + counsel sign-off + +PLUS: + +6. ✅ Cryptographic attestation that no pre-identityd biometric data exists, signed by J + counsel +7. ✅ Employee training material published + initial acknowledgments recorded + +Until all 7 are checked off, **identity service backfill (Phase 2 §5 Step 5) cannot proceed.** + +--- + +## 5. Effort estimate + +| Gate | Engineering effort | Legal effort | +|---|---|---| +| Gate 1 (retention schedule) | 0.5 day | counsel-dependent (typically 1-2 weeks for review) | +| Gate 2 (consent template) | 0.5 day | counsel-dependent (typically 2-4 weeks for review and consent UX design) | +| Gate 3 (photo-upload endpoint) | 1-2 days | review of endpoint behavior | +| Gate 4 (deprecate name-ethnicity inference) | 0.5 day | none (engineering-only fix) | +| Gate 5 (destruction runbook) | 1 day | counsel sign-off | +| §2 cryptographic attestation | 0.5 day | counsel + J signature | +| §3 employee training | 0.25 day (admin) | counsel-authored content | +| **Total engineering** | **~4-5 days** | — | +| **Total counsel** | — | **~3-6 weeks calendar** (review cycles) | + +**The calendar bottleneck is counsel, not engineering.** Engineering can stage all 5 gates ready-to-ship in a week. Counsel sign-off + consent UX rollout is the longer pole. + +--- + +## 6. Open questions for J + counsel + +1. **Photo-upload UX:** is there an existing intake form / staffer console where photo upload would happen? Or is this new UI work? +2. **Consent collection mechanism:** electronic signature service (DocuSign, Adobe Sign), in-app click-acceptance, paper form? Each has different evidentiary weight in litigation. +3. **Operator list with biometric access:** who, today, would be on the named-operators list for §3 training? +4. **Counsel for sign-off:** named outside counsel — same or different from the dual-control legal-token party in identity service? +5. **Public privacy policy URL:** does one exist? If yes, where; if no, that's a separate Gate-1.5 deliverable. + +--- + +## 7. What this PRD is NOT + +- Not legal advice. The `⚖ COUNSEL` markers exist because the binding text needs lawyers, not engineers. +- Not a substitute for a DPIA / PIA. Phase 1.6 satisfies BIPA-specific gates; a Data Protection Impact Assessment is broader and may be required separately. +- Not a SOC2 Type II deliverable. SOC2 is a parallel work stream. +- Not the only gate before production. The full 9-phase audit-trail program continues; Phase 1.6 specifically unblocks Phase 2 (identity service implementation). + +--- + +## Change log + +- 2026-05-03 — Initial draft. Authored after `IDENTITY_SERVICE_DESIGN` v3 §5 Step 0 named Phase 1.6 as a hard prerequisite to backfill.