lakehouse/docs/PHASE_1_6_BIPA_GATES.md
root fcd53168a0 phase 1.6: counsel handoff turnkey + seed_consent_version.sh + strict mode live
The remaining production blocker is counsel-calendar bottleneck
(review + sign-off). Engineering can't make counsel move faster,
but it CAN reduce the round-trip overhead:

(1) docs/counsel/COUNSEL_HANDOFF_EMAIL_2026-05-05.md — copy-paste
    email body J can send to outside counsel. Subject line + body
    + tarball attachment instructions + headline asks (A/B/C/D
    in priority order) + post-signature operator runbook. The
    pre-flight checklist + post-signature workflow turn what
    would have been "I'll figure out the email" into "click send."

(2) scripts/staffing/seed_consent_version.sh — turnkey
    post-signature deployment. Takes the path to a (presumably
    counsel-signed) consent template markdown, computes SHA-256,
    atomically merges into /etc/lakehouse/consent_versions.json
    (creating the file if absent, with per-seed audit metadata
    in _meta.seeded_at[]), restarts lakehouse.service, probes
    /biometric/health post-restart. Idempotent: re-running with
    the same hash is a no-op for the versions array but still
    appends a [reseed] entry to the audit metadata.
    Verified live against the eng-staged template — strict mode
    flipped clean, /biometric/health 200 post-restart.

(3) docs/PHASE_1_6_BIPA_GATES.md §6.5 — post-signature deployment
    runbook embedded in the gates doc. Three steps: counsel signs
    + commits → seed_consent_version.sh → strict-mode probe.
    Plus a "pre-counsel demo seed" subsection documenting how to
    exercise strict mode BEFORE counsel signs (using the
    eng-staged template hash) so the deployment workflow is
    proven before the legal critical path closes.

Strict mode flipped live — verified post-restart:
- /etc/lakehouse/consent_versions.json populated with the
  eng-staged template hash:
  8b09591a8dc15f59197affac48909ce943d575eee01705b42303acf3b32f5c56
- POST /biometric/subject/WORKER-1/consent with deadbeef hash:
  HTTP 400 + error="consent_version_unknown"
- POST with the known eng-staged hash: passes version check
  (then 404 subject_not_found on a ghost candidate, proving
  the gate is hash-aware not auth-broken)

The hash currently seeded is the ENG-STAGED template
(pre-counsel-signature). When counsel returns the signed text,
operator runs `seed_consent_version.sh` again with the
counsel-signed markdown — the new hash gets appended; the demo
hash stays in for backwards-compat with any consent records
collected during the pre-counsel demo period (none, today).

Production blocker is now genuinely just counsel calendar:
1. J transmits reports/counsel/counsel_packet_2026-05-05.tar.gz
   per the handoff email
2. Counsel reviews + signs (their billable time)
3. Counsel returns signed text → operator runs seed script
4. Strict mode flips to canonical hash → cutover complete

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 15:32:16 -05:00

20 KiB

Phase 1.6 — BIPA Pre-Launch Gates

Status: Draft — 2026-05-03 · Owner: J + outside counsel · Companion to: AUDIT_TRAIL_PRD.md, AUDIT_PHASE_1_5_BIPA_AND_OUTCOMES.md, IDENTITY_SERVICE_DESIGN.md

Why this exists. IDENTITY_SERVICE_DESIGN.md v3 §5 Step 0 names Phase 1.6 as a HARD PREREQUISITE: identityd backfill cannot start until Phase 1.6 ships. This doc specifies what Phase 1.6 contains.

Scope. BIPA (740 ILCS 14) compliance gates that must be in place BEFORE the system accepts a single real candidate photo. Synthetic-data face pool can keep operating; real-photo intake CANNOT begin without these gates.

Authority. This is an engineering scaffold. Sections marked ⚖ COUNSEL need outside counsel to author the actual legally-binding text. Engineering ships the procedural gates; counsel writes the words.


1. The five BIPA pre-launch gates

Each gate is a deliverable that must ship before real-photo intake. None is optional. Order shown is the recommended ship sequence.

Gate 1 — Public retention schedule (BIPA §15(a))

Required: A publicly-available, written retention schedule for biometric identifiers and information.

What ships:

  • docs/policies/consent/biometric_retention_schedule_v1.md — public file
  • Linked from public privacy policy at the deployment URL
  • Specifies:
    • Categories of biometric data collected (facial photograph for staff identification at job sites; classifications deferred per Gate 3b — see docs/specs/GATE_3B_DEEPFACE_DESIGN.md)
    • Purpose of collection (identity matching for staffing operations)
    • Maximum retention: BIPA §15(a) caps at "3 years from the individual's last interaction with the private entity, whichever occurs first" — recommend 18-24 months as the operational ceiling (provides safety margin)
    • Destruction procedure: per Gate 5 below
  • Versioned (this is v1; future updates supersede with a new version)

⚖ COUNSEL — write the actual schedule. Engineering provides the operational facts; counsel writes the binding language.

Engineering acceptance: the file is committed, the public URL renders it, and identityd's consent_versions table references it by hash.


Required: Informed, written consent BEFORE any biometric collection occurs.

What ships:

  • docs/policies/consent/biometric_consent_template_v1.md — public consent template
  • Versioned, hashed, referenced from identityd's consent_versions table
  • Must disclose, per BIPA §15(b)(1)-(3):
    1. That biometric identifiers/information will be collected
    2. The specific purpose for collection (and the length of term — references Gate 1)
    3. Receipt of a written release authorizing collection
  • Consent flow at intake:
    • Candidate sees the disclosure on a UI surface (web form / paper / digital signature)
    • Candidate provides explicit affirmative action (signature, click-acceptance with timestamp, etc.)
    • Identityd records biometric_consent_status='given' with consent_version reference + consent_given_at timestamp
    • Without identityd recording 'given', no biometric data flows through deepface.

⚖ COUNSEL — write the consent template. Recommended content (engineering view):

  • Clear language (not just legal boilerplate)
  • Specific to facial-classification (not generic biometrics)
  • Includes withdrawal procedure
  • Includes data-subject rights enumeration

Engineering acceptance: consent gate is enforced in code at the photo-upload endpoint; identityd refuses biometric writes when biometric_consent_status != 'given'; pre-existing synthetic-face pool is exempt (no consent needed because no real subject).


Required: Code-level enforcement that real-photo intake checks consent before processing.

What ships:

An endpoint at POST /biometric/subject/{candidate_id}/photo (catalogd-local — the original v1 spec named this /v1/identity/subjects/{candidate_id}/photo under a separate identityd daemon; that daemon was collapsed into catalogd per the architecture pivot. See IDENTITY_SERVICE_DESIGN.md deprecation header.) with the following behavior:

  1. Caller authenticates with service-tier token
  2. Endpoint queries identityd for subjects.biometric_consent_status
  3. If status ≠ 'given' → HTTP 403 with reason "BIPA consent required before biometric processing"
  4. If status = 'given': a. Photo bytes accepted, stored to a quarantined path under data/biometric/uploads/{candidate_id}/{ts}.{ext} (NOT data/headshots/) b. deepface tagging runs against the photo c. Classifications (gender, race, age) — DEFERRED to Gate 3b (docs/specs/GATE_3B_DEEPFACE_DESIGN.md). BiometricCollection.classifications remains None in v1. d. Original photo bytes encrypted under DEK + retained per Gate 1 schedule e. pii_access_log row written with purpose_token='biometric_collection'
  5. Response: {candidate_id, retention_until, consent_version}

Schema (as shipped — catalogd SubjectManifest.biometric_collection):

The original spec proposed JSONB columns on a Postgres subjects table under identityd. The shipped implementation collapses this into a per-subject JSON manifest at data/_catalog/subjects/<id>.json, with the BiometricCollection struct holding data_path, template_hash, collected_at, and classifications: Option<JSON>. See crates/catalogd/src/subject_manifest.rs for the canonical type.

// crates/catalogd/src/subject_manifest.rs (paraphrased)
pub struct BiometricCollection {
    pub data_path: String,                // quarantined path
    pub template_hash: String,            // SHA-256 of original bytes (integrity, NOT re-derivation)
    pub collected_at: DateTime<Utc>,
    pub classifications: Option<Value>,   // None until Gate 3b ships (deferred — see GATE_3B_DEEPFACE_DESIGN.md)
}

Engineering acceptance:

  • Endpoint refuses uploads when consent missing (verified by integration test)
  • deepface output never lands in the synthetic-face manifest (data/headshots/manifest.jsonl)
  • Real-photo classifications are isolated to identityd subjects table — never flow to JSONL sinks
  • The /headshots/:key route in mcp-server REMAINS synthetic-only — does NOT serve real candidate photos to LLMs without an explicit allowance (proposed: real photos served only to authenticated staffer UI, never to model context)

Gate 4 — Deprecate name → ethnicity inference

Required: The hard-coded NAMES_HISPANIC / SURNAMES_* lookup tables in mcp-server/search.html:3375-3432 (per Phase 1.5 §1B walk) get removed.

What ships:

  • A code commit that removes:
    • FEMALE_NAMES, MALE_NAMES constants
    • NAMES_HISPANIC, NAMES_BLACK, NAMES_SOUTH_ASIAN, NAMES_EAST_ASIAN, NAMES_MIDDLE_EASTERN constants
    • SURNAMES_HISPANIC, SURNAMES_SOUTH_ASIAN, SURNAMES_EAST_ASIAN, SURNAMES_MIDDLE_EASTERN, SURNAMES_BLACK constants
    • The genderFor() and guessEthnicityFromFirstName() functions
    • All call sites that consumed these (face-pool bucket selection)
  • Replacement strategy:
    • For SYNTHETIC face pool routing: deterministic hash of candidate_id selects a face bucket, no demographic inference
    • For REAL candidate photos: the candidate's actual photo IS the representation; no inference needed

Why this is BIPA + Title VII risk separately: name-based ethnicity classification is BOTH a discriminatory feature engineering practice (Title VII) AND, when combined with photo-based attribute extraction, a "biometric information derived from a biometric identifier" pattern (BIPA broad reading). Removing the lookup tables forecloses both arguments.

Engineering acceptance:

  • Lookup tables removed from search.html
  • Unit test asserts no protected-attribute inference functions exist in search.html or any mcp-server module
  • Face-pool routing for synthetic faces uses candidate_id hash exclusively
  • Phase 1.5 §1B finding closed

Gate 5 — Documented destruction procedure

Required: A written procedure for biometric data destruction at retention expiry OR consent withdrawal OR right-to-be-forgotten request.

What ships:

  • docs/runbooks/BIPA_DESTRUCTION_RUNBOOK.md — operator-facing
  • Specifies:
    • Triggers: retention expiry (per Gate 1), withdrawal, RTBF request, candidate request
    • Procedure: catalogd-local POST /biometric/subject/{id}/erase (legal-tier auth) — formerly proposed under identityd; now serves from catalogd directly
    • Erasure scope: BiometricCollection set to None on the subject manifest (drops data_path, template_hash, classifications together), quarantined photo files at data/biometric/uploads/<id>/* securely unlinked, audit row appended BEFORE photo unlink so the chain proves intent even if file delete fails
    • Backup window: per IDENTITY_SERVICE_DESIGN v3-B12, residual exists in DB backups for 30 days max; subject is informed
    • Witnessed: every erasure event written to pii_access_log with purpose_token='biometric_erasure' and the legal-tier JWT signature (proves authorized destruction)
    • Reporting: monthly internal report of erasures + retention-expiry sweeps; available to counsel on request

⚖ COUNSEL — review the runbook for legal sufficiency. Engineering writes the procedure; counsel attests that the procedure satisfies BIPA §15(a) destruction requirements.

Engineering acceptance:

  • Runbook committed
  • POST /biometric/subject/{id}/erase endpoint includes biometric-specific erasure path (shipped 848a458 — 21 unit tests, two scopes: biometric_only / full)
  • Daily sweep job destroys biometric data past biometric_retention_until (separate from general retention sweep — biometric has stricter clock)
  • Erasure events are logged with cryptographic attestation

2. Cryptographic attestation: no biometric data exists pre-identityd

Per IDENTITY_SERVICE_DESIGN v3-B11. Plaintiffs may argue that the EXISTENCE of biometric schema fields constitutes constructive notice of intent to collect biometric data — therefore consent should have preceded the schema. The defense: prove that no biometric data was actually collected from real candidates before identityd + the consent gate.

What ships:

  • A one-shot script scripts/staffing/attest_pre_identityd_biometric_state.sh that:
    • Queries data/datasets/workers_500k.parquet schema and confirms NO column named photo, biometric_*, face_*, image_* exists
    • Greps data/_kb/*.jsonl and data/_pathway_memory/state.json for any base64-encoded image bytes (deepface output, photo blobs)
    • Verifies data/headshots/manifest.jsonl rows ≤ synthetic face pool size
    • Hashes the schema + summary; commits the hash to S3 Object Lock (per identity service v3 anchor pattern)
  • Attestation document docs/BIPA_PRE_IDENTITYD_ATTESTATION_2026-05-XX.md signed by J + outside counsel

This is a one-time defense artifact. It establishes the baseline: "as of this date, no biometric data was collected from real candidates."


3. Employee training acknowledgment (general BIPA hygiene)

Required: People with access to biometric data acknowledge BIPA-handling training.

What ships:

  • docs/policies/BIPA_HANDLING_TRAINING_v1.md — training material covering:
    • What constitutes biometric identifiers / information
    • The consent + retention procedures
    • Destruction obligations
    • Reporting suspected exposure
  • Acknowledgment record per individual (initially: J + counsel + named operators)
  • Annual refresh

⚖ COUNSEL — write training content. Engineering doesn't author legal-compliance training.


4. Phase 1.6 exit criteria (gates Phase 2 backfill)

All 5 gates must be DONE before identityd backfill begins. Status as of 2026-05-03 — scaffolds vs. counsel sign-off vs. shipped code:

# Gate Engineering Counsel Status
1 Public retention schedule scaffolded at docs/policies/consent/biometric_retention_schedule_v1.md pending eng-staged
2 Consent template scaffolded at docs/policies/consent/biometric_consent_template_v1.md pending eng-staged
3 Photo-upload endpoint with consent enforcement DONE — crates/catalogd/src/biometric_endpoint.rs mounted at /biometric/subject/{id}/photo, 11 unit tests, live-verified end-to-end. Gate 3b DECIDED 2026-05-05: Option C (defer classifications). BiometricCollection.classifications stays Option<JSON> = None in v1; consent + retention docs revised to match. See docs/specs/GATE_3B_DEEPFACE_DESIGN.md §6 + change log. reviewed under Gate 2 (matching consent text) DONE — 3a shipped, 3b deferred per design doc
4 Name → ethnicity inference removed DONE — mcp-server/search.html:3372 removal note + mcp-server/phase_1_6_gate_4.test.ts absence test (3/3 green) none required DONE
5 Destruction runbook scaffolded at docs/runbooks/BIPA_DESTRUCTION_RUNBOOK.md; erasure endpoint + verify/report scripts marked TODO pending eng-staged

PLUS:

# Item Engineering Counsel Status
6 Cryptographic attestation pre-identityd DONE — scripts/staffing/attest_pre_identityd_biometric_state.sh + docs/attestations/BIPA_PRE_IDENTITYD_ATTESTATION_2026-05-03.md (3/3 evidence checks pass; signature lines pending) pending signature eng-DONE, signature-pending
7 Employee training material scaffold deferred — Gate 5 runbook §7 acknowledgment may serve as substrate pending deferred

Blocking set for Phase 2 backfill: items 1, 2, 3, 4, 5, 6 must all be DONE. Item 7 (employee training) is reduced from blocking to "deferred" because the Gate 5 destruction runbook §7 already requires operator acknowledgment before legal-tier credentials are issued — that acknowledgment is procedurally equivalent to the training-record requirement when the operator population is small (J + 1-2 named operators). If the operator population grows beyond that, item 7 re-promotes to blocking and a separate training program must be authored.

⚖ COUNSEL — confirm whether item 7 deferral is acceptable for the expected operator population size, or restore it to the blocking set.

Calendar bottleneck: Items 1, 2, 5, 6 (and #7) await counsel review of the engineering scaffolds. Gate 3 substrate is fully shipped; Gate 3b deepface classification was DECIDED on 2026-05-05 as Option C (defer) — BiometricCollection.classifications stays None in v1, consent + retention docs revised to match this narrower scope. If a future product requirement surfaces a real need for classifications, the substrate is forward-compatible (Option<JSON>) and either Option A (~1 day) or Option B (~5 days) of the design doc can be picked up then under a v2 consent template.


5. Effort estimate

Gate Engineering effort Legal effort
Gate 1 (retention schedule) 0.5 day counsel-dependent (typically 1-2 weeks for review)
Gate 2 (consent template) 0.5 day counsel-dependent (typically 2-4 weeks for review and consent UX design)
Gate 3 (photo-upload endpoint) 1-2 days review of endpoint behavior
Gate 4 (deprecate name-ethnicity inference) 0.5 day none (engineering-only fix)
Gate 5 (destruction runbook) 1 day counsel sign-off
§2 cryptographic attestation 0.5 day counsel + J signature
§3 employee training 0.25 day (admin) counsel-authored content
Total engineering ~4-5 days
Total counsel ~3-6 weeks calendar (review cycles)

The calendar bottleneck is counsel, not engineering. Engineering can stage all 5 gates ready-to-ship in a week. Counsel sign-off + consent UX rollout is the longer pole.


6. Open questions for J + counsel

  1. Photo-upload UX: is there an existing intake form / staffer console where photo upload would happen? Or is this new UI work?
  2. Consent collection mechanism: electronic signature service (DocuSign, Adobe Sign), in-app click-acceptance, paper form? Each has different evidentiary weight in litigation.
  3. Operator list with biometric access: who, today, would be on the named-operators list for §3 training?
  4. Counsel for sign-off: named outside counsel — same or different from the dual-control legal-token party in identity service?
  5. Public privacy policy URL: does one exist? If yes, where; if no, that's a separate Gate-1.5 deliverable.

6.5. Post-signature deployment runbook

When counsel returns the countersigned consent template + retention schedule, the engineering side of "flip from permissive to strict mode" is one command:

# 1. Counsel commits their signature to §7 of the consent template
#    markdown (or J commits the signed PDF + updates §7 with counsel's
#    name + date). The markdown is the BINDING TEXT — the PDF is just
#    a rendering of it.

# 2. Hash the canonical signed text + seed the gateway allowlist.
./scripts/staffing/seed_consent_version.sh \
    docs/policies/consent/biometric_consent_template_v1.md \
    --label "v1 signed YYYY-MM-DD by [counsel name]"

# The script:
#   - computes SHA-256 of the markdown (binding text)
#   - atomically writes /etc/lakehouse/consent_versions.json with
#     the new hash + per-seed audit metadata (timestamp, label,
#     source path)
#   - restarts lakehouse.service so the gateway re-reads the
#     allowlist
#   - probes /biometric/health for clean restart

# 3. Verify strict mode is rejecting unknown hashes:
TOKEN=$(cat /etc/lakehouse/legal_audit.token)
curl -sS -X POST http://localhost:3100/biometric/subject/WORKER-1/consent \
    -H "X-Lakehouse-Legal-Token: $TOKEN" \
    -H "Content-Type: application/json" \
    -d '{"consent_version_hash":"deadbeefdeadbeef000000000000000000000000000000000000000000000000","consent_collection_method":"electronic_signature","operator_of_record":"strict_mode_probe"}'
# Expect: HTTP 400 + {"error":"consent_version_unknown", ...}

After this, the gateway is in counsel-tier strict mode:

  • Any consent grant POST whose consent_version_hash doesn't match a known signed template is refused at intake
  • Operator typos (mistyped hash) become loud failures, not silent bad records
  • Future template revisions (v2, v3, ...) require counsel re-sign AND a new seed_consent_version.sh run before being accepted — the v1 hash stays in the allowlist for already-collected subjects' audit-trail compatibility

Pre-counsel demo seed

For deployments that want to exercise strict mode BEFORE counsel signs, the same script works against the eng-staged template:

./scripts/staffing/seed_consent_version.sh \
    docs/policies/consent/biometric_consent_template_v1.md \
    --label "eng-staged demo seed (NOT counsel-signed)"

The hash entry should be replaced (rotate the demo hash out, add the counsel-signed hash) when counsel completes review. The allowlist's _meta.seeded_at[] array preserves the seed history.


7. What this PRD is NOT

  • Not legal advice. The ⚖ COUNSEL markers exist because the binding text needs lawyers, not engineers.
  • Not a substitute for a DPIA / PIA. Phase 1.6 satisfies BIPA-specific gates; a Data Protection Impact Assessment is broader and may be required separately.
  • Not a SOC2 Type II deliverable. SOC2 is a parallel work stream.
  • Not the only gate before production. The full 9-phase audit-trail program continues; Phase 1.6 specifically unblocks Phase 2 (identity service implementation).

Change log

  • 2026-05-05 — Reconciled with shipped state: endpoint paths corrected from the legacy identityd v1 spec (/v1/identity/subjects/*) to the catalogd-local routes that actually shipped (/biometric/subject/*). Schema block rewritten to reflect the JSON SubjectManifest.biometric_collection substrate that replaced the proposed Postgres columns. Gate 3b deepface deferral marked in-line where Disclosure 1 / Gate 3 step 5c / Gate 5 erasure scope previously assumed classifications were collected. No legal text changed; this was doc/code drift cleanup.
  • 2026-05-03 — Initial draft. Authored after IDENTITY_SERVICE_DESIGN v3 §5 Step 0 named Phase 1.6 as a hard prerequisite to backfill.