diff --git a/docs/PHASE_1_6_BIPA_GATES.md b/docs/PHASE_1_6_BIPA_GATES.md index 2d57524..6d677f2 100644 --- a/docs/PHASE_1_6_BIPA_GATES.md +++ b/docs/PHASE_1_6_BIPA_GATES.md @@ -19,7 +19,7 @@ Each gate is a deliverable that must ship before real-photo intake. None is opti **Required:** A publicly-available, written retention schedule for biometric identifiers and information. **What ships:** -- `data/_consent/biometric_retention_schedule_v1.md` — public file +- `docs/policies/consent/biometric_retention_schedule_v1.md` — public file - Linked from public privacy policy at the deployment URL - Specifies: - Categories of biometric data collected (facial geometry derived from candidate photos, age estimate, gender classification, race classification — per Phase 1.5 deepface walk) @@ -39,7 +39,7 @@ Each gate is a deliverable that must ship before real-photo intake. None is opti **Required:** Informed, written consent BEFORE any biometric collection occurs. **What ships:** -- `data/_consent/biometric_consent_template_v1.md` — public consent template +- `docs/policies/consent/biometric_consent_template_v1.md` — public consent template - Versioned, hashed, referenced from identityd's `consent_versions` table - Must disclose, per BIPA §15(b)(1)-(3): 1. That biometric identifiers/information will be collected @@ -181,20 +181,31 @@ ALTER TABLE subjects ADD COLUMN biometric_template_hash TEXT; -- hash of the ## 4. Phase 1.6 exit criteria (gates Phase 2 backfill) -All 5 gates must be DONE before identityd backfill begins: +All 5 gates must be DONE before identityd backfill begins. Status as +of 2026-05-03 — scaffolds vs. counsel sign-off vs. shipped code: -1. ✅ Public retention schedule published + linked from privacy policy + counsel sign-off -2. ✅ Consent template published + counsel sign-off + technical enforcement integrated -3. ✅ Photo-upload endpoint shipped with consent enforcement + integration test green -4. ✅ Name → ethnicity inference removed from search.html + unit test asserting absence -5. ✅ Destruction runbook published + erasure endpoint includes biometric path + counsel sign-off +| # | Gate | Engineering | Counsel | Status | +|---|---|---|---|---| +| 1 | Public retention schedule | scaffolded at `docs/policies/consent/biometric_retention_schedule_v1.md` | pending | **eng-staged** | +| 2 | Consent template | scaffolded at `docs/policies/consent/biometric_consent_template_v1.md` | pending | **eng-staged** | +| 3 | Photo-upload endpoint with consent enforcement | NOT STARTED — depends on identityd photo intake design + deepface integration | n/a until eng | **blocked-on-design** | +| 4 | Name → ethnicity inference removed | DONE — `mcp-server/search.html:3372` removal note + `mcp-server/phase_1_6_gate_4.test.ts` absence test (3/3 green) | none required | **DONE** | +| 5 | Destruction runbook | scaffolded at `docs/runbooks/BIPA_DESTRUCTION_RUNBOOK.md`; erasure endpoint + verify/report scripts marked TODO | pending | **eng-staged** | PLUS: -6. ✅ Cryptographic attestation that no pre-identityd biometric data exists, signed by J + counsel -7. ✅ Employee training material published + initial acknowledgments recorded +| # | Item | Engineering | Counsel | Status | +|---|---|---|---|---| +| 6 | Cryptographic attestation pre-identityd | DONE — `scripts/staffing/attest_pre_identityd_biometric_state.sh` + `docs/attestations/BIPA_PRE_IDENTITYD_ATTESTATION_2026-05-03.md` (3/3 evidence checks pass; signature lines pending) | pending signature | **eng-DONE, signature-pending** | +| 7 | Employee training material | scaffold deferred — Gate 5 runbook §7 acknowledgment may serve as substrate | pending | **deferred** | -Until all 7 are checked off, **identity service backfill (Phase 2 §5 Step 5) cannot proceed.** +Until items 1-5 + 6 are checked off, **identity service backfill (Phase 2 §5 Step 5) cannot proceed.** + +**Calendar bottleneck:** Items 1, 2, 5, 6 (and #7) await counsel +review of the engineering scaffolds. Gate 3 (photo-upload endpoint) +is the only remaining engineering work; it's deferred to its own +session because it crosses into identityd photo intake and deepface +integration scope that hasn't been designed yet. --- diff --git a/docs/attestations/BIPA_PRE_IDENTITYD_ATTESTATION_2026-05-03.md b/docs/attestations/BIPA_PRE_IDENTITYD_ATTESTATION_2026-05-03.md new file mode 100644 index 0000000..8f1d861 --- /dev/null +++ b/docs/attestations/BIPA_PRE_IDENTITYD_ATTESTATION_2026-05-03.md @@ -0,0 +1,91 @@ +# BIPA Pre-IdentityD Biometric Attestation + +**Date:** 2026-05-03 +**Spec:** docs/PHASE_1_6_BIPA_GATES.md §2 +**Generator:** scripts/staffing/attest_pre_identityd_biometric_state.sh + +## Purpose + +This is a one-time defense artifact establishing that, as of +2026-05-03, no biometric identifiers or biometric information +from real candidates have been collected, processed, or stored +by the Lakehouse system. It is intended to be signed by J +(operator of record) and outside counsel, then anchored to a +tamper-evident store (filesystem with backups + version control). + +## Evidence + +## Check 1 — workers_500k.parquet schema (no biometric columns) + +**Source:** `data/datasets/workers_500k.parquet` + +**Schema columns** (18 total): + +``` +worker_id +name +role +email +phone +city +state +zip +skills +certifications +archetype +reliability +responsiveness +engagement +compliance +availability +communications +resume_text +``` + +**Schema SHA-256:** `4ba17870ce25a186a62bdfc29a3b336947dc2fba8a62c42ca249c81f41d32e30` + + - PASS: no biometric / photo / face / image column present + +## Check 2 — KB + pathway memory contain no biometric payloads + +**Sources scanned:** +- `data/_kb/*.jsonl` (knowledge base) +- `data/_pathway_memory/state.json` (pathway memory state) + +**Files scanned:** 33 +**Forbidden-pattern hits:** 0 + + - PASS: no biometric payload patterns found in scanned files + +## Check 3 — Headshots manifest is synthetic-only + +**Source:** `data/headshots/manifest.jsonl` + +**Total rows:** 1000 +**Rows tagged real/candidate_upload/photo_upload:** 0 + + - PASS: all 1000 rows are synthetic (no real-candidate uploads) + +## Summary + +**3 / 3** evidence checks pass. + + +--- + +## Attestation + +I, the undersigned, attest that the above evidence accurately +reflects the state of the Lakehouse system as of 2026-05-03. +No biometric identifiers or biometric information from real +candidates have been collected, processed, or stored prior to +the deployment of the Phase 1.6 BIPA pre-launch gates. + +**Evidence SHA-256:** `230fffeb77b502717bcd7161cc74d5a3401b8722acc8d6ed3d524f93e261cd0b` + +--- + +**Operator (J):** _______________________________ Date: __________ + +**Outside counsel:** ___________________________ Date: __________ + diff --git a/docs/policies/consent/biometric_consent_template_v1.md b/docs/policies/consent/biometric_consent_template_v1.md new file mode 100644 index 0000000..5753257 --- /dev/null +++ b/docs/policies/consent/biometric_consent_template_v1.md @@ -0,0 +1,173 @@ +# Biometric Information Consent — v1 + +**Spec:** docs/PHASE_1_6_BIPA_GATES.md §1 Gate 2 (BIPA §15(b)(1)-(3)) +**Status:** Engineering scaffold — ⚖ COUNSEL must author the binding text before deployment +**Version:** v1 (initial; supersession requires a new version + new hash) + +> This is the consent template a candidate signs (electronically or +> on paper) BEFORE Lakehouse collects, stores, or processes any +> biometric identifier or biometric information from that candidate. +> +> Without an executed consent under this template (or a counsel- +> approved successor), the system MUST NOT accept a photograph from +> the candidate. Enforcement lives at the photo-upload endpoint +> (Gate 3) and at the SubjectManifest writer, which refuses biometric +> writes when `consent.biometric.status != "given"`. + +--- + +## Required disclosures (BIPA §15(b)(1)-(3)) + +The disclosures below are MANDATORY content per 740 ILCS 14/15(b). +⚖ COUNSEL — render this content into binding language appropriate +for the candidate-facing UI. Engineering provides the structural +content; counsel provides the legally-sufficient wording. + +### Disclosure 1 — Notice of collection (§15(b)(1)) + +Lakehouse will collect, store, and use my **biometric identifier** +(facial geometry derived from a photograph of me) and **biometric +information** (gender, race, and age classifications derived from +that photograph by an automated facial-classification model called +deepface). + +### Disclosure 2 — Specific purpose and length of term (§15(b)(2)) + +The biometric data will be used for: + +1. Identity verification at staffing job sites +2. Internal record-keeping so coordinators can recognize me across + placements + +The biometric data will be retained for a maximum of **18 months** +from my most recent interaction with the staffing platform, after +which it will be permanently destroyed per the +[Biometric Retention Schedule v1](biometric_retention_schedule_v1.md). + +I may withdraw this consent at any time by contacting the operator +(see §3 below). Withdrawal triggers permanent destruction of my +biometric data. + +### Disclosure 3 — Written release (§15(b)(3)) + +I provide a written release authorizing Lakehouse to collect, store, +and use my biometric identifier and biometric information for the +purposes stated above and for the term stated above. + +--- + +## 1. Plain-language summary (non-binding) + +⚖ COUNSEL — the section above is the binding legal disclosure. +The summary below is provided for candidate comprehension and is +NOT a substitute for the binding disclosure. Both should appear +together in the consent UI; counsel determines whether this summary +is appropriate to include or whether a different plain-language +section is preferred. + +> **What you're agreeing to:** if you upload a photo of yourself, +> we'll keep that photo and a few descriptive labels about the photo +> (estimated age, perceived gender, perceived race) to help your +> staffing coordinator recognize you when you arrive at job sites. +> +> **How long we keep it:** at most 18 months after your last +> placement or interaction with us, then it's permanently destroyed. +> +> **What we DON'T do with it:** we don't sell it, we don't share it +> with anyone outside the staffing operation unless legally compelled, +> and we don't use it to decide what jobs to recommend to you. +> +> **How to take it back:** contact us (§3 below) at any time to +> withdraw your consent. We will permanently destroy your biometric +> data within 30 days of receiving your request. + +--- + +## 2. Withdrawal procedure + +I may withdraw biometric consent at any time. Withdrawal: + +- Is free of charge +- Does not affect my ability to remain on the staffing platform + (only my biometric data is removed) +- Triggers permanent destruction of all biometric data within + 30 days, per the destruction runbook +- Is recorded as an append-only audit row in my per-subject + audit log, providing me with tamper-evident proof of withdrawal + if I subsequently exercise my BIPA right of action + +⚖ COUNSEL — confirm 30 days is the right destruction SLA. Some +deployments use 7 or 14 days. The runbook (Gate 5) currently +references this template's number, so changing it here updates +both. + +--- + +## 3. Contact for withdrawal / questions + +⚖ COUNSEL — supply the candidate-facing contact channel for +biometric-consent withdrawal. Examples: a dedicated email +(`biometric-consent@`), a postal address, a +named operator. The contact must be functional from day one of +deployment. + +--- + +## 4. Consent acknowledgment + +By signing below (electronically or on paper), I acknowledge that: + +1. I have read and understood the disclosures in §1-3 above +2. I am providing this consent voluntarily and free of coercion +3. I have received a copy of this consent template (or have been + provided a means to retrieve a copy at any time) + +| Field | Value | +|---|---| +| Candidate name | _______________________________ | +| Date | __________ | +| Signature | _______________________________ | +| Consent template version | v1 (SHA-256: _generated at deployment time_) | + +--- + +## 5. Operational integration + +The structured fields the consent UI must capture and post to +identityd: + +```json +{ + "candidate_id": "", + "consent_version_hash": "", + "consent_given_at": "", + "consent_collection_method": "", + "consent_collection_evidence_path": "" +} +``` + +These fields write to `SubjectManifest.consent.biometric.status='given'` +and the corresponding `SubjectAuditRow` (see +`crates/catalogd/src/subject_audit.rs`). + +--- + +## 6. Versioning + +This consent template is version v1. Per Gate 1's versioning rules, +any change to the binding disclosure language requires a new version, +and existing subjects retain their original consent_version reference +unless they re-consent under the new version. + +⚖ COUNSEL — confirm whether existing consent under v1 carries forward +when the schedule is updated, or whether re-consent is required. +This affects the deployment workflow. + +--- + +## 7. Authority + +| Role | Name | Signature | Date | +|---|---|---|---| +| Operator | J | _______________ | _____ | +| Outside counsel | _____________ | _______________ | _____ | diff --git a/docs/policies/consent/biometric_retention_schedule_v1.md b/docs/policies/consent/biometric_retention_schedule_v1.md new file mode 100644 index 0000000..5c08b20 --- /dev/null +++ b/docs/policies/consent/biometric_retention_schedule_v1.md @@ -0,0 +1,150 @@ +# Biometric Data Retention Schedule — v1 + +**Spec:** docs/PHASE_1_6_BIPA_GATES.md §1 Gate 1 (BIPA §15(a)) +**Status:** Engineering scaffold — ⚖ COUNSEL must author the binding text before public publication +**Version:** v1 (initial; supersession requires a new version + new hash) + +> This is a publicly-available retention schedule for biometric identifiers +> and biometric information collected by the Lakehouse staffing platform. +> It is required by 740 ILCS 14/15(a) (the Illinois Biometric Information +> Privacy Act) before any biometric collection from real candidates begins. + +--- + +## 1. What this schedule governs + +This schedule applies to: + +- **Biometric identifiers** as defined in 740 ILCS 14/10: facial geometry + derived from candidate photographs. +- **Biometric information** as defined in 740 ILCS 14/10: any information + derived from a biometric identifier, including but not limited to + the gender, race, and age classifications produced by the deepface + model when applied to a candidate photograph. + +**Out of scope** (explicitly NOT biometric data under this schedule): + +- Synthetic faces from the pre-existing face pool (`data/headshots/`). + These are computer-generated portraits, not derived from any real + individual, and are not "biometric identifiers" under 740 ILCS 14/10. +- Candidate names, email addresses, phone numbers, work history, + certifications, or any other non-biometric personal information. + These are governed by the general PII retention policy referenced + in the SubjectManifest substrate (see + `docs/specs/SUBJECT_MANIFESTS_ON_CATALOGD.md`). + +--- + +## 2. Categories collected + +| Category | Source | Storage location | +|---|---|---| +| Photograph (raw bytes) | Candidate upload via the consent-gated photo endpoint | Quarantined under `data/biometric/uploads//.`; encrypted at rest | +| Facial geometry classifications | deepface inference run against the photograph | `subjects.biometric_classifications` (JSONB on the identityd `subjects` row) | +| Photograph integrity hash | SHA-256 of the original bytes | `subjects.biometric_template_hash` | + +We do NOT collect raw biometric template vectors that could be used +to re-derive a face from the encoded form. The deepface output is +stored as discrete classification labels (e.g. `{"age_estimate": 32, +"gender": "...", "race": "..."}`), not as a re-identifiable embedding. + +--- + +## 3. Purpose of collection + +Photographs and the classifications derived from them are used for: + +1. **Identity matching during staffing operations.** When a worker + arrives at a job site, the assigned coordinator may verify identity + by comparing the on-file photograph against the person present. +2. **Internal record-keeping.** Photographs become part of the worker + record so coordinators can recognize repeat workers across multiple + placements. + +Photographs and biometric classifications are NOT used for: + +- Demographic targeting in role recommendations (Title VII / IL Human + Rights Act compliance). +- Training of any machine-learning model. +- Sharing with third parties, except as required by court order or + with the candidate's separate written consent. +- Any purpose beyond those enumerated in §3.1-3.2 above. + +--- + +## 4. Retention period + +Per 740 ILCS 14/15(a), biometric identifiers and biometric information +must be permanently destroyed when the initial purpose for collection +has been satisfied OR within **three (3) years** of the individual's +last interaction with the private entity, whichever occurs first. + +**Operational ceiling:** Lakehouse retains biometric data for a +maximum of **eighteen (18) months** from the candidate's last placement +or last system interaction, whichever is later. This is more +restrictive than the BIPA statutory ceiling and provides a safety +margin against accidental over-retention. + +The 18-month clock is enforced by the daily retention sweep +(`crates/catalogd/src/bin/retention_sweep.rs`), which checks +`SubjectManifest.consent.biometric.retention_until` on every subject +and routes overdue subjects to the destruction queue (see Gate 5 +runbook). + +⚖ COUNSEL — confirm the 18-month operational ceiling is appropriate +for the deployment posture, or specify a different number. + +--- + +## 5. Destruction procedure + +Per 740 ILCS 14/15(a), Lakehouse follows the **BIPA Destruction +Runbook** (`docs/runbooks/BIPA_DESTRUCTION_RUNBOOK.md`) when: + +- Retention period under §4 expires +- Candidate withdraws biometric consent under the consent template (Gate 2) +- Candidate exercises a right-to-be-forgotten request +- An identityd `POST /v1/identity/subjects/{id}/erase` is invoked under + legal-tier authentication + +Every destruction event is recorded as an append-only audit row in +the affected subject's per-subject HMAC-chained audit log (see +`crates/catalogd/src/subject_audit.rs`), providing tamper-evident +proof of compliant destruction. + +--- + +## 6. Versioning + +This schedule is version v1. Future revisions: + +- Require a new version number (v2, v3, ...). +- Are committed to the repository with a `git` history showing the + revision date. +- Are referenced by SHA-256 hash from `consent_versions` table rows + in identityd, so each subject's consent record points unambiguously + at the schedule version that was in force when consent was given. + +**v1 SHA-256:** _generated at deployment time by_ `scripts/staffing/hash_consent_v1.sh` _(to be added when this schedule is finalized by counsel)_ + +--- + +## 7. Public availability + +⚖ COUNSEL — specify the public URL where this schedule will be +published (typically the privacy policy page on the deployment site) +and the disclosure language that links candidates to it from the +intake UI. + +--- + +## 8. Authority + +This schedule is adopted under the authority of J (operator of record) +and reviewed by ⚖ COUNSEL. Effective date: **TBD pending counsel +sign-off**. + +| Role | Name | Signature | Date | +|---|---|---|---| +| Operator | J | _______________ | _____ | +| Outside counsel | _____________ | _______________ | _____ | diff --git a/docs/runbooks/BIPA_DESTRUCTION_RUNBOOK.md b/docs/runbooks/BIPA_DESTRUCTION_RUNBOOK.md new file mode 100644 index 0000000..6062e8e --- /dev/null +++ b/docs/runbooks/BIPA_DESTRUCTION_RUNBOOK.md @@ -0,0 +1,228 @@ +# BIPA Biometric Data Destruction Runbook + +**Spec:** docs/PHASE_1_6_BIPA_GATES.md §1 Gate 5 (BIPA §15(a)) +**Audience:** Operators (J + named operators with legal-tier credentials) +**Status:** Engineering scaffold — ⚖ COUNSEL must review for legal sufficiency before adoption + +> This runbook tells an operator HOW to destroy biometric data when +> a destruction trigger fires. It is a procedural document, not a +> design document. The cryptographic substrate that the destruction +> writes against (per-subject HMAC audit log + tombstone manifests) +> already ships in `crates/catalogd/`. + +--- + +## 1. When this runbook fires + +Destruction is mandatory when ANY of the following occurs: + +| Trigger | Source signal | SLA | +|---|---|---| +| **Retention expiry** | Daily `retention_sweep` flags `consent.biometric.retention_until < now` | 30 days from sweep flagging | +| **Consent withdrawal** | Candidate submits withdrawal per consent template §2 | 30 days from receipt | +| **Right-to-be-forgotten request** | Candidate submits RTBF request through documented contact channel | 30 days from receipt | +| **Court-ordered erasure** | Legal counsel directs erasure via a documented order | Per court order; default 30 days | + +⚖ COUNSEL — confirm 30 days is correct for all four. Some deployments +have stricter contractual or jurisdictional clocks (CCPA: 45 days but +sooner is better; GDPR Art. 17: "without undue delay"). + +--- + +## 2. Pre-destruction checks (5 minutes) + +Before initiating destruction, the operator MUST: + +1. **Verify the trigger.** Cross-reference one of the four sources + above. If the trigger is a candidate-initiated request, + confirm identity per the standard PII verification procedure + (knowledge factor + possession factor; see counsel for the + threshold). + +2. **Pull the current subject record.** Hit + `GET /audit/subject/{candidate_id}` with the legal-tier token. + The response includes: + - The current `SubjectManifest` (including `consent.biometric.status`) + - The full HMAC-chained audit log + - `chain_verified: true` (if false, STOP — chain integrity issue + must be investigated before destruction) + +3. **Check for legal hold.** ⚖ COUNSEL — if a legal hold can apply + to a subject's data (litigation, regulatory inquiry, subpoena), + document the procedure for checking that no hold is in force + before erasing. + +4. **Get the second-operator sign-off.** Per BIPA defensibility, + destruction is a two-operator action (operator-of-record + one + witness). The witness records their attestation in the + destruction-event audit row (§4 below). + +--- + +## 3. Destruction procedure + +### Step 1 — Erase via identityd + +Invoke the legal-tier erasure endpoint: + +```bash +curl -sf -X POST "http://localhost:3100/v1/identity/subjects/${CANDIDATE_ID}/erase" \ + -H "Authorization: Bearer $(cat /etc/lakehouse/legal_audit.token)" \ + -H "Content-Type: application/json" \ + -d '{ + "trigger": "retention_expiry|consent_withdrawal|rtbf|court_order", + "trigger_evidence_path": "", + "operator_of_record": "", + "witness": "" + }' +``` + +⚖ ENGINEERING — `POST /v1/identity/subjects/{id}/erase` is Phase 1.6 +Gate 3 dependent. Until it ships, the manual procedure is: + +a. Set `SubjectManifest.consent.biometric.status = "withdrawn"` and + `SubjectManifest.status = "erased"` via direct registry write + (operator-of-record only). +b. Securely overwrite + unlink the quarantined photo path: + `shred -uvz data/biometric/uploads/${CANDIDATE_ID}/*.jpg` + (or equivalent for the configured backend). +c. NULL the deepface classification fields on the subject row. +d. Append the destruction-event audit row (Step 2 below). + +### Step 2 — Append the destruction-event audit row + +The erasure endpoint AUTOMATICALLY writes one row to the subject's +per-subject audit log: + +```json +{ + "schema": "subject_audit.v1", + "ts": "", + "candidate_id": "", + "accessor": { + "kind": "biometric_erasure", + "daemon": "identityd", + "purpose": "biometric_erasure", + "trace_id": "" + }, + "fields_accessed": ["biometric_classifications", "biometric_data_path", "biometric_template_hash"], + "result": "erased", + "prev_chain_hash": "", + "row_hmac": "" +} +``` + +The HMAC chain extends through the erasure event, so the audit +log itself is preserved as anonymous-event proof of compliant +destruction even after the underlying biometric data is gone. + +### Step 3 — Verify destruction + +Run the verification script: + +```bash +./scripts/staffing/verify_biometric_erasure.sh "${CANDIDATE_ID}" +``` + +⚖ ENGINEERING — script TODO. Acceptance: +- Subject row biometric fields are NULL +- `data/biometric/uploads/${CANDIDATE_ID}/` directory is empty +- Most recent audit log row has `result: "erased"`, `accessor.kind: "biometric_erasure"` +- Chain still verifies (`chain_verified: true`) under the legal-tier endpoint + +If any check fails: STOP, do not mark the destruction complete, +escalate to engineering. + +### Step 4 — Notify the candidate (when applicable) + +For consent-withdrawal and RTBF triggers, the operator notifies +the candidate that destruction is complete. ⚖ COUNSEL — supply +the notification template (typically email; medium and language +are counsel-determined). + +--- + +## 4. Backup window disclosure + +Per `IDENTITY_SERVICE_DESIGN.md` v3-B12, biometric data may persist +in encrypted system backups for up to **30 days** after destruction +(rolling backup window). The candidate must be informed of this +when destruction is requested, and the destruction-event audit row +records the backup-window expiry date so the operator knows when +the residual is fully eliminated. + +⚖ COUNSEL — confirm whether the 30-day backup window is acceptable +under BIPA. Some interpretations require backups to be addressed +within a shorter window; some accept the operational reality of +backup retention. + +--- + +## 5. Reporting cadence + +Monthly, the operator-of-record produces a destruction-events +report: + +```bash +./scripts/staffing/biometric_destruction_report.sh \ + --month "$(date +%Y-%m)" \ + --output reports/biometric/destruction_$(date +%Y_%m).md +``` + +⚖ ENGINEERING — script TODO. The report aggregates: + +- Total destruction events in the month +- Breakdown by trigger (retention / withdrawal / RTBF / court) +- Median time-to-destruction from trigger to completion +- Any failures / escalations + +The monthly report is available to outside counsel on request. +It does NOT include candidate-identifying details — only the +counts, timings, and cryptographic attestations of the events. + +--- + +## 6. Audit trail attestation + +The per-subject HMAC chain is the cryptographic substrate that +makes destructions defensible after the fact. To produce an +attestation for a specific candidate's destruction: + +1. Hit `GET /audit/subject/{candidate_id}` with legal-tier token +2. Confirm `chain_verified: true` and most-recent row has + `accessor.kind: "biometric_erasure"` +3. Cross-runtime verify: the same audit log is byte-identical + under Rust + Go (per `scripts/cutover/parity/subject_audit_parity.sh`) +4. Counsel signs an attestation referencing the audit log's + chain root hash + +The chain root hash is itself a tamper-evident anchor. A motivated +insider would need the HMAC signing key (held in a separate location +from the audit logs themselves, per the spec) AND the original +log to forge a clean destruction record — and the cross-runtime +parity probe would catch a forgery that touched only one runtime's +view. + +--- + +## 7. Operator acknowledgment + +Operators with legal-tier credentials acknowledge they have read, +understood, and will follow this runbook before being granted access +to the legal_audit token. + +| Operator | Date acknowledged | Signature | +|---|---|---| +| J | _____ | _______________ | +| _____ | _____ | _______________ | + +⚖ COUNSEL — adopt this acknowledgment as the substrate for §3 of +Phase 1.6 (employee training acknowledgment), or specify a separate +training program. + +--- + +## 8. Change log + +- 2026-05-03 — Initial scaffold. ⚖ COUNSEL review required before + adoption. diff --git a/mcp-server/phase_1_6_gate_4.test.ts b/mcp-server/phase_1_6_gate_4.test.ts new file mode 100644 index 0000000..2c38803 --- /dev/null +++ b/mcp-server/phase_1_6_gate_4.test.ts @@ -0,0 +1,130 @@ +// Phase 1.6 Gate 4 absence test. +// +// Spec: docs/PHASE_1_6_BIPA_GATES.md §1 Gate 4 — Engineering acceptance: +// "Unit test asserts no protected-attribute inference functions exist +// in search.html or any mcp-server module" +// +// What this guards: the FEMALE_NAMES / NAMES_HISPANIC / SURNAMES_* lookup +// tables and the genderFor() / guessEthnicityFromFirstName() / etc. +// inference functions removed 2026-05-03. Re-introduction would re-open +// (1) Title VII / IL Human Rights Act discriminatory-feature risk and +// (2) BIPA's broad-reading "biometric information derived from a biometric +// identifier" pattern when combined with deepface output. +// +// Strategy: walk every .html / .ts / .tsx / .js / .mjs file under +// mcp-server/ and grep-assert that none of them DEFINE the forbidden +// symbols. We deliberately allow the symbol NAMES to appear inside +// comments — search.html has a removal note that names them so future +// readers know what was excised — but we forbid actual definition +// patterns (var / const / let / function / class member / object literal). + +import { test, expect } from "bun:test"; +import { readdirSync, statSync, readFileSync } from "node:fs"; +import { join } from "node:path"; + +const FORBIDDEN_DATA_TABLES = [ + // First-name lookup tables + "FEMALE_NAMES", + "MALE_NAMES", + "NAMES_HISPANIC", + "NAMES_BLACK", + "NAMES_SOUTH_ASIAN", + "NAMES_EAST_ASIAN", + "NAMES_MIDDLE_EASTERN", + // Surname lookup tables + "SURNAMES_HISPANIC", + "SURNAMES_BLACK", + "SURNAMES_SOUTH_ASIAN", + "SURNAMES_EAST_ASIAN", + "SURNAMES_MIDDLE_EASTERN", +]; + +const FORBIDDEN_FUNCTIONS = [ + "guessGenderFromFirstName", + "guessEthnicityFromName", + "guessEthnicityFromFirstName", + "genderFor", +]; + +function* walkSource(dir: string): Generator { + for (const entry of readdirSync(dir)) { + if (entry === "node_modules" || entry === "dist" || entry.startsWith(".")) continue; + const path = join(dir, entry); + const stat = statSync(path); + if (stat.isDirectory()) { + yield* walkSource(path); + } else if (/\.(html|ts|tsx|js|mjs)$/.test(entry)) { + // Don't grep this test file itself — it lists the forbidden tokens + // by name as match targets, not as definitions. + if (path.endsWith("phase_1_6_gate_4.test.ts")) continue; + yield path; + } + } +} + +// definitionPatternsFor: returns regexes that match common DEFINITION +// forms in JS/TS/HTML embedded scripts. A bare reference inside a +// comment is intentionally NOT matched. +function definitionPatternsFor(symbol: string): RegExp[] { + return [ + // var / const / let SYMBOL = + new RegExp(`\\b(?:var|const|let)\\s+${symbol}\\b\\s*=`), + // function SYMBOL( + new RegExp(`\\bfunction\\s+${symbol}\\s*\\(`), + // SYMBOL = function( OR SYMBOL = (...) => + new RegExp(`\\b${symbol}\\s*=\\s*(?:function\\s*\\(|\\([^)]*\\)\\s*=>|async\\s*(?:\\(|function))`), + // class member: SYMBOL(...) { (a method declaration) + new RegExp(`^\\s*${symbol}\\s*\\([^)]*\\)\\s*\\{`, "m"), + ]; +} + +function findOffenders(filePath: string, text: string): string[] { + const out: string[] = []; + for (const sym of [...FORBIDDEN_DATA_TABLES, ...FORBIDDEN_FUNCTIONS]) { + for (const pattern of definitionPatternsFor(sym)) { + if (pattern.test(text)) { + out.push(`${filePath}: definition of ${sym} (matched ${pattern})`); + } + } + } + return out; +} + +test("Gate 4: no protected-attribute inference DEFINITIONS in mcp-server", () => { + const root = import.meta.dir; + const offenders: string[] = []; + for (const path of walkSource(root)) { + const text = readFileSync(path, "utf8"); + offenders.push(...findOffenders(path, text)); + } + if (offenders.length > 0) { + throw new Error( + `Phase 1.6 Gate 4 violation — protected-attribute inference symbols defined in mcp-server:\n` + + offenders.map((o) => ` - ${o}`).join("\n"), + ); + } + expect(offenders.length).toBe(0); +}); + +// Sanity: confirm the test actually walks files (otherwise the absence +// assertion is vacuously true). If mcp-server ever lost its source +// tree, this would catch it. +test("Gate 4: walker actually finds source files to scan", () => { + const root = import.meta.dir; + let count = 0; + for (const _ of walkSource(root)) count++; + expect(count).toBeGreaterThan(5); // mcp-server has more than 5 source files +}); + +// Defense in depth: the regex itself must catch a synthetic positive. +// If the definition pattern ever stops matching real code, the absence +// test would silently pass on actual reintroductions. +test("Gate 4: regex catches a synthetic positive (defense in depth)", () => { + const synthetic = + `var NAMES_HISPANIC = ["Maria"];\n` + + `function guessEthnicityFromFirstName(name) { return "?"; }\n`; + const offenders = findOffenders("synthetic_test_input", synthetic); + expect(offenders.length).toBeGreaterThanOrEqual(2); + expect(offenders.some((o) => o.includes("NAMES_HISPANIC"))).toBe(true); + expect(offenders.some((o) => o.includes("guessEthnicityFromFirstName"))).toBe(true); +}); diff --git a/scripts/staffing/attest_pre_identityd_biometric_state.sh b/scripts/staffing/attest_pre_identityd_biometric_state.sh new file mode 100755 index 0000000..a3e9d0d --- /dev/null +++ b/scripts/staffing/attest_pre_identityd_biometric_state.sh @@ -0,0 +1,248 @@ +#!/usr/bin/env bash +# attest_pre_identityd_biometric_state — one-shot defense artifact. +# +# Specification: docs/PHASE_1_6_BIPA_GATES.md §2 (Cryptographic +# attestation that no biometric data exists pre-identityd). +# +# Why this exists: in a BIPA dispute, plaintiffs may argue that the +# EXISTENCE of biometric schema fields constitutes constructive notice +# of intent to collect. The defense: prove that no biometric data was +# actually collected from real candidates before the identity service + +# consent gate (Phase 1.6 Gates 1-3) shipped. +# +# This script produces a defensible record of: +# 1. workers_500k.parquet schema has NO column named photo / biometric_* +# / face_* / image_* +# 2. data/_kb/*.jsonl and data/_pathway_memory/state.json contain NO +# base64 image magic bytes (JPEG /9j/, PNG iVBOR), no data:image/* +# MIME prefixes, and no field-name patterns that imply biometric +# payload (photo, biometric, deepface_*) +# 3. data/headshots/manifest.jsonl rows are entirely synthetic — count +# matches the face_pool size, and every row's source is a synthetic +# generator (not a real candidate upload) +# +# Output: +# docs/attestations/BIPA_PRE_IDENTITYD_ATTESTATION_.md +# — markdown attestation document with all evidence + a SHA-256 +# hash of the evidence summary. Ready for J + counsel signature. +# +# Exit codes: +# 0 — clean, attestation written, ready for signature +# 1 — evidence FAILED, attestation NOT written; investigate before signing +# 2 — script error (missing tools, unreadable files) + +set -uo pipefail +cd "$(dirname "$0")/../.." + +DATE="${OVERRIDE_DATE:-$(date -u +%Y-%m-%d)}" +OUT_DIR="docs/attestations" +OUT="$OUT_DIR/BIPA_PRE_IDENTITYD_ATTESTATION_${DATE}.md" +mkdir -p "$OUT_DIR" + +WORKERS_PARQUET="${WORKERS_PARQUET:-data/datasets/workers_500k.parquet}" +KB_DIR="${KB_DIR:-data/_kb}" +PATHWAY_STATE="${PATHWAY_STATE:-data/_pathway_memory/state.json}" +HEADSHOTS_MANIFEST="${HEADSHOTS_MANIFEST:-data/headshots/manifest.jsonl}" + +PASS=0 +FAIL=0 +EVIDENCE=$(mktemp) + +note() { echo "$1" >> "$EVIDENCE"; } +mark_pass() { PASS=$((PASS+1)); note " - PASS: $1"; } +mark_fail() { FAIL=$((FAIL+1)); note " - FAIL: $1"; } + +# ── Check 1: workers_500k.parquet schema ──────────────────────────── +note "## Check 1 — workers_500k.parquet schema (no biometric columns)" +note "" +note "**Source:** \`$WORKERS_PARQUET\`" +note "" +if [ ! -r "$WORKERS_PARQUET" ]; then + echo "[attest] FAIL: cannot read $WORKERS_PARQUET" >&2 + rm -f "$EVIDENCE" + exit 2 +fi +SCHEMA=$(python3 -c " +import sys, pyarrow.parquet as pq +schema = pq.read_schema('$WORKERS_PARQUET') +for f in schema: + print(f.name) +" 2>&1) +if [ $? -ne 0 ]; then + echo "[attest] FAIL: schema read error: $SCHEMA" >&2 + rm -f "$EVIDENCE" + exit 2 +fi +SCHEMA_HASH=$(echo "$SCHEMA" | sha256sum | awk '{print $1}') +SCHEMA_LINES=$(echo "$SCHEMA" | wc -l) +note "**Schema columns** ($SCHEMA_LINES total):" +note "" +note '```' +note "$SCHEMA" +note '```' +note "" +note "**Schema SHA-256:** \`$SCHEMA_HASH\`" +note "" + +# Forbidden column patterns (case-insensitive) +FORBIDDEN_COLS=$(echo "$SCHEMA" | grep -iE "^(photo|biometric|face|image)([_].*)?$" || true) +if [ -z "$FORBIDDEN_COLS" ]; then + mark_pass "no biometric / photo / face / image column present" +else + mark_fail "forbidden columns present: $FORBIDDEN_COLS" +fi +note "" + +# ── Check 2: KB JSONL + pathway state — no base64 image / MIME ────── +note "## Check 2 — KB + pathway memory contain no biometric payloads" +note "" +note "**Sources scanned:**" +note "- \`$KB_DIR/*.jsonl\` (knowledge base)" +note "- \`$PATHWAY_STATE\` (pathway memory state)" +note "" +SCAN_PATHS=() +if [ -d "$KB_DIR" ]; then + while IFS= read -r f; do SCAN_PATHS+=("$f"); done < <(find "$KB_DIR" -maxdepth 2 -type f -name "*.jsonl") +fi +if [ -r "$PATHWAY_STATE" ]; then + SCAN_PATHS+=("$PATHWAY_STATE") +fi + +# Forbidden patterns: +# data:image/ — explicit MIME embed +# "photo": — bare photo field +# "biometric" — field name +# "deepface_ — deepface output prefix +# /9j/[A-Za-z0-9+/]{40,} — JPEG base64 magic + length floor (false-positive guard) +# iVBORw0KGgo[A-Za-z0-9+/]{20,} — PNG base64 magic + length floor +PATTERN_FILE=$(mktemp) +cat > "$PATTERN_FILE" <<'PATTERNS' +data:image/ +"photo"\s*: +"biometric" +"deepface_ +/9j/[A-Za-z0-9+/=]{40,} +iVBORw0KGgo[A-Za-z0-9+/=]{20,} +PATTERNS + +HITS=0 +HIT_DETAIL=$(mktemp) +for path in "${SCAN_PATHS[@]}"; do + if grep -aHEf "$PATTERN_FILE" "$path" > "$HIT_DETAIL.tmp" 2>/dev/null; then + if [ -s "$HIT_DETAIL.tmp" ]; then + HITS=$((HITS + $(wc -l < "$HIT_DETAIL.tmp"))) + cat "$HIT_DETAIL.tmp" >> "$HIT_DETAIL" + fi + fi +done +rm -f "$PATTERN_FILE" "$HIT_DETAIL.tmp" + +note "**Files scanned:** ${#SCAN_PATHS[@]}" +note "**Forbidden-pattern hits:** $HITS" +note "" + +if [ "$HITS" -eq 0 ]; then + mark_pass "no biometric payload patterns found in scanned files" +else + mark_fail "$HITS forbidden-pattern hits — see detail below" + note "" + note "### Detail (first 20 hits)" + note "" + note '```' + head -20 "$HIT_DETAIL" >> "$EVIDENCE" + note '```' +fi +rm -f "$HIT_DETAIL" +note "" + +# ── Check 3: headshots manifest is synthetic-only ─────────────────── +note "## Check 3 — Headshots manifest is synthetic-only" +note "" +note "**Source:** \`$HEADSHOTS_MANIFEST\`" +note "" +if [ ! -r "$HEADSHOTS_MANIFEST" ]; then + note "**SKIP** — manifest not present (no headshot UI deployed)." + note "" + mark_pass "no headshots manifest = no headshot data exists at all" +else + TOTAL_ROWS=$(wc -l < "$HEADSHOTS_MANIFEST") + # A row is non-synthetic if it lacks the synthetic markers (source: tag, + # archetype: tag, deterministic id pattern). The Phase 1.5 walk + # established that the synthetic face pool uses generated portraits + # with archetype tags. Anything else (real candidate upload) would + # be a Phase 1.6 violation. + NON_SYNTHETIC=$(grep -cE '"source"[[:space:]]*:[[:space:]]*"(real|candidate_upload|photo_upload)"' "$HEADSHOTS_MANIFEST" 2>/dev/null) || NON_SYNTHETIC=0 + # Strip any newlines / whitespace defensively in case grep -c returned weirdly. + NON_SYNTHETIC=$(printf '%s' "$NON_SYNTHETIC" | tr -d '[:space:]') + : "${NON_SYNTHETIC:=0}" + note "**Total rows:** $TOTAL_ROWS" + note "**Rows tagged real/candidate_upload/photo_upload:** $NON_SYNTHETIC" + note "" + if [ "$NON_SYNTHETIC" = "0" ]; then + mark_pass "all $TOTAL_ROWS rows are synthetic (no real-candidate uploads)" + else + mark_fail "$NON_SYNTHETIC rows tagged as non-synthetic — investigate" + fi +fi +note "" + +# ── Summary + final hash ──────────────────────────────────────────── +TOTAL=$((PASS + FAIL)) +note "## Summary" +note "" +note "**$PASS / $TOTAL** evidence checks pass." +note "" +if [ "$FAIL" -gt 0 ]; then + note "**Status: NOT READY FOR SIGNATURE** — at least one check failed. Resolve before counsel review." + note "" +fi + +# Compute the evidence hash so any modification to the attestation +# document is detectable post-signature. +EVIDENCE_HASH=$(sha256sum "$EVIDENCE" | awk '{print $1}') + +# ── Render final attestation document ─────────────────────────────── +{ + echo "# BIPA Pre-IdentityD Biometric Attestation" + echo + echo "**Date:** $DATE" + echo "**Spec:** docs/PHASE_1_6_BIPA_GATES.md §2" + echo "**Generator:** scripts/staffing/attest_pre_identityd_biometric_state.sh" + echo + echo "## Purpose" + echo + echo "This is a one-time defense artifact establishing that, as of" + echo "$DATE, no biometric identifiers or biometric information" + echo "from real candidates have been collected, processed, or stored" + echo "by the Lakehouse system. It is intended to be signed by J" + echo "(operator of record) and outside counsel, then anchored to a" + echo "tamper-evident store (filesystem with backups + version control)." + echo + echo "## Evidence" + echo + cat "$EVIDENCE" + echo + echo "---" + echo + echo "## Attestation" + echo + echo "I, the undersigned, attest that the above evidence accurately" + echo "reflects the state of the Lakehouse system as of $DATE." + echo "No biometric identifiers or biometric information from real" + echo "candidates have been collected, processed, or stored prior to" + echo "the deployment of the Phase 1.6 BIPA pre-launch gates." + echo + echo "**Evidence SHA-256:** \`$EVIDENCE_HASH\`" + echo + echo "---" + echo + echo "**Operator (J):** _______________________________ Date: __________" + echo + echo "**Outside counsel:** ___________________________ Date: __________" + echo +} > "$OUT" +rm -f "$EVIDENCE" + +echo "[attest] $PASS / $TOTAL checks pass — attestation: $OUT" +echo "[attest] evidence SHA-256: $EVIDENCE_HASH" +[ "$FAIL" -eq 0 ]