audit phase 1.5: BIPA schema audit + outcomes.jsonl content sample

Two follow-up walks per AUDIT_PHASE_1_DISCOVERY §10/C4 + gemini scrum
flag. Read-only. No code changes.

BIPA findings:
- scripts/staffing/tag_face_pool.py uses deepface to extract gender +
  race + age from face images. Output persists to data/headshots/
  manifest.jsonl. For synthetic faces this is fine; for real candidate
  photos this becomes a regulated biometric database (740 ILCS 14/10).
- mcp-server/index.ts:1408 ComfyUI prompt EXPLICITLY embeds protected
  attributes (age + race + gender) into model prompt — system-level
  encoding of protected-attribute features into AI workflow.
- mcp-server/search.html:3375-3432 has hard-coded FEMALE_NAMES /
  MALE_NAMES / NAMES_HISPANIC / SURNAMES_* lookup tables — name-based
  ethnicity inference. Title VII / disparate-impact risk separate
  from BIPA.
- data/headshots/manifest.jsonl is TRACKED IN GIT today (synthetic
  classifications). For real photos, this would be biometric data
  in version control — serious failure.
- No consent flow, no public retention schedule, no deletion
  procedure, no employee training documented. All required by BIPA
  §15(a)/(b) before real-photo intake.

outcomes.jsonl sample:
- 39/101 rows persist candidate names in fills[*].name field today
- Sample names: "Carmen I. Garcia", "Jamal Z. Jones", "Jacob N. Patel"
  (synthetic but real shape)
- 0 hits for "culture fit" / "communication" / etc proxy phrases —
  synthetic data doesn't generate them. When real models reason about
  real candidates, they will. Append-only persistence makes RTBF
  cryptographic-erasure-only.

Recommends Phase 1.6 (NEW) — BIPA pre-launch gates between Phase 1.5
and Phase 2: BIPA_COMPLIANCE_POLICY.md, consent gate at upload
endpoint, quarantine real-photo classifications to data/biometric/,
deprecate name->ethnicity lookup tables, unit test that synthetic
manifest stays synthetic. 4-8 hours of design + one code commit.

5 open questions for J: where do real photos enter, will deepface
tagging path stay for real photos, consent UX, retention duration
floor, designated privacy officer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
root 2026-05-03 01:22:53 -05:00
parent 64bda21614
commit fd429f4185

View File

@ -0,0 +1,177 @@
# Phase 1.5 — BIPA Schema Audit + outcomes.jsonl Content Sample
**Status:** Complete — 2026-05-03 · **Companion to:** [`AUDIT_TRAIL_PRD.md`](AUDIT_TRAIL_PRD.md), [`AUDIT_PHASE_1_DISCOVERY.md`](AUDIT_PHASE_1_DISCOVERY.md)
> **Scope.** Two follow-up walks defined in `AUDIT_PHASE_1_DISCOVERY.md` §10/C4 + §10/single-reviewer (gemini "culture fit"): (1) BIPA-specific photo/video/biometric column + code path inventory, (2) actual content sample of `data/_kb/outcomes.jsonl` to confirm whether PII or discrimination-proxy reasoning lands there. Read-only walk; no code changes.
---
## §1 — BIPA walk: photo/video/biometric surface
### 1A — The face pipeline
The Rust legacy ships a synthetic face pool with deepface-extracted demographic classifications. **Even though the FACES are synthetic, the LOGIC is BIPA-relevant** — when real photos arrive (J 2026-05-03: "we will" have them), the same pipeline processes them.
| File:line | What | BIPA relevance |
|---|---|---|
| `scripts/staffing/fetch_face_pool.py` | Fetches 1000 synthetic faces from `thispersondoesnotexist.com` (StyleGAN). | Synthetic — no biometric collection at this stage. |
| `scripts/staffing/tag_face_pool.py:18-31` | Runs `deepface` on each face to extract `gender` (man / woman) + `race` (asian / black / hispanic / indian / middle_eastern / white) + `age`. Writes to `data/headshots/manifest.jsonl`. | **BIPA biometric processing.** Deepface extracts demographic classifications from face geometry. The output IS biometric information per 740 ILCS 14/10. |
| `data/headshots/manifest.jsonl` | 1000 rows, each `{id, file, gender, race, age}`. Sampled rows confirm shape: e.g. `{id: 0, file: "face_0000.jpg", gender: "woman", race: "east_asian", age: 33}`. | The persisted classifications. **For real photos, this file becomes a regulated biometric database.** |
| `mcp-server/index.ts:1308` (`/headshots/:key` route) | Serves face matched by gender × race × age intersection bucketing for a candidate. | Routing logic predicated on biometric-derived demographic attributes. |
| `mcp-server/index.ts:1367` (`/headshots/generate/:key`) | ComfyUI on-demand portrait generation. | Custom-generated image keyed to candidate. |
| `mcp-server/index.ts:1408` (ComfyUI prompt template) | `professional headshot portrait of a ${age}-year-old ${raceText} ${genderText} ${role}, ${scene}, neutral confident expression, sharp focus, photorealistic` | **Protected attributes (age + race + gender) embedded directly into the model prompt.** Even with synthetic output, this is system-level encoding of protected-attribute features into AI workflow. |
### 1B — Name-based ethnicity inference (inferred-attribute risk)
`mcp-server/search.html:3375-3432` defines lookup tables that classify candidates by demographic attributes derived from their NAMES (not photos):
| Constant | What |
|---|---|
| `FEMALE_NAMES` (~200 entries) + `MALE_NAMES` | Gender inference from first name |
| `NAMES_HISPANIC`, `NAMES_BLACK`, `NAMES_SOUTH_ASIAN`, `NAMES_EAST_ASIAN`, `NAMES_MIDDLE_EASTERN` | Ethnicity inference from first name |
| `SURNAMES_HISPANIC`, `SURNAMES_SOUTH_ASIAN`, `SURNAMES_EAST_ASIAN`, `SURNAMES_MIDDLE_EASTERN`, `SURNAMES_BLACK` | Ethnicity inference from surname (more diagnostic than first name per the source comment) |
**This is hard-coded inferred-attribute classification.** Per CLAUDE.md, the rationale is "Confidence-default name resolution... so a forklift operator looks like a forklift operator" — i.e., assigning representative photos. But the code path is:
`candidate name → genderFor() / guessEthnicityFromFirstName() → face-pool bucket selection`
A discrimination plaintiff's lawyer would frame this as: "the system EXPLICITLY classifies candidates by inferred ethnicity and uses that classification in its workflow." Whether the downstream use is "just for the photo display" or "ranks candidates" doesn't matter for BIPA-style strict-liability framing — it matters for Title VII, where intent / disparate-impact tests apply.
### 1C — Data directories holding face content
| Path | Count today | Purpose | gitignored? |
|---|---:|---|---|
| `data/headshots/` (face_*.jpg + manifest.jsonl) | 2500 entries | Synthetic face pool + biometric classifications | partial — manifest.jsonl is tracked, face_*.jpg are gitignored |
| `data/headshots_role_pool/` | 7 entries | Role-specific representative faces | gitignored |
| `data/headshots_gen/` | 4 entries | ComfyUI on-demand portrait cache | gitignored |
| `data/icons_pool/` | 2 entries | Certification + role icons | gitignored |
| `data/face_test/` | 1 entry | Likely deepface dev/test scratch | gitignored |
| `data/_imagecache/` | (not enumerated) | Image cache | gitignored |
The `manifest.jsonl` file is **tracked in git** today and contains the demographic-classification table. For synthetic faces this is fine; **for real candidate faces, this would be a biometric database in version control** — a serious BIPA + general-data-protection failure.
### 1D — Biometric-info contract risk when real photos arrive
When the system transitions from synthetic to real candidate photos (J: "we will"), the BIPA exposure becomes immediate:
1. **No written consent flow exists.** BIPA §15(b) requires "informed written consent" before biometric collection. The headshot upload path (wherever it gets built) needs a consent gate.
2. **No public retention schedule exists.** BIPA §15(a) requires a written, publicly-available retention schedule. No such document found.
3. **No deletion procedure exists.** BIPA §15(a) requires destruction "when the initial purpose for collecting or obtaining such identifiers or information has been satisfied or within 3 years of the individual's last interaction with the private entity, whichever occurs first." No such procedure exists.
4. **No employee training documented.** BIPA litigation routinely cites lack of employee training on biometric handling as evidence of negligence.
5. **The deepface library is itself a biometric processor.** Using it triggers the BIPA disclosure obligation regardless of whether the system stores the underlying face image — extracting "race" and "gender" from a face IS biometric processing under modern interpretations.
6. **Per-violation damages.** Negligent violations: $1,000 per violation per person. Intentional/reckless: $5,000 per violation per person. The IL Supreme Court has held each separate scan/extraction can constitute a separate violation (Cothron v. White Castle, 2023). Class actions are routine; recent settlements:
- Facebook (Patel v. Facebook, 2020): $650M
- TikTok (In re TikTok BIPA, 2022): $92M
- Six Flags (Rosenbach v. Six Flags, 2019): set the "no actual harm needed" precedent
- Smaller staffing companies have settled for $1M-$50M ranges.
### 1E — What this means for the architecture
Even with synthetic data, the architecture has BIPA-shaped surface that is one upload path away from being live. **Recommended pre-real-photo gates:**
1. **Disable deepface tagging on real-photo path** until consent + retention schedule are in place. Synthetic faces can keep using deepface; real faces cannot.
2. **Quarantine `data/headshots/manifest.jsonl` for real photos.** Real-photo classifications must NOT land here. Either move to a dedicated `data/biometric/` tree with separate retention or — preferable — don't extract demographic features from real photos at all.
3. **Replace the name → ethnicity lookup tables with a non-classifying default.** The current logic exists to pick "representative" photos. For real photos, the candidate's own photo IS representative — no inference needed. The lookup tables become dead code.
4. **Upload path needs consent gate.** Whatever endpoint accepts real photos must check a stored consent timestamp + version before storing the file.
5. **Retention schedule documented + auto-enforced.** Crontab / ingestd background sweep that deletes biometric data per the schedule.
These five gates are NOT in `AUDIT_TRAIL_PRD.md` §8 phase plan. **Recommend adding as Phase 1.6 (BIPA pre-launch gates)** ahead of any real-photo intake.
---
## §2 — outcomes.jsonl content sample
### 2A — File state
| Property | Value |
|---|---|
| Rows | 101 |
| Path | `data/_kb/outcomes.jsonl` |
| Rows with `name` field in fills | **39** |
| Rows with `first_name` field | 0 (the field is `name`, not `first_name` — names are pre-concatenated in this sink) |
| Sample names in fills (synthetic but representative) | `Alex Rivera`, `Carmen I. Garcia`, `Jacob N. Patel`, `Jamal Z. Jones`, `James Park`, `Maria Chen`, `Mei Y. Cooper`, `Peter E. Carter`, `Sam Torres` |
### 2B — Confirmed PII shapes in the file today
- `fills[*].name` — full candidate name (synthetic but real shape)
- `operation` — natural-language fill request (e.g. "fill: Welder x2 in Toledo, OH"). Currently doesn't carry candidate names in the operation string itself, BUT the `error` field can — model reasoning that names specific candidates lands here
- `error` — model self-correction text and validator failure messages. Sampled rows contain phrases like "executor parse failure on turn 8: no JSON object in Executor response: ..." which is technical content. Real-data scenarios where the model reasons about specific candidates would put names here.
### 2C — Discrimination-proxy phrase audit (gemini scrum flag §10/single-reviewer)
Searched for common LLM-generated discrimination proxy phrases. **0 hits in current synthetic-data sample**:
- "culture fit" — 0
- "communication" — 0
- "team chemistry" — 0
- "personality" — 0
- "soft skills" — 0
- "work ethic" — 0
- "professional attitude" — 0
**Interpretation:** the current synthetic data doesn't generate these phrases. However:
- The file is structurally capable of holding them (it persists model `error` text)
- When real models (kimi-k2.6, deepseek-v3.2 per PR #13) reason about real candidates, they WILL generate phrases like these — these are common in HR/staffing LLM output
- Once a phrase like "rejected as not a culture fit" lands in outcomes.jsonl, it's append-only persistence. Cryptographic erasure is the only RTBF path.
**Action:** add a content-redaction pass to the outcomes.jsonl writer BEFORE production data lands. Strip discrimination-proxy phrases at write time. (This is not a substitute for the model not GENERATING them — it's a defense-in-depth layer.)
### 2D — Implications for the audit trail
The outcomes.jsonl sample confirms what `AUDIT_PHASE_1_DISCOVERY.md` §1D predicted: **39 of 101 rows persist candidate names in the `fills[*].name` field today**. The shape is real even if the data is synthetic.
For the Phase 2 identity service design:
1. **The training-safe export pipeline (per J 2026-05-03 answer 11) MUST address outcomes.jsonl directly.** This file is the most-likely candidate for RAG re-indexing or fine-tuning corpus, and it carries names.
2. **Crypto-erasure target list** must include `data/_kb/outcomes.jsonl`, `data/_kb/overseer_corrections.jsonl`, `/tmp/lakehouse-validator/sessions.jsonl`, plus pathway memory state, plus Langfuse storage.
3. **Subject_id top-level promotion (per AUDIT_PHASE_1_DISCOVERY §10/C5)** can replace `fills[*].name` as the queryable field. The name itself can move into a `[REDACTED]` token until legal access dereferences via the identity service.
---
## §3 — Synthesis: net new findings for Phase 2 design
| Finding | Source | Phase 2 design implication |
|---|---|---|
| **Phase 1.6 (NEW) — BIPA pre-launch gates needed before real photos accepted** | §1E above | Phase 2 identity service design must include biometric-info table with consent + retention metadata schema. |
| **Name-based ethnicity inference is a separate Title VII / disparate-impact risk** | §1B above | Phase 2 must include a "no protected-attribute inference at routing time" architectural rule. The lookup tables should be deprecated as part of the identity-service migration. |
| **deepface library use itself triggers BIPA when real photos arrive** | §1D point 5 | Phase 2 design must specify "biometric processing requires consent token presence" as a gateway boundary check. |
| **outcomes.jsonl persists names today (39/101 rows)** | §2A above | Phase 2 needs subject_id top-level promotion + name field as `[REDACTED-{token}]` reference (token resolves via identity service for authorized callers). |
| **No discrimination-proxy phrases in current synthetic data, but the sink is shape-ready for them** | §2C above | Phase 2 should specify a redaction-on-write pass for outcomes.jsonl + sessions.jsonl writers. |
| **manifest.jsonl is tracked in git today** | §1C above | If real photos arrive without changing the manifest path, biometric data lands in version control — must be addressed as part of Phase 1.6 gates. |
---
## §4 — Open questions surfaced by Phase 1.5
1. **Where will real photos enter the system?** No upload endpoint walked. Need to know the entry point before designing the consent gate.
2. **Will the deepface tagging path be preserved for real photos, or stripped?** Architectural decision — affects whether biometric extraction is in scope at all.
3. **What's the staffing client's consent UX?** Does the candidate sign consent during onboarding, separately for biometrics, etc.? Affects the consent-token design.
4. **How long do BIPA records need to persist?** BIPA's "3 years from last interaction" is a maximum, not a target — staffing-co's contract may set a shorter floor (e.g., "delete on placement+1 year"). Need staffing-co policy.
5. **Is there a Designated Privacy Officer / DPO?** Per CCPA/GDPR best practice + BIPA litigation hygiene, this role should exist before real-PII intake. Out of scope for code; flag for J.
---
## §5 — Recommended Phase 1.6 BIPA pre-launch gates (NEW)
Lands as new phase BETWEEN Phase 1.5 (this doc) and Phase 2 (identity service design). Read-only/design work; one code commit (the consent gate). Estimated 4-8 hours.
1. **Write `docs/BIPA_COMPLIANCE_POLICY.md`** — public retention schedule, written consent template, deletion procedure, employee training acknowledgment. Counsel must review before publication.
2. **Add consent gate to whatever endpoint accepts real photos** (not yet built — design with the gate from day 1, don't bolt on after).
3. **Quarantine real-photo classifications to `data/biometric/`** — separate path, separate retention, separate access control. Real-photo `manifest.jsonl` does NOT land in `data/headshots/`.
4. **Deprecate name → ethnicity lookup tables in `mcp-server/search.html`** (commit removing those constants + the call sites). Replace with "use candidate's actual photo" path or non-inferred default.
5. **Add a unit test that asserts `data/headshots/manifest.jsonl` row count matches the SYNTHETIC face count** — fails if real-photo classifications leak into the synthetic manifest.
These five fit between Phase 1.5 and Phase 2 because (a) they're MORE URGENT than Phase 2 (BIPA exposure starts the moment a real photo touches deepface), (b) they're independent of the identity service design, (c) they need counsel sign-off before they can ship anyway.
---
## Change log
- 2026-05-03 — Phase 1.5 BIPA + outcomes content audit complete. New finding: synthetic data + deepface + name-ethnicity inference creates BIPA-shaped surface that is one upload path away from being live. Recommend Phase 1.6 (BIPA pre-launch gates) before Phase 2 design starts.