Per docs/PHASE_1_6_BIPA_GATES.md. Status table now reflects:
DONE (engineering-only, no counsel dependency):
- Gate 4: name→ethnicity inference removed from mcp-server.
Removal note in search.html:3372 + new Bun absence test
(mcp-server/phase_1_6_gate_4.test.ts) with 3 assertions:
walker actually scans files, regex catches synthetic positives,
no offending DEFINITION patterns in any .html/.ts/.js source.
3/3 pass.
ENG-DONE, signature pending:
- §2 attestation: scripts/staffing/attest_pre_identityd_biometric_state.sh
runs three checks against the live state:
1. workers_500k.parquet schema has no biometric/photo/face/image col
2. data/_kb/*.jsonl + pathway state contain no base64 image magic
bytes (JPEG /9j/, PNG iVBOR), no data:image/* MIME prefixes,
no field-name patterns ("photo", "biometric", "deepface_*")
3. data/headshots/manifest.jsonl is entirely synthetic-tagged
3/3 evidence checks pass on the live data dir. Generates a
signed-by-operator+counsel attestation document committed at
docs/attestations/BIPA_PRE_IDENTITYD_ATTESTATION_2026-05-03.md
with SHA-256 of the evidence summary so post-signature tampering
is detectable.
ENG-STAGED, awaiting counsel review:
- Gate 1 retention schedule scaffold at
docs/policies/consent/biometric_retention_schedule_v1.md (BIPA
§15(a)). Engineering facts (categories, 18-month operational
ceiling vs 3-year statutory cap, destruction procedure pointer
to Gate 5 runbook) plus ⚖ COUNSEL markers for the binding text.
- Gate 2 consent template scaffold at
docs/policies/consent/biometric_consent_template_v1.md (BIPA
§15(b)(1)-(3)). Required disclosures + plain-language summary +
withdrawal procedure + the structured fields the consent UI must
post to identityd.
- Gate 5 destruction runbook at docs/runbooks/BIPA_DESTRUCTION_RUNBOOK.md.
Triggers, pre-destruction checks (incl. chain-verified gate via
/audit/subject/{id}), procedure (legal-tier endpoint), automatic
audit row append (subject_audit.v1 with kind=biometric_erasure),
backup-window disclosure, monthly reporting cadence, audit-trail
attestation procedure cross-referencing the cross-runtime parity
probe.
BLOCKED on engineering design:
- Gate 3 photo-upload endpoint. Requires identityd photo intake
design + deepface integration scope. Deferred to its own session.
DEFERRED:
- §3 employee training material. Gate 5 runbook §7 may serve as
substrate; counsel decides whether a separate program is needed.
Calendar bottleneck is now counsel review. Engineering can stage no
further deliverables until either (a) Gate 3's design conversation
happens or (b) counsel completes review of items 1/2/5/6.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
249 lines
8.7 KiB
Bash
Executable File
249 lines
8.7 KiB
Bash
Executable File
#!/usr/bin/env bash
|
|
# attest_pre_identityd_biometric_state — one-shot defense artifact.
|
|
#
|
|
# Specification: docs/PHASE_1_6_BIPA_GATES.md §2 (Cryptographic
|
|
# attestation that no biometric data exists pre-identityd).
|
|
#
|
|
# Why this exists: in a BIPA dispute, plaintiffs may argue that the
|
|
# EXISTENCE of biometric schema fields constitutes constructive notice
|
|
# of intent to collect. The defense: prove that no biometric data was
|
|
# actually collected from real candidates before the identity service +
|
|
# consent gate (Phase 1.6 Gates 1-3) shipped.
|
|
#
|
|
# This script produces a defensible record of:
|
|
# 1. workers_500k.parquet schema has NO column named photo / biometric_*
|
|
# / face_* / image_*
|
|
# 2. data/_kb/*.jsonl and data/_pathway_memory/state.json contain NO
|
|
# base64 image magic bytes (JPEG /9j/, PNG iVBOR), no data:image/*
|
|
# MIME prefixes, and no field-name patterns that imply biometric
|
|
# payload (photo, biometric, deepface_*)
|
|
# 3. data/headshots/manifest.jsonl rows are entirely synthetic — count
|
|
# matches the face_pool size, and every row's source is a synthetic
|
|
# generator (not a real candidate upload)
|
|
#
|
|
# Output:
|
|
# docs/attestations/BIPA_PRE_IDENTITYD_ATTESTATION_<DATE>.md
|
|
# — markdown attestation document with all evidence + a SHA-256
|
|
# hash of the evidence summary. Ready for J + counsel signature.
|
|
#
|
|
# Exit codes:
|
|
# 0 — clean, attestation written, ready for signature
|
|
# 1 — evidence FAILED, attestation NOT written; investigate before signing
|
|
# 2 — script error (missing tools, unreadable files)
|
|
|
|
set -uo pipefail
|
|
cd "$(dirname "$0")/../.."
|
|
|
|
DATE="${OVERRIDE_DATE:-$(date -u +%Y-%m-%d)}"
|
|
OUT_DIR="docs/attestations"
|
|
OUT="$OUT_DIR/BIPA_PRE_IDENTITYD_ATTESTATION_${DATE}.md"
|
|
mkdir -p "$OUT_DIR"
|
|
|
|
WORKERS_PARQUET="${WORKERS_PARQUET:-data/datasets/workers_500k.parquet}"
|
|
KB_DIR="${KB_DIR:-data/_kb}"
|
|
PATHWAY_STATE="${PATHWAY_STATE:-data/_pathway_memory/state.json}"
|
|
HEADSHOTS_MANIFEST="${HEADSHOTS_MANIFEST:-data/headshots/manifest.jsonl}"
|
|
|
|
PASS=0
|
|
FAIL=0
|
|
EVIDENCE=$(mktemp)
|
|
|
|
note() { echo "$1" >> "$EVIDENCE"; }
|
|
mark_pass() { PASS=$((PASS+1)); note " - PASS: $1"; }
|
|
mark_fail() { FAIL=$((FAIL+1)); note " - FAIL: $1"; }
|
|
|
|
# ── Check 1: workers_500k.parquet schema ────────────────────────────
|
|
note "## Check 1 — workers_500k.parquet schema (no biometric columns)"
|
|
note ""
|
|
note "**Source:** \`$WORKERS_PARQUET\`"
|
|
note ""
|
|
if [ ! -r "$WORKERS_PARQUET" ]; then
|
|
echo "[attest] FAIL: cannot read $WORKERS_PARQUET" >&2
|
|
rm -f "$EVIDENCE"
|
|
exit 2
|
|
fi
|
|
SCHEMA=$(python3 -c "
|
|
import sys, pyarrow.parquet as pq
|
|
schema = pq.read_schema('$WORKERS_PARQUET')
|
|
for f in schema:
|
|
print(f.name)
|
|
" 2>&1)
|
|
if [ $? -ne 0 ]; then
|
|
echo "[attest] FAIL: schema read error: $SCHEMA" >&2
|
|
rm -f "$EVIDENCE"
|
|
exit 2
|
|
fi
|
|
SCHEMA_HASH=$(echo "$SCHEMA" | sha256sum | awk '{print $1}')
|
|
SCHEMA_LINES=$(echo "$SCHEMA" | wc -l)
|
|
note "**Schema columns** ($SCHEMA_LINES total):"
|
|
note ""
|
|
note '```'
|
|
note "$SCHEMA"
|
|
note '```'
|
|
note ""
|
|
note "**Schema SHA-256:** \`$SCHEMA_HASH\`"
|
|
note ""
|
|
|
|
# Forbidden column patterns (case-insensitive)
|
|
FORBIDDEN_COLS=$(echo "$SCHEMA" | grep -iE "^(photo|biometric|face|image)([_].*)?$" || true)
|
|
if [ -z "$FORBIDDEN_COLS" ]; then
|
|
mark_pass "no biometric / photo / face / image column present"
|
|
else
|
|
mark_fail "forbidden columns present: $FORBIDDEN_COLS"
|
|
fi
|
|
note ""
|
|
|
|
# ── Check 2: KB JSONL + pathway state — no base64 image / MIME ──────
|
|
note "## Check 2 — KB + pathway memory contain no biometric payloads"
|
|
note ""
|
|
note "**Sources scanned:**"
|
|
note "- \`$KB_DIR/*.jsonl\` (knowledge base)"
|
|
note "- \`$PATHWAY_STATE\` (pathway memory state)"
|
|
note ""
|
|
SCAN_PATHS=()
|
|
if [ -d "$KB_DIR" ]; then
|
|
while IFS= read -r f; do SCAN_PATHS+=("$f"); done < <(find "$KB_DIR" -maxdepth 2 -type f -name "*.jsonl")
|
|
fi
|
|
if [ -r "$PATHWAY_STATE" ]; then
|
|
SCAN_PATHS+=("$PATHWAY_STATE")
|
|
fi
|
|
|
|
# Forbidden patterns:
|
|
# data:image/ — explicit MIME embed
|
|
# "photo": — bare photo field
|
|
# "biometric" — field name
|
|
# "deepface_ — deepface output prefix
|
|
# /9j/[A-Za-z0-9+/]{40,} — JPEG base64 magic + length floor (false-positive guard)
|
|
# iVBORw0KGgo[A-Za-z0-9+/]{20,} — PNG base64 magic + length floor
|
|
PATTERN_FILE=$(mktemp)
|
|
cat > "$PATTERN_FILE" <<'PATTERNS'
|
|
data:image/
|
|
"photo"\s*:
|
|
"biometric"
|
|
"deepface_
|
|
/9j/[A-Za-z0-9+/=]{40,}
|
|
iVBORw0KGgo[A-Za-z0-9+/=]{20,}
|
|
PATTERNS
|
|
|
|
HITS=0
|
|
HIT_DETAIL=$(mktemp)
|
|
for path in "${SCAN_PATHS[@]}"; do
|
|
if grep -aHEf "$PATTERN_FILE" "$path" > "$HIT_DETAIL.tmp" 2>/dev/null; then
|
|
if [ -s "$HIT_DETAIL.tmp" ]; then
|
|
HITS=$((HITS + $(wc -l < "$HIT_DETAIL.tmp")))
|
|
cat "$HIT_DETAIL.tmp" >> "$HIT_DETAIL"
|
|
fi
|
|
fi
|
|
done
|
|
rm -f "$PATTERN_FILE" "$HIT_DETAIL.tmp"
|
|
|
|
note "**Files scanned:** ${#SCAN_PATHS[@]}"
|
|
note "**Forbidden-pattern hits:** $HITS"
|
|
note ""
|
|
|
|
if [ "$HITS" -eq 0 ]; then
|
|
mark_pass "no biometric payload patterns found in scanned files"
|
|
else
|
|
mark_fail "$HITS forbidden-pattern hits — see detail below"
|
|
note ""
|
|
note "### Detail (first 20 hits)"
|
|
note ""
|
|
note '```'
|
|
head -20 "$HIT_DETAIL" >> "$EVIDENCE"
|
|
note '```'
|
|
fi
|
|
rm -f "$HIT_DETAIL"
|
|
note ""
|
|
|
|
# ── Check 3: headshots manifest is synthetic-only ───────────────────
|
|
note "## Check 3 — Headshots manifest is synthetic-only"
|
|
note ""
|
|
note "**Source:** \`$HEADSHOTS_MANIFEST\`"
|
|
note ""
|
|
if [ ! -r "$HEADSHOTS_MANIFEST" ]; then
|
|
note "**SKIP** — manifest not present (no headshot UI deployed)."
|
|
note ""
|
|
mark_pass "no headshots manifest = no headshot data exists at all"
|
|
else
|
|
TOTAL_ROWS=$(wc -l < "$HEADSHOTS_MANIFEST")
|
|
# A row is non-synthetic if it lacks the synthetic markers (source: tag,
|
|
# archetype: tag, deterministic id pattern). The Phase 1.5 walk
|
|
# established that the synthetic face pool uses generated portraits
|
|
# with archetype tags. Anything else (real candidate upload) would
|
|
# be a Phase 1.6 violation.
|
|
NON_SYNTHETIC=$(grep -cE '"source"[[:space:]]*:[[:space:]]*"(real|candidate_upload|photo_upload)"' "$HEADSHOTS_MANIFEST" 2>/dev/null) || NON_SYNTHETIC=0
|
|
# Strip any newlines / whitespace defensively in case grep -c returned weirdly.
|
|
NON_SYNTHETIC=$(printf '%s' "$NON_SYNTHETIC" | tr -d '[:space:]')
|
|
: "${NON_SYNTHETIC:=0}"
|
|
note "**Total rows:** $TOTAL_ROWS"
|
|
note "**Rows tagged real/candidate_upload/photo_upload:** $NON_SYNTHETIC"
|
|
note ""
|
|
if [ "$NON_SYNTHETIC" = "0" ]; then
|
|
mark_pass "all $TOTAL_ROWS rows are synthetic (no real-candidate uploads)"
|
|
else
|
|
mark_fail "$NON_SYNTHETIC rows tagged as non-synthetic — investigate"
|
|
fi
|
|
fi
|
|
note ""
|
|
|
|
# ── Summary + final hash ────────────────────────────────────────────
|
|
TOTAL=$((PASS + FAIL))
|
|
note "## Summary"
|
|
note ""
|
|
note "**$PASS / $TOTAL** evidence checks pass."
|
|
note ""
|
|
if [ "$FAIL" -gt 0 ]; then
|
|
note "**Status: NOT READY FOR SIGNATURE** — at least one check failed. Resolve before counsel review."
|
|
note ""
|
|
fi
|
|
|
|
# Compute the evidence hash so any modification to the attestation
|
|
# document is detectable post-signature.
|
|
EVIDENCE_HASH=$(sha256sum "$EVIDENCE" | awk '{print $1}')
|
|
|
|
# ── Render final attestation document ───────────────────────────────
|
|
{
|
|
echo "# BIPA Pre-IdentityD Biometric Attestation"
|
|
echo
|
|
echo "**Date:** $DATE"
|
|
echo "**Spec:** docs/PHASE_1_6_BIPA_GATES.md §2"
|
|
echo "**Generator:** scripts/staffing/attest_pre_identityd_biometric_state.sh"
|
|
echo
|
|
echo "## Purpose"
|
|
echo
|
|
echo "This is a one-time defense artifact establishing that, as of"
|
|
echo "$DATE, no biometric identifiers or biometric information"
|
|
echo "from real candidates have been collected, processed, or stored"
|
|
echo "by the Lakehouse system. It is intended to be signed by J"
|
|
echo "(operator of record) and outside counsel, then anchored to a"
|
|
echo "tamper-evident store (filesystem with backups + version control)."
|
|
echo
|
|
echo "## Evidence"
|
|
echo
|
|
cat "$EVIDENCE"
|
|
echo
|
|
echo "---"
|
|
echo
|
|
echo "## Attestation"
|
|
echo
|
|
echo "I, the undersigned, attest that the above evidence accurately"
|
|
echo "reflects the state of the Lakehouse system as of $DATE."
|
|
echo "No biometric identifiers or biometric information from real"
|
|
echo "candidates have been collected, processed, or stored prior to"
|
|
echo "the deployment of the Phase 1.6 BIPA pre-launch gates."
|
|
echo
|
|
echo "**Evidence SHA-256:** \`$EVIDENCE_HASH\`"
|
|
echo
|
|
echo "---"
|
|
echo
|
|
echo "**Operator (J):** _______________________________ Date: __________"
|
|
echo
|
|
echo "**Outside counsel:** ___________________________ Date: __________"
|
|
echo
|
|
} > "$OUT"
|
|
rm -f "$EVIDENCE"
|
|
|
|
echo "[attest] $PASS / $TOTAL checks pass — attestation: $OUT"
|
|
echo "[attest] evidence SHA-256: $EVIDENCE_HASH"
|
|
[ "$FAIL" -eq 0 ]
|