lakehouse

Author	SHA1	Message	Date
root	b2c34b80b3	phase 1.6: lock Gate 3b = C, reconcile docs to shipped state, fix double-upload file leak Four threads landing together — all driven by the audit J asked for before production cutover. (1) Gate 3b DECIDED: Option C (defer classifications). `BiometricCollection.classifications` stays `Option<JSON> = None` in v1. `docs/specs/GATE_3B_DEEPFACE_DESIGN.md` status flipped from "draft / awaits product" to DECIDED. Consent template + retention schedule revised to remove all "automated facial-classification" / "deepface" language so disclosed scope matches implemented scope. (2) Endpoint-path drift reconciled across 3 docs. `PHASE_1_6_BIPA_GATES.md`, `BIPA_DESTRUCTION_RUNBOOK.md`, and `biometric_retention_schedule_v1.md` had references to legacy `/v1/identity/subjects/` paths (proposed under a separate identityd daemon, never shipped) — corrected to actual shipped routes `/biometric/subject/` (catalogd-local). Schema block in PHASE_1_6_BIPA_GATES rewritten to reflect JSON `SubjectManifest.biometric_collection` substrate (not the proposed Postgres `subjects` table). (3) New operational artifacts: - `scripts/staffing/verify_biometric_erasure.sh` — checks 4 things post-erasure (manifest cleared, uploads dir empty, audit row matches, chain verified). Smoke-tested live against WORKER-2. - `scripts/staffing/biometric_destruction_report.sh` — monthly anonymized destruction-event aggregation. Smoke-tested clean. - `scripts/staffing/bundle_counsel_packet.sh` — tarballs the counsel-review packet with per-file SHA-256 manifest. - `docs/runbooks/LEGAL_AUDIT_KEY_ROTATION.md` — formal rotation procedure operationalized after the 2026-05-05 /tmp wipe incident. - `docs/counsel/COUNSEL_REVIEW_PACKET_2026-05-05.md` — cover note bundling all eng-staged BIPA docs for counsel review with per-doc questions, sign-off checklist, recommended review sequence. (4) Double-upload file leak fixed in `crates/catalogd/src/biometric_endpoint.rs`. `verify_biometric_erasure.sh` smoked WORKER-2 and surfaced a stranded photo file. Investigation showed the file was 13-byte test-fixture bytes (zero PII, no biometric content); audit timeline showed two consecutive uploads followed by one erasure — the second upload had silently overwritten manifest.data_path, orphaning the first file. Patched `process_upload` to refuse a second upload with HTTP 409 + `error: "biometric_already_collected"` when `biometric_collection.is_some()` on the manifest. Operator must explicitly POST `/biometric/subject/{id}/erase` first. Tests: new `second_upload_without_erase_returns_409` (asserts 409 + manifest pointer unchanged + first file untouched on disk). Replaced `repeated_uploads_grow_the_chain` with `upload_erase_upload_grows_the_chain_cleanly` (covers the legitimate re-collection cycle: chain grows to 3 rows). Updated `content_type_with_parameters_accepted` to use 2 distinct subjects (was using 1 subject with 2 uploads to test ct parsing — would now 409). 22/22 biometric_endpoint tests + 59/59 catalogd lib tests green post-patch. Production posture: gateway needs `cargo build --release -p gateway` + `systemctl restart lakehouse.service` to pick up the new 409 in live traffic. Counsel calendar is now the only remaining blocker for first real-photo intake. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 06:19:40 -05:00
root	c7aa607ae4	phase 1.6 BIPA: scrum-driven fixes Per 2026-05-03 phase_1_6_bipa_gates scrum (13 findings, 0 convergent). 1 BLOCK verified false positive, 4 real fixes shipped: False positive (verified): - opus BLOCK on attest:55 — claimed `set -uo pipefail` without `-e` makes the post-python3 `if [ $? -ne 0 ]` check unreachable. Verified WRONG: `X=$(false); echo $?` prints 1. Bash propagates command- substitution exit through $? on the assignment line. The check IS the python3 exit gate. Inline comment added to the script noting the false positive so future scrums don't re-flag. Real fixes: 1. opus WARN attestation:18 — schema fingerprint hashed names ONLY, missing column-type changes. A column repurposed to hold base64 photo bytes under its existing name would pass undetected. Now hashes "name<TAB>type<TAB>nullable=bool" per row. Re-run produced evidence SHA-256 1fdcc9f1... (vs old 230fffeb..., reflecting the broader fingerprint scope). 2. opus WARN gate_4_test:60 — definition regex didn't catch object-literal property forms (`const t = { FEMALE_NAMES: [...] }`) or TypeScript class fields (`class L { public NAMES_X: string[] = [] }`). Added two new patterns + a regression test (Gate 4: object-literal and class-field bypasses are caught) that exercises 5 bypass forms. 4/4 tests green; 1 minor regex tweak needed mid-fix to handle single-line class bodies. 3. kimi WARN python3-reliance — script assumed pyarrow installed and would emit a stack trace into the attestation if not. Added `python3 -c "import pyarrow"` gate at top with clean install instructions on failure. 4. opus INFO PHASE_1_6:200 — item 7 (training) silently dropped from blocking set with bare "deferred" rationale. Now explicitly states the deferral is conditional on small operator population (J + 1-2 named ops); item 7 re-promotes to blocking if population grows. ⚖ COUNSEL marker added. Skipped (acceptable as ⚖ COUNSEL placeholders by design): - kimi WARN consent template:30-day-SLA (counsel decides number) - kimi WARN consent template:email-placeholder (counsel supplies) - kimi WARN parquet absence (env override exists; redeployment-aware) - kimi INFO runbook manual-erasure (marked TODO when /erase ships) - qwen INFO doc path/status nits (already addressed by file moves) Tests: 4/4 Gate 4 absence test (incl. new bypass-coverage), 3/3 attestation evidence checks pass on live data. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 04:43:17 -05:00
root	4708717f6b	phase 1.6 BIPA gates — engineering wave (4 of 7 staged) Per docs/PHASE_1_6_BIPA_GATES.md. Status table now reflects: DONE (engineering-only, no counsel dependency): - Gate 4: name→ethnicity inference removed from mcp-server. Removal note in search.html:3372 + new Bun absence test (mcp-server/phase_1_6_gate_4.test.ts) with 3 assertions: walker actually scans files, regex catches synthetic positives, no offending DEFINITION patterns in any .html/.ts/.js source. 3/3 pass. ENG-DONE, signature pending: - §2 attestation: scripts/staffing/attest_pre_identityd_biometric_state.sh runs three checks against the live state: 1. workers_500k.parquet schema has no biometric/photo/face/image col 2. data/_kb/.jsonl + pathway state contain no base64 image magic bytes (JPEG /9j/, PNG iVBOR), no data:image/ MIME prefixes, no field-name patterns ("photo", "biometric", "deepface_*") 3. data/headshots/manifest.jsonl is entirely synthetic-tagged 3/3 evidence checks pass on the live data dir. Generates a signed-by-operator+counsel attestation document committed at docs/attestations/BIPA_PRE_IDENTITYD_ATTESTATION_2026-05-03.md with SHA-256 of the evidence summary so post-signature tampering is detectable. ENG-STAGED, awaiting counsel review: - Gate 1 retention schedule scaffold at docs/policies/consent/biometric_retention_schedule_v1.md (BIPA §15(a)). Engineering facts (categories, 18-month operational ceiling vs 3-year statutory cap, destruction procedure pointer to Gate 5 runbook) plus ⚖ COUNSEL markers for the binding text. - Gate 2 consent template scaffold at docs/policies/consent/biometric_consent_template_v1.md (BIPA §15(b)(1)-(3)). Required disclosures + plain-language summary + withdrawal procedure + the structured fields the consent UI must post to identityd. - Gate 5 destruction runbook at docs/runbooks/BIPA_DESTRUCTION_RUNBOOK.md. Triggers, pre-destruction checks (incl. chain-verified gate via /audit/subject/{id}), procedure (legal-tier endpoint), automatic audit row append (subject_audit.v1 with kind=biometric_erasure), backup-window disclosure, monthly reporting cadence, audit-trail attestation procedure cross-referencing the cross-runtime parity probe. BLOCKED on engineering design: - Gate 3 photo-upload endpoint. Requires identityd photo intake design + deepface integration scope. Deferred to its own session. DEFERRED: - §3 employee training material. Gate 5 runbook §7 may serve as substrate; counsel decides whether a separate program is needed. Calendar bottleneck is now counsel review. Engineering can stage no further deliverables until either (a) Gate 3's design conversation happens or (b) counsel completes review of items 1/2/5/6. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 04:38:49 -05:00
root	4b92d1da91	demo: icon recipe pipeline + role-aware portraits + ComfyUI negative-prompt override Adds two single-source-of-truth recipe files that drive both the hot-path render server and the offline pre-render scripts: - role_scenes.ts: per-role-band scene clauses (clothing + backdrop). Forklift operators look like forklift operators instead of collapsing to interchangeable studio shots. SCENES_VERSION mixes into the headshot cache key so a coordinator tweak refreshes every matching face on next view. - icon_recipes.ts: cert / role-prop / status / hazard / empty icons with deterministic per-recipe seeds + fuzzy text resolver. ICONS_VERSION suffix on the cached file means edits don't overwrite in place — misfires are recoverable. Routes (mcp-server/index.ts): - GET /headshots/_scenes — exposes SCENES + version to the pre-render script so prompts don't drift between batch and hot-path. - GET /icons/_recipes — same idea for icons. - GET /icons/cert?text=... — resolves free-text cert names to a recipe and 302s to the rendered icon. 404 (not 500) when no recipe matches so the front-end can hang `onerror="this.remove()"`. - GET /icons/render/{category}/{slug} — cache-or-render at 256² (8 steps) for crisper edges than 512² when downsampled to 14px. ComfyUI portrait support (scripts/serve_imagegen.py): The editorial workflow had `human, person, face` baked into its negative prompt — actively sabotaging portraits. _comfyui_generate now accepts negative_prompt/cfg/sampler/scheduler overrides, and those mix into the cache key so portrait calls don't collapse into hero-shot cache hits. scripts/staffing/render_role_pool.py: pre-renders the role-aware face pool by reading SCENES from /headshots/_scenes — single source of truth verified at run time. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 06:01:04 -05:00
root	1745881426	staffing: face pool fetch preserves prior tags + --shrink gate + atomic manifest write fetch_face_pool was wiping 952 hand-classified rows when re-run from a Python without deepface installed (it reset every gender to None). Now: - Loads existing manifest by id and overlays only fetch-owned fields, so gender/race/age/excluded survive a refetch. - deepface pass tags only records that don't already have a gender; deepface unavailable means "leave existing tags alone" not "reset". - New --shrink flag required to drop ids >= --count. Default refuses to shrink the pool silently. - Atomic write via tmp + os.replace so an interrupted run can't corrupt the manifest. - Dedupes duplicate id lines (root cause of the 2497-row manifest backing a 1000-face pool). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 06:01:04 -05:00
root	a3b65f314e	Synthetic face pool — 1000 StyleGAN headshots, ComfyUI hot-swap, 60x smaller thumbs Worker cards now ship a real photo per person instead of monogram tiles: - fetch_face_pool.py pulls 1000 faces from thispersondoesnotexist.com - tag_face_pool.py runs deepface for gender/race/age, excludes <22yo - manifest.jsonl: 952 servable, gender/race buckets populated - /headshots/_thumbs/ pre-resized to 384px webp (587KB -> 11KB, 60x smaller; without this Chrome's parallel-connection budget drops ~75% of tiles in a 40-card grid) - /headshots/:key gender x race x age intersection bucketing with gender-only fallback when intersection is sparse - /headshots/generate/:key ComfyUI on-demand for the contractor profile spotlight (cold ~1.5s, cached ~1ms; worker-derived djb2 seed makes faces deterministic-per-worker but unique across workers sharing the same prompt) - serve_imagegen.py _cache_key() now includes seed (was caching by prompt only -> 3 different worker seeds collapsed to 1 cached image; verified fix produces 3 distinct md5s) - confidence-default name resolution: Xavier->man+hispanic, Aisha->woman+black, etc. Every worker resolves to a bucket. End-to-end: playwright run on /?q=forklift+operators+IL -> 21/21 cards loaded, 0 broken, all 384px webp. Cache + binary pool gitignored; manifest tracked. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 06:01:04 -05:00
root	10ed3bc630	demo: real synthetic headshots — fetch pool + serve route + UI wire Three layers shipped: 1. SCRIPT — scripts/staffing/fetch_face_pool.py Pulls N synthetic StyleGAN faces from thispersondoesnotexist.com into data/headshots/face_NNNN.jpg, writes manifest.jsonl. Idempotent: re-running skips existing files. Optional gender tagging via deepface (currently unavailable on this box; the script handles ImportError gracefully and tags everything as untagged). Fetched 198 faces with concurrency=3 in ~67s. 2. SERVER — /headshots/:key route in mcp-server/index.ts Loads manifest at first hit, caches in globalThis._faces. Hashes the key with djb2-style mixing → pool index → returns the JPG. Same key always gets the same face (deterministic). Accepts ?g=man\|woman&e=caucasian\|black\|hispanic\|south_asian\|east_asian\|middle_eastern to bias pool selection — the gender/ethnicity buckets fall back to the full pool when no tagged matches exist. Cache-Control: 86400 immutable so faces ride the browser cache after first hit. /headshots/__reload re-reads the manifest without restart. 3. UI — search.html + console.html worker cards Re-added overlay <img> on top of the monogram .av circle. img.src = /headshots/<encoded-key>?g=<hint>&e=<hint>. img.onerror removes the failed image so the monogram stays visible if the face pool isn't fetched / CDN is blocked. .av now has overflow:hidden + position:relative to clip the img to a perfect circle. Forced-confident name resolution (J: "we're CREATING the profile, created as though you truly have the information Xavier is more likely Hispanic and he's a male"): genderFor(name) — looks up MALE_NAMES + FEMALE_NAMES, falls back to a deterministic hash split so unknown names spread ~50/50. Sets now include cross-cultural names: Alejandro/ Andres/Mateo/Santiago/Joaquin/Cesar/Hugo/ Felipe/Gerardo/Salvador/Ramon (Hispanic), Raj/Anil/Vikram/Krishna/Pradeep (South Asian), Wei/Yi/Hiroshi/Akira/Hyun (East Asian), Demetrius/Kareem/DaQuan/Khalil (Black), Omar/Khalid/Hassan/Ahmed/Bilal (Middle Eastern). FEMALE_NAMES extended in parallel. guessEthnicityFromFirstName(name) — confident default of 'caucasian' for any name not in the cultural buckets so every worker resolves to a category the face pool can be biased toward. Order: ME → Black → Hispanic → South Asian → East Asian → Caucasian (matters where names overlap, e.g. Aisha appears in ME + Black, biases toward ME for visual fit). Both helpers also ported into console.html so the triage backfills and try-it-yourself rendering get the same hint stack. Privacy note in the script + route comments: the synthetic data uses the worker's name as the seed; production should hash worker_id (not name) to avoid leaking PII to a third-party CDN. The fetch URL itself is referenced once per pool build, not per-worker. .gitignore — added data/headshots/face_*.jpg (~100MB for 198 faces; the manifest + script are tracked). Re-running the script on a fresh checkout rebuilds the pool from scratch. Verified end-to-end via playwright on devop.live/lakehouse: forklift query → 10 worker cards 10/10 with face images (real synthetic headshots, not monograms) 0/10 broken Alejandro G. Nelson → ?g=man&e=hispanic Patricia K. Garcia → ?g=woman&e=caucasian Each name → unique face, deterministic across loads. Console triage backfills get the same treatment.	2026-04-28 06:01:04 -05:00
root	c3c9c2174a	staffing: B+C — safe views (candidates/workers/jobs) + workers_500k_v9 build script Some checks failed lakehouse/auditor 9 blocking issues: cloud: claim not backed — "Verified live (current synthetic data):" Decision B from reports/staffing/synthetic-data-gap-report.md §7 (plus C: client_workerskjkk.parquet typo file removed from data/datasets/ — was never tracked, no git effect). PII enforcement was UNVERIFIED in workers_500k_v8 (the corpus staffing_inference mode embeds chunks from). Verified 2026-04-27 by inspecting data/vectors/meta/workers_500k_v8.json — `source: "workers_500k"` confirms v8 was built directly from the raw table, so the LLM has been seeing names / emails / phones / resume_text for every staffing query. This commit closes the boundary at the catalog metadata layer: candidates_safe (overhauled — was failing SQL invalid 434×/day on a nonexistent `vertical` column reference, copy-pasted from job_orders): drops last_name, email, phone, hourly_rate_usd candidate_id masked (keep first 3, last 2) row_filter: status != 'blocked' workers_safe (NEW): drops name, email, phone, zip, communications, resume_text keeps role, city, state, skills, certifications, archetype, scores resume_text + communications carry verbatim PII (full names) and there is no in-view text scrubber, so they are dropped wholesale. Skills + certifications + scores carry the matching signal for staffing inference. jobs_safe (NEW): drops description (often quotes client names verbatim) client_id masked (keep first 3, last 2) bill_rate / pay_rate kept — commercial info, not PII per staffing PRD scripts/staffing/build_workers_v9.sh (NEW): POSTs /vectors/index to rebuild workers_500k_v9 from `workers_safe` rather than the raw table. Embedded text is constructed from the view projection so PII never enters the corpus by construction. 30+ minute background job — not run inline. After it completes, flip config/modes.toml `staffing_inference` matrix_corpus from workers_500k_v8 to workers_500k_v9 and restart gateway. Distillation v1.0.0 substrate untouched. audit-full passed clean (16/16 required) before this commit; will re-verify after.	2026-04-27 10:46:03 -05:00
root	940737daa7	staffing: D — workers_500k.phone int → string fixup script Decision D from reports/staffing/synthetic-data-gap-report.md §7. Phones in workers_500k.parquet are 11-digit US numbers stored as int64 (e.g. 13122277740). Numerically fine, but breaks join keys against any other source that carries phone as string. Script casts the column to string in place, with non-destructive backup at data/datasets/workers_500k.parquet.bak-<date> before write. Idempotent: if phone is already string, exits 0 with "no-op". Safe to re-run. The .parquet itself is too large to commit (75MB) and follows project convention of staying out of git. The script makes the conversion reproducible from the source dataset.	2026-04-27 10:45:38 -05:00
root	d56f08e740	staffing: A — fill_events.parquet from 44 scenarios + 64 lessons (deterministic) Decision A from reports/staffing/synthetic-data-gap-report.md §7. Walks tests/multi-agent/scenarios/scen_.json and data/_playbook_lessons/.json, normalizes to a single fill_events.parquet at data/datasets/fill_events.parquet. One row per scenario event, lesson outcomes joined by (client, date) where the tuple matches. rows: 123 scenarios contributing: 40 events with outcome data: 62 unique (client, date) tuples: 40 Reproducibility: event_id is SHA1(client\|date\|role\|at\|city) truncated to 16 hex chars; rows sorted by event_id before write so re-runs produce bit-identical output. Verified. Pure normalization — no LLM, no new data, no distillation substrate mutation.	2026-04-27 10:45:29 -05:00

10 Commits