lakehouse

Author	SHA1	Message	Date
root	fcd53168a0	phase 1.6: counsel handoff turnkey + seed_consent_version.sh + strict mode live The remaining production blocker is counsel-calendar bottleneck (review + sign-off). Engineering can't make counsel move faster, but it CAN reduce the round-trip overhead: (1) docs/counsel/COUNSEL_HANDOFF_EMAIL_2026-05-05.md — copy-paste email body J can send to outside counsel. Subject line + body + tarball attachment instructions + headline asks (A/B/C/D in priority order) + post-signature operator runbook. The pre-flight checklist + post-signature workflow turn what would have been "I'll figure out the email" into "click send." (2) scripts/staffing/seed_consent_version.sh — turnkey post-signature deployment. Takes the path to a (presumably counsel-signed) consent template markdown, computes SHA-256, atomically merges into /etc/lakehouse/consent_versions.json (creating the file if absent, with per-seed audit metadata in _meta.seeded_at[]), restarts lakehouse.service, probes /biometric/health post-restart. Idempotent: re-running with the same hash is a no-op for the versions array but still appends a [reseed] entry to the audit metadata. Verified live against the eng-staged template — strict mode flipped clean, /biometric/health 200 post-restart. (3) docs/PHASE_1_6_BIPA_GATES.md §6.5 — post-signature deployment runbook embedded in the gates doc. Three steps: counsel signs + commits → seed_consent_version.sh → strict-mode probe. Plus a "pre-counsel demo seed" subsection documenting how to exercise strict mode BEFORE counsel signs (using the eng-staged template hash) so the deployment workflow is proven before the legal critical path closes. Strict mode flipped live — verified post-restart: - /etc/lakehouse/consent_versions.json populated with the eng-staged template hash: 8b09591a8dc15f59197affac48909ce943d575eee01705b42303acf3b32f5c56 - POST /biometric/subject/WORKER-1/consent with deadbeef hash: HTTP 400 + error="consent_version_unknown" - POST with the known eng-staged hash: passes version check (then 404 subject_not_found on a ghost candidate, proving the gate is hash-aware not auth-broken) The hash currently seeded is the ENG-STAGED template (pre-counsel-signature). When counsel returns the signed text, operator runs `seed_consent_version.sh` again with the counsel-signed markdown — the new hash gets appended; the demo hash stays in for backwards-compat with any consent records collected during the pre-counsel demo period (none, today). Production blocker is now genuinely just counsel calendar: 1. J transmits reports/counsel/counsel_packet_2026-05-05.tar.gz per the handoff email 2. Counsel reviews + signs (their billable time) 3. Counsel returns signed text → operator runs seed script 4. Strict mode flips to canonical hash → cutover complete Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 15:32:16 -05:00
root	87b034f5f9	phase 1.6: ops dashboard + consent_versions allowlist + subject timeline tool Closes the afternoon's "all four" wave (per J's request to do all the items in one pass instead of pick-one-of-options): (1) Live demo on WORKER-100 — full lifecycle exercised end-to-end against the running gateway. 3 audit rows landed in correct order (consent_grant → biometric_collection → consent_withdrawal), chain_verified=true, photo on disk at data/biometric/uploads/WORKER-100/1778011967957907731_027b6bb1.jpg (180 bytes JFIF). retention_until=2026-06-04 (30d from withdrawal per consent template v1 §2). (2) GET /biometric/stats — read-only aggregate over all subjects. Returns counts by biometric.status + subject.status, photo count, oldest_active_retention_until, and the last 20 state-change events (consent_grant / collection / withdrawal / erasure — validator_lookup and other noise filtered out). Walks per-subject audit logs via the existing writer; cheap for 100 subjects, would want an event-stream index at 100k. Legal-tier auth (same posture as /audit). 4 unit tests. (3) /biometric/dashboard mcp-server frontend. Auto-refreshes /biometric/stats every 15s, neo-brutalist tile layout for the per-status counts + retention horizon block + recent events table with kind badges + event-kind breakdown pills. sessionStorage-backed token; logout button clears state. DOM-built throughout (textContent + createElement) — never innerHTML on audit-row values, since trace_id et al. could in theory carry operator-supplied strings. (4) consent_versions allowlist. BiometricEndpointState gains `allowed_consent_versions: Option<Arc<HashSet<String>>>`, loaded at startup from /etc/lakehouse/consent_versions.json (override via LH_CONSENT_VERSIONS_FILE). process_consent refuses unknown hashes with HTTP 400 consent_version_unknown when configured. Resolution semantics: - Missing file → permissive (v1 compat, warn-log) - Parse error → permissive (error-log; broken config silently going strict would be worse) - Empty array → strict, refuse all (deliberate freeze mode for "counsel hasn't signed v1 yet") - Populated → strict, lowercase-normalized comparison 5 unit tests (known/unknown/case/empty/none-permissive). Example template at ops/consent_versions.example.json with a counsel-tier deployment note. (5) scripts/staffing/subject_timeline.sh — operator one-shot pretty-print of any subject's full BIPA lifecycle. Curls /audit/subject/{id} with legal token; renders manifest summary + on-disk photo state + chronological audit chain with kind badges + chain verification status. Smoke-tested on WORKER-100 (3 rows verified). (6) STATE_OF_PLAY.md refresh. New section "afternoon wave" captures all four commits (76cb5ac, 7f0f500, 68d226c, this one) + the live demo evidence + the v1 endpoint matrix + UI/CLI inventory + the production-cutover blocking set (counsel calendar only — eng substrate is done). Verified live post-restart: - /audit/health + /biometric/health both 200 - /biometric/stats returns 100 subjects, 2 withdrawn (WORKER-2 from earlier scrum + WORKER-100 from today's demo), 1 photo on record, 6 recent state-change events - /biometric/intake + /biometric/withdraw + /biometric/dashboard all 200 on mcp-server :3700 - subject_timeline.sh on WORKER-100: chain_verified=true, chain_root=a47563ff937d50de… - 88/88 catalogd lib tests + 55/55 biometric_endpoint tests green Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 15:27:52 -05:00
root	b2c34b80b3	phase 1.6: lock Gate 3b = C, reconcile docs to shipped state, fix double-upload file leak Four threads landing together — all driven by the audit J asked for before production cutover. (1) Gate 3b DECIDED: Option C (defer classifications). `BiometricCollection.classifications` stays `Option<JSON> = None` in v1. `docs/specs/GATE_3B_DEEPFACE_DESIGN.md` status flipped from "draft / awaits product" to DECIDED. Consent template + retention schedule revised to remove all "automated facial-classification" / "deepface" language so disclosed scope matches implemented scope. (2) Endpoint-path drift reconciled across 3 docs. `PHASE_1_6_BIPA_GATES.md`, `BIPA_DESTRUCTION_RUNBOOK.md`, and `biometric_retention_schedule_v1.md` had references to legacy `/v1/identity/subjects/` paths (proposed under a separate identityd daemon, never shipped) — corrected to actual shipped routes `/biometric/subject/` (catalogd-local). Schema block in PHASE_1_6_BIPA_GATES rewritten to reflect JSON `SubjectManifest.biometric_collection` substrate (not the proposed Postgres `subjects` table). (3) New operational artifacts: - `scripts/staffing/verify_biometric_erasure.sh` — checks 4 things post-erasure (manifest cleared, uploads dir empty, audit row matches, chain verified). Smoke-tested live against WORKER-2. - `scripts/staffing/biometric_destruction_report.sh` — monthly anonymized destruction-event aggregation. Smoke-tested clean. - `scripts/staffing/bundle_counsel_packet.sh` — tarballs the counsel-review packet with per-file SHA-256 manifest. - `docs/runbooks/LEGAL_AUDIT_KEY_ROTATION.md` — formal rotation procedure operationalized after the 2026-05-05 /tmp wipe incident. - `docs/counsel/COUNSEL_REVIEW_PACKET_2026-05-05.md` — cover note bundling all eng-staged BIPA docs for counsel review with per-doc questions, sign-off checklist, recommended review sequence. (4) Double-upload file leak fixed in `crates/catalogd/src/biometric_endpoint.rs`. `verify_biometric_erasure.sh` smoked WORKER-2 and surfaced a stranded photo file. Investigation showed the file was 13-byte test-fixture bytes (zero PII, no biometric content); audit timeline showed two consecutive uploads followed by one erasure — the second upload had silently overwritten manifest.data_path, orphaning the first file. Patched `process_upload` to refuse a second upload with HTTP 409 + `error: "biometric_already_collected"` when `biometric_collection.is_some()` on the manifest. Operator must explicitly POST `/biometric/subject/{id}/erase` first. Tests: new `second_upload_without_erase_returns_409` (asserts 409 + manifest pointer unchanged + first file untouched on disk). Replaced `repeated_uploads_grow_the_chain` with `upload_erase_upload_grows_the_chain_cleanly` (covers the legitimate re-collection cycle: chain grows to 3 rows). Updated `content_type_with_parameters_accepted` to use 2 distinct subjects (was using 1 subject with 2 uploads to test ct parsing — would now 409). 22/22 biometric_endpoint tests + 59/59 catalogd lib tests green post-patch. Production posture: gateway needs `cargo build --release -p gateway` + `systemctl restart lakehouse.service` to pick up the new 409 in live traffic. Counsel calendar is now the only remaining blocker for first real-photo intake. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 06:19:40 -05:00
root	c7aa607ae4	phase 1.6 BIPA: scrum-driven fixes Per 2026-05-03 phase_1_6_bipa_gates scrum (13 findings, 0 convergent). 1 BLOCK verified false positive, 4 real fixes shipped: False positive (verified): - opus BLOCK on attest:55 — claimed `set -uo pipefail` without `-e` makes the post-python3 `if [ $? -ne 0 ]` check unreachable. Verified WRONG: `X=$(false); echo $?` prints 1. Bash propagates command- substitution exit through $? on the assignment line. The check IS the python3 exit gate. Inline comment added to the script noting the false positive so future scrums don't re-flag. Real fixes: 1. opus WARN attestation:18 — schema fingerprint hashed names ONLY, missing column-type changes. A column repurposed to hold base64 photo bytes under its existing name would pass undetected. Now hashes "name<TAB>type<TAB>nullable=bool" per row. Re-run produced evidence SHA-256 1fdcc9f1... (vs old 230fffeb..., reflecting the broader fingerprint scope). 2. opus WARN gate_4_test:60 — definition regex didn't catch object-literal property forms (`const t = { FEMALE_NAMES: [...] }`) or TypeScript class fields (`class L { public NAMES_X: string[] = [] }`). Added two new patterns + a regression test (Gate 4: object-literal and class-field bypasses are caught) that exercises 5 bypass forms. 4/4 tests green; 1 minor regex tweak needed mid-fix to handle single-line class bodies. 3. kimi WARN python3-reliance — script assumed pyarrow installed and would emit a stack trace into the attestation if not. Added `python3 -c "import pyarrow"` gate at top with clean install instructions on failure. 4. opus INFO PHASE_1_6:200 — item 7 (training) silently dropped from blocking set with bare "deferred" rationale. Now explicitly states the deferral is conditional on small operator population (J + 1-2 named ops); item 7 re-promotes to blocking if population grows. ⚖ COUNSEL marker added. Skipped (acceptable as ⚖ COUNSEL placeholders by design): - kimi WARN consent template:30-day-SLA (counsel decides number) - kimi WARN consent template:email-placeholder (counsel supplies) - kimi WARN parquet absence (env override exists; redeployment-aware) - kimi INFO runbook manual-erasure (marked TODO when /erase ships) - qwen INFO doc path/status nits (already addressed by file moves) Tests: 4/4 Gate 4 absence test (incl. new bypass-coverage), 3/3 attestation evidence checks pass on live data. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 04:43:17 -05:00
root	4708717f6b	phase 1.6 BIPA gates — engineering wave (4 of 7 staged) Per docs/PHASE_1_6_BIPA_GATES.md. Status table now reflects: DONE (engineering-only, no counsel dependency): - Gate 4: name→ethnicity inference removed from mcp-server. Removal note in search.html:3372 + new Bun absence test (mcp-server/phase_1_6_gate_4.test.ts) with 3 assertions: walker actually scans files, regex catches synthetic positives, no offending DEFINITION patterns in any .html/.ts/.js source. 3/3 pass. ENG-DONE, signature pending: - §2 attestation: scripts/staffing/attest_pre_identityd_biometric_state.sh runs three checks against the live state: 1. workers_500k.parquet schema has no biometric/photo/face/image col 2. data/_kb/.jsonl + pathway state contain no base64 image magic bytes (JPEG /9j/, PNG iVBOR), no data:image/ MIME prefixes, no field-name patterns ("photo", "biometric", "deepface_*") 3. data/headshots/manifest.jsonl is entirely synthetic-tagged 3/3 evidence checks pass on the live data dir. Generates a signed-by-operator+counsel attestation document committed at docs/attestations/BIPA_PRE_IDENTITYD_ATTESTATION_2026-05-03.md with SHA-256 of the evidence summary so post-signature tampering is detectable. ENG-STAGED, awaiting counsel review: - Gate 1 retention schedule scaffold at docs/policies/consent/biometric_retention_schedule_v1.md (BIPA §15(a)). Engineering facts (categories, 18-month operational ceiling vs 3-year statutory cap, destruction procedure pointer to Gate 5 runbook) plus ⚖ COUNSEL markers for the binding text. - Gate 2 consent template scaffold at docs/policies/consent/biometric_consent_template_v1.md (BIPA §15(b)(1)-(3)). Required disclosures + plain-language summary + withdrawal procedure + the structured fields the consent UI must post to identityd. - Gate 5 destruction runbook at docs/runbooks/BIPA_DESTRUCTION_RUNBOOK.md. Triggers, pre-destruction checks (incl. chain-verified gate via /audit/subject/{id}), procedure (legal-tier endpoint), automatic audit row append (subject_audit.v1 with kind=biometric_erasure), backup-window disclosure, monthly reporting cadence, audit-trail attestation procedure cross-referencing the cross-runtime parity probe. BLOCKED on engineering design: - Gate 3 photo-upload endpoint. Requires identityd photo intake design + deepface integration scope. Deferred to its own session. DEFERRED: - §3 employee training material. Gate 5 runbook §7 may serve as substrate; counsel decides whether a separate program is needed. Calendar bottleneck is now counsel review. Engineering can stage no further deliverables until either (a) Gate 3's design conversation happens or (b) counsel completes review of items 1/2/5/6. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 04:38:49 -05:00
root	4b92d1da91	demo: icon recipe pipeline + role-aware portraits + ComfyUI negative-prompt override Adds two single-source-of-truth recipe files that drive both the hot-path render server and the offline pre-render scripts: - role_scenes.ts: per-role-band scene clauses (clothing + backdrop). Forklift operators look like forklift operators instead of collapsing to interchangeable studio shots. SCENES_VERSION mixes into the headshot cache key so a coordinator tweak refreshes every matching face on next view. - icon_recipes.ts: cert / role-prop / status / hazard / empty icons with deterministic per-recipe seeds + fuzzy text resolver. ICONS_VERSION suffix on the cached file means edits don't overwrite in place — misfires are recoverable. Routes (mcp-server/index.ts): - GET /headshots/_scenes — exposes SCENES + version to the pre-render script so prompts don't drift between batch and hot-path. - GET /icons/_recipes — same idea for icons. - GET /icons/cert?text=... — resolves free-text cert names to a recipe and 302s to the rendered icon. 404 (not 500) when no recipe matches so the front-end can hang `onerror="this.remove()"`. - GET /icons/render/{category}/{slug} — cache-or-render at 256² (8 steps) for crisper edges than 512² when downsampled to 14px. ComfyUI portrait support (scripts/serve_imagegen.py): The editorial workflow had `human, person, face` baked into its negative prompt — actively sabotaging portraits. _comfyui_generate now accepts negative_prompt/cfg/sampler/scheduler overrides, and those mix into the cache key so portrait calls don't collapse into hero-shot cache hits. scripts/staffing/render_role_pool.py: pre-renders the role-aware face pool by reading SCENES from /headshots/_scenes — single source of truth verified at run time. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 06:01:04 -05:00
root	1745881426	staffing: face pool fetch preserves prior tags + --shrink gate + atomic manifest write fetch_face_pool was wiping 952 hand-classified rows when re-run from a Python without deepface installed (it reset every gender to None). Now: - Loads existing manifest by id and overlays only fetch-owned fields, so gender/race/age/excluded survive a refetch. - deepface pass tags only records that don't already have a gender; deepface unavailable means "leave existing tags alone" not "reset". - New --shrink flag required to drop ids >= --count. Default refuses to shrink the pool silently. - Atomic write via tmp + os.replace so an interrupted run can't corrupt the manifest. - Dedupes duplicate id lines (root cause of the 2497-row manifest backing a 1000-face pool). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 06:01:04 -05:00
root	a3b65f314e	Synthetic face pool — 1000 StyleGAN headshots, ComfyUI hot-swap, 60x smaller thumbs Worker cards now ship a real photo per person instead of monogram tiles: - fetch_face_pool.py pulls 1000 faces from thispersondoesnotexist.com - tag_face_pool.py runs deepface for gender/race/age, excludes <22yo - manifest.jsonl: 952 servable, gender/race buckets populated - /headshots/_thumbs/ pre-resized to 384px webp (587KB -> 11KB, 60x smaller; without this Chrome's parallel-connection budget drops ~75% of tiles in a 40-card grid) - /headshots/:key gender x race x age intersection bucketing with gender-only fallback when intersection is sparse - /headshots/generate/:key ComfyUI on-demand for the contractor profile spotlight (cold ~1.5s, cached ~1ms; worker-derived djb2 seed makes faces deterministic-per-worker but unique across workers sharing the same prompt) - serve_imagegen.py _cache_key() now includes seed (was caching by prompt only -> 3 different worker seeds collapsed to 1 cached image; verified fix produces 3 distinct md5s) - confidence-default name resolution: Xavier->man+hispanic, Aisha->woman+black, etc. Every worker resolves to a bucket. End-to-end: playwright run on /?q=forklift+operators+IL -> 21/21 cards loaded, 0 broken, all 384px webp. Cache + binary pool gitignored; manifest tracked. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 06:01:04 -05:00
root	10ed3bc630	demo: real synthetic headshots — fetch pool + serve route + UI wire Three layers shipped: 1. SCRIPT — scripts/staffing/fetch_face_pool.py Pulls N synthetic StyleGAN faces from thispersondoesnotexist.com into data/headshots/face_NNNN.jpg, writes manifest.jsonl. Idempotent: re-running skips existing files. Optional gender tagging via deepface (currently unavailable on this box; the script handles ImportError gracefully and tags everything as untagged). Fetched 198 faces with concurrency=3 in ~67s. 2. SERVER — /headshots/:key route in mcp-server/index.ts Loads manifest at first hit, caches in globalThis._faces. Hashes the key with djb2-style mixing → pool index → returns the JPG. Same key always gets the same face (deterministic). Accepts ?g=man\|woman&e=caucasian\|black\|hispanic\|south_asian\|east_asian\|middle_eastern to bias pool selection — the gender/ethnicity buckets fall back to the full pool when no tagged matches exist. Cache-Control: 86400 immutable so faces ride the browser cache after first hit. /headshots/__reload re-reads the manifest without restart. 3. UI — search.html + console.html worker cards Re-added overlay <img> on top of the monogram .av circle. img.src = /headshots/<encoded-key>?g=<hint>&e=<hint>. img.onerror removes the failed image so the monogram stays visible if the face pool isn't fetched / CDN is blocked. .av now has overflow:hidden + position:relative to clip the img to a perfect circle. Forced-confident name resolution (J: "we're CREATING the profile, created as though you truly have the information Xavier is more likely Hispanic and he's a male"): genderFor(name) — looks up MALE_NAMES + FEMALE_NAMES, falls back to a deterministic hash split so unknown names spread ~50/50. Sets now include cross-cultural names: Alejandro/ Andres/Mateo/Santiago/Joaquin/Cesar/Hugo/ Felipe/Gerardo/Salvador/Ramon (Hispanic), Raj/Anil/Vikram/Krishna/Pradeep (South Asian), Wei/Yi/Hiroshi/Akira/Hyun (East Asian), Demetrius/Kareem/DaQuan/Khalil (Black), Omar/Khalid/Hassan/Ahmed/Bilal (Middle Eastern). FEMALE_NAMES extended in parallel. guessEthnicityFromFirstName(name) — confident default of 'caucasian' for any name not in the cultural buckets so every worker resolves to a category the face pool can be biased toward. Order: ME → Black → Hispanic → South Asian → East Asian → Caucasian (matters where names overlap, e.g. Aisha appears in ME + Black, biases toward ME for visual fit). Both helpers also ported into console.html so the triage backfills and try-it-yourself rendering get the same hint stack. Privacy note in the script + route comments: the synthetic data uses the worker's name as the seed; production should hash worker_id (not name) to avoid leaking PII to a third-party CDN. The fetch URL itself is referenced once per pool build, not per-worker. .gitignore — added data/headshots/face_*.jpg (~100MB for 198 faces; the manifest + script are tracked). Re-running the script on a fresh checkout rebuilds the pool from scratch. Verified end-to-end via playwright on devop.live/lakehouse: forklift query → 10 worker cards 10/10 with face images (real synthetic headshots, not monograms) 0/10 broken Alejandro G. Nelson → ?g=man&e=hispanic Patricia K. Garcia → ?g=woman&e=caucasian Each name → unique face, deterministic across loads. Console triage backfills get the same treatment.	2026-04-28 06:01:04 -05:00
root	c3c9c2174a	staffing: B+C — safe views (candidates/workers/jobs) + workers_500k_v9 build script Some checks failed lakehouse/auditor 9 blocking issues: cloud: claim not backed — "Verified live (current synthetic data):" Decision B from reports/staffing/synthetic-data-gap-report.md §7 (plus C: client_workerskjkk.parquet typo file removed from data/datasets/ — was never tracked, no git effect). PII enforcement was UNVERIFIED in workers_500k_v8 (the corpus staffing_inference mode embeds chunks from). Verified 2026-04-27 by inspecting data/vectors/meta/workers_500k_v8.json — `source: "workers_500k"` confirms v8 was built directly from the raw table, so the LLM has been seeing names / emails / phones / resume_text for every staffing query. This commit closes the boundary at the catalog metadata layer: candidates_safe (overhauled — was failing SQL invalid 434×/day on a nonexistent `vertical` column reference, copy-pasted from job_orders): drops last_name, email, phone, hourly_rate_usd candidate_id masked (keep first 3, last 2) row_filter: status != 'blocked' workers_safe (NEW): drops name, email, phone, zip, communications, resume_text keeps role, city, state, skills, certifications, archetype, scores resume_text + communications carry verbatim PII (full names) and there is no in-view text scrubber, so they are dropped wholesale. Skills + certifications + scores carry the matching signal for staffing inference. jobs_safe (NEW): drops description (often quotes client names verbatim) client_id masked (keep first 3, last 2) bill_rate / pay_rate kept — commercial info, not PII per staffing PRD scripts/staffing/build_workers_v9.sh (NEW): POSTs /vectors/index to rebuild workers_500k_v9 from `workers_safe` rather than the raw table. Embedded text is constructed from the view projection so PII never enters the corpus by construction. 30+ minute background job — not run inline. After it completes, flip config/modes.toml `staffing_inference` matrix_corpus from workers_500k_v8 to workers_500k_v9 and restart gateway. Distillation v1.0.0 substrate untouched. audit-full passed clean (16/16 required) before this commit; will re-verify after.	2026-04-27 10:46:03 -05:00
root	940737daa7	staffing: D — workers_500k.phone int → string fixup script Decision D from reports/staffing/synthetic-data-gap-report.md §7. Phones in workers_500k.parquet are 11-digit US numbers stored as int64 (e.g. 13122277740). Numerically fine, but breaks join keys against any other source that carries phone as string. Script casts the column to string in place, with non-destructive backup at data/datasets/workers_500k.parquet.bak-<date> before write. Idempotent: if phone is already string, exits 0 with "no-op". Safe to re-run. The .parquet itself is too large to commit (75MB) and follows project convention of staying out of git. The script makes the conversion reproducible from the source dataset.	2026-04-27 10:45:38 -05:00
root	d56f08e740	staffing: A — fill_events.parquet from 44 scenarios + 64 lessons (deterministic) Decision A from reports/staffing/synthetic-data-gap-report.md §7. Walks tests/multi-agent/scenarios/scen_.json and data/_playbook_lessons/.json, normalizes to a single fill_events.parquet at data/datasets/fill_events.parquet. One row per scenario event, lesson outcomes joined by (client, date) where the tuple matches. rows: 123 scenarios contributing: 40 events with outcome data: 62 unique (client, date) tuples: 40 Reproducibility: event_id is SHA1(client\|date\|role\|at\|city) truncated to 16 hex chars; rows sorted by event_id before write so re-runs produce bit-identical output. Verified. Pure normalization — no LLM, no new data, no distillation substrate mutation.	2026-04-27 10:45:29 -05:00

12 Commits