12 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
fcd53168a0 |
phase 1.6: counsel handoff turnkey + seed_consent_version.sh + strict mode live
The remaining production blocker is counsel-calendar bottleneck
(review + sign-off). Engineering can't make counsel move faster,
but it CAN reduce the round-trip overhead:
(1) docs/counsel/COUNSEL_HANDOFF_EMAIL_2026-05-05.md — copy-paste
email body J can send to outside counsel. Subject line + body
+ tarball attachment instructions + headline asks (A/B/C/D
in priority order) + post-signature operator runbook. The
pre-flight checklist + post-signature workflow turn what
would have been "I'll figure out the email" into "click send."
(2) scripts/staffing/seed_consent_version.sh — turnkey
post-signature deployment. Takes the path to a (presumably
counsel-signed) consent template markdown, computes SHA-256,
atomically merges into /etc/lakehouse/consent_versions.json
(creating the file if absent, with per-seed audit metadata
in _meta.seeded_at[]), restarts lakehouse.service, probes
/biometric/health post-restart. Idempotent: re-running with
the same hash is a no-op for the versions array but still
appends a [reseed] entry to the audit metadata.
Verified live against the eng-staged template — strict mode
flipped clean, /biometric/health 200 post-restart.
(3) docs/PHASE_1_6_BIPA_GATES.md §6.5 — post-signature deployment
runbook embedded in the gates doc. Three steps: counsel signs
+ commits → seed_consent_version.sh → strict-mode probe.
Plus a "pre-counsel demo seed" subsection documenting how to
exercise strict mode BEFORE counsel signs (using the
eng-staged template hash) so the deployment workflow is
proven before the legal critical path closes.
Strict mode flipped live — verified post-restart:
- /etc/lakehouse/consent_versions.json populated with the
eng-staged template hash:
8b09591a8dc15f59197affac48909ce943d575eee01705b42303acf3b32f5c56
- POST /biometric/subject/WORKER-1/consent with deadbeef hash:
HTTP 400 + error="consent_version_unknown"
- POST with the known eng-staged hash: passes version check
(then 404 subject_not_found on a ghost candidate, proving
the gate is hash-aware not auth-broken)
The hash currently seeded is the ENG-STAGED template
(pre-counsel-signature). When counsel returns the signed text,
operator runs `seed_consent_version.sh` again with the
counsel-signed markdown — the new hash gets appended; the demo
hash stays in for backwards-compat with any consent records
collected during the pre-counsel demo period (none, today).
Production blocker is now genuinely just counsel calendar:
1. J transmits reports/counsel/counsel_packet_2026-05-05.tar.gz
per the handoff email
2. Counsel reviews + signs (their billable time)
3. Counsel returns signed text → operator runs seed script
4. Strict mode flips to canonical hash → cutover complete
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
87b034f5f9 |
phase 1.6: ops dashboard + consent_versions allowlist + subject timeline tool
Closes the afternoon's "all four" wave (per J's request to do all the
items in one pass instead of pick-one-of-options):
(1) Live demo on WORKER-100 — full lifecycle exercised end-to-end
against the running gateway. 3 audit rows landed in correct
order (consent_grant → biometric_collection →
consent_withdrawal), chain_verified=true, photo on disk at
data/biometric/uploads/WORKER-100/1778011967957907731_027b6bb1.jpg
(180 bytes JFIF). retention_until=2026-06-04 (30d from
withdrawal per consent template v1 §2).
(2) GET /biometric/stats — read-only aggregate over all subjects.
Returns counts by biometric.status + subject.status, photo
count, oldest_active_retention_until, and the last 20
state-change events (consent_grant / collection / withdrawal /
erasure — validator_lookup and other noise filtered out).
Walks per-subject audit logs via the existing writer; cheap
for 100 subjects, would want an event-stream index at 100k.
Legal-tier auth (same posture as /audit). 4 unit tests.
(3) /biometric/dashboard mcp-server frontend. Auto-refreshes
/biometric/stats every 15s, neo-brutalist tile layout for
the per-status counts + retention horizon block + recent
events table with kind badges + event-kind breakdown pills.
sessionStorage-backed token; logout button clears state.
DOM-built throughout (textContent + createElement) — never
innerHTML on audit-row values, since trace_id et al. could
in theory carry operator-supplied strings.
(4) consent_versions allowlist. BiometricEndpointState gains
`allowed_consent_versions: Option<Arc<HashSet<String>>>`,
loaded at startup from /etc/lakehouse/consent_versions.json
(override via LH_CONSENT_VERSIONS_FILE). process_consent
refuses unknown hashes with HTTP 400 consent_version_unknown
when configured. Resolution semantics:
- Missing file → permissive (v1 compat, warn-log)
- Parse error → permissive (error-log; broken config
silently going strict would be worse)
- Empty array → strict, refuse all (deliberate freeze
mode for "counsel hasn't signed v1 yet")
- Populated → strict, lowercase-normalized comparison
5 unit tests (known/unknown/case/empty/none-permissive).
Example template at ops/consent_versions.example.json with
a counsel-tier deployment note.
(5) scripts/staffing/subject_timeline.sh — operator one-shot
pretty-print of any subject's full BIPA lifecycle. Curls
/audit/subject/{id} with legal token; renders manifest
summary + on-disk photo state + chronological audit chain
with kind badges + chain verification status. Smoke-tested
on WORKER-100 (3 rows verified).
(6) STATE_OF_PLAY.md refresh. New section "afternoon wave"
captures all four commits (76cb5ac, 7f0f500, 68d226c, this
one) + the live demo evidence + the v1 endpoint matrix +
UI/CLI inventory + the production-cutover blocking set
(counsel calendar only — eng substrate is done).
Verified live post-restart:
- /audit/health + /biometric/health both 200
- /biometric/stats returns 100 subjects, 2 withdrawn (WORKER-2 from
earlier scrum + WORKER-100 from today's demo), 1 photo on record,
6 recent state-change events
- /biometric/intake + /biometric/withdraw + /biometric/dashboard
all 200 on mcp-server :3700
- subject_timeline.sh on WORKER-100: chain_verified=true,
chain_root=a47563ff937d50de…
- 88/88 catalogd lib tests + 55/55 biometric_endpoint tests green
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
b2c34b80b3 |
phase 1.6: lock Gate 3b = C, reconcile docs to shipped state, fix double-upload file leak
Four threads landing together — all driven by the audit J asked for before
production cutover.
(1) Gate 3b DECIDED: Option C (defer classifications). `BiometricCollection.classifications`
stays `Option<JSON> = None` in v1. `docs/specs/GATE_3B_DEEPFACE_DESIGN.md` status
flipped from "draft / awaits product" to DECIDED. Consent template + retention
schedule revised to remove all "automated facial-classification" / "deepface"
language so disclosed scope matches implemented scope.
(2) Endpoint-path drift reconciled across 3 docs. `PHASE_1_6_BIPA_GATES.md`,
`BIPA_DESTRUCTION_RUNBOOK.md`, and `biometric_retention_schedule_v1.md` had
references to legacy `/v1/identity/subjects/*` paths (proposed under a separate
identityd daemon, never shipped) — corrected to actual shipped routes
`/biometric/subject/*` (catalogd-local). Schema block in PHASE_1_6_BIPA_GATES
rewritten to reflect JSON `SubjectManifest.biometric_collection` substrate
(not the proposed Postgres `subjects` table).
(3) New operational artifacts:
- `scripts/staffing/verify_biometric_erasure.sh` — checks 4 things post-erasure
(manifest cleared, uploads dir empty, audit row matches, chain verified).
Smoke-tested live against WORKER-2.
- `scripts/staffing/biometric_destruction_report.sh` — monthly anonymized
destruction-event aggregation. Smoke-tested clean.
- `scripts/staffing/bundle_counsel_packet.sh` — tarballs the counsel-review
packet with per-file SHA-256 manifest.
- `docs/runbooks/LEGAL_AUDIT_KEY_ROTATION.md` — formal rotation procedure
operationalized after the 2026-05-05 /tmp wipe incident.
- `docs/counsel/COUNSEL_REVIEW_PACKET_2026-05-05.md` — cover note bundling
all eng-staged BIPA docs for counsel review with per-doc questions, sign-off
checklist, recommended review sequence.
(4) Double-upload file leak fixed in `crates/catalogd/src/biometric_endpoint.rs`.
`verify_biometric_erasure.sh` smoked WORKER-2 and surfaced a stranded photo
file. Investigation showed the file was 13-byte test-fixture bytes (zero PII,
no biometric content); audit timeline showed two consecutive uploads followed
by one erasure — the second upload had silently overwritten manifest.data_path,
orphaning the first file. Patched `process_upload` to refuse a second upload
with HTTP 409 + `error: "biometric_already_collected"` when
`biometric_collection.is_some()` on the manifest. Operator must explicitly
POST `/biometric/subject/{id}/erase` first.
Tests: new `second_upload_without_erase_returns_409` (asserts 409 + manifest
pointer unchanged + first file untouched on disk). Replaced
`repeated_uploads_grow_the_chain` with `upload_erase_upload_grows_the_chain_cleanly`
(covers the legitimate re-collection cycle: chain grows to 3 rows). Updated
`content_type_with_parameters_accepted` to use 2 distinct subjects (was
using 1 subject with 2 uploads to test ct parsing — would now 409).
22/22 biometric_endpoint tests + 59/59 catalogd lib tests green post-patch.
Production posture: gateway needs `cargo build --release -p gateway` +
`systemctl restart lakehouse.service` to pick up the new 409 in live traffic.
Counsel calendar is now the only remaining blocker for first real-photo intake.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
c7aa607ae4 |
phase 1.6 BIPA: scrum-driven fixes
Per 2026-05-03 phase_1_6_bipa_gates scrum (13 findings, 0 convergent).
1 BLOCK verified false positive, 4 real fixes shipped:
False positive (verified):
- opus BLOCK on attest:55 — claimed `set -uo pipefail` without `-e`
makes the post-python3 `if [ $? -ne 0 ]` check unreachable. Verified
WRONG: `X=$(false); echo $?` prints 1. Bash propagates command-
substitution exit through $? on the assignment line. The check IS
the python3 exit gate. Inline comment added to the script noting
the false positive so future scrums don't re-flag.
Real fixes:
1. opus WARN attestation:18 — schema fingerprint hashed names ONLY,
missing column-type changes. A column repurposed to hold base64
photo bytes under its existing name would pass undetected. Now
hashes "name<TAB>type<TAB>nullable=bool" per row. Re-run produced
evidence SHA-256 1fdcc9f1... (vs old 230fffeb..., reflecting the
broader fingerprint scope).
2. opus WARN gate_4_test:60 — definition regex didn't catch
object-literal property forms (`const t = { FEMALE_NAMES: [...] }`)
or TypeScript class fields (`class L { public NAMES_X: string[] = [] }`).
Added two new patterns + a regression test
(Gate 4: object-literal and class-field bypasses are caught) that
exercises 5 bypass forms. 4/4 tests green; 1 minor regex tweak
needed mid-fix to handle single-line class bodies.
3. kimi WARN python3-reliance — script assumed pyarrow installed and
would emit a stack trace into the attestation if not. Added
`python3 -c "import pyarrow"` gate at top with clean install
instructions on failure.
4. opus INFO PHASE_1_6:200 — item 7 (training) silently dropped from
blocking set with bare "deferred" rationale. Now explicitly states
the deferral is conditional on small operator population (J + 1-2
named ops); item 7 re-promotes to blocking if population grows.
⚖ COUNSEL marker added.
Skipped (acceptable as ⚖ COUNSEL placeholders by design):
- kimi WARN consent template:30-day-SLA (counsel decides number)
- kimi WARN consent template:email-placeholder (counsel supplies)
- kimi WARN parquet absence (env override exists; redeployment-aware)
- kimi INFO runbook manual-erasure (marked TODO when /erase ships)
- qwen INFO doc path/status nits (already addressed by file moves)
Tests: 4/4 Gate 4 absence test (incl. new bypass-coverage), 3/3
attestation evidence checks pass on live data.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
4708717f6b |
phase 1.6 BIPA gates — engineering wave (4 of 7 staged)
Per docs/PHASE_1_6_BIPA_GATES.md. Status table now reflects:
DONE (engineering-only, no counsel dependency):
- Gate 4: name→ethnicity inference removed from mcp-server.
Removal note in search.html:3372 + new Bun absence test
(mcp-server/phase_1_6_gate_4.test.ts) with 3 assertions:
walker actually scans files, regex catches synthetic positives,
no offending DEFINITION patterns in any .html/.ts/.js source.
3/3 pass.
ENG-DONE, signature pending:
- §2 attestation: scripts/staffing/attest_pre_identityd_biometric_state.sh
runs three checks against the live state:
1. workers_500k.parquet schema has no biometric/photo/face/image col
2. data/_kb/*.jsonl + pathway state contain no base64 image magic
bytes (JPEG /9j/, PNG iVBOR), no data:image/* MIME prefixes,
no field-name patterns ("photo", "biometric", "deepface_*")
3. data/headshots/manifest.jsonl is entirely synthetic-tagged
3/3 evidence checks pass on the live data dir. Generates a
signed-by-operator+counsel attestation document committed at
docs/attestations/BIPA_PRE_IDENTITYD_ATTESTATION_2026-05-03.md
with SHA-256 of the evidence summary so post-signature tampering
is detectable.
ENG-STAGED, awaiting counsel review:
- Gate 1 retention schedule scaffold at
docs/policies/consent/biometric_retention_schedule_v1.md (BIPA
§15(a)). Engineering facts (categories, 18-month operational
ceiling vs 3-year statutory cap, destruction procedure pointer
to Gate 5 runbook) plus ⚖ COUNSEL markers for the binding text.
- Gate 2 consent template scaffold at
docs/policies/consent/biometric_consent_template_v1.md (BIPA
§15(b)(1)-(3)). Required disclosures + plain-language summary +
withdrawal procedure + the structured fields the consent UI must
post to identityd.
- Gate 5 destruction runbook at docs/runbooks/BIPA_DESTRUCTION_RUNBOOK.md.
Triggers, pre-destruction checks (incl. chain-verified gate via
/audit/subject/{id}), procedure (legal-tier endpoint), automatic
audit row append (subject_audit.v1 with kind=biometric_erasure),
backup-window disclosure, monthly reporting cadence, audit-trail
attestation procedure cross-referencing the cross-runtime parity
probe.
BLOCKED on engineering design:
- Gate 3 photo-upload endpoint. Requires identityd photo intake
design + deepface integration scope. Deferred to its own session.
DEFERRED:
- §3 employee training material. Gate 5 runbook §7 may serve as
substrate; counsel decides whether a separate program is needed.
Calendar bottleneck is now counsel review. Engineering can stage no
further deliverables until either (a) Gate 3's design conversation
happens or (b) counsel completes review of items 1/2/5/6.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
4b92d1da91 |
demo: icon recipe pipeline + role-aware portraits + ComfyUI negative-prompt override
Adds two single-source-of-truth recipe files that drive both the
hot-path render server and the offline pre-render scripts:
- role_scenes.ts: per-role-band scene clauses (clothing + backdrop).
Forklift operators look like forklift operators instead of
collapsing to interchangeable studio shots. SCENES_VERSION mixes
into the headshot cache key so a coordinator tweak refreshes every
matching face on next view.
- icon_recipes.ts: cert / role-prop / status / hazard / empty icons
with deterministic per-recipe seeds + fuzzy text resolver.
ICONS_VERSION suffix on the cached file means edits don't
overwrite in place — misfires are recoverable.
Routes (mcp-server/index.ts):
- GET /headshots/_scenes — exposes SCENES + version to the
pre-render script so prompts don't drift between batch and hot-path.
- GET /icons/_recipes — same idea for icons.
- GET /icons/cert?text=... — resolves free-text cert names to a
recipe and 302s to the rendered icon. 404 (not 500) when no recipe
matches so the front-end can hang `onerror="this.remove()"`.
- GET /icons/render/{category}/{slug} — cache-or-render at 256² (8
steps) for crisper edges than 512² when downsampled to 14px.
ComfyUI portrait support (scripts/serve_imagegen.py):
The editorial workflow had `human, person, face` baked into its
negative prompt — actively sabotaging portraits. _comfyui_generate
now accepts negative_prompt/cfg/sampler/scheduler overrides, and
those mix into the cache key so portrait calls don't collapse into
hero-shot cache hits.
scripts/staffing/render_role_pool.py: pre-renders the role-aware
face pool by reading SCENES from /headshots/_scenes — single source
of truth verified at run time.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
1745881426 |
staffing: face pool fetch preserves prior tags + --shrink gate + atomic manifest write
fetch_face_pool was wiping 952 hand-classified rows when re-run from a Python without deepface installed (it reset every gender to None). Now: - Loads existing manifest by id and overlays only fetch-owned fields, so gender/race/age/excluded survive a refetch. - deepface pass tags only records that don't already have a gender; deepface unavailable means "leave existing tags alone" not "reset". - New --shrink flag required to drop ids >= --count. Default refuses to shrink the pool silently. - Atomic write via tmp + os.replace so an interrupted run can't corrupt the manifest. - Dedupes duplicate id lines (root cause of the 2497-row manifest backing a 1000-face pool). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
a3b65f314e |
Synthetic face pool — 1000 StyleGAN headshots, ComfyUI hot-swap, 60x smaller thumbs
Worker cards now ship a real photo per person instead of monogram tiles:
- fetch_face_pool.py pulls 1000 faces from thispersondoesnotexist.com
- tag_face_pool.py runs deepface for gender/race/age, excludes <22yo
- manifest.jsonl: 952 servable, gender/race buckets populated
- /headshots/_thumbs/ pre-resized to 384px webp (587KB -> 11KB,
60x smaller; without this Chrome's parallel-connection budget
drops ~75% of tiles in a 40-card grid)
- /headshots/:key gender x race x age intersection bucketing with
gender-only fallback when intersection is sparse
- /headshots/generate/:key ComfyUI on-demand for the contractor
profile spotlight (cold ~1.5s, cached ~1ms; worker-derived
djb2 seed makes faces deterministic-per-worker but unique
across workers sharing the same prompt)
- serve_imagegen.py _cache_key() now includes seed (was caching
by prompt only -> 3 different worker seeds collapsed to 1
cached image; verified fix produces 3 distinct md5s)
- confidence-default name resolution: Xavier->man+hispanic,
Aisha->woman+black, etc. Every worker resolves to a bucket.
End-to-end: playwright run on /?q=forklift+operators+IL -> 21/21
cards loaded, 0 broken, all 384px webp.
Cache + binary pool gitignored; manifest tracked.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
10ed3bc630 |
demo: real synthetic headshots — fetch pool + serve route + UI wire
Three layers shipped:
1. SCRIPT — scripts/staffing/fetch_face_pool.py
Pulls N synthetic StyleGAN faces from thispersondoesnotexist.com
into data/headshots/face_NNNN.jpg, writes manifest.jsonl. Idempotent:
re-running skips existing files. Optional gender tagging via deepface
(currently unavailable on this box; the script handles ImportError
gracefully and tags everything as untagged). Fetched 198 faces with
concurrency=3 in ~67s.
2. SERVER — /headshots/:key route in mcp-server/index.ts
Loads manifest at first hit, caches in globalThis._faces. Hashes the
key with djb2-style mixing → pool index → returns the JPG. Same
key always gets the same face (deterministic). Accepts
?g=man|woman&e=caucasian|black|hispanic|south_asian|east_asian|middle_eastern
to bias pool selection — the gender/ethnicity buckets fall back to
the full pool when no tagged matches exist. Cache-Control:
86400 immutable so faces ride the browser cache after first hit.
/headshots/__reload re-reads the manifest without restart.
3. UI — search.html + console.html worker cards
Re-added overlay <img> on top of the monogram .av circle. img.src
= /headshots/<encoded-key>?g=<hint>&e=<hint>. img.onerror removes
the failed image so the monogram stays visible if the face pool
isn't fetched / CDN is blocked. .av now has overflow:hidden +
position:relative to clip the img to a perfect circle.
Forced-confident name resolution (J: "we're CREATING the profile,
created as though you truly have the information Xavier is more
likely Hispanic and he's a male"):
genderFor(name) — looks up MALE_NAMES + FEMALE_NAMES,
falls back to a deterministic hash split
so unknown names spread ~50/50. Sets now
include cross-cultural names: Alejandro/
Andres/Mateo/Santiago/Joaquin/Cesar/Hugo/
Felipe/Gerardo/Salvador/Ramon (Hispanic),
Raj/Anil/Vikram/Krishna/Pradeep (South
Asian), Wei/Yi/Hiroshi/Akira/Hyun (East
Asian), Demetrius/Kareem/DaQuan/Khalil
(Black), Omar/Khalid/Hassan/Ahmed/Bilal
(Middle Eastern). FEMALE_NAMES extended
in parallel.
guessEthnicityFromFirstName(name)
— confident default of 'caucasian' for any
name not in the cultural buckets so every
worker resolves to a category the face
pool can be biased toward. Order: ME → Black
→ Hispanic → South Asian → East Asian →
Caucasian (matters where names overlap,
e.g. Aisha appears in ME + Black, biases
toward ME for visual fit).
Both helpers also ported into console.html so the triage backfills
and try-it-yourself rendering get the same hint stack.
Privacy note in the script + route comments: the synthetic data uses
the worker's name as the seed; production should hash worker_id (not
name) to avoid leaking PII to a third-party CDN. The fetch URL itself
is referenced once per pool build, not per-worker.
.gitignore — added data/headshots/face_*.jpg (~100MB for 198 faces;
the manifest + script are tracked). Re-running the script on a fresh
checkout rebuilds the pool from scratch.
Verified end-to-end via playwright on devop.live/lakehouse:
forklift query → 10 worker cards
10/10 with face images (real synthetic headshots, not monograms)
0/10 broken
Alejandro G. Nelson → ?g=man&e=hispanic
Patricia K. Garcia → ?g=woman&e=caucasian
Each name → unique face, deterministic across loads.
Console triage backfills get the same treatment.
|
||
|
|
c3c9c2174a |
staffing: B+C — safe views (candidates/workers/jobs) + workers_500k_v9 build script
Some checks failed
lakehouse/auditor 9 blocking issues: cloud: claim not backed — "Verified live (current synthetic data):"
Decision B from reports/staffing/synthetic-data-gap-report.md §7 (plus C: client_workerskjkk.parquet typo file removed from data/datasets/ — was never tracked, no git effect). PII enforcement was UNVERIFIED in workers_500k_v8 (the corpus staffing_inference mode embeds chunks from). Verified 2026-04-27 by inspecting data/vectors/meta/workers_500k_v8.json — `source: "workers_500k"` confirms v8 was built directly from the raw table, so the LLM has been seeing names / emails / phones / resume_text for every staffing query. This commit closes the boundary at the catalog metadata layer: candidates_safe (overhauled — was failing SQL invalid 434×/day on a nonexistent `vertical` column reference, copy-pasted from job_orders): drops last_name, email, phone, hourly_rate_usd candidate_id masked (keep first 3, last 2) row_filter: status != 'blocked' workers_safe (NEW): drops name, email, phone, zip, communications, resume_text keeps role, city, state, skills, certifications, archetype, scores resume_text + communications carry verbatim PII (full names) and there is no in-view text scrubber, so they are dropped wholesale. Skills + certifications + scores carry the matching signal for staffing inference. jobs_safe (NEW): drops description (often quotes client names verbatim) client_id masked (keep first 3, last 2) bill_rate / pay_rate kept — commercial info, not PII per staffing PRD scripts/staffing/build_workers_v9.sh (NEW): POSTs /vectors/index to rebuild workers_500k_v9 from `workers_safe` rather than the raw table. Embedded text is constructed from the view projection so PII never enters the corpus by construction. 30+ minute background job — not run inline. After it completes, flip config/modes.toml `staffing_inference` matrix_corpus from workers_500k_v8 to workers_500k_v9 and restart gateway. Distillation v1.0.0 substrate untouched. audit-full passed clean (16/16 required) before this commit; will re-verify after. |
||
|
|
940737daa7 |
staffing: D — workers_500k.phone int → string fixup script
Decision D from reports/staffing/synthetic-data-gap-report.md §7. Phones in workers_500k.parquet are 11-digit US numbers stored as int64 (e.g. 13122277740). Numerically fine, but breaks join keys against any other source that carries phone as string. Script casts the column to string in place, with non-destructive backup at data/datasets/workers_500k.parquet.bak-<date> before write. Idempotent: if phone is already string, exits 0 with "no-op". Safe to re-run. The .parquet itself is too large to commit (75MB) and follows project convention of staying out of git. The script makes the conversion reproducible from the source dataset. |
||
|
|
d56f08e740 |
staffing: A — fill_events.parquet from 44 scenarios + 64 lessons (deterministic)
Decision A from reports/staffing/synthetic-data-gap-report.md §7. Walks tests/multi-agent/scenarios/scen_*.json and data/_playbook_lessons/*.json, normalizes to a single fill_events.parquet at data/datasets/fill_events.parquet. One row per scenario event, lesson outcomes joined by (client, date) where the tuple matches. rows: 123 scenarios contributing: 40 events with outcome data: 62 unique (client, date) tuples: 40 Reproducibility: event_id is SHA1(client|date|role|at|city) truncated to 16 hex chars; rows sorted by event_id before write so re-runs produce bit-identical output. Verified. Pure normalization — no LLM, no new data, no distillation substrate mutation. |