Decision D from reports/staffing/synthetic-data-gap-report.md §7.
Phones in workers_500k.parquet are 11-digit US numbers stored as int64
(e.g. 13122277740). Numerically fine, but breaks join keys against any
other source that carries phone as string. Script casts the column to
string in place, with non-destructive backup at
data/datasets/workers_500k.parquet.bak-<date> before write.
Idempotent: if phone is already string, exits 0 with "no-op". Safe to
re-run.
The .parquet itself is too large to commit (75MB) and follows project
convention of staying out of git. The script makes the conversion
reproducible from the source dataset.
Decision A from reports/staffing/synthetic-data-gap-report.md §7.
Walks tests/multi-agent/scenarios/scen_*.json and
data/_playbook_lessons/*.json, normalizes to a single fill_events.parquet
at data/datasets/fill_events.parquet. One row per scenario event,
lesson outcomes joined by (client, date) where the tuple matches.
rows: 123
scenarios contributing: 40
events with outcome data: 62
unique (client, date) tuples: 40
Reproducibility: event_id is SHA1(client|date|role|at|city) truncated to
16 hex chars; rows sorted by event_id before write so re-runs produce
bit-identical output. Verified.
Pure normalization — no LLM, no new data, no distillation substrate
mutation.