lakehouse/auditor 9 blocking issues: cloud: claim not backed — "Verified live (current synthetic data):"
Decision B from reports/staffing/synthetic-data-gap-report.md §7
(plus C: client_workerskjkk.parquet typo file removed from
data/datasets/ — was never tracked, no git effect).
PII enforcement was UNVERIFIED in workers_500k_v8 (the corpus
staffing_inference mode embeds chunks from). Verified 2026-04-27 by
inspecting data/vectors/meta/workers_500k_v8.json — `source:
"workers_500k"` confirms v8 was built directly from the raw table, so
the LLM has been seeing names / emails / phones / resume_text for every
staffing query.
This commit closes the boundary at the catalog metadata layer:
candidates_safe (overhauled — was failing SQL invalid 434×/day on a
nonexistent `vertical` column reference, copy-pasted from job_orders):
drops last_name, email, phone, hourly_rate_usd
candidate_id masked (keep first 3, last 2)
row_filter: status != 'blocked'
workers_safe (NEW):
drops name, email, phone, zip, communications, resume_text
keeps role, city, state, skills, certifications, archetype, scores
resume_text + communications carry verbatim PII (full names) and
there is no in-view text scrubber, so they are dropped wholesale.
Skills + certifications + scores carry the matching signal for
staffing inference.
jobs_safe (NEW):
drops description (often quotes client names verbatim)
client_id masked (keep first 3, last 2)
bill_rate / pay_rate kept — commercial info, not PII per staffing PRD
scripts/staffing/build_workers_v9.sh (NEW):
POSTs /vectors/index to rebuild workers_500k_v9 from `workers_safe`
rather than the raw table. Embedded text is constructed from the
view projection so PII never enters the corpus by construction.
30+ minute background job — not run inline. After it completes,
flip config/modes.toml `staffing_inference` matrix_corpus from
workers_500k_v8 to workers_500k_v9 and restart gateway.
Distillation v1.0.0 substrate untouched. audit-full passed clean
(16/16 required) before this commit; will re-verify after.