lakehouse

profit/lakehouse

Fork 0

Commit Graph

Author	SHA1	Message	Date
root	c3c9c2174a	staffing: B+C — safe views (candidates/workers/jobs) + workers_500k_v9 build script Some checks failed lakehouse/auditor 9 blocking issues: cloud: claim not backed — "Verified live (current synthetic data):" Decision B from reports/staffing/synthetic-data-gap-report.md §7 (plus C: client_workerskjkk.parquet typo file removed from data/datasets/ — was never tracked, no git effect). PII enforcement was UNVERIFIED in workers_500k_v8 (the corpus staffing_inference mode embeds chunks from). Verified 2026-04-27 by inspecting data/vectors/meta/workers_500k_v8.json — `source: "workers_500k"` confirms v8 was built directly from the raw table, so the LLM has been seeing names / emails / phones / resume_text for every staffing query. This commit closes the boundary at the catalog metadata layer: candidates_safe (overhauled — was failing SQL invalid 434×/day on a nonexistent `vertical` column reference, copy-pasted from job_orders): drops last_name, email, phone, hourly_rate_usd candidate_id masked (keep first 3, last 2) row_filter: status != 'blocked' workers_safe (NEW): drops name, email, phone, zip, communications, resume_text keeps role, city, state, skills, certifications, archetype, scores resume_text + communications carry verbatim PII (full names) and there is no in-view text scrubber, so they are dropped wholesale. Skills + certifications + scores carry the matching signal for staffing inference. jobs_safe (NEW): drops description (often quotes client names verbatim) client_id masked (keep first 3, last 2) bill_rate / pay_rate kept — commercial info, not PII per staffing PRD scripts/staffing/build_workers_v9.sh (NEW): POSTs /vectors/index to rebuild workers_500k_v9 from `workers_safe` rather than the raw table. Embedded text is constructed from the view projection so PII never enters the corpus by construction. 30+ minute background job — not run inline. After it completes, flip config/modes.toml `staffing_inference` matrix_corpus from workers_500k_v8 to workers_500k_v9 and restart gateway. Distillation v1.0.0 substrate untouched. audit-full passed clean (16/16 required) before this commit; will re-verify after.	2026-04-27 10:46:03 -05:00
root	940737daa7	staffing: D — workers_500k.phone int → string fixup script Decision D from reports/staffing/synthetic-data-gap-report.md §7. Phones in workers_500k.parquet are 11-digit US numbers stored as int64 (e.g. 13122277740). Numerically fine, but breaks join keys against any other source that carries phone as string. Script casts the column to string in place, with non-destructive backup at data/datasets/workers_500k.parquet.bak-<date> before write. Idempotent: if phone is already string, exits 0 with "no-op". Safe to re-run. The .parquet itself is too large to commit (75MB) and follows project convention of staying out of git. The script makes the conversion reproducible from the source dataset.	2026-04-27 10:45:38 -05:00
root	d56f08e740	staffing: A — fill_events.parquet from 44 scenarios + 64 lessons (deterministic) Decision A from reports/staffing/synthetic-data-gap-report.md §7. Walks tests/multi-agent/scenarios/scen_.json and data/_playbook_lessons/.json, normalizes to a single fill_events.parquet at data/datasets/fill_events.parquet. One row per scenario event, lesson outcomes joined by (client, date) where the tuple matches. rows: 123 scenarios contributing: 40 events with outcome data: 62 unique (client, date) tuples: 40 Reproducibility: event_id is SHA1(client\|date\|role\|at\|city) truncated to 16 hex chars; rows sorted by event_id before write so re-runs produce bit-identical output. Verified. Pure normalization — no LLM, no new data, no distillation substrate mutation.	2026-04-27 10:45:29 -05:00

Author

SHA1

Message

Date

root

c3c9c2174a

staffing: B+C — safe views (candidates/workers/jobs) + workers_500k_v9 build script

lakehouse/auditor 9 blocking issues: cloud: claim not backed — "Verified live (current synthetic data):"

Decision B from reports/staffing/synthetic-data-gap-report.md §7
(plus C: client_workerskjkk.parquet typo file removed from
data/datasets/ — was never tracked, no git effect).

PII enforcement was UNVERIFIED in workers_500k_v8 (the corpus
staffing_inference mode embeds chunks from). Verified 2026-04-27 by
inspecting data/vectors/meta/workers_500k_v8.json — `source:
"workers_500k"` confirms v8 was built directly from the raw table, so
the LLM has been seeing names / emails / phones / resume_text for every
staffing query.

This commit closes the boundary at the catalog metadata layer:

candidates_safe (overhauled — was failing SQL invalid 434×/day on a
nonexistent `vertical` column reference, copy-pasted from job_orders):
  drops last_name, email, phone, hourly_rate_usd
  candidate_id masked (keep first 3, last 2)
  row_filter: status != 'blocked'

workers_safe (NEW):
  drops name, email, phone, zip, communications, resume_text
  keeps role, city, state, skills, certifications, archetype, scores
  resume_text + communications carry verbatim PII (full names) and
  there is no in-view text scrubber, so they are dropped wholesale.
  Skills + certifications + scores carry the matching signal for
  staffing inference.

jobs_safe (NEW):
  drops description (often quotes client names verbatim)
  client_id masked (keep first 3, last 2)
  bill_rate / pay_rate kept — commercial info, not PII per staffing PRD

scripts/staffing/build_workers_v9.sh (NEW):
  POSTs /vectors/index to rebuild workers_500k_v9 from `workers_safe`
  rather than the raw table. Embedded text is constructed from the
  view projection so PII never enters the corpus by construction.
  30+ minute background job — not run inline. After it completes,
  flip config/modes.toml `staffing_inference` matrix_corpus from
  workers_500k_v8 to workers_500k_v9 and restart gateway.

Distillation v1.0.0 substrate untouched. audit-full passed clean
(16/16 required) before this commit; will re-verify after.

2026-04-27 10:46:03 -05:00

root

940737daa7

staffing: D — workers_500k.phone int → string fixup script

Decision D from reports/staffing/synthetic-data-gap-report.md §7.

Phones in workers_500k.parquet are 11-digit US numbers stored as int64
(e.g. 13122277740). Numerically fine, but breaks join keys against any
other source that carries phone as string. Script casts the column to
string in place, with non-destructive backup at
data/datasets/workers_500k.parquet.bak-<date> before write.

Idempotent: if phone is already string, exits 0 with "no-op". Safe to
re-run.

The .parquet itself is too large to commit (75MB) and follows project
convention of staying out of git. The script makes the conversion
reproducible from the source dataset.

2026-04-27 10:45:38 -05:00

root

d56f08e740

staffing: A — fill_events.parquet from 44 scenarios + 64 lessons (deterministic)

Decision A from reports/staffing/synthetic-data-gap-report.md §7.

Walks tests/multi-agent/scenarios/scen_*.json and
data/_playbook_lessons/*.json, normalizes to a single fill_events.parquet
at data/datasets/fill_events.parquet. One row per scenario event,
lesson outcomes joined by (client, date) where the tuple matches.

  rows: 123
  scenarios contributing: 40
  events with outcome data: 62
  unique (client, date) tuples: 40

Reproducibility: event_id is SHA1(client|date|role|at|city) truncated to
16 hex chars; rows sorted by event_id before write so re-runs produce
bit-identical output. Verified.

Pure normalization — no LLM, no new data, no distillation substrate
mutation.

2026-04-27 10:45:29 -05:00

3 Commits