Spec mandates these two docs before any staffing audit runner ships:
docs/recon/staffing-lakehouse-distillation-recon.md
reports/staffing/synthetic-data-gap-report.md
NO distillation core touched. Distillation v1.0.0 (commit e7636f2,
tag distillation-v1.0.0) remains the stable substrate. Staffing
work is consumer-only.
Recon findings (12 sections, ~5KB):
- Existing staffing schemas in crates/validator/staffing/* are scaffolds
(FillValidator schema-shape only; worker-existence/status/geo TODOs)
- Synthetic data spans 6+ shapes across 9 parquet files
(~625k worker-shape rows + 1k candidate-shape rows)
- PII detection lives in shared/pii.rs but enforcement at query
time is unverified — the LLM may have been seeing raw PII via
workers_500k_v8 vector corpus
- 44 scenarios + 64 playbook_lessons = ~108 RAG candidates
- No structured fill-event log exists; scenarios+lessons are
retrospective, not queryable per-event records
- workers_500k.phone is int (should be string — leading-zero loss)
- client_workerskjkk.parquet is a typo file (160 rows, sibling of
client_workersi.parquet)
- PRD §158 claims Phase 19 closed playbook write-only gap — unverified
Gap report findings (9 sections, ~6KB):
- 4 BLOCKING gaps requiring J decisions before audit ships:
A. Generate fill_events.parquet from scenarios + lessons?
B. Build views/{candidates,workers,jobs}_safe with PII masking?
C. Delete client_workerskjkk.parquet typo file?
D. Fix workers_500k.phone type (int → string)?
- 5 SOFT gaps the audit can run with (will be reported as findings)
- 3 NON-gaps (data sufficient as-is)
- Recommendation: NO new synthetic data needed; only normalization
of what already exists, contingent on J approval of A-D
Up-front commitments:
- Distillation v1.0.0 substrate untouched (verified by audit-full
running clean before+after each staffing change)
- All synthetic-data modifications via deterministic scripts under
scripts/staffing/, never hand-edit
- Every staffing artifact carries canonical sha256 provenance back
to source parquet/scenario/lesson
- _safe views are the source of truth for LLM-facing text; raw
parquets never directly fed into corpus builds
Phase 1 unblocks AFTER J reviews both docs and approves audit scope
+ the 4 gap-fix decisions.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
14 KiB
Staffing Synthetic Data — Gap Report
Date: 2026-04-27
Status: read-only inventory; no data generated
Spec: J's "Lakehouse Staffing Integration" prompt
Companion: docs/recon/staffing-lakehouse-distillation-recon.md
This is the up-front gap report the spec mandates BEFORE any audit runner is built or any synthetic data is generated. It enumerates every staffing parquet on disk, tallies fields, flags PII status, and reports whether the data is fit for the audit it's meant to validate.
The headline finding: the synthetic data is broad but inconsistent. Three distinct worker schemas exist across five files; PII is raw (not masked); audit usefulness is high for some streams (workers_500k, scenarios) and low for others (sparse_workers, new_candidates). No new data should be generated until the inconsistencies are resolved or explicitly accepted as test fixtures.
1. Record counts + entity types
| Stream | Path | Rows | Entity | Notes |
|---|---|---|---|---|
| candidates | data/datasets/candidates.parquet |
1,000 | candidate | recruiter-side ATS-style records |
| job_orders | data/datasets/job_orders.parquet |
15,000 | job_order | client-side req records |
| workers_500k | data/datasets/workers_500k.parquet |
500,000 | worker | full population with scores + resume + comms |
| workers_100k | data/datasets/workers_100k.parquet |
100,000 | worker | scaled-down sibling |
| ethereal_workers | data/datasets/ethereal_workers.parquet |
10,000 | worker | scenario-friendly subset |
| client_workersi | data/datasets/client_workersi.parquet |
160 | worker | client "approved roster" view, simpler shape |
| client_workerskjkk | data/datasets/client_workerskjkk.parquet |
160 | worker | typo-named sibling of above |
| sparse_workers | data/datasets/sparse_workers.parquet |
200 | worker (sparse) | edge-case fixture |
| new_candidates | data/datasets/new_candidates.parquet |
3 | candidate | demo / smoke-test data |
| scenarios | tests/multi-agent/scenarios/*.json |
44 files | scenario | per-day client fill plans |
| lessons | data/_playbook_lessons/*.json |
64 files | lesson | post-run retrospectives |
Worker-shape total on disk: ~625k rows across 5 files. Candidate-shape: ~1k.
2. Schema-by-schema field inventory
candidates.parquet (1,000 rows)
candidate_id (string, "CAND-NNNNN") — present
first_name (string) — present, raw PII
last_name (string) — present, raw PII
email (string) — present, raw PII
phone (string, formatted "(NNN) NNN-NNNN") — present, raw PII
city, state — present
skills (string, CSV) — present
years_experience (int) — present
hourly_rate_usd (int) — present, financial
status (string) — present (sample: "placed"; full enum unknown)
Missing fields a real ATS would have: created_at, last_contact, recruiter_id, source (referral/website/cold), placement_count, blacklisted_clients. None of these block the audit but they limit what staffing-PRD-drift can verify.
job_orders.parquet (15,000 rows)
job_order_id (string, "JO-NNNNNN") — present
client_id (string, "CLI-NNNNN") — present
title (string) — present
vertical (string) — present
bill_rate, pay_rate (float) — present, financial
status (string) — present (sample: "closed")
city, state, zip — present
description (string) — present, generated text
Missing fields: created_at, target_count, filled_count, start_date, end_date, requirements (skills array). The description field embeds these informally ("Requires: ...", "6+ years exp", "$34.97/hr"). Parsing them into structured fields is what the audit needs to verify.
workers_500k.parquet / workers_100k / ethereal_workers (same schema)
worker_id (int, sequential) — present
name (string) — present, raw PII
role (string) — present
email (string) — present, raw PII
phone (int, no formatting) — present, raw PII (also wrong type — should be string given leading digits)
city, state, zip — present
skills (string, CSV in single column) — present
certifications (string, CSV) — present
archetype (string, enum, sample: "flexible") — present, full enum unknown
reliability, responsiveness, engagement, compliance, availability (float 0-1) — present
communications (string, multi-msg with " | " separator) — present
resume_text (string) — present
Missing: created_at, last_active, geo_radius_mi, certifications_expiry. The 5 personality scores are the matchmaking signal.
client_workersi / client_workerskjkk (160 rows each, simpler shape)
worker_id, name, role, city, state, email, phone, skills, certifications, availability, reliability, archetype
3 fields fewer than workers_500k: missing responsiveness, engagement, compliance, communications, resume_text, zip. Plus phone is here as string vs int in workers_500k.
sparse_workers.parquet (200 rows, completely different shape)
name, phone, role, city, state, notes
No worker_id, no scores, no email, no skills/certifications/archetype. This is a recruiter-shorthand fixture — useful for testing "missing-fields graceful degradation" but NOT a staffing source.
new_candidates.parquet (3 rows, candidate-shape)
name, phone, email, city, state, skills, years
Missing the candidate_id that exists in candidates.parquet. Tiny + smoke-test only.
3. PII / tokenization status
| Stream | PII fields | Masked? | Risk if LLM sees this |
|---|---|---|---|
| candidates | first_name, last_name, email, phone | ❌ raw | Names are real-shape; emails are firstname.lastnameN@example.com (clearly fake); phones are realistic-looking — could fool a model into citing them as real |
| workers_500k | name, email, phone | ❌ raw | Same risk — but at 500k scale, retrieval-time exposure is the more relevant concern |
| client_workers* | name, email, phone | ❌ raw | Same |
| sparse_workers | name, phone | ❌ raw | Same |
| new_candidates | name, email, phone | ❌ raw | Same |
| job_orders | (none — client_id is opaque) | n/a | low risk; description text doesn't leak PII |
| scenarios | (worker names sometimes appear in lesson text) | ❌ inline | "Susan X. Ruiz double-booked" — verbatim names in lesson markdown |
| lessons | worker names embedded in lesson field |
❌ inline | same |
Critical: crates/shared/src/pii.rs::detect_sensitivity recognizes email, phone, ssn as PII. catalogd::service.rs:264 carries column_redactions: HashMap<String, Redaction>. But enforcement at query time is unverified. Whether retrieval through staffing_inference_lakehouse mode actually applies the mask — and whether the workers_500k_v8 vector corpus was built with masked text or raw — is the staffing audit's first deterministic check.
The synthetic email convention (first.lastN@example.com) is fake-recognizable to humans but a model trained to extract emails would still extract them as if real. Until either (a) the catalog masks them at query time or (b) a _safe view replaces PII with hashed tokens before vectorization, the LLM has plausibly been seeing PII for every staffing query.
4. Search usefulness (as a corpus)
| Stream | Searchable | Rich enough for retrieval | Notes |
|---|---|---|---|
| workers_500k | ✓ | High | resume_text + comms = good RAG. archetype + 5 scores = good filtering signal |
| ethereal_workers | ✓ | High | same shape as 500k, smaller test slice |
| candidates | ✓ | Medium | skills as CSV string (not array — tokenize before search). No resume text |
| job_orders | ✓ | Medium | description carries requirements informally. No structured required_skills array |
| client_workers* | ✓ | Low | no resume, no scores beyond reliability/availability |
| sparse_workers | minimal | Low | useful for "graceful degradation" tests only |
| new_candidates | n/a | Trivial | 3 rows |
workers_500k_v8 vector corpus exists — it's the staffing-mode-runner's matrix corpus. Whether its content was sourced from the masked catalog view or raw parquet is the build-time question for the audit.
5. Audit usefulness
| Stream | Audit value |
|---|---|
| scenarios | High — 44 fully-specified fill plans with timestamps, roles, counts, geo. Deterministic acceptance fixture material |
| lessons | High — 64 retrospectives with events_total/events_ok ratios. The closest thing to a fill-success ledger |
| outcomes.jsonl | High — already consumed by Phase 2 distillation transforms |
| candidates | Medium — status field is the verdict but enum is implicit |
| job_orders | Medium — status: closed count vs target_count (missing field) is the obvious metric, blocked by schema gap |
| workers_500k | Medium — archetype + scores enable per-worker reliability checks but no "did this worker get filled" signal lives here |
| client_workers* | Low — no temporal or status fields |
| sparse_workers | Low — fixture data |
| new_candidates | None — too few rows |
6. Concrete gap list (what's missing)
Blocking gaps (must fix or accept before audit ships)
-
No structured fill-event log. Scenarios + lessons describe fills retrospectively but no row-per-event ledger exists. The audit's "candidate/job matching integrity" check needs this. Decision needed: generate a synthetic fill_events.parquet from the 44 scenarios + 64 lessons via deterministic script, OR scope the audit to "best-effort post-hoc reconstruction". Recommend the former — same scenarios + lessons unmodified, just normalized into a queryable shape.
-
PII masking enforcement unverified. Cannot ship a staffing audit that claims "PII boundaries respected" until we can prove the LLM-facing path masks. Decision needed: add
views/candidates_safe.sql,views/workers_safe.sql(hash-masked) and rebuildworkers_500k_v9from the safe view. OR: add a runtime check that asserts the LLM's prompt never contains PII regex matches. Recommend both — view at corpus-build time, runtime check as defense-in-depth. -
client_workerskjkk.parquettypo file. Obviously not authoritative; either delete or rename. Decision: remove from canonical list; add a startup gate that errors on unrecognized parquet names indata/datasets/. -
workers_500k.phoneisint, should bestring. Leading-zero loss is a real bug. Affects email/phone joins. Decision: fixup script + new schema version, OR document and accept (test data only).
Soft gaps (audit can run; results will reflect the gap)
- Missing
created_at/last_activetimestamps on every entity — staffing recency rules can't fire. - No
target_count/filled_counton job_orders — fill-rate metric requires parsing description. candidates.statusenum undocumented — can audit count distribution but can't claim "all expected statuses present".archetypeenum undocumented — same.- No worker→candidate join key. They're plausibly the SAME population in different shapes; the audit will assume distinct unless documented otherwise.
Non-gaps (sufficient as-is)
- 500k workers is plenty for retrieval-quality testing.
- 44 scenarios + 64 lessons is enough for staffing_answers RAG corpus building.
- PII detection rules in
pii.rsare sufficient — the gap is enforcement, not classification.
7. Whether more synthetic data is needed
Short answer: no, not for the initial staffing audit.
The existing data is enough to:
- Run schema validity checks (Phase 1 of staffing audit)
- Audit PII enforcement (Phase 2)
- Build a staffing_answers RAG corpus from scenarios + lessons (Phase 3)
- Run replay against synthetic FillRequest payloads (Phase 4 — uses Phase 7 distillation infra)
- Detect PRD drift between docs/PRD.md §32 claims and the actual code (Phase 5)
The data is NOT enough to:
- Validate end-to-end fill rates without synthesizing a fill_events ledger from scenarios + lessons (gap #1 above)
- Test the "system gets smarter over time" Phase 19 claim — would need a longitudinal replay sweep, which is post-audit work
Recommended decision tree (J to confirm):
A. Generate fill_events.parquet (deterministic script over scenarios + lessons)?
YES → adds 44 × ~5 rows = ~220 events; audit can run candidate/job matching integrity
NO → audit reports "blocked: no fill-event ledger" and exits with that finding
B. Build views/{candidates,workers,jobs}_safe with PII hash-masked?
YES → corpus rebuilds from safe views; audit can prove PII boundary respected
NO → audit reports "blocked: cannot prove PII masking; LLM may have seen PII"
C. Delete client_workerskjkk.parquet typo file?
YES → cleaner inventory; reduces audit surface
NO → audit flags as anomaly
D. Fix workers_500k.phone type (int → string)?
YES → join keys work
NO → audit reports as known data quality issue
If J approves A + B + C + D, no genuinely new synthetic data needed — only normalization of what already exists.
8. Up-front commitments before code
- The staffing audit, when it ships, will NOT modify the distillation v1.0.0 substrate. Verified by
audit-fullrunning clean before+after. - Synthetic data modifications (gap #1 fill_events generation, gap #2 safe views, gap #3 typo deletion, gap #4 phone fixup) happen via deterministic scripts under
scripts/staffing/, never by hand-edit. - Every new staffing-side artifact (RAG corpus, audit report, fill_events ledger) carries provenance back to its source parquet/scenario/lesson via canonical sha256 — same pattern as distillation Phase 1.
- PII handling: the
_safeviews are the source of truth for any LLM-facing text; raw parquets stay on disk but are never the corpus build input.
9. Phase 1 readiness checklist
- Recon doc exists (
docs/recon/staffing-lakehouse-distillation-recon.md) - Gap report exists (this file)
- J approves the 4 gap-fix decisions (A/B/C/D in §7)
- J approves the audit scope (which checks ship in v1)
Implementation begins only after J's review of both docs.