lakehouse/reports/staffing/synthetic-data-gap-report.md
root d11632a6fa
Some checks failed
lakehouse/auditor 13 blocking issues: cloud: claim not backed — "Phase 8 done-criteria (per spec):"
staffing: recon + synthetic-data gap report (Phase 0, no implementation)
Spec mandates these two docs before any staffing audit runner ships:
  docs/recon/staffing-lakehouse-distillation-recon.md
  reports/staffing/synthetic-data-gap-report.md

NO distillation core touched. Distillation v1.0.0 (commit e7636f2,
tag distillation-v1.0.0) remains the stable substrate. Staffing
work is consumer-only.

Recon findings (12 sections, ~5KB):
  - Existing staffing schemas in crates/validator/staffing/* are scaffolds
    (FillValidator schema-shape only; worker-existence/status/geo TODOs)
  - Synthetic data spans 6+ shapes across 9 parquet files
    (~625k worker-shape rows + 1k candidate-shape rows)
  - PII detection lives in shared/pii.rs but enforcement at query
    time is unverified — the LLM may have been seeing raw PII via
    workers_500k_v8 vector corpus
  - 44 scenarios + 64 playbook_lessons = ~108 RAG candidates
  - No structured fill-event log exists; scenarios+lessons are
    retrospective, not queryable per-event records
  - workers_500k.phone is int (should be string — leading-zero loss)
  - client_workerskjkk.parquet is a typo file (160 rows, sibling of
    client_workersi.parquet)
  - PRD §158 claims Phase 19 closed playbook write-only gap — unverified

Gap report findings (9 sections, ~6KB):
  - 4 BLOCKING gaps requiring J decisions before audit ships:
    A. Generate fill_events.parquet from scenarios + lessons?
    B. Build views/{candidates,workers,jobs}_safe with PII masking?
    C. Delete client_workerskjkk.parquet typo file?
    D. Fix workers_500k.phone type (int → string)?
  - 5 SOFT gaps the audit can run with (will be reported as findings)
  - 3 NON-gaps (data sufficient as-is)
  - Recommendation: NO new synthetic data needed; only normalization
    of what already exists, contingent on J approval of A-D

Up-front commitments:
  - Distillation v1.0.0 substrate untouched (verified by audit-full
    running clean before+after each staffing change)
  - All synthetic-data modifications via deterministic scripts under
    scripts/staffing/, never hand-edit
  - Every staffing artifact carries canonical sha256 provenance back
    to source parquet/scenario/lesson
  - _safe views are the source of truth for LLM-facing text; raw
    parquets never directly fed into corpus builds

Phase 1 unblocks AFTER J reviews both docs and approves audit scope
+ the 4 gap-fix decisions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 00:02:47 -05:00

14 KiB
Raw Permalink Blame History

Staffing Synthetic Data — Gap Report

Date: 2026-04-27 Status: read-only inventory; no data generated Spec: J's "Lakehouse Staffing Integration" prompt Companion: docs/recon/staffing-lakehouse-distillation-recon.md

This is the up-front gap report the spec mandates BEFORE any audit runner is built or any synthetic data is generated. It enumerates every staffing parquet on disk, tallies fields, flags PII status, and reports whether the data is fit for the audit it's meant to validate.

The headline finding: the synthetic data is broad but inconsistent. Three distinct worker schemas exist across five files; PII is raw (not masked); audit usefulness is high for some streams (workers_500k, scenarios) and low for others (sparse_workers, new_candidates). No new data should be generated until the inconsistencies are resolved or explicitly accepted as test fixtures.


1. Record counts + entity types

Stream Path Rows Entity Notes
candidates data/datasets/candidates.parquet 1,000 candidate recruiter-side ATS-style records
job_orders data/datasets/job_orders.parquet 15,000 job_order client-side req records
workers_500k data/datasets/workers_500k.parquet 500,000 worker full population with scores + resume + comms
workers_100k data/datasets/workers_100k.parquet 100,000 worker scaled-down sibling
ethereal_workers data/datasets/ethereal_workers.parquet 10,000 worker scenario-friendly subset
client_workersi data/datasets/client_workersi.parquet 160 worker client "approved roster" view, simpler shape
client_workerskjkk data/datasets/client_workerskjkk.parquet 160 worker typo-named sibling of above
sparse_workers data/datasets/sparse_workers.parquet 200 worker (sparse) edge-case fixture
new_candidates data/datasets/new_candidates.parquet 3 candidate demo / smoke-test data
scenarios tests/multi-agent/scenarios/*.json 44 files scenario per-day client fill plans
lessons data/_playbook_lessons/*.json 64 files lesson post-run retrospectives

Worker-shape total on disk: ~625k rows across 5 files. Candidate-shape: ~1k.


2. Schema-by-schema field inventory

candidates.parquet (1,000 rows)

candidate_id (string, "CAND-NNNNN") — present
first_name (string) — present, raw PII
last_name (string) — present, raw PII
email (string) — present, raw PII
phone (string, formatted "(NNN) NNN-NNNN") — present, raw PII
city, state — present
skills (string, CSV) — present
years_experience (int) — present
hourly_rate_usd (int) — present, financial
status (string) — present (sample: "placed"; full enum unknown)

Missing fields a real ATS would have: created_at, last_contact, recruiter_id, source (referral/website/cold), placement_count, blacklisted_clients. None of these block the audit but they limit what staffing-PRD-drift can verify.

job_orders.parquet (15,000 rows)

job_order_id (string, "JO-NNNNNN") — present
client_id (string, "CLI-NNNNN") — present
title (string) — present
vertical (string) — present
bill_rate, pay_rate (float) — present, financial
status (string) — present (sample: "closed")
city, state, zip — present
description (string) — present, generated text

Missing fields: created_at, target_count, filled_count, start_date, end_date, requirements (skills array). The description field embeds these informally ("Requires: ...", "6+ years exp", "$34.97/hr"). Parsing them into structured fields is what the audit needs to verify.

workers_500k.parquet / workers_100k / ethereal_workers (same schema)

worker_id (int, sequential) — present
name (string) — present, raw PII
role (string) — present
email (string) — present, raw PII
phone (int, no formatting) — present, raw PII (also wrong type — should be string given leading digits)
city, state, zip — present
skills (string, CSV in single column) — present
certifications (string, CSV) — present
archetype (string, enum, sample: "flexible") — present, full enum unknown
reliability, responsiveness, engagement, compliance, availability (float 0-1) — present
communications (string, multi-msg with " | " separator) — present
resume_text (string) — present

Missing: created_at, last_active, geo_radius_mi, certifications_expiry. The 5 personality scores are the matchmaking signal.

client_workersi / client_workerskjkk (160 rows each, simpler shape)

worker_id, name, role, city, state, email, phone, skills, certifications, availability, reliability, archetype

3 fields fewer than workers_500k: missing responsiveness, engagement, compliance, communications, resume_text, zip. Plus phone is here as string vs int in workers_500k.

sparse_workers.parquet (200 rows, completely different shape)

name, phone, role, city, state, notes

No worker_id, no scores, no email, no skills/certifications/archetype. This is a recruiter-shorthand fixture — useful for testing "missing-fields graceful degradation" but NOT a staffing source.

new_candidates.parquet (3 rows, candidate-shape)

name, phone, email, city, state, skills, years

Missing the candidate_id that exists in candidates.parquet. Tiny + smoke-test only.


3. PII / tokenization status

Stream PII fields Masked? Risk if LLM sees this
candidates first_name, last_name, email, phone raw Names are real-shape; emails are firstname.lastnameN@example.com (clearly fake); phones are realistic-looking — could fool a model into citing them as real
workers_500k name, email, phone raw Same risk — but at 500k scale, retrieval-time exposure is the more relevant concern
client_workers* name, email, phone raw Same
sparse_workers name, phone raw Same
new_candidates name, email, phone raw Same
job_orders (none — client_id is opaque) n/a low risk; description text doesn't leak PII
scenarios (worker names sometimes appear in lesson text) inline "Susan X. Ruiz double-booked" — verbatim names in lesson markdown
lessons worker names embedded in lesson field inline same

Critical: crates/shared/src/pii.rs::detect_sensitivity recognizes email, phone, ssn as PII. catalogd::service.rs:264 carries column_redactions: HashMap<String, Redaction>. But enforcement at query time is unverified. Whether retrieval through staffing_inference_lakehouse mode actually applies the mask — and whether the workers_500k_v8 vector corpus was built with masked text or raw — is the staffing audit's first deterministic check.

The synthetic email convention (first.lastN@example.com) is fake-recognizable to humans but a model trained to extract emails would still extract them as if real. Until either (a) the catalog masks them at query time or (b) a _safe view replaces PII with hashed tokens before vectorization, the LLM has plausibly been seeing PII for every staffing query.


4. Search usefulness (as a corpus)

Stream Searchable Rich enough for retrieval Notes
workers_500k High resume_text + comms = good RAG. archetype + 5 scores = good filtering signal
ethereal_workers High same shape as 500k, smaller test slice
candidates Medium skills as CSV string (not array — tokenize before search). No resume text
job_orders Medium description carries requirements informally. No structured required_skills array
client_workers* Low no resume, no scores beyond reliability/availability
sparse_workers minimal Low useful for "graceful degradation" tests only
new_candidates n/a Trivial 3 rows

workers_500k_v8 vector corpus exists — it's the staffing-mode-runner's matrix corpus. Whether its content was sourced from the masked catalog view or raw parquet is the build-time question for the audit.


5. Audit usefulness

Stream Audit value
scenarios High — 44 fully-specified fill plans with timestamps, roles, counts, geo. Deterministic acceptance fixture material
lessons High — 64 retrospectives with events_total/events_ok ratios. The closest thing to a fill-success ledger
outcomes.jsonl High — already consumed by Phase 2 distillation transforms
candidates Medium — status field is the verdict but enum is implicit
job_orders Medium — status: closed count vs target_count (missing field) is the obvious metric, blocked by schema gap
workers_500k Medium — archetype + scores enable per-worker reliability checks but no "did this worker get filled" signal lives here
client_workers* Low — no temporal or status fields
sparse_workers Low — fixture data
new_candidates None — too few rows

6. Concrete gap list (what's missing)

Blocking gaps (must fix or accept before audit ships)

  1. No structured fill-event log. Scenarios + lessons describe fills retrospectively but no row-per-event ledger exists. The audit's "candidate/job matching integrity" check needs this. Decision needed: generate a synthetic fill_events.parquet from the 44 scenarios + 64 lessons via deterministic script, OR scope the audit to "best-effort post-hoc reconstruction". Recommend the former — same scenarios + lessons unmodified, just normalized into a queryable shape.

  2. PII masking enforcement unverified. Cannot ship a staffing audit that claims "PII boundaries respected" until we can prove the LLM-facing path masks. Decision needed: add views/candidates_safe.sql, views/workers_safe.sql (hash-masked) and rebuild workers_500k_v9 from the safe view. OR: add a runtime check that asserts the LLM's prompt never contains PII regex matches. Recommend both — view at corpus-build time, runtime check as defense-in-depth.

  3. client_workerskjkk.parquet typo file. Obviously not authoritative; either delete or rename. Decision: remove from canonical list; add a startup gate that errors on unrecognized parquet names in data/datasets/.

  4. workers_500k.phone is int, should be string. Leading-zero loss is a real bug. Affects email/phone joins. Decision: fixup script + new schema version, OR document and accept (test data only).

Soft gaps (audit can run; results will reflect the gap)

  1. Missing created_at / last_active timestamps on every entity — staffing recency rules can't fire.
  2. No target_count / filled_count on job_orders — fill-rate metric requires parsing description.
  3. candidates.status enum undocumented — can audit count distribution but can't claim "all expected statuses present".
  4. archetype enum undocumented — same.
  5. No worker→candidate join key. They're plausibly the SAME population in different shapes; the audit will assume distinct unless documented otherwise.

Non-gaps (sufficient as-is)

  1. 500k workers is plenty for retrieval-quality testing.
  2. 44 scenarios + 64 lessons is enough for staffing_answers RAG corpus building.
  3. PII detection rules in pii.rs are sufficient — the gap is enforcement, not classification.

7. Whether more synthetic data is needed

Short answer: no, not for the initial staffing audit.

The existing data is enough to:

  • Run schema validity checks (Phase 1 of staffing audit)
  • Audit PII enforcement (Phase 2)
  • Build a staffing_answers RAG corpus from scenarios + lessons (Phase 3)
  • Run replay against synthetic FillRequest payloads (Phase 4 — uses Phase 7 distillation infra)
  • Detect PRD drift between docs/PRD.md §32 claims and the actual code (Phase 5)

The data is NOT enough to:

  • Validate end-to-end fill rates without synthesizing a fill_events ledger from scenarios + lessons (gap #1 above)
  • Test the "system gets smarter over time" Phase 19 claim — would need a longitudinal replay sweep, which is post-audit work

Recommended decision tree (J to confirm):

A. Generate fill_events.parquet (deterministic script over scenarios + lessons)?
   YES → adds 44 × ~5 rows = ~220 events; audit can run candidate/job matching integrity
   NO  → audit reports "blocked: no fill-event ledger" and exits with that finding

B. Build views/{candidates,workers,jobs}_safe with PII hash-masked?
   YES → corpus rebuilds from safe views; audit can prove PII boundary respected
   NO  → audit reports "blocked: cannot prove PII masking; LLM may have seen PII"

C. Delete client_workerskjkk.parquet typo file?
   YES → cleaner inventory; reduces audit surface
   NO  → audit flags as anomaly

D. Fix workers_500k.phone type (int → string)?
   YES → join keys work
   NO  → audit reports as known data quality issue

If J approves A + B + C + D, no genuinely new synthetic data needed — only normalization of what already exists.


8. Up-front commitments before code

  1. The staffing audit, when it ships, will NOT modify the distillation v1.0.0 substrate. Verified by audit-full running clean before+after.
  2. Synthetic data modifications (gap #1 fill_events generation, gap #2 safe views, gap #3 typo deletion, gap #4 phone fixup) happen via deterministic scripts under scripts/staffing/, never by hand-edit.
  3. Every new staffing-side artifact (RAG corpus, audit report, fill_events ledger) carries provenance back to its source parquet/scenario/lesson via canonical sha256 — same pattern as distillation Phase 1.
  4. PII handling: the _safe views are the source of truth for any LLM-facing text; raw parquets stay on disk but are never the corpus build input.

9. Phase 1 readiness checklist

  • Recon doc exists (docs/recon/staffing-lakehouse-distillation-recon.md)
  • Gap report exists (this file)
  • J approves the 4 gap-fix decisions (A/B/C/D in §7)
  • J approves the audit scope (which checks ship in v1)

Implementation begins only after J's review of both docs.