lakehouse/reports/staffing/synthetic-data-gap-report.md
root d11632a6fa
Some checks failed
lakehouse/auditor 13 blocking issues: cloud: claim not backed — "Phase 8 done-criteria (per spec):"
staffing: recon + synthetic-data gap report (Phase 0, no implementation)
Spec mandates these two docs before any staffing audit runner ships:
  docs/recon/staffing-lakehouse-distillation-recon.md
  reports/staffing/synthetic-data-gap-report.md

NO distillation core touched. Distillation v1.0.0 (commit e7636f2,
tag distillation-v1.0.0) remains the stable substrate. Staffing
work is consumer-only.

Recon findings (12 sections, ~5KB):
  - Existing staffing schemas in crates/validator/staffing/* are scaffolds
    (FillValidator schema-shape only; worker-existence/status/geo TODOs)
  - Synthetic data spans 6+ shapes across 9 parquet files
    (~625k worker-shape rows + 1k candidate-shape rows)
  - PII detection lives in shared/pii.rs but enforcement at query
    time is unverified — the LLM may have been seeing raw PII via
    workers_500k_v8 vector corpus
  - 44 scenarios + 64 playbook_lessons = ~108 RAG candidates
  - No structured fill-event log exists; scenarios+lessons are
    retrospective, not queryable per-event records
  - workers_500k.phone is int (should be string — leading-zero loss)
  - client_workerskjkk.parquet is a typo file (160 rows, sibling of
    client_workersi.parquet)
  - PRD §158 claims Phase 19 closed playbook write-only gap — unverified

Gap report findings (9 sections, ~6KB):
  - 4 BLOCKING gaps requiring J decisions before audit ships:
    A. Generate fill_events.parquet from scenarios + lessons?
    B. Build views/{candidates,workers,jobs}_safe with PII masking?
    C. Delete client_workerskjkk.parquet typo file?
    D. Fix workers_500k.phone type (int → string)?
  - 5 SOFT gaps the audit can run with (will be reported as findings)
  - 3 NON-gaps (data sufficient as-is)
  - Recommendation: NO new synthetic data needed; only normalization
    of what already exists, contingent on J approval of A-D

Up-front commitments:
  - Distillation v1.0.0 substrate untouched (verified by audit-full
    running clean before+after each staffing change)
  - All synthetic-data modifications via deterministic scripts under
    scripts/staffing/, never hand-edit
  - Every staffing artifact carries canonical sha256 provenance back
    to source parquet/scenario/lesson
  - _safe views are the source of truth for LLM-facing text; raw
    parquets never directly fed into corpus builds

Phase 1 unblocks AFTER J reviews both docs and approves audit scope
+ the 4 gap-fix decisions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 00:02:47 -05:00

233 lines
14 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Staffing Synthetic Data — Gap Report
**Date:** 2026-04-27
**Status:** read-only inventory; no data generated
**Spec:** J's "Lakehouse Staffing Integration" prompt
**Companion:** `docs/recon/staffing-lakehouse-distillation-recon.md`
This is the up-front gap report the spec mandates BEFORE any audit runner is built or any synthetic data is generated. It enumerates every staffing parquet on disk, tallies fields, flags PII status, and reports whether the data is **fit for the audit it's meant to validate**.
The headline finding: **the synthetic data is broad but inconsistent**. Three distinct worker schemas exist across five files; PII is raw (not masked); audit usefulness is high for some streams (workers_500k, scenarios) and low for others (sparse_workers, new_candidates). **No new data should be generated until the inconsistencies are resolved or explicitly accepted as test fixtures.**
---
## 1. Record counts + entity types
| Stream | Path | Rows | Entity | Notes |
|---|---|---|---|---|
| candidates | `data/datasets/candidates.parquet` | 1,000 | candidate | recruiter-side ATS-style records |
| job_orders | `data/datasets/job_orders.parquet` | 15,000 | job_order | client-side req records |
| workers_500k | `data/datasets/workers_500k.parquet` | 500,000 | worker | full population with scores + resume + comms |
| workers_100k | `data/datasets/workers_100k.parquet` | 100,000 | worker | scaled-down sibling |
| ethereal_workers | `data/datasets/ethereal_workers.parquet` | 10,000 | worker | scenario-friendly subset |
| client_workersi | `data/datasets/client_workersi.parquet` | 160 | worker | client "approved roster" view, simpler shape |
| client_workerskjkk | `data/datasets/client_workerskjkk.parquet` | 160 | worker | typo-named sibling of above |
| sparse_workers | `data/datasets/sparse_workers.parquet` | 200 | worker (sparse) | edge-case fixture |
| new_candidates | `data/datasets/new_candidates.parquet` | 3 | candidate | demo / smoke-test data |
| scenarios | `tests/multi-agent/scenarios/*.json` | 44 files | scenario | per-day client fill plans |
| lessons | `data/_playbook_lessons/*.json` | 64 files | lesson | post-run retrospectives |
**Worker-shape total on disk: ~625k rows across 5 files. Candidate-shape: ~1k.**
---
## 2. Schema-by-schema field inventory
### candidates.parquet (1,000 rows)
```
candidate_id (string, "CAND-NNNNN") — present
first_name (string) — present, raw PII
last_name (string) — present, raw PII
email (string) — present, raw PII
phone (string, formatted "(NNN) NNN-NNNN") — present, raw PII
city, state — present
skills (string, CSV) — present
years_experience (int) — present
hourly_rate_usd (int) — present, financial
status (string) — present (sample: "placed"; full enum unknown)
```
Missing fields a real ATS would have: `created_at`, `last_contact`, `recruiter_id`, `source` (referral/website/cold), `placement_count`, `blacklisted_clients`. None of these block the audit but they limit what staffing-PRD-drift can verify.
### job_orders.parquet (15,000 rows)
```
job_order_id (string, "JO-NNNNNN") — present
client_id (string, "CLI-NNNNN") — present
title (string) — present
vertical (string) — present
bill_rate, pay_rate (float) — present, financial
status (string) — present (sample: "closed")
city, state, zip — present
description (string) — present, generated text
```
Missing fields: `created_at`, `target_count`, `filled_count`, `start_date`, `end_date`, `requirements (skills array)`. The `description` field embeds these informally ("Requires: ...", "6+ years exp", "$34.97/hr"). Parsing them into structured fields is what the audit needs to verify.
### workers_500k.parquet / workers_100k / ethereal_workers (same schema)
```
worker_id (int, sequential) — present
name (string) — present, raw PII
role (string) — present
email (string) — present, raw PII
phone (int, no formatting) — present, raw PII (also wrong type — should be string given leading digits)
city, state, zip — present
skills (string, CSV in single column) — present
certifications (string, CSV) — present
archetype (string, enum, sample: "flexible") — present, full enum unknown
reliability, responsiveness, engagement, compliance, availability (float 0-1) — present
communications (string, multi-msg with " | " separator) — present
resume_text (string) — present
```
Missing: `created_at`, `last_active`, `geo_radius_mi`, `certifications_expiry`. The 5 personality scores are the matchmaking signal.
### client_workersi / client_workerskjkk (160 rows each, simpler shape)
```
worker_id, name, role, city, state, email, phone, skills, certifications, availability, reliability, archetype
```
**3 fields fewer than workers_500k**: missing `responsiveness`, `engagement`, `compliance`, `communications`, `resume_text`, `zip`. Plus `phone` is here as string vs int in workers_500k.
### sparse_workers.parquet (200 rows, completely different shape)
```
name, phone, role, city, state, notes
```
**No worker_id, no scores, no email, no skills/certifications/archetype.** This is a recruiter-shorthand fixture — useful for testing "missing-fields graceful degradation" but NOT a staffing source.
### new_candidates.parquet (3 rows, candidate-shape)
```
name, phone, email, city, state, skills, years
```
**Missing the `candidate_id`** that exists in candidates.parquet. Tiny + smoke-test only.
---
## 3. PII / tokenization status
| Stream | PII fields | Masked? | Risk if LLM sees this |
|---|---|---|---|
| candidates | first_name, last_name, email, phone | ❌ raw | Names are real-shape; emails are `firstname.lastnameN@example.com` (clearly fake); phones are realistic-looking — could fool a model into citing them as real |
| workers_500k | name, email, phone | ❌ raw | Same risk — but at 500k scale, retrieval-time exposure is the more relevant concern |
| client_workers* | name, email, phone | ❌ raw | Same |
| sparse_workers | name, phone | ❌ raw | Same |
| new_candidates | name, email, phone | ❌ raw | Same |
| job_orders | (none — client_id is opaque) | n/a | low risk; description text doesn't leak PII |
| scenarios | (worker names sometimes appear in lesson text) | ❌ inline | "Susan X. Ruiz double-booked" — verbatim names in lesson markdown |
| lessons | worker names embedded in `lesson` field | ❌ inline | same |
**Critical:** `crates/shared/src/pii.rs::detect_sensitivity` recognizes `email`, `phone`, `ssn` as PII. `catalogd::service.rs:264` carries `column_redactions: HashMap<String, Redaction>`. **But enforcement at query time is unverified.** Whether retrieval through `staffing_inference_lakehouse` mode actually applies the mask — and whether the workers_500k_v8 vector corpus was built with masked text or raw — is the staffing audit's first deterministic check.
The synthetic email convention (`first.lastN@example.com`) is fake-recognizable to humans but a model trained to extract emails would still extract them as if real. Until either (a) the catalog masks them at query time or (b) a `_safe` view replaces PII with hashed tokens before vectorization, **the LLM has plausibly been seeing PII for every staffing query**.
---
## 4. Search usefulness (as a corpus)
| Stream | Searchable | Rich enough for retrieval | Notes |
|---|---|---|---|
| workers_500k | ✓ | **High** | resume_text + comms = good RAG. archetype + 5 scores = good filtering signal |
| ethereal_workers | ✓ | High | same shape as 500k, smaller test slice |
| candidates | ✓ | Medium | skills as CSV string (not array — tokenize before search). No resume text |
| job_orders | ✓ | Medium | description carries requirements informally. No structured `required_skills` array |
| client_workers* | ✓ | Low | no resume, no scores beyond reliability/availability |
| sparse_workers | minimal | Low | useful for "graceful degradation" tests only |
| new_candidates | n/a | Trivial | 3 rows |
**`workers_500k_v8` vector corpus exists** — it's the staffing-mode-runner's matrix corpus. Whether its content was sourced from the masked catalog view or raw parquet is the build-time question for the audit.
---
## 5. Audit usefulness
| Stream | Audit value |
|---|---|
| scenarios | **High** — 44 fully-specified fill plans with timestamps, roles, counts, geo. Deterministic acceptance fixture material |
| lessons | High — 64 retrospectives with `events_total`/`events_ok` ratios. The closest thing to a fill-success ledger |
| outcomes.jsonl | High — already consumed by Phase 2 distillation transforms |
| candidates | Medium — `status` field is the verdict but enum is implicit |
| job_orders | Medium — `status: closed` count vs `target_count` (missing field) is the obvious metric, blocked by schema gap |
| workers_500k | Medium — `archetype` + scores enable per-worker reliability checks but no "did this worker get filled" signal lives here |
| client_workers* | Low — no temporal or status fields |
| sparse_workers | Low — fixture data |
| new_candidates | None — too few rows |
---
## 6. Concrete gap list (what's missing)
### Blocking gaps (must fix or accept before audit ships)
1. **No structured fill-event log.** Scenarios + lessons describe fills retrospectively but no row-per-event ledger exists. The audit's "candidate/job matching integrity" check needs this. **Decision needed:** generate a synthetic fill_events.parquet from the 44 scenarios + 64 lessons via deterministic script, OR scope the audit to "best-effort post-hoc reconstruction". Recommend the former — same scenarios + lessons unmodified, just normalized into a queryable shape.
2. **PII masking enforcement unverified.** Cannot ship a staffing audit that claims "PII boundaries respected" until we can prove the LLM-facing path masks. **Decision needed:** add `views/candidates_safe.sql`, `views/workers_safe.sql` (hash-masked) and rebuild `workers_500k_v9` from the safe view. OR: add a runtime check that asserts the LLM's prompt never contains PII regex matches. Recommend both — view at corpus-build time, runtime check as defense-in-depth.
3. **`client_workerskjkk.parquet` typo file.** Obviously not authoritative; either delete or rename. **Decision:** remove from canonical list; add a startup gate that errors on unrecognized parquet names in `data/datasets/`.
4. **`workers_500k.phone` is `int`, should be `string`.** Leading-zero loss is a real bug. Affects email/phone joins. **Decision:** fixup script + new schema version, OR document and accept (test data only).
### Soft gaps (audit can run; results will reflect the gap)
5. Missing `created_at` / `last_active` timestamps on every entity — staffing recency rules can't fire.
6. No `target_count` / `filled_count` on job_orders — fill-rate metric requires parsing description.
7. `candidates.status` enum undocumented — can audit count distribution but can't claim "all expected statuses present".
8. `archetype` enum undocumented — same.
9. No worker→candidate join key. They're plausibly the SAME population in different shapes; the audit will assume distinct unless documented otherwise.
### Non-gaps (sufficient as-is)
10. 500k workers is plenty for retrieval-quality testing.
11. 44 scenarios + 64 lessons is enough for staffing_answers RAG corpus building.
12. PII detection rules in `pii.rs` are sufficient — the gap is enforcement, not classification.
---
## 7. Whether more synthetic data is needed
**Short answer: no, not for the initial staffing audit.**
The existing data is enough to:
- Run schema validity checks (Phase 1 of staffing audit)
- Audit PII enforcement (Phase 2)
- Build a staffing_answers RAG corpus from scenarios + lessons (Phase 3)
- Run replay against synthetic FillRequest payloads (Phase 4 — uses Phase 7 distillation infra)
- Detect PRD drift between docs/PRD.md §32 claims and the actual code (Phase 5)
The data is **NOT enough** to:
- Validate end-to-end fill rates without synthesizing a fill_events ledger from scenarios + lessons (gap #1 above)
- Test the "system gets smarter over time" Phase 19 claim — would need a longitudinal replay sweep, which is post-audit work
**Recommended decision tree (J to confirm):**
```
A. Generate fill_events.parquet (deterministic script over scenarios + lessons)?
YES → adds 44 × ~5 rows = ~220 events; audit can run candidate/job matching integrity
NO → audit reports "blocked: no fill-event ledger" and exits with that finding
B. Build views/{candidates,workers,jobs}_safe with PII hash-masked?
YES → corpus rebuilds from safe views; audit can prove PII boundary respected
NO → audit reports "blocked: cannot prove PII masking; LLM may have seen PII"
C. Delete client_workerskjkk.parquet typo file?
YES → cleaner inventory; reduces audit surface
NO → audit flags as anomaly
D. Fix workers_500k.phone type (int → string)?
YES → join keys work
NO → audit reports as known data quality issue
```
If J approves A + B + C + D, **no genuinely new synthetic data needed** — only normalization of what already exists.
---
## 8. Up-front commitments before code
1. The staffing audit, when it ships, will **NOT modify** the distillation v1.0.0 substrate. Verified by `audit-full` running clean before+after.
2. Synthetic data **modifications** (gap #1 fill_events generation, gap #2 safe views, gap #3 typo deletion, gap #4 phone fixup) happen via deterministic scripts under `scripts/staffing/`, never by hand-edit.
3. Every new staffing-side artifact (RAG corpus, audit report, fill_events ledger) carries provenance back to its source parquet/scenario/lesson via canonical sha256 — same pattern as distillation Phase 1.
4. PII handling: the `_safe` views are the source of truth for any LLM-facing text; raw parquets stay on disk but are never the corpus build input.
---
## 9. Phase 1 readiness checklist
- [x] Recon doc exists (`docs/recon/staffing-lakehouse-distillation-recon.md`)
- [x] Gap report exists (this file)
- [ ] J approves the 4 gap-fix decisions (A/B/C/D in §7)
- [ ] J approves the audit scope (which checks ship in v1)
Implementation begins **only after** J's review of both docs.