staffing: recon + synthetic-data gap report (Phase 0, no implementation)
Some checks failed
lakehouse/auditor 13 blocking issues: cloud: claim not backed — "Phase 8 done-criteria (per spec):"

Spec mandates these two docs before any staffing audit runner ships:
  docs/recon/staffing-lakehouse-distillation-recon.md
  reports/staffing/synthetic-data-gap-report.md

NO distillation core touched. Distillation v1.0.0 (commit e7636f2,
tag distillation-v1.0.0) remains the stable substrate. Staffing
work is consumer-only.

Recon findings (12 sections, ~5KB):
  - Existing staffing schemas in crates/validator/staffing/* are scaffolds
    (FillValidator schema-shape only; worker-existence/status/geo TODOs)
  - Synthetic data spans 6+ shapes across 9 parquet files
    (~625k worker-shape rows + 1k candidate-shape rows)
  - PII detection lives in shared/pii.rs but enforcement at query
    time is unverified — the LLM may have been seeing raw PII via
    workers_500k_v8 vector corpus
  - 44 scenarios + 64 playbook_lessons = ~108 RAG candidates
  - No structured fill-event log exists; scenarios+lessons are
    retrospective, not queryable per-event records
  - workers_500k.phone is int (should be string — leading-zero loss)
  - client_workerskjkk.parquet is a typo file (160 rows, sibling of
    client_workersi.parquet)
  - PRD §158 claims Phase 19 closed playbook write-only gap — unverified

Gap report findings (9 sections, ~6KB):
  - 4 BLOCKING gaps requiring J decisions before audit ships:
    A. Generate fill_events.parquet from scenarios + lessons?
    B. Build views/{candidates,workers,jobs}_safe with PII masking?
    C. Delete client_workerskjkk.parquet typo file?
    D. Fix workers_500k.phone type (int → string)?
  - 5 SOFT gaps the audit can run with (will be reported as findings)
  - 3 NON-gaps (data sufficient as-is)
  - Recommendation: NO new synthetic data needed; only normalization
    of what already exists, contingent on J approval of A-D

Up-front commitments:
  - Distillation v1.0.0 substrate untouched (verified by audit-full
    running clean before+after each staffing change)
  - All synthetic-data modifications via deterministic scripts under
    scripts/staffing/, never hand-edit
  - Every staffing artifact carries canonical sha256 provenance back
    to source parquet/scenario/lesson
  - _safe views are the source of truth for LLM-facing text; raw
    parquets never directly fed into corpus builds

Phase 1 unblocks AFTER J reviews both docs and approves audit scope
+ the 4 gap-fix decisions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
root 2026-04-27 00:02:47 -05:00
parent e7636f202b
commit d11632a6fa
2 changed files with 498 additions and 0 deletions

View File

@ -0,0 +1,266 @@
# Staffing Lakehouse × Distillation Substrate — Recon
**Date:** 2026-04-27
**Status:** Phase 0 (read-only inventory — no implementation yet)
**Spec:** J's "Lakehouse Staffing Integration" prompt
**Distillation tag (consumer of):** `distillation-v1.0.0` (commit `e7636f2`)
This document inventories the staffing surface in the Lakehouse repo and identifies where the distillation substrate (Phases 0-8) should attach as a *consumer*. **No distillation core mutation — staffing builds on top.**
The headline finding: **staffing has substantial existing infrastructure but is undocumented as a system.** Validators are scaffolds, scenarios are test fixtures, synthetic data spans 6+ shapes with overlapping intent, and there's no unified staffing audit. The integration work is orchestration over what already exists, not greenfield.
---
## 1. Existing staffing schemas
### Rust validators (`crates/validator/src/staffing/`)
| File | Shape | Status |
|---|---|---|
| `mod.rs` | trait + module wiring | scaffold complete |
| `fill.rs::FillValidator` | validates `{fills: [{candidate_id, name}]}` against Artifact::FillProposal | schema check live; worker-existence + status + geo checks are TODO (commented in source) |
| `playbook.rs::PlaybookValidator` | validates Artifact::Playbook (operation format, endorsed_names cap, fingerprint) | schema-shape only; no semantic content check |
| `email.rs` | email-domain validation | scaffold |
### Profiles (`crates/shared/src/profiles/`)
| File | Purpose |
|---|---|
| `execution.rs` | execution profile (model routing per task class) |
| `memory.rs` | MemoryProfile (Phase 19 playbook boost ceiling, history cap, doc stale window, auto-retire) |
| `observer.rs` | Observer profile (failure cluster size, alert cooldown, ring size, langfuse forward) |
| `retrieval.rs` | RetrievalProfile (top_k, rerank_top_k, freshness cutoff, boost_playbook_memory, enforce_sensitivity_gates) |
These are **typed** but auditing whether they're enforced at runtime is part of Phase 1 work.
### PII (`crates/shared/src/pii.rs`)
`detect_sensitivity(column_name)` → maps column names to sensitivity classes (`Pii`, `Financial`, `Public`). Verified by tests:
- `email`, `contact_email`, `ssn` → Pii
- `salary`, `bill_rate` → Financial
`catalogd::service.rs:264` carries `column_redactions: HashMap<String, Redaction>` per dataset. Catalog enforces, but the audit needs to confirm masking is actually applied at query time.
---
## 2. Synthetic data inventory
| File | Rows | Shape | Status assessment |
|---|---|---|---|
| `data/datasets/candidates.parquet` | 1,000 | candidate_id, first_name, last_name, email, phone, city, state, skills, years_experience, hourly_rate_usd, status | **Has PII (raw email + phone)**. CAND-* IDs. status field: `placed`, `unknown others`. Compact + realistic. |
| `data/datasets/job_orders.parquet` | 15,000 | job_order_id, client_id, title, vertical, bill_rate, pay_rate, status, city, state, zip, description | JO-* IDs, CLI-* clients. Verticals: Admin, Manufacturing(?), etc. Realistic shape. **No candidate-fill linkage table observed.** |
| `data/datasets/workers_500k.parquet` | 500,000 | worker_id (int), name, role, email, phone, city, state, zip, skills (CSV string), certifications, archetype, reliability/responsiveness/engagement/compliance/availability (0-1 floats), communications (multi-msg string), resume_text | **Largest + richest source.** Has PII. archetype enum (flexible/?). 4-axis personality scores. Resume text + comm log = good RAG/SFT material. |
| `data/datasets/workers_100k.parquet` | 100,000 | (presumed same as 500k) | scaled-down sibling |
| `data/datasets/ethereal_workers.parquet` | 10,000 | same as workers_500k schema | scenario-friendly subset |
| `data/datasets/client_workersi.parquet` | 160 | worker_id, name, role, city, state, email, phone, skills, certifications, availability, reliability, archetype | **Different shape** (no scores beyond reliability+availability, no resume_text). Probably client-side "approved roster" — the worker pool a client has historically used. |
| `data/datasets/client_workerskjkk.parquet` | (similar) | (same as above) | typo-named sibling — gap to clean up |
| `data/datasets/sparse_workers.parquet` | 200 | name, phone, role, city, state, notes | **Different shape** — no IDs, no scores, just contact + notes. Looks like edge-case test data (sparse field coverage). |
| `data/datasets/new_candidates.parquet` | 3 | name, phone, email, city, state, skills, years | Demo / smoke-test data. Tiny. |
**Total worker-shape rows on disk: ~625k** across 5 files. Schema fragmentation (3 distinct shapes) is a real issue — see gap report.
### Scenarios (`tests/multi-agent/scenarios/`)
44 JSON files covering specific staffing days. Sample shape (Heritage Foods Indianapolis 2026-04-23):
```json
{ "client": "Heritage Foods", "date": "2026-04-23", "events": [
{ "kind": "baseline_fill", "at": "10:30", "role": "Machine Operator", "count": 2,
"city": "Indianapolis", "state": "IN", "shift_start": "10:30 AM" },
{ "kind": "recurring", "at": "10:30", "role": "Receiving Clerk", "count": 1, ... }
]}
```
Event kinds observed: `baseline_fill`, `recurring`. Cities span Indianapolis, Cincinnati, Madison, Toledo, Detroit, Columbus, etc. — Midwestern + Eastern US.
### Playbook lessons (`data/_playbook_lessons/`)
64 JSON files. Sample shape (Heritage Foods 2026-04-21):
```json
{ "date": "...", "client": "...", "cities": "...", "states": "...",
"events_total": 5, "events_ok": 3, "checkpoint_count": 2,
"model": "gpt-oss:20b", "cloud": false,
"lesson": "<long markdown analysis>",
"checkpoints": [{ "after": "09:30", "risk": "...", "hint": "..." }, ...] }
```
These are **post-run retrospectives** — the staffing ops loop wrote them after each scenario completed. Goldmine for RAG.
---
## 3. Ingestion paths + storage layout
### Object storage / Parquet
- `data/datasets/*.parquet` is the disk-resident store. Treated as input by `ingestd` (CSV/JSON/PDF/Postgres/MySQL ingest in `crates/ingestd`).
- **No catalog manifests observed for the staffing parquets** (none under `data/_catalog/manifests/` matching candidate/worker/job names). The datasets exist on disk but may not be registered with `catalogd` — gap.
### MariaDB
- `crates/queryd/src/context.rs` has a "candidates_safe" view referenced by recent code (failed at boot when schema mismatched, see prior memory `feedback_endpoint_probe_discipline.md`).
- Schema for the views isn't visible from grep — needs DB inspection.
### Vector indexes (`data/vectors/`)
- `workers_500k_v8.parquet` — vector corpus matched by `staffing_inference_lakehouse` mode in `config/modes.toml`
- `ethereal_workers_v1.parquet` — alt corpus
- `entity_brief_v1.parquet` — Chicago-permit-style entity briefs (different domain but same indexer)
- `chicago_permits_v1.parquet` — separate but uses same machinery
### KB streams that touch staffing
- `data/_kb/contract_analyses.jsonl` — contractor + permit analyses (related but not staffing per se)
- `data/_kb/staffers.jsonl` — 1.5K, small, not yet inspected
- `data/_kb/outcomes.jsonl` — scenario outcomes log (used by Phase 2 transforms in distillation)
- `data/_playbook_memory/state.json` — Phase 19 playbook memory state
---
## 4. Search / indexing logic
### Staffing-aware mode runner
`config/modes.toml` defines `staffing_inference` task class:
```toml
preferred_mode = "staffing_inference_lakehouse"
default_model = "openai/gpt-oss-120b:free"
matrix_corpus = "workers_500k_v8"
```
The mode runner (Phase 5+ work in this session) composes:
- `EnrichmentFlags { include_file_content, include_bug_fingerprints, include_matrix_chunks, use_relevance_filter, framing: Staffing }`
- Pulls top-K from `workers_500k_v8` corpus
- `FRAMING_STAFFING` system prompt instructs: "only recommend candidates whose names appear in the matrix data; do NOT fabricate workers"
### Pass 4 staffing harness
`scripts/mode_pass4_staffing.ts` ships synthetic FillRequest payloads through the runner. Each request is a JSON `{city, state, role, count, deadline, notes?}` posted as `file_content` (the runner's input shape). Validation: did the model surface real worker_ids from the corpus, or fabricate.
### What's missing
- **No "candidate matching" deterministic scorer** beyond mode-runner LLM. Staffing audit should add: given a job_order, can we score worker fit deterministically (skills overlap, geo distance, status filter) BEFORE asking the LLM? Currently the LLM does both retrieval and scoring.
- **No indexed link table between candidates.parquet and workers_500k.parquet.** They look like the SAME population in different shapes — the workers_500k has the scores + resume + comms, candidates has the basic contact + status + hourly rate. If they're meant to be different populations, the join key is unclear; if they're the same, there's redundancy.
---
## 5. Audit / event tables
**No staffing-specific audit/event log observed.** Searched for `audit_event`, `outcome_event`, `fill_event` patterns in `crates/` — zero hits. The closest existing infrastructure:
- `data/_kb/outcomes.jsonl` — per-run scenario outcomes (used by distillation transforms)
- `data/_observer/ops.jsonl` — observer ring buffer (general-purpose, not staffing)
- `data/_playbook_lessons/*.json` — post-run lessons (retrospective, not audit)
**Gap:** staffing fills happen, scenarios complete, but **no schema-backed event log** captures: which worker_ids were proposed, accepted, filled, rejected, with what timing, against which job_order. The closest record is in scenarios + playbook_lessons but those are unstructured + per-scenario, not a queryable log.
---
## 6. PII / tokenization boundaries
### Detection
`crates/shared/src/pii.rs::detect_sensitivity` recognizes: `email`, `contact_email`, `ssn`, `phone` → Pii. `salary`, `bill_rate`, `pay_rate` → Financial.
### Enforcement
`catalogd::service.rs` carries per-dataset `column_redactions: HashMap<String, Redaction>` — but enforcement at query time wasn't visible from initial grep. Auditing whether masking actually happens when `staffing_inference_lakehouse` retrieves from `workers_500k_v8` is in scope.
### Risk
Raw email + phone live in `workers_500k.parquet` and `candidates.parquet`. If the LLM mode runner retrieves chunks and the catalog hasn't masked them, **the LLM sees PII**. Spec says "do not expose raw PII to AI" — auditing this is non-negotiable for the staffing integration.
---
## 7. PRD docs
- `docs/PRD.md` — main PRD. §32 names staffing as the reference implementation. §158 explicitly notes Phase 19 playbook learning was originally write-only, claims it's now closed — **verify**.
- `docs/CONTROL_PLANE_PRD.md` — long-horizon vision (2026-04-22 pivot)
PRD references staffing throughout but doesn't itemize a "staffing PRD checklist" the way the auditor's pr_audit mode expects per-PR claims. Drift detection between PRD claims and code reality is exactly the auditor's job — running it on the PRD as input rather than a PR diff is a configuration shift, not new code.
---
## 8. Where distillation outputs should attach
The Phase 0-8 distillation substrate is **already feeding the staffing surface in two places**:
1. **`staffing_inference_lakehouse` mode → `workers_500k_v8` matrix corpus.** This is read-only consumption; no change needed.
2. **`pr_audit` mode → `lakehouse_answers_v1` corpus.** Generic; not staffing-specific.
**What's missing for staffing:**
a. **Staffing-specific RAG corpus**`staffing_answers_v1` built from playbook_lessons + scored scenarios. Same builder pattern as `lakehouse_answers_v1` (commit `0844206`'s `scripts/build_answers_corpus.ts`); just point at staffing inputs.
b. **Staffing audit task class**`staffing_audit` mode in `config/modes.toml`, paralleling the auditor's `pr_audit` work. Reads PRD claims + scenario outcomes, asks "do we ship what the PRD claims for staffing?"
c. **Staffing acceptance fixture** — same shape as `tests/fixtures/distillation/acceptance/` but with synthetic candidate + job_order + scenario + lesson rows. Pins staffing invariants: PII masked, candidates valid, scenarios reproducible.
d. **Staffing replay tasks** — drop sample fill requests through `./scripts/distill replay` to see if the local model proposes real worker_ids vs fabricates.
**Implementation approach (deferred until gap report + J approval):**
```
scripts/staffing/
audit.ts # ./scripts/staffing audit — single entry
build_answers.ts # build_staffing_answers_v1 from lessons + scenarios
build_corpus_v9.ts # rebuild workers_500k_v9 with PII masking applied
acceptance.ts # staffing-specific 22-invariant gate
tests/fixtures/staffing/
candidates_sample.parquet
job_orders_sample.parquet
scenario_sample.json
lesson_sample.json
reports/staffing/
staffing-audit-report.md
staffing-prd-drift-report.md
staffing-search-quality-report.md
staffing-synthetic-data-report.md
```
**ALL of the above is consumer-side.** The distillation pipeline's `scripts/distillation/`, `auditor/schemas/distillation/`, and Phase 0-8 commits are NOT touched.
---
## 9. Risks identified during recon
1. **Synthetic data shape fragmentation** — 3 distinct worker schemas across 5 files. If staffing audit assumes one shape and the system uses another, audits will silently miss.
2. **PII enforcement unverified.** Catalog has a redaction primitive; whether it's wired to mode-runner retrieval is the audit's first deterministic check.
3. **No structured staffing audit log.** Lessons + outcomes are retrospective summaries, not per-event records. Without per-event records, deterministic checks like "every worker proposed by the LLM exists in workers_500k" can't run on historical scenarios.
4. **Validator scaffolds.** `FillValidator::validate` does schema-shape only — the worker-existence/status/geo TODOs in the source are exactly the deterministic gates the staffing audit needs to run. Wiring them is consumer work, not distillation work.
5. **Fragile PRD ↔ code linkage.** PRD §158 claims Phase 19 closed the playbook write-only gap; no audit verifies. The staffing-prd-drift-report should run an inference-style claim verification against PRD claims, not unlike the auditor's pr_audit but with PRD as the source.
6. **`workers_500k_v8` is the embedded corpus the LLM sees.** If it carries PII without masking, the LLM has been seeing PII. Auditing the corpus content (not just the SQL views) is required.
7. **64 playbook_lessons + 44 scenarios = ~108 RAG candidates.** Plenty for a staffing_answers corpus, but PII filtering must apply before vectorization. Currently lessons may contain worker names ("Susan X. Ruiz double-booked").
---
## 10. Recommended integration points (where consumer code attaches)
1. **Staffing audit script** at `scripts/staffing/audit.ts` reads from existing distillation outputs:
- `data/scored-runs/` (filter to task_id starting `permit:` or `scenario:`)
- `exports/quarantine/*.jsonl` (any staffing-specific quarantines)
- `reports/distillation/<latest>/summary.json` (cross-reference)
2. **Reuse Phase 5 receipts harness** — staffing audit writes a `StageReceipt` matching the existing schema, with a new `stage` value (extend the enum to `"staffing-audit"` only after schema-version bump if needed; otherwise use the existing reserved `"index"` slot or just write a parallel manifest under `reports/staffing/`).
3. **Reuse Phase 1 schemas** — RagSample, SftSample, PreferenceSample work for staffing data without modification. The `tags` array can carry `task:staffing.fill` to keep the corpus self-tagged.
4. **Reuse Phase 7 replay**`./scripts/distill replay --task "fill 2 welders in Toledo OH"` already works; just feed it from synthetic FillRequest payloads.
5. **Reuse Phase 8 audit-full** — its drift baseline tracks distillation metrics; staffing audit gets its OWN baseline file at `data/_kb/staffing_audit_baselines.jsonl`.
6. **Schema invariants for staffing**:
- every candidate_id in candidates.parquet appears in workers_500k.parquet OR is documented as "candidate-distinct-from-worker"
- every status value in candidates.parquet is in a known enum
- every email in workers/candidates is masked when it reaches the LLM (audit by inspecting prompt traces in Langfuse)
---
## 11. What this document is NOT
- Not a green-light to start staffing audit implementation. The spec is explicit: synthetic-data gap report next, THEN J reviews, THEN code.
- Not an audit itself. This is the inventory — the audit's first run will surface findings.
- Not a redesign of staffing data shapes. The fragmentation is documented for the gap report; reshape decisions are J's call, not this recon's.
- Not a modification of the distillation v1.0.0 substrate. Per spec: "DO NOT modify the completed distillation pipeline unless a blocking integration bug is found."
---
## 12. Phase 1 readiness checklist
Before staffing implementation starts, the following must be true:
- [x] Recon doc exists (this file)
- [ ] Synthetic-data gap report exists (next)
- [ ] J reviews both before any code change
- [ ] J approves audit scope + first invariants
Phase 1 is unblocked only after the gap report is reviewed.

View File

@ -0,0 +1,232 @@
# Staffing Synthetic Data — Gap Report
**Date:** 2026-04-27
**Status:** read-only inventory; no data generated
**Spec:** J's "Lakehouse Staffing Integration" prompt
**Companion:** `docs/recon/staffing-lakehouse-distillation-recon.md`
This is the up-front gap report the spec mandates BEFORE any audit runner is built or any synthetic data is generated. It enumerates every staffing parquet on disk, tallies fields, flags PII status, and reports whether the data is **fit for the audit it's meant to validate**.
The headline finding: **the synthetic data is broad but inconsistent**. Three distinct worker schemas exist across five files; PII is raw (not masked); audit usefulness is high for some streams (workers_500k, scenarios) and low for others (sparse_workers, new_candidates). **No new data should be generated until the inconsistencies are resolved or explicitly accepted as test fixtures.**
---
## 1. Record counts + entity types
| Stream | Path | Rows | Entity | Notes |
|---|---|---|---|---|
| candidates | `data/datasets/candidates.parquet` | 1,000 | candidate | recruiter-side ATS-style records |
| job_orders | `data/datasets/job_orders.parquet` | 15,000 | job_order | client-side req records |
| workers_500k | `data/datasets/workers_500k.parquet` | 500,000 | worker | full population with scores + resume + comms |
| workers_100k | `data/datasets/workers_100k.parquet` | 100,000 | worker | scaled-down sibling |
| ethereal_workers | `data/datasets/ethereal_workers.parquet` | 10,000 | worker | scenario-friendly subset |
| client_workersi | `data/datasets/client_workersi.parquet` | 160 | worker | client "approved roster" view, simpler shape |
| client_workerskjkk | `data/datasets/client_workerskjkk.parquet` | 160 | worker | typo-named sibling of above |
| sparse_workers | `data/datasets/sparse_workers.parquet` | 200 | worker (sparse) | edge-case fixture |
| new_candidates | `data/datasets/new_candidates.parquet` | 3 | candidate | demo / smoke-test data |
| scenarios | `tests/multi-agent/scenarios/*.json` | 44 files | scenario | per-day client fill plans |
| lessons | `data/_playbook_lessons/*.json` | 64 files | lesson | post-run retrospectives |
**Worker-shape total on disk: ~625k rows across 5 files. Candidate-shape: ~1k.**
---
## 2. Schema-by-schema field inventory
### candidates.parquet (1,000 rows)
```
candidate_id (string, "CAND-NNNNN") — present
first_name (string) — present, raw PII
last_name (string) — present, raw PII
email (string) — present, raw PII
phone (string, formatted "(NNN) NNN-NNNN") — present, raw PII
city, state — present
skills (string, CSV) — present
years_experience (int) — present
hourly_rate_usd (int) — present, financial
status (string) — present (sample: "placed"; full enum unknown)
```
Missing fields a real ATS would have: `created_at`, `last_contact`, `recruiter_id`, `source` (referral/website/cold), `placement_count`, `blacklisted_clients`. None of these block the audit but they limit what staffing-PRD-drift can verify.
### job_orders.parquet (15,000 rows)
```
job_order_id (string, "JO-NNNNNN") — present
client_id (string, "CLI-NNNNN") — present
title (string) — present
vertical (string) — present
bill_rate, pay_rate (float) — present, financial
status (string) — present (sample: "closed")
city, state, zip — present
description (string) — present, generated text
```
Missing fields: `created_at`, `target_count`, `filled_count`, `start_date`, `end_date`, `requirements (skills array)`. The `description` field embeds these informally ("Requires: ...", "6+ years exp", "$34.97/hr"). Parsing them into structured fields is what the audit needs to verify.
### workers_500k.parquet / workers_100k / ethereal_workers (same schema)
```
worker_id (int, sequential) — present
name (string) — present, raw PII
role (string) — present
email (string) — present, raw PII
phone (int, no formatting) — present, raw PII (also wrong type — should be string given leading digits)
city, state, zip — present
skills (string, CSV in single column) — present
certifications (string, CSV) — present
archetype (string, enum, sample: "flexible") — present, full enum unknown
reliability, responsiveness, engagement, compliance, availability (float 0-1) — present
communications (string, multi-msg with " | " separator) — present
resume_text (string) — present
```
Missing: `created_at`, `last_active`, `geo_radius_mi`, `certifications_expiry`. The 5 personality scores are the matchmaking signal.
### client_workersi / client_workerskjkk (160 rows each, simpler shape)
```
worker_id, name, role, city, state, email, phone, skills, certifications, availability, reliability, archetype
```
**3 fields fewer than workers_500k**: missing `responsiveness`, `engagement`, `compliance`, `communications`, `resume_text`, `zip`. Plus `phone` is here as string vs int in workers_500k.
### sparse_workers.parquet (200 rows, completely different shape)
```
name, phone, role, city, state, notes
```
**No worker_id, no scores, no email, no skills/certifications/archetype.** This is a recruiter-shorthand fixture — useful for testing "missing-fields graceful degradation" but NOT a staffing source.
### new_candidates.parquet (3 rows, candidate-shape)
```
name, phone, email, city, state, skills, years
```
**Missing the `candidate_id`** that exists in candidates.parquet. Tiny + smoke-test only.
---
## 3. PII / tokenization status
| Stream | PII fields | Masked? | Risk if LLM sees this |
|---|---|---|---|
| candidates | first_name, last_name, email, phone | ❌ raw | Names are real-shape; emails are `firstname.lastnameN@example.com` (clearly fake); phones are realistic-looking — could fool a model into citing them as real |
| workers_500k | name, email, phone | ❌ raw | Same risk — but at 500k scale, retrieval-time exposure is the more relevant concern |
| client_workers* | name, email, phone | ❌ raw | Same |
| sparse_workers | name, phone | ❌ raw | Same |
| new_candidates | name, email, phone | ❌ raw | Same |
| job_orders | (none — client_id is opaque) | n/a | low risk; description text doesn't leak PII |
| scenarios | (worker names sometimes appear in lesson text) | ❌ inline | "Susan X. Ruiz double-booked" — verbatim names in lesson markdown |
| lessons | worker names embedded in `lesson` field | ❌ inline | same |
**Critical:** `crates/shared/src/pii.rs::detect_sensitivity` recognizes `email`, `phone`, `ssn` as PII. `catalogd::service.rs:264` carries `column_redactions: HashMap<String, Redaction>`. **But enforcement at query time is unverified.** Whether retrieval through `staffing_inference_lakehouse` mode actually applies the mask — and whether the workers_500k_v8 vector corpus was built with masked text or raw — is the staffing audit's first deterministic check.
The synthetic email convention (`first.lastN@example.com`) is fake-recognizable to humans but a model trained to extract emails would still extract them as if real. Until either (a) the catalog masks them at query time or (b) a `_safe` view replaces PII with hashed tokens before vectorization, **the LLM has plausibly been seeing PII for every staffing query**.
---
## 4. Search usefulness (as a corpus)
| Stream | Searchable | Rich enough for retrieval | Notes |
|---|---|---|---|
| workers_500k | ✓ | **High** | resume_text + comms = good RAG. archetype + 5 scores = good filtering signal |
| ethereal_workers | ✓ | High | same shape as 500k, smaller test slice |
| candidates | ✓ | Medium | skills as CSV string (not array — tokenize before search). No resume text |
| job_orders | ✓ | Medium | description carries requirements informally. No structured `required_skills` array |
| client_workers* | ✓ | Low | no resume, no scores beyond reliability/availability |
| sparse_workers | minimal | Low | useful for "graceful degradation" tests only |
| new_candidates | n/a | Trivial | 3 rows |
**`workers_500k_v8` vector corpus exists** — it's the staffing-mode-runner's matrix corpus. Whether its content was sourced from the masked catalog view or raw parquet is the build-time question for the audit.
---
## 5. Audit usefulness
| Stream | Audit value |
|---|---|
| scenarios | **High** — 44 fully-specified fill plans with timestamps, roles, counts, geo. Deterministic acceptance fixture material |
| lessons | High — 64 retrospectives with `events_total`/`events_ok` ratios. The closest thing to a fill-success ledger |
| outcomes.jsonl | High — already consumed by Phase 2 distillation transforms |
| candidates | Medium — `status` field is the verdict but enum is implicit |
| job_orders | Medium — `status: closed` count vs `target_count` (missing field) is the obvious metric, blocked by schema gap |
| workers_500k | Medium — `archetype` + scores enable per-worker reliability checks but no "did this worker get filled" signal lives here |
| client_workers* | Low — no temporal or status fields |
| sparse_workers | Low — fixture data |
| new_candidates | None — too few rows |
---
## 6. Concrete gap list (what's missing)
### Blocking gaps (must fix or accept before audit ships)
1. **No structured fill-event log.** Scenarios + lessons describe fills retrospectively but no row-per-event ledger exists. The audit's "candidate/job matching integrity" check needs this. **Decision needed:** generate a synthetic fill_events.parquet from the 44 scenarios + 64 lessons via deterministic script, OR scope the audit to "best-effort post-hoc reconstruction". Recommend the former — same scenarios + lessons unmodified, just normalized into a queryable shape.
2. **PII masking enforcement unverified.** Cannot ship a staffing audit that claims "PII boundaries respected" until we can prove the LLM-facing path masks. **Decision needed:** add `views/candidates_safe.sql`, `views/workers_safe.sql` (hash-masked) and rebuild `workers_500k_v9` from the safe view. OR: add a runtime check that asserts the LLM's prompt never contains PII regex matches. Recommend both — view at corpus-build time, runtime check as defense-in-depth.
3. **`client_workerskjkk.parquet` typo file.** Obviously not authoritative; either delete or rename. **Decision:** remove from canonical list; add a startup gate that errors on unrecognized parquet names in `data/datasets/`.
4. **`workers_500k.phone` is `int`, should be `string`.** Leading-zero loss is a real bug. Affects email/phone joins. **Decision:** fixup script + new schema version, OR document and accept (test data only).
### Soft gaps (audit can run; results will reflect the gap)
5. Missing `created_at` / `last_active` timestamps on every entity — staffing recency rules can't fire.
6. No `target_count` / `filled_count` on job_orders — fill-rate metric requires parsing description.
7. `candidates.status` enum undocumented — can audit count distribution but can't claim "all expected statuses present".
8. `archetype` enum undocumented — same.
9. No worker→candidate join key. They're plausibly the SAME population in different shapes; the audit will assume distinct unless documented otherwise.
### Non-gaps (sufficient as-is)
10. 500k workers is plenty for retrieval-quality testing.
11. 44 scenarios + 64 lessons is enough for staffing_answers RAG corpus building.
12. PII detection rules in `pii.rs` are sufficient — the gap is enforcement, not classification.
---
## 7. Whether more synthetic data is needed
**Short answer: no, not for the initial staffing audit.**
The existing data is enough to:
- Run schema validity checks (Phase 1 of staffing audit)
- Audit PII enforcement (Phase 2)
- Build a staffing_answers RAG corpus from scenarios + lessons (Phase 3)
- Run replay against synthetic FillRequest payloads (Phase 4 — uses Phase 7 distillation infra)
- Detect PRD drift between docs/PRD.md §32 claims and the actual code (Phase 5)
The data is **NOT enough** to:
- Validate end-to-end fill rates without synthesizing a fill_events ledger from scenarios + lessons (gap #1 above)
- Test the "system gets smarter over time" Phase 19 claim — would need a longitudinal replay sweep, which is post-audit work
**Recommended decision tree (J to confirm):**
```
A. Generate fill_events.parquet (deterministic script over scenarios + lessons)?
YES → adds 44 × ~5 rows = ~220 events; audit can run candidate/job matching integrity
NO → audit reports "blocked: no fill-event ledger" and exits with that finding
B. Build views/{candidates,workers,jobs}_safe with PII hash-masked?
YES → corpus rebuilds from safe views; audit can prove PII boundary respected
NO → audit reports "blocked: cannot prove PII masking; LLM may have seen PII"
C. Delete client_workerskjkk.parquet typo file?
YES → cleaner inventory; reduces audit surface
NO → audit flags as anomaly
D. Fix workers_500k.phone type (int → string)?
YES → join keys work
NO → audit reports as known data quality issue
```
If J approves A + B + C + D, **no genuinely new synthetic data needed** — only normalization of what already exists.
---
## 8. Up-front commitments before code
1. The staffing audit, when it ships, will **NOT modify** the distillation v1.0.0 substrate. Verified by `audit-full` running clean before+after.
2. Synthetic data **modifications** (gap #1 fill_events generation, gap #2 safe views, gap #3 typo deletion, gap #4 phone fixup) happen via deterministic scripts under `scripts/staffing/`, never by hand-edit.
3. Every new staffing-side artifact (RAG corpus, audit report, fill_events ledger) carries provenance back to its source parquet/scenario/lesson via canonical sha256 — same pattern as distillation Phase 1.
4. PII handling: the `_safe` views are the source of truth for any LLM-facing text; raw parquets stay on disk but are never the corpus build input.
---
## 9. Phase 1 readiness checklist
- [x] Recon doc exists (`docs/recon/staffing-lakehouse-distillation-recon.md`)
- [x] Gap report exists (this file)
- [ ] J approves the 4 gap-fix decisions (A/B/C/D in §7)
- [ ] J approves the audit scope (which checks ship in v1)
Implementation begins **only after** J's review of both docs.