# Staffing Lakehouse × Distillation Substrate — Recon **Date:** 2026-04-27 **Status:** Phase 0 (read-only inventory — no implementation yet) **Spec:** J's "Lakehouse Staffing Integration" prompt **Distillation tag (consumer of):** `distillation-v1.0.0` (commit `e7636f2`) This document inventories the staffing surface in the Lakehouse repo and identifies where the distillation substrate (Phases 0-8) should attach as a *consumer*. **No distillation core mutation — staffing builds on top.** The headline finding: **staffing has substantial existing infrastructure but is undocumented as a system.** Validators are scaffolds, scenarios are test fixtures, synthetic data spans 6+ shapes with overlapping intent, and there's no unified staffing audit. The integration work is orchestration over what already exists, not greenfield. --- ## 1. Existing staffing schemas ### Rust validators (`crates/validator/src/staffing/`) | File | Shape | Status | |---|---|---| | `mod.rs` | trait + module wiring | scaffold complete | | `fill.rs::FillValidator` | validates `{fills: [{candidate_id, name}]}` against Artifact::FillProposal | schema check live; worker-existence + status + geo checks are TODO (commented in source) | | `playbook.rs::PlaybookValidator` | validates Artifact::Playbook (operation format, endorsed_names cap, fingerprint) | schema-shape only; no semantic content check | | `email.rs` | email-domain validation | scaffold | ### Profiles (`crates/shared/src/profiles/`) | File | Purpose | |---|---| | `execution.rs` | execution profile (model routing per task class) | | `memory.rs` | MemoryProfile (Phase 19 playbook boost ceiling, history cap, doc stale window, auto-retire) | | `observer.rs` | Observer profile (failure cluster size, alert cooldown, ring size, langfuse forward) | | `retrieval.rs` | RetrievalProfile (top_k, rerank_top_k, freshness cutoff, boost_playbook_memory, enforce_sensitivity_gates) | These are **typed** but auditing whether they're enforced at runtime is part of Phase 1 work. ### PII (`crates/shared/src/pii.rs`) `detect_sensitivity(column_name)` → maps column names to sensitivity classes (`Pii`, `Financial`, `Public`). Verified by tests: - `email`, `contact_email`, `ssn` → Pii - `salary`, `bill_rate` → Financial `catalogd::service.rs:264` carries `column_redactions: HashMap` per dataset. Catalog enforces, but the audit needs to confirm masking is actually applied at query time. --- ## 2. Synthetic data inventory | File | Rows | Shape | Status assessment | |---|---|---|---| | `data/datasets/candidates.parquet` | 1,000 | candidate_id, first_name, last_name, email, phone, city, state, skills, years_experience, hourly_rate_usd, status | **Has PII (raw email + phone)**. CAND-* IDs. status field: `placed`, `unknown others`. Compact + realistic. | | `data/datasets/job_orders.parquet` | 15,000 | job_order_id, client_id, title, vertical, bill_rate, pay_rate, status, city, state, zip, description | JO-* IDs, CLI-* clients. Verticals: Admin, Manufacturing(?), etc. Realistic shape. **No candidate-fill linkage table observed.** | | `data/datasets/workers_500k.parquet` | 500,000 | worker_id (int), name, role, email, phone, city, state, zip, skills (CSV string), certifications, archetype, reliability/responsiveness/engagement/compliance/availability (0-1 floats), communications (multi-msg string), resume_text | **Largest + richest source.** Has PII. archetype enum (flexible/?). 4-axis personality scores. Resume text + comm log = good RAG/SFT material. | | `data/datasets/workers_100k.parquet` | 100,000 | (presumed same as 500k) | scaled-down sibling | | `data/datasets/ethereal_workers.parquet` | 10,000 | same as workers_500k schema | scenario-friendly subset | | `data/datasets/client_workersi.parquet` | 160 | worker_id, name, role, city, state, email, phone, skills, certifications, availability, reliability, archetype | **Different shape** (no scores beyond reliability+availability, no resume_text). Probably client-side "approved roster" — the worker pool a client has historically used. | | `data/datasets/client_workerskjkk.parquet` | (similar) | (same as above) | typo-named sibling — gap to clean up | | `data/datasets/sparse_workers.parquet` | 200 | name, phone, role, city, state, notes | **Different shape** — no IDs, no scores, just contact + notes. Looks like edge-case test data (sparse field coverage). | | `data/datasets/new_candidates.parquet` | 3 | name, phone, email, city, state, skills, years | Demo / smoke-test data. Tiny. | **Total worker-shape rows on disk: ~625k** across 5 files. Schema fragmentation (3 distinct shapes) is a real issue — see gap report. ### Scenarios (`tests/multi-agent/scenarios/`) 44 JSON files covering specific staffing days. Sample shape (Heritage Foods Indianapolis 2026-04-23): ```json { "client": "Heritage Foods", "date": "2026-04-23", "events": [ { "kind": "baseline_fill", "at": "10:30", "role": "Machine Operator", "count": 2, "city": "Indianapolis", "state": "IN", "shift_start": "10:30 AM" }, { "kind": "recurring", "at": "10:30", "role": "Receiving Clerk", "count": 1, ... } ]} ``` Event kinds observed: `baseline_fill`, `recurring`. Cities span Indianapolis, Cincinnati, Madison, Toledo, Detroit, Columbus, etc. — Midwestern + Eastern US. ### Playbook lessons (`data/_playbook_lessons/`) 64 JSON files. Sample shape (Heritage Foods 2026-04-21): ```json { "date": "...", "client": "...", "cities": "...", "states": "...", "events_total": 5, "events_ok": 3, "checkpoint_count": 2, "model": "gpt-oss:20b", "cloud": false, "lesson": "", "checkpoints": [{ "after": "09:30", "risk": "...", "hint": "..." }, ...] } ``` These are **post-run retrospectives** — the staffing ops loop wrote them after each scenario completed. Goldmine for RAG. --- ## 3. Ingestion paths + storage layout ### Object storage / Parquet - `data/datasets/*.parquet` is the disk-resident store. Treated as input by `ingestd` (CSV/JSON/PDF/Postgres/MySQL ingest in `crates/ingestd`). - **No catalog manifests observed for the staffing parquets** (none under `data/_catalog/manifests/` matching candidate/worker/job names). The datasets exist on disk but may not be registered with `catalogd` — gap. ### MariaDB - `crates/queryd/src/context.rs` has a "candidates_safe" view referenced by recent code (failed at boot when schema mismatched, see prior memory `feedback_endpoint_probe_discipline.md`). - Schema for the views isn't visible from grep — needs DB inspection. ### Vector indexes (`data/vectors/`) - `workers_500k_v8.parquet` — vector corpus matched by `staffing_inference_lakehouse` mode in `config/modes.toml` - `ethereal_workers_v1.parquet` — alt corpus - `entity_brief_v1.parquet` — Chicago-permit-style entity briefs (different domain but same indexer) - `chicago_permits_v1.parquet` — separate but uses same machinery ### KB streams that touch staffing - `data/_kb/contract_analyses.jsonl` — contractor + permit analyses (related but not staffing per se) - `data/_kb/staffers.jsonl` — 1.5K, small, not yet inspected - `data/_kb/outcomes.jsonl` — scenario outcomes log (used by Phase 2 transforms in distillation) - `data/_playbook_memory/state.json` — Phase 19 playbook memory state --- ## 4. Search / indexing logic ### Staffing-aware mode runner `config/modes.toml` defines `staffing_inference` task class: ```toml preferred_mode = "staffing_inference_lakehouse" default_model = "openai/gpt-oss-120b:free" matrix_corpus = "workers_500k_v8" ``` The mode runner (Phase 5+ work in this session) composes: - `EnrichmentFlags { include_file_content, include_bug_fingerprints, include_matrix_chunks, use_relevance_filter, framing: Staffing }` - Pulls top-K from `workers_500k_v8` corpus - `FRAMING_STAFFING` system prompt instructs: "only recommend candidates whose names appear in the matrix data; do NOT fabricate workers" ### Pass 4 staffing harness `scripts/mode_pass4_staffing.ts` ships synthetic FillRequest payloads through the runner. Each request is a JSON `{city, state, role, count, deadline, notes?}` posted as `file_content` (the runner's input shape). Validation: did the model surface real worker_ids from the corpus, or fabricate. ### What's missing - **No "candidate matching" deterministic scorer** beyond mode-runner LLM. Staffing audit should add: given a job_order, can we score worker fit deterministically (skills overlap, geo distance, status filter) BEFORE asking the LLM? Currently the LLM does both retrieval and scoring. - **No indexed link table between candidates.parquet and workers_500k.parquet.** They look like the SAME population in different shapes — the workers_500k has the scores + resume + comms, candidates has the basic contact + status + hourly rate. If they're meant to be different populations, the join key is unclear; if they're the same, there's redundancy. --- ## 5. Audit / event tables **No staffing-specific audit/event log observed.** Searched for `audit_event`, `outcome_event`, `fill_event` patterns in `crates/` — zero hits. The closest existing infrastructure: - `data/_kb/outcomes.jsonl` — per-run scenario outcomes (used by distillation transforms) - `data/_observer/ops.jsonl` — observer ring buffer (general-purpose, not staffing) - `data/_playbook_lessons/*.json` — post-run lessons (retrospective, not audit) **Gap:** staffing fills happen, scenarios complete, but **no schema-backed event log** captures: which worker_ids were proposed, accepted, filled, rejected, with what timing, against which job_order. The closest record is in scenarios + playbook_lessons but those are unstructured + per-scenario, not a queryable log. --- ## 6. PII / tokenization boundaries ### Detection `crates/shared/src/pii.rs::detect_sensitivity` recognizes: `email`, `contact_email`, `ssn`, `phone` → Pii. `salary`, `bill_rate`, `pay_rate` → Financial. ### Enforcement `catalogd::service.rs` carries per-dataset `column_redactions: HashMap` — but enforcement at query time wasn't visible from initial grep. Auditing whether masking actually happens when `staffing_inference_lakehouse` retrieves from `workers_500k_v8` is in scope. ### Risk Raw email + phone live in `workers_500k.parquet` and `candidates.parquet`. If the LLM mode runner retrieves chunks and the catalog hasn't masked them, **the LLM sees PII**. Spec says "do not expose raw PII to AI" — auditing this is non-negotiable for the staffing integration. --- ## 7. PRD docs - `docs/PRD.md` — main PRD. §32 names staffing as the reference implementation. §158 explicitly notes Phase 19 playbook learning was originally write-only, claims it's now closed — **verify**. - `docs/CONTROL_PLANE_PRD.md` — long-horizon vision (2026-04-22 pivot) PRD references staffing throughout but doesn't itemize a "staffing PRD checklist" the way the auditor's pr_audit mode expects per-PR claims. Drift detection between PRD claims and code reality is exactly the auditor's job — running it on the PRD as input rather than a PR diff is a configuration shift, not new code. --- ## 8. Where distillation outputs should attach The Phase 0-8 distillation substrate is **already feeding the staffing surface in two places**: 1. **`staffing_inference_lakehouse` mode → `workers_500k_v8` matrix corpus.** This is read-only consumption; no change needed. 2. **`pr_audit` mode → `lakehouse_answers_v1` corpus.** Generic; not staffing-specific. **What's missing for staffing:** a. **Staffing-specific RAG corpus** — `staffing_answers_v1` built from playbook_lessons + scored scenarios. Same builder pattern as `lakehouse_answers_v1` (commit `0844206`'s `scripts/build_answers_corpus.ts`); just point at staffing inputs. b. **Staffing audit task class** — `staffing_audit` mode in `config/modes.toml`, paralleling the auditor's `pr_audit` work. Reads PRD claims + scenario outcomes, asks "do we ship what the PRD claims for staffing?" c. **Staffing acceptance fixture** — same shape as `tests/fixtures/distillation/acceptance/` but with synthetic candidate + job_order + scenario + lesson rows. Pins staffing invariants: PII masked, candidates valid, scenarios reproducible. d. **Staffing replay tasks** — drop sample fill requests through `./scripts/distill replay` to see if the local model proposes real worker_ids vs fabricates. **Implementation approach (deferred until gap report + J approval):** ``` scripts/staffing/ audit.ts # ./scripts/staffing audit — single entry build_answers.ts # build_staffing_answers_v1 from lessons + scenarios build_corpus_v9.ts # rebuild workers_500k_v9 with PII masking applied acceptance.ts # staffing-specific 22-invariant gate tests/fixtures/staffing/ candidates_sample.parquet job_orders_sample.parquet scenario_sample.json lesson_sample.json reports/staffing/ staffing-audit-report.md staffing-prd-drift-report.md staffing-search-quality-report.md staffing-synthetic-data-report.md ``` **ALL of the above is consumer-side.** The distillation pipeline's `scripts/distillation/`, `auditor/schemas/distillation/`, and Phase 0-8 commits are NOT touched. --- ## 9. Risks identified during recon 1. **Synthetic data shape fragmentation** — 3 distinct worker schemas across 5 files. If staffing audit assumes one shape and the system uses another, audits will silently miss. 2. **PII enforcement unverified.** Catalog has a redaction primitive; whether it's wired to mode-runner retrieval is the audit's first deterministic check. 3. **No structured staffing audit log.** Lessons + outcomes are retrospective summaries, not per-event records. Without per-event records, deterministic checks like "every worker proposed by the LLM exists in workers_500k" can't run on historical scenarios. 4. **Validator scaffolds.** `FillValidator::validate` does schema-shape only — the worker-existence/status/geo TODOs in the source are exactly the deterministic gates the staffing audit needs to run. Wiring them is consumer work, not distillation work. 5. **Fragile PRD ↔ code linkage.** PRD §158 claims Phase 19 closed the playbook write-only gap; no audit verifies. The staffing-prd-drift-report should run an inference-style claim verification against PRD claims, not unlike the auditor's pr_audit but with PRD as the source. 6. **`workers_500k_v8` is the embedded corpus the LLM sees.** If it carries PII without masking, the LLM has been seeing PII. Auditing the corpus content (not just the SQL views) is required. 7. **64 playbook_lessons + 44 scenarios = ~108 RAG candidates.** Plenty for a staffing_answers corpus, but PII filtering must apply before vectorization. Currently lessons may contain worker names ("Susan X. Ruiz double-booked"). --- ## 10. Recommended integration points (where consumer code attaches) 1. **Staffing audit script** at `scripts/staffing/audit.ts` reads from existing distillation outputs: - `data/scored-runs/` (filter to task_id starting `permit:` or `scenario:`) - `exports/quarantine/*.jsonl` (any staffing-specific quarantines) - `reports/distillation//summary.json` (cross-reference) 2. **Reuse Phase 5 receipts harness** — staffing audit writes a `StageReceipt` matching the existing schema, with a new `stage` value (extend the enum to `"staffing-audit"` only after schema-version bump if needed; otherwise use the existing reserved `"index"` slot or just write a parallel manifest under `reports/staffing/`). 3. **Reuse Phase 1 schemas** — RagSample, SftSample, PreferenceSample work for staffing data without modification. The `tags` array can carry `task:staffing.fill` to keep the corpus self-tagged. 4. **Reuse Phase 7 replay** — `./scripts/distill replay --task "fill 2 welders in Toledo OH"` already works; just feed it from synthetic FillRequest payloads. 5. **Reuse Phase 8 audit-full** — its drift baseline tracks distillation metrics; staffing audit gets its OWN baseline file at `data/_kb/staffing_audit_baselines.jsonl`. 6. **Schema invariants for staffing**: - every candidate_id in candidates.parquet appears in workers_500k.parquet OR is documented as "candidate-distinct-from-worker" - every status value in candidates.parquet is in a known enum - every email in workers/candidates is masked when it reaches the LLM (audit by inspecting prompt traces in Langfuse) --- ## 11. What this document is NOT - Not a green-light to start staffing audit implementation. The spec is explicit: synthetic-data gap report next, THEN J reviews, THEN code. - Not an audit itself. This is the inventory — the audit's first run will surface findings. - Not a redesign of staffing data shapes. The fragmentation is documented for the gap report; reshape decisions are J's call, not this recon's. - Not a modification of the distillation v1.0.0 substrate. Per spec: "DO NOT modify the completed distillation pipeline unless a blocking integration bug is found." --- ## 12. Phase 1 readiness checklist Before staffing implementation starts, the following must be true: - [x] Recon doc exists (this file) - [ ] Synthetic-data gap report exists (next) - [ ] J reviews both before any code change - [ ] J approves audit scope + first invariants Phase 1 is unblocked only after the gap report is reviewed.