lakehouse/docs/recon/staffing-lakehouse-distillation-recon.md

# Staffing Lakehouse × Distillation Substrate — Recon

**Date:** 2026-04-27
**Status:** Phase 0 (read-only inventory — no implementation yet)
**Spec:** J's "Lakehouse Staffing Integration" prompt
**Distillation tag (consumer of):** `distillation-v1.0.0` (commit `e7636f2`)

This document inventories the staffing surface in the Lakehouse repo and identifies where the distillation substrate (Phases 0-8) should attach as a *consumer*. **No distillation core mutation — staffing builds on top.**

The headline finding: **staffing has substantial existing infrastructure but is undocumented as a system.** Validators are scaffolds, scenarios are test fixtures, synthetic data spans 6+ shapes with overlapping intent, and there's no unified staffing audit. The integration work is orchestration over what already exists, not greenfield.

---

## 1. Existing staffing schemas

### Rust validators (`crates/validator/src/staffing/`)

| File | Shape | Status |
|---|---|---|
| `mod.rs` | trait + module wiring | scaffold complete |
| `fill.rs::FillValidator` | validates `{fills: [{candidate_id, name}]}` against Artifact::FillProposal | schema check live; worker-existence + status + geo checks are TODO (commented in source) |
| `playbook.rs::PlaybookValidator` | validates Artifact::Playbook (operation format, endorsed_names cap, fingerprint) | schema-shape only; no semantic content check |
| `email.rs` | email-domain validation | scaffold |

### Profiles (`crates/shared/src/profiles/`)

| File | Purpose |
|---|---|
| `execution.rs` | execution profile (model routing per task class) |
| `memory.rs` | MemoryProfile (Phase 19 playbook boost ceiling, history cap, doc stale window, auto-retire) |
| `observer.rs` | Observer profile (failure cluster size, alert cooldown, ring size, langfuse forward) |
| `retrieval.rs` | RetrievalProfile (top_k, rerank_top_k, freshness cutoff, boost_playbook_memory, enforce_sensitivity_gates) |

These are **typed** but auditing whether they're enforced at runtime is part of Phase 1 work.

### PII (`crates/shared/src/pii.rs`)

`detect_sensitivity(column_name)` → maps column names to sensitivity classes (`Pii`, `Financial`, `Public`). Verified by tests:
- `email`, `contact_email`, `ssn` → Pii
- `salary`, `bill_rate` → Financial

`catalogd::service.rs:264` carries `column_redactions: HashMap<String, Redaction>` per dataset. Catalog enforces, but the audit needs to confirm masking is actually applied at query time.

---

## 2. Synthetic data inventory

| File | Rows | Shape | Status assessment |
|---|---|---|---|
| `data/datasets/candidates.parquet` | 1,000 | candidate_id, first_name, last_name, email, phone, city, state, skills, years_experience, hourly_rate_usd, status | **Has PII (raw email + phone)**. CAND-* IDs. status field: `placed`, `unknown others`. Compact + realistic. |
| `data/datasets/job_orders.parquet` | 15,000 | job_order_id, client_id, title, vertical, bill_rate, pay_rate, status, city, state, zip, description | JO-* IDs, CLI-* clients. Verticals: Admin, Manufacturing(?), etc. Realistic shape. **No candidate-fill linkage table observed.** |
| `data/datasets/workers_500k.parquet` | 500,000 | worker_id (int), name, role, email, phone, city, state, zip, skills (CSV string), certifications, archetype, reliability/responsiveness/engagement/compliance/availability (0-1 floats), communications (multi-msg string), resume_text | **Largest + richest source.** Has PII. archetype enum (flexible/?). 4-axis personality scores. Resume text + comm log = good RAG/SFT material. |
| `data/datasets/workers_100k.parquet` | 100,000 | (presumed same as 500k) | scaled-down sibling |
| `data/datasets/ethereal_workers.parquet` | 10,000 | same as workers_500k schema | scenario-friendly subset |
| `data/datasets/client_workersi.parquet` | 160 | worker_id, name, role, city, state, email, phone, skills, certifications, availability, reliability, archetype | **Different shape** (no scores beyond reliability+availability, no resume_text). Probably client-side "approved roster" — the worker pool a client has historically used. |
| `data/datasets/client_workerskjkk.parquet` | (similar) | (same as above) | typo-named sibling — gap to clean up |
| `data/datasets/sparse_workers.parquet` | 200 | name, phone, role, city, state, notes | **Different shape** — no IDs, no scores, just contact + notes. Looks like edge-case test data (sparse field coverage). |
| `data/datasets/new_candidates.parquet` | 3 | name, phone, email, city, state, skills, years | Demo / smoke-test data. Tiny. |

**Total worker-shape rows on disk: ~625k** across 5 files. Schema fragmentation (3 distinct shapes) is a real issue — see gap report.

### Scenarios (`tests/multi-agent/scenarios/`)

44 JSON files covering specific staffing days. Sample shape (Heritage Foods Indianapolis 2026-04-23):
```json
{ "client": "Heritage Foods", "date": "2026-04-23", "events": [
  { "kind": "baseline_fill", "at": "10:30", "role": "Machine Operator", "count": 2,
    "city": "Indianapolis", "state": "IN", "shift_start": "10:30 AM" },
  { "kind": "recurring", "at": "10:30", "role": "Receiving Clerk", "count": 1, ... }
]}
```

Event kinds observed: `baseline_fill`, `recurring`. Cities span Indianapolis, Cincinnati, Madison, Toledo, Detroit, Columbus, etc. — Midwestern + Eastern US.

### Playbook lessons (`data/_playbook_lessons/`)

64 JSON files. Sample shape (Heritage Foods 2026-04-21):
```json
{ "date": "...", "client": "...", "cities": "...", "states": "...",
  "events_total": 5, "events_ok": 3, "checkpoint_count": 2,
  "model": "gpt-oss:20b", "cloud": false,
  "lesson": "<long markdown analysis>",
  "checkpoints": [{ "after": "09:30", "risk": "...", "hint": "..." }, ...] }
```

These are **post-run retrospectives** — the staffing ops loop wrote them after each scenario completed. Goldmine for RAG.

---

## 3. Ingestion paths + storage layout

### Object storage / Parquet
- `data/datasets/*.parquet` is the disk-resident store. Treated as input by `ingestd` (CSV/JSON/PDF/Postgres/MySQL ingest in `crates/ingestd`).
- **No catalog manifests observed for the staffing parquets** (none under `data/_catalog/manifests/` matching candidate/worker/job names). The datasets exist on disk but may not be registered with `catalogd` — gap.

### MariaDB
- `crates/queryd/src/context.rs` has a "candidates_safe" view referenced by recent code (failed at boot when schema mismatched, see prior memory `feedback_endpoint_probe_discipline.md`).
- Schema for the views isn't visible from grep — needs DB inspection.

### Vector indexes (`data/vectors/`)
- `workers_500k_v8.parquet` — vector corpus matched by `staffing_inference_lakehouse` mode in `config/modes.toml`
- `ethereal_workers_v1.parquet` — alt corpus
- `entity_brief_v1.parquet` — Chicago-permit-style entity briefs (different domain but same indexer)
- `chicago_permits_v1.parquet` — separate but uses same machinery

### KB streams that touch staffing
- `data/_kb/contract_analyses.jsonl` — contractor + permit analyses (related but not staffing per se)
- `data/_kb/staffers.jsonl` — 1.5K, small, not yet inspected
- `data/_kb/outcomes.jsonl` — scenario outcomes log (used by Phase 2 transforms in distillation)
- `data/_playbook_memory/state.json` — Phase 19 playbook memory state

---

## 4. Search / indexing logic

### Staffing-aware mode runner
`config/modes.toml` defines `staffing_inference` task class:
```toml
preferred_mode = "staffing_inference_lakehouse"
default_model = "openai/gpt-oss-120b:free"
matrix_corpus = "workers_500k_v8"
```

The mode runner (Phase 5+ work in this session) composes:
- `EnrichmentFlags { include_file_content, include_bug_fingerprints, include_matrix_chunks, use_relevance_filter, framing: Staffing }`
- Pulls top-K from `workers_500k_v8` corpus
- `FRAMING_STAFFING` system prompt instructs: "only recommend candidates whose names appear in the matrix data; do NOT fabricate workers"

### Pass 4 staffing harness
`scripts/mode_pass4_staffing.ts` ships synthetic FillRequest payloads through the runner. Each request is a JSON `{city, state, role, count, deadline, notes?}` posted as `file_content` (the runner's input shape). Validation: did the model surface real worker_ids from the corpus, or fabricate.

### What's missing
- **No "candidate matching" deterministic scorer** beyond mode-runner LLM. Staffing audit should add: given a job_order, can we score worker fit deterministically (skills overlap, geo distance, status filter) BEFORE asking the LLM? Currently the LLM does both retrieval and scoring.
- **No indexed link table between candidates.parquet and workers_500k.parquet.** They look like the SAME population in different shapes — the workers_500k has the scores + resume + comms, candidates has the basic contact + status + hourly rate. If they're meant to be different populations, the join key is unclear; if they're the same, there's redundancy.

---

## 5. Audit / event tables

**No staffing-specific audit/event log observed.** Searched for `audit_event`, `outcome_event`, `fill_event` patterns in `crates/` — zero hits. The closest existing infrastructure:
- `data/_kb/outcomes.jsonl` — per-run scenario outcomes (used by distillation transforms)
- `data/_observer/ops.jsonl` — observer ring buffer (general-purpose, not staffing)
- `data/_playbook_lessons/*.json` — post-run lessons (retrospective, not audit)

**Gap:** staffing fills happen, scenarios complete, but **no schema-backed event log** captures: which worker_ids were proposed, accepted, filled, rejected, with what timing, against which job_order. The closest record is in scenarios + playbook_lessons but those are unstructured + per-scenario, not a queryable log.

---

## 6. PII / tokenization boundaries

### Detection
`crates/shared/src/pii.rs::detect_sensitivity` recognizes: `email`, `contact_email`, `ssn`, `phone` → Pii. `salary`, `bill_rate`, `pay_rate` → Financial.

### Enforcement
`catalogd::service.rs` carries per-dataset `column_redactions: HashMap<String, Redaction>` — but enforcement at query time wasn't visible from initial grep. Auditing whether masking actually happens when `staffing_inference_lakehouse` retrieves from `workers_500k_v8` is in scope.

### Risk
Raw email + phone live in `workers_500k.parquet` and `candidates.parquet`. If the LLM mode runner retrieves chunks and the catalog hasn't masked them, **the LLM sees PII**. Spec says "do not expose raw PII to AI" — auditing this is non-negotiable for the staffing integration.

---

## 7. PRD docs

- `docs/PRD.md` — main PRD. §32 names staffing as the reference implementation. §158 explicitly notes Phase 19 playbook learning was originally write-only, claims it's now closed — **verify**.
- `docs/CONTROL_PLANE_PRD.md` — long-horizon vision (2026-04-22 pivot)

PRD references staffing throughout but doesn't itemize a "staffing PRD checklist" the way the auditor's pr_audit mode expects per-PR claims. Drift detection between PRD claims and code reality is exactly the auditor's job — running it on the PRD as input rather than a PR diff is a configuration shift, not new code.

---

## 8. Where distillation outputs should attach

The Phase 0-8 distillation substrate is **already feeding the staffing surface in two places**:

1. **`staffing_inference_lakehouse` mode → `workers_500k_v8` matrix corpus.** This is read-only consumption; no change needed.
2. **`pr_audit` mode → `lakehouse_answers_v1` corpus.** Generic; not staffing-specific.

**What's missing for staffing:**

a. **Staffing-specific RAG corpus** — `staffing_answers_v1` built from playbook_lessons + scored scenarios. Same builder pattern as `lakehouse_answers_v1` (commit `0844206`'s `scripts/build_answers_corpus.ts`); just point at staffing inputs.

b. **Staffing audit task class** — `staffing_audit` mode in `config/modes.toml`, paralleling the auditor's `pr_audit` work. Reads PRD claims + scenario outcomes, asks "do we ship what the PRD claims for staffing?"

c. **Staffing acceptance fixture** — same shape as `tests/fixtures/distillation/acceptance/` but with synthetic candidate + job_order + scenario + lesson rows. Pins staffing invariants: PII masked, candidates valid, scenarios reproducible.

d. **Staffing replay tasks** — drop sample fill requests through `./scripts/distill replay` to see if the local model proposes real worker_ids vs fabricates.

**Implementation approach (deferred until gap report + J approval):**

```
scripts/staffing/
  audit.ts              # ./scripts/staffing audit — single entry
  build_answers.ts      # build_staffing_answers_v1 from lessons + scenarios
  build_corpus_v9.ts    # rebuild workers_500k_v9 with PII masking applied
  acceptance.ts         # staffing-specific 22-invariant gate

tests/fixtures/staffing/
  candidates_sample.parquet
  job_orders_sample.parquet
  scenario_sample.json
  lesson_sample.json

reports/staffing/
  staffing-audit-report.md
  staffing-prd-drift-report.md
  staffing-search-quality-report.md
  staffing-synthetic-data-report.md
```

**ALL of the above is consumer-side.** The distillation pipeline's `scripts/distillation/`, `auditor/schemas/distillation/`, and Phase 0-8 commits are NOT touched.

---

## 9. Risks identified during recon

1. **Synthetic data shape fragmentation** — 3 distinct worker schemas across 5 files. If staffing audit assumes one shape and the system uses another, audits will silently miss.
2. **PII enforcement unverified.** Catalog has a redaction primitive; whether it's wired to mode-runner retrieval is the audit's first deterministic check.
3. **No structured staffing audit log.** Lessons + outcomes are retrospective summaries, not per-event records. Without per-event records, deterministic checks like "every worker proposed by the LLM exists in workers_500k" can't run on historical scenarios.
4. **Validator scaffolds.** `FillValidator::validate` does schema-shape only — the worker-existence/status/geo TODOs in the source are exactly the deterministic gates the staffing audit needs to run. Wiring them is consumer work, not distillation work.
5. **Fragile PRD ↔ code linkage.** PRD §158 claims Phase 19 closed the playbook write-only gap; no audit verifies. The staffing-prd-drift-report should run an inference-style claim verification against PRD claims, not unlike the auditor's pr_audit but with PRD as the source.
6. **`workers_500k_v8` is the embedded corpus the LLM sees.** If it carries PII without masking, the LLM has been seeing PII. Auditing the corpus content (not just the SQL views) is required.
7. **64 playbook_lessons + 44 scenarios = ~108 RAG candidates.** Plenty for a staffing_answers corpus, but PII filtering must apply before vectorization. Currently lessons may contain worker names ("Susan X. Ruiz double-booked").

---

## 10. Recommended integration points (where consumer code attaches)

1. **Staffing audit script** at `scripts/staffing/audit.ts` reads from existing distillation outputs:
   - `data/scored-runs/` (filter to task_id starting `permit:` or `scenario:`)
   - `exports/quarantine/*.jsonl` (any staffing-specific quarantines)
   - `reports/distillation/<latest>/summary.json` (cross-reference)

2. **Reuse Phase 5 receipts harness** — staffing audit writes a `StageReceipt` matching the existing schema, with a new `stage` value (extend the enum to `"staffing-audit"` only after schema-version bump if needed; otherwise use the existing reserved `"index"` slot or just write a parallel manifest under `reports/staffing/`).

3. **Reuse Phase 1 schemas** — RagSample, SftSample, PreferenceSample work for staffing data without modification. The `tags` array can carry `task:staffing.fill` to keep the corpus self-tagged.

4. **Reuse Phase 7 replay** — `./scripts/distill replay --task "fill 2 welders in Toledo OH"` already works; just feed it from synthetic FillRequest payloads.

5. **Reuse Phase 8 audit-full** — its drift baseline tracks distillation metrics; staffing audit gets its OWN baseline file at `data/_kb/staffing_audit_baselines.jsonl`.

6. **Schema invariants for staffing**:
   - every candidate_id in candidates.parquet appears in workers_500k.parquet OR is documented as "candidate-distinct-from-worker"
   - every status value in candidates.parquet is in a known enum
   - every email in workers/candidates is masked when it reaches the LLM (audit by inspecting prompt traces in Langfuse)

---

## 11. What this document is NOT

- Not a green-light to start staffing audit implementation. The spec is explicit: synthetic-data gap report next, THEN J reviews, THEN code.
- Not an audit itself. This is the inventory — the audit's first run will surface findings.
- Not a redesign of staffing data shapes. The fragmentation is documented for the gap report; reshape decisions are J's call, not this recon's.
- Not a modification of the distillation v1.0.0 substrate. Per spec: "DO NOT modify the completed distillation pipeline unless a blocking integration bug is found."

---

## 12. Phase 1 readiness checklist

Before staffing implementation starts, the following must be true:

- [x] Recon doc exists (this file)
- [ ] Synthetic-data gap report exists (next)
- [ ] J reviews both before any code change
- [ ] J approves audit scope + first invariants

Phase 1 is unblocked only after the gap report is reviewed.