Spec mandates these two docs before any staffing audit runner ships:
docs/recon/staffing-lakehouse-distillation-recon.md
reports/staffing/synthetic-data-gap-report.md
NO distillation core touched. Distillation v1.0.0 (commit e7636f2,
tag distillation-v1.0.0) remains the stable substrate. Staffing
work is consumer-only.
Recon findings (12 sections, ~5KB):
- Existing staffing schemas in crates/validator/staffing/* are scaffolds
(FillValidator schema-shape only; worker-existence/status/geo TODOs)
- Synthetic data spans 6+ shapes across 9 parquet files
(~625k worker-shape rows + 1k candidate-shape rows)
- PII detection lives in shared/pii.rs but enforcement at query
time is unverified — the LLM may have been seeing raw PII via
workers_500k_v8 vector corpus
- 44 scenarios + 64 playbook_lessons = ~108 RAG candidates
- No structured fill-event log exists; scenarios+lessons are
retrospective, not queryable per-event records
- workers_500k.phone is int (should be string — leading-zero loss)
- client_workerskjkk.parquet is a typo file (160 rows, sibling of
client_workersi.parquet)
- PRD §158 claims Phase 19 closed playbook write-only gap — unverified
Gap report findings (9 sections, ~6KB):
- 4 BLOCKING gaps requiring J decisions before audit ships:
A. Generate fill_events.parquet from scenarios + lessons?
B. Build views/{candidates,workers,jobs}_safe with PII masking?
C. Delete client_workerskjkk.parquet typo file?
D. Fix workers_500k.phone type (int → string)?
- 5 SOFT gaps the audit can run with (will be reported as findings)
- 3 NON-gaps (data sufficient as-is)
- Recommendation: NO new synthetic data needed; only normalization
of what already exists, contingent on J approval of A-D
Up-front commitments:
- Distillation v1.0.0 substrate untouched (verified by audit-full
running clean before+after each staffing change)
- All synthetic-data modifications via deterministic scripts under
scripts/staffing/, never hand-edit
- Every staffing artifact carries canonical sha256 provenance back
to source parquet/scenario/lesson
- _safe views are the source of truth for LLM-facing text; raw
parquets never directly fed into corpus builds
Phase 1 unblocks AFTER J reviews both docs and approves audit scope
+ the 4 gap-fix decisions.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
17 KiB
Staffing Lakehouse × Distillation Substrate — Recon
Date: 2026-04-27
Status: Phase 0 (read-only inventory — no implementation yet)
Spec: J's "Lakehouse Staffing Integration" prompt
Distillation tag (consumer of): distillation-v1.0.0 (commit e7636f2)
This document inventories the staffing surface in the Lakehouse repo and identifies where the distillation substrate (Phases 0-8) should attach as a consumer. No distillation core mutation — staffing builds on top.
The headline finding: staffing has substantial existing infrastructure but is undocumented as a system. Validators are scaffolds, scenarios are test fixtures, synthetic data spans 6+ shapes with overlapping intent, and there's no unified staffing audit. The integration work is orchestration over what already exists, not greenfield.
1. Existing staffing schemas
Rust validators (crates/validator/src/staffing/)
| File | Shape | Status |
|---|---|---|
mod.rs |
trait + module wiring | scaffold complete |
fill.rs::FillValidator |
validates {fills: [{candidate_id, name}]} against Artifact::FillProposal |
schema check live; worker-existence + status + geo checks are TODO (commented in source) |
playbook.rs::PlaybookValidator |
validates Artifact::Playbook (operation format, endorsed_names cap, fingerprint) | schema-shape only; no semantic content check |
email.rs |
email-domain validation | scaffold |
Profiles (crates/shared/src/profiles/)
| File | Purpose |
|---|---|
execution.rs |
execution profile (model routing per task class) |
memory.rs |
MemoryProfile (Phase 19 playbook boost ceiling, history cap, doc stale window, auto-retire) |
observer.rs |
Observer profile (failure cluster size, alert cooldown, ring size, langfuse forward) |
retrieval.rs |
RetrievalProfile (top_k, rerank_top_k, freshness cutoff, boost_playbook_memory, enforce_sensitivity_gates) |
These are typed but auditing whether they're enforced at runtime is part of Phase 1 work.
PII (crates/shared/src/pii.rs)
detect_sensitivity(column_name) → maps column names to sensitivity classes (Pii, Financial, Public). Verified by tests:
email,contact_email,ssn→ Piisalary,bill_rate→ Financial
catalogd::service.rs:264 carries column_redactions: HashMap<String, Redaction> per dataset. Catalog enforces, but the audit needs to confirm masking is actually applied at query time.
2. Synthetic data inventory
| File | Rows | Shape | Status assessment |
|---|---|---|---|
data/datasets/candidates.parquet |
1,000 | candidate_id, first_name, last_name, email, phone, city, state, skills, years_experience, hourly_rate_usd, status | Has PII (raw email + phone). CAND-* IDs. status field: placed, unknown others. Compact + realistic. |
data/datasets/job_orders.parquet |
15,000 | job_order_id, client_id, title, vertical, bill_rate, pay_rate, status, city, state, zip, description | JO-* IDs, CLI-* clients. Verticals: Admin, Manufacturing(?), etc. Realistic shape. No candidate-fill linkage table observed. |
data/datasets/workers_500k.parquet |
500,000 | worker_id (int), name, role, email, phone, city, state, zip, skills (CSV string), certifications, archetype, reliability/responsiveness/engagement/compliance/availability (0-1 floats), communications (multi-msg string), resume_text | Largest + richest source. Has PII. archetype enum (flexible/?). 4-axis personality scores. Resume text + comm log = good RAG/SFT material. |
data/datasets/workers_100k.parquet |
100,000 | (presumed same as 500k) | scaled-down sibling |
data/datasets/ethereal_workers.parquet |
10,000 | same as workers_500k schema | scenario-friendly subset |
data/datasets/client_workersi.parquet |
160 | worker_id, name, role, city, state, email, phone, skills, certifications, availability, reliability, archetype | Different shape (no scores beyond reliability+availability, no resume_text). Probably client-side "approved roster" — the worker pool a client has historically used. |
data/datasets/client_workerskjkk.parquet |
(similar) | (same as above) | typo-named sibling — gap to clean up |
data/datasets/sparse_workers.parquet |
200 | name, phone, role, city, state, notes | Different shape — no IDs, no scores, just contact + notes. Looks like edge-case test data (sparse field coverage). |
data/datasets/new_candidates.parquet |
3 | name, phone, email, city, state, skills, years | Demo / smoke-test data. Tiny. |
Total worker-shape rows on disk: ~625k across 5 files. Schema fragmentation (3 distinct shapes) is a real issue — see gap report.
Scenarios (tests/multi-agent/scenarios/)
44 JSON files covering specific staffing days. Sample shape (Heritage Foods Indianapolis 2026-04-23):
{ "client": "Heritage Foods", "date": "2026-04-23", "events": [
{ "kind": "baseline_fill", "at": "10:30", "role": "Machine Operator", "count": 2,
"city": "Indianapolis", "state": "IN", "shift_start": "10:30 AM" },
{ "kind": "recurring", "at": "10:30", "role": "Receiving Clerk", "count": 1, ... }
]}
Event kinds observed: baseline_fill, recurring. Cities span Indianapolis, Cincinnati, Madison, Toledo, Detroit, Columbus, etc. — Midwestern + Eastern US.
Playbook lessons (data/_playbook_lessons/)
64 JSON files. Sample shape (Heritage Foods 2026-04-21):
{ "date": "...", "client": "...", "cities": "...", "states": "...",
"events_total": 5, "events_ok": 3, "checkpoint_count": 2,
"model": "gpt-oss:20b", "cloud": false,
"lesson": "<long markdown analysis>",
"checkpoints": [{ "after": "09:30", "risk": "...", "hint": "..." }, ...] }
These are post-run retrospectives — the staffing ops loop wrote them after each scenario completed. Goldmine for RAG.
3. Ingestion paths + storage layout
Object storage / Parquet
data/datasets/*.parquetis the disk-resident store. Treated as input byingestd(CSV/JSON/PDF/Postgres/MySQL ingest incrates/ingestd).- No catalog manifests observed for the staffing parquets (none under
data/_catalog/manifests/matching candidate/worker/job names). The datasets exist on disk but may not be registered withcatalogd— gap.
MariaDB
crates/queryd/src/context.rshas a "candidates_safe" view referenced by recent code (failed at boot when schema mismatched, see prior memoryfeedback_endpoint_probe_discipline.md).- Schema for the views isn't visible from grep — needs DB inspection.
Vector indexes (data/vectors/)
workers_500k_v8.parquet— vector corpus matched bystaffing_inference_lakehousemode inconfig/modes.tomlethereal_workers_v1.parquet— alt corpusentity_brief_v1.parquet— Chicago-permit-style entity briefs (different domain but same indexer)chicago_permits_v1.parquet— separate but uses same machinery
KB streams that touch staffing
data/_kb/contract_analyses.jsonl— contractor + permit analyses (related but not staffing per se)data/_kb/staffers.jsonl— 1.5K, small, not yet inspecteddata/_kb/outcomes.jsonl— scenario outcomes log (used by Phase 2 transforms in distillation)data/_playbook_memory/state.json— Phase 19 playbook memory state
4. Search / indexing logic
Staffing-aware mode runner
config/modes.toml defines staffing_inference task class:
preferred_mode = "staffing_inference_lakehouse"
default_model = "openai/gpt-oss-120b:free"
matrix_corpus = "workers_500k_v8"
The mode runner (Phase 5+ work in this session) composes:
EnrichmentFlags { include_file_content, include_bug_fingerprints, include_matrix_chunks, use_relevance_filter, framing: Staffing }- Pulls top-K from
workers_500k_v8corpus FRAMING_STAFFINGsystem prompt instructs: "only recommend candidates whose names appear in the matrix data; do NOT fabricate workers"
Pass 4 staffing harness
scripts/mode_pass4_staffing.ts ships synthetic FillRequest payloads through the runner. Each request is a JSON {city, state, role, count, deadline, notes?} posted as file_content (the runner's input shape). Validation: did the model surface real worker_ids from the corpus, or fabricate.
What's missing
- No "candidate matching" deterministic scorer beyond mode-runner LLM. Staffing audit should add: given a job_order, can we score worker fit deterministically (skills overlap, geo distance, status filter) BEFORE asking the LLM? Currently the LLM does both retrieval and scoring.
- No indexed link table between candidates.parquet and workers_500k.parquet. They look like the SAME population in different shapes — the workers_500k has the scores + resume + comms, candidates has the basic contact + status + hourly rate. If they're meant to be different populations, the join key is unclear; if they're the same, there's redundancy.
5. Audit / event tables
No staffing-specific audit/event log observed. Searched for audit_event, outcome_event, fill_event patterns in crates/ — zero hits. The closest existing infrastructure:
data/_kb/outcomes.jsonl— per-run scenario outcomes (used by distillation transforms)data/_observer/ops.jsonl— observer ring buffer (general-purpose, not staffing)data/_playbook_lessons/*.json— post-run lessons (retrospective, not audit)
Gap: staffing fills happen, scenarios complete, but no schema-backed event log captures: which worker_ids were proposed, accepted, filled, rejected, with what timing, against which job_order. The closest record is in scenarios + playbook_lessons but those are unstructured + per-scenario, not a queryable log.
6. PII / tokenization boundaries
Detection
crates/shared/src/pii.rs::detect_sensitivity recognizes: email, contact_email, ssn, phone → Pii. salary, bill_rate, pay_rate → Financial.
Enforcement
catalogd::service.rs carries per-dataset column_redactions: HashMap<String, Redaction> — but enforcement at query time wasn't visible from initial grep. Auditing whether masking actually happens when staffing_inference_lakehouse retrieves from workers_500k_v8 is in scope.
Risk
Raw email + phone live in workers_500k.parquet and candidates.parquet. If the LLM mode runner retrieves chunks and the catalog hasn't masked them, the LLM sees PII. Spec says "do not expose raw PII to AI" — auditing this is non-negotiable for the staffing integration.
7. PRD docs
docs/PRD.md— main PRD. §32 names staffing as the reference implementation. §158 explicitly notes Phase 19 playbook learning was originally write-only, claims it's now closed — verify.docs/CONTROL_PLANE_PRD.md— long-horizon vision (2026-04-22 pivot)
PRD references staffing throughout but doesn't itemize a "staffing PRD checklist" the way the auditor's pr_audit mode expects per-PR claims. Drift detection between PRD claims and code reality is exactly the auditor's job — running it on the PRD as input rather than a PR diff is a configuration shift, not new code.
8. Where distillation outputs should attach
The Phase 0-8 distillation substrate is already feeding the staffing surface in two places:
staffing_inference_lakehousemode →workers_500k_v8matrix corpus. This is read-only consumption; no change needed.pr_auditmode →lakehouse_answers_v1corpus. Generic; not staffing-specific.
What's missing for staffing:
a. Staffing-specific RAG corpus — staffing_answers_v1 built from playbook_lessons + scored scenarios. Same builder pattern as lakehouse_answers_v1 (commit 0844206's scripts/build_answers_corpus.ts); just point at staffing inputs.
b. Staffing audit task class — staffing_audit mode in config/modes.toml, paralleling the auditor's pr_audit work. Reads PRD claims + scenario outcomes, asks "do we ship what the PRD claims for staffing?"
c. Staffing acceptance fixture — same shape as tests/fixtures/distillation/acceptance/ but with synthetic candidate + job_order + scenario + lesson rows. Pins staffing invariants: PII masked, candidates valid, scenarios reproducible.
d. Staffing replay tasks — drop sample fill requests through ./scripts/distill replay to see if the local model proposes real worker_ids vs fabricates.
Implementation approach (deferred until gap report + J approval):
scripts/staffing/
audit.ts # ./scripts/staffing audit — single entry
build_answers.ts # build_staffing_answers_v1 from lessons + scenarios
build_corpus_v9.ts # rebuild workers_500k_v9 with PII masking applied
acceptance.ts # staffing-specific 22-invariant gate
tests/fixtures/staffing/
candidates_sample.parquet
job_orders_sample.parquet
scenario_sample.json
lesson_sample.json
reports/staffing/
staffing-audit-report.md
staffing-prd-drift-report.md
staffing-search-quality-report.md
staffing-synthetic-data-report.md
ALL of the above is consumer-side. The distillation pipeline's scripts/distillation/, auditor/schemas/distillation/, and Phase 0-8 commits are NOT touched.
9. Risks identified during recon
- Synthetic data shape fragmentation — 3 distinct worker schemas across 5 files. If staffing audit assumes one shape and the system uses another, audits will silently miss.
- PII enforcement unverified. Catalog has a redaction primitive; whether it's wired to mode-runner retrieval is the audit's first deterministic check.
- No structured staffing audit log. Lessons + outcomes are retrospective summaries, not per-event records. Without per-event records, deterministic checks like "every worker proposed by the LLM exists in workers_500k" can't run on historical scenarios.
- Validator scaffolds.
FillValidator::validatedoes schema-shape only — the worker-existence/status/geo TODOs in the source are exactly the deterministic gates the staffing audit needs to run. Wiring them is consumer work, not distillation work. - Fragile PRD ↔ code linkage. PRD §158 claims Phase 19 closed the playbook write-only gap; no audit verifies. The staffing-prd-drift-report should run an inference-style claim verification against PRD claims, not unlike the auditor's pr_audit but with PRD as the source.
workers_500k_v8is the embedded corpus the LLM sees. If it carries PII without masking, the LLM has been seeing PII. Auditing the corpus content (not just the SQL views) is required.- 64 playbook_lessons + 44 scenarios = ~108 RAG candidates. Plenty for a staffing_answers corpus, but PII filtering must apply before vectorization. Currently lessons may contain worker names ("Susan X. Ruiz double-booked").
10. Recommended integration points (where consumer code attaches)
-
Staffing audit script at
scripts/staffing/audit.tsreads from existing distillation outputs:data/scored-runs/(filter to task_id startingpermit:orscenario:)exports/quarantine/*.jsonl(any staffing-specific quarantines)reports/distillation/<latest>/summary.json(cross-reference)
-
Reuse Phase 5 receipts harness — staffing audit writes a
StageReceiptmatching the existing schema, with a newstagevalue (extend the enum to"staffing-audit"only after schema-version bump if needed; otherwise use the existing reserved"index"slot or just write a parallel manifest underreports/staffing/). -
Reuse Phase 1 schemas — RagSample, SftSample, PreferenceSample work for staffing data without modification. The
tagsarray can carrytask:staffing.fillto keep the corpus self-tagged. -
Reuse Phase 7 replay —
./scripts/distill replay --task "fill 2 welders in Toledo OH"already works; just feed it from synthetic FillRequest payloads. -
Reuse Phase 8 audit-full — its drift baseline tracks distillation metrics; staffing audit gets its OWN baseline file at
data/_kb/staffing_audit_baselines.jsonl. -
Schema invariants for staffing:
- every candidate_id in candidates.parquet appears in workers_500k.parquet OR is documented as "candidate-distinct-from-worker"
- every status value in candidates.parquet is in a known enum
- every email in workers/candidates is masked when it reaches the LLM (audit by inspecting prompt traces in Langfuse)
11. What this document is NOT
- Not a green-light to start staffing audit implementation. The spec is explicit: synthetic-data gap report next, THEN J reviews, THEN code.
- Not an audit itself. This is the inventory — the audit's first run will surface findings.
- Not a redesign of staffing data shapes. The fragmentation is documented for the gap report; reshape decisions are J's call, not this recon's.
- Not a modification of the distillation v1.0.0 substrate. Per spec: "DO NOT modify the completed distillation pipeline unless a blocking integration bug is found."
12. Phase 1 readiness checklist
Before staffing implementation starts, the following must be true:
- Recon doc exists (this file)
- Synthetic-data gap report exists (next)
- J reviews both before any code change
- J approves audit scope + first invariants
Phase 1 is unblocked only after the gap report is reviewed.