Some checks failed
lakehouse/auditor 13 blocking issues: cloud: claim not backed — "Phase 8 done-criteria (per spec):"
Spec mandates these two docs before any staffing audit runner ships:
docs/recon/staffing-lakehouse-distillation-recon.md
reports/staffing/synthetic-data-gap-report.md
NO distillation core touched. Distillation v1.0.0 (commit e7636f2,
tag distillation-v1.0.0) remains the stable substrate. Staffing
work is consumer-only.
Recon findings (12 sections, ~5KB):
- Existing staffing schemas in crates/validator/staffing/* are scaffolds
(FillValidator schema-shape only; worker-existence/status/geo TODOs)
- Synthetic data spans 6+ shapes across 9 parquet files
(~625k worker-shape rows + 1k candidate-shape rows)
- PII detection lives in shared/pii.rs but enforcement at query
time is unverified — the LLM may have been seeing raw PII via
workers_500k_v8 vector corpus
- 44 scenarios + 64 playbook_lessons = ~108 RAG candidates
- No structured fill-event log exists; scenarios+lessons are
retrospective, not queryable per-event records
- workers_500k.phone is int (should be string — leading-zero loss)
- client_workerskjkk.parquet is a typo file (160 rows, sibling of
client_workersi.parquet)
- PRD §158 claims Phase 19 closed playbook write-only gap — unverified
Gap report findings (9 sections, ~6KB):
- 4 BLOCKING gaps requiring J decisions before audit ships:
A. Generate fill_events.parquet from scenarios + lessons?
B. Build views/{candidates,workers,jobs}_safe with PII masking?
C. Delete client_workerskjkk.parquet typo file?
D. Fix workers_500k.phone type (int → string)?
- 5 SOFT gaps the audit can run with (will be reported as findings)
- 3 NON-gaps (data sufficient as-is)
- Recommendation: NO new synthetic data needed; only normalization
of what already exists, contingent on J approval of A-D
Up-front commitments:
- Distillation v1.0.0 substrate untouched (verified by audit-full
running clean before+after each staffing change)
- All synthetic-data modifications via deterministic scripts under
scripts/staffing/, never hand-edit
- Every staffing artifact carries canonical sha256 provenance back
to source parquet/scenario/lesson
- _safe views are the source of truth for LLM-facing text; raw
parquets never directly fed into corpus builds
Phase 1 unblocks AFTER J reviews both docs and approves audit scope
+ the 4 gap-fix decisions.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
267 lines
17 KiB
Markdown
267 lines
17 KiB
Markdown
# Staffing Lakehouse × Distillation Substrate — Recon
|
||
|
||
**Date:** 2026-04-27
|
||
**Status:** Phase 0 (read-only inventory — no implementation yet)
|
||
**Spec:** J's "Lakehouse Staffing Integration" prompt
|
||
**Distillation tag (consumer of):** `distillation-v1.0.0` (commit `e7636f2`)
|
||
|
||
This document inventories the staffing surface in the Lakehouse repo and identifies where the distillation substrate (Phases 0-8) should attach as a *consumer*. **No distillation core mutation — staffing builds on top.**
|
||
|
||
The headline finding: **staffing has substantial existing infrastructure but is undocumented as a system.** Validators are scaffolds, scenarios are test fixtures, synthetic data spans 6+ shapes with overlapping intent, and there's no unified staffing audit. The integration work is orchestration over what already exists, not greenfield.
|
||
|
||
---
|
||
|
||
## 1. Existing staffing schemas
|
||
|
||
### Rust validators (`crates/validator/src/staffing/`)
|
||
|
||
| File | Shape | Status |
|
||
|---|---|---|
|
||
| `mod.rs` | trait + module wiring | scaffold complete |
|
||
| `fill.rs::FillValidator` | validates `{fills: [{candidate_id, name}]}` against Artifact::FillProposal | schema check live; worker-existence + status + geo checks are TODO (commented in source) |
|
||
| `playbook.rs::PlaybookValidator` | validates Artifact::Playbook (operation format, endorsed_names cap, fingerprint) | schema-shape only; no semantic content check |
|
||
| `email.rs` | email-domain validation | scaffold |
|
||
|
||
### Profiles (`crates/shared/src/profiles/`)
|
||
|
||
| File | Purpose |
|
||
|---|---|
|
||
| `execution.rs` | execution profile (model routing per task class) |
|
||
| `memory.rs` | MemoryProfile (Phase 19 playbook boost ceiling, history cap, doc stale window, auto-retire) |
|
||
| `observer.rs` | Observer profile (failure cluster size, alert cooldown, ring size, langfuse forward) |
|
||
| `retrieval.rs` | RetrievalProfile (top_k, rerank_top_k, freshness cutoff, boost_playbook_memory, enforce_sensitivity_gates) |
|
||
|
||
These are **typed** but auditing whether they're enforced at runtime is part of Phase 1 work.
|
||
|
||
### PII (`crates/shared/src/pii.rs`)
|
||
|
||
`detect_sensitivity(column_name)` → maps column names to sensitivity classes (`Pii`, `Financial`, `Public`). Verified by tests:
|
||
- `email`, `contact_email`, `ssn` → Pii
|
||
- `salary`, `bill_rate` → Financial
|
||
|
||
`catalogd::service.rs:264` carries `column_redactions: HashMap<String, Redaction>` per dataset. Catalog enforces, but the audit needs to confirm masking is actually applied at query time.
|
||
|
||
---
|
||
|
||
## 2. Synthetic data inventory
|
||
|
||
| File | Rows | Shape | Status assessment |
|
||
|---|---|---|---|
|
||
| `data/datasets/candidates.parquet` | 1,000 | candidate_id, first_name, last_name, email, phone, city, state, skills, years_experience, hourly_rate_usd, status | **Has PII (raw email + phone)**. CAND-* IDs. status field: `placed`, `unknown others`. Compact + realistic. |
|
||
| `data/datasets/job_orders.parquet` | 15,000 | job_order_id, client_id, title, vertical, bill_rate, pay_rate, status, city, state, zip, description | JO-* IDs, CLI-* clients. Verticals: Admin, Manufacturing(?), etc. Realistic shape. **No candidate-fill linkage table observed.** |
|
||
| `data/datasets/workers_500k.parquet` | 500,000 | worker_id (int), name, role, email, phone, city, state, zip, skills (CSV string), certifications, archetype, reliability/responsiveness/engagement/compliance/availability (0-1 floats), communications (multi-msg string), resume_text | **Largest + richest source.** Has PII. archetype enum (flexible/?). 4-axis personality scores. Resume text + comm log = good RAG/SFT material. |
|
||
| `data/datasets/workers_100k.parquet` | 100,000 | (presumed same as 500k) | scaled-down sibling |
|
||
| `data/datasets/ethereal_workers.parquet` | 10,000 | same as workers_500k schema | scenario-friendly subset |
|
||
| `data/datasets/client_workersi.parquet` | 160 | worker_id, name, role, city, state, email, phone, skills, certifications, availability, reliability, archetype | **Different shape** (no scores beyond reliability+availability, no resume_text). Probably client-side "approved roster" — the worker pool a client has historically used. |
|
||
| `data/datasets/client_workerskjkk.parquet` | (similar) | (same as above) | typo-named sibling — gap to clean up |
|
||
| `data/datasets/sparse_workers.parquet` | 200 | name, phone, role, city, state, notes | **Different shape** — no IDs, no scores, just contact + notes. Looks like edge-case test data (sparse field coverage). |
|
||
| `data/datasets/new_candidates.parquet` | 3 | name, phone, email, city, state, skills, years | Demo / smoke-test data. Tiny. |
|
||
|
||
**Total worker-shape rows on disk: ~625k** across 5 files. Schema fragmentation (3 distinct shapes) is a real issue — see gap report.
|
||
|
||
### Scenarios (`tests/multi-agent/scenarios/`)
|
||
|
||
44 JSON files covering specific staffing days. Sample shape (Heritage Foods Indianapolis 2026-04-23):
|
||
```json
|
||
{ "client": "Heritage Foods", "date": "2026-04-23", "events": [
|
||
{ "kind": "baseline_fill", "at": "10:30", "role": "Machine Operator", "count": 2,
|
||
"city": "Indianapolis", "state": "IN", "shift_start": "10:30 AM" },
|
||
{ "kind": "recurring", "at": "10:30", "role": "Receiving Clerk", "count": 1, ... }
|
||
]}
|
||
```
|
||
|
||
Event kinds observed: `baseline_fill`, `recurring`. Cities span Indianapolis, Cincinnati, Madison, Toledo, Detroit, Columbus, etc. — Midwestern + Eastern US.
|
||
|
||
### Playbook lessons (`data/_playbook_lessons/`)
|
||
|
||
64 JSON files. Sample shape (Heritage Foods 2026-04-21):
|
||
```json
|
||
{ "date": "...", "client": "...", "cities": "...", "states": "...",
|
||
"events_total": 5, "events_ok": 3, "checkpoint_count": 2,
|
||
"model": "gpt-oss:20b", "cloud": false,
|
||
"lesson": "<long markdown analysis>",
|
||
"checkpoints": [{ "after": "09:30", "risk": "...", "hint": "..." }, ...] }
|
||
```
|
||
|
||
These are **post-run retrospectives** — the staffing ops loop wrote them after each scenario completed. Goldmine for RAG.
|
||
|
||
---
|
||
|
||
## 3. Ingestion paths + storage layout
|
||
|
||
### Object storage / Parquet
|
||
- `data/datasets/*.parquet` is the disk-resident store. Treated as input by `ingestd` (CSV/JSON/PDF/Postgres/MySQL ingest in `crates/ingestd`).
|
||
- **No catalog manifests observed for the staffing parquets** (none under `data/_catalog/manifests/` matching candidate/worker/job names). The datasets exist on disk but may not be registered with `catalogd` — gap.
|
||
|
||
### MariaDB
|
||
- `crates/queryd/src/context.rs` has a "candidates_safe" view referenced by recent code (failed at boot when schema mismatched, see prior memory `feedback_endpoint_probe_discipline.md`).
|
||
- Schema for the views isn't visible from grep — needs DB inspection.
|
||
|
||
### Vector indexes (`data/vectors/`)
|
||
- `workers_500k_v8.parquet` — vector corpus matched by `staffing_inference_lakehouse` mode in `config/modes.toml`
|
||
- `ethereal_workers_v1.parquet` — alt corpus
|
||
- `entity_brief_v1.parquet` — Chicago-permit-style entity briefs (different domain but same indexer)
|
||
- `chicago_permits_v1.parquet` — separate but uses same machinery
|
||
|
||
### KB streams that touch staffing
|
||
- `data/_kb/contract_analyses.jsonl` — contractor + permit analyses (related but not staffing per se)
|
||
- `data/_kb/staffers.jsonl` — 1.5K, small, not yet inspected
|
||
- `data/_kb/outcomes.jsonl` — scenario outcomes log (used by Phase 2 transforms in distillation)
|
||
- `data/_playbook_memory/state.json` — Phase 19 playbook memory state
|
||
|
||
---
|
||
|
||
## 4. Search / indexing logic
|
||
|
||
### Staffing-aware mode runner
|
||
`config/modes.toml` defines `staffing_inference` task class:
|
||
```toml
|
||
preferred_mode = "staffing_inference_lakehouse"
|
||
default_model = "openai/gpt-oss-120b:free"
|
||
matrix_corpus = "workers_500k_v8"
|
||
```
|
||
|
||
The mode runner (Phase 5+ work in this session) composes:
|
||
- `EnrichmentFlags { include_file_content, include_bug_fingerprints, include_matrix_chunks, use_relevance_filter, framing: Staffing }`
|
||
- Pulls top-K from `workers_500k_v8` corpus
|
||
- `FRAMING_STAFFING` system prompt instructs: "only recommend candidates whose names appear in the matrix data; do NOT fabricate workers"
|
||
|
||
### Pass 4 staffing harness
|
||
`scripts/mode_pass4_staffing.ts` ships synthetic FillRequest payloads through the runner. Each request is a JSON `{city, state, role, count, deadline, notes?}` posted as `file_content` (the runner's input shape). Validation: did the model surface real worker_ids from the corpus, or fabricate.
|
||
|
||
### What's missing
|
||
- **No "candidate matching" deterministic scorer** beyond mode-runner LLM. Staffing audit should add: given a job_order, can we score worker fit deterministically (skills overlap, geo distance, status filter) BEFORE asking the LLM? Currently the LLM does both retrieval and scoring.
|
||
- **No indexed link table between candidates.parquet and workers_500k.parquet.** They look like the SAME population in different shapes — the workers_500k has the scores + resume + comms, candidates has the basic contact + status + hourly rate. If they're meant to be different populations, the join key is unclear; if they're the same, there's redundancy.
|
||
|
||
---
|
||
|
||
## 5. Audit / event tables
|
||
|
||
**No staffing-specific audit/event log observed.** Searched for `audit_event`, `outcome_event`, `fill_event` patterns in `crates/` — zero hits. The closest existing infrastructure:
|
||
- `data/_kb/outcomes.jsonl` — per-run scenario outcomes (used by distillation transforms)
|
||
- `data/_observer/ops.jsonl` — observer ring buffer (general-purpose, not staffing)
|
||
- `data/_playbook_lessons/*.json` — post-run lessons (retrospective, not audit)
|
||
|
||
**Gap:** staffing fills happen, scenarios complete, but **no schema-backed event log** captures: which worker_ids were proposed, accepted, filled, rejected, with what timing, against which job_order. The closest record is in scenarios + playbook_lessons but those are unstructured + per-scenario, not a queryable log.
|
||
|
||
---
|
||
|
||
## 6. PII / tokenization boundaries
|
||
|
||
### Detection
|
||
`crates/shared/src/pii.rs::detect_sensitivity` recognizes: `email`, `contact_email`, `ssn`, `phone` → Pii. `salary`, `bill_rate`, `pay_rate` → Financial.
|
||
|
||
### Enforcement
|
||
`catalogd::service.rs` carries per-dataset `column_redactions: HashMap<String, Redaction>` — but enforcement at query time wasn't visible from initial grep. Auditing whether masking actually happens when `staffing_inference_lakehouse` retrieves from `workers_500k_v8` is in scope.
|
||
|
||
### Risk
|
||
Raw email + phone live in `workers_500k.parquet` and `candidates.parquet`. If the LLM mode runner retrieves chunks and the catalog hasn't masked them, **the LLM sees PII**. Spec says "do not expose raw PII to AI" — auditing this is non-negotiable for the staffing integration.
|
||
|
||
---
|
||
|
||
## 7. PRD docs
|
||
|
||
- `docs/PRD.md` — main PRD. §32 names staffing as the reference implementation. §158 explicitly notes Phase 19 playbook learning was originally write-only, claims it's now closed — **verify**.
|
||
- `docs/CONTROL_PLANE_PRD.md` — long-horizon vision (2026-04-22 pivot)
|
||
|
||
PRD references staffing throughout but doesn't itemize a "staffing PRD checklist" the way the auditor's pr_audit mode expects per-PR claims. Drift detection between PRD claims and code reality is exactly the auditor's job — running it on the PRD as input rather than a PR diff is a configuration shift, not new code.
|
||
|
||
---
|
||
|
||
## 8. Where distillation outputs should attach
|
||
|
||
The Phase 0-8 distillation substrate is **already feeding the staffing surface in two places**:
|
||
|
||
1. **`staffing_inference_lakehouse` mode → `workers_500k_v8` matrix corpus.** This is read-only consumption; no change needed.
|
||
2. **`pr_audit` mode → `lakehouse_answers_v1` corpus.** Generic; not staffing-specific.
|
||
|
||
**What's missing for staffing:**
|
||
|
||
a. **Staffing-specific RAG corpus** — `staffing_answers_v1` built from playbook_lessons + scored scenarios. Same builder pattern as `lakehouse_answers_v1` (commit `0844206`'s `scripts/build_answers_corpus.ts`); just point at staffing inputs.
|
||
|
||
b. **Staffing audit task class** — `staffing_audit` mode in `config/modes.toml`, paralleling the auditor's `pr_audit` work. Reads PRD claims + scenario outcomes, asks "do we ship what the PRD claims for staffing?"
|
||
|
||
c. **Staffing acceptance fixture** — same shape as `tests/fixtures/distillation/acceptance/` but with synthetic candidate + job_order + scenario + lesson rows. Pins staffing invariants: PII masked, candidates valid, scenarios reproducible.
|
||
|
||
d. **Staffing replay tasks** — drop sample fill requests through `./scripts/distill replay` to see if the local model proposes real worker_ids vs fabricates.
|
||
|
||
**Implementation approach (deferred until gap report + J approval):**
|
||
|
||
```
|
||
scripts/staffing/
|
||
audit.ts # ./scripts/staffing audit — single entry
|
||
build_answers.ts # build_staffing_answers_v1 from lessons + scenarios
|
||
build_corpus_v9.ts # rebuild workers_500k_v9 with PII masking applied
|
||
acceptance.ts # staffing-specific 22-invariant gate
|
||
|
||
tests/fixtures/staffing/
|
||
candidates_sample.parquet
|
||
job_orders_sample.parquet
|
||
scenario_sample.json
|
||
lesson_sample.json
|
||
|
||
reports/staffing/
|
||
staffing-audit-report.md
|
||
staffing-prd-drift-report.md
|
||
staffing-search-quality-report.md
|
||
staffing-synthetic-data-report.md
|
||
```
|
||
|
||
**ALL of the above is consumer-side.** The distillation pipeline's `scripts/distillation/`, `auditor/schemas/distillation/`, and Phase 0-8 commits are NOT touched.
|
||
|
||
---
|
||
|
||
## 9. Risks identified during recon
|
||
|
||
1. **Synthetic data shape fragmentation** — 3 distinct worker schemas across 5 files. If staffing audit assumes one shape and the system uses another, audits will silently miss.
|
||
2. **PII enforcement unverified.** Catalog has a redaction primitive; whether it's wired to mode-runner retrieval is the audit's first deterministic check.
|
||
3. **No structured staffing audit log.** Lessons + outcomes are retrospective summaries, not per-event records. Without per-event records, deterministic checks like "every worker proposed by the LLM exists in workers_500k" can't run on historical scenarios.
|
||
4. **Validator scaffolds.** `FillValidator::validate` does schema-shape only — the worker-existence/status/geo TODOs in the source are exactly the deterministic gates the staffing audit needs to run. Wiring them is consumer work, not distillation work.
|
||
5. **Fragile PRD ↔ code linkage.** PRD §158 claims Phase 19 closed the playbook write-only gap; no audit verifies. The staffing-prd-drift-report should run an inference-style claim verification against PRD claims, not unlike the auditor's pr_audit but with PRD as the source.
|
||
6. **`workers_500k_v8` is the embedded corpus the LLM sees.** If it carries PII without masking, the LLM has been seeing PII. Auditing the corpus content (not just the SQL views) is required.
|
||
7. **64 playbook_lessons + 44 scenarios = ~108 RAG candidates.** Plenty for a staffing_answers corpus, but PII filtering must apply before vectorization. Currently lessons may contain worker names ("Susan X. Ruiz double-booked").
|
||
|
||
---
|
||
|
||
## 10. Recommended integration points (where consumer code attaches)
|
||
|
||
1. **Staffing audit script** at `scripts/staffing/audit.ts` reads from existing distillation outputs:
|
||
- `data/scored-runs/` (filter to task_id starting `permit:` or `scenario:`)
|
||
- `exports/quarantine/*.jsonl` (any staffing-specific quarantines)
|
||
- `reports/distillation/<latest>/summary.json` (cross-reference)
|
||
|
||
2. **Reuse Phase 5 receipts harness** — staffing audit writes a `StageReceipt` matching the existing schema, with a new `stage` value (extend the enum to `"staffing-audit"` only after schema-version bump if needed; otherwise use the existing reserved `"index"` slot or just write a parallel manifest under `reports/staffing/`).
|
||
|
||
3. **Reuse Phase 1 schemas** — RagSample, SftSample, PreferenceSample work for staffing data without modification. The `tags` array can carry `task:staffing.fill` to keep the corpus self-tagged.
|
||
|
||
4. **Reuse Phase 7 replay** — `./scripts/distill replay --task "fill 2 welders in Toledo OH"` already works; just feed it from synthetic FillRequest payloads.
|
||
|
||
5. **Reuse Phase 8 audit-full** — its drift baseline tracks distillation metrics; staffing audit gets its OWN baseline file at `data/_kb/staffing_audit_baselines.jsonl`.
|
||
|
||
6. **Schema invariants for staffing**:
|
||
- every candidate_id in candidates.parquet appears in workers_500k.parquet OR is documented as "candidate-distinct-from-worker"
|
||
- every status value in candidates.parquet is in a known enum
|
||
- every email in workers/candidates is masked when it reaches the LLM (audit by inspecting prompt traces in Langfuse)
|
||
|
||
---
|
||
|
||
## 11. What this document is NOT
|
||
|
||
- Not a green-light to start staffing audit implementation. The spec is explicit: synthetic-data gap report next, THEN J reviews, THEN code.
|
||
- Not an audit itself. This is the inventory — the audit's first run will surface findings.
|
||
- Not a redesign of staffing data shapes. The fragmentation is documented for the gap report; reshape decisions are J's call, not this recon's.
|
||
- Not a modification of the distillation v1.0.0 substrate. Per spec: "DO NOT modify the completed distillation pipeline unless a blocking integration bug is found."
|
||
|
||
---
|
||
|
||
## 12. Phase 1 readiness checklist
|
||
|
||
Before staffing implementation starts, the following must be true:
|
||
|
||
- [x] Recon doc exists (this file)
|
||
- [ ] Synthetic-data gap report exists (next)
|
||
- [ ] J reviews both before any code change
|
||
- [ ] J approves audit scope + first invariants
|
||
|
||
Phase 1 is unblocked only after the gap report is reviewed.
|