lakehouse/docs/recon/staffing-lakehouse-distillation-recon.md
root d11632a6fa
Some checks failed
lakehouse/auditor 13 blocking issues: cloud: claim not backed — "Phase 8 done-criteria (per spec):"
staffing: recon + synthetic-data gap report (Phase 0, no implementation)
Spec mandates these two docs before any staffing audit runner ships:
  docs/recon/staffing-lakehouse-distillation-recon.md
  reports/staffing/synthetic-data-gap-report.md

NO distillation core touched. Distillation v1.0.0 (commit e7636f2,
tag distillation-v1.0.0) remains the stable substrate. Staffing
work is consumer-only.

Recon findings (12 sections, ~5KB):
  - Existing staffing schemas in crates/validator/staffing/* are scaffolds
    (FillValidator schema-shape only; worker-existence/status/geo TODOs)
  - Synthetic data spans 6+ shapes across 9 parquet files
    (~625k worker-shape rows + 1k candidate-shape rows)
  - PII detection lives in shared/pii.rs but enforcement at query
    time is unverified — the LLM may have been seeing raw PII via
    workers_500k_v8 vector corpus
  - 44 scenarios + 64 playbook_lessons = ~108 RAG candidates
  - No structured fill-event log exists; scenarios+lessons are
    retrospective, not queryable per-event records
  - workers_500k.phone is int (should be string — leading-zero loss)
  - client_workerskjkk.parquet is a typo file (160 rows, sibling of
    client_workersi.parquet)
  - PRD §158 claims Phase 19 closed playbook write-only gap — unverified

Gap report findings (9 sections, ~6KB):
  - 4 BLOCKING gaps requiring J decisions before audit ships:
    A. Generate fill_events.parquet from scenarios + lessons?
    B. Build views/{candidates,workers,jobs}_safe with PII masking?
    C. Delete client_workerskjkk.parquet typo file?
    D. Fix workers_500k.phone type (int → string)?
  - 5 SOFT gaps the audit can run with (will be reported as findings)
  - 3 NON-gaps (data sufficient as-is)
  - Recommendation: NO new synthetic data needed; only normalization
    of what already exists, contingent on J approval of A-D

Up-front commitments:
  - Distillation v1.0.0 substrate untouched (verified by audit-full
    running clean before+after each staffing change)
  - All synthetic-data modifications via deterministic scripts under
    scripts/staffing/, never hand-edit
  - Every staffing artifact carries canonical sha256 provenance back
    to source parquet/scenario/lesson
  - _safe views are the source of truth for LLM-facing text; raw
    parquets never directly fed into corpus builds

Phase 1 unblocks AFTER J reviews both docs and approves audit scope
+ the 4 gap-fix decisions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 00:02:47 -05:00

17 KiB
Raw Blame History

Staffing Lakehouse × Distillation Substrate — Recon

Date: 2026-04-27 Status: Phase 0 (read-only inventory — no implementation yet) Spec: J's "Lakehouse Staffing Integration" prompt Distillation tag (consumer of): distillation-v1.0.0 (commit e7636f2)

This document inventories the staffing surface in the Lakehouse repo and identifies where the distillation substrate (Phases 0-8) should attach as a consumer. No distillation core mutation — staffing builds on top.

The headline finding: staffing has substantial existing infrastructure but is undocumented as a system. Validators are scaffolds, scenarios are test fixtures, synthetic data spans 6+ shapes with overlapping intent, and there's no unified staffing audit. The integration work is orchestration over what already exists, not greenfield.


1. Existing staffing schemas

Rust validators (crates/validator/src/staffing/)

File Shape Status
mod.rs trait + module wiring scaffold complete
fill.rs::FillValidator validates {fills: [{candidate_id, name}]} against Artifact::FillProposal schema check live; worker-existence + status + geo checks are TODO (commented in source)
playbook.rs::PlaybookValidator validates Artifact::Playbook (operation format, endorsed_names cap, fingerprint) schema-shape only; no semantic content check
email.rs email-domain validation scaffold

Profiles (crates/shared/src/profiles/)

File Purpose
execution.rs execution profile (model routing per task class)
memory.rs MemoryProfile (Phase 19 playbook boost ceiling, history cap, doc stale window, auto-retire)
observer.rs Observer profile (failure cluster size, alert cooldown, ring size, langfuse forward)
retrieval.rs RetrievalProfile (top_k, rerank_top_k, freshness cutoff, boost_playbook_memory, enforce_sensitivity_gates)

These are typed but auditing whether they're enforced at runtime is part of Phase 1 work.

PII (crates/shared/src/pii.rs)

detect_sensitivity(column_name) → maps column names to sensitivity classes (Pii, Financial, Public). Verified by tests:

  • email, contact_email, ssn → Pii
  • salary, bill_rate → Financial

catalogd::service.rs:264 carries column_redactions: HashMap<String, Redaction> per dataset. Catalog enforces, but the audit needs to confirm masking is actually applied at query time.


2. Synthetic data inventory

File Rows Shape Status assessment
data/datasets/candidates.parquet 1,000 candidate_id, first_name, last_name, email, phone, city, state, skills, years_experience, hourly_rate_usd, status Has PII (raw email + phone). CAND-* IDs. status field: placed, unknown others. Compact + realistic.
data/datasets/job_orders.parquet 15,000 job_order_id, client_id, title, vertical, bill_rate, pay_rate, status, city, state, zip, description JO-* IDs, CLI-* clients. Verticals: Admin, Manufacturing(?), etc. Realistic shape. No candidate-fill linkage table observed.
data/datasets/workers_500k.parquet 500,000 worker_id (int), name, role, email, phone, city, state, zip, skills (CSV string), certifications, archetype, reliability/responsiveness/engagement/compliance/availability (0-1 floats), communications (multi-msg string), resume_text Largest + richest source. Has PII. archetype enum (flexible/?). 4-axis personality scores. Resume text + comm log = good RAG/SFT material.
data/datasets/workers_100k.parquet 100,000 (presumed same as 500k) scaled-down sibling
data/datasets/ethereal_workers.parquet 10,000 same as workers_500k schema scenario-friendly subset
data/datasets/client_workersi.parquet 160 worker_id, name, role, city, state, email, phone, skills, certifications, availability, reliability, archetype Different shape (no scores beyond reliability+availability, no resume_text). Probably client-side "approved roster" — the worker pool a client has historically used.
data/datasets/client_workerskjkk.parquet (similar) (same as above) typo-named sibling — gap to clean up
data/datasets/sparse_workers.parquet 200 name, phone, role, city, state, notes Different shape — no IDs, no scores, just contact + notes. Looks like edge-case test data (sparse field coverage).
data/datasets/new_candidates.parquet 3 name, phone, email, city, state, skills, years Demo / smoke-test data. Tiny.

Total worker-shape rows on disk: ~625k across 5 files. Schema fragmentation (3 distinct shapes) is a real issue — see gap report.

Scenarios (tests/multi-agent/scenarios/)

44 JSON files covering specific staffing days. Sample shape (Heritage Foods Indianapolis 2026-04-23):

{ "client": "Heritage Foods", "date": "2026-04-23", "events": [
  { "kind": "baseline_fill", "at": "10:30", "role": "Machine Operator", "count": 2,
    "city": "Indianapolis", "state": "IN", "shift_start": "10:30 AM" },
  { "kind": "recurring", "at": "10:30", "role": "Receiving Clerk", "count": 1, ... }
]}

Event kinds observed: baseline_fill, recurring. Cities span Indianapolis, Cincinnati, Madison, Toledo, Detroit, Columbus, etc. — Midwestern + Eastern US.

Playbook lessons (data/_playbook_lessons/)

64 JSON files. Sample shape (Heritage Foods 2026-04-21):

{ "date": "...", "client": "...", "cities": "...", "states": "...",
  "events_total": 5, "events_ok": 3, "checkpoint_count": 2,
  "model": "gpt-oss:20b", "cloud": false,
  "lesson": "<long markdown analysis>",
  "checkpoints": [{ "after": "09:30", "risk": "...", "hint": "..." }, ...] }

These are post-run retrospectives — the staffing ops loop wrote them after each scenario completed. Goldmine for RAG.


3. Ingestion paths + storage layout

Object storage / Parquet

  • data/datasets/*.parquet is the disk-resident store. Treated as input by ingestd (CSV/JSON/PDF/Postgres/MySQL ingest in crates/ingestd).
  • No catalog manifests observed for the staffing parquets (none under data/_catalog/manifests/ matching candidate/worker/job names). The datasets exist on disk but may not be registered with catalogd — gap.

MariaDB

  • crates/queryd/src/context.rs has a "candidates_safe" view referenced by recent code (failed at boot when schema mismatched, see prior memory feedback_endpoint_probe_discipline.md).
  • Schema for the views isn't visible from grep — needs DB inspection.

Vector indexes (data/vectors/)

  • workers_500k_v8.parquet — vector corpus matched by staffing_inference_lakehouse mode in config/modes.toml
  • ethereal_workers_v1.parquet — alt corpus
  • entity_brief_v1.parquet — Chicago-permit-style entity briefs (different domain but same indexer)
  • chicago_permits_v1.parquet — separate but uses same machinery

KB streams that touch staffing

  • data/_kb/contract_analyses.jsonl — contractor + permit analyses (related but not staffing per se)
  • data/_kb/staffers.jsonl — 1.5K, small, not yet inspected
  • data/_kb/outcomes.jsonl — scenario outcomes log (used by Phase 2 transforms in distillation)
  • data/_playbook_memory/state.json — Phase 19 playbook memory state

4. Search / indexing logic

Staffing-aware mode runner

config/modes.toml defines staffing_inference task class:

preferred_mode = "staffing_inference_lakehouse"
default_model = "openai/gpt-oss-120b:free"
matrix_corpus = "workers_500k_v8"

The mode runner (Phase 5+ work in this session) composes:

  • EnrichmentFlags { include_file_content, include_bug_fingerprints, include_matrix_chunks, use_relevance_filter, framing: Staffing }
  • Pulls top-K from workers_500k_v8 corpus
  • FRAMING_STAFFING system prompt instructs: "only recommend candidates whose names appear in the matrix data; do NOT fabricate workers"

Pass 4 staffing harness

scripts/mode_pass4_staffing.ts ships synthetic FillRequest payloads through the runner. Each request is a JSON {city, state, role, count, deadline, notes?} posted as file_content (the runner's input shape). Validation: did the model surface real worker_ids from the corpus, or fabricate.

What's missing

  • No "candidate matching" deterministic scorer beyond mode-runner LLM. Staffing audit should add: given a job_order, can we score worker fit deterministically (skills overlap, geo distance, status filter) BEFORE asking the LLM? Currently the LLM does both retrieval and scoring.
  • No indexed link table between candidates.parquet and workers_500k.parquet. They look like the SAME population in different shapes — the workers_500k has the scores + resume + comms, candidates has the basic contact + status + hourly rate. If they're meant to be different populations, the join key is unclear; if they're the same, there's redundancy.

5. Audit / event tables

No staffing-specific audit/event log observed. Searched for audit_event, outcome_event, fill_event patterns in crates/ — zero hits. The closest existing infrastructure:

  • data/_kb/outcomes.jsonl — per-run scenario outcomes (used by distillation transforms)
  • data/_observer/ops.jsonl — observer ring buffer (general-purpose, not staffing)
  • data/_playbook_lessons/*.json — post-run lessons (retrospective, not audit)

Gap: staffing fills happen, scenarios complete, but no schema-backed event log captures: which worker_ids were proposed, accepted, filled, rejected, with what timing, against which job_order. The closest record is in scenarios + playbook_lessons but those are unstructured + per-scenario, not a queryable log.


6. PII / tokenization boundaries

Detection

crates/shared/src/pii.rs::detect_sensitivity recognizes: email, contact_email, ssn, phone → Pii. salary, bill_rate, pay_rate → Financial.

Enforcement

catalogd::service.rs carries per-dataset column_redactions: HashMap<String, Redaction> — but enforcement at query time wasn't visible from initial grep. Auditing whether masking actually happens when staffing_inference_lakehouse retrieves from workers_500k_v8 is in scope.

Risk

Raw email + phone live in workers_500k.parquet and candidates.parquet. If the LLM mode runner retrieves chunks and the catalog hasn't masked them, the LLM sees PII. Spec says "do not expose raw PII to AI" — auditing this is non-negotiable for the staffing integration.


7. PRD docs

  • docs/PRD.md — main PRD. §32 names staffing as the reference implementation. §158 explicitly notes Phase 19 playbook learning was originally write-only, claims it's now closed — verify.
  • docs/CONTROL_PLANE_PRD.md — long-horizon vision (2026-04-22 pivot)

PRD references staffing throughout but doesn't itemize a "staffing PRD checklist" the way the auditor's pr_audit mode expects per-PR claims. Drift detection between PRD claims and code reality is exactly the auditor's job — running it on the PRD as input rather than a PR diff is a configuration shift, not new code.


8. Where distillation outputs should attach

The Phase 0-8 distillation substrate is already feeding the staffing surface in two places:

  1. staffing_inference_lakehouse mode → workers_500k_v8 matrix corpus. This is read-only consumption; no change needed.
  2. pr_audit mode → lakehouse_answers_v1 corpus. Generic; not staffing-specific.

What's missing for staffing:

a. Staffing-specific RAG corpusstaffing_answers_v1 built from playbook_lessons + scored scenarios. Same builder pattern as lakehouse_answers_v1 (commit 0844206's scripts/build_answers_corpus.ts); just point at staffing inputs.

b. Staffing audit task classstaffing_audit mode in config/modes.toml, paralleling the auditor's pr_audit work. Reads PRD claims + scenario outcomes, asks "do we ship what the PRD claims for staffing?"

c. Staffing acceptance fixture — same shape as tests/fixtures/distillation/acceptance/ but with synthetic candidate + job_order + scenario + lesson rows. Pins staffing invariants: PII masked, candidates valid, scenarios reproducible.

d. Staffing replay tasks — drop sample fill requests through ./scripts/distill replay to see if the local model proposes real worker_ids vs fabricates.

Implementation approach (deferred until gap report + J approval):

scripts/staffing/
  audit.ts              # ./scripts/staffing audit — single entry
  build_answers.ts      # build_staffing_answers_v1 from lessons + scenarios
  build_corpus_v9.ts    # rebuild workers_500k_v9 with PII masking applied
  acceptance.ts         # staffing-specific 22-invariant gate

tests/fixtures/staffing/
  candidates_sample.parquet
  job_orders_sample.parquet
  scenario_sample.json
  lesson_sample.json

reports/staffing/
  staffing-audit-report.md
  staffing-prd-drift-report.md
  staffing-search-quality-report.md
  staffing-synthetic-data-report.md

ALL of the above is consumer-side. The distillation pipeline's scripts/distillation/, auditor/schemas/distillation/, and Phase 0-8 commits are NOT touched.


9. Risks identified during recon

  1. Synthetic data shape fragmentation — 3 distinct worker schemas across 5 files. If staffing audit assumes one shape and the system uses another, audits will silently miss.
  2. PII enforcement unverified. Catalog has a redaction primitive; whether it's wired to mode-runner retrieval is the audit's first deterministic check.
  3. No structured staffing audit log. Lessons + outcomes are retrospective summaries, not per-event records. Without per-event records, deterministic checks like "every worker proposed by the LLM exists in workers_500k" can't run on historical scenarios.
  4. Validator scaffolds. FillValidator::validate does schema-shape only — the worker-existence/status/geo TODOs in the source are exactly the deterministic gates the staffing audit needs to run. Wiring them is consumer work, not distillation work.
  5. Fragile PRD ↔ code linkage. PRD §158 claims Phase 19 closed the playbook write-only gap; no audit verifies. The staffing-prd-drift-report should run an inference-style claim verification against PRD claims, not unlike the auditor's pr_audit but with PRD as the source.
  6. workers_500k_v8 is the embedded corpus the LLM sees. If it carries PII without masking, the LLM has been seeing PII. Auditing the corpus content (not just the SQL views) is required.
  7. 64 playbook_lessons + 44 scenarios = ~108 RAG candidates. Plenty for a staffing_answers corpus, but PII filtering must apply before vectorization. Currently lessons may contain worker names ("Susan X. Ruiz double-booked").

  1. Staffing audit script at scripts/staffing/audit.ts reads from existing distillation outputs:

    • data/scored-runs/ (filter to task_id starting permit: or scenario:)
    • exports/quarantine/*.jsonl (any staffing-specific quarantines)
    • reports/distillation/<latest>/summary.json (cross-reference)
  2. Reuse Phase 5 receipts harness — staffing audit writes a StageReceipt matching the existing schema, with a new stage value (extend the enum to "staffing-audit" only after schema-version bump if needed; otherwise use the existing reserved "index" slot or just write a parallel manifest under reports/staffing/).

  3. Reuse Phase 1 schemas — RagSample, SftSample, PreferenceSample work for staffing data without modification. The tags array can carry task:staffing.fill to keep the corpus self-tagged.

  4. Reuse Phase 7 replay./scripts/distill replay --task "fill 2 welders in Toledo OH" already works; just feed it from synthetic FillRequest payloads.

  5. Reuse Phase 8 audit-full — its drift baseline tracks distillation metrics; staffing audit gets its OWN baseline file at data/_kb/staffing_audit_baselines.jsonl.

  6. Schema invariants for staffing:

    • every candidate_id in candidates.parquet appears in workers_500k.parquet OR is documented as "candidate-distinct-from-worker"
    • every status value in candidates.parquet is in a known enum
    • every email in workers/candidates is masked when it reaches the LLM (audit by inspecting prompt traces in Langfuse)

11. What this document is NOT

  • Not a green-light to start staffing audit implementation. The spec is explicit: synthetic-data gap report next, THEN J reviews, THEN code.
  • Not an audit itself. This is the inventory — the audit's first run will surface findings.
  • Not a redesign of staffing data shapes. The fragmentation is documented for the gap report; reshape decisions are J's call, not this recon's.
  • Not a modification of the distillation v1.0.0 substrate. Per spec: "DO NOT modify the completed distillation pipeline unless a blocking integration bug is found."

12. Phase 1 readiness checklist

Before staffing implementation starts, the following must be true:

  • Recon doc exists (this file)
  • Synthetic-data gap report exists (next)
  • J reviews both before any code change
  • J approves audit scope + first invariants

Phase 1 is unblocked only after the gap report is reviewed.