root ebd9ab7c77
Some checks failed
lakehouse/auditor 13 blocking issues: cloud: claim not backed — "Verified end-to-end:"
validator: Phase 43 v3 — production WorkerLookup backed by workers_500k.parquet
Closes the Phase 43 v2 loose end. The validator scaffolds (FillValidator,
EmailValidator) take Arc<dyn WorkerLookup> at construction; this commit
ships the parquet-snapshot impl that production code wires in.

Schema mapping (workers_500k.parquet → WorkerRecord):
  worker_id (int64)     → candidate_id = "W-{id}"   (matches what the
                                                     staffing executor
                                                     emits)
  name (string)         → name (already concatenated upstream)
  role (string)         → role
  city, state (string)  → city, state
  availability (double) → status: "active" if >0 else "inactive"

Workers_500k has no `status` column; we derive from `availability`
since 0.0 means vacationing/suspended/etc in this dataset's
convention. Once Track A.B's `_safe` view ships with proper status,
flip the loader to read it directly — schema mapping is in one
function (load_workers_parquet), so the swap is trivial.

In-memory snapshot model:
- Loads all 500K rows at startup → ~75MB resident
- Sync .find() — no per-call I/O on the validation hot path
- Refresh = call load_workers_parquet again to rebuild
- Caller-driven refresh (no auto-watch) — operators pick the cadence

Why workers_500k and not candidates.parquet:
candidates.parquet has the right shape (string candidate_id, status,
first/last_name) but lacks `role` — and the staffing executor matches
the W-* convention from workers_500k_v8 corpus. So the production
data path goes through workers_500k. The schema mismatch between the
two parquets is documented in `reports/staffing/synthetic-data-gap-
report.md` (gap A); resolution is operator's call.

Errors are typed (LookupLoadError):
- Open: file not found / permission
- Parse: invalid parquet
- MissingColumn: schema doesn't have required field
- BadRow: row missing worker_id or name
Schema check happens before iteration, so a wrong-shape file fails
loud immediately rather than silently building an empty lookup.

Verification:
  cargo build -p validator                       compiles
  cargo test  -p validator                       33 pass / 0 fail
                                                 (was 31; +2 for parquet)
  load_real_workers_500k smoke test              passes against the
                                                 live 500K-row file:
                                                 W-1 resolves, status +
                                                 role + city/state all
                                                 populated.

Phase 43 v3 part 2 (next):
- /v1/validate gateway endpoint that takes a JSON artifact + dispatches
  to FillValidator/EmailValidator/PlaybookValidator with a shared
  WorkerLookup loaded from the parquet at gateway startup.
- That closes the "any caller can validate" surface; execution-loop
  wiring (Phase 43 PRD's "generate → validate → correct → retry")
  becomes a thin wrapper on top of /v1/validate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 07:36:40 -05:00

16 lines
422 B
TOML

[package]
name = "validator"
version = "0.1.0"
edition = "2024"
[dependencies]
serde = { workspace = true }
serde_json = { workspace = true }
thiserror = { workspace = true }
tokio = { workspace = true }
tracing = { workspace = true }
# Parquet loader for ParquetWorkerLookup (Phase 43 v3 — production
# WorkerLookup backed by workers_500k.parquet snapshot).
arrow = { workspace = true }
parquet = { workspace = true }