Some checks failed
lakehouse/auditor 13 blocking issues: cloud: claim not backed — "Verified end-to-end:"
Closes the Phase 43 v2 loose end. The validator scaffolds (FillValidator,
EmailValidator) take Arc<dyn WorkerLookup> at construction; this commit
ships the parquet-snapshot impl that production code wires in.
Schema mapping (workers_500k.parquet → WorkerRecord):
worker_id (int64) → candidate_id = "W-{id}" (matches what the
staffing executor
emits)
name (string) → name (already concatenated upstream)
role (string) → role
city, state (string) → city, state
availability (double) → status: "active" if >0 else "inactive"
Workers_500k has no `status` column; we derive from `availability`
since 0.0 means vacationing/suspended/etc in this dataset's
convention. Once Track A.B's `_safe` view ships with proper status,
flip the loader to read it directly — schema mapping is in one
function (load_workers_parquet), so the swap is trivial.
In-memory snapshot model:
- Loads all 500K rows at startup → ~75MB resident
- Sync .find() — no per-call I/O on the validation hot path
- Refresh = call load_workers_parquet again to rebuild
- Caller-driven refresh (no auto-watch) — operators pick the cadence
Why workers_500k and not candidates.parquet:
candidates.parquet has the right shape (string candidate_id, status,
first/last_name) but lacks `role` — and the staffing executor matches
the W-* convention from workers_500k_v8 corpus. So the production
data path goes through workers_500k. The schema mismatch between the
two parquets is documented in `reports/staffing/synthetic-data-gap-
report.md` (gap A); resolution is operator's call.
Errors are typed (LookupLoadError):
- Open: file not found / permission
- Parse: invalid parquet
- MissingColumn: schema doesn't have required field
- BadRow: row missing worker_id or name
Schema check happens before iteration, so a wrong-shape file fails
loud immediately rather than silently building an empty lookup.
Verification:
cargo build -p validator compiles
cargo test -p validator 33 pass / 0 fail
(was 31; +2 for parquet)
load_real_workers_500k smoke test passes against the
live 500K-row file:
W-1 resolves, status +
role + city/state all
populated.
Phase 43 v3 part 2 (next):
- /v1/validate gateway endpoint that takes a JSON artifact + dispatches
to FillValidator/EmailValidator/PlaybookValidator with a shared
WorkerLookup loaded from the parquet at gateway startup.
- That closes the "any caller can validate" surface; execution-loop
wiring (Phase 43 PRD's "generate → validate → correct → retry")
becomes a thin wrapper on top of /v1/validate.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
16 lines
422 B
TOML
16 lines
422 B
TOML
[package]
|
|
name = "validator"
|
|
version = "0.1.0"
|
|
edition = "2024"
|
|
|
|
[dependencies]
|
|
serde = { workspace = true }
|
|
serde_json = { workspace = true }
|
|
thiserror = { workspace = true }
|
|
tokio = { workspace = true }
|
|
tracing = { workspace = true }
|
|
# Parquet loader for ParquetWorkerLookup (Phase 43 v3 — production
|
|
# WorkerLookup backed by workers_500k.parquet snapshot).
|
|
arrow = { workspace = true }
|
|
parquet = { workspace = true }
|