lakehouse

History

root d11632a6fa

lakehouse/auditor 13 blocking issues: cloud: claim not backed — "Phase 8 done-criteria (per spec):"

staffing: recon + synthetic-data gap report (Phase 0, no implementation)

Spec mandates these two docs before any staffing audit runner ships:
  docs/recon/staffing-lakehouse-distillation-recon.md
  reports/staffing/synthetic-data-gap-report.md

NO distillation core touched. Distillation v1.0.0 (commit e7636f2,
tag distillation-v1.0.0) remains the stable substrate. Staffing
work is consumer-only.

Recon findings (12 sections, ~5KB):
  - Existing staffing schemas in crates/validator/staffing/* are scaffolds
    (FillValidator schema-shape only; worker-existence/status/geo TODOs)
  - Synthetic data spans 6+ shapes across 9 parquet files
    (~625k worker-shape rows + 1k candidate-shape rows)
  - PII detection lives in shared/pii.rs but enforcement at query
    time is unverified — the LLM may have been seeing raw PII via
    workers_500k_v8 vector corpus
  - 44 scenarios + 64 playbook_lessons = ~108 RAG candidates
  - No structured fill-event log exists; scenarios+lessons are
    retrospective, not queryable per-event records
  - workers_500k.phone is int (should be string — leading-zero loss)
  - client_workerskjkk.parquet is a typo file (160 rows, sibling of
    client_workersi.parquet)
  - PRD §158 claims Phase 19 closed playbook write-only gap — unverified

Gap report findings (9 sections, ~6KB):
  - 4 BLOCKING gaps requiring J decisions before audit ships:
    A. Generate fill_events.parquet from scenarios + lessons?
    B. Build views/{candidates,workers,jobs}_safe with PII masking?
    C. Delete client_workerskjkk.parquet typo file?
    D. Fix workers_500k.phone type (int → string)?
  - 5 SOFT gaps the audit can run with (will be reported as findings)
  - 3 NON-gaps (data sufficient as-is)
  - Recommendation: NO new synthetic data needed; only normalization
    of what already exists, contingent on J approval of A-D

Up-front commitments:
  - Distillation v1.0.0 substrate untouched (verified by audit-full
    running clean before+after each staffing change)
  - All synthetic-data modifications via deterministic scripts under
    scripts/staffing/, never hand-edit
  - Every staffing artifact carries canonical sha256 provenance back
    to source parquet/scenario/lesson
  - _safe views are the source of truth for LLM-facing text; raw
    parquets never directly fed into corpus builds

Phase 1 unblocks AFTER J reviews both docs and approves audit scope
+ the 4 gap-fix decisions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-27 00:02:47 -05:00

synthetic-data-gap-report.md

staffing: recon + synthetic-data gap report (Phase 0, no implementation)

2026-04-27 00:02:47 -05:00