lakehouse

profit/lakehouse

Fork 0

Commit Graph

Author	SHA1	Message	Date
root	d11632a6fa	staffing: recon + synthetic-data gap report (Phase 0, no implementation) Some checks failed lakehouse/auditor 13 blocking issues: cloud: claim not backed — "Phase 8 done-criteria (per spec):" Spec mandates these two docs before any staffing audit runner ships: docs/recon/staffing-lakehouse-distillation-recon.md reports/staffing/synthetic-data-gap-report.md NO distillation core touched. Distillation v1.0.0 (commit e7636f2, tag distillation-v1.0.0) remains the stable substrate. Staffing work is consumer-only. Recon findings (12 sections, ~5KB): - Existing staffing schemas in crates/validator/staffing/* are scaffolds (FillValidator schema-shape only; worker-existence/status/geo TODOs) - Synthetic data spans 6+ shapes across 9 parquet files (~625k worker-shape rows + 1k candidate-shape rows) - PII detection lives in shared/pii.rs but enforcement at query time is unverified — the LLM may have been seeing raw PII via workers_500k_v8 vector corpus - 44 scenarios + 64 playbook_lessons = ~108 RAG candidates - No structured fill-event log exists; scenarios+lessons are retrospective, not queryable per-event records - workers_500k.phone is int (should be string — leading-zero loss) - client_workerskjkk.parquet is a typo file (160 rows, sibling of client_workersi.parquet) - PRD §158 claims Phase 19 closed playbook write-only gap — unverified Gap report findings (9 sections, ~6KB): - 4 BLOCKING gaps requiring J decisions before audit ships: A. Generate fill_events.parquet from scenarios + lessons? B. Build views/{candidates,workers,jobs}_safe with PII masking? C. Delete client_workerskjkk.parquet typo file? D. Fix workers_500k.phone type (int → string)? - 5 SOFT gaps the audit can run with (will be reported as findings) - 3 NON-gaps (data sufficient as-is) - Recommendation: NO new synthetic data needed; only normalization of what already exists, contingent on J approval of A-D Up-front commitments: - Distillation v1.0.0 substrate untouched (verified by audit-full running clean before+after each staffing change) - All synthetic-data modifications via deterministic scripts under scripts/staffing/, never hand-edit - Every staffing artifact carries canonical sha256 provenance back to source parquet/scenario/lesson - _safe views are the source of truth for LLM-facing text; raw parquets never directly fed into corpus builds Phase 1 unblocks AFTER J reviews both docs and approves audit scope + the 4 gap-fix decisions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 00:02:47 -05:00

Author

SHA1

Message

Date

root

d11632a6fa

staffing: recon + synthetic-data gap report (Phase 0, no implementation)

lakehouse/auditor 13 blocking issues: cloud: claim not backed — "Phase 8 done-criteria (per spec):"

Spec mandates these two docs before any staffing audit runner ships:
  docs/recon/staffing-lakehouse-distillation-recon.md
  reports/staffing/synthetic-data-gap-report.md

NO distillation core touched. Distillation v1.0.0 (commit e7636f2,
tag distillation-v1.0.0) remains the stable substrate. Staffing
work is consumer-only.

Recon findings (12 sections, ~5KB):
  - Existing staffing schemas in crates/validator/staffing/* are scaffolds
    (FillValidator schema-shape only; worker-existence/status/geo TODOs)
  - Synthetic data spans 6+ shapes across 9 parquet files
    (~625k worker-shape rows + 1k candidate-shape rows)
  - PII detection lives in shared/pii.rs but enforcement at query
    time is unverified — the LLM may have been seeing raw PII via
    workers_500k_v8 vector corpus
  - 44 scenarios + 64 playbook_lessons = ~108 RAG candidates
  - No structured fill-event log exists; scenarios+lessons are
    retrospective, not queryable per-event records
  - workers_500k.phone is int (should be string — leading-zero loss)
  - client_workerskjkk.parquet is a typo file (160 rows, sibling of
    client_workersi.parquet)
  - PRD §158 claims Phase 19 closed playbook write-only gap — unverified

Gap report findings (9 sections, ~6KB):
  - 4 BLOCKING gaps requiring J decisions before audit ships:
    A. Generate fill_events.parquet from scenarios + lessons?
    B. Build views/{candidates,workers,jobs}_safe with PII masking?
    C. Delete client_workerskjkk.parquet typo file?
    D. Fix workers_500k.phone type (int → string)?
  - 5 SOFT gaps the audit can run with (will be reported as findings)
  - 3 NON-gaps (data sufficient as-is)
  - Recommendation: NO new synthetic data needed; only normalization
    of what already exists, contingent on J approval of A-D

Up-front commitments:
  - Distillation v1.0.0 substrate untouched (verified by audit-full
    running clean before+after each staffing change)
  - All synthetic-data modifications via deterministic scripts under
    scripts/staffing/, never hand-edit
  - Every staffing artifact carries canonical sha256 provenance back
    to source parquet/scenario/lesson
  - _safe views are the source of truth for LLM-facing text; raw
    parquets never directly fed into corpus builds

Phase 1 unblocks AFTER J reviews both docs and approves audit scope
+ the 4 gap-fix decisions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-27 00:02:47 -05:00

1 Commits