lakehouse/auditor 9 blocking issues: cloud: claim not backed — "Verified live (current synthetic data):"
Decision B from reports/staffing/synthetic-data-gap-report.md §7
(plus C: client_workerskjkk.parquet typo file removed from
data/datasets/ — was never tracked, no git effect).
PII enforcement was UNVERIFIED in workers_500k_v8 (the corpus
staffing_inference mode embeds chunks from). Verified 2026-04-27 by
inspecting data/vectors/meta/workers_500k_v8.json — `source:
"workers_500k"` confirms v8 was built directly from the raw table, so
the LLM has been seeing names / emails / phones / resume_text for every
staffing query.
This commit closes the boundary at the catalog metadata layer:
candidates_safe (overhauled — was failing SQL invalid 434×/day on a
nonexistent `vertical` column reference, copy-pasted from job_orders):
drops last_name, email, phone, hourly_rate_usd
candidate_id masked (keep first 3, last 2)
row_filter: status != 'blocked'
workers_safe (NEW):
drops name, email, phone, zip, communications, resume_text
keeps role, city, state, skills, certifications, archetype, scores
resume_text + communications carry verbatim PII (full names) and
there is no in-view text scrubber, so they are dropped wholesale.
Skills + certifications + scores carry the matching signal for
staffing inference.
jobs_safe (NEW):
drops description (often quotes client names verbatim)
client_id masked (keep first 3, last 2)
bill_rate / pay_rate kept — commercial info, not PII per staffing PRD
scripts/staffing/build_workers_v9.sh (NEW):
POSTs /vectors/index to rebuild workers_500k_v9 from `workers_safe`
rather than the raw table. Embedded text is constructed from the
view projection so PII never enters the corpus by construction.
30+ minute background job — not run inline. After it completes,
flip config/modes.toml `staffing_inference` matrix_corpus from
workers_500k_v8 to workers_500k_v9 and restart gateway.
Distillation v1.0.0 substrate untouched. audit-full passed clean
(16/16 required) before this commit; will re-verify after.
10,000 staffing worker profiles from profit/ethereal repo. Flattened
JSON → CSV → Parquet. Indexed on HNSW (9.5s) + Lance IVF_PQ (7.2s).
SQL hybrid verified: forklift operators in IL with reliability > 0.8
returned exact matches. Vector search alone missed the state filter —
confirms the hybrid SQL+vector routing need from quality eval.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
RAG pipeline now includes a cross-encoder rerank step between retrieval
and generation. The LLM re-sorts top-K results by relevance before
they become context. Falls back to original order if model output is
unparseable (~5% with 7B models). Also improved the generation prompt
to be domain-aware ("staffing database") and request specific citations.
Fixed 4 catalog manifests with bucket="data" (pre-federation leftover)
that poisoned the entire DataFusion query context on startup. The
"users", "lab_trials", "meta_runs", and "new_candidates" datasets
now correctly reference bucket="primary". This bug was surfaced by
the quality evaluation pipeline — wouldn't have been found by
structural tests alone.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
clean_sql now uses 3 strategies in priority order:
1. Extract from ```sql...``` markdown blocks
2. Find first SELECT/WITH/INSERT statement in text
3. Strip leading "sql" keyword fallback
Tested against 5 real model output patterns:
- Clean SQL ✓
- "sql" prefixed ✓
- Markdown fenced ✓
- Explanation before ```sql block ✓
- Explanation with SELECT buried in text ✓
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- POST /ingest/postgres/tables — list all tables in a database
- POST /ingest/postgres/import — import table → Parquet → catalog → queryable
- Auto type mapping: int2/4/8 → Int, float4/8 → Float64, bool → Boolean,
text/varchar/jsonb/timestamp → Utf8 (safe default per ADR-010)
- Auto PII detection + lineage on import
- Empty password support for trust auth
- Tested: imported lab_trials (40 rows, 10 cols) and threat_intel (20 rows, 30 cols)
from local knowledge_base Postgres database — immediately queryable
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Drop CSV/JSON/PDF/text into ./inbox → auto-detected → Parquet → queryable
- Polls every 10 seconds (configurable)
- Processed files moved to ./inbox/processed/
- Failed files moved to ./inbox/failed/
- Dedup: same file dropped twice = no-op
- Watcher starts automatically on gateway boot
- Tested: CSV dropped → queryable in <15s
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- 4 parallel pipelines on i9 + A4000 via Ollama
- Previous single-pipeline: 43/sec, 39min for 100K
- Supervisor: 67.6/sec, 22min for 100K
- Previous 100K attempt failed at 97K (no retry) — supervisor handles this
- Checkpointing every 1000 chunks for crash recovery
- Round-robin retry on batch failure (3 attempts)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- JobTracker: create/update/complete/fail jobs with progress tracking
- POST /vectors/index now returns immediately with job_id (HTTP 202)
- Embedding runs in tokio::spawn background task
- GET /vectors/jobs/{id} returns live progress (chunks embedded, rate, ETA)
- GET /vectors/jobs lists all jobs
- Progress logged every 100 batches with chunks/sec and ETA
- 100K embedding job running successfully at 44 chunks/sec
- System stays responsive during embedding (queries in 23ms)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- WorkspaceManager: create/get/list workspaces with daily/weekly/monthly/pinned tiers
- Saved searches: agent stores SQL queries in workspace context
- Shortlist: tag candidates/records to a workspace with notes
- Activity log: track calls, emails, updates per workspace per agent
- Instant handoff: transfer workspace ownership with full history
Zero data copy — just a pointer swap, receiving agent sees everything
- Persistence: workspaces stored as JSON in object storage, rebuilt on startup
- Endpoints: /workspaces/create, /{id}, /{id}/handoff, /{id}/search,
/{id}/shortlist, /{id}/activity
- Tested: Sarah creates workspace, saves searches, shortlists 3 candidates,
logs activity, hands off to Mike who continues seamlessly
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- MemCache: LRU in-memory cache for hot datasets (configurable max, default 16GB)
Pin/evict/stats endpoints: POST /query/cache/pin, /cache/evict, GET /cache/stats
- Delta store: append-only delta Parquet files for row-level updates
Write deltas without rewriting base files, merge at query time
- Compaction: POST /query/compact merges deltas into base Parquet
- Query engine: checks cache first, falls back to Parquet, merges deltas
- Benchmarked on 2.47M rows:
1M row JOIN: 854ms cold → 96ms hot (8.9x speedup)
100K filter: 62ms cold → 21ms hot (3x speedup)
1.1M rows cached in 408MB RAM
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- ui: Dioxus WASM app with dataset sidebar, SQL editor (Ctrl+Enter), results table
- ui: dynamic API base URL (same-origin for nginx, port-based for local dev)
- gateway: CORS enabled for cross-origin requests
- nginx: lakehouse.devop.live proxies UI (:3300) + API (:3100) on same origin
- justfile: ui-build, ui-serve, sidecar, up commands added
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>