lakehouse

Author	SHA1	Message	Date
root	1745881426	staffing: face pool fetch preserves prior tags + --shrink gate + atomic manifest write fetch_face_pool was wiping 952 hand-classified rows when re-run from a Python without deepface installed (it reset every gender to None). Now: - Loads existing manifest by id and overlays only fetch-owned fields, so gender/race/age/excluded survive a refetch. - deepface pass tags only records that don't already have a gender; deepface unavailable means "leave existing tags alone" not "reset". - New --shrink flag required to drop ids >= --count. Default refuses to shrink the pool silently. - Atomic write via tmp + os.replace so an interrupted run can't corrupt the manifest. - Dedupes duplicate id lines (root cause of the 2497-row manifest backing a 1000-face pool). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 06:01:04 -05:00
root	a3b65f314e	Synthetic face pool — 1000 StyleGAN headshots, ComfyUI hot-swap, 60x smaller thumbs Worker cards now ship a real photo per person instead of monogram tiles: - fetch_face_pool.py pulls 1000 faces from thispersondoesnotexist.com - tag_face_pool.py runs deepface for gender/race/age, excludes <22yo - manifest.jsonl: 952 servable, gender/race buckets populated - /headshots/_thumbs/ pre-resized to 384px webp (587KB -> 11KB, 60x smaller; without this Chrome's parallel-connection budget drops ~75% of tiles in a 40-card grid) - /headshots/:key gender x race x age intersection bucketing with gender-only fallback when intersection is sparse - /headshots/generate/:key ComfyUI on-demand for the contractor profile spotlight (cold ~1.5s, cached ~1ms; worker-derived djb2 seed makes faces deterministic-per-worker but unique across workers sharing the same prompt) - serve_imagegen.py _cache_key() now includes seed (was caching by prompt only -> 3 different worker seeds collapsed to 1 cached image; verified fix produces 3 distinct md5s) - confidence-default name resolution: Xavier->man+hispanic, Aisha->woman+black, etc. Every worker resolves to a bucket. End-to-end: playwright run on /?q=forklift+operators+IL -> 21/21 cards loaded, 0 broken, all 384px webp. Cache + binary pool gitignored; manifest tracked. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 06:01:04 -05:00
root	10ed3bc630	demo: real synthetic headshots — fetch pool + serve route + UI wire Three layers shipped: 1. SCRIPT — scripts/staffing/fetch_face_pool.py Pulls N synthetic StyleGAN faces from thispersondoesnotexist.com into data/headshots/face_NNNN.jpg, writes manifest.jsonl. Idempotent: re-running skips existing files. Optional gender tagging via deepface (currently unavailable on this box; the script handles ImportError gracefully and tags everything as untagged). Fetched 198 faces with concurrency=3 in ~67s. 2. SERVER — /headshots/:key route in mcp-server/index.ts Loads manifest at first hit, caches in globalThis._faces. Hashes the key with djb2-style mixing → pool index → returns the JPG. Same key always gets the same face (deterministic). Accepts ?g=man\|woman&e=caucasian\|black\|hispanic\|south_asian\|east_asian\|middle_eastern to bias pool selection — the gender/ethnicity buckets fall back to the full pool when no tagged matches exist. Cache-Control: 86400 immutable so faces ride the browser cache after first hit. /headshots/__reload re-reads the manifest without restart. 3. UI — search.html + console.html worker cards Re-added overlay <img> on top of the monogram .av circle. img.src = /headshots/<encoded-key>?g=<hint>&e=<hint>. img.onerror removes the failed image so the monogram stays visible if the face pool isn't fetched / CDN is blocked. .av now has overflow:hidden + position:relative to clip the img to a perfect circle. Forced-confident name resolution (J: "we're CREATING the profile, created as though you truly have the information Xavier is more likely Hispanic and he's a male"): genderFor(name) — looks up MALE_NAMES + FEMALE_NAMES, falls back to a deterministic hash split so unknown names spread ~50/50. Sets now include cross-cultural names: Alejandro/ Andres/Mateo/Santiago/Joaquin/Cesar/Hugo/ Felipe/Gerardo/Salvador/Ramon (Hispanic), Raj/Anil/Vikram/Krishna/Pradeep (South Asian), Wei/Yi/Hiroshi/Akira/Hyun (East Asian), Demetrius/Kareem/DaQuan/Khalil (Black), Omar/Khalid/Hassan/Ahmed/Bilal (Middle Eastern). FEMALE_NAMES extended in parallel. guessEthnicityFromFirstName(name) — confident default of 'caucasian' for any name not in the cultural buckets so every worker resolves to a category the face pool can be biased toward. Order: ME → Black → Hispanic → South Asian → East Asian → Caucasian (matters where names overlap, e.g. Aisha appears in ME + Black, biases toward ME for visual fit). Both helpers also ported into console.html so the triage backfills and try-it-yourself rendering get the same hint stack. Privacy note in the script + route comments: the synthetic data uses the worker's name as the seed; production should hash worker_id (not name) to avoid leaking PII to a third-party CDN. The fetch URL itself is referenced once per pool build, not per-worker. .gitignore — added data/headshots/face_*.jpg (~100MB for 198 faces; the manifest + script are tracked). Re-running the script on a fresh checkout rebuilds the pool from scratch. Verified end-to-end via playwright on devop.live/lakehouse: forklift query → 10 worker cards 10/10 with face images (real synthetic headshots, not monograms) 0/10 broken Alejandro G. Nelson → ?g=man&e=hispanic Patricia K. Garcia → ?g=woman&e=caucasian Each name → unique face, deterministic across loads. Console triage backfills get the same treatment.	2026-04-28 06:01:04 -05:00
root	6366487b45	ops: persist runtime fixes — iterate.rs unused state, catalog cleanup Two load-bearing runtime changes that were never committed: 1. crates/gateway/src/v1/iterate.rs — `state` → `_state` on the unused route-state parameter. Cleared the one cargo workspace warning. Fix was made earlier this session but the working-tree change never made it into a commit. 2. data/_catalog/manifests/564b00ae-cbf3-4efd-aa55-84cdb6d2b0b7.json — DELETED. This was the dead manifest for `client_workerskjkk`, a typo dataset whose parquet was deleted but whose catalog entry stayed registered. Every SQL query failed schema inference on the missing file before reaching its target table — that's the bug that made /system/summary report 0 workers and the demo show zero bench. Deleting the manifest keeps the fix on disk; committing the deletion keeps it in git so a fresh checkout doesn't regress. 3. data/_catalog/manifests/32ee74a0-59b4-4e5b-8edb-70c9347a4bf3.json — runtime catalog metadata update from the successful_playbooks_live write path. Ride-along change. Reports under reports/distillation/phase[68]-*.md are auto-regenerated by the audit cycle each run; skipping those.	2026-04-28 06:01:04 -05:00
root	c3c9c2174a	staffing: B+C — safe views (candidates/workers/jobs) + workers_500k_v9 build script Some checks failed lakehouse/auditor 9 blocking issues: cloud: claim not backed — "Verified live (current synthetic data):" Decision B from reports/staffing/synthetic-data-gap-report.md §7 (plus C: client_workerskjkk.parquet typo file removed from data/datasets/ — was never tracked, no git effect). PII enforcement was UNVERIFIED in workers_500k_v8 (the corpus staffing_inference mode embeds chunks from). Verified 2026-04-27 by inspecting data/vectors/meta/workers_500k_v8.json — `source: "workers_500k"` confirms v8 was built directly from the raw table, so the LLM has been seeing names / emails / phones / resume_text for every staffing query. This commit closes the boundary at the catalog metadata layer: candidates_safe (overhauled — was failing SQL invalid 434×/day on a nonexistent `vertical` column reference, copy-pasted from job_orders): drops last_name, email, phone, hourly_rate_usd candidate_id masked (keep first 3, last 2) row_filter: status != 'blocked' workers_safe (NEW): drops name, email, phone, zip, communications, resume_text keeps role, city, state, skills, certifications, archetype, scores resume_text + communications carry verbatim PII (full names) and there is no in-view text scrubber, so they are dropped wholesale. Skills + certifications + scores carry the matching signal for staffing inference. jobs_safe (NEW): drops description (often quotes client names verbatim) client_id masked (keep first 3, last 2) bill_rate / pay_rate kept — commercial info, not PII per staffing PRD scripts/staffing/build_workers_v9.sh (NEW): POSTs /vectors/index to rebuild workers_500k_v9 from `workers_safe` rather than the raw table. Embedded text is constructed from the view projection so PII never enters the corpus by construction. 30+ minute background job — not run inline. After it completes, flip config/modes.toml `staffing_inference` matrix_corpus from workers_500k_v8 to workers_500k_v9 and restart gateway. Distillation v1.0.0 substrate untouched. audit-full passed clean (16/16 required) before this commit; will re-verify after.	2026-04-27 10:46:03 -05:00
root	d56f08e740	staffing: A — fill_events.parquet from 44 scenarios + 64 lessons (deterministic) Decision A from reports/staffing/synthetic-data-gap-report.md §7. Walks tests/multi-agent/scenarios/scen_.json and data/_playbook_lessons/.json, normalizes to a single fill_events.parquet at data/datasets/fill_events.parquet. One row per scenario event, lesson outcomes joined by (client, date) where the tuple matches. rows: 123 scenarios contributing: 40 events with outcome data: 62 unique (client, date) tuples: 40 Reproducibility: event_id is SHA1(client\|date\|role\|at\|city) truncated to 16 hex chars; rows sorted by event_id before write so re-runs produce bit-identical output. Verified. Pure normalization — no LLM, no new data, no distillation substrate mutation.	2026-04-27 10:45:29 -05:00
profit	5b1fcf6d27	Phase 28-36 body of work Accumulated since a6f12e2 (Phase 21 Rust port + Phase 27 versioning): - Phase 36: embed_semaphore on VectorState (permits=1) serializes seed embed calls — prevents sidecar socket collisions under concurrent /seed stress load - Phase 31+: run_stress.ts 6-task diverse stress scaffolding; run_e2e_rated.ts + orchestrator.ts tightening - Catalog dedupe cleanup: 16 duplicate manifests removed; canonical candidates.parquet (10.5MB -> 76KB) + placements.parquet (1.2MB -> 11KB) regenerated post-dedupe; fresh manifests for active datasets - vectord: harness EvalSet refinements (+181), agent portfolio rotation + ingest triggers (+158), autotune + rag adjustments - catalogd/storaged/ingestd/mcp-server: misc tightening - docs: Phase 28-36 PRD entries + DECISIONS ADR additions; control-plane pivot banner added to top of docs/PRD.md (pointing at docs/CONTROL_PLANE_PRD.md which lands in next commit) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 02:41:15 -05:00
root	bd8c30c7bd	Public URL: devop.live/lakehouse/proof — SSL, no IP needed Added nginx proxy: /lakehouse/* → localhost:3700 (agent gateway). Separate include file so the main llms3 config stays clean. https://devop.live/lakehouse/proof — styled proof page https://devop.live/lakehouse/proof.json — raw verification data https://devop.live/lakehouse/ — dashboard Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-17 12:41:53 -05:00
root	a710896db2	Ingest Ethereal 10K worker profiles — domain data in the substrate 10,000 staffing worker profiles from profit/ethereal repo. Flattened JSON → CSV → Parquet. Indexed on HNSW (9.5s) + Lance IVF_PQ (7.2s). SQL hybrid verified: forklift operators in IL with reliability > 0.8 returned exact matches. Vector search alone missed the state filter — confirms the hybrid SQL+vector routing need from quality eval. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 22:26:19 -05:00
root	f9f92706f3	RAG reranker + manifest bucket fix — quality improvements from eval RAG pipeline now includes a cross-encoder rerank step between retrieval and generation. The LLM re-sorts top-K results by relevance before they become context. Falls back to original order if model output is unparseable (~5% with 7B models). Also improved the generation prompt to be domain-aware ("staffing database") and request specific citations. Fixed 4 catalog manifests with bucket="data" (pre-federation leftover) that poisoned the entire DataFusion query context on startup. The "users", "lab_trials", "meta_runs", and "new_candidates" datasets now correctly reference bucket="primary". This bug was surfaced by the quality evaluation pipeline — wouldn't have been found by structural tests alone. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 22:19:11 -05:00
root	84407eeb51	Stress test suite: 9/9 passed — architecture validated Tests: 1. Concurrent (10 queries): avg 48ms, max 50ms, no contention 2. Cross-reference (1.3M rows): 130ms, 3 JOINs + anti-join 3. Restart recovery: 12 datasets, 100K rows identical after restart 4. Pagination: 100K rows in 1000 pages, random page fetch works 5. Sustained: 70 QPS over 100 queries, 0 errors 6. Journal: write, flush, read-back correct 7. Tool registry: 6 tools execute correctly with audit 8. Cache: hot/cold verified 9. MySQL comparison: schema-on-read, vector+SQL, portable backup, PII auto-detect Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 22:13:27 -05:00
root	0bd753294b	Robust SQL extraction: handles explanations, markdown, prefixes clean_sql now uses 3 strategies in priority order: 1. Extract from ```sql...``` markdown blocks 2. Find first SELECT/WITH/INSERT statement in text 3. Strip leading "sql" keyword fallback Tested against 5 real model output patterns: - Clean SQL ✓ - "sql" prefixed ✓ - Markdown fenced ✓ - Explanation before ```sql block ✓ - Explanation with SELECT buried in text ✓ Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 20:42:11 -05:00
root	9992b5f135	Database connector: PostgreSQL → Parquet import - POST /ingest/postgres/tables — list all tables in a database - POST /ingest/postgres/import — import table → Parquet → catalog → queryable - Auto type mapping: int2/4/8 → Int, float4/8 → Float64, bool → Boolean, text/varchar/jsonb/timestamp → Utf8 (safe default per ADR-010) - Auto PII detection + lineage on import - Empty password support for trust auth - Tested: imported lab_trials (40 rows, 10 cols) and threat_intel (20 rows, 30 cols) from local knowledge_base Postgres database — immediately queryable Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 20:14:16 -05:00
root	294f3f6a49	Scheduled ingest: file watcher auto-ingests from ./inbox - Drop CSV/JSON/PDF/text into ./inbox → auto-detected → Parquet → queryable - Polls every 10 seconds (configurable) - Processed files moved to ./inbox/processed/ - Failed files moved to ./inbox/failed/ - Dedup: same file dropped twice = no-op - Watcher starts automatically on gateway boot - Tested: CSV dropped → queryable in <15s Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 20:04:40 -05:00
root	d61096e26f	100K embedding COMPLETE: 177/sec, 9.5 min, zero failures - Supervisor 4-pipeline: 100,000 chunks embedded successfully - Peak throughput: 177 chunks/sec (4.1x vs single-pipeline 43/sec) - Total time: 572s (9.5 minutes) - Storage: 315 MB Parquet - Brute-force search over 100K vectors: 4.5s - Index metadata registered: nomic-embed-text, 768d, build stats - Zero failures — supervisor retry handled all transient errors Previous attempt (single pipeline): failed at 97K after 38 min This attempt (supervisor): completed 100K in 9.5 min with retry Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 09:53:47 -05:00
root	e5b7663c20	Phase 13: Access control — role-based sensitivity enforcement - AccessControl: agent roles with allowed sensitivity levels - 4 default roles: admin (all), recruiter (PII ok), analyst (financial ok), agent (internal only) - Field-level masking: determines which columns to mask per agent based on sensitivity - Query audit log: tracks every query with agent, datasets, PII fields accessed - Endpoints: GET/POST /access/roles, GET /access/audit, POST /access/check - Toggleable via config (auth.enabled) - 100K embedding: supervisor now sustained 125/sec (2.9x vs single pipeline) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 09:47:47 -05:00
root	b2cd54e941	100K embedding: supervisor achieves 67.6/sec (57% faster than single pipeline) - 4 parallel pipelines on i9 + A4000 via Ollama - Previous single-pipeline: 43/sec, 39min for 100K - Supervisor: 67.6/sec, 22min for 100K - Previous 100K attempt failed at 97K (no retry) — supervisor handles this - Checkpointing every 1000 chunks for crash recovery - Round-robin retry on batch failure (3 attempts) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 09:45:59 -05:00
root	6a532cb248	Background job system for embedding — fixes 100K timeout - JobTracker: create/update/complete/fail jobs with progress tracking - POST /vectors/index now returns immediately with job_id (HTTP 202) - Embedding runs in tokio::spawn background task - GET /vectors/jobs/{id} returns live progress (chunks embedded, rate, ETA) - GET /vectors/jobs lists all jobs - Progress logged every 100 batches with chunks/sec and ETA - 100K embedding job running successfully at 44 chunks/sec - System stays responsive during embedding (queries in 23ms) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 09:03:07 -05:00
root	0b9da45647	Agent workspaces: per-contract overlays with instant handoff - WorkspaceManager: create/get/list workspaces with daily/weekly/monthly/pinned tiers - Saved searches: agent stores SQL queries in workspace context - Shortlist: tag candidates/records to a workspace with notes - Activity log: track calls, emails, updates per workspace per agent - Instant handoff: transfer workspace ownership with full history Zero data copy — just a pointer swap, receiving agent sees everything - Persistence: workspaces stored as JSON in object storage, rebuilt on startup - Endpoints: /workspaces/create, /{id}, /{id}/handoff, /{id}/search, /{id}/shortlist, /{id}/activity - Tested: Sarah creates workspace, saves searches, shortlists 3 candidates, logs activity, hands off to Mike who continues seamlessly Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 08:44:45 -05:00
root	6df904a03c	Phase 8: Hot cache + incremental delta updates - MemCache: LRU in-memory cache for hot datasets (configurable max, default 16GB) Pin/evict/stats endpoints: POST /query/cache/pin, /cache/evict, GET /cache/stats - Delta store: append-only delta Parquet files for row-level updates Write deltas without rewriting base files, merge at query time - Compaction: POST /query/compact merges deltas into base Parquet - Query engine: checks cache first, falls back to Parquet, merges deltas - Benchmarked on 2.47M rows: 1M row JOIN: 854ms cold → 96ms hot (8.9x speedup) 100K filter: 62ms cold → 21ms hot (3x speedup) 1.1M rows cached in 408MB RAM Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 08:37:28 -05:00
root	eae51977ab	Scale test: 2.47M rows + 10K vector index benchmarked Benchmarks on 128GB RAM server: - 100K candidate filter (skills+city+status): 257ms - 1M timesheet aggregation (revenue by client): 942ms - 800K call log cross-reference (cold leads): 642ms - Triple JOIN recruiter performance: 487ms - 500K email open rate aggregation: 259ms - COUNT all 2.47M rows: 84ms - 10K vector search (cosine similarity): ~450ms - Embedding throughput: 49 chunks/sec via Ollama - RAG correctly refuses to hallucinate when no match exists Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 08:31:37 -05:00
root	26fc98c885	Phase 7: Vector index + RAG pipeline - vectord crate: chunk → embed → store → search → RAG - chunker: configurable chunk size + overlap, sentence-boundary aware splitting - store: embeddings as Parquet (binary blob f32 vectors), portable format - search: brute-force cosine similarity (works up to ~100K vectors) - rag: full pipeline — embed question → search index → retrieve context → LLM answer - Endpoints: POST /vectors/index, /vectors/search, /vectors/rag - Gateway wired with vectord service - Tested: 200 candidate resumes indexed in 5.4s, semantic search + RAG working - 20 unit tests passing (chunker, search, ingestd, shared) - AI gives honest "no match found" when context doesn't support an answer Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 08:12:28 -05:00
root	bb05c4412e	Phase 6: Ingest pipeline — CSV, JSON, PDF, text file support - ingestd crate: detect file type → parse → schema detection → Parquet → catalog - CSV: auto-detect column types (int, float, bool, string), handles $, %, commas Strips dollar signs from amounts, flexible row parsing, sanitized column names - JSON: array or newline-delimited, nested object flattening (a.b.c → a_b_c) - PDF: text extraction via lopdf, one row per page (source_file, page_number, text) - Text/SMS: line-based ingestion with line numbers - Dedup: SHA-256 content hash, re-ingest same file = no-op - Gateway: POST /ingest/file multipart upload, 256MB body limit - Schema detection per ADR-010: ambiguous types default to String - 12 unit tests passing (CSV parsing, JSON flattening, type inference, dedup) - Tested: messy CSV with missing data, dollar amounts, N/A values → queryable Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 08:07:31 -05:00
root	b37e171e10	UI redesign: Ask, Explore, SQL, System tabs - Ask: natural language → AI generates SQL → DataFusion executes → results Shows the AI-over-data-lake story: schema introspection → LLM → query - Explore: click dataset → schema + preview + AI-generated summary - SQL: raw DataFusion SQL editor with Ctrl+Enter - System: health grid testing all 5 services + embeddings + generation - Example prompts for quick demo - Dark theme with accent styling Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 07:24:51 -05:00
root	387ce0074c	UI: full-stack test coverage with tabs for Query, Storage, AI, Status - Query tab: SQL editor with results table (existing) - Storage tab: list objects, register datasets pointing at storage keys - AI tab: embed (nomic-embed-text), generate (qwen2.5), rerank with scored results - Status tab: health checks for all 5 services + functional tests (embed, generate, SQL) - nginx: added /lakehouse/ and API proxy paths to devop.live config - Loaded 3 sample datasets: employees, events, products - Fixed Rust 2024 reserved keyword `gen` Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 06:56:18 -05:00
root	01373c0e45	Phase 5: hardening — gRPC, observability, auth, config - proto: lakehouse.proto with CatalogService, QueryService, StorageService, AiService - proto crate: tonic-build codegen from proto definitions - catalogd: gRPC CatalogService implementation - gateway: dual HTTP (:3100) + gRPC (:3101) servers - gateway: OpenTelemetry tracing with stdout exporter - gateway: API key auth middleware (toggleable) - shared: TOML config system with typed structs and defaults - lakehouse.toml config file - ADR-006 and ADR-007 documented Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 06:37:07 -05:00
root	50a8c8013f	Phase 4: Dioxus frontend with dataset browser and SQL query editor - ui: Dioxus WASM app with dataset sidebar, SQL editor (Ctrl+Enter), results table - ui: dynamic API base URL (same-origin for nginx, port-based for local dev) - gateway: CORS enabled for cross-origin requests - nginx: lakehouse.devop.live proxies UI (:3300) + API (:3100) on same origin - justfile: ui-build, ui-serve, sidecar, up commands added Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 06:24:15 -05:00

27 Commits