lakehouse

Author	SHA1	Message	Date
root	6287558493	Push/daemon presence: background digest + /alerts settings page Converts the app from 'dashboard you visit' to 'system that finds you.' Critical for the phone-first staffing shop that won't open a URL — the system reaches out when something matters. Daemon: - Starts once per Bun process (guarded via globalThis sentinel) - Default interval 15 min (configurable, min 1, max 1440) - On each cycle, buildDigest() compares current state against prior snapshot persisted in mcp-server/data/notification_state.json - Events detected: - risk_escalation: role moved to tight or critical (was ok/watch) - deadline_approaching: staffing window falls within warn window (default 7 days) AND deadline date differs from prior - memory_growth: playbook_memory entries grew by >= 5 since last run Channels (all opt-out individually via config): - console: always on, logged to journalctl -u lakehouse-agent - file: always on, appends JSONL to mcp-server/data/notifications.jsonl - webhook: optional, POSTs {text, digest} to configured URL (Slack incoming-webhook / Discord webhook / any custom endpoint) Digest format (human-readable, fits in a Slack message): LAKEHOUSE DIGEST — 2026-04-20 23:24 3 staffing deadlines within window: • Production Worker — 2d to 2026-04-23 · demand 724 • Maintenance Tech — 4d to 2026-04-25 · demand 32 • Electrician — 5d to 2026-04-26 · demand 34 +779 new playbooks (total 779, 2204 endorsed names) snapshot: 0 critical · 0 tight · $275,599,326 pipeline /alerts page: - Current status table (daemon state, interval, webhook, last run) - Config form: enable toggle, interval, deadline warn window, webhook URL + label (saved to data/notification_config.json) - 'Fire a test digest now' button — force a cycle without waiting - Recent digests panel shows the last 10 dispatches with full text End-to-end verified live: - Daemon armed successfully on startup - First-run digest dispatched to console + file in <1s - Events detected correctly: 3 deadlines within 7 days from real Chicago permit data; 779 playbook entries surfaced as memory growth - Digest text format is Slack-pastable - Dispatch records appear in /alerts recent list TDZ caveat: startAlertsDaemon() invocation moved to end of module so all const/let in the alerts block evaluate before daemon reads them. Previously failed with 'Cannot access X before initialization' when the call lived near the top of the file. Nav added to all 6 pages: Dashboard · Walkthrough · Architecture · Spec · Onboard · Alerts.	2026-04-20 18:24:48 -05:00
root	23eb04a145	Onboarding wizard — ingest any staffing CSV in 3 steps New /onboard page. Client-facing wizard for getting real data into the system without engineering help. Flow: 1. Drop a CSV (or click 'Use the sample as my data' — ships a 25-row realistic staffing roster under /samples/staffing_roster_sample.csv) 2. Browser parses client-side. Columns auto-typed (text/int/decimal/ date). PII flagged by name hint AND content regex (emails, phones). First rows previewed. Read-only — nothing written yet. 3. Name the dataset (lowercase+underscores). Commit. 4. Post-commit: dataset is live. Shows 4 next steps the operator can take (SQL query, vector index, dashboard search, playbook training). Backend: - /onboard serves onboard.html - /samples/.csv serves CSV files from mcp-server/samples/ with filename validation (only [a-zA-Z0-9_-.]+.csv, prevents path traversal) - /onboard/ingest forwards multipart/form-data to gateway /ingest/file preserving the boundary. The generic /api/ passthrough breaks multipart because it reads as text and forwards as JSON; this route uses arrayBuffer + original Content-Type. Verified end-to-end: upload sample roster (25 rows, 12 columns) → parse in browser → show columns + PII flags + preview → commit → gateway writes Parquet, registers in catalog → immediately queryable: SELECT * FROM onboard_demo2 LIMIT 3 → Sarah Johnson, Forklift Operator, Chicago, IL, 0.92 Round-trip <1 second. Nav updated on all pages to link Onboard. Shipped with a sample CSV so the full flow is demonstrable without real client data. When a real client shows up, same path — they upload their CSV. No engineering ticket, no code change, no schema pre-definition. Security: sample filename regex prevents path traversal. CSV parse is client-side pure JS (no DOM injection). Commit uses existing /ingest/file validation (schema fingerprint, PII server-side, content-hash dedup).	2026-04-20 18:13:56 -05:00
root	468798c9ac	/spec: technical specification — 11-chapter README-equivalent J's ask: explain the full architecture so someone reading a README can dispute it or recreate it. The repo isn't public yet; this page IS the spec until it is. Ch1 Repository layout — 13 crates + tests/multi-agent + docs + data, with owned responsibility and file path per crate. Ch2 Data ingest pipeline (8 steps) — sources (file/inbox/DB/cron), parse+normalize with ADR-010 conservative typing, PII auto-tag, dedup, Parquet write, catalog register with fingerprint gate, mark embeddings stale, queryable immediately. Ch3 Measurement & indexing — row count / fingerprint / owner / sensitivity / freshness / lineage per dataset. HNSW vs Lance tradeoff table with measured numbers (ADR-019). Autotune loop. Per-profile scoping (Phase 17). Ch4 Contract inference from external signal — Chicago permit feed → role mapping → worker count heuristic → timeline → hybrid search with boost → pattern discovery → rendered card. All pre-computed before staffer opens UI. Ch5 What a CRM can't do — 11-row comparison table of capabilities. Ch6 How it gets better over time — three paths: - Phase 19 playbook boost (full math) - Pattern discovery meta-index - Autotune agent Ch7 Scale story: 20 staffers, 300 contracts, midday +20/+1M surge - Async gateway + per-staffer profile isolation + client blacklists - 7-step surge handling flow (ingest, stale-mark, incremental refresh, degradation, hot-swap, autotune re-enter) - Known pain points: Ollama inference serial, RAM ceiling ~5M on HNSW (mitigated by Lance), VRAM 1-2 models sequential, playbook_memory unbounded. Ch8 Error surfaces & recovery — 10-row table covering ingest schema conflicts, bucket failures, ghost names, dual-agent drift, empty searches, Ollama down, gateway restart, schema fingerprint divergence. Every failure has a named surface and recovery path. Ch9 Per-staffer context — active profile, workspace, client blacklist, audit trail, daily summary. How 20 staffers don't see the same UI. Ch10 Day in the life — 07:00 housekeeping → 07:30 refresh → 08:00 staffer opens → 08:15 drill down → 08:30 Call click → 09:00 second staffer shares memory → 12:30 surge → 14:00 no-show → 15:00 new embeddings live → 17:00 retrospective → 22:00 overnight trials. Ch11 Known limits & non-goals — deferred (rate/margin, push, confidence calibration, neural re-ranker, pm compaction, call_log cross-ref) and explicitly out-of-scope (cloud, ACID, streaming, CRM replace, proprietary formats, hard multi-tenant). Also: nav updated on /dashboard, /console, /proof to link /spec. Every architectural claim in the spec cites either a code path, an ADR number, or a phase reference so someone skeptical can target the specific artifact.	2026-04-20 17:56:18 -05:00
root	76bfa2c8d7	/proof: explain the dual-agent recursive architecture with citations Previous page was numeric claims without explanations — 'sub-100ms SQL', '500K vectors in 341ms' etc. Accurate but undefendable without math, code paths, and ADR references. Expanded to 8 chapters: Ch1 — Live receipts (unchanged: real gateway tests, pass/fail, timing) Ch2 — Architecture. 13-crate diagram with per-crate responsibility table and file paths. gateway → catalogd/queryd/vectord/ingestd + aibridge → object_store. References ADRs 1-20. Ch3 — Dual-agent recursive consensus loop (NEW) - Role specialization (executor=optimist, reviewer=pessimist) - Parallel orchestration via Promise.all - Recursive: sealed playbooks feed playbook_memory → next query - Termination math: sealed \| tool-error abort \| drift abort \| turn-cap abort — every path dumps forensic log - File refs: tests/multi-agent/agent.ts, orchestrator.ts, scenario.ts, run_e2e_rated.ts Ch4 — Playbook memory feedback loop (NEW) - PlaybookEntry shape with embedding - Full boost math: similarity * base_weight * decay * penalty / n_workers, capped at MAX_BOOST_PER_WORKER - Temporal decay (e^-age/30, 30d half-life) - Negative signal (0.5^failures) - Why k=200: narrow cosine discrimination in nomic-embed-text - Evidence: compounding test 0 → 0.250 cap in 3 seeds - persist_sql write-through - Pattern discovery (Path 2 meta-index) - File: crates/vectord/src/playbook_memory.rs Ch5 — ADR citations for each key choice ADR-001, 008, 012, 015, 019, 020 + Phase 19 design note Ch6 — Live scale data (unchanged: pulled from /proof.json) Ch7 — Reproduction recipes: curl for health, sql, hybrid with boost, patterns, pm stats, and the full dual-agent scenario run Ch8 — Honest limits (unchanged: synthetic workers_500k, 1K candidates misaligned to call_log, 7B model imperfection, no rate/margin) Every architectural claim now cites either the code path (crates/.../src/file.rs::fn_name) or the ADR (docs/DECISIONS.md). Someone disputing the system has specific targets to attack. Mechanism unchanged: /proof serves mcp-server/proof.html via Bun.file. /proof.json still returns the live test data the page consumes client-side.	2026-04-20 17:49:08 -05:00

4 Commits