lakehouse

Author	SHA1	Message	Date
root	3bc82833ac	Update PRD + PHASES.md — reflect 8-commit 2026-04-17 push PRD status line: "Phases 0-18 shipped; hybrid operational; scheduled ingest live; PDF OCR live; entering horizon items." PHASES.md: federation L2 items marked complete, Phase 16.2 (autotune agent), Phase 17 VRAM gate, MySQL connector, Phase 18 (hybrid Lance), scheduled ingest, PDF OCR all documented with dates and measurements. Stats updated: 52+ unit tests, 13 crates, 19 ADRs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 20:54:05 -05:00
root	76f6fba5de	Phase B: Lance pilot — hybrid decision with measured benchmark Standalone benchmark crate `crates/lance-bench` running Lance 4.0 against our Parquet+HNSW at 100K × 768d (resumes_100k_v2) measured 8 dimensions. Results (see docs/ADR-019-vector-storage.md for full scorecard): Cold load: Parquet 0.17s vs Lance 0.13s (tie — not ≥2× threshold) Disk size: 330.3 MB vs 330.4 MB (tie) Search p50: 873us vs 2229us (Parquet 2.55× faster) Search p95: 1413us vs 4998us (Parquet 3.54× faster) Index build: 230s (ec=80) vs 16s (IVF_PQ) (Lance 14× faster) Random access: 35ms (scan) vs 311us (Lance 112× faster) Append 10K rows: full rewrite vs 0.08s/+31MB (Lance structural win) Decision (ADR-019): hybrid, not migrate-or-reject. - Parquet+HNSW stays primary — our HNSW at ec=80 es=30 recall=1.00 is 2.55× faster than Lance IVF_PQ at 100K in-RAM scale - Lance joins as second backend per-profile for workloads where it wins architecturally: random row access (RAG text fetch), append-heavy pipelines (Phase C), hot-swap generations (Phase 16, 14× faster builds), and indexes past the ~5M RAM ceiling - Phase 17 ModelProfile gets vector_backend: Parquet \| Lance field - Ceiling table in PRD updated — 5M ceiling now says "switch to Lance" instead of "migrate" since Lance runs alongside, not instead of Isolation: lance-bench is a standalone workspace crate with its own dep tree (Lance pulls DataFusion 52 + Arrow 57 incompatible with main stack DataFusion 47 + Arrow 55). Kept off the critical path until API is stable enough to promote into vectord::lance_store. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 02:37:11 -05:00
root	dbe00d018f	Federation foundation + HNSW trial system + Postgres streaming + PRD reframe Four shipped features and a PRD realignment, all measured end-to-end: HNSW trial system (Phase 15 horizon item → complete) - vectord: EmbeddingCache, harness (eval sets + brute-force ground truth), TrialJournal, parameterized HnswConfig on build_index_with_config - /vectors/hnsw/trial, /hnsw/trials/{idx}, /hnsw/trials/{idx}/best, /hnsw/evals/{name}/autogen, /hnsw/cache/stats - Measured on resumes_100k_v2 (100K × 768d): brute-force 44ms -> HNSW 873us at 100% recall@10. ec=80 es=30 locked as HnswConfig::default() - Lower ec values trade recall for build time: 20/30 = 0.96 recall in 8s, 80/30 = 1.00 recall in 230s Catalog manifest repair - catalogd: resync_from_parquet reads parquet footers to restore row_count and columns on drifted manifests - POST /catalog/datasets/{name}/resync + POST /catalog/resync-missing - All 7 staffing tables recovered to PRD-matching 2,469,278 rows Federation foundation (ADR-017) - shared::secrets: SecretsProvider trait + FileSecretsProvider (reads /etc/lakehouse/secrets.toml, enforces 0600 perms) - storaged::registry::BucketRegistry — multi-bucket resolution with rescue_bucket read fallback and reachability probing - storaged::error_journal — bucket op failures visible in one HTTP call - storaged::append_log — write-once batched append pattern (fixes the RMW anti-pattern llms3.com calls out; errors and trial journals both use it) - /storage/buckets, /storage/errors, /storage/bucket-health, /storage/errors/{flush,compact} - Bucket-aware I/O at /storage/buckets/{bucket}/objects/{*key} with X-Lakehouse-Rescue-Used observability headers on fallback Postgres streaming ingest - ingestd::pg_stream: DSN parser, batched ORDER BY + LIMIT/OFFSET pagination into ArrowWriter, lineage redacts password - POST /ingest/db — verified against live knowledge_base.team_runs (586 rows × 13 cols, 6 batches, 196ms end-to-end) PRD realignment (2026-04-16) - Dual use case: staffing analytics + local LLM knowledge substrate - Removed "multi-tenancy (single-owner system)" from non-goals - Added invariants 8-11: indexes hot-swappable, per-reader profiles, trials-as-data, operational failures findable in one HTTP call - New phases 16 (hot-swap generations), 17 (model profiles + dataset bindings), 18 (Lance vs Parquet+sidecar evaluation) - Known ceilings table documents the 5M vector wall and escape hatches - ADR-017 (federation), ADR-018 (append-log pattern) added - EXECUTION_PLAN.md sequences phases B-E with success gates and decision rules Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 01:50:05 -05:00
root	354c9c4a04	PRD v3: future-proofing roadmap — event journal, rich catalog, tool registry Phases 9-15 designed based on "future regret" analysis: - Phase 9: Event journal (append-only mutation history — can't retrofit) - Phase 10: Rich catalog v2 (ownership, sensitivity, lineage, freshness) - Phase 11: Embedding versioning (model-proof vector layer) - Phase 12: Tool registry (governed agent actions via MCP) - Phase 13: Security & access control (field-level, row-level, audit) - Phase 14: Schema evolution with AI migration rules - Phase 15+: Federated query, DB connectors, OCR, fine-tuned models 8 design principles: store truth openly, describe richly, never destroy evidence, secure centrally, expose through tools, version everything, unstructured first-class, separate storage/compute/intelligence. ADR-012 through ADR-016 documenting key future-proofing decisions. Updated benchmarks: 2.47M rows, hot cache 9.8x speedup. Updated operating rules: cheap-now/expensive-later built first. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 08:57:29 -05:00
root	6740a017c7	PRD v2: production roadmap with ingest, vector search, hot cache phases - Phase 6: Ingest pipeline (CSV/JSON → schema detect → Parquet → catalog) - Phase 7: Vector index + RAG (embed → HNSW → semantic search → LLM answer) - Phase 8: Hot cache + incremental updates (MemTable, delta files, merge-on-read) - ADR-008 through ADR-011: embeddings as Parquet, delta files not Delta Lake, schema defaults to string, not a CRM replacement - Staffing company reference dataset (286K rows, 7 tables) - Honest risk assessment: vector search at scale and incremental updates are hard Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 07:54:24 -05:00
root	a52ca841c6	Phase 0: bootstrap Rust workspace - Cargo workspace with 6 crates: shared, storaged, catalogd, queryd, aibridge, gateway - shared: types (DatasetId, ObjectRef, SchemaFingerprint, DatasetManifest) + error enum - gateway: Axum HTTP entrypoint with nested service routers + tracing - All services expose /health stubs - justfile with build/test/run recipes - PRD, phase tracker, and ADR docs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 04:59:05 -05:00

6 Commits