lakehouse/docs/PHASES.md
root 8282842eaf Sync memory + phases: all 15 phases marked complete
PHASES.md and project memory updated to reflect actual build state.
Phases 11-14 were built but trackers weren't updated.

Final stats: 11 crates, 30 tests, 16 ADRs, 2.47M rows, 100K vectors.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 19:34:15 -05:00

5.3 KiB

Phase Tracker

Phase 0: Bootstrap

  • Cargo workspace with all crate stubs compiling
  • shared crate: error types, ObjectRef, DatasetId
  • gateway with Axum: GET /health → 200
  • tracing + tracing-subscriber wired in gateway
  • justfile with build, test, run recipes
  • docs committed to git

Phase 1: Storage + Catalog

  • storaged: object_store backend init (LocalFileSystem)
  • storaged: Axum endpoints (PUT/GET/DELETE/LIST)
  • shared/arrow_helpers.rs: RecordBatch ↔ Parquet + schema fingerprinting
  • catalogd/registry.rs: in-memory index + manifest persistence
  • catalogd service: POST/GET /datasets + by-name
  • gateway routes wired

Phase 2: Query Engine

  • queryd: SessionContext + object_store config
  • queryd: ListingTable from catalog ObjectRefs
  • queryd service: POST /query/sql → JSON
  • queryd → catalogd wiring
  • gateway routes /query

Phase 3: AI Integration

  • Python sidecar: FastAPI + Ollama (embed/generate/rerank)
  • Dockerfile for sidecar
  • aibridge/client.rs: HTTP client
  • aibridge service: Axum proxy endpoints
  • Model config via env vars

Phase 4: Frontend

  • Dioxus scaffold, WASM build
  • Ask tab: natural language → AI SQL → results
  • Explore tab: dataset browser + AI summary
  • SQL tab: raw DataFusion editor
  • System tab: health checks for all services

Phase 5: Hardening

  • Proto definitions (lakehouse.proto)
  • Internal gRPC: CatalogService on :3101
  • OpenTelemetry tracing: stdout exporter
  • Auth middleware: X-API-Key (toggleable)
  • Config-driven startup: lakehouse.toml

Phase 6: Ingest Pipeline

  • CSV ingest with auto schema detection
  • JSON ingest (array + newline-delimited, nested flattening)
  • PDF text extraction (lopdf)
  • Text/SMS file ingest
  • Content hash dedup (SHA-256)
  • POST /ingest/file multipart upload
  • 12 unit tests

Phase 7: Vector Index + RAG

  • chunker: configurable size + overlap, sentence-boundary aware
  • store: embeddings as Parquet (binary f32 vectors)
  • search: brute-force cosine similarity
  • rag: embed → search → retrieve → LLM answer with citations
  • POST /vectors/index, /search, /rag
  • Background job system with progress tracking
  • Dual-pipeline supervisor with checkpointing + retry
  • 100K embeddings: 177/sec on A4000, zero failures
  • 6 unit tests

Phase 8: Hot Cache + Incremental Updates

  • MemTable hot cache: LRU, configurable max (16GB)
  • POST /query/cache/pin, /cache/evict, GET /cache/stats
  • Delta store: append-only delta Parquet files
  • Merge-on-read: queries combine base + deltas
  • Compaction: POST /query/compact
  • Benchmarked: 9.8x speedup (1M rows: 942ms → 96ms)

Phase 8.5: Agent Workspaces

  • WorkspaceManager with daily/weekly/monthly/pinned tiers
  • Saved searches, shortlists, activity logs per workspace
  • Instant zero-copy handoff between agents
  • Persistence to object storage, rebuild on startup

Phase 9: Event Journal

  • journald crate: append-only mutation log
  • Event schema: entity, field, old/new value, actor, source, workspace
  • In-memory buffer with auto-flush to Parquet
  • GET /journal/history/{entity_id}, /recent, /stats
  • POST /journal/event, /update, /flush

Phase 10: Rich Catalog v2

  • DatasetManifest: description, owner, sensitivity, columns, lineage, freshness, tags
  • PII auto-detection: email, phone, SSN, salary, address, medical
  • Column-level metadata with sensitivity flags
  • Lineage tracking: source_system → ingest_job → dataset
  • PATCH /catalog/datasets/by-name/{name}/metadata
  • Backward compatible (serde default)

Phase 11: Embedding Versioning

  • IndexRegistry: model_name, model_version, dimensions per index
  • Index metadata persisted as JSON, rebuilt on startup
  • GET /vectors/indexes — list all (filter by source/model)
  • GET /vectors/indexes/{name} — metadata
  • Background jobs auto-register metadata on completion

Phase 12: Tool Registry

  • 6 built-in staffing tools (search_candidates, get_candidate, revenue_by_client, recruiter_performance, cold_leads, open_jobs)
  • Parameter validation + SQL template substitution
  • Permission levels: read / write / admin
  • Full audit trail per invocation
  • GET /tools, GET /tools/{name}, POST /tools/{name}/call, GET /tools/audit

Phase 13: Security & Access Control

  • Role-based access: admin, recruiter, analyst, agent
  • Field-level sensitivity enforcement
  • Column masking determination per agent
  • Query audit logging
  • GET/POST /access/roles, GET /access/audit, POST /access/check

Phase 14: Schema Evolution

  • Schema diff detection (added, removed, type changed, renamed)
  • Fuzzy rename detection (shared word parts)
  • Auto-generated migration rules with confidence scores
  • AI migration prompt builder for complex cases
  • 5 unit tests

Phase 15+: Horizon

  • HNSW vector index (100K search: 4.5s → <50ms)
  • Federated multi-bucket query
  • Database connector ingest (Postgres/MySQL)
  • PDF OCR (Tesseract)
  • Scheduled ingest (cron)
  • Fine-tuned domain models
  • Multi-node query distribution

30 unit tests | 11 crates | 16 ADRs | 2.47M rows | 100K vectors | All built 2026-03-27