# Phase Tracker ## Phase 0: Bootstrap ✅ - [x] Cargo workspace with all crate stubs compiling - [x] `shared` crate: error types, ObjectRef, DatasetId - [x] `gateway` with Axum: GET /health → 200 - [x] tracing + tracing-subscriber wired in gateway - [x] justfile with build, test, run recipes - [x] docs committed to git ## Phase 1: Storage + Catalog ✅ - [x] storaged: object_store backend init (LocalFileSystem) - [x] storaged: Axum endpoints (PUT/GET/DELETE/LIST) - [x] shared/arrow_helpers.rs: RecordBatch ↔ Parquet + schema fingerprinting - [x] catalogd/registry.rs: in-memory index + manifest persistence - [x] catalogd service: POST/GET /datasets + by-name - [x] gateway routes wired ## Phase 2: Query Engine ✅ - [x] queryd: SessionContext + object_store config - [x] queryd: ListingTable from catalog ObjectRefs - [x] queryd service: POST /query/sql → JSON - [x] queryd → catalogd wiring - [x] gateway routes /query ## Phase 3: AI Integration ✅ - [x] Python sidecar: FastAPI + Ollama (embed/generate/rerank) - [x] Dockerfile for sidecar - [x] aibridge/client.rs: HTTP client - [x] aibridge service: Axum proxy endpoints - [x] Model config via env vars ## Phase 4: Frontend ✅ - [x] Dioxus scaffold, WASM build - [x] Ask tab: natural language → AI SQL → results - [x] Explore tab: dataset browser + AI summary - [x] SQL tab: raw DataFusion editor - [x] System tab: health checks for all services ## Phase 5: Hardening ✅ - [x] Proto definitions (lakehouse.proto) - [x] Internal gRPC: CatalogService on :3101 - [x] OpenTelemetry tracing: stdout exporter - [x] Auth middleware: X-API-Key (toggleable) - [x] Config-driven startup: lakehouse.toml ## Phase 6: Ingest Pipeline ✅ - [x] CSV ingest with auto schema detection - [x] JSON ingest (array + newline-delimited, nested flattening) - [x] PDF text extraction (lopdf) - [x] Text/SMS file ingest - [x] Content hash dedup (SHA-256) - [x] POST /ingest/file multipart upload - [x] 12 unit tests ## Phase 7: Vector Index + RAG ✅ - [x] chunker: configurable size + overlap, sentence-boundary aware - [x] store: embeddings as Parquet (binary f32 vectors) - [x] search: brute-force cosine similarity - [x] rag: embed → search → retrieve → LLM answer with citations - [x] POST /vectors/index, /search, /rag - [x] Background job system with progress tracking - [x] Dual-pipeline supervisor with checkpointing + retry - [x] 6 unit tests ## Phase 8: Hot Cache + Incremental Updates ✅ - [x] MemTable hot cache: LRU, configurable max (16GB) - [x] POST /query/cache/pin, /cache/evict, GET /cache/stats - [x] Delta store: append-only delta Parquet files - [x] Merge-on-read: queries combine base + deltas - [x] Compaction: POST /query/compact - [x] Benchmarked: 9.8x speedup (1M rows: 942ms → 96ms) ## Phase 8.5: Agent Workspaces ✅ - [x] WorkspaceManager with daily/weekly/monthly/pinned tiers - [x] Saved searches, shortlists, activity logs per workspace - [x] Instant zero-copy handoff between agents - [x] Persistence to object storage, rebuild on startup ## Phase 9: Event Journal ✅ - [x] journald crate: append-only mutation log - [x] Event schema: entity, field, old/new value, actor, source, workspace - [x] In-memory buffer with auto-flush to Parquet - [x] GET /journal/history/{entity_id}, /recent, /stats - [x] POST /journal/event, /update, /flush ## Phase 10: Rich Catalog v2 ✅ - [x] DatasetManifest: description, owner, sensitivity, columns, lineage, freshness, tags - [x] PII auto-detection: email, phone, SSN, salary, address, medical - [x] Column-level metadata with sensitivity flags - [x] Lineage tracking: source_system → ingest_job → dataset - [x] PATCH /catalog/datasets/by-name/{name}/metadata - [x] Backward compatible (serde default) - [x] 25 unit tests total ## Phase 11: Embedding Versioning ⬜ - [ ] Vector index metadata: model_name, model_version, dimensions - [ ] Multi-version indexes coexist - [ ] Incremental re-embed on model upgrade - [ ] A/B search comparison ## Phase 12: Tool Registry ⬜ - [ ] Named business actions with parameter validation - [ ] Read vs write tool permissions - [ ] Audit logging per tool invocation - [ ] MCP-compatible interface - [ ] Rate limiting per agent/tool ## Phase 13: Security & Access Control ⬜ - [ ] Field-level sensitivity enforcement - [ ] Row-level access policies - [ ] Column masking - [ ] Query audit log - [ ] Policy-as-code (TOML/YAML) ## Phase 14: Schema Evolution ⬜ - [ ] Schema diff detection - [ ] AI-generated migration rules - [ ] Migration preview before apply - [ ] Versioned schemas in catalog ## Phase 15+: Horizon ⬜ - [ ] Federated multi-bucket query - [ ] Database connector ingest (Postgres/MySQL) - [ ] PDF OCR (Tesseract) - [ ] Scheduled ingest (cron) - [ ] Fine-tuned domain models - [ ] Multi-node query distribution