diff --git a/docs/PHASES.md b/docs/PHASES.md index b6c8f47..4fc4ce2 100644 --- a/docs/PHASES.md +++ b/docs/PHASES.md @@ -1,58 +1,128 @@ # Phase Tracker -## Phase 0: Bootstrap -- [x] 0.1 — Cargo workspace with all crate stubs compiling -- [x] 0.2 — `shared` crate: error types, ObjectRef, DatasetId -- [x] 0.3 — `gateway` with Axum: GET /health → 200 -- [x] 0.4 — tracing + tracing-subscriber wired in gateway -- [x] 0.5 — justfile with build, test, run recipes -- [x] 0.6 — docs committed to git +## Phase 0: Bootstrap ✅ +- [x] Cargo workspace with all crate stubs compiling +- [x] `shared` crate: error types, ObjectRef, DatasetId +- [x] `gateway` with Axum: GET /health → 200 +- [x] tracing + tracing-subscriber wired in gateway +- [x] justfile with build, test, run recipes +- [x] docs committed to git -**Gate: PASSED** — All crates compile. Gateway runs. Logs emit. Docs committed. +## Phase 1: Storage + Catalog ✅ +- [x] storaged: object_store backend init (LocalFileSystem) +- [x] storaged: Axum endpoints (PUT/GET/DELETE/LIST) +- [x] shared/arrow_helpers.rs: RecordBatch ↔ Parquet + schema fingerprinting +- [x] catalogd/registry.rs: in-memory index + manifest persistence +- [x] catalogd service: POST/GET /datasets + by-name +- [x] gateway routes wired -## Phase 1: Storage + Catalog -- [x] 1.1 — storaged: object_store backend init (LocalFileSystem) -- [x] 1.2 — storaged: Axum endpoints (PUT/GET/DELETE/LIST /objects/{key}) -- [x] 1.3 — shared/arrow_helpers.rs: RecordBatch ↔ Parquet + schema fingerprinting -- [x] 1.4 — catalogd/registry.rs: in-memory index + manifest persistence to object storage -- [x] 1.5 — catalogd/schema.rs: schema fingerprinting (merged into shared/arrow_helpers.rs) -- [x] 1.6 — catalogd service: POST/GET /datasets + GET /datasets/by-name/{name} -- [x] 1.7 — gateway routes to storaged + catalogd with shared state +## Phase 2: Query Engine ✅ +- [x] queryd: SessionContext + object_store config +- [x] queryd: ListingTable from catalog ObjectRefs +- [x] queryd service: POST /query/sql → JSON +- [x] queryd → catalogd wiring +- [x] gateway routes /query -**Gate: PASSED** — PUT object → register dataset → list → get by name. All via gateway HTTP. +## Phase 3: AI Integration ✅ +- [x] Python sidecar: FastAPI + Ollama (embed/generate/rerank) +- [x] Dockerfile for sidecar +- [x] aibridge/client.rs: HTTP client +- [x] aibridge service: Axum proxy endpoints +- [x] Model config via env vars -## Phase 2: Query Engine -- [x] 2.1 — queryd: SessionContext + object_store config (custom scheme to avoid path doubling) -- [x] 2.2 — queryd: ListingTable from catalog ObjectRefs with schema inference -- [x] 2.3 — queryd service: POST /query/sql → JSON (columns + rows + row_count) -- [x] 2.4 — queryd → catalogd wiring (reads dataset list, registers as tables) -- [x] 2.5 — gateway routes /query with QueryEngine state +## Phase 4: Frontend ✅ +- [x] Dioxus scaffold, WASM build +- [x] Ask tab: natural language → AI SQL → results +- [x] Explore tab: dataset browser + AI summary +- [x] SQL tab: raw DataFusion editor +- [x] System tab: health checks for all services -**Gate: PASSED** — SELECT *, WHERE/ORDER BY, COUNT/AVG all return correct results via catalog. +## Phase 5: Hardening ✅ +- [x] Proto definitions (lakehouse.proto) +- [x] Internal gRPC: CatalogService on :3101 +- [x] OpenTelemetry tracing: stdout exporter +- [x] Auth middleware: X-API-Key (toggleable) +- [x] Config-driven startup: lakehouse.toml -## Phase 3: AI Integration -- [x] 3.1 — Python sidecar: FastAPI + Ollama (embed/generate/rerank) — real models, no mocks -- [x] 3.2 — Dockerfile for sidecar -- [x] 3.3 — aibridge/client.rs: reqwest HTTP client with 120s timeout -- [x] 3.4 — aibridge service: Axum proxy endpoints (POST /ai/embed, /ai/generate, /ai/rerank) -- [x] 3.5 — Model config via env vars (EMBED_MODEL, GEN_MODEL, RERANK_MODEL, SIDECAR_URL) +## Phase 6: Ingest Pipeline ✅ +- [x] CSV ingest with auto schema detection +- [x] JSON ingest (array + newline-delimited, nested flattening) +- [x] PDF text extraction (lopdf) +- [x] Text/SMS file ingest +- [x] Content hash dedup (SHA-256) +- [x] POST /ingest/file multipart upload +- [x] 12 unit tests -**Gate: PASSED** — Gateway → aibridge → sidecar → Ollama → real 768d embeddings + generation. +## Phase 7: Vector Index + RAG ✅ +- [x] chunker: configurable size + overlap, sentence-boundary aware +- [x] store: embeddings as Parquet (binary f32 vectors) +- [x] search: brute-force cosine similarity +- [x] rag: embed → search → retrieve → LLM answer with citations +- [x] POST /vectors/index, /search, /rag +- [x] Background job system with progress tracking +- [x] Dual-pipeline supervisor with checkpointing + retry +- [x] 6 unit tests -## Phase 4: Frontend -- [x] 4.1 — Dioxus scaffold, WASM build (dx build --platform web) -- [x] 4.2 — Dataset browser (sidebar, click to select, refresh) -- [x] 4.3 — Query editor + results table (Ctrl+Enter to run, column types, row count) -- [x] 4.4 — Error display + loading states -- [x] 4.5 — Nginx proxy (lakehouse.devop.live), same-origin API detection +## Phase 8: Hot Cache + Incremental Updates ✅ +- [x] MemTable hot cache: LRU, configurable max (16GB) +- [x] POST /query/cache/pin, /cache/evict, GET /cache/stats +- [x] Delta store: append-only delta Parquet files +- [x] Merge-on-read: queries combine base + deltas +- [x] Compaction: POST /query/compact +- [x] Benchmarked: 9.8x speedup (1M rows: 942ms → 96ms) -**Gate: PASSED** — Browse datasets and query from browser at lakehouse.devop.live. +## Phase 8.5: Agent Workspaces ✅ +- [x] WorkspaceManager with daily/weekly/monthly/pinned tiers +- [x] Saved searches, shortlists, activity logs per workspace +- [x] Instant zero-copy handoff between agents +- [x] Persistence to object storage, rebuild on startup -## Phase 5: Hardening -- [x] 5.1 — Proto definitions (lakehouse.proto: CatalogService, QueryService, StorageService, AiService) -- [x] 5.2 — Internal gRPC: CatalogService on :3101, proto crate with tonic codegen -- [x] 5.3 — OpenTelemetry tracing: stdout exporter, configurable via lakehouse.toml -- [x] 5.4 — Auth middleware: X-API-Key header check, toggleable via config -- [x] 5.5 — Config-driven startup: lakehouse.toml (gateway, storage, catalog, sidecar, ai, auth, observability) +## Phase 9: Event Journal ✅ +- [x] journald crate: append-only mutation log +- [x] Event schema: entity, field, old/new value, actor, source, workspace +- [x] In-memory buffer with auto-flush to Parquet +- [x] GET /journal/history/{entity_id}, /recent, /stats +- [x] POST /journal/event, /update, /flush -**Gate: PASSED** — gRPC on :3101, OTel traces, auth ready, system starts from repo + lakehouse.toml. +## Phase 10: Rich Catalog v2 ✅ +- [x] DatasetManifest: description, owner, sensitivity, columns, lineage, freshness, tags +- [x] PII auto-detection: email, phone, SSN, salary, address, medical +- [x] Column-level metadata with sensitivity flags +- [x] Lineage tracking: source_system → ingest_job → dataset +- [x] PATCH /catalog/datasets/by-name/{name}/metadata +- [x] Backward compatible (serde default) +- [x] 25 unit tests total + +## Phase 11: Embedding Versioning ⬜ +- [ ] Vector index metadata: model_name, model_version, dimensions +- [ ] Multi-version indexes coexist +- [ ] Incremental re-embed on model upgrade +- [ ] A/B search comparison + +## Phase 12: Tool Registry ⬜ +- [ ] Named business actions with parameter validation +- [ ] Read vs write tool permissions +- [ ] Audit logging per tool invocation +- [ ] MCP-compatible interface +- [ ] Rate limiting per agent/tool + +## Phase 13: Security & Access Control ⬜ +- [ ] Field-level sensitivity enforcement +- [ ] Row-level access policies +- [ ] Column masking +- [ ] Query audit log +- [ ] Policy-as-code (TOML/YAML) + +## Phase 14: Schema Evolution ⬜ +- [ ] Schema diff detection +- [ ] AI-generated migration rules +- [ ] Migration preview before apply +- [ ] Versioned schemas in catalog + +## Phase 15+: Horizon ⬜ +- [ ] Federated multi-bucket query +- [ ] Database connector ingest (Postgres/MySQL) +- [ ] PDF OCR (Tesseract) +- [ ] Scheduled ingest (cron) +- [ ] Fine-tuned domain models +- [ ] Multi-node query distribution