PHASES.md and project memory updated to reflect actual build state. Phases 11-14 were built but trackers weren't updated. Final stats: 11 crates, 30 tests, 16 ADRs, 2.47M rows, 100K vectors. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
136 lines
5.3 KiB
Markdown
136 lines
5.3 KiB
Markdown
# Phase Tracker
|
|
|
|
## Phase 0: Bootstrap ✅
|
|
- [x] Cargo workspace with all crate stubs compiling
|
|
- [x] `shared` crate: error types, ObjectRef, DatasetId
|
|
- [x] `gateway` with Axum: GET /health → 200
|
|
- [x] tracing + tracing-subscriber wired in gateway
|
|
- [x] justfile with build, test, run recipes
|
|
- [x] docs committed to git
|
|
|
|
## Phase 1: Storage + Catalog ✅
|
|
- [x] storaged: object_store backend init (LocalFileSystem)
|
|
- [x] storaged: Axum endpoints (PUT/GET/DELETE/LIST)
|
|
- [x] shared/arrow_helpers.rs: RecordBatch ↔ Parquet + schema fingerprinting
|
|
- [x] catalogd/registry.rs: in-memory index + manifest persistence
|
|
- [x] catalogd service: POST/GET /datasets + by-name
|
|
- [x] gateway routes wired
|
|
|
|
## Phase 2: Query Engine ✅
|
|
- [x] queryd: SessionContext + object_store config
|
|
- [x] queryd: ListingTable from catalog ObjectRefs
|
|
- [x] queryd service: POST /query/sql → JSON
|
|
- [x] queryd → catalogd wiring
|
|
- [x] gateway routes /query
|
|
|
|
## Phase 3: AI Integration ✅
|
|
- [x] Python sidecar: FastAPI + Ollama (embed/generate/rerank)
|
|
- [x] Dockerfile for sidecar
|
|
- [x] aibridge/client.rs: HTTP client
|
|
- [x] aibridge service: Axum proxy endpoints
|
|
- [x] Model config via env vars
|
|
|
|
## Phase 4: Frontend ✅
|
|
- [x] Dioxus scaffold, WASM build
|
|
- [x] Ask tab: natural language → AI SQL → results
|
|
- [x] Explore tab: dataset browser + AI summary
|
|
- [x] SQL tab: raw DataFusion editor
|
|
- [x] System tab: health checks for all services
|
|
|
|
## Phase 5: Hardening ✅
|
|
- [x] Proto definitions (lakehouse.proto)
|
|
- [x] Internal gRPC: CatalogService on :3101
|
|
- [x] OpenTelemetry tracing: stdout exporter
|
|
- [x] Auth middleware: X-API-Key (toggleable)
|
|
- [x] Config-driven startup: lakehouse.toml
|
|
|
|
## Phase 6: Ingest Pipeline ✅
|
|
- [x] CSV ingest with auto schema detection
|
|
- [x] JSON ingest (array + newline-delimited, nested flattening)
|
|
- [x] PDF text extraction (lopdf)
|
|
- [x] Text/SMS file ingest
|
|
- [x] Content hash dedup (SHA-256)
|
|
- [x] POST /ingest/file multipart upload
|
|
- [x] 12 unit tests
|
|
|
|
## Phase 7: Vector Index + RAG ✅
|
|
- [x] chunker: configurable size + overlap, sentence-boundary aware
|
|
- [x] store: embeddings as Parquet (binary f32 vectors)
|
|
- [x] search: brute-force cosine similarity
|
|
- [x] rag: embed → search → retrieve → LLM answer with citations
|
|
- [x] POST /vectors/index, /search, /rag
|
|
- [x] Background job system with progress tracking
|
|
- [x] Dual-pipeline supervisor with checkpointing + retry
|
|
- [x] 100K embeddings: 177/sec on A4000, zero failures
|
|
- [x] 6 unit tests
|
|
|
|
## Phase 8: Hot Cache + Incremental Updates ✅
|
|
- [x] MemTable hot cache: LRU, configurable max (16GB)
|
|
- [x] POST /query/cache/pin, /cache/evict, GET /cache/stats
|
|
- [x] Delta store: append-only delta Parquet files
|
|
- [x] Merge-on-read: queries combine base + deltas
|
|
- [x] Compaction: POST /query/compact
|
|
- [x] Benchmarked: 9.8x speedup (1M rows: 942ms → 96ms)
|
|
|
|
## Phase 8.5: Agent Workspaces ✅
|
|
- [x] WorkspaceManager with daily/weekly/monthly/pinned tiers
|
|
- [x] Saved searches, shortlists, activity logs per workspace
|
|
- [x] Instant zero-copy handoff between agents
|
|
- [x] Persistence to object storage, rebuild on startup
|
|
|
|
## Phase 9: Event Journal ✅
|
|
- [x] journald crate: append-only mutation log
|
|
- [x] Event schema: entity, field, old/new value, actor, source, workspace
|
|
- [x] In-memory buffer with auto-flush to Parquet
|
|
- [x] GET /journal/history/{entity_id}, /recent, /stats
|
|
- [x] POST /journal/event, /update, /flush
|
|
|
|
## Phase 10: Rich Catalog v2 ✅
|
|
- [x] DatasetManifest: description, owner, sensitivity, columns, lineage, freshness, tags
|
|
- [x] PII auto-detection: email, phone, SSN, salary, address, medical
|
|
- [x] Column-level metadata with sensitivity flags
|
|
- [x] Lineage tracking: source_system → ingest_job → dataset
|
|
- [x] PATCH /catalog/datasets/by-name/{name}/metadata
|
|
- [x] Backward compatible (serde default)
|
|
|
|
## Phase 11: Embedding Versioning ✅
|
|
- [x] IndexRegistry: model_name, model_version, dimensions per index
|
|
- [x] Index metadata persisted as JSON, rebuilt on startup
|
|
- [x] GET /vectors/indexes — list all (filter by source/model)
|
|
- [x] GET /vectors/indexes/{name} — metadata
|
|
- [x] Background jobs auto-register metadata on completion
|
|
|
|
## Phase 12: Tool Registry ✅
|
|
- [x] 6 built-in staffing tools (search_candidates, get_candidate, revenue_by_client, recruiter_performance, cold_leads, open_jobs)
|
|
- [x] Parameter validation + SQL template substitution
|
|
- [x] Permission levels: read / write / admin
|
|
- [x] Full audit trail per invocation
|
|
- [x] GET /tools, GET /tools/{name}, POST /tools/{name}/call, GET /tools/audit
|
|
|
|
## Phase 13: Security & Access Control ✅
|
|
- [x] Role-based access: admin, recruiter, analyst, agent
|
|
- [x] Field-level sensitivity enforcement
|
|
- [x] Column masking determination per agent
|
|
- [x] Query audit logging
|
|
- [x] GET/POST /access/roles, GET /access/audit, POST /access/check
|
|
|
|
## Phase 14: Schema Evolution ✅
|
|
- [x] Schema diff detection (added, removed, type changed, renamed)
|
|
- [x] Fuzzy rename detection (shared word parts)
|
|
- [x] Auto-generated migration rules with confidence scores
|
|
- [x] AI migration prompt builder for complex cases
|
|
- [x] 5 unit tests
|
|
|
|
## Phase 15+: Horizon ⬜
|
|
- [ ] HNSW vector index (100K search: 4.5s → <50ms)
|
|
- [ ] Federated multi-bucket query
|
|
- [ ] Database connector ingest (Postgres/MySQL)
|
|
- [ ] PDF OCR (Tesseract)
|
|
- [ ] Scheduled ingest (cron)
|
|
- [ ] Fine-tuned domain models
|
|
- [ ] Multi-node query distribution
|
|
|
|
---
|
|
|
|
**30 unit tests | 11 crates | 16 ADRs | 2.47M rows | 100K vectors | All built 2026-03-27**
|