- /read-mem skill: reads PRD, phases, decisions, checks live services - Updated PHASES.md with all 15 phases tracked - Updated project_lakehouse.md memory with full context - Updated CLAUDE.md with project reference - Skill at ~/.claude/skills/read-mem/ and project level - Triggers on: "read mem", "project status", "where were we", "catch me up" Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
4.7 KiB
4.7 KiB
Phase Tracker
Phase 0: Bootstrap ✅
- Cargo workspace with all crate stubs compiling
sharedcrate: error types, ObjectRef, DatasetIdgatewaywith Axum: GET /health → 200- tracing + tracing-subscriber wired in gateway
- justfile with build, test, run recipes
- docs committed to git
Phase 1: Storage + Catalog ✅
- storaged: object_store backend init (LocalFileSystem)
- storaged: Axum endpoints (PUT/GET/DELETE/LIST)
- shared/arrow_helpers.rs: RecordBatch ↔ Parquet + schema fingerprinting
- catalogd/registry.rs: in-memory index + manifest persistence
- catalogd service: POST/GET /datasets + by-name
- gateway routes wired
Phase 2: Query Engine ✅
- queryd: SessionContext + object_store config
- queryd: ListingTable from catalog ObjectRefs
- queryd service: POST /query/sql → JSON
- queryd → catalogd wiring
- gateway routes /query
Phase 3: AI Integration ✅
- Python sidecar: FastAPI + Ollama (embed/generate/rerank)
- Dockerfile for sidecar
- aibridge/client.rs: HTTP client
- aibridge service: Axum proxy endpoints
- Model config via env vars
Phase 4: Frontend ✅
- Dioxus scaffold, WASM build
- Ask tab: natural language → AI SQL → results
- Explore tab: dataset browser + AI summary
- SQL tab: raw DataFusion editor
- System tab: health checks for all services
Phase 5: Hardening ✅
- Proto definitions (lakehouse.proto)
- Internal gRPC: CatalogService on :3101
- OpenTelemetry tracing: stdout exporter
- Auth middleware: X-API-Key (toggleable)
- Config-driven startup: lakehouse.toml
Phase 6: Ingest Pipeline ✅
- CSV ingest with auto schema detection
- JSON ingest (array + newline-delimited, nested flattening)
- PDF text extraction (lopdf)
- Text/SMS file ingest
- Content hash dedup (SHA-256)
- POST /ingest/file multipart upload
- 12 unit tests
Phase 7: Vector Index + RAG ✅
- chunker: configurable size + overlap, sentence-boundary aware
- store: embeddings as Parquet (binary f32 vectors)
- search: brute-force cosine similarity
- rag: embed → search → retrieve → LLM answer with citations
- POST /vectors/index, /search, /rag
- Background job system with progress tracking
- Dual-pipeline supervisor with checkpointing + retry
- 6 unit tests
Phase 8: Hot Cache + Incremental Updates ✅
- MemTable hot cache: LRU, configurable max (16GB)
- POST /query/cache/pin, /cache/evict, GET /cache/stats
- Delta store: append-only delta Parquet files
- Merge-on-read: queries combine base + deltas
- Compaction: POST /query/compact
- Benchmarked: 9.8x speedup (1M rows: 942ms → 96ms)
Phase 8.5: Agent Workspaces ✅
- WorkspaceManager with daily/weekly/monthly/pinned tiers
- Saved searches, shortlists, activity logs per workspace
- Instant zero-copy handoff between agents
- Persistence to object storage, rebuild on startup
Phase 9: Event Journal ✅
- journald crate: append-only mutation log
- Event schema: entity, field, old/new value, actor, source, workspace
- In-memory buffer with auto-flush to Parquet
- GET /journal/history/{entity_id}, /recent, /stats
- POST /journal/event, /update, /flush
Phase 10: Rich Catalog v2 ✅
- DatasetManifest: description, owner, sensitivity, columns, lineage, freshness, tags
- PII auto-detection: email, phone, SSN, salary, address, medical
- Column-level metadata with sensitivity flags
- Lineage tracking: source_system → ingest_job → dataset
- PATCH /catalog/datasets/by-name/{name}/metadata
- Backward compatible (serde default)
- 25 unit tests total
Phase 11: Embedding Versioning ⬜
- Vector index metadata: model_name, model_version, dimensions
- Multi-version indexes coexist
- Incremental re-embed on model upgrade
- A/B search comparison
Phase 12: Tool Registry ⬜
- Named business actions with parameter validation
- Read vs write tool permissions
- Audit logging per tool invocation
- MCP-compatible interface
- Rate limiting per agent/tool
Phase 13: Security & Access Control ⬜
- Field-level sensitivity enforcement
- Row-level access policies
- Column masking
- Query audit log
- Policy-as-code (TOML/YAML)
Phase 14: Schema Evolution ⬜
- Schema diff detection
- AI-generated migration rules
- Migration preview before apply
- Versioned schemas in catalog
Phase 15+: Horizon ⬜
- Federated multi-bucket query
- Database connector ingest (Postgres/MySQL)
- PDF OCR (Tesseract)
- Scheduled ingest (cron)
- Fine-tuned domain models
- Multi-node query distribution