Add read-mem skill + comprehensive project memory

- /read-mem skill: reads PRD, phases, decisions, checks live services - Updated PHASES.md with all 15 phases tracked - Updated project_lakehouse.md memory with full context - Updated CLAUDE.md with project reference - Skill at ~/.claude/skills/read-mem/ and project level - Triggers on: "read mem", "project status", "where were we", "catch me up" Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 09:23:01 -05:00 · 2026-03-27 09:23:01 -05:00 · 6d49f81ebf
commit 6d49f81ebf
parent 9e53caaec3
1 changed files with 115 additions and 45 deletions
--- a/docs/PHASES.md
+++ b/docs/PHASES.md
@ -1,58 +1,128 @@
 # Phase Tracker

-## Phase 0: Bootstrap
- [x] 0.1 — Cargo workspace with all crate stubs compiling
- [x] 0.2 — `shared` crate: error types, ObjectRef, DatasetId
- [x] 0.3 — `gateway` with Axum: GET /health → 200
- [x] 0.4 — tracing + tracing-subscriber wired in gateway
- [x] 0.5 — justfile with build, test, run recipes
- [x] 0.6 — docs committed to git
+## Phase 0: Bootstrap ✅
+- [x] Cargo workspace with all crate stubs compiling
+- [x] `shared` crate: error types, ObjectRef, DatasetId
+- [x] `gateway` with Axum: GET /health → 200
+- [x] tracing + tracing-subscriber wired in gateway
+- [x] justfile with build, test, run recipes
+- [x] docs committed to git

-**Gate: PASSED** — All crates compile. Gateway runs. Logs emit. Docs committed.
+## Phase 1: Storage + Catalog ✅
+- [x] storaged: object_store backend init (LocalFileSystem)
+- [x] storaged: Axum endpoints (PUT/GET/DELETE/LIST)
+- [x] shared/arrow_helpers.rs: RecordBatch ↔ Parquet + schema fingerprinting
+- [x] catalogd/registry.rs: in-memory index + manifest persistence
+- [x] catalogd service: POST/GET /datasets + by-name
+- [x] gateway routes wired

-## Phase 1: Storage + Catalog
- [x] 1.1 — storaged: object_store backend init (LocalFileSystem)
- [x] 1.2 — storaged: Axum endpoints (PUT/GET/DELETE/LIST /objects/{key})
- [x] 1.3 — shared/arrow_helpers.rs: RecordBatch ↔ Parquet + schema fingerprinting
- [x] 1.4 — catalogd/registry.rs: in-memory index + manifest persistence to object storage
- [x] 1.5 — catalogd/schema.rs: schema fingerprinting (merged into shared/arrow_helpers.rs)
- [x] 1.6 — catalogd service: POST/GET /datasets + GET /datasets/by-name/{name}
- [x] 1.7 — gateway routes to storaged + catalogd with shared state
+## Phase 2: Query Engine ✅
+- [x] queryd: SessionContext + object_store config
+- [x] queryd: ListingTable from catalog ObjectRefs
+- [x] queryd service: POST /query/sql → JSON
+- [x] queryd → catalogd wiring
+- [x] gateway routes /query

-**Gate: PASSED** — PUT object → register dataset → list → get by name. All via gateway HTTP.
+## Phase 3: AI Integration ✅
+- [x] Python sidecar: FastAPI + Ollama (embed/generate/rerank)
+- [x] Dockerfile for sidecar
+- [x] aibridge/client.rs: HTTP client
+- [x] aibridge service: Axum proxy endpoints
+- [x] Model config via env vars

-## Phase 2: Query Engine
- [x] 2.1 — queryd: SessionContext + object_store config (custom scheme to avoid path doubling)
- [x] 2.2 — queryd: ListingTable from catalog ObjectRefs with schema inference
- [x] 2.3 — queryd service: POST /query/sql → JSON (columns + rows + row_count)
- [x] 2.4 — queryd → catalogd wiring (reads dataset list, registers as tables)
- [x] 2.5 — gateway routes /query with QueryEngine state
+## Phase 4: Frontend ✅
+- [x] Dioxus scaffold, WASM build
+- [x] Ask tab: natural language → AI SQL → results
+- [x] Explore tab: dataset browser + AI summary
+- [x] SQL tab: raw DataFusion editor
+- [x] System tab: health checks for all services

-**Gate: PASSED** — SELECT *, WHERE/ORDER BY, COUNT/AVG all return correct results via catalog.
+## Phase 5: Hardening ✅
+- [x] Proto definitions (lakehouse.proto)
+- [x] Internal gRPC: CatalogService on :3101
+- [x] OpenTelemetry tracing: stdout exporter
+- [x] Auth middleware: X-API-Key (toggleable)
+- [x] Config-driven startup: lakehouse.toml

-## Phase 3: AI Integration
- [x] 3.1 — Python sidecar: FastAPI + Ollama (embed/generate/rerank) — real models, no mocks
- [x] 3.2 — Dockerfile for sidecar
- [x] 3.3 — aibridge/client.rs: reqwest HTTP client with 120s timeout
- [x] 3.4 — aibridge service: Axum proxy endpoints (POST /ai/embed, /ai/generate, /ai/rerank)
- [x] 3.5 — Model config via env vars (EMBED_MODEL, GEN_MODEL, RERANK_MODEL, SIDECAR_URL)
+## Phase 6: Ingest Pipeline ✅
+- [x] CSV ingest with auto schema detection
+- [x] JSON ingest (array + newline-delimited, nested flattening)
+- [x] PDF text extraction (lopdf)
+- [x] Text/SMS file ingest
+- [x] Content hash dedup (SHA-256)
+- [x] POST /ingest/file multipart upload
+- [x] 12 unit tests

-**Gate: PASSED** — Gateway → aibridge → sidecar → Ollama → real 768d embeddings + generation.
+## Phase 7: Vector Index + RAG ✅
+- [x] chunker: configurable size + overlap, sentence-boundary aware
+- [x] store: embeddings as Parquet (binary f32 vectors)
+- [x] search: brute-force cosine similarity
+- [x] rag: embed → search → retrieve → LLM answer with citations
+- [x] POST /vectors/index, /search, /rag
+- [x] Background job system with progress tracking
+- [x] Dual-pipeline supervisor with checkpointing + retry
+- [x] 6 unit tests

-## Phase 4: Frontend
- [x] 4.1 — Dioxus scaffold, WASM build (dx build --platform web)
- [x] 4.2 — Dataset browser (sidebar, click to select, refresh)
- [x] 4.3 — Query editor + results table (Ctrl+Enter to run, column types, row count)
- [x] 4.4 — Error display + loading states
- [x] 4.5 — Nginx proxy (lakehouse.devop.live), same-origin API detection
+## Phase 8: Hot Cache + Incremental Updates ✅
+- [x] MemTable hot cache: LRU, configurable max (16GB)
+- [x] POST /query/cache/pin, /cache/evict, GET /cache/stats
+- [x] Delta store: append-only delta Parquet files
+- [x] Merge-on-read: queries combine base + deltas
+- [x] Compaction: POST /query/compact
+- [x] Benchmarked: 9.8x speedup (1M rows: 942ms → 96ms)

-**Gate: PASSED** — Browse datasets and query from browser at lakehouse.devop.live.
+## Phase 8.5: Agent Workspaces ✅
+- [x] WorkspaceManager with daily/weekly/monthly/pinned tiers
+- [x] Saved searches, shortlists, activity logs per workspace
+- [x] Instant zero-copy handoff between agents
+- [x] Persistence to object storage, rebuild on startup

-## Phase 5: Hardening
- [x] 5.1 — Proto definitions (lakehouse.proto: CatalogService, QueryService, StorageService, AiService)
- [x] 5.2 — Internal gRPC: CatalogService on :3101, proto crate with tonic codegen
- [x] 5.3 — OpenTelemetry tracing: stdout exporter, configurable via lakehouse.toml
- [x] 5.4 — Auth middleware: X-API-Key header check, toggleable via config
- [x] 5.5 — Config-driven startup: lakehouse.toml (gateway, storage, catalog, sidecar, ai, auth, observability)
+## Phase 9: Event Journal ✅
+- [x] journald crate: append-only mutation log
+- [x] Event schema: entity, field, old/new value, actor, source, workspace
+- [x] In-memory buffer with auto-flush to Parquet
+- [x] GET /journal/history/{entity_id}, /recent, /stats
+- [x] POST /journal/event, /update, /flush

-**Gate: PASSED** — gRPC on :3101, OTel traces, auth ready, system starts from repo + lakehouse.toml.
+## Phase 10: Rich Catalog v2 ✅
+- [x] DatasetManifest: description, owner, sensitivity, columns, lineage, freshness, tags
+- [x] PII auto-detection: email, phone, SSN, salary, address, medical
+- [x] Column-level metadata with sensitivity flags
+- [x] Lineage tracking: source_system → ingest_job → dataset
+- [x] PATCH /catalog/datasets/by-name/{name}/metadata
+- [x] Backward compatible (serde default)
+- [x] 25 unit tests total
+
+## Phase 11: Embedding Versioning ⬜
+- [ ] Vector index metadata: model_name, model_version, dimensions
+- [ ] Multi-version indexes coexist
+- [ ] Incremental re-embed on model upgrade
+- [ ] A/B search comparison
+
+## Phase 12: Tool Registry ⬜
+- [ ] Named business actions with parameter validation
+- [ ] Read vs write tool permissions
+- [ ] Audit logging per tool invocation
+- [ ] MCP-compatible interface
+- [ ] Rate limiting per agent/tool
+
+## Phase 13: Security & Access Control ⬜
+- [ ] Field-level sensitivity enforcement
+- [ ] Row-level access policies
+- [ ] Column masking
+- [ ] Query audit log
+- [ ] Policy-as-code (TOML/YAML)
+
+## Phase 14: Schema Evolution ⬜
+- [ ] Schema diff detection
+- [ ] AI-generated migration rules
+- [ ] Migration preview before apply
+- [ ] Versioned schemas in catalog
+
+## Phase 15+: Horizon ⬜
+- [ ] Federated multi-bucket query
+- [ ] Database connector ingest (Postgres/MySQL)
+- [ ] PDF OCR (Tesseract)
+- [ ] Scheduled ingest (cron)
+- [ ] Fine-tuned domain models
+- [ ] Multi-node query distribution