lakehouse/docs/PHASES.md
root 97a376482c Phase C: Decoupled embedding refresh
Implements the llms3.com-inspired pattern: embeddings refresh
asynchronously, decoupled from transactional row writes. New rows arrive,
ingest marks the vector index stale, a later refresh embeds only the
delta (doc_ids not already in the index).

Schema additions (DatasetManifest):
- last_embedded_at: Option<DateTime> - when the index was last refreshed
- embedding_stale_since: Option<DateTime> - set when data written, cleared on refresh
- embedding_refresh_policy: Option<RefreshPolicy> - Manual | OnAppend | Scheduled

Ingest paths (pipeline::ingest_file + pg_stream) call
registry.mark_embeddings_stale after writing. No-op if the dataset has
never been embedded — stale semantics only kick in once last_embedded_at
is set.

Refresh pipeline (vectord::refresh::refresh_index):
- Reads the dataset Parquet, extracts (doc_id, text) pairs
- Accepts Utf8 / Int32 / Int64 id columns (covers both CSV and pg schemas)
- Loads existing embeddings via EmbeddingCache (empty on first-time build)
- Filters to rows whose doc_id is NOT in the existing set
- Chunks (chunker::chunk_column), embeds via Ollama (batches of 32),
  writes combined index, clears stale flag

Endpoints:
- POST /vectors/refresh/{dataset_name} - body {index_name, id_column,
  text_column, chunk_size?, overlap?}
- GET /vectors/stale - lists datasets whose embedding_stale_since is set

End-to-end verified on threat_intel (knowledge_base.threat_intel):
- Initial refresh: 20 rows -> 20 chunks -> embedded in 2.1s,
  last_embedded_at set
- Idempotent second refresh: 0 new docs -> 1.8ms (pure delta check)
- Re-ingest to 54 rows: mark_embeddings_stale fires -> stale_since set
- /vectors/stale surfaces threat_intel with timestamps + policy
- Delta refresh: 34 new docs embedded in 970ms (6x faster than full
  re-embed); stale_cleared = true

Not in MVP scope:
- UPDATE semantics (same doc_id, different content) - would need
  per-row content hashing
- OnAppend policy auto-trigger - just declares intent; actual scheduler
  deferred
- Scheduler runtime - the Scheduled(cron) variant declares the intent so
  operators can see which datasets expect what, but the cron itself is
  separate

Per ADR-019: when a profile switches to vector_backend=Lance, this
refresh path benefits — Lance's native append replaces our "read all +
rewrite" Parquet rebuild pattern. Current MVP works well enough at
~500-5K rows to validate the architecture; Lance unblocks the 5M+ case.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 03:00:43 -05:00

175 lines
9.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Phase Tracker
## Phase 0: Bootstrap ✅
- [x] Cargo workspace with all crate stubs compiling
- [x] `shared` crate: error types, ObjectRef, DatasetId
- [x] `gateway` with Axum: GET /health → 200
- [x] tracing + tracing-subscriber wired in gateway
- [x] justfile with build, test, run recipes
- [x] docs committed to git
## Phase 1: Storage + Catalog ✅
- [x] storaged: object_store backend init (LocalFileSystem)
- [x] storaged: Axum endpoints (PUT/GET/DELETE/LIST)
- [x] shared/arrow_helpers.rs: RecordBatch ↔ Parquet + schema fingerprinting
- [x] catalogd/registry.rs: in-memory index + manifest persistence
- [x] catalogd service: POST/GET /datasets + by-name
- [x] gateway routes wired
## Phase 2: Query Engine ✅
- [x] queryd: SessionContext + object_store config
- [x] queryd: ListingTable from catalog ObjectRefs
- [x] queryd service: POST /query/sql → JSON
- [x] queryd → catalogd wiring
- [x] gateway routes /query
## Phase 3: AI Integration ✅
- [x] Python sidecar: FastAPI + Ollama (embed/generate/rerank)
- [x] Dockerfile for sidecar
- [x] aibridge/client.rs: HTTP client
- [x] aibridge service: Axum proxy endpoints
- [x] Model config via env vars
## Phase 4: Frontend ✅
- [x] Dioxus scaffold, WASM build
- [x] Ask tab: natural language → AI SQL → results
- [x] Explore tab: dataset browser + AI summary
- [x] SQL tab: raw DataFusion editor
- [x] System tab: health checks for all services
## Phase 5: Hardening ✅
- [x] Proto definitions (lakehouse.proto)
- [x] Internal gRPC: CatalogService on :3101
- [x] OpenTelemetry tracing: stdout exporter
- [x] Auth middleware: X-API-Key (toggleable)
- [x] Config-driven startup: lakehouse.toml
## Phase 6: Ingest Pipeline ✅
- [x] CSV ingest with auto schema detection
- [x] JSON ingest (array + newline-delimited, nested flattening)
- [x] PDF text extraction (lopdf)
- [x] Text/SMS file ingest
- [x] Content hash dedup (SHA-256)
- [x] POST /ingest/file multipart upload
- [x] 12 unit tests
## Phase 7: Vector Index + RAG ✅
- [x] chunker: configurable size + overlap, sentence-boundary aware
- [x] store: embeddings as Parquet (binary f32 vectors)
- [x] search: brute-force cosine similarity
- [x] rag: embed → search → retrieve → LLM answer with citations
- [x] POST /vectors/index, /search, /rag
- [x] Background job system with progress tracking
- [x] Dual-pipeline supervisor with checkpointing + retry
- [x] 100K embeddings: 177/sec on A4000, zero failures
- [x] 6 unit tests
## Phase 8: Hot Cache + Incremental Updates ✅
- [x] MemTable hot cache: LRU, configurable max (16GB)
- [x] POST /query/cache/pin, /cache/evict, GET /cache/stats
- [x] Delta store: append-only delta Parquet files
- [x] Merge-on-read: queries combine base + deltas
- [x] Compaction: POST /query/compact
- [x] Benchmarked: 9.8x speedup (1M rows: 942ms → 96ms)
## Phase 8.5: Agent Workspaces ✅
- [x] WorkspaceManager with daily/weekly/monthly/pinned tiers
- [x] Saved searches, shortlists, activity logs per workspace
- [x] Instant zero-copy handoff between agents
- [x] Persistence to object storage, rebuild on startup
## Phase 9: Event Journal ✅
- [x] journald crate: append-only mutation log
- [x] Event schema: entity, field, old/new value, actor, source, workspace
- [x] In-memory buffer with auto-flush to Parquet
- [x] GET /journal/history/{entity_id}, /recent, /stats
- [x] POST /journal/event, /update, /flush
## Phase 10: Rich Catalog v2 ✅
- [x] DatasetManifest: description, owner, sensitivity, columns, lineage, freshness, tags
- [x] PII auto-detection: email, phone, SSN, salary, address, medical
- [x] Column-level metadata with sensitivity flags
- [x] Lineage tracking: source_system → ingest_job → dataset
- [x] PATCH /catalog/datasets/by-name/{name}/metadata
- [x] Backward compatible (serde default)
## Phase 11: Embedding Versioning ✅
- [x] IndexRegistry: model_name, model_version, dimensions per index
- [x] Index metadata persisted as JSON, rebuilt on startup
- [x] GET /vectors/indexes — list all (filter by source/model)
- [x] GET /vectors/indexes/{name} — metadata
- [x] Background jobs auto-register metadata on completion
## Phase 12: Tool Registry ✅
- [x] 6 built-in staffing tools (search_candidates, get_candidate, revenue_by_client, recruiter_performance, cold_leads, open_jobs)
- [x] Parameter validation + SQL template substitution
- [x] Permission levels: read / write / admin
- [x] Full audit trail per invocation
- [x] GET /tools, GET /tools/{name}, POST /tools/{name}/call, GET /tools/audit
## Phase 13: Security & Access Control ✅
- [x] Role-based access: admin, recruiter, analyst, agent
- [x] Field-level sensitivity enforcement
- [x] Column masking determination per agent
- [x] Query audit logging
- [x] GET/POST /access/roles, GET /access/audit, POST /access/check
## Phase 14: Schema Evolution ✅
- [x] Schema diff detection (added, removed, type changed, renamed)
- [x] Fuzzy rename detection (shared word parts)
- [x] Auto-generated migration rules with confidence scores
- [x] AI migration prompt builder for complex cases
- [x] 5 unit tests
## Phase 15+: Horizon
- [x] HNSW vector index with iteration-friendly trial system (2026-04-16)
- `HnswStore.build_index_with_config` — parameterized ef_construction, ef_search, seed
- `EmbeddingCache` — pins 100K vectors in memory, shared across trials
- `harness::EvalSet` — named query sets with brute-force ground truth
- `TrialJournal` — append-only JSONL at `_hnsw_trials/{index}.jsonl`
- Endpoints: `/vectors/hnsw/trial`, `/hnsw/trials/{idx}`, `/hnsw/trials/{idx}/best?metric={recall|latency|pareto}`, `/hnsw/evals`, `/hnsw/evals/{name}/autogen`, `/hnsw/cache/stats`
- Measured on 100K resumes: **brute-force 44-54ms → HNSW 509us-1830us**, recall 0.92-1.00 depending on `ef_construction`. Sweet spot: ec=80 es=30 → p50=873us recall=1.00 — locked in as `HnswConfig::default()`
- [x] Catalog manifest repair — `POST /catalog/resync-missing` restores row_count and columns from parquet footers (2026-04-16). All 7 staffing tables recovered to PRD-matching 2.47M rows.
- [~] Federated multi-bucket query — **foundation complete 2026-04-16**, see ADR-017
- [x] `StorageConfig.buckets` + `rescue_bucket` + `profile_root` config shape
- [x] `SecretsProvider` trait + `FileSecretsProvider` (reads /etc/lakehouse/secrets.toml, checks 0600 perms)
- [x] `storaged::BucketRegistry` — multi-backend, rescue-aware, reachability probes
- [x] `storaged::error_journal::ErrorJournal` — append-only JSONL at `primary://_errors/bucket_errors.jsonl`
- [x] Endpoints: `GET /storage/buckets`, `GET /storage/errors`, `GET /storage/bucket-health`
- [x] Bucket-aware I/O: `PUT/GET /storage/buckets/{bucket}/objects/{*key}` with rescue fallback + `X-Lakehouse-Rescue-Used` observability headers
- [x] Backward compat: empty `[[storage.buckets]]` synthesizes a `primary` from legacy `root`
- [x] Three-bucket test (primary + rescue + testing) verified: normal reads, rescue fallback with headers, hard-fail missing, write to unknown bucket 503, error journal + health summary
- [ ] `X-Lakehouse-Bucket` header middleware on ingest/query/catalog endpoints
- [ ] Catalog migration: set `bucket = "primary"` on every legacy ObjectRef
- [ ] `queryd` registers every bucket with DataFusion for cross-bucket SQL
- [ ] Profile hot-load endpoints: `POST /profile/{user}/activate|deactivate`
- [ ] `vectord` bucket-scoped paths (trial journals, eval sets per-bucket)
- [x] Database connector ingest (Postgres first) — 2026-04-16
- `pg_stream::stream_table_to_parquet` — ORDER BY + LIMIT/OFFSET pagination, configurable batch_size
- `parse_dsn` — postgresql:// and postgres:// URL scheme, user/password/host/port/db
- `POST /ingest/db` endpoint: `{dsn, table, dataset_name?, batch_size?, order_by?, limit?}` → streams to Parquet, registers in catalog with PII detection + redacted-password lineage
- Existing `POST /ingest/postgres/import` (structured config) preserved alongside
- 4 DSN-parser unit tests + live end-to-end test against `knowledge_base.team_runs` (586 rows, 13 cols, 6 batches, 196ms)
- [x] Phase B: Lance storage evaluation — 2026-04-16
- `crates/lance-bench` standalone pilot (Lance 4.0) avoids DataFusion/Arrow version conflict with main stack
- 8-dimension benchmark on resumes_100k_v2 — see docs/ADR-019-vector-storage.md for scorecard
- Decision: hybrid architecture. Parquet+HNSW stays primary (2.55× faster search at 100K in-RAM). Lance added as per-profile second backend for random access (112× faster), append (0.08s vs full rewrite), hot-swap (14× faster index builds), and scale past 5M RAM ceiling.
- [x] Phase C: Decoupled embedding refresh — 2026-04-16
- `DatasetManifest`: `last_embedded_at`, `embedding_stale_since`, `embedding_refresh_policy` (Manual | OnAppend | Scheduled)
- `Registry::mark_embeddings_stale` / `clear_embeddings_stale` / `stale_datasets`
- Ingest paths (CSV pipeline + Postgres streaming) auto-mark-stale when writing to an already-embedded dataset
- `vectord::refresh::refresh_index` — reads dataset, diffs doc_ids vs existing embeddings, embeds only new rows, writes combined index, clears stale
- `POST /vectors/refresh/{dataset}` + `GET /vectors/stale`
- Id columns accept `Utf8`, `Int32`, `Int64`
- End-to-end on threat_intel: initial 20-row embed 2.1s; re-ingest to 54 rows auto-marks stale; delta refresh embeds only 34 new in 970ms (6× faster than full re-embed); stale cleared
- [ ] Database connector ingest (Postgres/MySQL)
- [ ] PDF OCR (Tesseract)
- [ ] Scheduled ingest (cron)
- [ ] Fine-tuned domain models
- [ ] Multi-node query distribution
---
**30 unit tests | 11 crates | 16 ADRs | 2.47M rows | 100K vectors | All built 2026-03-27**
**HNSW trial system: 2026-04-16**