lakehouse/docs/PHASES.md
root 24f1249a62 Federation layer 2: header routing + cross-bucket SQL
Three pieces of the multi-bucket federation made real:

1. Catalog migration (POST /catalog/migrate-buckets)
   - One-shot normalizer for ObjectRef.bucket field
   - Empty -> "primary"; legacy "data"/"local" -> "primary"
   - Idempotent; re-running on canonical state is no-op
   - Ran on existing catalog: 12 refs renamed from "data", 2 already
     "primary", all 14 now canonical

2. X-Lakehouse-Bucket header middleware on ingest
   - resolve_bucket() helper extracts header, returns
     (bucket_name, store) or 404 with valid bucket list
   - ingest_file and ingest_db_stream now route writes per-request
   - Defaults to "primary" when header absent
   - pipeline::ingest_file_to_bucket records the actual bucket on the
     ObjectRef so catalog stays the source of truth for "where does this
     data live"
   - Verified: ingest with X-Lakehouse-Bucket: testing lands in
     data/_testing/, ingest without header lands in data/, bad header
     returns 404 with hint

3. queryd registers every bucket with DataFusion
   - QueryEngine now holds Arc<BucketRegistry> instead of single store
   - build_context iterates all buckets, registers each as a separate
     ObjectStore under URL scheme "lakehouse-{bucket}://"
   - ListingTable URLs include the per-object bucket scheme so
     DataFusion routes scans automatically based on ObjectRef.bucket
   - Profile bucket names like "profile:user" sanitized to
     "lakehouse-profile-user" since URL host segments can't contain ":"
   - Tolerant of duplicate manifest entries (pre-existing
     pipeline::ingest_file behavior creates a fresh dataset id per
     ingest); duplicates skipped with debug log
   - Backward compat: legacy "lakehouse://data/" URL still registered
     pointing at primary

Success gate: cross-bucket CROSS JOIN
  SELECT p.name, p.role, a.species
  FROM people_test p          (bucket: testing)
  CROSS JOIN animals a        (bucket: primary)
  LIMIT 5
returns rows correctly. DataFusion routed each scan to its bucket's
ObjectStore based on the URL scheme.

No regressions: SELECT COUNT(*) FROM candidates still returns 100000
from the primary bucket.

Deferred to Phase 17:
- POST /profile/{user}/activate (HNSW hot-load on profile switch)
- vectord storage paths becoming bucket-scoped (trial journals,
  eval sets per-profile)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 08:52:32 -05:00

175 lines
9.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Phase Tracker
## Phase 0: Bootstrap ✅
- [x] Cargo workspace with all crate stubs compiling
- [x] `shared` crate: error types, ObjectRef, DatasetId
- [x] `gateway` with Axum: GET /health → 200
- [x] tracing + tracing-subscriber wired in gateway
- [x] justfile with build, test, run recipes
- [x] docs committed to git
## Phase 1: Storage + Catalog ✅
- [x] storaged: object_store backend init (LocalFileSystem)
- [x] storaged: Axum endpoints (PUT/GET/DELETE/LIST)
- [x] shared/arrow_helpers.rs: RecordBatch ↔ Parquet + schema fingerprinting
- [x] catalogd/registry.rs: in-memory index + manifest persistence
- [x] catalogd service: POST/GET /datasets + by-name
- [x] gateway routes wired
## Phase 2: Query Engine ✅
- [x] queryd: SessionContext + object_store config
- [x] queryd: ListingTable from catalog ObjectRefs
- [x] queryd service: POST /query/sql → JSON
- [x] queryd → catalogd wiring
- [x] gateway routes /query
## Phase 3: AI Integration ✅
- [x] Python sidecar: FastAPI + Ollama (embed/generate/rerank)
- [x] Dockerfile for sidecar
- [x] aibridge/client.rs: HTTP client
- [x] aibridge service: Axum proxy endpoints
- [x] Model config via env vars
## Phase 4: Frontend ✅
- [x] Dioxus scaffold, WASM build
- [x] Ask tab: natural language → AI SQL → results
- [x] Explore tab: dataset browser + AI summary
- [x] SQL tab: raw DataFusion editor
- [x] System tab: health checks for all services
## Phase 5: Hardening ✅
- [x] Proto definitions (lakehouse.proto)
- [x] Internal gRPC: CatalogService on :3101
- [x] OpenTelemetry tracing: stdout exporter
- [x] Auth middleware: X-API-Key (toggleable)
- [x] Config-driven startup: lakehouse.toml
## Phase 6: Ingest Pipeline ✅
- [x] CSV ingest with auto schema detection
- [x] JSON ingest (array + newline-delimited, nested flattening)
- [x] PDF text extraction (lopdf)
- [x] Text/SMS file ingest
- [x] Content hash dedup (SHA-256)
- [x] POST /ingest/file multipart upload
- [x] 12 unit tests
## Phase 7: Vector Index + RAG ✅
- [x] chunker: configurable size + overlap, sentence-boundary aware
- [x] store: embeddings as Parquet (binary f32 vectors)
- [x] search: brute-force cosine similarity
- [x] rag: embed → search → retrieve → LLM answer with citations
- [x] POST /vectors/index, /search, /rag
- [x] Background job system with progress tracking
- [x] Dual-pipeline supervisor with checkpointing + retry
- [x] 100K embeddings: 177/sec on A4000, zero failures
- [x] 6 unit tests
## Phase 8: Hot Cache + Incremental Updates ✅
- [x] MemTable hot cache: LRU, configurable max (16GB)
- [x] POST /query/cache/pin, /cache/evict, GET /cache/stats
- [x] Delta store: append-only delta Parquet files
- [x] Merge-on-read: queries combine base + deltas
- [x] Compaction: POST /query/compact
- [x] Benchmarked: 9.8x speedup (1M rows: 942ms → 96ms)
## Phase 8.5: Agent Workspaces ✅
- [x] WorkspaceManager with daily/weekly/monthly/pinned tiers
- [x] Saved searches, shortlists, activity logs per workspace
- [x] Instant zero-copy handoff between agents
- [x] Persistence to object storage, rebuild on startup
## Phase 9: Event Journal ✅
- [x] journald crate: append-only mutation log
- [x] Event schema: entity, field, old/new value, actor, source, workspace
- [x] In-memory buffer with auto-flush to Parquet
- [x] GET /journal/history/{entity_id}, /recent, /stats
- [x] POST /journal/event, /update, /flush
## Phase 10: Rich Catalog v2 ✅
- [x] DatasetManifest: description, owner, sensitivity, columns, lineage, freshness, tags
- [x] PII auto-detection: email, phone, SSN, salary, address, medical
- [x] Column-level metadata with sensitivity flags
- [x] Lineage tracking: source_system → ingest_job → dataset
- [x] PATCH /catalog/datasets/by-name/{name}/metadata
- [x] Backward compatible (serde default)
## Phase 11: Embedding Versioning ✅
- [x] IndexRegistry: model_name, model_version, dimensions per index
- [x] Index metadata persisted as JSON, rebuilt on startup
- [x] GET /vectors/indexes — list all (filter by source/model)
- [x] GET /vectors/indexes/{name} — metadata
- [x] Background jobs auto-register metadata on completion
## Phase 12: Tool Registry ✅
- [x] 6 built-in staffing tools (search_candidates, get_candidate, revenue_by_client, recruiter_performance, cold_leads, open_jobs)
- [x] Parameter validation + SQL template substitution
- [x] Permission levels: read / write / admin
- [x] Full audit trail per invocation
- [x] GET /tools, GET /tools/{name}, POST /tools/{name}/call, GET /tools/audit
## Phase 13: Security & Access Control ✅
- [x] Role-based access: admin, recruiter, analyst, agent
- [x] Field-level sensitivity enforcement
- [x] Column masking determination per agent
- [x] Query audit logging
- [x] GET/POST /access/roles, GET /access/audit, POST /access/check
## Phase 14: Schema Evolution ✅
- [x] Schema diff detection (added, removed, type changed, renamed)
- [x] Fuzzy rename detection (shared word parts)
- [x] Auto-generated migration rules with confidence scores
- [x] AI migration prompt builder for complex cases
- [x] 5 unit tests
## Phase 15+: Horizon
- [x] HNSW vector index with iteration-friendly trial system (2026-04-16)
- `HnswStore.build_index_with_config` — parameterized ef_construction, ef_search, seed
- `EmbeddingCache` — pins 100K vectors in memory, shared across trials
- `harness::EvalSet` — named query sets with brute-force ground truth
- `TrialJournal` — append-only JSONL at `_hnsw_trials/{index}.jsonl`
- Endpoints: `/vectors/hnsw/trial`, `/hnsw/trials/{idx}`, `/hnsw/trials/{idx}/best?metric={recall|latency|pareto}`, `/hnsw/evals`, `/hnsw/evals/{name}/autogen`, `/hnsw/cache/stats`
- Measured on 100K resumes: **brute-force 44-54ms → HNSW 509us-1830us**, recall 0.92-1.00 depending on `ef_construction`. Sweet spot: ec=80 es=30 → p50=873us recall=1.00 — locked in as `HnswConfig::default()`
- [x] Catalog manifest repair — `POST /catalog/resync-missing` restores row_count and columns from parquet footers (2026-04-16). All 7 staffing tables recovered to PRD-matching 2.47M rows.
- [~] Federated multi-bucket query — **foundation complete 2026-04-16**, see ADR-017
- [x] `StorageConfig.buckets` + `rescue_bucket` + `profile_root` config shape
- [x] `SecretsProvider` trait + `FileSecretsProvider` (reads /etc/lakehouse/secrets.toml, checks 0600 perms)
- [x] `storaged::BucketRegistry` — multi-backend, rescue-aware, reachability probes
- [x] `storaged::error_journal::ErrorJournal` — append-only JSONL at `primary://_errors/bucket_errors.jsonl`
- [x] Endpoints: `GET /storage/buckets`, `GET /storage/errors`, `GET /storage/bucket-health`
- [x] Bucket-aware I/O: `PUT/GET /storage/buckets/{bucket}/objects/{*key}` with rescue fallback + `X-Lakehouse-Rescue-Used` observability headers
- [x] Backward compat: empty `[[storage.buckets]]` synthesizes a `primary` from legacy `root`
- [x] Three-bucket test (primary + rescue + testing) verified: normal reads, rescue fallback with headers, hard-fail missing, write to unknown bucket 503, error journal + health summary
- [x] `X-Lakehouse-Bucket` header middleware on ingest endpoints (2026-04-16)
- [x] Catalog migration: `POST /catalog/migrate-buckets` stamps `bucket = "primary"` on legacy refs (12 renamed, 14 total now canonical)
- [x] `queryd` registers every bucket with DataFusion for cross-bucket SQL — verified with people_test (testing) × animals (primary) CROSS JOIN
- [ ] Profile hot-load endpoints: `POST /profile/{user}/activate|deactivate` (deferred to Phase 17)
- [ ] `vectord` bucket-scoped paths (trial journals, eval sets per-bucket) (deferred to Phase 17)
- [x] Database connector ingest (Postgres first) — 2026-04-16
- `pg_stream::stream_table_to_parquet` — ORDER BY + LIMIT/OFFSET pagination, configurable batch_size
- `parse_dsn` — postgresql:// and postgres:// URL scheme, user/password/host/port/db
- `POST /ingest/db` endpoint: `{dsn, table, dataset_name?, batch_size?, order_by?, limit?}` → streams to Parquet, registers in catalog with PII detection + redacted-password lineage
- Existing `POST /ingest/postgres/import` (structured config) preserved alongside
- 4 DSN-parser unit tests + live end-to-end test against `knowledge_base.team_runs` (586 rows, 13 cols, 6 batches, 196ms)
- [x] Phase B: Lance storage evaluation — 2026-04-16
- `crates/lance-bench` standalone pilot (Lance 4.0) avoids DataFusion/Arrow version conflict with main stack
- 8-dimension benchmark on resumes_100k_v2 — see docs/ADR-019-vector-storage.md for scorecard
- Decision: hybrid architecture. Parquet+HNSW stays primary (2.55× faster search at 100K in-RAM). Lance added as per-profile second backend for random access (112× faster), append (0.08s vs full rewrite), hot-swap (14× faster index builds), and scale past 5M RAM ceiling.
- [x] Phase C: Decoupled embedding refresh — 2026-04-16
- `DatasetManifest`: `last_embedded_at`, `embedding_stale_since`, `embedding_refresh_policy` (Manual | OnAppend | Scheduled)
- `Registry::mark_embeddings_stale` / `clear_embeddings_stale` / `stale_datasets`
- Ingest paths (CSV pipeline + Postgres streaming) auto-mark-stale when writing to an already-embedded dataset
- `vectord::refresh::refresh_index` — reads dataset, diffs doc_ids vs existing embeddings, embeds only new rows, writes combined index, clears stale
- `POST /vectors/refresh/{dataset}` + `GET /vectors/stale`
- Id columns accept `Utf8`, `Int32`, `Int64`
- End-to-end on threat_intel: initial 20-row embed 2.1s; re-ingest to 54 rows auto-marks stale; delta refresh embeds only 34 new in 970ms (6× faster than full re-embed); stale cleared
- [ ] Database connector ingest (Postgres/MySQL)
- [ ] PDF OCR (Tesseract)
- [ ] Scheduled ingest (cron)
- [ ] Fine-tuned domain models
- [ ] Multi-node query distribution
---
**30 unit tests | 11 crates | 16 ADRs | 2.47M rows | 100K vectors | All built 2026-03-27**
**HNSW trial system: 2026-04-16**