# Phase Tracker ## Phase 0: Bootstrap ✅ - [x] Cargo workspace with all crate stubs compiling - [x] `shared` crate: error types, ObjectRef, DatasetId - [x] `gateway` with Axum: GET /health → 200 - [x] tracing + tracing-subscriber wired in gateway - [x] justfile with build, test, run recipes - [x] docs committed to git ## Phase 1: Storage + Catalog ✅ - [x] storaged: object_store backend init (LocalFileSystem) - [x] storaged: Axum endpoints (PUT/GET/DELETE/LIST) - [x] shared/arrow_helpers.rs: RecordBatch ↔ Parquet + schema fingerprinting - [x] catalogd/registry.rs: in-memory index + manifest persistence - [x] catalogd service: POST/GET /datasets + by-name - [x] gateway routes wired ## Phase 2: Query Engine ✅ - [x] queryd: SessionContext + object_store config - [x] queryd: ListingTable from catalog ObjectRefs - [x] queryd service: POST /query/sql → JSON - [x] queryd → catalogd wiring - [x] gateway routes /query ## Phase 3: AI Integration ✅ - [x] Python sidecar: FastAPI + Ollama (embed/generate/rerank) - [x] Dockerfile for sidecar - [x] aibridge/client.rs: HTTP client - [x] aibridge service: Axum proxy endpoints - [x] Model config via env vars ## Phase 4: Frontend ✅ - [x] Dioxus scaffold, WASM build - [x] Ask tab: natural language → AI SQL → results - [x] Explore tab: dataset browser + AI summary - [x] SQL tab: raw DataFusion editor - [x] System tab: health checks for all services ## Phase 5: Hardening ✅ - [x] Proto definitions (lakehouse.proto) - [x] Internal gRPC: CatalogService on :3101 - [x] OpenTelemetry tracing: stdout exporter - [x] Auth middleware: X-API-Key (toggleable) - [x] Config-driven startup: lakehouse.toml ## Phase 6: Ingest Pipeline ✅ - [x] CSV ingest with auto schema detection - [x] JSON ingest (array + newline-delimited, nested flattening) - [x] PDF text extraction (lopdf) - [x] Text/SMS file ingest - [x] Content hash dedup (SHA-256) - [x] POST /ingest/file multipart upload - [x] 12 unit tests ## Phase 7: Vector Index + RAG ✅ - [x] chunker: configurable size + overlap, sentence-boundary aware - [x] store: embeddings as Parquet (binary f32 vectors) - [x] search: brute-force cosine similarity - [x] rag: embed → search → retrieve → LLM answer with citations - [x] POST /vectors/index, /search, /rag - [x] Background job system with progress tracking - [x] Dual-pipeline supervisor with checkpointing + retry - [x] 100K embeddings: 177/sec on A4000, zero failures - [x] 6 unit tests ## Phase 8: Hot Cache + Incremental Updates ✅ - [x] MemTable hot cache: LRU, configurable max (16GB) - [x] POST /query/cache/pin, /cache/evict, GET /cache/stats - [x] Delta store: append-only delta Parquet files - [x] Merge-on-read: queries combine base + deltas - [x] Compaction: POST /query/compact - [x] Benchmarked: 9.8x speedup (1M rows: 942ms → 96ms) ## Phase 8.5: Agent Workspaces ✅ - [x] WorkspaceManager with daily/weekly/monthly/pinned tiers - [x] Saved searches, shortlists, activity logs per workspace - [x] Instant zero-copy handoff between agents - [x] Persistence to object storage, rebuild on startup ## Phase 9: Event Journal ✅ - [x] journald crate: append-only mutation log - [x] Event schema: entity, field, old/new value, actor, source, workspace - [x] In-memory buffer with auto-flush to Parquet - [x] GET /journal/history/{entity_id}, /recent, /stats - [x] POST /journal/event, /update, /flush ## Phase 10: Rich Catalog v2 ✅ - [x] DatasetManifest: description, owner, sensitivity, columns, lineage, freshness, tags - [x] PII auto-detection: email, phone, SSN, salary, address, medical - [x] Column-level metadata with sensitivity flags - [x] Lineage tracking: source_system → ingest_job → dataset - [x] PATCH /catalog/datasets/by-name/{name}/metadata - [x] Backward compatible (serde default) ## Phase 11: Embedding Versioning ✅ - [x] IndexRegistry: model_name, model_version, dimensions per index - [x] Index metadata persisted as JSON, rebuilt on startup - [x] GET /vectors/indexes — list all (filter by source/model) - [x] GET /vectors/indexes/{name} — metadata - [x] Background jobs auto-register metadata on completion ## Phase 12: Tool Registry ✅ - [x] 6 built-in staffing tools (search_candidates, get_candidate, revenue_by_client, recruiter_performance, cold_leads, open_jobs) - [x] Parameter validation + SQL template substitution - [x] Permission levels: read / write / admin - [x] Full audit trail per invocation - [x] GET /tools, GET /tools/{name}, POST /tools/{name}/call, GET /tools/audit ## Phase 13: Security & Access Control ✅ - [x] Role-based access: admin, recruiter, analyst, agent - [x] Field-level sensitivity enforcement - [x] Column masking determination per agent - [x] Query audit logging - [x] GET/POST /access/roles, GET /access/audit, POST /access/check ## Phase 14: Schema Evolution ✅ - [x] Schema diff detection (added, removed, type changed, renamed) - [x] Fuzzy rename detection (shared word parts) - [x] Auto-generated migration rules with confidence scores - [x] AI migration prompt builder for complex cases - [x] 5 unit tests ## Phase 15+: Horizon - [x] HNSW vector index with iteration-friendly trial system (2026-04-16) - `HnswStore.build_index_with_config` — parameterized ef_construction, ef_search, seed - `EmbeddingCache` — pins 100K vectors in memory, shared across trials - `harness::EvalSet` — named query sets with brute-force ground truth - `TrialJournal` — append-only JSONL at `_hnsw_trials/{index}.jsonl` - Endpoints: `/vectors/hnsw/trial`, `/hnsw/trials/{idx}`, `/hnsw/trials/{idx}/best?metric={recall|latency|pareto}`, `/hnsw/evals`, `/hnsw/evals/{name}/autogen`, `/hnsw/cache/stats` - Measured on 100K resumes: **brute-force 44-54ms → HNSW 509us-1830us**, recall 0.92-1.00 depending on `ef_construction`. Sweet spot: ec=80 es=30 → p50=873us recall=1.00 — locked in as `HnswConfig::default()` - [x] Catalog manifest repair — `POST /catalog/resync-missing` restores row_count and columns from parquet footers (2026-04-16). All 7 staffing tables recovered to PRD-matching 2.47M rows. - [~] Federated multi-bucket query — **foundation complete 2026-04-16**, see ADR-017 - [x] `StorageConfig.buckets` + `rescue_bucket` + `profile_root` config shape - [x] `SecretsProvider` trait + `FileSecretsProvider` (reads /etc/lakehouse/secrets.toml, checks 0600 perms) - [x] `storaged::BucketRegistry` — multi-backend, rescue-aware, reachability probes - [x] `storaged::error_journal::ErrorJournal` — append-only JSONL at `primary://_errors/bucket_errors.jsonl` - [x] Endpoints: `GET /storage/buckets`, `GET /storage/errors`, `GET /storage/bucket-health` - [x] Bucket-aware I/O: `PUT/GET /storage/buckets/{bucket}/objects/{*key}` with rescue fallback + `X-Lakehouse-Rescue-Used` observability headers - [x] Backward compat: empty `[[storage.buckets]]` synthesizes a `primary` from legacy `root` - [x] Three-bucket test (primary + rescue + testing) verified: normal reads, rescue fallback with headers, hard-fail missing, write to unknown bucket 503, error journal + health summary - [x] `X-Lakehouse-Bucket` header middleware on ingest endpoints (2026-04-16) - [x] Catalog migration: `POST /catalog/migrate-buckets` stamps `bucket = "primary"` on legacy refs (12 renamed, 14 total now canonical) - [x] `queryd` registers every bucket with DataFusion for cross-bucket SQL — verified with people_test (testing) × animals (primary) CROSS JOIN - [x] Profile hot-load endpoints: bucket auto-provisioning on `POST /vectors/profile/{id}/activate` (2026-04-17) - [x] `vectord` bucket-scoped paths: TrialJournal + PromotionRegistry resolve per-index via IndexMeta.bucket (2026-04-17) - [x] Runtime bucket lifecycle: `POST /storage/buckets` (provision) + `DELETE /storage/buckets/{name}` (unregister, refuses primary/rescue) (2026-04-17) - [x] ModelProfile.bucket field — per-profile artifact isolation (2026-04-17) - [x] Database connector ingest (Postgres first) — 2026-04-16 - `pg_stream::stream_table_to_parquet` — ORDER BY + LIMIT/OFFSET pagination, configurable batch_size - `parse_dsn` — postgresql:// and postgres:// URL scheme, user/password/host/port/db - `POST /ingest/db` endpoint: `{dsn, table, dataset_name?, batch_size?, order_by?, limit?}` → streams to Parquet, registers in catalog with PII detection + redacted-password lineage - Existing `POST /ingest/postgres/import` (structured config) preserved alongside - 4 DSN-parser unit tests + live end-to-end test against `knowledge_base.team_runs` (586 rows, 13 cols, 6 batches, 196ms) - [x] Phase B: Lance storage evaluation — 2026-04-16 - `crates/lance-bench` standalone pilot (Lance 4.0) avoids DataFusion/Arrow version conflict with main stack - 8-dimension benchmark on resumes_100k_v2 — see docs/ADR-019-vector-storage.md for scorecard - Decision: hybrid architecture. Parquet+HNSW stays primary (2.55× faster search at 100K in-RAM). Lance added as per-profile second backend for random access (112× faster), append (0.08s vs full rewrite), hot-swap (14× faster index builds), and scale past 5M RAM ceiling. - [x] Phase E.2 — Compaction integrates tombstones (physical deletion) — 2026-04-16 - `delta::compact` accepts `tombstones: &[Tombstone]` param, filters rows at merge time via arrow `filter_record_batch` - CompactResult gains `tombstones_applied` + `rows_dropped_by_tombstones` - Atomic write: ArrowWriter → single Parquet file (fixes latent bug where concatenated Parquet byte streams produced garbage — footer-only-first-segment visible), verify-parse before overwrite, temp_key staging, delete delta files AFTER base write succeeds - Snappy compression on output matches ingest defaults (avoids 3× size inflation on every compact) - `TombstoneStore::clear` drops all batch files for a dataset; called by queryd after successful compact - Query engine exposes `catalog()` accessor so service handler can reach the tombstone store - E2E verified on candidates (100K rows): tombstone 3 IDs → compact → 99,997 rows physically in parquet, tombstones empty, IDs gone from `__raw__candidates` too; file size 10.59 MB → 10.72 MB (proportional to data, not inflated) - [x] Phase 16: Hot-swap generations + autotune agent — 2026-04-16 - `vectord::promotion::PromotionRegistry` — per-index current config + history at `_hnsw_promotions/{index}.json`, cap 50 history entries - Endpoints: `POST /vectors/hnsw/promote/{index}/{trial_id}`, `POST /vectors/hnsw/rollback/{index}`, `GET /vectors/hnsw/promoted/{index}` - `vectord::autotune::run_autotune` — grid of trials (configurable or default 5 configs), Pareto winner selection (max recall, then min p50), min_recall safety gate (default 0.9), config bounds (ec ∈ [10,400], es ∈ [10,200]) - `POST /vectors/hnsw/autotune` — runs the full loop synchronously, journals every trial, auto-promotes winner - `activate_profile` uses `promotion_registry.config_or(..., profile_default)` so newly-promoted configs flow automatically into next activation - End-to-end: autogen harness for threat_intel_v1 (10 queries), autotune ran 5 trials (all recall=1.00, p50 64-68us), promoted ec=20 es=30 at recall=1.0 p50=64us as winner. Manual promote of ec=80 es=30 pushed autotune pick onto history. Rollback restored autotune winner. Second rollback cleared to None. Re-promote + restart verified persistence. Activation after promotion logged "building HNSW ef_construction=80 ef_search=30 seed=42" — config flowed through correctly. - [x] Phase 17: Model profiles + scoped search — 2026-04-16 - `shared::types::ModelProfile` — { id, ollama_name, description, bound_datasets, hnsw_config, embed_model, created_at, created_by } - `shared::types::ProfileHnswConfig` — mirror of vectord's HnswConfig to avoid cross-crate dep cycle (defaults ec=80 es=30 matching Phase 15 winner) - `Registry::{put_profile, get_profile, list_profiles, delete_profile}` persisted at `_catalog/profiles/{id}.json`, validates bindings exist (raw dataset OR AiView) - Endpoints: `POST/GET /catalog/profiles`, `GET/DELETE /catalog/profiles/{id}` - `POST /vectors/profile/{id}/activate` — warms EmbeddingCache + builds HNSW with profile's config for every bound dataset's vector index; reports warmed indexes + failures + duration - `POST /vectors/profile/{id}/search` — rejects 403 if requested index's source isn't in profile.bound_datasets; falls through to HNSW if warm, brute-force otherwise - Fixed refresh to register new index metadata (was silently no-op for first-time indexes) - End-to-end: security-analyst profile bound to threat_intel → activate warms 54 vectors in 156ms → within-scope HNSW search works (0.625 score); out-of-scope search for candidates returns 403 with allowed bindings listed - [x] Phase E: Soft deletes (tombstones) — 2026-04-16 - `shared::types::Tombstone` — { dataset, row_key_column, row_key_value, deleted_at, actor, reason } - `catalogd::tombstones::TombstoneStore` per-dataset append-log at `_catalog/tombstones/{dataset}/`, flush_threshold=1 + explicit flush so every tombstone is durable on return (compliance requirement) - All tombstones for a dataset must share the same `row_key_column` (validated at write — query filter is built as a single WHERE NOT IN clause) - `Registry::add_tombstone / list_tombstones` - Endpoint: `POST /catalog/datasets/by-name/{name}/tombstone` accepting `{row_key_column, row_key_values[], actor, reason}`; companion `GET` lists active tombstones - `queryd::context::build_context` wraps tombstoned tables: raw goes to `__raw__{name}`, public name becomes a DataFusion view with `WHERE CAST(col AS VARCHAR) NOT IN (...)` filter - End-to-end on candidates: tombstone 3 IDs, COUNT drops 100,000 → 99,997, specific WHERE returns empty, AiView candidates_safe transitively excludes them too, restart preserves all tombstones - Limits / not in MVP: physical compaction (Phase 8 doesn't yet read tombstones during merge); journal integration (tombstones don't yet emit Phase 9 mutation events — covered by audit fields on the tombstone itself) - [x] Phase D: AI-safe views — 2026-04-16 - `shared::types::AiView` — name, base_dataset, columns whitelist, optional row_filter, column_redactions - `shared::types::Redaction` — Null | Hash | Mask { keep_prefix, keep_suffix } - `Registry::put_view / get_view / list_views / delete_view` persisted to `_catalog/views/{name}.json` - `queryd::context` registers each view as a DataFusion view with the safe projection + filter + redactions baked into the SELECT - Endpoints: `POST/GET /catalog/views`, `GET/DELETE /catalog/views/{name}` - End-to-end on candidates: `candidates_safe` view exposes 8 of 15 columns, masks `candidate_id` (CAN******01), filters out `status='blocked'`. `SELECT * FROM candidates_safe` returns whitelist only; `SELECT email FROM candidates_safe` fails. View survives restart. - Capability surface — raw `candidates` still accessible by name; Phase 13 access control is the layer that enforces who can query what - [x] Phase C: Decoupled embedding refresh — 2026-04-16 - `DatasetManifest`: `last_embedded_at`, `embedding_stale_since`, `embedding_refresh_policy` (Manual | OnAppend | Scheduled) - `Registry::mark_embeddings_stale` / `clear_embeddings_stale` / `stale_datasets` - Ingest paths (CSV pipeline + Postgres streaming) auto-mark-stale when writing to an already-embedded dataset - `vectord::refresh::refresh_index` — reads dataset, diffs doc_ids vs existing embeddings, embeds only new rows, writes combined index, clears stale - `POST /vectors/refresh/{dataset}` + `GET /vectors/stale` - Id columns accept `Utf8`, `Int32`, `Int64` - End-to-end on threat_intel: initial 20-row embed 2.1s; re-ingest to 54 rows auto-marks stale; delta refresh embeds only 34 new in 970ms (6× faster than full re-embed); stale cleared - [x] Phase 16.2/16.5: Background autotune agent + ingest-triggered re-trials — 2026-04-17 - `vectord::agent` — ε-greedy proposer, rate-limited, cooldown-gated, tokio background task - Ingest paths push `DatasetAppended` triggers to agent queue - Endpoints: `GET /vectors/agent/status`, `POST /vectors/agent/stop`, `POST /vectors/agent/enqueue/{idx}` - `[agent]` config section in lakehouse.toml (enabled, cycle_interval, cooldown, min_recall, max_trials/hr) - 3 unit tests - [x] Phase 17 VRAM gate: Two-profile sequential swap — 2026-04-17 - Sidecar: `POST /admin/unload` (keep_alive=0), `POST /admin/preload`, `GET /admin/vram` (nvidia-smi + Ollama /api/ps) - `AiClient::unload_model / preload_model / vram_snapshot` - `VectorState.active_profile` singleton — activate swaps models, deactivate unloads - Verified: staffing-recruiter (qwen2.5) ↔ docs-assistant (mistral) — only one model in VRAM at a time - [x] MySQL streaming connector — 2026-04-17 - `my_stream.rs` mirrors pg_stream: DSN parsing, OFFSET pagination, Arrow type mapping, Parquet streaming - `POST /ingest/mysql` with PII detection, lineage, agent trigger - Verified end-to-end on live MariaDB (10 rows, 9 columns, round-tripped all types) - 6 DSN + type-mapping unit tests - [x] Phase 18 hybrid: vectord-lance production crate — 2026-04-17 - Firewall crate (Arrow 57 / Lance 4, separate from main Arrow 55 / DF 47 stack) - Public API: migrate_from_parquet, build_index (IVF_PQ), search, get_by_doc_id, append, build_scalar_index, stats - `lance_backend::LanceRegistry` resolves bucket → URI per index - `VectorBackend { Parquet | Lance }` enum on ModelProfile + IndexMeta - 8 HTTP endpoints under `/vectors/lance/*` (migrate, index, search, doc, append, stats, scalar-index, recall) - Profile-driven routing: `POST /vectors/profile/{id}/search` auto-routes to Lance when profile.vector_backend=lance - Auto-migrate + auto-index on activation - Measured on real 100K × 768d: migrate 0.57s, IVF_PQ build 16.2s (14× faster than HNSW 230s), search 23ms, append 100 rows 3.3ms, doc_id fetch 3.5ms (with scalar btree) - IVF_PQ recall@10 = 0.805 (HNSW = 1.000) — measured via `/vectors/lance/recall/{idx}` harness - [x] Phase E.3: Scheduled ingest — 2026-04-17 - `ingestd::schedule` module: ScheduleDef, ScheduleStore (JSON at `_schedules/{id}.json`), Scheduler tokio task - Supports MySQL + Postgres sources on interval triggers (Cron variant defined, parsing stubbed) - 6 CRUD endpoints under `/ingest/schedules/*` + run-now manual trigger - Full catalog integration: PII, lineage, mark-stale, agent trigger - 6 unit tests - [x] PDF OCR via Tesseract — 2026-04-17 - Two-tier: lopdf text extraction → Tesseract 5.5 fallback for scanned/image PDFs - Extracts embedded XObject /Image streams, shells to tesseract --oem 3 --psm 6 - Same schema (source_file, page_number, text_content) — downstream unchanged - [ ] Fine-tuned domain models - [ ] Multi-node query distribution --- **52+ unit tests | 13 crates | 19 ADRs | 2.47M rows | 100K vectors | Hybrid Parquet+HNSW ⊕ Lance** **Latest: 2026-04-17 — 8 commits shipping Phase 16.2 through Phase 18**