lakehouse/docs/PHASES.md
root 3bc82833ac Update PRD + PHASES.md — reflect 8-commit 2026-04-17 push
PRD status line: "Phases 0-18 shipped; hybrid operational; scheduled
ingest live; PDF OCR live; entering horizon items."

PHASES.md: federation L2 items marked complete, Phase 16.2 (autotune
agent), Phase 17 VRAM gate, MySQL connector, Phase 18 (hybrid Lance),
scheduled ingest, PDF OCR all documented with dates and measurements.

Stats updated: 52+ unit tests, 13 crates, 19 ADRs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 20:54:05 -05:00

19 KiB
Raw Blame History

Phase Tracker

Phase 0: Bootstrap

  • Cargo workspace with all crate stubs compiling
  • shared crate: error types, ObjectRef, DatasetId
  • gateway with Axum: GET /health → 200
  • tracing + tracing-subscriber wired in gateway
  • justfile with build, test, run recipes
  • docs committed to git

Phase 1: Storage + Catalog

  • storaged: object_store backend init (LocalFileSystem)
  • storaged: Axum endpoints (PUT/GET/DELETE/LIST)
  • shared/arrow_helpers.rs: RecordBatch ↔ Parquet + schema fingerprinting
  • catalogd/registry.rs: in-memory index + manifest persistence
  • catalogd service: POST/GET /datasets + by-name
  • gateway routes wired

Phase 2: Query Engine

  • queryd: SessionContext + object_store config
  • queryd: ListingTable from catalog ObjectRefs
  • queryd service: POST /query/sql → JSON
  • queryd → catalogd wiring
  • gateway routes /query

Phase 3: AI Integration

  • Python sidecar: FastAPI + Ollama (embed/generate/rerank)
  • Dockerfile for sidecar
  • aibridge/client.rs: HTTP client
  • aibridge service: Axum proxy endpoints
  • Model config via env vars

Phase 4: Frontend

  • Dioxus scaffold, WASM build
  • Ask tab: natural language → AI SQL → results
  • Explore tab: dataset browser + AI summary
  • SQL tab: raw DataFusion editor
  • System tab: health checks for all services

Phase 5: Hardening

  • Proto definitions (lakehouse.proto)
  • Internal gRPC: CatalogService on :3101
  • OpenTelemetry tracing: stdout exporter
  • Auth middleware: X-API-Key (toggleable)
  • Config-driven startup: lakehouse.toml

Phase 6: Ingest Pipeline

  • CSV ingest with auto schema detection
  • JSON ingest (array + newline-delimited, nested flattening)
  • PDF text extraction (lopdf)
  • Text/SMS file ingest
  • Content hash dedup (SHA-256)
  • POST /ingest/file multipart upload
  • 12 unit tests

Phase 7: Vector Index + RAG

  • chunker: configurable size + overlap, sentence-boundary aware
  • store: embeddings as Parquet (binary f32 vectors)
  • search: brute-force cosine similarity
  • rag: embed → search → retrieve → LLM answer with citations
  • POST /vectors/index, /search, /rag
  • Background job system with progress tracking
  • Dual-pipeline supervisor with checkpointing + retry
  • 100K embeddings: 177/sec on A4000, zero failures
  • 6 unit tests

Phase 8: Hot Cache + Incremental Updates

  • MemTable hot cache: LRU, configurable max (16GB)
  • POST /query/cache/pin, /cache/evict, GET /cache/stats
  • Delta store: append-only delta Parquet files
  • Merge-on-read: queries combine base + deltas
  • Compaction: POST /query/compact
  • Benchmarked: 9.8x speedup (1M rows: 942ms → 96ms)

Phase 8.5: Agent Workspaces

  • WorkspaceManager with daily/weekly/monthly/pinned tiers
  • Saved searches, shortlists, activity logs per workspace
  • Instant zero-copy handoff between agents
  • Persistence to object storage, rebuild on startup

Phase 9: Event Journal

  • journald crate: append-only mutation log
  • Event schema: entity, field, old/new value, actor, source, workspace
  • In-memory buffer with auto-flush to Parquet
  • GET /journal/history/{entity_id}, /recent, /stats
  • POST /journal/event, /update, /flush

Phase 10: Rich Catalog v2

  • DatasetManifest: description, owner, sensitivity, columns, lineage, freshness, tags
  • PII auto-detection: email, phone, SSN, salary, address, medical
  • Column-level metadata with sensitivity flags
  • Lineage tracking: source_system → ingest_job → dataset
  • PATCH /catalog/datasets/by-name/{name}/metadata
  • Backward compatible (serde default)

Phase 11: Embedding Versioning

  • IndexRegistry: model_name, model_version, dimensions per index
  • Index metadata persisted as JSON, rebuilt on startup
  • GET /vectors/indexes — list all (filter by source/model)
  • GET /vectors/indexes/{name} — metadata
  • Background jobs auto-register metadata on completion

Phase 12: Tool Registry

  • 6 built-in staffing tools (search_candidates, get_candidate, revenue_by_client, recruiter_performance, cold_leads, open_jobs)
  • Parameter validation + SQL template substitution
  • Permission levels: read / write / admin
  • Full audit trail per invocation
  • GET /tools, GET /tools/{name}, POST /tools/{name}/call, GET /tools/audit

Phase 13: Security & Access Control

  • Role-based access: admin, recruiter, analyst, agent
  • Field-level sensitivity enforcement
  • Column masking determination per agent
  • Query audit logging
  • GET/POST /access/roles, GET /access/audit, POST /access/check

Phase 14: Schema Evolution

  • Schema diff detection (added, removed, type changed, renamed)
  • Fuzzy rename detection (shared word parts)
  • Auto-generated migration rules with confidence scores
  • AI migration prompt builder for complex cases
  • 5 unit tests

Phase 15+: Horizon

  • HNSW vector index with iteration-friendly trial system (2026-04-16)
    • HnswStore.build_index_with_config — parameterized ef_construction, ef_search, seed
    • EmbeddingCache — pins 100K vectors in memory, shared across trials
    • harness::EvalSet — named query sets with brute-force ground truth
    • TrialJournal — append-only JSONL at _hnsw_trials/{index}.jsonl
    • Endpoints: /vectors/hnsw/trial, /hnsw/trials/{idx}, /hnsw/trials/{idx}/best?metric={recall|latency|pareto}, /hnsw/evals, /hnsw/evals/{name}/autogen, /hnsw/cache/stats
    • Measured on 100K resumes: brute-force 44-54ms → HNSW 509us-1830us, recall 0.92-1.00 depending on ef_construction. Sweet spot: ec=80 es=30 → p50=873us recall=1.00 — locked in as HnswConfig::default()
  • Catalog manifest repair — POST /catalog/resync-missing restores row_count and columns from parquet footers (2026-04-16). All 7 staffing tables recovered to PRD-matching 2.47M rows.
  • [~] Federated multi-bucket query — foundation complete 2026-04-16, see ADR-017
    • StorageConfig.buckets + rescue_bucket + profile_root config shape
    • SecretsProvider trait + FileSecretsProvider (reads /etc/lakehouse/secrets.toml, checks 0600 perms)
    • storaged::BucketRegistry — multi-backend, rescue-aware, reachability probes
    • storaged::error_journal::ErrorJournal — append-only JSONL at primary://_errors/bucket_errors.jsonl
    • Endpoints: GET /storage/buckets, GET /storage/errors, GET /storage/bucket-health
    • Bucket-aware I/O: PUT/GET /storage/buckets/{bucket}/objects/{*key} with rescue fallback + X-Lakehouse-Rescue-Used observability headers
    • Backward compat: empty [[storage.buckets]] synthesizes a primary from legacy root
    • Three-bucket test (primary + rescue + testing) verified: normal reads, rescue fallback with headers, hard-fail missing, write to unknown bucket 503, error journal + health summary
    • X-Lakehouse-Bucket header middleware on ingest endpoints (2026-04-16)
    • Catalog migration: POST /catalog/migrate-buckets stamps bucket = "primary" on legacy refs (12 renamed, 14 total now canonical)
    • queryd registers every bucket with DataFusion for cross-bucket SQL — verified with people_test (testing) × animals (primary) CROSS JOIN
    • Profile hot-load endpoints: bucket auto-provisioning on POST /vectors/profile/{id}/activate (2026-04-17)
    • vectord bucket-scoped paths: TrialJournal + PromotionRegistry resolve per-index via IndexMeta.bucket (2026-04-17)
    • Runtime bucket lifecycle: POST /storage/buckets (provision) + DELETE /storage/buckets/{name} (unregister, refuses primary/rescue) (2026-04-17)
    • ModelProfile.bucket field — per-profile artifact isolation (2026-04-17)
  • Database connector ingest (Postgres first) — 2026-04-16
    • pg_stream::stream_table_to_parquet — ORDER BY + LIMIT/OFFSET pagination, configurable batch_size
    • parse_dsn — postgresql:// and postgres:// URL scheme, user/password/host/port/db
    • POST /ingest/db endpoint: {dsn, table, dataset_name?, batch_size?, order_by?, limit?} → streams to Parquet, registers in catalog with PII detection + redacted-password lineage
    • Existing POST /ingest/postgres/import (structured config) preserved alongside
    • 4 DSN-parser unit tests + live end-to-end test against knowledge_base.team_runs (586 rows, 13 cols, 6 batches, 196ms)
  • Phase B: Lance storage evaluation — 2026-04-16
    • crates/lance-bench standalone pilot (Lance 4.0) avoids DataFusion/Arrow version conflict with main stack
    • 8-dimension benchmark on resumes_100k_v2 — see docs/ADR-019-vector-storage.md for scorecard
    • Decision: hybrid architecture. Parquet+HNSW stays primary (2.55× faster search at 100K in-RAM). Lance added as per-profile second backend for random access (112× faster), append (0.08s vs full rewrite), hot-swap (14× faster index builds), and scale past 5M RAM ceiling.
  • Phase E.2 — Compaction integrates tombstones (physical deletion) — 2026-04-16
    • delta::compact accepts tombstones: &[Tombstone] param, filters rows at merge time via arrow filter_record_batch
    • CompactResult gains tombstones_applied + rows_dropped_by_tombstones
    • Atomic write: ArrowWriter → single Parquet file (fixes latent bug where concatenated Parquet byte streams produced garbage — footer-only-first-segment visible), verify-parse before overwrite, temp_key staging, delete delta files AFTER base write succeeds
    • Snappy compression on output matches ingest defaults (avoids 3× size inflation on every compact)
    • TombstoneStore::clear drops all batch files for a dataset; called by queryd after successful compact
    • Query engine exposes catalog() accessor so service handler can reach the tombstone store
    • E2E verified on candidates (100K rows): tombstone 3 IDs → compact → 99,997 rows physically in parquet, tombstones empty, IDs gone from __raw__candidates too; file size 10.59 MB → 10.72 MB (proportional to data, not inflated)
  • Phase 16: Hot-swap generations + autotune agent — 2026-04-16
    • vectord::promotion::PromotionRegistry — per-index current config + history at _hnsw_promotions/{index}.json, cap 50 history entries
    • Endpoints: POST /vectors/hnsw/promote/{index}/{trial_id}, POST /vectors/hnsw/rollback/{index}, GET /vectors/hnsw/promoted/{index}
    • vectord::autotune::run_autotune — grid of trials (configurable or default 5 configs), Pareto winner selection (max recall, then min p50), min_recall safety gate (default 0.9), config bounds (ec ∈ [10,400], es ∈ [10,200])
    • POST /vectors/hnsw/autotune — runs the full loop synchronously, journals every trial, auto-promotes winner
    • activate_profile uses promotion_registry.config_or(..., profile_default) so newly-promoted configs flow automatically into next activation
    • End-to-end: autogen harness for threat_intel_v1 (10 queries), autotune ran 5 trials (all recall=1.00, p50 64-68us), promoted ec=20 es=30 at recall=1.0 p50=64us as winner. Manual promote of ec=80 es=30 pushed autotune pick onto history. Rollback restored autotune winner. Second rollback cleared to None. Re-promote + restart verified persistence. Activation after promotion logged "building HNSW ef_construction=80 ef_search=30 seed=42" — config flowed through correctly.
  • Phase 17: Model profiles + scoped search — 2026-04-16
    • shared::types::ModelProfile — { id, ollama_name, description, bound_datasets, hnsw_config, embed_model, created_at, created_by }
    • shared::types::ProfileHnswConfig — mirror of vectord's HnswConfig to avoid cross-crate dep cycle (defaults ec=80 es=30 matching Phase 15 winner)
    • Registry::{put_profile, get_profile, list_profiles, delete_profile} persisted at _catalog/profiles/{id}.json, validates bindings exist (raw dataset OR AiView)
    • Endpoints: POST/GET /catalog/profiles, GET/DELETE /catalog/profiles/{id}
    • POST /vectors/profile/{id}/activate — warms EmbeddingCache + builds HNSW with profile's config for every bound dataset's vector index; reports warmed indexes + failures + duration
    • POST /vectors/profile/{id}/search — rejects 403 if requested index's source isn't in profile.bound_datasets; falls through to HNSW if warm, brute-force otherwise
    • Fixed refresh to register new index metadata (was silently no-op for first-time indexes)
    • End-to-end: security-analyst profile bound to threat_intel → activate warms 54 vectors in 156ms → within-scope HNSW search works (0.625 score); out-of-scope search for candidates returns 403 with allowed bindings listed
  • Phase E: Soft deletes (tombstones) — 2026-04-16
    • shared::types::Tombstone — { dataset, row_key_column, row_key_value, deleted_at, actor, reason }
    • catalogd::tombstones::TombstoneStore per-dataset append-log at _catalog/tombstones/{dataset}/, flush_threshold=1 + explicit flush so every tombstone is durable on return (compliance requirement)
    • All tombstones for a dataset must share the same row_key_column (validated at write — query filter is built as a single WHERE NOT IN clause)
    • Registry::add_tombstone / list_tombstones
    • Endpoint: POST /catalog/datasets/by-name/{name}/tombstone accepting {row_key_column, row_key_values[], actor, reason}; companion GET lists active tombstones
    • queryd::context::build_context wraps tombstoned tables: raw goes to __raw__{name}, public name becomes a DataFusion view with WHERE CAST(col AS VARCHAR) NOT IN (...) filter
    • End-to-end on candidates: tombstone 3 IDs, COUNT drops 100,000 → 99,997, specific WHERE returns empty, AiView candidates_safe transitively excludes them too, restart preserves all tombstones
    • Limits / not in MVP: physical compaction (Phase 8 doesn't yet read tombstones during merge); journal integration (tombstones don't yet emit Phase 9 mutation events — covered by audit fields on the tombstone itself)
  • Phase D: AI-safe views — 2026-04-16
    • shared::types::AiView — name, base_dataset, columns whitelist, optional row_filter, column_redactions
    • shared::types::Redaction — Null | Hash | Mask { keep_prefix, keep_suffix }
    • Registry::put_view / get_view / list_views / delete_view persisted to _catalog/views/{name}.json
    • queryd::context registers each view as a DataFusion view with the safe projection + filter + redactions baked into the SELECT
    • Endpoints: POST/GET /catalog/views, GET/DELETE /catalog/views/{name}
    • End-to-end on candidates: candidates_safe view exposes 8 of 15 columns, masks candidate_id (CAN******01), filters out status='blocked'. SELECT * FROM candidates_safe returns whitelist only; SELECT email FROM candidates_safe fails. View survives restart.
    • Capability surface — raw candidates still accessible by name; Phase 13 access control is the layer that enforces who can query what
  • Phase C: Decoupled embedding refresh — 2026-04-16
    • DatasetManifest: last_embedded_at, embedding_stale_since, embedding_refresh_policy (Manual | OnAppend | Scheduled)
    • Registry::mark_embeddings_stale / clear_embeddings_stale / stale_datasets
    • Ingest paths (CSV pipeline + Postgres streaming) auto-mark-stale when writing to an already-embedded dataset
    • vectord::refresh::refresh_index — reads dataset, diffs doc_ids vs existing embeddings, embeds only new rows, writes combined index, clears stale
    • POST /vectors/refresh/{dataset} + GET /vectors/stale
    • Id columns accept Utf8, Int32, Int64
    • End-to-end on threat_intel: initial 20-row embed 2.1s; re-ingest to 54 rows auto-marks stale; delta refresh embeds only 34 new in 970ms (6× faster than full re-embed); stale cleared
  • Phase 16.2/16.5: Background autotune agent + ingest-triggered re-trials — 2026-04-17
    • vectord::agent — ε-greedy proposer, rate-limited, cooldown-gated, tokio background task
    • Ingest paths push DatasetAppended triggers to agent queue
    • Endpoints: GET /vectors/agent/status, POST /vectors/agent/stop, POST /vectors/agent/enqueue/{idx}
    • [agent] config section in lakehouse.toml (enabled, cycle_interval, cooldown, min_recall, max_trials/hr)
    • 3 unit tests
  • Phase 17 VRAM gate: Two-profile sequential swap — 2026-04-17
    • Sidecar: POST /admin/unload (keep_alive=0), POST /admin/preload, GET /admin/vram (nvidia-smi + Ollama /api/ps)
    • AiClient::unload_model / preload_model / vram_snapshot
    • VectorState.active_profile singleton — activate swaps models, deactivate unloads
    • Verified: staffing-recruiter (qwen2.5) ↔ docs-assistant (mistral) — only one model in VRAM at a time
  • MySQL streaming connector — 2026-04-17
    • my_stream.rs mirrors pg_stream: DSN parsing, OFFSET pagination, Arrow type mapping, Parquet streaming
    • POST /ingest/mysql with PII detection, lineage, agent trigger
    • Verified end-to-end on live MariaDB (10 rows, 9 columns, round-tripped all types)
    • 6 DSN + type-mapping unit tests
  • Phase 18 hybrid: vectord-lance production crate — 2026-04-17
    • Firewall crate (Arrow 57 / Lance 4, separate from main Arrow 55 / DF 47 stack)
    • Public API: migrate_from_parquet, build_index (IVF_PQ), search, get_by_doc_id, append, build_scalar_index, stats
    • lance_backend::LanceRegistry resolves bucket → URI per index
    • VectorBackend { Parquet | Lance } enum on ModelProfile + IndexMeta
    • 8 HTTP endpoints under /vectors/lance/* (migrate, index, search, doc, append, stats, scalar-index, recall)
    • Profile-driven routing: POST /vectors/profile/{id}/search auto-routes to Lance when profile.vector_backend=lance
    • Auto-migrate + auto-index on activation
    • Measured on real 100K × 768d: migrate 0.57s, IVF_PQ build 16.2s (14× faster than HNSW 230s), search 23ms, append 100 rows 3.3ms, doc_id fetch 3.5ms (with scalar btree)
    • IVF_PQ recall@10 = 0.805 (HNSW = 1.000) — measured via /vectors/lance/recall/{idx} harness
  • Phase E.3: Scheduled ingest — 2026-04-17
    • ingestd::schedule module: ScheduleDef, ScheduleStore (JSON at _schedules/{id}.json), Scheduler tokio task
    • Supports MySQL + Postgres sources on interval triggers (Cron variant defined, parsing stubbed)
    • 6 CRUD endpoints under /ingest/schedules/* + run-now manual trigger
    • Full catalog integration: PII, lineage, mark-stale, agent trigger
    • 6 unit tests
  • PDF OCR via Tesseract — 2026-04-17
    • Two-tier: lopdf text extraction → Tesseract 5.5 fallback for scanned/image PDFs
    • Extracts embedded XObject /Image streams, shells to tesseract --oem 3 --psm 6
    • Same schema (source_file, page_number, text_content) — downstream unchanged
  • Fine-tuned domain models
  • Multi-node query distribution

52+ unit tests | 13 crates | 19 ADRs | 2.47M rows | 100K vectors | Hybrid Parquet+HNSW ⊕ Lance Latest: 2026-04-17 — 8 commits shipping Phase 16.2 through Phase 18