lakehouse/docs/PHASES.md
root 09fd446c8d Phase D: AI-safe views — capability-surface projections over base data
Implements the llms3.com "AI-safe views" pattern: a named projection
that exposes only whitelisted columns, with optional row filter and
per-column redactions. AI agents (or Phase 13 roles) bind to the view;
they can never accidentally see PII even if they write raw SQL.

Schema (shared::types):
- AiView { name, base_dataset, columns: Vec<String>, row_filter,
           column_redactions: HashMap<String, Redaction>, ... }
- Redaction enum: Null | Hash | Mask { keep_prefix, keep_suffix }

Catalog (catalogd::registry):
- put_view validates base dataset exists + columns non-empty
- Persists JSON at _catalog/views/{name}.json (sanitized name)
- rebuild() loads views alongside dataset manifests on startup

Query layer (queryd::context):
- build_context registers every AiView as a DataFusion view object
- Constructed SELECT applies whitelist projection, WHERE filter, and
  redaction expressions per column
  - Mask: substr(prefix) + repeat('*', mid_len) + substr(suffix)
  - Hash: digest(value, 'sha256')
  - Null: CAST(NULL AS VARCHAR) AS col
- DataFusion handles JOINs/aggregates over the view natively — it's a
  real view, not a query rewrite

HTTP (catalogd::service):
- POST /catalog/views (create)
- GET  /catalog/views (list)
- GET  /catalog/views/{name} (full def)
- DELETE /catalog/views/{name}

End-to-end test on candidates (100K rows, 15 columns):

  candidates_safe view:
    columns: candidate_id, first_name, city, state, vertical,
             skills, years_experience, status
    row_filter: status != 'blocked'
    redaction: candidate_id mask(prefix=3, suffix=2)

  SELECT * FROM candidates_safe LIMIT 5
    -> 8 columns only, candidate_id shown as "CAN******01"
       (PII fields email/phone/last_name absent from result)

  SELECT email FROM candidates_safe
    -> fails (column not in projection)

  SELECT email FROM candidates
    -> succeeds (raw table still accessible by name —
       Phase 13 access control is the gate, not the view itself)

Survives restart — view definitions reload from object storage.

Limits / not in MVP:
- View CANNOT shadow base table by name (DataFusion treats them as
  separate identifiers; access control must restrict raw-table access)
- row_filter is treated as trusted SQL — operators must validate
  before persisting; only authenticated admin path should call put_view
- Redaction expressions assume column is castable to VARCHAR; numeric
  redactions could be misleading (a Hash on Int64 returns a hex string
  that won't equi-join with another hash on the same value type)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 09:16:44 -05:00

183 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Phase Tracker
## Phase 0: Bootstrap ✅
- [x] Cargo workspace with all crate stubs compiling
- [x] `shared` crate: error types, ObjectRef, DatasetId
- [x] `gateway` with Axum: GET /health → 200
- [x] tracing + tracing-subscriber wired in gateway
- [x] justfile with build, test, run recipes
- [x] docs committed to git
## Phase 1: Storage + Catalog ✅
- [x] storaged: object_store backend init (LocalFileSystem)
- [x] storaged: Axum endpoints (PUT/GET/DELETE/LIST)
- [x] shared/arrow_helpers.rs: RecordBatch ↔ Parquet + schema fingerprinting
- [x] catalogd/registry.rs: in-memory index + manifest persistence
- [x] catalogd service: POST/GET /datasets + by-name
- [x] gateway routes wired
## Phase 2: Query Engine ✅
- [x] queryd: SessionContext + object_store config
- [x] queryd: ListingTable from catalog ObjectRefs
- [x] queryd service: POST /query/sql → JSON
- [x] queryd → catalogd wiring
- [x] gateway routes /query
## Phase 3: AI Integration ✅
- [x] Python sidecar: FastAPI + Ollama (embed/generate/rerank)
- [x] Dockerfile for sidecar
- [x] aibridge/client.rs: HTTP client
- [x] aibridge service: Axum proxy endpoints
- [x] Model config via env vars
## Phase 4: Frontend ✅
- [x] Dioxus scaffold, WASM build
- [x] Ask tab: natural language → AI SQL → results
- [x] Explore tab: dataset browser + AI summary
- [x] SQL tab: raw DataFusion editor
- [x] System tab: health checks for all services
## Phase 5: Hardening ✅
- [x] Proto definitions (lakehouse.proto)
- [x] Internal gRPC: CatalogService on :3101
- [x] OpenTelemetry tracing: stdout exporter
- [x] Auth middleware: X-API-Key (toggleable)
- [x] Config-driven startup: lakehouse.toml
## Phase 6: Ingest Pipeline ✅
- [x] CSV ingest with auto schema detection
- [x] JSON ingest (array + newline-delimited, nested flattening)
- [x] PDF text extraction (lopdf)
- [x] Text/SMS file ingest
- [x] Content hash dedup (SHA-256)
- [x] POST /ingest/file multipart upload
- [x] 12 unit tests
## Phase 7: Vector Index + RAG ✅
- [x] chunker: configurable size + overlap, sentence-boundary aware
- [x] store: embeddings as Parquet (binary f32 vectors)
- [x] search: brute-force cosine similarity
- [x] rag: embed → search → retrieve → LLM answer with citations
- [x] POST /vectors/index, /search, /rag
- [x] Background job system with progress tracking
- [x] Dual-pipeline supervisor with checkpointing + retry
- [x] 100K embeddings: 177/sec on A4000, zero failures
- [x] 6 unit tests
## Phase 8: Hot Cache + Incremental Updates ✅
- [x] MemTable hot cache: LRU, configurable max (16GB)
- [x] POST /query/cache/pin, /cache/evict, GET /cache/stats
- [x] Delta store: append-only delta Parquet files
- [x] Merge-on-read: queries combine base + deltas
- [x] Compaction: POST /query/compact
- [x] Benchmarked: 9.8x speedup (1M rows: 942ms → 96ms)
## Phase 8.5: Agent Workspaces ✅
- [x] WorkspaceManager with daily/weekly/monthly/pinned tiers
- [x] Saved searches, shortlists, activity logs per workspace
- [x] Instant zero-copy handoff between agents
- [x] Persistence to object storage, rebuild on startup
## Phase 9: Event Journal ✅
- [x] journald crate: append-only mutation log
- [x] Event schema: entity, field, old/new value, actor, source, workspace
- [x] In-memory buffer with auto-flush to Parquet
- [x] GET /journal/history/{entity_id}, /recent, /stats
- [x] POST /journal/event, /update, /flush
## Phase 10: Rich Catalog v2 ✅
- [x] DatasetManifest: description, owner, sensitivity, columns, lineage, freshness, tags
- [x] PII auto-detection: email, phone, SSN, salary, address, medical
- [x] Column-level metadata with sensitivity flags
- [x] Lineage tracking: source_system → ingest_job → dataset
- [x] PATCH /catalog/datasets/by-name/{name}/metadata
- [x] Backward compatible (serde default)
## Phase 11: Embedding Versioning ✅
- [x] IndexRegistry: model_name, model_version, dimensions per index
- [x] Index metadata persisted as JSON, rebuilt on startup
- [x] GET /vectors/indexes — list all (filter by source/model)
- [x] GET /vectors/indexes/{name} — metadata
- [x] Background jobs auto-register metadata on completion
## Phase 12: Tool Registry ✅
- [x] 6 built-in staffing tools (search_candidates, get_candidate, revenue_by_client, recruiter_performance, cold_leads, open_jobs)
- [x] Parameter validation + SQL template substitution
- [x] Permission levels: read / write / admin
- [x] Full audit trail per invocation
- [x] GET /tools, GET /tools/{name}, POST /tools/{name}/call, GET /tools/audit
## Phase 13: Security & Access Control ✅
- [x] Role-based access: admin, recruiter, analyst, agent
- [x] Field-level sensitivity enforcement
- [x] Column masking determination per agent
- [x] Query audit logging
- [x] GET/POST /access/roles, GET /access/audit, POST /access/check
## Phase 14: Schema Evolution ✅
- [x] Schema diff detection (added, removed, type changed, renamed)
- [x] Fuzzy rename detection (shared word parts)
- [x] Auto-generated migration rules with confidence scores
- [x] AI migration prompt builder for complex cases
- [x] 5 unit tests
## Phase 15+: Horizon
- [x] HNSW vector index with iteration-friendly trial system (2026-04-16)
- `HnswStore.build_index_with_config` — parameterized ef_construction, ef_search, seed
- `EmbeddingCache` — pins 100K vectors in memory, shared across trials
- `harness::EvalSet` — named query sets with brute-force ground truth
- `TrialJournal` — append-only JSONL at `_hnsw_trials/{index}.jsonl`
- Endpoints: `/vectors/hnsw/trial`, `/hnsw/trials/{idx}`, `/hnsw/trials/{idx}/best?metric={recall|latency|pareto}`, `/hnsw/evals`, `/hnsw/evals/{name}/autogen`, `/hnsw/cache/stats`
- Measured on 100K resumes: **brute-force 44-54ms → HNSW 509us-1830us**, recall 0.92-1.00 depending on `ef_construction`. Sweet spot: ec=80 es=30 → p50=873us recall=1.00 — locked in as `HnswConfig::default()`
- [x] Catalog manifest repair — `POST /catalog/resync-missing` restores row_count and columns from parquet footers (2026-04-16). All 7 staffing tables recovered to PRD-matching 2.47M rows.
- [~] Federated multi-bucket query — **foundation complete 2026-04-16**, see ADR-017
- [x] `StorageConfig.buckets` + `rescue_bucket` + `profile_root` config shape
- [x] `SecretsProvider` trait + `FileSecretsProvider` (reads /etc/lakehouse/secrets.toml, checks 0600 perms)
- [x] `storaged::BucketRegistry` — multi-backend, rescue-aware, reachability probes
- [x] `storaged::error_journal::ErrorJournal` — append-only JSONL at `primary://_errors/bucket_errors.jsonl`
- [x] Endpoints: `GET /storage/buckets`, `GET /storage/errors`, `GET /storage/bucket-health`
- [x] Bucket-aware I/O: `PUT/GET /storage/buckets/{bucket}/objects/{*key}` with rescue fallback + `X-Lakehouse-Rescue-Used` observability headers
- [x] Backward compat: empty `[[storage.buckets]]` synthesizes a `primary` from legacy `root`
- [x] Three-bucket test (primary + rescue + testing) verified: normal reads, rescue fallback with headers, hard-fail missing, write to unknown bucket 503, error journal + health summary
- [x] `X-Lakehouse-Bucket` header middleware on ingest endpoints (2026-04-16)
- [x] Catalog migration: `POST /catalog/migrate-buckets` stamps `bucket = "primary"` on legacy refs (12 renamed, 14 total now canonical)
- [x] `queryd` registers every bucket with DataFusion for cross-bucket SQL — verified with people_test (testing) × animals (primary) CROSS JOIN
- [ ] Profile hot-load endpoints: `POST /profile/{user}/activate|deactivate` (deferred to Phase 17)
- [ ] `vectord` bucket-scoped paths (trial journals, eval sets per-bucket) (deferred to Phase 17)
- [x] Database connector ingest (Postgres first) — 2026-04-16
- `pg_stream::stream_table_to_parquet` — ORDER BY + LIMIT/OFFSET pagination, configurable batch_size
- `parse_dsn` — postgresql:// and postgres:// URL scheme, user/password/host/port/db
- `POST /ingest/db` endpoint: `{dsn, table, dataset_name?, batch_size?, order_by?, limit?}` → streams to Parquet, registers in catalog with PII detection + redacted-password lineage
- Existing `POST /ingest/postgres/import` (structured config) preserved alongside
- 4 DSN-parser unit tests + live end-to-end test against `knowledge_base.team_runs` (586 rows, 13 cols, 6 batches, 196ms)
- [x] Phase B: Lance storage evaluation — 2026-04-16
- `crates/lance-bench` standalone pilot (Lance 4.0) avoids DataFusion/Arrow version conflict with main stack
- 8-dimension benchmark on resumes_100k_v2 — see docs/ADR-019-vector-storage.md for scorecard
- Decision: hybrid architecture. Parquet+HNSW stays primary (2.55× faster search at 100K in-RAM). Lance added as per-profile second backend for random access (112× faster), append (0.08s vs full rewrite), hot-swap (14× faster index builds), and scale past 5M RAM ceiling.
- [x] Phase D: AI-safe views — 2026-04-16
- `shared::types::AiView` — name, base_dataset, columns whitelist, optional row_filter, column_redactions
- `shared::types::Redaction` — Null | Hash | Mask { keep_prefix, keep_suffix }
- `Registry::put_view / get_view / list_views / delete_view` persisted to `_catalog/views/{name}.json`
- `queryd::context` registers each view as a DataFusion view with the safe projection + filter + redactions baked into the SELECT
- Endpoints: `POST/GET /catalog/views`, `GET/DELETE /catalog/views/{name}`
- End-to-end on candidates: `candidates_safe` view exposes 8 of 15 columns, masks `candidate_id` (CAN******01), filters out `status='blocked'`. `SELECT * FROM candidates_safe` returns whitelist only; `SELECT email FROM candidates_safe` fails. View survives restart.
- Capability surface — raw `candidates` still accessible by name; Phase 13 access control is the layer that enforces who can query what
- [x] Phase C: Decoupled embedding refresh — 2026-04-16
- `DatasetManifest`: `last_embedded_at`, `embedding_stale_since`, `embedding_refresh_policy` (Manual | OnAppend | Scheduled)
- `Registry::mark_embeddings_stale` / `clear_embeddings_stale` / `stale_datasets`
- Ingest paths (CSV pipeline + Postgres streaming) auto-mark-stale when writing to an already-embedded dataset
- `vectord::refresh::refresh_index` — reads dataset, diffs doc_ids vs existing embeddings, embeds only new rows, writes combined index, clears stale
- `POST /vectors/refresh/{dataset}` + `GET /vectors/stale`
- Id columns accept `Utf8`, `Int32`, `Int64`
- End-to-end on threat_intel: initial 20-row embed 2.1s; re-ingest to 54 rows auto-marks stale; delta refresh embeds only 34 new in 970ms (6× faster than full re-embed); stale cleared
- [ ] Database connector ingest (Postgres/MySQL)
- [ ] PDF OCR (Tesseract)
- [ ] Scheduled ingest (cron)
- [ ] Fine-tuned domain models
- [ ] Multi-node query distribution
---
**30 unit tests | 11 crates | 16 ADRs | 2.47M rows | 100K vectors | All built 2026-03-27**
**HNSW trial system: 2026-04-16**