lakehouse/docs/PHASES.md
root 09fd446c8d Phase D: AI-safe views — capability-surface projections over base data
Implements the llms3.com "AI-safe views" pattern: a named projection
that exposes only whitelisted columns, with optional row filter and
per-column redactions. AI agents (or Phase 13 roles) bind to the view;
they can never accidentally see PII even if they write raw SQL.

Schema (shared::types):
- AiView { name, base_dataset, columns: Vec<String>, row_filter,
           column_redactions: HashMap<String, Redaction>, ... }
- Redaction enum: Null | Hash | Mask { keep_prefix, keep_suffix }

Catalog (catalogd::registry):
- put_view validates base dataset exists + columns non-empty
- Persists JSON at _catalog/views/{name}.json (sanitized name)
- rebuild() loads views alongside dataset manifests on startup

Query layer (queryd::context):
- build_context registers every AiView as a DataFusion view object
- Constructed SELECT applies whitelist projection, WHERE filter, and
  redaction expressions per column
  - Mask: substr(prefix) + repeat('*', mid_len) + substr(suffix)
  - Hash: digest(value, 'sha256')
  - Null: CAST(NULL AS VARCHAR) AS col
- DataFusion handles JOINs/aggregates over the view natively — it's a
  real view, not a query rewrite

HTTP (catalogd::service):
- POST /catalog/views (create)
- GET  /catalog/views (list)
- GET  /catalog/views/{name} (full def)
- DELETE /catalog/views/{name}

End-to-end test on candidates (100K rows, 15 columns):

  candidates_safe view:
    columns: candidate_id, first_name, city, state, vertical,
             skills, years_experience, status
    row_filter: status != 'blocked'
    redaction: candidate_id mask(prefix=3, suffix=2)

  SELECT * FROM candidates_safe LIMIT 5
    -> 8 columns only, candidate_id shown as "CAN******01"
       (PII fields email/phone/last_name absent from result)

  SELECT email FROM candidates_safe
    -> fails (column not in projection)

  SELECT email FROM candidates
    -> succeeds (raw table still accessible by name —
       Phase 13 access control is the gate, not the view itself)

Survives restart — view definitions reload from object storage.

Limits / not in MVP:
- View CANNOT shadow base table by name (DataFusion treats them as
  separate identifiers; access control must restrict raw-table access)
- row_filter is treated as trusted SQL — operators must validate
  before persisting; only authenticated admin path should call put_view
- Redaction expressions assume column is castable to VARCHAR; numeric
  redactions could be misleading (a Hash on Int64 returns a hex string
  that won't equi-join with another hash on the same value type)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 09:16:44 -05:00

11 KiB
Raw Blame History

Phase Tracker

Phase 0: Bootstrap

  • Cargo workspace with all crate stubs compiling
  • shared crate: error types, ObjectRef, DatasetId
  • gateway with Axum: GET /health → 200
  • tracing + tracing-subscriber wired in gateway
  • justfile with build, test, run recipes
  • docs committed to git

Phase 1: Storage + Catalog

  • storaged: object_store backend init (LocalFileSystem)
  • storaged: Axum endpoints (PUT/GET/DELETE/LIST)
  • shared/arrow_helpers.rs: RecordBatch ↔ Parquet + schema fingerprinting
  • catalogd/registry.rs: in-memory index + manifest persistence
  • catalogd service: POST/GET /datasets + by-name
  • gateway routes wired

Phase 2: Query Engine

  • queryd: SessionContext + object_store config
  • queryd: ListingTable from catalog ObjectRefs
  • queryd service: POST /query/sql → JSON
  • queryd → catalogd wiring
  • gateway routes /query

Phase 3: AI Integration

  • Python sidecar: FastAPI + Ollama (embed/generate/rerank)
  • Dockerfile for sidecar
  • aibridge/client.rs: HTTP client
  • aibridge service: Axum proxy endpoints
  • Model config via env vars

Phase 4: Frontend

  • Dioxus scaffold, WASM build
  • Ask tab: natural language → AI SQL → results
  • Explore tab: dataset browser + AI summary
  • SQL tab: raw DataFusion editor
  • System tab: health checks for all services

Phase 5: Hardening

  • Proto definitions (lakehouse.proto)
  • Internal gRPC: CatalogService on :3101
  • OpenTelemetry tracing: stdout exporter
  • Auth middleware: X-API-Key (toggleable)
  • Config-driven startup: lakehouse.toml

Phase 6: Ingest Pipeline

  • CSV ingest with auto schema detection
  • JSON ingest (array + newline-delimited, nested flattening)
  • PDF text extraction (lopdf)
  • Text/SMS file ingest
  • Content hash dedup (SHA-256)
  • POST /ingest/file multipart upload
  • 12 unit tests

Phase 7: Vector Index + RAG

  • chunker: configurable size + overlap, sentence-boundary aware
  • store: embeddings as Parquet (binary f32 vectors)
  • search: brute-force cosine similarity
  • rag: embed → search → retrieve → LLM answer with citations
  • POST /vectors/index, /search, /rag
  • Background job system with progress tracking
  • Dual-pipeline supervisor with checkpointing + retry
  • 100K embeddings: 177/sec on A4000, zero failures
  • 6 unit tests

Phase 8: Hot Cache + Incremental Updates

  • MemTable hot cache: LRU, configurable max (16GB)
  • POST /query/cache/pin, /cache/evict, GET /cache/stats
  • Delta store: append-only delta Parquet files
  • Merge-on-read: queries combine base + deltas
  • Compaction: POST /query/compact
  • Benchmarked: 9.8x speedup (1M rows: 942ms → 96ms)

Phase 8.5: Agent Workspaces

  • WorkspaceManager with daily/weekly/monthly/pinned tiers
  • Saved searches, shortlists, activity logs per workspace
  • Instant zero-copy handoff between agents
  • Persistence to object storage, rebuild on startup

Phase 9: Event Journal

  • journald crate: append-only mutation log
  • Event schema: entity, field, old/new value, actor, source, workspace
  • In-memory buffer with auto-flush to Parquet
  • GET /journal/history/{entity_id}, /recent, /stats
  • POST /journal/event, /update, /flush

Phase 10: Rich Catalog v2

  • DatasetManifest: description, owner, sensitivity, columns, lineage, freshness, tags
  • PII auto-detection: email, phone, SSN, salary, address, medical
  • Column-level metadata with sensitivity flags
  • Lineage tracking: source_system → ingest_job → dataset
  • PATCH /catalog/datasets/by-name/{name}/metadata
  • Backward compatible (serde default)

Phase 11: Embedding Versioning

  • IndexRegistry: model_name, model_version, dimensions per index
  • Index metadata persisted as JSON, rebuilt on startup
  • GET /vectors/indexes — list all (filter by source/model)
  • GET /vectors/indexes/{name} — metadata
  • Background jobs auto-register metadata on completion

Phase 12: Tool Registry

  • 6 built-in staffing tools (search_candidates, get_candidate, revenue_by_client, recruiter_performance, cold_leads, open_jobs)
  • Parameter validation + SQL template substitution
  • Permission levels: read / write / admin
  • Full audit trail per invocation
  • GET /tools, GET /tools/{name}, POST /tools/{name}/call, GET /tools/audit

Phase 13: Security & Access Control

  • Role-based access: admin, recruiter, analyst, agent
  • Field-level sensitivity enforcement
  • Column masking determination per agent
  • Query audit logging
  • GET/POST /access/roles, GET /access/audit, POST /access/check

Phase 14: Schema Evolution

  • Schema diff detection (added, removed, type changed, renamed)
  • Fuzzy rename detection (shared word parts)
  • Auto-generated migration rules with confidence scores
  • AI migration prompt builder for complex cases
  • 5 unit tests

Phase 15+: Horizon

  • HNSW vector index with iteration-friendly trial system (2026-04-16)
    • HnswStore.build_index_with_config — parameterized ef_construction, ef_search, seed
    • EmbeddingCache — pins 100K vectors in memory, shared across trials
    • harness::EvalSet — named query sets with brute-force ground truth
    • TrialJournal — append-only JSONL at _hnsw_trials/{index}.jsonl
    • Endpoints: /vectors/hnsw/trial, /hnsw/trials/{idx}, /hnsw/trials/{idx}/best?metric={recall|latency|pareto}, /hnsw/evals, /hnsw/evals/{name}/autogen, /hnsw/cache/stats
    • Measured on 100K resumes: brute-force 44-54ms → HNSW 509us-1830us, recall 0.92-1.00 depending on ef_construction. Sweet spot: ec=80 es=30 → p50=873us recall=1.00 — locked in as HnswConfig::default()
  • Catalog manifest repair — POST /catalog/resync-missing restores row_count and columns from parquet footers (2026-04-16). All 7 staffing tables recovered to PRD-matching 2.47M rows.
  • [~] Federated multi-bucket query — foundation complete 2026-04-16, see ADR-017
    • StorageConfig.buckets + rescue_bucket + profile_root config shape
    • SecretsProvider trait + FileSecretsProvider (reads /etc/lakehouse/secrets.toml, checks 0600 perms)
    • storaged::BucketRegistry — multi-backend, rescue-aware, reachability probes
    • storaged::error_journal::ErrorJournal — append-only JSONL at primary://_errors/bucket_errors.jsonl
    • Endpoints: GET /storage/buckets, GET /storage/errors, GET /storage/bucket-health
    • Bucket-aware I/O: PUT/GET /storage/buckets/{bucket}/objects/{*key} with rescue fallback + X-Lakehouse-Rescue-Used observability headers
    • Backward compat: empty [[storage.buckets]] synthesizes a primary from legacy root
    • Three-bucket test (primary + rescue + testing) verified: normal reads, rescue fallback with headers, hard-fail missing, write to unknown bucket 503, error journal + health summary
    • X-Lakehouse-Bucket header middleware on ingest endpoints (2026-04-16)
    • Catalog migration: POST /catalog/migrate-buckets stamps bucket = "primary" on legacy refs (12 renamed, 14 total now canonical)
    • queryd registers every bucket with DataFusion for cross-bucket SQL — verified with people_test (testing) × animals (primary) CROSS JOIN
    • Profile hot-load endpoints: POST /profile/{user}/activate|deactivate (deferred to Phase 17)
    • vectord bucket-scoped paths (trial journals, eval sets per-bucket) (deferred to Phase 17)
  • Database connector ingest (Postgres first) — 2026-04-16
    • pg_stream::stream_table_to_parquet — ORDER BY + LIMIT/OFFSET pagination, configurable batch_size
    • parse_dsn — postgresql:// and postgres:// URL scheme, user/password/host/port/db
    • POST /ingest/db endpoint: {dsn, table, dataset_name?, batch_size?, order_by?, limit?} → streams to Parquet, registers in catalog with PII detection + redacted-password lineage
    • Existing POST /ingest/postgres/import (structured config) preserved alongside
    • 4 DSN-parser unit tests + live end-to-end test against knowledge_base.team_runs (586 rows, 13 cols, 6 batches, 196ms)
  • Phase B: Lance storage evaluation — 2026-04-16
    • crates/lance-bench standalone pilot (Lance 4.0) avoids DataFusion/Arrow version conflict with main stack
    • 8-dimension benchmark on resumes_100k_v2 — see docs/ADR-019-vector-storage.md for scorecard
    • Decision: hybrid architecture. Parquet+HNSW stays primary (2.55× faster search at 100K in-RAM). Lance added as per-profile second backend for random access (112× faster), append (0.08s vs full rewrite), hot-swap (14× faster index builds), and scale past 5M RAM ceiling.
  • Phase D: AI-safe views — 2026-04-16
    • shared::types::AiView — name, base_dataset, columns whitelist, optional row_filter, column_redactions
    • shared::types::Redaction — Null | Hash | Mask { keep_prefix, keep_suffix }
    • Registry::put_view / get_view / list_views / delete_view persisted to _catalog/views/{name}.json
    • queryd::context registers each view as a DataFusion view with the safe projection + filter + redactions baked into the SELECT
    • Endpoints: POST/GET /catalog/views, GET/DELETE /catalog/views/{name}
    • End-to-end on candidates: candidates_safe view exposes 8 of 15 columns, masks candidate_id (CAN******01), filters out status='blocked'. SELECT * FROM candidates_safe returns whitelist only; SELECT email FROM candidates_safe fails. View survives restart.
    • Capability surface — raw candidates still accessible by name; Phase 13 access control is the layer that enforces who can query what
  • Phase C: Decoupled embedding refresh — 2026-04-16
    • DatasetManifest: last_embedded_at, embedding_stale_since, embedding_refresh_policy (Manual | OnAppend | Scheduled)
    • Registry::mark_embeddings_stale / clear_embeddings_stale / stale_datasets
    • Ingest paths (CSV pipeline + Postgres streaming) auto-mark-stale when writing to an already-embedded dataset
    • vectord::refresh::refresh_index — reads dataset, diffs doc_ids vs existing embeddings, embeds only new rows, writes combined index, clears stale
    • POST /vectors/refresh/{dataset} + GET /vectors/stale
    • Id columns accept Utf8, Int32, Int64
    • End-to-end on threat_intel: initial 20-row embed 2.1s; re-ingest to 54 rows auto-marks stale; delta refresh embeds only 34 new in 970ms (6× faster than full re-embed); stale cleared
  • Database connector ingest (Postgres/MySQL)
  • PDF OCR (Tesseract)
  • Scheduled ingest (cron)
  • Fine-tuned domain models
  • Multi-node query distribution

30 unit tests | 11 crates | 16 ADRs | 2.47M rows | 100K vectors | All built 2026-03-27 HNSW trial system: 2026-04-16