Implements PRD invariant 9 ("every reader gets its own profile") and
completes the multi-model substrate vision. Local models (or agents)
bind to a named set of datasets; activation pre-loads their vector
indexes into memory; search enforces scope.
Schema (shared::types):
- ModelProfile { id, ollama_name, description, bound_datasets,
hnsw_config, embed_model, created_at, created_by }
- ProfileHnswConfig mirrors vectord::trial::HnswConfig to avoid a
cross-crate dep cycle. Default (ec=80, es=30) matches the Phase 15
trial winner.
- bound_datasets can reference raw dataset names OR AiView names
(both register as DataFusion tables with the same name, so mixing
raw tables and PII-redacted views composes naturally)
Catalog (catalogd::registry):
- put_profile validates id is a slug (alphanumeric + -_ only) and
every binding resolves to an existing dataset or view
- Persistence at _catalog/profiles/{id}.json, loaded on rebuild
- get_profile / list_profiles / delete_profile
HTTP endpoints:
- POST /catalog/profiles (create/update)
- GET /catalog/profiles (list)
- GET/DELETE /catalog/profiles/{id}
- POST /vectors/profile/{id}/activate (HNSW hot-load)
- POST /vectors/profile/{id}/search (scope-enforced)
Activation (vectord::service::activate_profile):
- For each bound dataset, find vector indexes with matching source
- Pre-load embeddings into EmbeddingCache
- Build HNSW with profile's config
- Report warmed indexes + per-binding failures + duration
- Failures on individual bindings don't abort — "substrate keeps
working" per ADR-017
Scoped search (vectord::service::profile_scoped_search):
- Look up profile, verify index.source ∈ profile.bound_datasets
- Returns 403 with allowed bindings list if out-of-scope
- Uses HNSW if index is warm, brute-force cosine otherwise (graceful
degradation — no "must activate first" friction)
Bug fix surfaced during testing: vectord::refresh::try_update_index_meta
was a no-op for first-time indexes, so threat_intel_v1 and
kb_team_runs_v1 (both built via refresh after Phase C shipped) didn't
show up in the index registry. Now it auto-infers the source from the
index name convention (`{source}_vN`) and registers new metadata with
reasonable defaults.
End-to-end verified:
- Created security-analyst profile bound to [threat_intel]
- POST /vectors/profile/security-analyst/activate → warmed
threat_intel_v1 (54 vectors) in 156ms, HNSW built
- Within-scope search: method=hnsw, returned relevant IP indicators
- Out-of-scope: tried to search resumes_100k_v2 (source=candidates)
→ 403 "profile 'security-analyst' is not bound to 'candidates' —
allowed bindings: [\"threat_intel\"]"
- staffing-recruiter profile created bound to candidates + placements;
search without activation fell through to brute_force (graceful)
Deferred (Phase 17 followups):
- VRAM-aware activation (unload-then-load via Ollama keep_alive=0)
— Ollama already handles this; we don't need to reinvent
- Model-identity in audit trail — Phase 13 has role-based audit;
adding model_id is ~20 LOC when we want it
- Profile bucket pre-load (profile:user bucket mount) — Phase 17.5
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
201 lines
13 KiB
Markdown
201 lines
13 KiB
Markdown
# Phase Tracker
|
||
|
||
## Phase 0: Bootstrap ✅
|
||
- [x] Cargo workspace with all crate stubs compiling
|
||
- [x] `shared` crate: error types, ObjectRef, DatasetId
|
||
- [x] `gateway` with Axum: GET /health → 200
|
||
- [x] tracing + tracing-subscriber wired in gateway
|
||
- [x] justfile with build, test, run recipes
|
||
- [x] docs committed to git
|
||
|
||
## Phase 1: Storage + Catalog ✅
|
||
- [x] storaged: object_store backend init (LocalFileSystem)
|
||
- [x] storaged: Axum endpoints (PUT/GET/DELETE/LIST)
|
||
- [x] shared/arrow_helpers.rs: RecordBatch ↔ Parquet + schema fingerprinting
|
||
- [x] catalogd/registry.rs: in-memory index + manifest persistence
|
||
- [x] catalogd service: POST/GET /datasets + by-name
|
||
- [x] gateway routes wired
|
||
|
||
## Phase 2: Query Engine ✅
|
||
- [x] queryd: SessionContext + object_store config
|
||
- [x] queryd: ListingTable from catalog ObjectRefs
|
||
- [x] queryd service: POST /query/sql → JSON
|
||
- [x] queryd → catalogd wiring
|
||
- [x] gateway routes /query
|
||
|
||
## Phase 3: AI Integration ✅
|
||
- [x] Python sidecar: FastAPI + Ollama (embed/generate/rerank)
|
||
- [x] Dockerfile for sidecar
|
||
- [x] aibridge/client.rs: HTTP client
|
||
- [x] aibridge service: Axum proxy endpoints
|
||
- [x] Model config via env vars
|
||
|
||
## Phase 4: Frontend ✅
|
||
- [x] Dioxus scaffold, WASM build
|
||
- [x] Ask tab: natural language → AI SQL → results
|
||
- [x] Explore tab: dataset browser + AI summary
|
||
- [x] SQL tab: raw DataFusion editor
|
||
- [x] System tab: health checks for all services
|
||
|
||
## Phase 5: Hardening ✅
|
||
- [x] Proto definitions (lakehouse.proto)
|
||
- [x] Internal gRPC: CatalogService on :3101
|
||
- [x] OpenTelemetry tracing: stdout exporter
|
||
- [x] Auth middleware: X-API-Key (toggleable)
|
||
- [x] Config-driven startup: lakehouse.toml
|
||
|
||
## Phase 6: Ingest Pipeline ✅
|
||
- [x] CSV ingest with auto schema detection
|
||
- [x] JSON ingest (array + newline-delimited, nested flattening)
|
||
- [x] PDF text extraction (lopdf)
|
||
- [x] Text/SMS file ingest
|
||
- [x] Content hash dedup (SHA-256)
|
||
- [x] POST /ingest/file multipart upload
|
||
- [x] 12 unit tests
|
||
|
||
## Phase 7: Vector Index + RAG ✅
|
||
- [x] chunker: configurable size + overlap, sentence-boundary aware
|
||
- [x] store: embeddings as Parquet (binary f32 vectors)
|
||
- [x] search: brute-force cosine similarity
|
||
- [x] rag: embed → search → retrieve → LLM answer with citations
|
||
- [x] POST /vectors/index, /search, /rag
|
||
- [x] Background job system with progress tracking
|
||
- [x] Dual-pipeline supervisor with checkpointing + retry
|
||
- [x] 100K embeddings: 177/sec on A4000, zero failures
|
||
- [x] 6 unit tests
|
||
|
||
## Phase 8: Hot Cache + Incremental Updates ✅
|
||
- [x] MemTable hot cache: LRU, configurable max (16GB)
|
||
- [x] POST /query/cache/pin, /cache/evict, GET /cache/stats
|
||
- [x] Delta store: append-only delta Parquet files
|
||
- [x] Merge-on-read: queries combine base + deltas
|
||
- [x] Compaction: POST /query/compact
|
||
- [x] Benchmarked: 9.8x speedup (1M rows: 942ms → 96ms)
|
||
|
||
## Phase 8.5: Agent Workspaces ✅
|
||
- [x] WorkspaceManager with daily/weekly/monthly/pinned tiers
|
||
- [x] Saved searches, shortlists, activity logs per workspace
|
||
- [x] Instant zero-copy handoff between agents
|
||
- [x] Persistence to object storage, rebuild on startup
|
||
|
||
## Phase 9: Event Journal ✅
|
||
- [x] journald crate: append-only mutation log
|
||
- [x] Event schema: entity, field, old/new value, actor, source, workspace
|
||
- [x] In-memory buffer with auto-flush to Parquet
|
||
- [x] GET /journal/history/{entity_id}, /recent, /stats
|
||
- [x] POST /journal/event, /update, /flush
|
||
|
||
## Phase 10: Rich Catalog v2 ✅
|
||
- [x] DatasetManifest: description, owner, sensitivity, columns, lineage, freshness, tags
|
||
- [x] PII auto-detection: email, phone, SSN, salary, address, medical
|
||
- [x] Column-level metadata with sensitivity flags
|
||
- [x] Lineage tracking: source_system → ingest_job → dataset
|
||
- [x] PATCH /catalog/datasets/by-name/{name}/metadata
|
||
- [x] Backward compatible (serde default)
|
||
|
||
## Phase 11: Embedding Versioning ✅
|
||
- [x] IndexRegistry: model_name, model_version, dimensions per index
|
||
- [x] Index metadata persisted as JSON, rebuilt on startup
|
||
- [x] GET /vectors/indexes — list all (filter by source/model)
|
||
- [x] GET /vectors/indexes/{name} — metadata
|
||
- [x] Background jobs auto-register metadata on completion
|
||
|
||
## Phase 12: Tool Registry ✅
|
||
- [x] 6 built-in staffing tools (search_candidates, get_candidate, revenue_by_client, recruiter_performance, cold_leads, open_jobs)
|
||
- [x] Parameter validation + SQL template substitution
|
||
- [x] Permission levels: read / write / admin
|
||
- [x] Full audit trail per invocation
|
||
- [x] GET /tools, GET /tools/{name}, POST /tools/{name}/call, GET /tools/audit
|
||
|
||
## Phase 13: Security & Access Control ✅
|
||
- [x] Role-based access: admin, recruiter, analyst, agent
|
||
- [x] Field-level sensitivity enforcement
|
||
- [x] Column masking determination per agent
|
||
- [x] Query audit logging
|
||
- [x] GET/POST /access/roles, GET /access/audit, POST /access/check
|
||
|
||
## Phase 14: Schema Evolution ✅
|
||
- [x] Schema diff detection (added, removed, type changed, renamed)
|
||
- [x] Fuzzy rename detection (shared word parts)
|
||
- [x] Auto-generated migration rules with confidence scores
|
||
- [x] AI migration prompt builder for complex cases
|
||
- [x] 5 unit tests
|
||
|
||
## Phase 15+: Horizon
|
||
- [x] HNSW vector index with iteration-friendly trial system (2026-04-16)
|
||
- `HnswStore.build_index_with_config` — parameterized ef_construction, ef_search, seed
|
||
- `EmbeddingCache` — pins 100K vectors in memory, shared across trials
|
||
- `harness::EvalSet` — named query sets with brute-force ground truth
|
||
- `TrialJournal` — append-only JSONL at `_hnsw_trials/{index}.jsonl`
|
||
- Endpoints: `/vectors/hnsw/trial`, `/hnsw/trials/{idx}`, `/hnsw/trials/{idx}/best?metric={recall|latency|pareto}`, `/hnsw/evals`, `/hnsw/evals/{name}/autogen`, `/hnsw/cache/stats`
|
||
- Measured on 100K resumes: **brute-force 44-54ms → HNSW 509us-1830us**, recall 0.92-1.00 depending on `ef_construction`. Sweet spot: ec=80 es=30 → p50=873us recall=1.00 — locked in as `HnswConfig::default()`
|
||
- [x] Catalog manifest repair — `POST /catalog/resync-missing` restores row_count and columns from parquet footers (2026-04-16). All 7 staffing tables recovered to PRD-matching 2.47M rows.
|
||
- [~] Federated multi-bucket query — **foundation complete 2026-04-16**, see ADR-017
|
||
- [x] `StorageConfig.buckets` + `rescue_bucket` + `profile_root` config shape
|
||
- [x] `SecretsProvider` trait + `FileSecretsProvider` (reads /etc/lakehouse/secrets.toml, checks 0600 perms)
|
||
- [x] `storaged::BucketRegistry` — multi-backend, rescue-aware, reachability probes
|
||
- [x] `storaged::error_journal::ErrorJournal` — append-only JSONL at `primary://_errors/bucket_errors.jsonl`
|
||
- [x] Endpoints: `GET /storage/buckets`, `GET /storage/errors`, `GET /storage/bucket-health`
|
||
- [x] Bucket-aware I/O: `PUT/GET /storage/buckets/{bucket}/objects/{*key}` with rescue fallback + `X-Lakehouse-Rescue-Used` observability headers
|
||
- [x] Backward compat: empty `[[storage.buckets]]` synthesizes a `primary` from legacy `root`
|
||
- [x] Three-bucket test (primary + rescue + testing) verified: normal reads, rescue fallback with headers, hard-fail missing, write to unknown bucket 503, error journal + health summary
|
||
- [x] `X-Lakehouse-Bucket` header middleware on ingest endpoints (2026-04-16)
|
||
- [x] Catalog migration: `POST /catalog/migrate-buckets` stamps `bucket = "primary"` on legacy refs (12 renamed, 14 total now canonical)
|
||
- [x] `queryd` registers every bucket with DataFusion for cross-bucket SQL — verified with people_test (testing) × animals (primary) CROSS JOIN
|
||
- [ ] Profile hot-load endpoints: `POST /profile/{user}/activate|deactivate` (deferred to Phase 17)
|
||
- [ ] `vectord` bucket-scoped paths (trial journals, eval sets per-bucket) (deferred to Phase 17)
|
||
- [x] Database connector ingest (Postgres first) — 2026-04-16
|
||
- `pg_stream::stream_table_to_parquet` — ORDER BY + LIMIT/OFFSET pagination, configurable batch_size
|
||
- `parse_dsn` — postgresql:// and postgres:// URL scheme, user/password/host/port/db
|
||
- `POST /ingest/db` endpoint: `{dsn, table, dataset_name?, batch_size?, order_by?, limit?}` → streams to Parquet, registers in catalog with PII detection + redacted-password lineage
|
||
- Existing `POST /ingest/postgres/import` (structured config) preserved alongside
|
||
- 4 DSN-parser unit tests + live end-to-end test against `knowledge_base.team_runs` (586 rows, 13 cols, 6 batches, 196ms)
|
||
- [x] Phase B: Lance storage evaluation — 2026-04-16
|
||
- `crates/lance-bench` standalone pilot (Lance 4.0) avoids DataFusion/Arrow version conflict with main stack
|
||
- 8-dimension benchmark on resumes_100k_v2 — see docs/ADR-019-vector-storage.md for scorecard
|
||
- Decision: hybrid architecture. Parquet+HNSW stays primary (2.55× faster search at 100K in-RAM). Lance added as per-profile second backend for random access (112× faster), append (0.08s vs full rewrite), hot-swap (14× faster index builds), and scale past 5M RAM ceiling.
|
||
- [x] Phase 17: Model profiles + scoped search — 2026-04-16
|
||
- `shared::types::ModelProfile` — { id, ollama_name, description, bound_datasets, hnsw_config, embed_model, created_at, created_by }
|
||
- `shared::types::ProfileHnswConfig` — mirror of vectord's HnswConfig to avoid cross-crate dep cycle (defaults ec=80 es=30 matching Phase 15 winner)
|
||
- `Registry::{put_profile, get_profile, list_profiles, delete_profile}` persisted at `_catalog/profiles/{id}.json`, validates bindings exist (raw dataset OR AiView)
|
||
- Endpoints: `POST/GET /catalog/profiles`, `GET/DELETE /catalog/profiles/{id}`
|
||
- `POST /vectors/profile/{id}/activate` — warms EmbeddingCache + builds HNSW with profile's config for every bound dataset's vector index; reports warmed indexes + failures + duration
|
||
- `POST /vectors/profile/{id}/search` — rejects 403 if requested index's source isn't in profile.bound_datasets; falls through to HNSW if warm, brute-force otherwise
|
||
- Fixed refresh to register new index metadata (was silently no-op for first-time indexes)
|
||
- End-to-end: security-analyst profile bound to threat_intel → activate warms 54 vectors in 156ms → within-scope HNSW search works (0.625 score); out-of-scope search for candidates returns 403 with allowed bindings listed
|
||
- [x] Phase E: Soft deletes (tombstones) — 2026-04-16
|
||
- `shared::types::Tombstone` — { dataset, row_key_column, row_key_value, deleted_at, actor, reason }
|
||
- `catalogd::tombstones::TombstoneStore` per-dataset append-log at `_catalog/tombstones/{dataset}/`, flush_threshold=1 + explicit flush so every tombstone is durable on return (compliance requirement)
|
||
- All tombstones for a dataset must share the same `row_key_column` (validated at write — query filter is built as a single WHERE NOT IN clause)
|
||
- `Registry::add_tombstone / list_tombstones`
|
||
- Endpoint: `POST /catalog/datasets/by-name/{name}/tombstone` accepting `{row_key_column, row_key_values[], actor, reason}`; companion `GET` lists active tombstones
|
||
- `queryd::context::build_context` wraps tombstoned tables: raw goes to `__raw__{name}`, public name becomes a DataFusion view with `WHERE CAST(col AS VARCHAR) NOT IN (...)` filter
|
||
- End-to-end on candidates: tombstone 3 IDs, COUNT drops 100,000 → 99,997, specific WHERE returns empty, AiView candidates_safe transitively excludes them too, restart preserves all tombstones
|
||
- Limits / not in MVP: physical compaction (Phase 8 doesn't yet read tombstones during merge); journal integration (tombstones don't yet emit Phase 9 mutation events — covered by audit fields on the tombstone itself)
|
||
- [x] Phase D: AI-safe views — 2026-04-16
|
||
- `shared::types::AiView` — name, base_dataset, columns whitelist, optional row_filter, column_redactions
|
||
- `shared::types::Redaction` — Null | Hash | Mask { keep_prefix, keep_suffix }
|
||
- `Registry::put_view / get_view / list_views / delete_view` persisted to `_catalog/views/{name}.json`
|
||
- `queryd::context` registers each view as a DataFusion view with the safe projection + filter + redactions baked into the SELECT
|
||
- Endpoints: `POST/GET /catalog/views`, `GET/DELETE /catalog/views/{name}`
|
||
- End-to-end on candidates: `candidates_safe` view exposes 8 of 15 columns, masks `candidate_id` (CAN******01), filters out `status='blocked'`. `SELECT * FROM candidates_safe` returns whitelist only; `SELECT email FROM candidates_safe` fails. View survives restart.
|
||
- Capability surface — raw `candidates` still accessible by name; Phase 13 access control is the layer that enforces who can query what
|
||
- [x] Phase C: Decoupled embedding refresh — 2026-04-16
|
||
- `DatasetManifest`: `last_embedded_at`, `embedding_stale_since`, `embedding_refresh_policy` (Manual | OnAppend | Scheduled)
|
||
- `Registry::mark_embeddings_stale` / `clear_embeddings_stale` / `stale_datasets`
|
||
- Ingest paths (CSV pipeline + Postgres streaming) auto-mark-stale when writing to an already-embedded dataset
|
||
- `vectord::refresh::refresh_index` — reads dataset, diffs doc_ids vs existing embeddings, embeds only new rows, writes combined index, clears stale
|
||
- `POST /vectors/refresh/{dataset}` + `GET /vectors/stale`
|
||
- Id columns accept `Utf8`, `Int32`, `Int64`
|
||
- End-to-end on threat_intel: initial 20-row embed 2.1s; re-ingest to 54 rows auto-marks stale; delta refresh embeds only 34 new in 970ms (6× faster than full re-embed); stale cleared
|
||
- [ ] Database connector ingest (Postgres/MySQL)
|
||
- [ ] PDF OCR (Tesseract)
|
||
- [ ] Scheduled ingest (cron)
|
||
- [ ] Fine-tuned domain models
|
||
- [ ] Multi-node query distribution
|
||
|
||
---
|
||
|
||
**30 unit tests | 11 crates | 16 ADRs | 2.47M rows | 100K vectors | All built 2026-03-27**
|
||
**HNSW trial system: 2026-04-16**
|