Implements the llms3.com-inspired pattern: embeddings refresh
asynchronously, decoupled from transactional row writes. New rows arrive,
ingest marks the vector index stale, a later refresh embeds only the
delta (doc_ids not already in the index).
Schema additions (DatasetManifest):
- last_embedded_at: Option<DateTime> - when the index was last refreshed
- embedding_stale_since: Option<DateTime> - set when data written, cleared on refresh
- embedding_refresh_policy: Option<RefreshPolicy> - Manual | OnAppend | Scheduled
Ingest paths (pipeline::ingest_file + pg_stream) call
registry.mark_embeddings_stale after writing. No-op if the dataset has
never been embedded — stale semantics only kick in once last_embedded_at
is set.
Refresh pipeline (vectord::refresh::refresh_index):
- Reads the dataset Parquet, extracts (doc_id, text) pairs
- Accepts Utf8 / Int32 / Int64 id columns (covers both CSV and pg schemas)
- Loads existing embeddings via EmbeddingCache (empty on first-time build)
- Filters to rows whose doc_id is NOT in the existing set
- Chunks (chunker::chunk_column), embeds via Ollama (batches of 32),
writes combined index, clears stale flag
Endpoints:
- POST /vectors/refresh/{dataset_name} - body {index_name, id_column,
text_column, chunk_size?, overlap?}
- GET /vectors/stale - lists datasets whose embedding_stale_since is set
End-to-end verified on threat_intel (knowledge_base.threat_intel):
- Initial refresh: 20 rows -> 20 chunks -> embedded in 2.1s,
last_embedded_at set
- Idempotent second refresh: 0 new docs -> 1.8ms (pure delta check)
- Re-ingest to 54 rows: mark_embeddings_stale fires -> stale_since set
- /vectors/stale surfaces threat_intel with timestamps + policy
- Delta refresh: 34 new docs embedded in 970ms (6x faster than full
re-embed); stale_cleared = true
Not in MVP scope:
- UPDATE semantics (same doc_id, different content) - would need
per-row content hashing
- OnAppend policy auto-trigger - just declares intent; actual scheduler
deferred
- Scheduler runtime - the Scheduled(cron) variant declares the intent so
operators can see which datasets expect what, but the cron itself is
separate
Per ADR-019: when a profile switches to vector_backend=Lance, this
refresh path benefits — Lance's native append replaces our "read all +
rewrite" Parquet rebuild pattern. Current MVP works well enough at
~500-5K rows to validate the architecture; Lance unblocks the 5M+ case.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
9.6 KiB
9.6 KiB
Phase Tracker
Phase 0: Bootstrap ✅
- Cargo workspace with all crate stubs compiling
sharedcrate: error types, ObjectRef, DatasetIdgatewaywith Axum: GET /health → 200- tracing + tracing-subscriber wired in gateway
- justfile with build, test, run recipes
- docs committed to git
Phase 1: Storage + Catalog ✅
- storaged: object_store backend init (LocalFileSystem)
- storaged: Axum endpoints (PUT/GET/DELETE/LIST)
- shared/arrow_helpers.rs: RecordBatch ↔ Parquet + schema fingerprinting
- catalogd/registry.rs: in-memory index + manifest persistence
- catalogd service: POST/GET /datasets + by-name
- gateway routes wired
Phase 2: Query Engine ✅
- queryd: SessionContext + object_store config
- queryd: ListingTable from catalog ObjectRefs
- queryd service: POST /query/sql → JSON
- queryd → catalogd wiring
- gateway routes /query
Phase 3: AI Integration ✅
- Python sidecar: FastAPI + Ollama (embed/generate/rerank)
- Dockerfile for sidecar
- aibridge/client.rs: HTTP client
- aibridge service: Axum proxy endpoints
- Model config via env vars
Phase 4: Frontend ✅
- Dioxus scaffold, WASM build
- Ask tab: natural language → AI SQL → results
- Explore tab: dataset browser + AI summary
- SQL tab: raw DataFusion editor
- System tab: health checks for all services
Phase 5: Hardening ✅
- Proto definitions (lakehouse.proto)
- Internal gRPC: CatalogService on :3101
- OpenTelemetry tracing: stdout exporter
- Auth middleware: X-API-Key (toggleable)
- Config-driven startup: lakehouse.toml
Phase 6: Ingest Pipeline ✅
- CSV ingest with auto schema detection
- JSON ingest (array + newline-delimited, nested flattening)
- PDF text extraction (lopdf)
- Text/SMS file ingest
- Content hash dedup (SHA-256)
- POST /ingest/file multipart upload
- 12 unit tests
Phase 7: Vector Index + RAG ✅
- chunker: configurable size + overlap, sentence-boundary aware
- store: embeddings as Parquet (binary f32 vectors)
- search: brute-force cosine similarity
- rag: embed → search → retrieve → LLM answer with citations
- POST /vectors/index, /search, /rag
- Background job system with progress tracking
- Dual-pipeline supervisor with checkpointing + retry
- 100K embeddings: 177/sec on A4000, zero failures
- 6 unit tests
Phase 8: Hot Cache + Incremental Updates ✅
- MemTable hot cache: LRU, configurable max (16GB)
- POST /query/cache/pin, /cache/evict, GET /cache/stats
- Delta store: append-only delta Parquet files
- Merge-on-read: queries combine base + deltas
- Compaction: POST /query/compact
- Benchmarked: 9.8x speedup (1M rows: 942ms → 96ms)
Phase 8.5: Agent Workspaces ✅
- WorkspaceManager with daily/weekly/monthly/pinned tiers
- Saved searches, shortlists, activity logs per workspace
- Instant zero-copy handoff between agents
- Persistence to object storage, rebuild on startup
Phase 9: Event Journal ✅
- journald crate: append-only mutation log
- Event schema: entity, field, old/new value, actor, source, workspace
- In-memory buffer with auto-flush to Parquet
- GET /journal/history/{entity_id}, /recent, /stats
- POST /journal/event, /update, /flush
Phase 10: Rich Catalog v2 ✅
- DatasetManifest: description, owner, sensitivity, columns, lineage, freshness, tags
- PII auto-detection: email, phone, SSN, salary, address, medical
- Column-level metadata with sensitivity flags
- Lineage tracking: source_system → ingest_job → dataset
- PATCH /catalog/datasets/by-name/{name}/metadata
- Backward compatible (serde default)
Phase 11: Embedding Versioning ✅
- IndexRegistry: model_name, model_version, dimensions per index
- Index metadata persisted as JSON, rebuilt on startup
- GET /vectors/indexes — list all (filter by source/model)
- GET /vectors/indexes/{name} — metadata
- Background jobs auto-register metadata on completion
Phase 12: Tool Registry ✅
- 6 built-in staffing tools (search_candidates, get_candidate, revenue_by_client, recruiter_performance, cold_leads, open_jobs)
- Parameter validation + SQL template substitution
- Permission levels: read / write / admin
- Full audit trail per invocation
- GET /tools, GET /tools/{name}, POST /tools/{name}/call, GET /tools/audit
Phase 13: Security & Access Control ✅
- Role-based access: admin, recruiter, analyst, agent
- Field-level sensitivity enforcement
- Column masking determination per agent
- Query audit logging
- GET/POST /access/roles, GET /access/audit, POST /access/check
Phase 14: Schema Evolution ✅
- Schema diff detection (added, removed, type changed, renamed)
- Fuzzy rename detection (shared word parts)
- Auto-generated migration rules with confidence scores
- AI migration prompt builder for complex cases
- 5 unit tests
Phase 15+: Horizon
- HNSW vector index with iteration-friendly trial system (2026-04-16)
HnswStore.build_index_with_config— parameterized ef_construction, ef_search, seedEmbeddingCache— pins 100K vectors in memory, shared across trialsharness::EvalSet— named query sets with brute-force ground truthTrialJournal— append-only JSONL at_hnsw_trials/{index}.jsonl- Endpoints:
/vectors/hnsw/trial,/hnsw/trials/{idx},/hnsw/trials/{idx}/best?metric={recall|latency|pareto},/hnsw/evals,/hnsw/evals/{name}/autogen,/hnsw/cache/stats - Measured on 100K resumes: brute-force 44-54ms → HNSW 509us-1830us, recall 0.92-1.00 depending on
ef_construction. Sweet spot: ec=80 es=30 → p50=873us recall=1.00 — locked in asHnswConfig::default()
- Catalog manifest repair —
POST /catalog/resync-missingrestores row_count and columns from parquet footers (2026-04-16). All 7 staffing tables recovered to PRD-matching 2.47M rows. - [~] Federated multi-bucket query — foundation complete 2026-04-16, see ADR-017
StorageConfig.buckets+rescue_bucket+profile_rootconfig shapeSecretsProvidertrait +FileSecretsProvider(reads /etc/lakehouse/secrets.toml, checks 0600 perms)storaged::BucketRegistry— multi-backend, rescue-aware, reachability probesstoraged::error_journal::ErrorJournal— append-only JSONL atprimary://_errors/bucket_errors.jsonl- Endpoints:
GET /storage/buckets,GET /storage/errors,GET /storage/bucket-health - Bucket-aware I/O:
PUT/GET /storage/buckets/{bucket}/objects/{*key}with rescue fallback +X-Lakehouse-Rescue-Usedobservability headers - Backward compat: empty
[[storage.buckets]]synthesizes aprimaryfrom legacyroot - Three-bucket test (primary + rescue + testing) verified: normal reads, rescue fallback with headers, hard-fail missing, write to unknown bucket 503, error journal + health summary
X-Lakehouse-Bucketheader middleware on ingest/query/catalog endpoints- Catalog migration: set
bucket = "primary"on every legacy ObjectRef querydregisters every bucket with DataFusion for cross-bucket SQL- Profile hot-load endpoints:
POST /profile/{user}/activate|deactivate vectordbucket-scoped paths (trial journals, eval sets per-bucket)
- Database connector ingest (Postgres first) — 2026-04-16
pg_stream::stream_table_to_parquet— ORDER BY + LIMIT/OFFSET pagination, configurable batch_sizeparse_dsn— postgresql:// and postgres:// URL scheme, user/password/host/port/dbPOST /ingest/dbendpoint:{dsn, table, dataset_name?, batch_size?, order_by?, limit?}→ streams to Parquet, registers in catalog with PII detection + redacted-password lineage- Existing
POST /ingest/postgres/import(structured config) preserved alongside - 4 DSN-parser unit tests + live end-to-end test against
knowledge_base.team_runs(586 rows, 13 cols, 6 batches, 196ms)
- Phase B: Lance storage evaluation — 2026-04-16
crates/lance-benchstandalone pilot (Lance 4.0) avoids DataFusion/Arrow version conflict with main stack- 8-dimension benchmark on resumes_100k_v2 — see docs/ADR-019-vector-storage.md for scorecard
- Decision: hybrid architecture. Parquet+HNSW stays primary (2.55× faster search at 100K in-RAM). Lance added as per-profile second backend for random access (112× faster), append (0.08s vs full rewrite), hot-swap (14× faster index builds), and scale past 5M RAM ceiling.
- Phase C: Decoupled embedding refresh — 2026-04-16
DatasetManifest:last_embedded_at,embedding_stale_since,embedding_refresh_policy(Manual | OnAppend | Scheduled)Registry::mark_embeddings_stale/clear_embeddings_stale/stale_datasets- Ingest paths (CSV pipeline + Postgres streaming) auto-mark-stale when writing to an already-embedded dataset
vectord::refresh::refresh_index— reads dataset, diffs doc_ids vs existing embeddings, embeds only new rows, writes combined index, clears stalePOST /vectors/refresh/{dataset}+GET /vectors/stale- Id columns accept
Utf8,Int32,Int64 - End-to-end on threat_intel: initial 20-row embed 2.1s; re-ingest to 54 rows auto-marks stale; delta refresh embeds only 34 new in 970ms (6× faster than full re-embed); stale cleared
- Database connector ingest (Postgres/MySQL)
- PDF OCR (Tesseract)
- Scheduled ingest (cron)
- Fine-tuned domain models
- Multi-node query distribution
30 unit tests | 11 crates | 16 ADRs | 2.47M rows | 100K vectors | All built 2026-03-27 HNSW trial system: 2026-04-16