diff --git a/docs/PHASES.md b/docs/PHASES.md index 4f9fcca..b562c5c 100644 --- a/docs/PHASES.md +++ b/docs/PHASES.md @@ -142,8 +142,10 @@ - [x] `X-Lakehouse-Bucket` header middleware on ingest endpoints (2026-04-16) - [x] Catalog migration: `POST /catalog/migrate-buckets` stamps `bucket = "primary"` on legacy refs (12 renamed, 14 total now canonical) - [x] `queryd` registers every bucket with DataFusion for cross-bucket SQL — verified with people_test (testing) × animals (primary) CROSS JOIN - - [ ] Profile hot-load endpoints: `POST /profile/{user}/activate|deactivate` (deferred to Phase 17) - - [ ] `vectord` bucket-scoped paths (trial journals, eval sets per-bucket) (deferred to Phase 17) + - [x] Profile hot-load endpoints: bucket auto-provisioning on `POST /vectors/profile/{id}/activate` (2026-04-17) + - [x] `vectord` bucket-scoped paths: TrialJournal + PromotionRegistry resolve per-index via IndexMeta.bucket (2026-04-17) + - [x] Runtime bucket lifecycle: `POST /storage/buckets` (provision) + `DELETE /storage/buckets/{name}` (unregister, refuses primary/rescue) (2026-04-17) + - [x] ModelProfile.bucket field — per-profile artifact isolation (2026-04-17) - [x] Database connector ingest (Postgres first) — 2026-04-16 - `pg_stream::stream_table_to_parquet` — ORDER BY + LIMIT/OFFSET pagination, configurable batch_size - `parse_dsn` — postgresql:// and postgres:// URL scheme, user/password/host/port/db @@ -203,13 +205,46 @@ - `POST /vectors/refresh/{dataset}` + `GET /vectors/stale` - Id columns accept `Utf8`, `Int32`, `Int64` - End-to-end on threat_intel: initial 20-row embed 2.1s; re-ingest to 54 rows auto-marks stale; delta refresh embeds only 34 new in 970ms (6× faster than full re-embed); stale cleared -- [ ] Database connector ingest (Postgres/MySQL) -- [ ] PDF OCR (Tesseract) -- [ ] Scheduled ingest (cron) +- [x] Phase 16.2/16.5: Background autotune agent + ingest-triggered re-trials — 2026-04-17 + - `vectord::agent` — ε-greedy proposer, rate-limited, cooldown-gated, tokio background task + - Ingest paths push `DatasetAppended` triggers to agent queue + - Endpoints: `GET /vectors/agent/status`, `POST /vectors/agent/stop`, `POST /vectors/agent/enqueue/{idx}` + - `[agent]` config section in lakehouse.toml (enabled, cycle_interval, cooldown, min_recall, max_trials/hr) + - 3 unit tests +- [x] Phase 17 VRAM gate: Two-profile sequential swap — 2026-04-17 + - Sidecar: `POST /admin/unload` (keep_alive=0), `POST /admin/preload`, `GET /admin/vram` (nvidia-smi + Ollama /api/ps) + - `AiClient::unload_model / preload_model / vram_snapshot` + - `VectorState.active_profile` singleton — activate swaps models, deactivate unloads + - Verified: staffing-recruiter (qwen2.5) ↔ docs-assistant (mistral) — only one model in VRAM at a time +- [x] MySQL streaming connector — 2026-04-17 + - `my_stream.rs` mirrors pg_stream: DSN parsing, OFFSET pagination, Arrow type mapping, Parquet streaming + - `POST /ingest/mysql` with PII detection, lineage, agent trigger + - Verified end-to-end on live MariaDB (10 rows, 9 columns, round-tripped all types) + - 6 DSN + type-mapping unit tests +- [x] Phase 18 hybrid: vectord-lance production crate — 2026-04-17 + - Firewall crate (Arrow 57 / Lance 4, separate from main Arrow 55 / DF 47 stack) + - Public API: migrate_from_parquet, build_index (IVF_PQ), search, get_by_doc_id, append, build_scalar_index, stats + - `lance_backend::LanceRegistry` resolves bucket → URI per index + - `VectorBackend { Parquet | Lance }` enum on ModelProfile + IndexMeta + - 8 HTTP endpoints under `/vectors/lance/*` (migrate, index, search, doc, append, stats, scalar-index, recall) + - Profile-driven routing: `POST /vectors/profile/{id}/search` auto-routes to Lance when profile.vector_backend=lance + - Auto-migrate + auto-index on activation + - Measured on real 100K × 768d: migrate 0.57s, IVF_PQ build 16.2s (14× faster than HNSW 230s), search 23ms, append 100 rows 3.3ms, doc_id fetch 3.5ms (with scalar btree) + - IVF_PQ recall@10 = 0.805 (HNSW = 1.000) — measured via `/vectors/lance/recall/{idx}` harness +- [x] Phase E.3: Scheduled ingest — 2026-04-17 + - `ingestd::schedule` module: ScheduleDef, ScheduleStore (JSON at `_schedules/{id}.json`), Scheduler tokio task + - Supports MySQL + Postgres sources on interval triggers (Cron variant defined, parsing stubbed) + - 6 CRUD endpoints under `/ingest/schedules/*` + run-now manual trigger + - Full catalog integration: PII, lineage, mark-stale, agent trigger + - 6 unit tests +- [x] PDF OCR via Tesseract — 2026-04-17 + - Two-tier: lopdf text extraction → Tesseract 5.5 fallback for scanned/image PDFs + - Extracts embedded XObject /Image streams, shells to tesseract --oem 3 --psm 6 + - Same schema (source_file, page_number, text_content) — downstream unchanged - [ ] Fine-tuned domain models - [ ] Multi-node query distribution --- -**30 unit tests | 11 crates | 16 ADRs | 2.47M rows | 100K vectors | All built 2026-03-27** -**HNSW trial system: 2026-04-16** +**52+ unit tests | 13 crates | 19 ADRs | 2.47M rows | 100K vectors | Hybrid Parquet+HNSW ⊕ Lance** +**Latest: 2026-04-17 — 8 commits shipping Phase 16.2 through Phase 18** diff --git a/docs/PRD.md b/docs/PRD.md index ec2928d..ad747a2 100644 --- a/docs/PRD.md +++ b/docs/PRD.md @@ -1,8 +1,8 @@ # PRD: Lakehouse — Rust-First Substrate for Versioned Knowledge Stores -**Status:** Active — Phase 0-14 complete; federation foundation + HNSW trial system shipped 2026-04-16; entering Phase 16 (hot-swap + model profiles) +**Status:** Active — Phases 0-18 shipped; hybrid Parquet+HNSW ⊕ Lance operational; scheduled ingest live; PDF OCR live; entering horizon items **Created:** 2026-03-27 -**Last reframed:** 2026-04-16 — from "staffing analytics platform" to "dual-use knowledge substrate" (see §Problem below) +**Last reframed:** 2026-04-17 — hybrid architecture proven end-to-end on 100K vectors (see §Phase 18 + ADR-019) **Owner:** J --- @@ -297,9 +297,10 @@ Per-contract overlays with daily/weekly/monthly tiers and instant handoff. - [x] HNSW vector index with trial system (shipped 2026-04-16) - [x] Federation foundation — ADR-017 (shipped 2026-04-16) - [x] Database connector ingest — Postgres batch with streaming (shipped 2026-04-16) -- [ ] Federation layer 2 — X-Lakehouse-Bucket middleware, catalog migration, cross-bucket SQL in queryd -- [ ] PDF OCR for scanned documents (Tesseract integration) -- [ ] Scheduled ingest (cron-based file watching, S3 event triggers) +- [x] Federation layer 2 — runtime bucket lifecycle, per-index bucket scoping, profile bucket auto-provisioning (shipped 2026-04-17) +- [x] MySQL streaming connector — mirrors Postgres path, verified on live MariaDB (shipped 2026-04-17) +- [x] PDF OCR for scanned documents — Tesseract 5.5 fallback when lopdf yields no text (shipped 2026-04-17) +- [x] Scheduled ingest — interval-based per-source schedules with CRUD + run-now + auto-trigger agent (shipped 2026-04-17) - [ ] Multi-node query distribution (DataFusion supports this architecturally) ### Phase 16: Hot-Swap Index Generations