Update PRD + PHASES.md — reflect 8-commit 2026-04-17 push

PRD status line: "Phases 0-18 shipped; hybrid operational; scheduled
ingest live; PDF OCR live; entering horizon items."

PHASES.md: federation L2 items marked complete, Phase 16.2 (autotune
agent), Phase 17 VRAM gate, MySQL connector, Phase 18 (hybrid Lance),
scheduled ingest, PDF OCR all documented with dates and measurements.

Stats updated: 52+ unit tests, 13 crates, 19 ADRs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
root 2026-04-16 20:54:05 -05:00
parent fd4b6836ae
commit 3bc82833ac
2 changed files with 48 additions and 12 deletions

View File

@ -142,8 +142,10 @@
- [x] `X-Lakehouse-Bucket` header middleware on ingest endpoints (2026-04-16)
- [x] Catalog migration: `POST /catalog/migrate-buckets` stamps `bucket = "primary"` on legacy refs (12 renamed, 14 total now canonical)
- [x] `queryd` registers every bucket with DataFusion for cross-bucket SQL — verified with people_test (testing) × animals (primary) CROSS JOIN
- [ ] Profile hot-load endpoints: `POST /profile/{user}/activate|deactivate` (deferred to Phase 17)
- [ ] `vectord` bucket-scoped paths (trial journals, eval sets per-bucket) (deferred to Phase 17)
- [x] Profile hot-load endpoints: bucket auto-provisioning on `POST /vectors/profile/{id}/activate` (2026-04-17)
- [x] `vectord` bucket-scoped paths: TrialJournal + PromotionRegistry resolve per-index via IndexMeta.bucket (2026-04-17)
- [x] Runtime bucket lifecycle: `POST /storage/buckets` (provision) + `DELETE /storage/buckets/{name}` (unregister, refuses primary/rescue) (2026-04-17)
- [x] ModelProfile.bucket field — per-profile artifact isolation (2026-04-17)
- [x] Database connector ingest (Postgres first) — 2026-04-16
- `pg_stream::stream_table_to_parquet` — ORDER BY + LIMIT/OFFSET pagination, configurable batch_size
- `parse_dsn` — postgresql:// and postgres:// URL scheme, user/password/host/port/db
@ -203,13 +205,46 @@
- `POST /vectors/refresh/{dataset}` + `GET /vectors/stale`
- Id columns accept `Utf8`, `Int32`, `Int64`
- End-to-end on threat_intel: initial 20-row embed 2.1s; re-ingest to 54 rows auto-marks stale; delta refresh embeds only 34 new in 970ms (6× faster than full re-embed); stale cleared
- [ ] Database connector ingest (Postgres/MySQL)
- [ ] PDF OCR (Tesseract)
- [ ] Scheduled ingest (cron)
- [x] Phase 16.2/16.5: Background autotune agent + ingest-triggered re-trials — 2026-04-17
- `vectord::agent` — ε-greedy proposer, rate-limited, cooldown-gated, tokio background task
- Ingest paths push `DatasetAppended` triggers to agent queue
- Endpoints: `GET /vectors/agent/status`, `POST /vectors/agent/stop`, `POST /vectors/agent/enqueue/{idx}`
- `[agent]` config section in lakehouse.toml (enabled, cycle_interval, cooldown, min_recall, max_trials/hr)
- 3 unit tests
- [x] Phase 17 VRAM gate: Two-profile sequential swap — 2026-04-17
- Sidecar: `POST /admin/unload` (keep_alive=0), `POST /admin/preload`, `GET /admin/vram` (nvidia-smi + Ollama /api/ps)
- `AiClient::unload_model / preload_model / vram_snapshot`
- `VectorState.active_profile` singleton — activate swaps models, deactivate unloads
- Verified: staffing-recruiter (qwen2.5) ↔ docs-assistant (mistral) — only one model in VRAM at a time
- [x] MySQL streaming connector — 2026-04-17
- `my_stream.rs` mirrors pg_stream: DSN parsing, OFFSET pagination, Arrow type mapping, Parquet streaming
- `POST /ingest/mysql` with PII detection, lineage, agent trigger
- Verified end-to-end on live MariaDB (10 rows, 9 columns, round-tripped all types)
- 6 DSN + type-mapping unit tests
- [x] Phase 18 hybrid: vectord-lance production crate — 2026-04-17
- Firewall crate (Arrow 57 / Lance 4, separate from main Arrow 55 / DF 47 stack)
- Public API: migrate_from_parquet, build_index (IVF_PQ), search, get_by_doc_id, append, build_scalar_index, stats
- `lance_backend::LanceRegistry` resolves bucket → URI per index
- `VectorBackend { Parquet | Lance }` enum on ModelProfile + IndexMeta
- 8 HTTP endpoints under `/vectors/lance/*` (migrate, index, search, doc, append, stats, scalar-index, recall)
- Profile-driven routing: `POST /vectors/profile/{id}/search` auto-routes to Lance when profile.vector_backend=lance
- Auto-migrate + auto-index on activation
- Measured on real 100K × 768d: migrate 0.57s, IVF_PQ build 16.2s (14× faster than HNSW 230s), search 23ms, append 100 rows 3.3ms, doc_id fetch 3.5ms (with scalar btree)
- IVF_PQ recall@10 = 0.805 (HNSW = 1.000) — measured via `/vectors/lance/recall/{idx}` harness
- [x] Phase E.3: Scheduled ingest — 2026-04-17
- `ingestd::schedule` module: ScheduleDef, ScheduleStore (JSON at `_schedules/{id}.json`), Scheduler tokio task
- Supports MySQL + Postgres sources on interval triggers (Cron variant defined, parsing stubbed)
- 6 CRUD endpoints under `/ingest/schedules/*` + run-now manual trigger
- Full catalog integration: PII, lineage, mark-stale, agent trigger
- 6 unit tests
- [x] PDF OCR via Tesseract — 2026-04-17
- Two-tier: lopdf text extraction → Tesseract 5.5 fallback for scanned/image PDFs
- Extracts embedded XObject /Image streams, shells to tesseract --oem 3 --psm 6
- Same schema (source_file, page_number, text_content) — downstream unchanged
- [ ] Fine-tuned domain models
- [ ] Multi-node query distribution
---
**30 unit tests | 11 crates | 16 ADRs | 2.47M rows | 100K vectors | All built 2026-03-27**
**HNSW trial system: 2026-04-16**
**52+ unit tests | 13 crates | 19 ADRs | 2.47M rows | 100K vectors | Hybrid Parquet+HNSW ⊕ Lance**
**Latest: 2026-04-17 — 8 commits shipping Phase 16.2 through Phase 18**

View File

@ -1,8 +1,8 @@
# PRD: Lakehouse — Rust-First Substrate for Versioned Knowledge Stores
**Status:** Active — Phase 0-14 complete; federation foundation + HNSW trial system shipped 2026-04-16; entering Phase 16 (hot-swap + model profiles)
**Status:** Active — Phases 0-18 shipped; hybrid Parquet+HNSW ⊕ Lance operational; scheduled ingest live; PDF OCR live; entering horizon items
**Created:** 2026-03-27
**Last reframed:** 2026-04-16 — from "staffing analytics platform" to "dual-use knowledge substrate" (see §Problem below)
**Last reframed:** 2026-04-17 — hybrid architecture proven end-to-end on 100K vectors (see §Phase 18 + ADR-019)
**Owner:** J
---
@ -297,9 +297,10 @@ Per-contract overlays with daily/weekly/monthly tiers and instant handoff.
- [x] HNSW vector index with trial system (shipped 2026-04-16)
- [x] Federation foundation — ADR-017 (shipped 2026-04-16)
- [x] Database connector ingest — Postgres batch with streaming (shipped 2026-04-16)
- [ ] Federation layer 2 — X-Lakehouse-Bucket middleware, catalog migration, cross-bucket SQL in queryd
- [ ] PDF OCR for scanned documents (Tesseract integration)
- [ ] Scheduled ingest (cron-based file watching, S3 event triggers)
- [x] Federation layer 2 — runtime bucket lifecycle, per-index bucket scoping, profile bucket auto-provisioning (shipped 2026-04-17)
- [x] MySQL streaming connector — mirrors Postgres path, verified on live MariaDB (shipped 2026-04-17)
- [x] PDF OCR for scanned documents — Tesseract 5.5 fallback when lopdf yields no text (shipped 2026-04-17)
- [x] Scheduled ingest — interval-based per-source schedules with CRUD + run-now + auto-trigger agent (shipped 2026-04-17)
- [ ] Multi-node query distribution (DataFusion supports this architecturally)
### Phase 16: Hot-Swap Index Generations