lakehouse

Author	SHA1	Message	Date
root	ac7c996596	sweep up scrum WARNs — model const, stale config, temp_path entropy, smoke gate Four findings deferred from the 2026-05-02 scrum, all 1-5 line fixes: W1 (kimi WARN @ scrum_master_pipeline.ts:1143) — `gemini-3-flash-preview` hardcoded twice in MAP and REDUCE phases. Extracted TREE_SPLIT_MODEL + TREE_SPLIT_PROVIDER constants near the existing config block. Diverging the two would break tree-split coherence (per-shard digests must come from the same model the reducer collapses). W2 (qwen WARN @ providers.toml:30) — stale `kimi-k2:1t` reference in operator-facing comments after PR #13 noted it's upstream-broken. Reframed as historical context ("was X here pre-2026-05-03 — that model is broken") so future operators don't paste-route from the comment. W3 (opus WARN @ vectord-lance/src/lib.rs:622) — temp_path() entropy was only pid+nanos, which collide under tokio scheduling when multiple tests in the same cargo process create temp dirs back-to-back. Added per-process AtomicU64 sequence counter — guarantees uniqueness regardless of clock. W4 (opus INFO @ scripts/lance_smoke.sh:38) — `\|\| echo '{}'` swallowed curl transport failures (gateway down, network broken, timeout), surfacing as misleading "no method field" jq errors at the next probe. Now captures $? separately, gates a "curl reachable" probe, and only falls back to empty body for the dependent jq parse. Smoke went 9 → 10 probes. Verified: vectord-lance 7/7 tests PASS, gateway cargo check clean, lance_smoke.sh 10/10 PASS against live gateway. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 00:11:59 -05:00
root	7594725c25	lance backend: 4-pack — bug fix + smoke + tests + 10M re-bench Some checks failed lakehouse/auditor 12 blocking issues: cloud: claim not backed — "Verified end-to-end against persistent Go stack on :4110:" Surfaced by the 2026-05-02 audit (vectord-lance + lance-bench + glue existed and worked but had no tests, no smoke, leaked server paths on missing-index search, and the ADR-019 10M re-bench was deferred). ## 1. Fix: missing-index search returned 500 + leaked filesystem path Pre-fix: $ POST /vectors/lance/search/no-such-index HTTP 500 Dataset at path home/profit/lakehouse/data/lance/no-such-index was not found: Not found: home/profit/lakehouse/data/lance/no-such-index/ _versions, /root/.cargo/registry/src/index.crates.io-...-1949cf8c.../ lance-table-4.0.0/src/io/commit.rs:364:26, ... Post-fix: HTTP 404 lance dataset not found: no-such-index Added `sanitize_lance_err()` in crates/vectord/src/service.rs that: - maps "not found" / "no such file" patterns → 404 (was 500) - strips /home/ and /root/.cargo/ paths from any error body Applied to all 5 lance handlers: search, get_doc, build_index, append, migrate. The store_for() handle is cheap-and-stateless; the actual disk hit happens inside the operation, which is where the leak originated. ## 2. scripts/lance_smoke.sh — first regression gate 9-probe smoke against the live HTTP surface. Exercises only read paths (no state mutation in CI). Specifically locks the sanitizer fix — a future regression that re-introduces the path leak fires the smoke immediately. 9/9 PASS against the live :3100 today. ## 3. Unit tests on vectord-lance/src/lib.rs (was: zero tests) 7 tests covering the public LanceVectorStore API: - fresh_store_reports_no_state — handle is lazy - migrate_then_count_and_fetch — Parquet → Lance round-trip - get_by_doc_id_missing_returns_none — Ok(None) vs Err contract that lets the HTTP handler return 404 cleanly - append_grows_count_and_new_rows_fetchable — ADR-019's structural-difference claim verified at the unit level - append_dim_mismatch_errors — guards against silently breaking search by accepting inconsistent-dim rows - search_returns_nearest — exact-vector match → top-1 - stats_reports_post_migrate_state — locks the field shape 7/7 PASS. cargo test -p vectord-lance --lib green. ## 4. 10M re-bench (deferred from ADR-019) reports/lance_10m_rebench_2026-05-02.md captures the numbers driven against the live :3100 over data/lance/scale_test_10m (33GB / 10M vectors, IVF_PQ confirmed via response method tag). Headline: Search cold (10 diverse queries): median ~32ms, mean ~46ms Search warm (5x same query): ~20ms p50 Doc fetch (5x same id): ~100ms p50 Search latency at 10M is acceptable for batch / async workloads, too slow for sub-10ms voice/recommendation paths. ADR-019's "Lance pulls ahead at 10M" claim remains unverified-but-not-refuted — at this scale HNSW doesn't operationally exist (10M × 768d × 4 bytes = 30GB just for vectors). Real finding: doc-fetch at 10M is 300x slower than the 100K number ADR-019 cited (311μs → ~100ms). Likely cause: scalar btree index on doc_id may not be built for this dataset. Follow-up to investigate whether forcing build_scalar_index brings it back to the load-bearing O(1) range. Captured in the report. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 20:06:56 -05:00
profit	5b1fcf6d27	Phase 28-36 body of work Accumulated since a6f12e2 (Phase 21 Rust port + Phase 27 versioning): - Phase 36: embed_semaphore on VectorState (permits=1) serializes seed embed calls — prevents sidecar socket collisions under concurrent /seed stress load - Phase 31+: run_stress.ts 6-task diverse stress scaffolding; run_e2e_rated.ts + orchestrator.ts tightening - Catalog dedupe cleanup: 16 duplicate manifests removed; canonical candidates.parquet (10.5MB -> 76KB) + placements.parquet (1.2MB -> 11KB) regenerated post-dedupe; fresh manifests for active datasets - vectord: harness EvalSet refinements (+181), agent portfolio rotation + ingest triggers (+158), autotune + rag adjustments - catalogd/storaged/ingestd/mcp-server: misc tightening - docs: Phase 28-36 PRD entries + DECISIONS ADR additions; control-plane pivot banner added to top of docs/PRD.md (pointing at docs/CONTROL_PLANE_PRD.md which lands in next commit) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 02:41:15 -05:00
root	9e6002c4d4	S3 backend for Lance — hybrid operates on real MinIO object storage Enabled lance feature "aws" for S3-compatible storage via opendal. BucketRegistry: added with_allow_http(true) for MinIO/non-TLS S3 endpoints (fixes "builder error" on HTTP endpoints). lakehouse.toml gains [[storage.buckets]] name="s3:lakehouse" with S3 backend config. lance_backend.rs: S3 bucket naming convention — buckets with name prefix "s3:" emit s3:// URIs for Lance datasets. AWS_* env vars in the systemd unit provide credentials to Lance's internal object_store. Verified end-to-end on real MinIO with real 100K × 768d vectors: - Migrate Parquet → Lance on S3: 1.7s (vs 0.57s local) - Build IVF_PQ: 16.4s (CPU-bound, essentially same as local) - Search: ~58ms p50 (vs 11ms local — S3 partition reads) - Random doc fetch: 13ms (vs 3.5ms local) - Recall@10: 0.835 (randomized IVF_PQ, consistent with local 0.805) - Total S3 footprint: 637 MiB (vectors + index + lance metadata) The "public storage" claim from the PRD is now proven: the hybrid Parquet+HNSW ⊕ Lance architecture works on S3-compatible object storage, not just local filesystem. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 21:09:42 -05:00
root	59e72fa566	Scalar btree index on doc_id + auto-build during Lance activation LanceVectorStore gains build_scalar_index(column) and has_scalar_index(column). Exposed as POST /vectors/lance/scalar-index/ {index}/{column}. activate_profile auto-builds the doc_id btree alongside the IVF_PQ vector index when activating a Lance-backed profile — operators get both indexes without extra API calls. stats() now reports has_doc_id_index alongside has_vector_index. Measured on resumes_100k_v2 (100K × 768d): random doc_id fetch improved from ~5.4ms to ~3.5ms (35% faster). Btree build: 19ms, +2.7 MB on disk. The remaining ~3ms is vector column materialization, not index lookup — to close further would need a projection-only fetch that skips the 768-float vector for text-only RAG retrieval. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 20:49:17 -05:00
root	0d037cfac1	Phases 16.2 + L2 + 17 VRAM gate + MySQL + 18 Lance hybrid milestone Five threads of work landing as one milestone — all individually verified end-to-end against real data, full release build clean, 46 unit tests pass. ## Phase 16.2 / 16.5 — autotune agent + ingest triggers `vectord::agent` is a long-running tokio task that watches the trial journal and autonomously proposes + runs new HNSW configs. Distinct from `autotune::run_autotune` (synchronous one-shot grid). Triggered on POST /vectors/agent/enqueue/{idx} or by the periodic wake; ingest paths now push DatasetAppended events when an index's source dataset gets re-ingested. Rate-limited (max_trials_per_hour) and cooldown- gated so it can't saturate Ollama under live load. The proposer is ε-greedy around the current champion: with prob 0.25 sample random from full bounds, otherwise perturb champion ± small delta on both axes. Dedup against history. Deterministic — RNG seeded from history.len() so the same journal state proposes the same next config (helps offline replay debugging). `[agent]` config section in lakehouse.toml; opt-in via enabled=true. ## Federation Layer 2 — runtime bucket lifecycle + per-index scoping `BucketRegistry.buckets` moved to `std::sync::RwLock<HashMap>` so buckets can be added/removed after startup. POST /storage/buckets provisions at runtime; DELETE /storage/buckets/{name} unregisters (refuses primary/rescue with 403). Local-backend buckets get their root directory auto-created. `IndexMeta.bucket` (default "primary" via serde) records each index's home bucket. `TrialJournal` and `PromotionRegistry` now hold Arc<BucketRegistry> + IndexRegistry; they resolve target store per- index via IndexMeta.bucket. PromotionRegistry::list_all scans every bucket and dedups by index_name. Pre-federation indexes keep working unchanged — they just default to primary. `ModelProfile.bucket: Option<String>` declares per-profile artifact home. POST /vectors/profile/{id}/activate auto-provisions the profile's bucket under storage.profile_root if not yet registered. EvalSets stay primary-only for now — noted gap, low-risk to extend later with the same resolver pattern. ## Phase 17 — VRAM-aware two-profile gate Sidecar gains POST /admin/unload (Ollama keep_alive=0 trick — forces immediate VRAM release), POST /admin/preload (keep_alive=5m with empty prompt, takes the slot warm), and GET /admin/vram (combines nvidia-smi snapshot with Ollama /api/ps). Exposed via aibridge as unload_model / preload_model / vram_snapshot. `VectorState.active_profile` is the GPU-slot singleton — Arc<RwLock<Option<ActiveProfileSlot>>>. activate_profile checks for a previous profile with a different ollama_name and unloads it before preloading the new one; same-model reactivations skip the unload (Ollama no-ops). New routes: POST /vectors/profile/{id}/ deactivate (unload + clear slot), GET /vectors/profile/active. Verified live: staffing-recruiter (qwen2.5) → docs-assistant (mistral) swap freed qwen2.5 from VRAM and loaded mistral. nomic- embed-text persists across swaps because both profiles use it — free optimization that fell out of the design. Scoped search correctly 403s cross-profile in both directions. ## MySQL streaming connector `crates/ingestd/src/my_stream.rs` mirrors pg_stream.rs for MySQL. Pure-rust `mysql_async` driver (default-features=false to avoid C deps). Same OFFSET pagination, same Parquet-streaming write shape. Type mapping per ADR-010: int/bigint → Int32/Int64, decimal/float → Float64, tinyint(1)/bool → Boolean, everything else → Utf8 with fallback parsers for date/time/json/uuid via Display. POST /ingest/mysql parallel to /ingest/db. Same PII auto-detection, same lineage capture (source_system="mysql"), same agent-trigger hook. `redact_dsn` generalized — was hardcoded to "postgresql://" length, now works for any scheme://user:pass@host/path URL (latent PII leak fix for MySQL DSNs). Verified live against MariaDB on localhost: 10 rows × 9 columns of test data round-tripped through datatypes int/varchar/decimal/ tinyint/datetime/text. PII detection auto-flagged name + email. Aggregation queries through DataFusion match the source values exactly. ## Phase 18 — Hybrid Parquet+HNSW ⊕ Lance backend (ADR-019) `vectord-lance` is a new firewall crate. Lance pulls Arrow 57 and DataFusion 52 — incompatible with the rest of the workspace's Arrow 55 / DataFusion 47. The firewall isolates that dep tree: public API uses only std types (Vec<f32>, Vec<String>, Hit, Row, Stats), so no Arrow types cross the crate boundary and nothing propagates to vectord. The ADR-019 path that didn't ship until now. `vectord::lance_backend::LanceRegistry` lazy-creates a LanceVectorStore per index, resolving bucket → URI via the conventional local-bucket layout. `IndexMeta.vector_backend` and `ModelProfile.vector_backend` carry the choice (default Parquet so existing indexes unchanged). Six routes under /vectors/lance/: - migrate/{idx}: convert binary-blob Parquet → Lance FixedSizeList - index/{idx}: build IVF_PQ - search/{idx}: vector search (embed via sidecar) - doc/{idx}/{doc_id}: random row fetch - append/{idx}: native fragment append - stats/{idx}: row count + index presence Verified live on the real resumes_100k_v2 corpus (100K × 768d): - Migrate: 0.57s - Build IVF_PQ index: 16.2s (matches ADR-019 bench; 14× faster than HNSW's 230s for the same data) - Search end-to-end (Ollama embed + Lance scan): 23-53ms - Random doc_id fetch: 5-7ms (filter scan; faster than Parquet's ~35ms full-file scan, slower than the bench's 311us positional take — would close that gap with a scalar btree on doc_id) - Append 100 rows: 3.3ms / +320KB on disk vs Parquet's required full ~330MB rewrite — the structural win - Index survives append; both backends coexist cleanly ## Known follow-ups not in this milestone - ModelProfile.vector_backend doesn't yet auto-route /vectors/profile/ {id}/search to Lance; callers go through /vectors/lance/* directly - Scalar btree on doc_id (closes the 5-7ms → ~300us gap) - vectord-lance built default-features=false → no S3 yet - IVF_PQ recall not measured (ADR-019 caveat) — needs a Lance-aware variant of the eval harness - Watcher-path ingest doesn't push agent triggers (HTTP paths do) - EvalSets still primary-only (federation gap) - No PATCH endpoint to move an existing index between buckets - The pre-existing storaged::append_log doctest fails to compile (malformed `{prefix}/` parses as code fence) — pre-existing bug, left for a focused fix 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 20:24:46 -05:00

6 Commits