root 0d037cfac1 Phases 16.2 + L2 + 17 VRAM gate + MySQL + 18 Lance hybrid milestone
Five threads of work landing as one milestone — all individually
verified end-to-end against real data, full release build clean,
46 unit tests pass.

## Phase 16.2 / 16.5 — autotune agent + ingest triggers

`vectord::agent` is a long-running tokio task that watches the trial
journal and autonomously proposes + runs new HNSW configs. Distinct
from `autotune::run_autotune` (synchronous one-shot grid). Triggered
on POST /vectors/agent/enqueue/{idx} or by the periodic wake; ingest
paths now push DatasetAppended events when an index's source dataset
gets re-ingested. Rate-limited (max_trials_per_hour) and cooldown-
gated so it can't saturate Ollama under live load.

The proposer is ε-greedy around the current champion: with prob 0.25
sample random from full bounds, otherwise perturb champion ± small
delta on both axes. Dedup against history. Deterministic — RNG seeded
from history.len() so the same journal state proposes the same next
config (helps offline replay debugging).

`[agent]` config section in lakehouse.toml; opt-in via enabled=true.

## Federation Layer 2 — runtime bucket lifecycle + per-index scoping

`BucketRegistry.buckets` moved to `std::sync::RwLock<HashMap>` so
buckets can be added/removed after startup. POST /storage/buckets
provisions at runtime; DELETE /storage/buckets/{name} unregisters
(refuses primary/rescue with 403). Local-backend buckets get their
root directory auto-created.

`IndexMeta.bucket` (default "primary" via serde) records each index's
home bucket. `TrialJournal` and `PromotionRegistry` now hold
Arc<BucketRegistry> + IndexRegistry; they resolve target store per-
index via IndexMeta.bucket. PromotionRegistry::list_all scans every
bucket and dedups by index_name. Pre-federation indexes keep working
unchanged — they just default to primary.

`ModelProfile.bucket: Option<String>` declares per-profile artifact
home. POST /vectors/profile/{id}/activate auto-provisions the
profile's bucket under storage.profile_root if not yet registered.

EvalSets stay primary-only for now — noted gap, low-risk to extend
later with the same resolver pattern.

## Phase 17 — VRAM-aware two-profile gate

Sidecar gains POST /admin/unload (Ollama keep_alive=0 trick — forces
immediate VRAM release), POST /admin/preload (keep_alive=5m with
empty prompt, takes the slot warm), and GET /admin/vram (combines
nvidia-smi snapshot with Ollama /api/ps). Exposed via aibridge as
unload_model / preload_model / vram_snapshot.

`VectorState.active_profile` is the GPU-slot singleton —
Arc<RwLock<Option<ActiveProfileSlot>>>. activate_profile checks for
a previous profile with a different ollama_name and unloads it
before preloading the new one; same-model reactivations skip the
unload (Ollama no-ops). New routes: POST /vectors/profile/{id}/
deactivate (unload + clear slot), GET /vectors/profile/active.

Verified live: staffing-recruiter (qwen2.5) → docs-assistant
(mistral) swap freed qwen2.5 from VRAM and loaded mistral. nomic-
embed-text persists across swaps because both profiles use it —
free optimization that fell out of the design. Scoped search
correctly 403s cross-profile in both directions.

## MySQL streaming connector

`crates/ingestd/src/my_stream.rs` mirrors pg_stream.rs for MySQL.
Pure-rust `mysql_async` driver (default-features=false to avoid C
deps). Same OFFSET pagination, same Parquet-streaming write shape.
Type mapping per ADR-010: int/bigint → Int32/Int64, decimal/float
→ Float64, tinyint(1)/bool → Boolean, everything else → Utf8 with
fallback parsers for date/time/json/uuid via Display.

POST /ingest/mysql parallel to /ingest/db. Same PII auto-detection,
same lineage capture (source_system="mysql"), same agent-trigger
hook. `redact_dsn` generalized — was hardcoded to "postgresql://"
length, now works for any scheme://user:pass@host/path URL (latent
PII leak fix for MySQL DSNs).

Verified live against MariaDB on localhost: 10 rows × 9 columns of
test data round-tripped through datatypes int/varchar/decimal/
tinyint/datetime/text. PII detection auto-flagged name + email.
Aggregation queries through DataFusion match the source values
exactly.

## Phase 18 — Hybrid Parquet+HNSW ⊕ Lance backend (ADR-019)

`vectord-lance` is a new firewall crate. Lance pulls Arrow 57 and
DataFusion 52 — incompatible with the rest of the workspace's
Arrow 55 / DataFusion 47. The firewall isolates that dep tree:
public API uses only std types (Vec<f32>, Vec<String>, Hit, Row,
*Stats), so no Arrow types cross the crate boundary and nothing
propagates to vectord. The ADR-019 path that didn't ship until now.

`vectord::lance_backend::LanceRegistry` lazy-creates a
LanceVectorStore per index, resolving bucket → URI via the
conventional local-bucket layout. `IndexMeta.vector_backend` and
`ModelProfile.vector_backend` carry the choice (default Parquet so
existing indexes unchanged).

Six routes under /vectors/lance/*:
- migrate/{idx}: convert binary-blob Parquet → Lance FixedSizeList
- index/{idx}: build IVF_PQ
- search/{idx}: vector search (embed via sidecar)
- doc/{idx}/{doc_id}: random row fetch
- append/{idx}: native fragment append
- stats/{idx}: row count + index presence

Verified live on the real resumes_100k_v2 corpus (100K × 768d):
- Migrate: 0.57s
- Build IVF_PQ index: 16.2s (matches ADR-019 bench; 14× faster than
  HNSW's 230s for the same data)
- Search end-to-end (Ollama embed + Lance scan): 23-53ms
- Random doc_id fetch: 5-7ms (filter scan; faster than Parquet's
  ~35ms full-file scan, slower than the bench's 311us positional
  take — would close that gap with a scalar btree on doc_id)
- Append 100 rows: 3.3ms / +320KB on disk vs Parquet's required
  full ~330MB rewrite — the structural win
- Index survives append; both backends coexist cleanly

## Known follow-ups not in this milestone

- ModelProfile.vector_backend doesn't yet auto-route /vectors/profile/
  {id}/search to Lance; callers go through /vectors/lance/* directly
- Scalar btree on doc_id (closes the 5-7ms → ~300us gap)
- vectord-lance built default-features=false → no S3 yet
- IVF_PQ recall not measured (ADR-019 caveat) — needs a Lance-aware
  variant of the eval harness
- Watcher-path ingest doesn't push agent triggers (HTTP paths do)
- EvalSets still primary-only (federation gap)
- No PATCH endpoint to move an existing index between buckets
- The pre-existing storaged::append_log doctest fails to compile
  (malformed `{prefix}/` parses as code fence) — pre-existing bug,
  left for a focused fix

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 20:24:46 -05:00

138 lines
4.2 KiB
Python

"""Admin endpoints: model lifecycle + VRAM introspection.
Phase 17 / Phase C — the VRAM-aware profile swap story. Ollama loads
models lazily and unloads after a TTL (default 5min). For predictable
swaps between model profiles, we need explicit control:
- unload: send keep_alive=0 to force immediate unload
- preload: send keep_alive=5m with empty prompt to proactively load
- ps: query Ollama's /api/ps to see what's currently loaded
All three are thin wrappers over Ollama's own API. No state held here.
"""
import asyncio
import shutil
from fastapi import APIRouter, HTTPException
from pydantic import BaseModel
from .ollama import client
router = APIRouter()
class ModelRequest(BaseModel):
model: str
@router.post("/unload")
async def unload(req: ModelRequest):
"""Force Ollama to unload a model immediately (keep_alive=0 trick)."""
async with client() as c:
resp = await c.post(
"/api/generate",
json={"model": req.model, "prompt": "", "keep_alive": 0, "stream": False},
)
# Ollama returns 200 even on unload; anything else is abnormal.
if resp.status_code not in (200,):
raise HTTPException(502, f"Ollama unload error: {resp.text}")
return {"unloaded": req.model}
@router.post("/preload")
async def preload(req: ModelRequest):
"""Force Ollama to load a model into VRAM and keep it there."""
async with client() as c:
resp = await c.post(
"/api/generate",
json={
"model": req.model,
# "" can confuse some models; a single token is safer and still ~instant.
"prompt": " ",
"keep_alive": "5m",
"stream": False,
"options": {"num_predict": 1},
},
)
if resp.status_code != 200:
raise HTTPException(502, f"Ollama preload error: {resp.text}")
data = resp.json()
return {
"preloaded": req.model,
"load_duration_ns": data.get("load_duration"),
"total_duration_ns": data.get("total_duration"),
}
@router.get("/ps")
async def list_loaded():
"""What models does Ollama currently have in VRAM?"""
async with client() as c:
resp = await c.get("/api/ps")
if resp.status_code != 200:
raise HTTPException(502, f"Ollama ps error: {resp.text}")
data = resp.json()
# Flatten Ollama's fields into something small + useful.
models = [
{
"name": m.get("name"),
"size_bytes": m.get("size"),
"size_vram_bytes": m.get("size_vram"),
"expires_at": m.get("expires_at"),
}
for m in data.get("models", [])
]
return {"models": models}
@router.get("/vram")
async def vram_summary():
"""Combined: nvidia-smi VRAM + Ollama loaded models.
Shells out to nvidia-smi; if it's not on PATH, returns just the
Ollama view. Intentionally async-via-to_thread so the blocking
subprocess doesn't stall the event loop.
"""
gpu = None
if shutil.which("nvidia-smi"):
gpu = await asyncio.to_thread(_nvidia_smi_snapshot)
async with client() as c:
resp = await c.get("/api/ps")
loaded = resp.json().get("models", []) if resp.status_code == 200 else []
return {
"gpu": gpu,
"ollama_loaded": [
{
"name": m.get("name"),
"size_vram_mib": (m.get("size_vram", 0) or 0) // (1024 * 1024),
"expires_at": m.get("expires_at"),
}
for m in loaded
],
}
def _nvidia_smi_snapshot():
"""One-shot nvidia-smi poll. Returns None on failure."""
import subprocess
try:
out = subprocess.check_output(
[
"nvidia-smi",
"--query-gpu=memory.used,memory.total,utilization.gpu,name",
"--format=csv,noheader,nounits",
],
timeout=2,
).decode().strip()
used_mib, total_mib, util_pct, name = [s.strip() for s in out.split(",")]
return {
"name": name,
"used_mib": int(used_mib),
"total_mib": int(total_mib),
"utilization_pct": int(util_pct),
}
except Exception:
return None