Five threads of work landing as one milestone — all individually
verified end-to-end against real data, full release build clean,
46 unit tests pass.
## Phase 16.2 / 16.5 — autotune agent + ingest triggers
`vectord::agent` is a long-running tokio task that watches the trial
journal and autonomously proposes + runs new HNSW configs. Distinct
from `autotune::run_autotune` (synchronous one-shot grid). Triggered
on POST /vectors/agent/enqueue/{idx} or by the periodic wake; ingest
paths now push DatasetAppended events when an index's source dataset
gets re-ingested. Rate-limited (max_trials_per_hour) and cooldown-
gated so it can't saturate Ollama under live load.
The proposer is ε-greedy around the current champion: with prob 0.25
sample random from full bounds, otherwise perturb champion ± small
delta on both axes. Dedup against history. Deterministic — RNG seeded
from history.len() so the same journal state proposes the same next
config (helps offline replay debugging).
`[agent]` config section in lakehouse.toml; opt-in via enabled=true.
## Federation Layer 2 — runtime bucket lifecycle + per-index scoping
`BucketRegistry.buckets` moved to `std::sync::RwLock<HashMap>` so
buckets can be added/removed after startup. POST /storage/buckets
provisions at runtime; DELETE /storage/buckets/{name} unregisters
(refuses primary/rescue with 403). Local-backend buckets get their
root directory auto-created.
`IndexMeta.bucket` (default "primary" via serde) records each index's
home bucket. `TrialJournal` and `PromotionRegistry` now hold
Arc<BucketRegistry> + IndexRegistry; they resolve target store per-
index via IndexMeta.bucket. PromotionRegistry::list_all scans every
bucket and dedups by index_name. Pre-federation indexes keep working
unchanged — they just default to primary.
`ModelProfile.bucket: Option<String>` declares per-profile artifact
home. POST /vectors/profile/{id}/activate auto-provisions the
profile's bucket under storage.profile_root if not yet registered.
EvalSets stay primary-only for now — noted gap, low-risk to extend
later with the same resolver pattern.
## Phase 17 — VRAM-aware two-profile gate
Sidecar gains POST /admin/unload (Ollama keep_alive=0 trick — forces
immediate VRAM release), POST /admin/preload (keep_alive=5m with
empty prompt, takes the slot warm), and GET /admin/vram (combines
nvidia-smi snapshot with Ollama /api/ps). Exposed via aibridge as
unload_model / preload_model / vram_snapshot.
`VectorState.active_profile` is the GPU-slot singleton —
Arc<RwLock<Option<ActiveProfileSlot>>>. activate_profile checks for
a previous profile with a different ollama_name and unloads it
before preloading the new one; same-model reactivations skip the
unload (Ollama no-ops). New routes: POST /vectors/profile/{id}/
deactivate (unload + clear slot), GET /vectors/profile/active.
Verified live: staffing-recruiter (qwen2.5) → docs-assistant
(mistral) swap freed qwen2.5 from VRAM and loaded mistral. nomic-
embed-text persists across swaps because both profiles use it —
free optimization that fell out of the design. Scoped search
correctly 403s cross-profile in both directions.
## MySQL streaming connector
`crates/ingestd/src/my_stream.rs` mirrors pg_stream.rs for MySQL.
Pure-rust `mysql_async` driver (default-features=false to avoid C
deps). Same OFFSET pagination, same Parquet-streaming write shape.
Type mapping per ADR-010: int/bigint → Int32/Int64, decimal/float
→ Float64, tinyint(1)/bool → Boolean, everything else → Utf8 with
fallback parsers for date/time/json/uuid via Display.
POST /ingest/mysql parallel to /ingest/db. Same PII auto-detection,
same lineage capture (source_system="mysql"), same agent-trigger
hook. `redact_dsn` generalized — was hardcoded to "postgresql://"
length, now works for any scheme://user:pass@host/path URL (latent
PII leak fix for MySQL DSNs).
Verified live against MariaDB on localhost: 10 rows × 9 columns of
test data round-tripped through datatypes int/varchar/decimal/
tinyint/datetime/text. PII detection auto-flagged name + email.
Aggregation queries through DataFusion match the source values
exactly.
## Phase 18 — Hybrid Parquet+HNSW ⊕ Lance backend (ADR-019)
`vectord-lance` is a new firewall crate. Lance pulls Arrow 57 and
DataFusion 52 — incompatible with the rest of the workspace's
Arrow 55 / DataFusion 47. The firewall isolates that dep tree:
public API uses only std types (Vec<f32>, Vec<String>, Hit, Row,
*Stats), so no Arrow types cross the crate boundary and nothing
propagates to vectord. The ADR-019 path that didn't ship until now.
`vectord::lance_backend::LanceRegistry` lazy-creates a
LanceVectorStore per index, resolving bucket → URI via the
conventional local-bucket layout. `IndexMeta.vector_backend` and
`ModelProfile.vector_backend` carry the choice (default Parquet so
existing indexes unchanged).
Six routes under /vectors/lance/*:
- migrate/{idx}: convert binary-blob Parquet → Lance FixedSizeList
- index/{idx}: build IVF_PQ
- search/{idx}: vector search (embed via sidecar)
- doc/{idx}/{doc_id}: random row fetch
- append/{idx}: native fragment append
- stats/{idx}: row count + index presence
Verified live on the real resumes_100k_v2 corpus (100K × 768d):
- Migrate: 0.57s
- Build IVF_PQ index: 16.2s (matches ADR-019 bench; 14× faster than
HNSW's 230s for the same data)
- Search end-to-end (Ollama embed + Lance scan): 23-53ms
- Random doc_id fetch: 5-7ms (filter scan; faster than Parquet's
~35ms full-file scan, slower than the bench's 311us positional
take — would close that gap with a scalar btree on doc_id)
- Append 100 rows: 3.3ms / +320KB on disk vs Parquet's required
full ~330MB rewrite — the structural win
- Index survives append; both backends coexist cleanly
## Known follow-ups not in this milestone
- ModelProfile.vector_backend doesn't yet auto-route /vectors/profile/
{id}/search to Lance; callers go through /vectors/lance/* directly
- Scalar btree on doc_id (closes the 5-7ms → ~300us gap)
- vectord-lance built default-features=false → no S3 yet
- IVF_PQ recall not measured (ADR-019 caveat) — needs a Lance-aware
variant of the eval harness
- Watcher-path ingest doesn't push agent triggers (HTTP paths do)
- EvalSets still primary-only (federation gap)
- No PATCH endpoint to move an existing index between buckets
- The pre-existing storaged::append_log doctest fails to compile
(malformed `{prefix}/` parses as code fence) — pre-existing bug,
left for a focused fix
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
138 lines
4.2 KiB
Python
138 lines
4.2 KiB
Python
"""Admin endpoints: model lifecycle + VRAM introspection.
|
|
|
|
Phase 17 / Phase C — the VRAM-aware profile swap story. Ollama loads
|
|
models lazily and unloads after a TTL (default 5min). For predictable
|
|
swaps between model profiles, we need explicit control:
|
|
|
|
- unload: send keep_alive=0 to force immediate unload
|
|
- preload: send keep_alive=5m with empty prompt to proactively load
|
|
- ps: query Ollama's /api/ps to see what's currently loaded
|
|
|
|
All three are thin wrappers over Ollama's own API. No state held here.
|
|
"""
|
|
|
|
import asyncio
|
|
import shutil
|
|
from fastapi import APIRouter, HTTPException
|
|
from pydantic import BaseModel
|
|
|
|
from .ollama import client
|
|
|
|
router = APIRouter()
|
|
|
|
|
|
class ModelRequest(BaseModel):
|
|
model: str
|
|
|
|
|
|
@router.post("/unload")
|
|
async def unload(req: ModelRequest):
|
|
"""Force Ollama to unload a model immediately (keep_alive=0 trick)."""
|
|
async with client() as c:
|
|
resp = await c.post(
|
|
"/api/generate",
|
|
json={"model": req.model, "prompt": "", "keep_alive": 0, "stream": False},
|
|
)
|
|
# Ollama returns 200 even on unload; anything else is abnormal.
|
|
if resp.status_code not in (200,):
|
|
raise HTTPException(502, f"Ollama unload error: {resp.text}")
|
|
return {"unloaded": req.model}
|
|
|
|
|
|
@router.post("/preload")
|
|
async def preload(req: ModelRequest):
|
|
"""Force Ollama to load a model into VRAM and keep it there."""
|
|
async with client() as c:
|
|
resp = await c.post(
|
|
"/api/generate",
|
|
json={
|
|
"model": req.model,
|
|
# "" can confuse some models; a single token is safer and still ~instant.
|
|
"prompt": " ",
|
|
"keep_alive": "5m",
|
|
"stream": False,
|
|
"options": {"num_predict": 1},
|
|
},
|
|
)
|
|
if resp.status_code != 200:
|
|
raise HTTPException(502, f"Ollama preload error: {resp.text}")
|
|
data = resp.json()
|
|
return {
|
|
"preloaded": req.model,
|
|
"load_duration_ns": data.get("load_duration"),
|
|
"total_duration_ns": data.get("total_duration"),
|
|
}
|
|
|
|
|
|
@router.get("/ps")
|
|
async def list_loaded():
|
|
"""What models does Ollama currently have in VRAM?"""
|
|
async with client() as c:
|
|
resp = await c.get("/api/ps")
|
|
if resp.status_code != 200:
|
|
raise HTTPException(502, f"Ollama ps error: {resp.text}")
|
|
data = resp.json()
|
|
# Flatten Ollama's fields into something small + useful.
|
|
models = [
|
|
{
|
|
"name": m.get("name"),
|
|
"size_bytes": m.get("size"),
|
|
"size_vram_bytes": m.get("size_vram"),
|
|
"expires_at": m.get("expires_at"),
|
|
}
|
|
for m in data.get("models", [])
|
|
]
|
|
return {"models": models}
|
|
|
|
|
|
@router.get("/vram")
|
|
async def vram_summary():
|
|
"""Combined: nvidia-smi VRAM + Ollama loaded models.
|
|
|
|
Shells out to nvidia-smi; if it's not on PATH, returns just the
|
|
Ollama view. Intentionally async-via-to_thread so the blocking
|
|
subprocess doesn't stall the event loop.
|
|
"""
|
|
gpu = None
|
|
if shutil.which("nvidia-smi"):
|
|
gpu = await asyncio.to_thread(_nvidia_smi_snapshot)
|
|
|
|
async with client() as c:
|
|
resp = await c.get("/api/ps")
|
|
loaded = resp.json().get("models", []) if resp.status_code == 200 else []
|
|
|
|
return {
|
|
"gpu": gpu,
|
|
"ollama_loaded": [
|
|
{
|
|
"name": m.get("name"),
|
|
"size_vram_mib": (m.get("size_vram", 0) or 0) // (1024 * 1024),
|
|
"expires_at": m.get("expires_at"),
|
|
}
|
|
for m in loaded
|
|
],
|
|
}
|
|
|
|
|
|
def _nvidia_smi_snapshot():
|
|
"""One-shot nvidia-smi poll. Returns None on failure."""
|
|
import subprocess
|
|
try:
|
|
out = subprocess.check_output(
|
|
[
|
|
"nvidia-smi",
|
|
"--query-gpu=memory.used,memory.total,utilization.gpu,name",
|
|
"--format=csv,noheader,nounits",
|
|
],
|
|
timeout=2,
|
|
).decode().strip()
|
|
used_mib, total_mib, util_pct, name = [s.strip() for s in out.split(",")]
|
|
return {
|
|
"name": name,
|
|
"used_mib": int(used_mib),
|
|
"total_mib": int(total_mib),
|
|
"utilization_pct": int(util_pct),
|
|
}
|
|
except Exception:
|
|
return None
|