Some checks failed
lakehouse/auditor 11 blocking issues: cloud: claim not backed — "Verified end-to-end against persistent Go stack on :4110:"
The "drop Python sidecar from Rust aibridge" item from the architecture_comparison decisions tracker. Universal-win cleanup — removes 1 process + 1 runtime + 1 hop from every embed/generate request, with no behavior change. ## What was on the hot path before gateway → AiClient → http://:3200 (FastAPI sidecar) ├── embed.py → http://:11434 (Ollama) ├── generate.py → http://:11434 ├── rerank.py → http://:11434 (loops generate) └── admin.py → http://:11434 (/api/ps + nvidia-smi) The sidecar's hot-path code (~120 LOC across embed.py / generate.py / rerank.py / admin.py) was pure pass-through: each route translated its request body to Ollama's wire format and returned Ollama's response in a sidecar envelope. Zero logic, one full HTTP hop of overhead. ## What's on the hot path now gateway → AiClient → http://:11434 (Ollama directly) Inline rewrites in crates/aibridge/src/client.rs: - embed_uncached: per-text loop to /api/embed; computes dimension from response[0].length (matches the sidecar's prior shape) - generate (direct path): translates GenerateRequest → /api/generate (model, prompt, stream:false, options:{temperature, num_predict}, system, think); maps response → GenerateResponse using Ollama's field names (response, prompt_eval_count, eval_count) - rerank: per-doc loop with the same score-prompt the sidecar used; parses leading number, clamps 0-10, sorts desc - unload_model: /api/generate with prompt:"", keep_alive:0 - preload_model: /api/generate with prompt:" ", keep_alive:"5m", num_predict:1 - vram_snapshot: GET /api/ps + std::process::Command nvidia-smi; same envelope shape as the sidecar's /admin/vram so callers keep parsing - health: GET /api/version, wrapped in a sidecar-shaped envelope ({status, ollama_url, ollama_version}) Public AiClient API is unchanged — Request/Response types untouched. Callers (gateway routes, vectord, etc.) require zero updates. ## Config changes - crates/shared/src/config.rs: default_sidecar_url() bumps to :11434. The TOML field stays `[sidecar].url` for migration compat (operators with existing configs don't need to rename anything). - lakehouse.toml + config/providers.toml: bumped to localhost:11434 with comments explaining the 2026-05-02 transition. ## What stays Python sidecar/sidecar/lab_ui.py (385 LOC) + pipeline_lab.py (503 LOC) are dev-mode Streamlit-shape UIs for prompt experimentation. Not on the runtime hot path; continue running for ad-hoc work. The embed/generate/rerank/admin routes inside sidecar can be retired, but operators who want to keep the sidecar process running for the lab UI face no breakage — those routes still call Ollama and work. ## Verification - cargo check --workspace: clean - cargo test -p aibridge --lib: 32/32 PASS - Live smoke against test gateway on :3199 with new config: /ai/embed → 768-dim vector for "forklift operator" ✓ /v1/chat → provider=ollama, model=qwen2.5:latest, content=OK ✓ - nvidia-smi parsing tested via std::process::Command path - Live `lakehouse.service` (port :3100) NOT yet restarted — deploy step is operator-driven (sudo systemctl restart lakehouse.service) ## Architecture comparison update (Captured separately in golangLAKEHOUSE/docs/ARCHITECTURE_COMPARISON.md decisions tracker.) The "drop Python sidecar" line moves from _open_ to DONE. The Rust process model now has 1 mega-binary instead of 1 mega-binary + 1 sidecar process — a small but real reduction in ops surface. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
91 lines
2.7 KiB
TOML
91 lines
2.7 KiB
TOML
# Lakehouse Configuration
|
|
|
|
[gateway]
|
|
host = "0.0.0.0"
|
|
port = 3100
|
|
|
|
[storage]
|
|
root = "./data"
|
|
profile_root = "./data/_profiles"
|
|
rescue_bucket = "rescue"
|
|
|
|
[[storage.buckets]]
|
|
name = "primary"
|
|
backend = "local"
|
|
root = "./data"
|
|
|
|
[[storage.buckets]]
|
|
name = "rescue"
|
|
backend = "local"
|
|
root = "./data/_rescue"
|
|
|
|
[[storage.buckets]]
|
|
name = "testing"
|
|
backend = "local"
|
|
root = "./data/_testing"
|
|
|
|
# S3 bucket via MinIO. The name "s3:lakehouse" is the convention
|
|
# lance_backend.rs uses to emit s3:// URIs for Lance datasets.
|
|
# Credentials resolved via environment (AWS_ACCESS_KEY_ID etc) or
|
|
# the secrets provider.
|
|
[[storage.buckets]]
|
|
name = "s3:lakehouse"
|
|
backend = "s3"
|
|
bucket = "lakehouse"
|
|
endpoint = "http://localhost:9000"
|
|
region = "us-east-1"
|
|
secret_ref = "minio-lakehouse"
|
|
|
|
[catalog]
|
|
# Manifests persisted to object storage under this prefix
|
|
manifest_prefix = "_catalog/manifests"
|
|
|
|
[query]
|
|
# max_rows_per_query = 10000
|
|
|
|
[sidecar]
|
|
# Post-2026-05-02: AiClient talks directly to Ollama; the Python
|
|
# sidecar's hot-path role (~120 LOC of pure Ollama wrappers) was
|
|
# retired. Field name kept for migration compat — value now points
|
|
# at Ollama on :11434. Lab UI + pipeline_lab Python remains as a
|
|
# dev-only tool, NOT on this URL.
|
|
url = "http://localhost:11434"
|
|
|
|
[ai]
|
|
embed_model = "nomic-embed-text"
|
|
# Local-tier defaults bumped 2026-04-30: qwen3.5:latest is the
|
|
# stronger local rung in the 5-loop substrate (per
|
|
# project_small_model_pipeline_vision.md). Same JSON-clean property
|
|
# as qwen2.5, more capacity. Ollama still serves both — bump back
|
|
# in this file if a workload regressed.
|
|
gen_model = "qwen3.5:latest"
|
|
rerank_model = "qwen3.5:latest"
|
|
|
|
[auth]
|
|
enabled = false
|
|
# api_key = "changeme"
|
|
|
|
[observability]
|
|
# Export traces to stdout (set to "otlp" for OpenTelemetry collector)
|
|
exporter = "stdout"
|
|
service_name = "lakehouse"
|
|
|
|
[agent]
|
|
# Phase 16.2 — background autotune agent. Opt-in: set enabled = true to
|
|
# let the agent continuously propose + trial HNSW configs and auto-promote
|
|
# winners. Defaults are conservative so it stays out of the way of live
|
|
# search traffic on shared Ollama.
|
|
enabled = true
|
|
cycle_interval_secs = 120 # periodic wake if no triggers
|
|
cooldown_between_trials_secs = 10 # min gap between trials
|
|
min_recall = 0.9 # never promote below this
|
|
max_trials_per_hour = 20 # hard budget cap
|
|
|
|
# Model roster — available for profile hot-swap
|
|
# qwen3.5:latest: stronger local rung — JSON-clean, 8K+ context,
|
|
# default for gen_model and rerank_model
|
|
# qwen3: 8.2B, 40K context, thinking+tools, best for reasoning tasks
|
|
# qwen2.5: 7B, 8K context, fast — kept loaded for the 2026-04 era
|
|
# comparison runs; new defaults use qwen3.5:latest
|
|
# nomic-embed-text: 137M, embedding-only, used by all profiles
|