lakehouse/lakehouse.toml
root 9e6002c4d4 S3 backend for Lance — hybrid operates on real MinIO object storage
Enabled lance feature "aws" for S3-compatible storage via opendal.
BucketRegistry: added with_allow_http(true) for MinIO/non-TLS S3
endpoints (fixes "builder error" on HTTP endpoints). lakehouse.toml
gains [[storage.buckets]] name="s3:lakehouse" with S3 backend config.

lance_backend.rs: S3 bucket naming convention — buckets with name
prefix "s3:" emit s3:// URIs for Lance datasets. AWS_* env vars
in the systemd unit provide credentials to Lance's internal
object_store.

Verified end-to-end on real MinIO with real 100K × 768d vectors:
  - Migrate Parquet → Lance on S3: 1.7s (vs 0.57s local)
  - Build IVF_PQ: 16.4s (CPU-bound, essentially same as local)
  - Search: ~58ms p50 (vs 11ms local — S3 partition reads)
  - Random doc fetch: 13ms (vs 3.5ms local)
  - Recall@10: 0.835 (randomized IVF_PQ, consistent with local 0.805)
  - Total S3 footprint: 637 MiB (vectors + index + lance metadata)

The "public storage" claim from the PRD is now proven: the hybrid
Parquet+HNSW ⊕ Lance architecture works on S3-compatible object
storage, not just local filesystem.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 21:09:42 -05:00

73 lines
1.7 KiB
TOML

# Lakehouse Configuration
[gateway]
host = "0.0.0.0"
port = 3100
[storage]
root = "./data"
profile_root = "./data/_profiles"
rescue_bucket = "rescue"
[[storage.buckets]]
name = "primary"
backend = "local"
root = "./data"
[[storage.buckets]]
name = "rescue"
backend = "local"
root = "./data/_rescue"
[[storage.buckets]]
name = "testing"
backend = "local"
root = "./data/_testing"
# S3 bucket via MinIO. The name "s3:lakehouse" is the convention
# lance_backend.rs uses to emit s3:// URIs for Lance datasets.
# Credentials resolved via environment (AWS_ACCESS_KEY_ID etc) or
# the secrets provider.
[[storage.buckets]]
name = "s3:lakehouse"
backend = "s3"
bucket = "lakehouse"
endpoint = "http://localhost:9000"
region = "us-east-1"
secret_ref = "minio-lakehouse"
[catalog]
# Manifests persisted to object storage under this prefix
manifest_prefix = "_catalog/manifests"
[query]
# max_rows_per_query = 10000
[sidecar]
url = "http://localhost:3200"
[ai]
embed_model = "nomic-embed-text"
gen_model = "qwen2.5"
rerank_model = "qwen2.5"
[auth]
enabled = false
# api_key = "changeme"
[observability]
# Export traces to stdout (set to "otlp" for OpenTelemetry collector)
exporter = "stdout"
service_name = "lakehouse"
[agent]
# Phase 16.2 — background autotune agent. Opt-in: set enabled = true to
# let the agent continuously propose + trial HNSW configs and auto-promote
# winners. Defaults are conservative so it stays out of the way of live
# search traffic on shared Ollama.
enabled = true
cycle_interval_secs = 120 # periodic wake if no triggers
cooldown_between_trials_secs = 10 # min gap between trials
min_recall = 0.9 # never promote below this
max_trials_per_hour = 20 # hard budget cap