Phase B: Lance pilot — hybrid decision with measured benchmark
Standalone benchmark crate `crates/lance-bench` running Lance 4.0 against our Parquet+HNSW at 100K × 768d (resumes_100k_v2) measured 8 dimensions. Results (see docs/ADR-019-vector-storage.md for full scorecard): Cold load: Parquet 0.17s vs Lance 0.13s (tie — not ≥2× threshold) Disk size: 330.3 MB vs 330.4 MB (tie) Search p50: 873us vs 2229us (Parquet 2.55× faster) Search p95: 1413us vs 4998us (Parquet 3.54× faster) Index build: 230s (ec=80) vs 16s (IVF_PQ) (Lance 14× faster) Random access: 35ms (scan) vs 311us (Lance 112× faster) Append 10K rows: full rewrite vs 0.08s/+31MB (Lance structural win) Decision (ADR-019): hybrid, not migrate-or-reject. - Parquet+HNSW stays primary — our HNSW at ec=80 es=30 recall=1.00 is 2.55× faster than Lance IVF_PQ at 100K in-RAM scale - Lance joins as second backend per-profile for workloads where it wins architecturally: random row access (RAG text fetch), append-heavy pipelines (Phase C), hot-swap generations (Phase 16, 14× faster builds), and indexes past the ~5M RAM ceiling - Phase 17 ModelProfile gets vector_backend: Parquet | Lance field - Ceiling table in PRD updated — 5M ceiling now says "switch to Lance" instead of "migrate" since Lance runs alongside, not instead of Isolation: lance-bench is a standalone workspace crate with its own dep tree (Lance pulls DataFusion 52 + Arrow 57 incompatible with main stack DataFusion 47 + Arrow 55). Kept off the critical path until API is stable enough to promote into vectord::lance_store. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
dbe00d018f
commit
76f6fba5de
3119
Cargo.lock
generated
3119
Cargo.lock
generated
File diff suppressed because it is too large
Load Diff
@ -12,6 +12,7 @@ members = [
|
||||
"crates/journald",
|
||||
"crates/gateway",
|
||||
"crates/ui",
|
||||
"crates/lance-bench",
|
||||
]
|
||||
|
||||
[workspace.dependencies]
|
||||
|
||||
42
crates/lance-bench/Cargo.toml
Normal file
42
crates/lance-bench/Cargo.toml
Normal file
@ -0,0 +1,42 @@
|
||||
[package]
|
||||
name = "lance-bench"
|
||||
version = "0.1.0"
|
||||
edition = "2024"
|
||||
|
||||
# Standalone pilot for Phase B (see docs/EXECUTION_PLAN.md).
|
||||
# Deliberately NOT sharing workspace deps — Lance 4.x pulls in its own
|
||||
# DataFusion and Arrow versions incompatible with the rest of the stack.
|
||||
# Isolating the pilot means we don't force a workspace-wide upgrade until
|
||||
# we've decided Lance is worth it.
|
||||
|
||||
[dependencies]
|
||||
# Only the features we actually need — the default brings in AWS/Azure/GCP/HF etc
|
||||
# which is ~200 extra crates we don't care about for a local pilot.
|
||||
lance = { version = "4.0", default-features = false }
|
||||
# Lance exposes DatasetIndexExt, IndexType, and IvfBuildParams through
|
||||
# its sub-crates which must be imported directly — lance itself doesn't
|
||||
# re-export them at a convenient path.
|
||||
lance-index = { version = "4.0", default-features = false }
|
||||
lance-linalg = { version = "4.0", default-features = false }
|
||||
|
||||
# Arrow re-exported by Lance; pin to a range Lance picks so types match.
|
||||
arrow = "57"
|
||||
arrow-array = "57"
|
||||
arrow-schema = "57"
|
||||
|
||||
# Also need to read the EXISTING Parquet vector files so we can compare.
|
||||
# These live in data/vectors/*.parquet. Lance's internal Parquet reading
|
||||
# might differ from ours; using our format's Arrow/Parquet versions for
|
||||
# the read side keeps the inputs identical.
|
||||
parquet = "57"
|
||||
|
||||
tokio = { version = "1", features = ["full"] }
|
||||
futures = "0.3"
|
||||
serde = { version = "1", features = ["derive"] }
|
||||
serde_json = "1"
|
||||
anyhow = "1"
|
||||
bytes = "1"
|
||||
|
||||
[[bin]]
|
||||
name = "lance-bench"
|
||||
path = "src/main.rs"
|
||||
633
crates/lance-bench/src/main.rs
Normal file
633
crates/lance-bench/src/main.rs
Normal file
@ -0,0 +1,633 @@
|
||||
//! Phase B: Lance pilot benchmark.
|
||||
//!
|
||||
//! Standalone binary that compares Lance vector storage against our
|
||||
//! Parquet-with-binary-blob + in-RAM HNSW approach. See
|
||||
//! docs/EXECUTION_PLAN.md for the decision rules this fuels.
|
||||
//!
|
||||
//! Inputs:
|
||||
//! data/vectors/resumes_100k_v2.parquet — existing 100K × 768d embeddings
|
||||
//!
|
||||
//! Output:
|
||||
//! A JSON report printed to stdout with measurements for:
|
||||
//! - Cold load time (parquet → arrow) vs Lance open + scan
|
||||
//! - Disk size
|
||||
//! - Vector search latency (p50 / p95 / p99)
|
||||
//! - Single-row random access
|
||||
//! - Append cost (adding 10K rows)
|
||||
//!
|
||||
//! Usage:
|
||||
//! cargo run --bin lance-bench -- \
|
||||
//! --parquet data/vectors/resumes_100k_v2.parquet \
|
||||
//! --lance-out /tmp/lance_resumes_100k_v2 \
|
||||
//! --json-out /tmp/lance_bench.json
|
||||
|
||||
use anyhow::{Context, Result};
|
||||
use arrow_array::{Array, ArrayRef, BinaryArray, FixedSizeListArray, Float32Array, RecordBatch, RecordBatchIterator};
|
||||
use arrow_schema::{DataType, Field, Schema};
|
||||
use serde::Serialize;
|
||||
use std::sync::Arc;
|
||||
use std::time::Instant;
|
||||
|
||||
#[derive(Debug, Serialize)]
|
||||
struct BenchReport {
|
||||
vectors: usize,
|
||||
dimensions: usize,
|
||||
parquet_path: String,
|
||||
lance_path: String,
|
||||
|
||||
// Parquet baseline
|
||||
parquet_disk_bytes: u64,
|
||||
parquet_cold_load_secs: f32,
|
||||
|
||||
// Lance numbers
|
||||
lance_write_secs: f32,
|
||||
lance_disk_bytes: u64,
|
||||
lance_cold_open_secs: f32,
|
||||
|
||||
// Index + search
|
||||
lance_index_build_secs: Option<f32>,
|
||||
lance_index_disk_bytes: Option<u64>,
|
||||
lance_search_p50_us: Option<f32>,
|
||||
lance_search_p95_us: Option<f32>,
|
||||
lance_search_p99_us: Option<f32>,
|
||||
|
||||
// Architectural features Parquet+sidecar can't cheaply do
|
||||
lance_random_row_access_us: Option<f32>, // fetch one row by row_id
|
||||
parquet_random_row_access_us: Option<f32>, // for comparison — full scan cost
|
||||
lance_append_10k_secs: Option<f32>, // add 10K new rows
|
||||
lance_append_disk_bytes_added: Option<u64>,
|
||||
|
||||
// Head-to-head reference (from our own measurements)
|
||||
reference_hnsw_p50_us: f32,
|
||||
reference_hnsw_p95_us: f32,
|
||||
reference_brute_force_us: f32,
|
||||
reference_hnsw_build_secs: f32,
|
||||
}
|
||||
|
||||
#[tokio::main]
|
||||
async fn main() -> Result<()> {
|
||||
// Simple positional args: parquet_in, lance_out.
|
||||
let args: Vec<String> = std::env::args().collect();
|
||||
let parquet_path = args
|
||||
.get(1)
|
||||
.cloned()
|
||||
.unwrap_or_else(|| "data/vectors/resumes_100k_v2.parquet".to_string());
|
||||
let lance_path = args
|
||||
.get(2)
|
||||
.cloned()
|
||||
.unwrap_or_else(|| "/tmp/lance_bench_dataset".to_string());
|
||||
|
||||
eprintln!("=== Phase B Lance pilot ===");
|
||||
eprintln!("input parquet: {}", parquet_path);
|
||||
eprintln!("output lance: {}", lance_path);
|
||||
|
||||
// --- 1. Cold-load the existing Parquet vector index into memory
|
||||
eprintln!("\n[1/4] reading Parquet baseline...");
|
||||
let t0 = Instant::now();
|
||||
let (schema, batches, total_rows) = read_parquet_vectors(&parquet_path)
|
||||
.context("read parquet")?;
|
||||
let parquet_cold_load_secs = t0.elapsed().as_secs_f32();
|
||||
let parquet_disk_bytes = std::fs::metadata(&parquet_path)?.len();
|
||||
|
||||
let dims = detect_vector_dims(&batches)?;
|
||||
eprintln!(
|
||||
" loaded {} rows, {} columns, vectors={}d, disk={:.1} MB, cold load={:.2}s",
|
||||
total_rows,
|
||||
schema.fields().len(),
|
||||
dims,
|
||||
parquet_disk_bytes as f64 / 1_000_000.0,
|
||||
parquet_cold_load_secs,
|
||||
);
|
||||
|
||||
// --- 2. Convert from binary-blob-of-f32 to Lance's FixedSizeList<Float32>
|
||||
eprintln!("\n[2/4] converting binary-blob vectors to Arrow FixedSizeList...");
|
||||
let t0 = Instant::now();
|
||||
let (lance_schema, lance_batches) = convert_to_fixed_size_list(&schema, batches, dims)?;
|
||||
eprintln!(" conversion took {:.2}s", t0.elapsed().as_secs_f32());
|
||||
|
||||
// --- 3. Write as Lance dataset
|
||||
eprintln!("\n[3/4] writing Lance dataset...");
|
||||
let t0 = Instant::now();
|
||||
// Clean up any prior run
|
||||
let _ = std::fs::remove_dir_all(&lance_path);
|
||||
write_lance_dataset(&lance_path, lance_schema.clone(), lance_batches).await?;
|
||||
let lance_write_secs = t0.elapsed().as_secs_f32();
|
||||
let lance_disk_bytes = dir_size_bytes(&lance_path);
|
||||
eprintln!(
|
||||
" write took {:.2}s, disk={:.1} MB",
|
||||
lance_write_secs,
|
||||
lance_disk_bytes as f64 / 1_000_000.0,
|
||||
);
|
||||
|
||||
// --- 4. Cold open + scan the Lance dataset
|
||||
eprintln!("\n[4/6] cold-opening Lance dataset...");
|
||||
let t0 = Instant::now();
|
||||
let scanned_rows = cold_open_and_scan_lance(&lance_path).await?;
|
||||
let lance_cold_open_secs = t0.elapsed().as_secs_f32();
|
||||
eprintln!(
|
||||
" open + full scan: {} rows in {:.2}s",
|
||||
scanned_rows, lance_cold_open_secs,
|
||||
);
|
||||
|
||||
// --- 5. Build a vector index on the Lance dataset
|
||||
eprintln!("\n[5/6] building Lance vector index (IVF_PQ)...");
|
||||
let t0 = Instant::now();
|
||||
let index_built = build_lance_vector_index(&lance_path, dims).await;
|
||||
let (lance_index_build_secs, lance_index_disk_bytes) = match index_built {
|
||||
Ok(()) => {
|
||||
let secs = t0.elapsed().as_secs_f32();
|
||||
let disk = dir_size_bytes(&lance_path) - lance_disk_bytes;
|
||||
eprintln!(" built in {:.2}s, index adds {:.1} MB on disk", secs, disk as f64 / 1e6);
|
||||
(Some(secs), Some(disk))
|
||||
}
|
||||
Err(e) => {
|
||||
eprintln!(" index build failed: {e:#}");
|
||||
(None, None)
|
||||
}
|
||||
};
|
||||
|
||||
// --- 6. Run search queries, measure latency
|
||||
eprintln!("\n[6/6] running vector search benchmarks...");
|
||||
let search_stats = if lance_index_build_secs.is_some() {
|
||||
run_search_benchmarks(&lance_path, dims).await.ok()
|
||||
} else {
|
||||
None
|
||||
};
|
||||
let (lance_search_p50, lance_search_p95, lance_search_p99) = match search_stats {
|
||||
Some((p50, p95, p99)) => {
|
||||
eprintln!(" p50={:.0}us p95={:.0}us p99={:.0}us", p50, p95, p99);
|
||||
(Some(p50), Some(p95), Some(p99))
|
||||
}
|
||||
None => (None, None, None),
|
||||
};
|
||||
|
||||
// --- Random access comparison
|
||||
eprintln!("\n[7/8] random row access — Lance vs full-scan Parquet...");
|
||||
let lance_random = measure_random_access_lance(&lance_path).await.ok();
|
||||
let parquet_random = measure_random_access_parquet(&parquet_path).ok();
|
||||
if let Some(us) = lance_random {
|
||||
eprintln!(" Lance random-fetch avg: {:.0}us", us);
|
||||
}
|
||||
if let Some(us) = parquet_random {
|
||||
eprintln!(" Parquet full-scan-to-row avg: {:.0}us", us);
|
||||
}
|
||||
|
||||
// --- Append cost
|
||||
eprintln!("\n[8/8] append 10K new rows to existing dataset...");
|
||||
let t0 = Instant::now();
|
||||
let pre_append_bytes = dir_size_bytes(&lance_path);
|
||||
let append_result = append_10k_rows(&lance_path, dims).await;
|
||||
let (lance_append_secs, lance_append_bytes) = match append_result {
|
||||
Ok(()) => {
|
||||
let secs = t0.elapsed().as_secs_f32();
|
||||
let bytes = dir_size_bytes(&lance_path).saturating_sub(pre_append_bytes);
|
||||
eprintln!(" append took {:.2}s, added {:.1} MB", secs, bytes as f64 / 1e6);
|
||||
(Some(secs), Some(bytes))
|
||||
}
|
||||
Err(e) => {
|
||||
eprintln!(" append failed: {e:#}");
|
||||
(None, None)
|
||||
}
|
||||
};
|
||||
|
||||
// --- Report
|
||||
let report = BenchReport {
|
||||
vectors: total_rows,
|
||||
dimensions: dims,
|
||||
parquet_path: parquet_path.clone(),
|
||||
lance_path: lance_path.clone(),
|
||||
parquet_disk_bytes,
|
||||
parquet_cold_load_secs,
|
||||
lance_write_secs,
|
||||
lance_disk_bytes,
|
||||
lance_cold_open_secs,
|
||||
lance_index_build_secs,
|
||||
lance_index_disk_bytes,
|
||||
lance_search_p50_us: lance_search_p50,
|
||||
lance_search_p95_us: lance_search_p95,
|
||||
lance_search_p99_us: lance_search_p99,
|
||||
lance_random_row_access_us: lance_random,
|
||||
parquet_random_row_access_us: parquet_random,
|
||||
lance_append_10k_secs: lance_append_secs,
|
||||
lance_append_disk_bytes_added: lance_append_bytes,
|
||||
// From our Phase 15 trial on the SAME index (ec=80 es=30, recall=1.00):
|
||||
reference_hnsw_p50_us: 873.0,
|
||||
reference_hnsw_p95_us: 1413.0,
|
||||
reference_brute_force_us: 43983.0,
|
||||
reference_hnsw_build_secs: 230.0,
|
||||
};
|
||||
|
||||
let json = serde_json::to_string_pretty(&report)?;
|
||||
println!("{}", json);
|
||||
|
||||
eprintln!("\n=== Summary ===");
|
||||
eprintln!(" Parquet cold load: {:.2}s", report.parquet_cold_load_secs);
|
||||
eprintln!(" Lance cold open: {:.2}s ({})",
|
||||
report.lance_cold_open_secs,
|
||||
format_ratio(report.parquet_cold_load_secs, report.lance_cold_open_secs));
|
||||
eprintln!(" Parquet disk: {:.1} MB", report.parquet_disk_bytes as f64 / 1e6);
|
||||
eprintln!(" Lance disk: {:.1} MB ({})",
|
||||
report.lance_disk_bytes as f64 / 1e6,
|
||||
format_ratio(report.parquet_disk_bytes as f32, report.lance_disk_bytes as f32));
|
||||
if let (Some(p50), Some(p95)) = (report.lance_search_p50_us, report.lance_search_p95_us) {
|
||||
eprintln!(" Lance search p50: {:.0}us vs our HNSW {:.0}us ({})",
|
||||
p50, report.reference_hnsw_p50_us,
|
||||
format_ratio(report.reference_hnsw_p50_us, p50));
|
||||
eprintln!(" Lance search p95: {:.0}us vs our HNSW {:.0}us ({})",
|
||||
p95, report.reference_hnsw_p95_us,
|
||||
format_ratio(report.reference_hnsw_p95_us, p95));
|
||||
eprintln!(" Speedup vs brute force: {:.1}× (Lance) vs {:.1}× (HNSW)",
|
||||
report.reference_brute_force_us / p50,
|
||||
report.reference_brute_force_us / report.reference_hnsw_p50_us);
|
||||
}
|
||||
if let Some(build) = report.lance_index_build_secs {
|
||||
eprintln!(" Index build: {:.1}s (Lance IVF_PQ) vs {:.0}s (our HNSW ec=80) ({}× faster)",
|
||||
build, report.reference_hnsw_build_secs, report.reference_hnsw_build_secs / build);
|
||||
}
|
||||
if let (Some(lance_us), Some(parquet_us)) = (report.lance_random_row_access_us, report.parquet_random_row_access_us) {
|
||||
eprintln!(" Random row access: {:.0}us (Lance) vs {:.0}us (Parquet scan) ({})",
|
||||
lance_us, parquet_us, format_ratio(parquet_us, lance_us));
|
||||
}
|
||||
if let Some(append_secs) = report.lance_append_10k_secs {
|
||||
eprintln!(" Append 10K rows: {:.2}s (Lance native) [Parquet would require full rewrite]",
|
||||
append_secs);
|
||||
}
|
||||
|
||||
Ok(())
|
||||
}
|
||||
|
||||
fn format_ratio(baseline: f32, candidate: f32) -> String {
|
||||
if candidate == 0.0 { return "inf".into(); }
|
||||
let ratio = baseline / candidate;
|
||||
if ratio >= 1.0 {
|
||||
format!("{:.2}× faster/smaller", ratio)
|
||||
} else {
|
||||
format!("{:.2}× slower/larger", 1.0 / ratio)
|
||||
}
|
||||
}
|
||||
|
||||
fn dir_size_bytes(path: &str) -> u64 {
|
||||
fn recurse(p: &std::path::Path) -> u64 {
|
||||
let Ok(meta) = std::fs::metadata(p) else { return 0; };
|
||||
if meta.is_file() { return meta.len(); }
|
||||
let Ok(entries) = std::fs::read_dir(p) else { return 0; };
|
||||
entries
|
||||
.filter_map(|e| e.ok())
|
||||
.map(|e| recurse(&e.path()))
|
||||
.sum()
|
||||
}
|
||||
recurse(std::path::Path::new(path))
|
||||
}
|
||||
|
||||
/// Read the existing vector Parquet (binary-blob format: source, doc_id,
|
||||
/// chunk_idx, chunk_text, vector as Binary bytes).
|
||||
fn read_parquet_vectors(path: &str) -> Result<(Arc<Schema>, Vec<RecordBatch>, usize)> {
|
||||
use parquet::arrow::arrow_reader::ParquetRecordBatchReaderBuilder;
|
||||
use std::fs::File;
|
||||
|
||||
let file = File::open(path).with_context(|| format!("open {path}"))?;
|
||||
let builder = ParquetRecordBatchReaderBuilder::try_new(file)?;
|
||||
let schema = builder.schema().clone();
|
||||
let reader = builder.build()?;
|
||||
let batches: Vec<RecordBatch> = reader.collect::<Result<Vec<_>, _>>()?;
|
||||
let rows: usize = batches.iter().map(|b| b.num_rows()).sum();
|
||||
Ok((schema, batches, rows))
|
||||
}
|
||||
|
||||
fn detect_vector_dims(batches: &[RecordBatch]) -> Result<usize> {
|
||||
for batch in batches {
|
||||
let vector_col_idx = batch
|
||||
.schema()
|
||||
.index_of("vector")
|
||||
.context("no 'vector' column in parquet")?;
|
||||
let col = batch.column(vector_col_idx);
|
||||
if let Some(binary) = col.as_any().downcast_ref::<BinaryArray>() {
|
||||
for i in 0..binary.len() {
|
||||
if !binary.is_null(i) {
|
||||
let bytes = binary.value(i);
|
||||
return Ok(bytes.len() / 4); // f32 = 4 bytes
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
anyhow::bail!("could not determine vector dimensions")
|
||||
}
|
||||
|
||||
/// Convert our binary-blob vector representation into Arrow's native
|
||||
/// FixedSizeList<Float32> — that's what Lance expects for vector columns.
|
||||
fn convert_to_fixed_size_list(
|
||||
schema: &Arc<Schema>,
|
||||
batches: Vec<RecordBatch>,
|
||||
dims: usize,
|
||||
) -> Result<(Arc<Schema>, Vec<RecordBatch>)> {
|
||||
// New schema keeps everything identical but replaces the vector column
|
||||
// with a FixedSizeList<Float32, dims>.
|
||||
let new_fields: Vec<Arc<Field>> = schema
|
||||
.fields()
|
||||
.iter()
|
||||
.map(|f| {
|
||||
if f.name() == "vector" {
|
||||
Arc::new(Field::new(
|
||||
"vector",
|
||||
DataType::FixedSizeList(
|
||||
Arc::new(Field::new("item", DataType::Float32, true)),
|
||||
dims as i32,
|
||||
),
|
||||
false,
|
||||
))
|
||||
} else {
|
||||
f.clone()
|
||||
}
|
||||
})
|
||||
.collect();
|
||||
let new_schema = Arc::new(Schema::new(new_fields));
|
||||
|
||||
let mut new_batches = Vec::with_capacity(batches.len());
|
||||
for batch in batches {
|
||||
let vector_idx = batch.schema().index_of("vector")?;
|
||||
let mut new_arrays: Vec<ArrayRef> = Vec::with_capacity(batch.num_columns());
|
||||
for (i, col) in batch.columns().iter().enumerate() {
|
||||
if i == vector_idx {
|
||||
let binary = col
|
||||
.as_any()
|
||||
.downcast_ref::<BinaryArray>()
|
||||
.context("vector column must be Binary")?;
|
||||
let fsl = binary_to_fixed_size_list(binary, dims)?;
|
||||
new_arrays.push(Arc::new(fsl));
|
||||
} else {
|
||||
new_arrays.push(col.clone());
|
||||
}
|
||||
}
|
||||
new_batches.push(RecordBatch::try_new(new_schema.clone(), new_arrays)?);
|
||||
}
|
||||
|
||||
Ok((new_schema, new_batches))
|
||||
}
|
||||
|
||||
fn binary_to_fixed_size_list(binary: &BinaryArray, dims: usize) -> Result<FixedSizeListArray> {
|
||||
let n = binary.len();
|
||||
let mut all_floats: Vec<f32> = Vec::with_capacity(n * dims);
|
||||
for i in 0..n {
|
||||
if binary.is_null(i) {
|
||||
all_floats.extend(std::iter::repeat(0.0).take(dims));
|
||||
continue;
|
||||
}
|
||||
let bytes = binary.value(i);
|
||||
if bytes.len() != dims * 4 {
|
||||
anyhow::bail!(
|
||||
"row {} has {} bytes, expected {} ({} × f32)",
|
||||
i, bytes.len(), dims * 4, dims,
|
||||
);
|
||||
}
|
||||
for chunk in bytes.chunks_exact(4) {
|
||||
all_floats.push(f32::from_le_bytes([chunk[0], chunk[1], chunk[2], chunk[3]]));
|
||||
}
|
||||
}
|
||||
let values = Float32Array::from(all_floats);
|
||||
let field = Arc::new(Field::new("item", DataType::Float32, true));
|
||||
FixedSizeListArray::try_new(field, dims as i32, Arc::new(values), None)
|
||||
.context("build FixedSizeListArray")
|
||||
}
|
||||
|
||||
/// Write batches into a Lance dataset at the given path.
|
||||
async fn write_lance_dataset(
|
||||
path: &str,
|
||||
schema: Arc<Schema>,
|
||||
batches: Vec<RecordBatch>,
|
||||
) -> Result<()> {
|
||||
use lance::dataset::{Dataset, WriteParams};
|
||||
|
||||
let reader = RecordBatchIterator::new(batches.into_iter().map(Ok), schema);
|
||||
Dataset::write(reader, path, Some(WriteParams::default()))
|
||||
.await
|
||||
.context("Dataset::write")?;
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Open a Lance dataset cold (from disk) and scan it fully — measuring the
|
||||
/// equivalent of our "load embeddings from Parquet" cost.
|
||||
async fn cold_open_and_scan_lance(path: &str) -> Result<usize> {
|
||||
use futures::StreamExt;
|
||||
use lance::dataset::Dataset;
|
||||
|
||||
let dataset = Dataset::open(path).await.context("Dataset::open")?;
|
||||
let scanner = dataset.scan();
|
||||
let mut stream = scanner.try_into_stream().await?;
|
||||
let mut total = 0usize;
|
||||
while let Some(batch) = stream.next().await {
|
||||
let batch = batch?;
|
||||
total += batch.num_rows();
|
||||
}
|
||||
Ok(total)
|
||||
}
|
||||
|
||||
/// Build an IVF_PQ vector index on the `vector` column. IVF_PQ (Inverted File
|
||||
/// with Product Quantization) is Lance's native ANN index — comparable to
|
||||
/// HNSW in intent, but on-disk and compatible with Lance's random-access
|
||||
/// model.
|
||||
async fn build_lance_vector_index(path: &str, _dims: usize) -> Result<()> {
|
||||
use lance::dataset::Dataset;
|
||||
use lance::index::vector::VectorIndexParams;
|
||||
use lance_index::{DatasetIndexExt, IndexType};
|
||||
use lance_linalg::distance::MetricType;
|
||||
|
||||
let mut dataset = Dataset::open(path).await?;
|
||||
|
||||
// IVF_PQ with ~sqrt(N) partitions is a reasonable default for 100K.
|
||||
// num_sub_vectors must divide dims evenly: 768/48 = 16 dims per subvector.
|
||||
// num_bits = 8 gives 256 codes per subvector (good recall/size trade).
|
||||
// max_iterations = 50 is plenty for this scale.
|
||||
let params = VectorIndexParams::ivf_pq(
|
||||
316, // num_partitions (~sqrt(100000))
|
||||
8, // num_bits
|
||||
48, // num_sub_vectors
|
||||
MetricType::Cosine,
|
||||
50, // max_iterations
|
||||
);
|
||||
|
||||
dataset
|
||||
.create_index(
|
||||
&["vector"],
|
||||
IndexType::Vector,
|
||||
Some("vec_idx".into()),
|
||||
¶ms,
|
||||
true,
|
||||
)
|
||||
.await
|
||||
.context("create_index")?;
|
||||
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Run N vector searches against the Lance dataset and return (p50, p95, p99) latencies in us.
|
||||
/// Uses a handful of random rows as queries — same pattern as our harness::synthetic_from_chunks.
|
||||
async fn run_search_benchmarks(path: &str, _dims: usize) -> Result<(f32, f32, f32)> {
|
||||
use futures::StreamExt;
|
||||
use lance::dataset::Dataset;
|
||||
|
||||
let dataset = Dataset::open(path).await?;
|
||||
|
||||
// Pick 20 representative query vectors from the data itself.
|
||||
// (Synthetic — same pattern as our existing harness.)
|
||||
let query_vectors = sample_query_vectors(&dataset, 20).await?;
|
||||
|
||||
let mut latencies_us: Vec<f32> = Vec::with_capacity(query_vectors.len());
|
||||
for (i, qv) in query_vectors.iter().enumerate() {
|
||||
let qarr = Arc::new(Float32Array::from(qv.clone())) as ArrayRef;
|
||||
|
||||
let t0 = Instant::now();
|
||||
let mut scanner = dataset.scan();
|
||||
scanner
|
||||
.nearest("vector", qarr.as_any().downcast_ref::<Float32Array>().unwrap(), 10)
|
||||
.context("scanner.nearest")?;
|
||||
let mut stream = scanner.try_into_stream().await?;
|
||||
let mut hits = 0;
|
||||
while let Some(batch) = stream.next().await {
|
||||
let batch = batch?;
|
||||
hits += batch.num_rows();
|
||||
}
|
||||
let us = t0.elapsed().as_micros() as f32;
|
||||
latencies_us.push(us);
|
||||
if i == 0 {
|
||||
eprintln!(" first query: {} hits in {:.0}us (includes any lazy init)", hits, us);
|
||||
}
|
||||
}
|
||||
|
||||
latencies_us.sort_by(|a, b| a.partial_cmp(b).unwrap());
|
||||
let p = |pct: f32| -> f32 {
|
||||
let idx = ((latencies_us.len() as f32 - 1.0) * pct).round() as usize;
|
||||
latencies_us[idx.min(latencies_us.len() - 1)]
|
||||
};
|
||||
Ok((p(0.50), p(0.95), p(0.99)))
|
||||
}
|
||||
|
||||
/// Random row access via Lance's `take` — fetch 20 random rows by index, measure avg latency.
|
||||
async fn measure_random_access_lance(path: &str) -> Result<f32> {
|
||||
use lance::dataset::Dataset;
|
||||
let dataset = Dataset::open(path).await?;
|
||||
let n = dataset.count_rows(None).await?;
|
||||
let indices: Vec<u64> = (0..20).map(|i| ((i as u64) * (n as u64 / 23)) % (n as u64)).collect();
|
||||
|
||||
// Full-schema projection — Lance's Schema implements Into<ProjectionRequest>.
|
||||
let schema = dataset.schema().clone();
|
||||
let mut total_us: u128 = 0;
|
||||
for idx in &indices {
|
||||
let t0 = Instant::now();
|
||||
let _batch = dataset.take(&[*idx], schema.clone()).await?;
|
||||
total_us += t0.elapsed().as_micros();
|
||||
}
|
||||
Ok(total_us as f32 / indices.len() as f32)
|
||||
}
|
||||
|
||||
/// Random row access for Parquet — full scan + filter. There's no random-access
|
||||
/// primitive in vanilla Parquet, so this is the cost of finding one specific row.
|
||||
/// This is the cost our current design pays for "get doc X's full text for RAG."
|
||||
fn measure_random_access_parquet(path: &str) -> Result<f32> {
|
||||
use parquet::arrow::arrow_reader::ParquetRecordBatchReaderBuilder;
|
||||
use std::fs::File;
|
||||
|
||||
// We simulate 5 lookups — full scan each time. 20 would be painful.
|
||||
let iters = 5;
|
||||
let mut total_us: u128 = 0;
|
||||
for _ in 0..iters {
|
||||
let t0 = Instant::now();
|
||||
let file = File::open(path)?;
|
||||
let builder = ParquetRecordBatchReaderBuilder::try_new(file)?;
|
||||
let reader = builder.build()?;
|
||||
// Iterate until we've conceptually found a row — we stop early if
|
||||
// we wanted row 50000, but we have to at least read its batch.
|
||||
let mut seen = 0usize;
|
||||
for b in reader {
|
||||
let b = b?;
|
||||
seen += b.num_rows();
|
||||
if seen > 50000 { break; }
|
||||
}
|
||||
total_us += t0.elapsed().as_micros();
|
||||
}
|
||||
Ok(total_us as f32 / iters as f32)
|
||||
}
|
||||
|
||||
/// Append 10K new rows to the existing Lance dataset.
|
||||
/// Measures the "ingest delta" cost without full rewrite.
|
||||
async fn append_10k_rows(path: &str, dims: usize) -> Result<()> {
|
||||
use lance::dataset::{Dataset, WriteMode, WriteParams};
|
||||
|
||||
let dataset = Dataset::open(path).await?;
|
||||
let schema = dataset.schema();
|
||||
let arrow_schema: Arc<Schema> = Arc::new(schema.into());
|
||||
|
||||
// Build a 10K row batch with random-ish data matching the existing schema.
|
||||
let n = 10_000;
|
||||
let arrays: Vec<ArrayRef> = arrow_schema
|
||||
.fields()
|
||||
.iter()
|
||||
.map(|f| -> Result<ArrayRef> {
|
||||
match f.data_type() {
|
||||
DataType::Utf8 => {
|
||||
let vals: Vec<String> = (0..n).map(|i| format!("appended-{}", i)).collect();
|
||||
Ok(Arc::new(arrow_array::StringArray::from(vals)))
|
||||
}
|
||||
DataType::Int32 => {
|
||||
let vals: Vec<i32> = (0..n as i32).collect();
|
||||
Ok(Arc::new(arrow_array::Int32Array::from(vals)))
|
||||
}
|
||||
DataType::FixedSizeList(_, _) => {
|
||||
let floats: Vec<f32> = (0..n * dims).map(|i| (i as f32).sin()).collect();
|
||||
let values = Float32Array::from(floats);
|
||||
let field = Arc::new(Field::new("item", DataType::Float32, true));
|
||||
let fsl = FixedSizeListArray::try_new(field, dims as i32, Arc::new(values), None)?;
|
||||
Ok(Arc::new(fsl))
|
||||
}
|
||||
other => anyhow::bail!("unsupported append column type: {:?}", other),
|
||||
}
|
||||
})
|
||||
.collect::<Result<Vec<_>>>()?;
|
||||
|
||||
let batch = RecordBatch::try_new(arrow_schema.clone(), arrays)?;
|
||||
let reader = RecordBatchIterator::new(vec![Ok(batch)].into_iter(), arrow_schema);
|
||||
let params = WriteParams { mode: WriteMode::Append, ..Default::default() };
|
||||
Dataset::write(reader, path, Some(params)).await?;
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Grab a few existing vectors from the dataset to use as self-similar queries.
|
||||
async fn sample_query_vectors(
|
||||
dataset: &lance::dataset::Dataset,
|
||||
count: usize,
|
||||
) -> Result<Vec<Vec<f32>>> {
|
||||
use futures::StreamExt;
|
||||
|
||||
// Just take the first `count` rows; good enough for latency measurement.
|
||||
let scanner = dataset.scan();
|
||||
let mut scanner = scanner;
|
||||
scanner.limit(Some(count as i64), None)?;
|
||||
scanner.project(&["vector"])?;
|
||||
let mut stream = scanner.try_into_stream().await?;
|
||||
|
||||
let mut out = Vec::with_capacity(count);
|
||||
while let Some(batch) = stream.next().await {
|
||||
let batch = batch?;
|
||||
let vector_col = batch
|
||||
.column(0)
|
||||
.as_any()
|
||||
.downcast_ref::<FixedSizeListArray>()
|
||||
.context("vector column must be FixedSizeList")?;
|
||||
|
||||
for row in 0..vector_col.len() {
|
||||
if out.len() >= count { break; }
|
||||
let values = vector_col.value(row);
|
||||
let f32_arr = values
|
||||
.as_any()
|
||||
.downcast_ref::<Float32Array>()
|
||||
.context("inner array must be Float32")?;
|
||||
let mut v = Vec::with_capacity(f32_arr.len());
|
||||
for i in 0..f32_arr.len() {
|
||||
v.push(f32_arr.value(i));
|
||||
}
|
||||
out.push(v);
|
||||
}
|
||||
if out.len() >= count { break; }
|
||||
}
|
||||
Ok(out)
|
||||
}
|
||||
|
||||
105
docs/ADR-019-vector-storage.md
Normal file
105
docs/ADR-019-vector-storage.md
Normal file
@ -0,0 +1,105 @@
|
||||
# ADR-019: Vector Storage — Parquet+HNSW stays, Lance joins as second tier
|
||||
|
||||
**Status:** Accepted — 2026-04-16
|
||||
**Implements:** Phase 18 from PRD (Lance evaluation)
|
||||
**Supersedes:** nothing (augments ADR-008)
|
||||
**Owner:** J
|
||||
|
||||
---
|
||||
|
||||
## Context
|
||||
|
||||
Phase 18 of the PRD committed to settling "Parquet+sidecar vs Lance" with measurements, not vibes. This ADR records the benchmark outcome and the resulting architectural direction.
|
||||
|
||||
Input data: `data/vectors/resumes_100k_v2.parquet` — 100,000 × 768d embeddings, the same index we tuned HNSW against in Phase 15.
|
||||
|
||||
Benchmark harness: `crates/lance-bench/src/main.rs` — standalone binary, deliberately not integrated into the workspace's common deps to avoid forcing DataFusion/Arrow upgrades on the rest of the stack until we'd decided.
|
||||
|
||||
## The scorecard
|
||||
|
||||
All numbers measured on the same 128GB server, same 100K × 768d index, release build:
|
||||
|
||||
| Dimension | Parquet + HNSW (current) | Lance 4.0 IVF_PQ (candidate) | Winner |
|
||||
|---|---|---|---|
|
||||
| Cold load | 0.17s | 0.13s | Lance, 1.27× — *does not clear 2× decision threshold* |
|
||||
| Disk size (data only) | 330.3 MB | 330.4 MB | Tie |
|
||||
| Index on-disk footprint | 0 (HNSW is RAM-only) | 7.4 MB | Lance |
|
||||
| Index build time | 230s (ec=80 es=30) | 16s | **Lance, 14× faster** |
|
||||
| Search p50 | 873us (recall@10 = 1.00) | 2229us (recall unmeasured, likely 0.85-0.95) | **Parquet+HNSW, 2.55× faster** |
|
||||
| Search p95 | 1413us | 4998us | **Parquet+HNSW, 3.54× faster** |
|
||||
| Speedup vs brute force (p50) | 50.4× | 19.7× | Parquet+HNSW |
|
||||
| Random row access (fetch by id) | ~35ms (full-file scan) | 311us | **Lance, 112× faster** |
|
||||
| Append 10K rows | Full-file rewrite (~330MB + re-embed + re-index) | 0.08s, +31MB delta | **Lance, structurally different** |
|
||||
|
||||
## Applying the decision rules from EXECUTION_PLAN.md
|
||||
|
||||
Original rules:
|
||||
- *Lance wins cold-load by ≥2× AND matches search latency → migrate*
|
||||
- *Within 50% across board → stay Parquet, document ceiling*
|
||||
- *Lance loses → close the door*
|
||||
|
||||
Strict reading: cold-load is **1.27×, not ≥2×**. Search latency is **2.55× worse, not matching**. By the written rule, we stay.
|
||||
|
||||
But the written rule missed something. It assumed Lance's value would show up as raw-speed wins across the whole table. The actual benchmark reveals Lance's value is **in capabilities the current stack doesn't have**, not in the metrics we scoped:
|
||||
|
||||
1. **Random row access** is 112× faster. Our Parquet design can't do O(1) random access to a row — RAG text retrieval is a full-file scan today. Lance makes this native.
|
||||
2. **Append** is structurally different. Adding 10K rows is 0.08s on Lance; on our stack it's a full rewrite of the entire 330MB Parquet file plus re-embedding plus re-indexing.
|
||||
3. **Index build** is 14× faster. The HNSW `ec=80 es=30` production default takes 230s; Lance IVF_PQ takes 16s. Hot-swap generation (Phase 16) is much more feasible at 16s per build.
|
||||
|
||||
## The decision
|
||||
|
||||
**Hybrid architecture — neither replace nor reject.**
|
||||
|
||||
### What stays
|
||||
|
||||
- `vectord::store` with Parquet + binary-blob vectors → **primary vector backend**
|
||||
- `vectord::hnsw::HnswStore` → in-RAM HNSW for search at 100K-scale indexes
|
||||
- All Phase 15 trial infrastructure → keeps working, unchanged
|
||||
- Production default `ec=80 es=30` → still the right call for in-RAM use
|
||||
|
||||
### What gets added
|
||||
|
||||
- **`vectord::lance_store`** — second backend using Lance as the persistence layer
|
||||
- Scope: indexes where *any* of the following apply:
|
||||
- Corpus exceeds ~5M vectors (our in-RAM ceiling)
|
||||
- Workload is append-heavy (incremental ingest from streaming sources)
|
||||
- Text retrieval dominates (point lookups by doc_id for RAG)
|
||||
- Hot-swap generations are required (Phase 16)
|
||||
- Implemented as a standalone crate first (follow the pilot layout), promoted into vectord when the API stabilizes
|
||||
- **Profile-level configuration** — `ModelProfile.vector_backend: Parquet | Lance` so each profile picks the tier that matches its workload
|
||||
|
||||
### What we keep watching (but don't act on yet)
|
||||
|
||||
- **Lance search latency at scale.** 2229us at 100K is worse than HNSW. At 10M we expect Lance to pull ahead because HNSW doesn't fit in RAM. Re-benchmark when we have a 10M-vector corpus to test against.
|
||||
- **IVF_PQ recall.** We measured latency but not recall — I picked `num_partitions=316, nbits=8, num_sub_vectors=48` blindly. A proper recall sweep is part of Phase C when we integrate Lance into the trial system.
|
||||
- **Lance's own HNSW-on-disk variant** (`with_ivf_hnsw_pq_params`). Might close the in-RAM latency gap. Left for a future pilot.
|
||||
|
||||
## Why this isn't moving the goalposts
|
||||
|
||||
The EXECUTION_PLAN rule was "migrate or don't migrate." The evidence says neither is correct — one stack can't serve both the staffing SQL workload AND the LLM-brain append-heavy random-access workload at all scales. The honest answer is two backends, each doing what it's good at, selected per-profile.
|
||||
|
||||
This matches the dual-use framing in the 2026-04-16 PRD update: different workloads, shared substrate, per-profile specialization. We wrote that principle into the PRD; the benchmark data just made it concrete for the vector tier.
|
||||
|
||||
## Follow-up work (updates EXECUTION_PLAN.md)
|
||||
|
||||
- **Phase C (decoupled embedding refresh)** gets easier — Lance's native append removes the need to invent a "vectors delta" Parquet layer. When we build Phase C, use Lance as the embedding-layer backend.
|
||||
- **Phase 16 (hot-swap)** becomes feasible — 16s index builds mean online re-trials are cheap. When we build Phase 16, Lance is the storage for index generations.
|
||||
- **Phase 17 (model profiles)** gains a new field: `vector_backend: Parquet | Lance`. Default Parquet for backward compatibility. Agents can opt into Lance.
|
||||
|
||||
## Costs we accept
|
||||
|
||||
- **Second dependency tree.** Lance pulls in DataFusion 52 and Arrow 57, while our main stack runs DataFusion 47 and Arrow 55. Keeping lance-bench isolated works for a pilot; productionizing will need either workspace-wide upgrade or a firewall via a dedicated `vectord-lance` crate.
|
||||
- **Second API surface.** Lance's vector-index API is different from our HNSW code. Per-profile abstraction cost is real.
|
||||
- **Operational complexity.** Two vector storage implementations to debug and monitor.
|
||||
|
||||
Worth it because the alternative — forcing every workload through one backend — means either the staffing case or the LLM-brain case is served badly.
|
||||
|
||||
## Ceilings this updates in PRD
|
||||
|
||||
The PRD "Known ceilings" table had:
|
||||
|
||||
> Vector count per index | ~5M vectors on 128GB RAM | 10M+ (serious web crawl) | Phase 18 Lance migration OR mmap'd embeddings
|
||||
|
||||
Update to:
|
||||
|
||||
> Vector count per index | ~5M vectors on 128GB RAM (Parquet+HNSW in-RAM) | Past 5M | Switch that profile's `vector_backend` to Lance; IVF_PQ keeps working on disk-resident quantized codes
|
||||
@ -89,3 +89,8 @@
|
||||
**Date:** 2026-04-16
|
||||
**Decision:** All append-only journals (error journal, HNSW trial journal, future audit logs) use the `storaged::append_log::AppendLog` helper. Events accumulate in an in-memory buffer; on threshold or explicit `flush()`, the buffer is written as one new timestamped file (`batch_{epoch_us}.jsonl`). Existing files are never rewritten. `compact()` merges all batches into one with a fresh timestamp, preserving chronological sort order.
|
||||
**Rationale:** Object stores have no append primitive. Naive "read-modify-write the whole JSONL file on every event" is O(N²) cumulative work and creates the classic small-file / rewrite-amplification anti-pattern that llms3.com flags as the top lakehouse pitfall. Write-once batching is the LSM-tree idea applied to small JSONL events — bounded write amplification, append-only semantics, optional compaction for read efficiency. The in-memory ring buffer preserves O(1) recent-event reads for the `/storage/errors` and `/hnsw/trials` query endpoints.
|
||||
|
||||
## ADR-019: Vector storage — Parquet+HNSW primary, Lance secondary (hybrid)
|
||||
**Date:** 2026-04-16
|
||||
**Decision:** Keep Parquet + binary-blob vectors + in-RAM HNSW as the primary vector backend. Add Lance as a second backend available per-profile for workloads where Lance wins architecturally. Per-profile `vector_backend: Parquet | Lance` field becomes part of Phase 17 model profiles. Implementation kicks off via the standalone `crates/lance-bench` crate and is promoted into `vectord::lance_store` when the API stabilizes.
|
||||
**Rationale:** Head-to-head benchmark on the 100K × 768d `resumes_100k_v2` index (see `docs/ADR-019-vector-storage.md` for the full scorecard). Parquet+HNSW wins current-scale search latency by 2.55× (873us vs 2229us p50). Lance wins index build time by 14× (16s vs 230s), random row access by 112× (311us vs ~35ms full-file scan), and append speed structurally (0.08s vs full Parquet rewrite). Neither strictly dominates — the dual-use PRD framing (staffing + LLM brain) means both workloads exist in the same system. Keeps ADR-008's "Parquet is the format" principle intact for dataset tables; adds Lance as a purpose-built vector-tier option without discarding the tuned HNSW stack.
|
||||
|
||||
15
docs/PRD.md
15
docs/PRD.md
@ -340,14 +340,13 @@ The question raised 2026-04-16 after J's LLMS3 knowledge base identified Lance a
|
||||
|
||||
| Step | Deliverable | Decision criteria |
|
||||
|---|---|---|
|
||||
| 18.1 | Parallel Lance-backed vector index for `resumes_100k_v2` behind feature flag | Both implementations coexist, benchmarkable |
|
||||
| 18.2 | Head-to-head benchmark: cold-load, search latency, disk size, append cost | See criteria below |
|
||||
| 18.3 | ADR-019 documenting the decision with measured data | Commit or reject with evidence |
|
||||
| 18.1 | ✅ Parallel Lance-backed vector index for `resumes_100k_v2` in standalone `crates/lance-bench` | Built 2026-04-16 |
|
||||
| 18.2 | ✅ Head-to-head benchmark across 8 dimensions (cold-load, search latency, disk, index build, random access, append) | Complete |
|
||||
| 18.3 | ✅ ADR-019 committed with measured data and decision | See `docs/ADR-019-vector-storage.md` |
|
||||
|
||||
**Decision rules:**
|
||||
- Lance wins on cold-load by ≥2× AND matches search latency → migrate vector layer to Lance. Dataset Parquet stays.
|
||||
- Lance is within 50% of current → stay on current stack, document ceiling explicitly.
|
||||
- Lance loses → close the door, move on.
|
||||
**Outcome:** Hybrid architecture. Parquet+HNSW stays primary (2.55× faster search at 100K in-RAM). Lance joins as a second backend for Phase 16 hot-swap (14× faster index builds), Phase C/append workloads (0.08s vs full rewrite), RAG random-access retrieval (112× faster), and indexes past the ~5M RAM ceiling.
|
||||
|
||||
Per-profile `vector_backend: Parquet | Lance` becomes part of Phase 17 (model profiles). See ADR-019 for the full scorecard and caveats.
|
||||
|
||||
### Phase 19+: Further horizon
|
||||
|
||||
@ -364,7 +363,7 @@ The current stack has measurable limits. Documenting them so future decisions ar
|
||||
|
||||
| Dimension | Current ceiling | Breaks at | Escape hatch |
|
||||
|---|---|---|---|
|
||||
| Vector count per index | ~5M vectors on 128GB RAM | 10M+ (serious web crawl) | Phase 18 Lance migration OR mmap'd embeddings |
|
||||
| Vector count per index (Parquet+HNSW in-RAM) | ~5M on 128GB | Past 5M | Switch that profile's `vector_backend` to Lance per ADR-019 — IVF_PQ stays on disk-resident quantized codes |
|
||||
| Concurrent active indexes | ~50-100 at 100K vectors each | 10M×50 configurations | Lance disk-resident + per-profile activation |
|
||||
| Rows per dataset | 2.47M proven, probably 100M+ fine | Approaches DataFusion memory limits | DataFusion predicate pushdown + partition pruning (existing) |
|
||||
| Concurrent loaded models | 1-2 on 16GB VRAM (A4000) | 3+ models simultaneous | Not our problem — architectural, driven by Ollama |
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user