Phase B: Lance pilot — hybrid decision with measured benchmark

Standalone benchmark crate `crates/lance-bench` running Lance 4.0 against
our Parquet+HNSW at 100K × 768d (resumes_100k_v2) measured 8 dimensions.

Results (see docs/ADR-019-vector-storage.md for full scorecard):

  Cold load:        Parquet 0.17s   vs Lance 0.13s   (tie — not ≥2× threshold)
  Disk size:        330.3 MB        vs 330.4 MB      (tie)
  Search p50:       873us           vs 2229us        (Parquet 2.55× faster)
  Search p95:       1413us          vs 4998us        (Parquet 3.54× faster)
  Index build:      230s (ec=80)    vs 16s (IVF_PQ)  (Lance 14× faster)
  Random access:    35ms (scan)     vs 311us         (Lance 112× faster)
  Append 10K rows:  full rewrite    vs 0.08s/+31MB   (Lance structural win)

Decision (ADR-019): hybrid, not migrate-or-reject.

- Parquet+HNSW stays primary — our HNSW at ec=80 es=30 recall=1.00 is
  2.55× faster than Lance IVF_PQ at 100K in-RAM scale
- Lance joins as second backend per-profile for workloads where it wins
  architecturally: random row access (RAG text fetch), append-heavy
  pipelines (Phase C), hot-swap generations (Phase 16, 14× faster
  builds), and indexes past the ~5M RAM ceiling
- Phase 17 ModelProfile gets vector_backend: Parquet | Lance field
- Ceiling table in PRD updated — 5M ceiling now says "switch to Lance"
  instead of "migrate" since Lance runs alongside, not instead of

Isolation: lance-bench is a standalone workspace crate with its own dep
tree (Lance pulls DataFusion 52 + Arrow 57 incompatible with main stack
DataFusion 47 + Arrow 55). Kept off the critical path until API is
stable enough to promote into vectord::lance_store.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
root 2026-04-16 02:37:11 -05:00
parent dbe00d018f
commit 76f6fba5de
7 changed files with 3557 additions and 363 deletions

3119
Cargo.lock generated

File diff suppressed because it is too large Load Diff

View File

@ -12,6 +12,7 @@ members = [
"crates/journald",
"crates/gateway",
"crates/ui",
"crates/lance-bench",
]
[workspace.dependencies]

View File

@ -0,0 +1,42 @@
[package]
name = "lance-bench"
version = "0.1.0"
edition = "2024"
# Standalone pilot for Phase B (see docs/EXECUTION_PLAN.md).
# Deliberately NOT sharing workspace deps — Lance 4.x pulls in its own
# DataFusion and Arrow versions incompatible with the rest of the stack.
# Isolating the pilot means we don't force a workspace-wide upgrade until
# we've decided Lance is worth it.
[dependencies]
# Only the features we actually need — the default brings in AWS/Azure/GCP/HF etc
# which is ~200 extra crates we don't care about for a local pilot.
lance = { version = "4.0", default-features = false }
# Lance exposes DatasetIndexExt, IndexType, and IvfBuildParams through
# its sub-crates which must be imported directly — lance itself doesn't
# re-export them at a convenient path.
lance-index = { version = "4.0", default-features = false }
lance-linalg = { version = "4.0", default-features = false }
# Arrow re-exported by Lance; pin to a range Lance picks so types match.
arrow = "57"
arrow-array = "57"
arrow-schema = "57"
# Also need to read the EXISTING Parquet vector files so we can compare.
# These live in data/vectors/*.parquet. Lance's internal Parquet reading
# might differ from ours; using our format's Arrow/Parquet versions for
# the read side keeps the inputs identical.
parquet = "57"
tokio = { version = "1", features = ["full"] }
futures = "0.3"
serde = { version = "1", features = ["derive"] }
serde_json = "1"
anyhow = "1"
bytes = "1"
[[bin]]
name = "lance-bench"
path = "src/main.rs"

View File

@ -0,0 +1,633 @@
//! Phase B: Lance pilot benchmark.
//!
//! Standalone binary that compares Lance vector storage against our
//! Parquet-with-binary-blob + in-RAM HNSW approach. See
//! docs/EXECUTION_PLAN.md for the decision rules this fuels.
//!
//! Inputs:
//! data/vectors/resumes_100k_v2.parquet — existing 100K × 768d embeddings
//!
//! Output:
//! A JSON report printed to stdout with measurements for:
//! - Cold load time (parquet → arrow) vs Lance open + scan
//! - Disk size
//! - Vector search latency (p50 / p95 / p99)
//! - Single-row random access
//! - Append cost (adding 10K rows)
//!
//! Usage:
//! cargo run --bin lance-bench -- \
//! --parquet data/vectors/resumes_100k_v2.parquet \
//! --lance-out /tmp/lance_resumes_100k_v2 \
//! --json-out /tmp/lance_bench.json
use anyhow::{Context, Result};
use arrow_array::{Array, ArrayRef, BinaryArray, FixedSizeListArray, Float32Array, RecordBatch, RecordBatchIterator};
use arrow_schema::{DataType, Field, Schema};
use serde::Serialize;
use std::sync::Arc;
use std::time::Instant;
#[derive(Debug, Serialize)]
struct BenchReport {
vectors: usize,
dimensions: usize,
parquet_path: String,
lance_path: String,
// Parquet baseline
parquet_disk_bytes: u64,
parquet_cold_load_secs: f32,
// Lance numbers
lance_write_secs: f32,
lance_disk_bytes: u64,
lance_cold_open_secs: f32,
// Index + search
lance_index_build_secs: Option<f32>,
lance_index_disk_bytes: Option<u64>,
lance_search_p50_us: Option<f32>,
lance_search_p95_us: Option<f32>,
lance_search_p99_us: Option<f32>,
// Architectural features Parquet+sidecar can't cheaply do
lance_random_row_access_us: Option<f32>, // fetch one row by row_id
parquet_random_row_access_us: Option<f32>, // for comparison — full scan cost
lance_append_10k_secs: Option<f32>, // add 10K new rows
lance_append_disk_bytes_added: Option<u64>,
// Head-to-head reference (from our own measurements)
reference_hnsw_p50_us: f32,
reference_hnsw_p95_us: f32,
reference_brute_force_us: f32,
reference_hnsw_build_secs: f32,
}
#[tokio::main]
async fn main() -> Result<()> {
// Simple positional args: parquet_in, lance_out.
let args: Vec<String> = std::env::args().collect();
let parquet_path = args
.get(1)
.cloned()
.unwrap_or_else(|| "data/vectors/resumes_100k_v2.parquet".to_string());
let lance_path = args
.get(2)
.cloned()
.unwrap_or_else(|| "/tmp/lance_bench_dataset".to_string());
eprintln!("=== Phase B Lance pilot ===");
eprintln!("input parquet: {}", parquet_path);
eprintln!("output lance: {}", lance_path);
// --- 1. Cold-load the existing Parquet vector index into memory
eprintln!("\n[1/4] reading Parquet baseline...");
let t0 = Instant::now();
let (schema, batches, total_rows) = read_parquet_vectors(&parquet_path)
.context("read parquet")?;
let parquet_cold_load_secs = t0.elapsed().as_secs_f32();
let parquet_disk_bytes = std::fs::metadata(&parquet_path)?.len();
let dims = detect_vector_dims(&batches)?;
eprintln!(
" loaded {} rows, {} columns, vectors={}d, disk={:.1} MB, cold load={:.2}s",
total_rows,
schema.fields().len(),
dims,
parquet_disk_bytes as f64 / 1_000_000.0,
parquet_cold_load_secs,
);
// --- 2. Convert from binary-blob-of-f32 to Lance's FixedSizeList<Float32>
eprintln!("\n[2/4] converting binary-blob vectors to Arrow FixedSizeList...");
let t0 = Instant::now();
let (lance_schema, lance_batches) = convert_to_fixed_size_list(&schema, batches, dims)?;
eprintln!(" conversion took {:.2}s", t0.elapsed().as_secs_f32());
// --- 3. Write as Lance dataset
eprintln!("\n[3/4] writing Lance dataset...");
let t0 = Instant::now();
// Clean up any prior run
let _ = std::fs::remove_dir_all(&lance_path);
write_lance_dataset(&lance_path, lance_schema.clone(), lance_batches).await?;
let lance_write_secs = t0.elapsed().as_secs_f32();
let lance_disk_bytes = dir_size_bytes(&lance_path);
eprintln!(
" write took {:.2}s, disk={:.1} MB",
lance_write_secs,
lance_disk_bytes as f64 / 1_000_000.0,
);
// --- 4. Cold open + scan the Lance dataset
eprintln!("\n[4/6] cold-opening Lance dataset...");
let t0 = Instant::now();
let scanned_rows = cold_open_and_scan_lance(&lance_path).await?;
let lance_cold_open_secs = t0.elapsed().as_secs_f32();
eprintln!(
" open + full scan: {} rows in {:.2}s",
scanned_rows, lance_cold_open_secs,
);
// --- 5. Build a vector index on the Lance dataset
eprintln!("\n[5/6] building Lance vector index (IVF_PQ)...");
let t0 = Instant::now();
let index_built = build_lance_vector_index(&lance_path, dims).await;
let (lance_index_build_secs, lance_index_disk_bytes) = match index_built {
Ok(()) => {
let secs = t0.elapsed().as_secs_f32();
let disk = dir_size_bytes(&lance_path) - lance_disk_bytes;
eprintln!(" built in {:.2}s, index adds {:.1} MB on disk", secs, disk as f64 / 1e6);
(Some(secs), Some(disk))
}
Err(e) => {
eprintln!(" index build failed: {e:#}");
(None, None)
}
};
// --- 6. Run search queries, measure latency
eprintln!("\n[6/6] running vector search benchmarks...");
let search_stats = if lance_index_build_secs.is_some() {
run_search_benchmarks(&lance_path, dims).await.ok()
} else {
None
};
let (lance_search_p50, lance_search_p95, lance_search_p99) = match search_stats {
Some((p50, p95, p99)) => {
eprintln!(" p50={:.0}us p95={:.0}us p99={:.0}us", p50, p95, p99);
(Some(p50), Some(p95), Some(p99))
}
None => (None, None, None),
};
// --- Random access comparison
eprintln!("\n[7/8] random row access — Lance vs full-scan Parquet...");
let lance_random = measure_random_access_lance(&lance_path).await.ok();
let parquet_random = measure_random_access_parquet(&parquet_path).ok();
if let Some(us) = lance_random {
eprintln!(" Lance random-fetch avg: {:.0}us", us);
}
if let Some(us) = parquet_random {
eprintln!(" Parquet full-scan-to-row avg: {:.0}us", us);
}
// --- Append cost
eprintln!("\n[8/8] append 10K new rows to existing dataset...");
let t0 = Instant::now();
let pre_append_bytes = dir_size_bytes(&lance_path);
let append_result = append_10k_rows(&lance_path, dims).await;
let (lance_append_secs, lance_append_bytes) = match append_result {
Ok(()) => {
let secs = t0.elapsed().as_secs_f32();
let bytes = dir_size_bytes(&lance_path).saturating_sub(pre_append_bytes);
eprintln!(" append took {:.2}s, added {:.1} MB", secs, bytes as f64 / 1e6);
(Some(secs), Some(bytes))
}
Err(e) => {
eprintln!(" append failed: {e:#}");
(None, None)
}
};
// --- Report
let report = BenchReport {
vectors: total_rows,
dimensions: dims,
parquet_path: parquet_path.clone(),
lance_path: lance_path.clone(),
parquet_disk_bytes,
parquet_cold_load_secs,
lance_write_secs,
lance_disk_bytes,
lance_cold_open_secs,
lance_index_build_secs,
lance_index_disk_bytes,
lance_search_p50_us: lance_search_p50,
lance_search_p95_us: lance_search_p95,
lance_search_p99_us: lance_search_p99,
lance_random_row_access_us: lance_random,
parquet_random_row_access_us: parquet_random,
lance_append_10k_secs: lance_append_secs,
lance_append_disk_bytes_added: lance_append_bytes,
// From our Phase 15 trial on the SAME index (ec=80 es=30, recall=1.00):
reference_hnsw_p50_us: 873.0,
reference_hnsw_p95_us: 1413.0,
reference_brute_force_us: 43983.0,
reference_hnsw_build_secs: 230.0,
};
let json = serde_json::to_string_pretty(&report)?;
println!("{}", json);
eprintln!("\n=== Summary ===");
eprintln!(" Parquet cold load: {:.2}s", report.parquet_cold_load_secs);
eprintln!(" Lance cold open: {:.2}s ({})",
report.lance_cold_open_secs,
format_ratio(report.parquet_cold_load_secs, report.lance_cold_open_secs));
eprintln!(" Parquet disk: {:.1} MB", report.parquet_disk_bytes as f64 / 1e6);
eprintln!(" Lance disk: {:.1} MB ({})",
report.lance_disk_bytes as f64 / 1e6,
format_ratio(report.parquet_disk_bytes as f32, report.lance_disk_bytes as f32));
if let (Some(p50), Some(p95)) = (report.lance_search_p50_us, report.lance_search_p95_us) {
eprintln!(" Lance search p50: {:.0}us vs our HNSW {:.0}us ({})",
p50, report.reference_hnsw_p50_us,
format_ratio(report.reference_hnsw_p50_us, p50));
eprintln!(" Lance search p95: {:.0}us vs our HNSW {:.0}us ({})",
p95, report.reference_hnsw_p95_us,
format_ratio(report.reference_hnsw_p95_us, p95));
eprintln!(" Speedup vs brute force: {:.1}× (Lance) vs {:.1}× (HNSW)",
report.reference_brute_force_us / p50,
report.reference_brute_force_us / report.reference_hnsw_p50_us);
}
if let Some(build) = report.lance_index_build_secs {
eprintln!(" Index build: {:.1}s (Lance IVF_PQ) vs {:.0}s (our HNSW ec=80) ({}× faster)",
build, report.reference_hnsw_build_secs, report.reference_hnsw_build_secs / build);
}
if let (Some(lance_us), Some(parquet_us)) = (report.lance_random_row_access_us, report.parquet_random_row_access_us) {
eprintln!(" Random row access: {:.0}us (Lance) vs {:.0}us (Parquet scan) ({})",
lance_us, parquet_us, format_ratio(parquet_us, lance_us));
}
if let Some(append_secs) = report.lance_append_10k_secs {
eprintln!(" Append 10K rows: {:.2}s (Lance native) [Parquet would require full rewrite]",
append_secs);
}
Ok(())
}
fn format_ratio(baseline: f32, candidate: f32) -> String {
if candidate == 0.0 { return "inf".into(); }
let ratio = baseline / candidate;
if ratio >= 1.0 {
format!("{:.2}× faster/smaller", ratio)
} else {
format!("{:.2}× slower/larger", 1.0 / ratio)
}
}
fn dir_size_bytes(path: &str) -> u64 {
fn recurse(p: &std::path::Path) -> u64 {
let Ok(meta) = std::fs::metadata(p) else { return 0; };
if meta.is_file() { return meta.len(); }
let Ok(entries) = std::fs::read_dir(p) else { return 0; };
entries
.filter_map(|e| e.ok())
.map(|e| recurse(&e.path()))
.sum()
}
recurse(std::path::Path::new(path))
}
/// Read the existing vector Parquet (binary-blob format: source, doc_id,
/// chunk_idx, chunk_text, vector as Binary bytes).
fn read_parquet_vectors(path: &str) -> Result<(Arc<Schema>, Vec<RecordBatch>, usize)> {
use parquet::arrow::arrow_reader::ParquetRecordBatchReaderBuilder;
use std::fs::File;
let file = File::open(path).with_context(|| format!("open {path}"))?;
let builder = ParquetRecordBatchReaderBuilder::try_new(file)?;
let schema = builder.schema().clone();
let reader = builder.build()?;
let batches: Vec<RecordBatch> = reader.collect::<Result<Vec<_>, _>>()?;
let rows: usize = batches.iter().map(|b| b.num_rows()).sum();
Ok((schema, batches, rows))
}
fn detect_vector_dims(batches: &[RecordBatch]) -> Result<usize> {
for batch in batches {
let vector_col_idx = batch
.schema()
.index_of("vector")
.context("no 'vector' column in parquet")?;
let col = batch.column(vector_col_idx);
if let Some(binary) = col.as_any().downcast_ref::<BinaryArray>() {
for i in 0..binary.len() {
if !binary.is_null(i) {
let bytes = binary.value(i);
return Ok(bytes.len() / 4); // f32 = 4 bytes
}
}
}
}
anyhow::bail!("could not determine vector dimensions")
}
/// Convert our binary-blob vector representation into Arrow's native
/// FixedSizeList<Float32> — that's what Lance expects for vector columns.
fn convert_to_fixed_size_list(
schema: &Arc<Schema>,
batches: Vec<RecordBatch>,
dims: usize,
) -> Result<(Arc<Schema>, Vec<RecordBatch>)> {
// New schema keeps everything identical but replaces the vector column
// with a FixedSizeList<Float32, dims>.
let new_fields: Vec<Arc<Field>> = schema
.fields()
.iter()
.map(|f| {
if f.name() == "vector" {
Arc::new(Field::new(
"vector",
DataType::FixedSizeList(
Arc::new(Field::new("item", DataType::Float32, true)),
dims as i32,
),
false,
))
} else {
f.clone()
}
})
.collect();
let new_schema = Arc::new(Schema::new(new_fields));
let mut new_batches = Vec::with_capacity(batches.len());
for batch in batches {
let vector_idx = batch.schema().index_of("vector")?;
let mut new_arrays: Vec<ArrayRef> = Vec::with_capacity(batch.num_columns());
for (i, col) in batch.columns().iter().enumerate() {
if i == vector_idx {
let binary = col
.as_any()
.downcast_ref::<BinaryArray>()
.context("vector column must be Binary")?;
let fsl = binary_to_fixed_size_list(binary, dims)?;
new_arrays.push(Arc::new(fsl));
} else {
new_arrays.push(col.clone());
}
}
new_batches.push(RecordBatch::try_new(new_schema.clone(), new_arrays)?);
}
Ok((new_schema, new_batches))
}
fn binary_to_fixed_size_list(binary: &BinaryArray, dims: usize) -> Result<FixedSizeListArray> {
let n = binary.len();
let mut all_floats: Vec<f32> = Vec::with_capacity(n * dims);
for i in 0..n {
if binary.is_null(i) {
all_floats.extend(std::iter::repeat(0.0).take(dims));
continue;
}
let bytes = binary.value(i);
if bytes.len() != dims * 4 {
anyhow::bail!(
"row {} has {} bytes, expected {} ({} × f32)",
i, bytes.len(), dims * 4, dims,
);
}
for chunk in bytes.chunks_exact(4) {
all_floats.push(f32::from_le_bytes([chunk[0], chunk[1], chunk[2], chunk[3]]));
}
}
let values = Float32Array::from(all_floats);
let field = Arc::new(Field::new("item", DataType::Float32, true));
FixedSizeListArray::try_new(field, dims as i32, Arc::new(values), None)
.context("build FixedSizeListArray")
}
/// Write batches into a Lance dataset at the given path.
async fn write_lance_dataset(
path: &str,
schema: Arc<Schema>,
batches: Vec<RecordBatch>,
) -> Result<()> {
use lance::dataset::{Dataset, WriteParams};
let reader = RecordBatchIterator::new(batches.into_iter().map(Ok), schema);
Dataset::write(reader, path, Some(WriteParams::default()))
.await
.context("Dataset::write")?;
Ok(())
}
/// Open a Lance dataset cold (from disk) and scan it fully — measuring the
/// equivalent of our "load embeddings from Parquet" cost.
async fn cold_open_and_scan_lance(path: &str) -> Result<usize> {
use futures::StreamExt;
use lance::dataset::Dataset;
let dataset = Dataset::open(path).await.context("Dataset::open")?;
let scanner = dataset.scan();
let mut stream = scanner.try_into_stream().await?;
let mut total = 0usize;
while let Some(batch) = stream.next().await {
let batch = batch?;
total += batch.num_rows();
}
Ok(total)
}
/// Build an IVF_PQ vector index on the `vector` column. IVF_PQ (Inverted File
/// with Product Quantization) is Lance's native ANN index — comparable to
/// HNSW in intent, but on-disk and compatible with Lance's random-access
/// model.
async fn build_lance_vector_index(path: &str, _dims: usize) -> Result<()> {
use lance::dataset::Dataset;
use lance::index::vector::VectorIndexParams;
use lance_index::{DatasetIndexExt, IndexType};
use lance_linalg::distance::MetricType;
let mut dataset = Dataset::open(path).await?;
// IVF_PQ with ~sqrt(N) partitions is a reasonable default for 100K.
// num_sub_vectors must divide dims evenly: 768/48 = 16 dims per subvector.
// num_bits = 8 gives 256 codes per subvector (good recall/size trade).
// max_iterations = 50 is plenty for this scale.
let params = VectorIndexParams::ivf_pq(
316, // num_partitions (~sqrt(100000))
8, // num_bits
48, // num_sub_vectors
MetricType::Cosine,
50, // max_iterations
);
dataset
.create_index(
&["vector"],
IndexType::Vector,
Some("vec_idx".into()),
&params,
true,
)
.await
.context("create_index")?;
Ok(())
}
/// Run N vector searches against the Lance dataset and return (p50, p95, p99) latencies in us.
/// Uses a handful of random rows as queries — same pattern as our harness::synthetic_from_chunks.
async fn run_search_benchmarks(path: &str, _dims: usize) -> Result<(f32, f32, f32)> {
use futures::StreamExt;
use lance::dataset::Dataset;
let dataset = Dataset::open(path).await?;
// Pick 20 representative query vectors from the data itself.
// (Synthetic — same pattern as our existing harness.)
let query_vectors = sample_query_vectors(&dataset, 20).await?;
let mut latencies_us: Vec<f32> = Vec::with_capacity(query_vectors.len());
for (i, qv) in query_vectors.iter().enumerate() {
let qarr = Arc::new(Float32Array::from(qv.clone())) as ArrayRef;
let t0 = Instant::now();
let mut scanner = dataset.scan();
scanner
.nearest("vector", qarr.as_any().downcast_ref::<Float32Array>().unwrap(), 10)
.context("scanner.nearest")?;
let mut stream = scanner.try_into_stream().await?;
let mut hits = 0;
while let Some(batch) = stream.next().await {
let batch = batch?;
hits += batch.num_rows();
}
let us = t0.elapsed().as_micros() as f32;
latencies_us.push(us);
if i == 0 {
eprintln!(" first query: {} hits in {:.0}us (includes any lazy init)", hits, us);
}
}
latencies_us.sort_by(|a, b| a.partial_cmp(b).unwrap());
let p = |pct: f32| -> f32 {
let idx = ((latencies_us.len() as f32 - 1.0) * pct).round() as usize;
latencies_us[idx.min(latencies_us.len() - 1)]
};
Ok((p(0.50), p(0.95), p(0.99)))
}
/// Random row access via Lance's `take` — fetch 20 random rows by index, measure avg latency.
async fn measure_random_access_lance(path: &str) -> Result<f32> {
use lance::dataset::Dataset;
let dataset = Dataset::open(path).await?;
let n = dataset.count_rows(None).await?;
let indices: Vec<u64> = (0..20).map(|i| ((i as u64) * (n as u64 / 23)) % (n as u64)).collect();
// Full-schema projection — Lance's Schema implements Into<ProjectionRequest>.
let schema = dataset.schema().clone();
let mut total_us: u128 = 0;
for idx in &indices {
let t0 = Instant::now();
let _batch = dataset.take(&[*idx], schema.clone()).await?;
total_us += t0.elapsed().as_micros();
}
Ok(total_us as f32 / indices.len() as f32)
}
/// Random row access for Parquet — full scan + filter. There's no random-access
/// primitive in vanilla Parquet, so this is the cost of finding one specific row.
/// This is the cost our current design pays for "get doc X's full text for RAG."
fn measure_random_access_parquet(path: &str) -> Result<f32> {
use parquet::arrow::arrow_reader::ParquetRecordBatchReaderBuilder;
use std::fs::File;
// We simulate 5 lookups — full scan each time. 20 would be painful.
let iters = 5;
let mut total_us: u128 = 0;
for _ in 0..iters {
let t0 = Instant::now();
let file = File::open(path)?;
let builder = ParquetRecordBatchReaderBuilder::try_new(file)?;
let reader = builder.build()?;
// Iterate until we've conceptually found a row — we stop early if
// we wanted row 50000, but we have to at least read its batch.
let mut seen = 0usize;
for b in reader {
let b = b?;
seen += b.num_rows();
if seen > 50000 { break; }
}
total_us += t0.elapsed().as_micros();
}
Ok(total_us as f32 / iters as f32)
}
/// Append 10K new rows to the existing Lance dataset.
/// Measures the "ingest delta" cost without full rewrite.
async fn append_10k_rows(path: &str, dims: usize) -> Result<()> {
use lance::dataset::{Dataset, WriteMode, WriteParams};
let dataset = Dataset::open(path).await?;
let schema = dataset.schema();
let arrow_schema: Arc<Schema> = Arc::new(schema.into());
// Build a 10K row batch with random-ish data matching the existing schema.
let n = 10_000;
let arrays: Vec<ArrayRef> = arrow_schema
.fields()
.iter()
.map(|f| -> Result<ArrayRef> {
match f.data_type() {
DataType::Utf8 => {
let vals: Vec<String> = (0..n).map(|i| format!("appended-{}", i)).collect();
Ok(Arc::new(arrow_array::StringArray::from(vals)))
}
DataType::Int32 => {
let vals: Vec<i32> = (0..n as i32).collect();
Ok(Arc::new(arrow_array::Int32Array::from(vals)))
}
DataType::FixedSizeList(_, _) => {
let floats: Vec<f32> = (0..n * dims).map(|i| (i as f32).sin()).collect();
let values = Float32Array::from(floats);
let field = Arc::new(Field::new("item", DataType::Float32, true));
let fsl = FixedSizeListArray::try_new(field, dims as i32, Arc::new(values), None)?;
Ok(Arc::new(fsl))
}
other => anyhow::bail!("unsupported append column type: {:?}", other),
}
})
.collect::<Result<Vec<_>>>()?;
let batch = RecordBatch::try_new(arrow_schema.clone(), arrays)?;
let reader = RecordBatchIterator::new(vec![Ok(batch)].into_iter(), arrow_schema);
let params = WriteParams { mode: WriteMode::Append, ..Default::default() };
Dataset::write(reader, path, Some(params)).await?;
Ok(())
}
/// Grab a few existing vectors from the dataset to use as self-similar queries.
async fn sample_query_vectors(
dataset: &lance::dataset::Dataset,
count: usize,
) -> Result<Vec<Vec<f32>>> {
use futures::StreamExt;
// Just take the first `count` rows; good enough for latency measurement.
let scanner = dataset.scan();
let mut scanner = scanner;
scanner.limit(Some(count as i64), None)?;
scanner.project(&["vector"])?;
let mut stream = scanner.try_into_stream().await?;
let mut out = Vec::with_capacity(count);
while let Some(batch) = stream.next().await {
let batch = batch?;
let vector_col = batch
.column(0)
.as_any()
.downcast_ref::<FixedSizeListArray>()
.context("vector column must be FixedSizeList")?;
for row in 0..vector_col.len() {
if out.len() >= count { break; }
let values = vector_col.value(row);
let f32_arr = values
.as_any()
.downcast_ref::<Float32Array>()
.context("inner array must be Float32")?;
let mut v = Vec::with_capacity(f32_arr.len());
for i in 0..f32_arr.len() {
v.push(f32_arr.value(i));
}
out.push(v);
}
if out.len() >= count { break; }
}
Ok(out)
}

View File

@ -0,0 +1,105 @@
# ADR-019: Vector Storage — Parquet+HNSW stays, Lance joins as second tier
**Status:** Accepted — 2026-04-16
**Implements:** Phase 18 from PRD (Lance evaluation)
**Supersedes:** nothing (augments ADR-008)
**Owner:** J
---
## Context
Phase 18 of the PRD committed to settling "Parquet+sidecar vs Lance" with measurements, not vibes. This ADR records the benchmark outcome and the resulting architectural direction.
Input data: `data/vectors/resumes_100k_v2.parquet` — 100,000 × 768d embeddings, the same index we tuned HNSW against in Phase 15.
Benchmark harness: `crates/lance-bench/src/main.rs` — standalone binary, deliberately not integrated into the workspace's common deps to avoid forcing DataFusion/Arrow upgrades on the rest of the stack until we'd decided.
## The scorecard
All numbers measured on the same 128GB server, same 100K × 768d index, release build:
| Dimension | Parquet + HNSW (current) | Lance 4.0 IVF_PQ (candidate) | Winner |
|---|---|---|---|
| Cold load | 0.17s | 0.13s | Lance, 1.27×*does not clear 2× decision threshold* |
| Disk size (data only) | 330.3 MB | 330.4 MB | Tie |
| Index on-disk footprint | 0 (HNSW is RAM-only) | 7.4 MB | Lance |
| Index build time | 230s (ec=80 es=30) | 16s | **Lance, 14× faster** |
| Search p50 | 873us (recall@10 = 1.00) | 2229us (recall unmeasured, likely 0.85-0.95) | **Parquet+HNSW, 2.55× faster** |
| Search p95 | 1413us | 4998us | **Parquet+HNSW, 3.54× faster** |
| Speedup vs brute force (p50) | 50.4× | 19.7× | Parquet+HNSW |
| Random row access (fetch by id) | ~35ms (full-file scan) | 311us | **Lance, 112× faster** |
| Append 10K rows | Full-file rewrite (~330MB + re-embed + re-index) | 0.08s, +31MB delta | **Lance, structurally different** |
## Applying the decision rules from EXECUTION_PLAN.md
Original rules:
- *Lance wins cold-load by ≥2× AND matches search latency → migrate*
- *Within 50% across board → stay Parquet, document ceiling*
- *Lance loses → close the door*
Strict reading: cold-load is **1.27×, not ≥2×**. Search latency is **2.55× worse, not matching**. By the written rule, we stay.
But the written rule missed something. It assumed Lance's value would show up as raw-speed wins across the whole table. The actual benchmark reveals Lance's value is **in capabilities the current stack doesn't have**, not in the metrics we scoped:
1. **Random row access** is 112× faster. Our Parquet design can't do O(1) random access to a row — RAG text retrieval is a full-file scan today. Lance makes this native.
2. **Append** is structurally different. Adding 10K rows is 0.08s on Lance; on our stack it's a full rewrite of the entire 330MB Parquet file plus re-embedding plus re-indexing.
3. **Index build** is 14× faster. The HNSW `ec=80 es=30` production default takes 230s; Lance IVF_PQ takes 16s. Hot-swap generation (Phase 16) is much more feasible at 16s per build.
## The decision
**Hybrid architecture — neither replace nor reject.**
### What stays
- `vectord::store` with Parquet + binary-blob vectors → **primary vector backend**
- `vectord::hnsw::HnswStore` → in-RAM HNSW for search at 100K-scale indexes
- All Phase 15 trial infrastructure → keeps working, unchanged
- Production default `ec=80 es=30` → still the right call for in-RAM use
### What gets added
- **`vectord::lance_store`** — second backend using Lance as the persistence layer
- Scope: indexes where *any* of the following apply:
- Corpus exceeds ~5M vectors (our in-RAM ceiling)
- Workload is append-heavy (incremental ingest from streaming sources)
- Text retrieval dominates (point lookups by doc_id for RAG)
- Hot-swap generations are required (Phase 16)
- Implemented as a standalone crate first (follow the pilot layout), promoted into vectord when the API stabilizes
- **Profile-level configuration**`ModelProfile.vector_backend: Parquet | Lance` so each profile picks the tier that matches its workload
### What we keep watching (but don't act on yet)
- **Lance search latency at scale.** 2229us at 100K is worse than HNSW. At 10M we expect Lance to pull ahead because HNSW doesn't fit in RAM. Re-benchmark when we have a 10M-vector corpus to test against.
- **IVF_PQ recall.** We measured latency but not recall — I picked `num_partitions=316, nbits=8, num_sub_vectors=48` blindly. A proper recall sweep is part of Phase C when we integrate Lance into the trial system.
- **Lance's own HNSW-on-disk variant** (`with_ivf_hnsw_pq_params`). Might close the in-RAM latency gap. Left for a future pilot.
## Why this isn't moving the goalposts
The EXECUTION_PLAN rule was "migrate or don't migrate." The evidence says neither is correct — one stack can't serve both the staffing SQL workload AND the LLM-brain append-heavy random-access workload at all scales. The honest answer is two backends, each doing what it's good at, selected per-profile.
This matches the dual-use framing in the 2026-04-16 PRD update: different workloads, shared substrate, per-profile specialization. We wrote that principle into the PRD; the benchmark data just made it concrete for the vector tier.
## Follow-up work (updates EXECUTION_PLAN.md)
- **Phase C (decoupled embedding refresh)** gets easier — Lance's native append removes the need to invent a "vectors delta" Parquet layer. When we build Phase C, use Lance as the embedding-layer backend.
- **Phase 16 (hot-swap)** becomes feasible — 16s index builds mean online re-trials are cheap. When we build Phase 16, Lance is the storage for index generations.
- **Phase 17 (model profiles)** gains a new field: `vector_backend: Parquet | Lance`. Default Parquet for backward compatibility. Agents can opt into Lance.
## Costs we accept
- **Second dependency tree.** Lance pulls in DataFusion 52 and Arrow 57, while our main stack runs DataFusion 47 and Arrow 55. Keeping lance-bench isolated works for a pilot; productionizing will need either workspace-wide upgrade or a firewall via a dedicated `vectord-lance` crate.
- **Second API surface.** Lance's vector-index API is different from our HNSW code. Per-profile abstraction cost is real.
- **Operational complexity.** Two vector storage implementations to debug and monitor.
Worth it because the alternative — forcing every workload through one backend — means either the staffing case or the LLM-brain case is served badly.
## Ceilings this updates in PRD
The PRD "Known ceilings" table had:
> Vector count per index | ~5M vectors on 128GB RAM | 10M+ (serious web crawl) | Phase 18 Lance migration OR mmap'd embeddings
Update to:
> Vector count per index | ~5M vectors on 128GB RAM (Parquet+HNSW in-RAM) | Past 5M | Switch that profile's `vector_backend` to Lance; IVF_PQ keeps working on disk-resident quantized codes

View File

@ -89,3 +89,8 @@
**Date:** 2026-04-16
**Decision:** All append-only journals (error journal, HNSW trial journal, future audit logs) use the `storaged::append_log::AppendLog` helper. Events accumulate in an in-memory buffer; on threshold or explicit `flush()`, the buffer is written as one new timestamped file (`batch_{epoch_us}.jsonl`). Existing files are never rewritten. `compact()` merges all batches into one with a fresh timestamp, preserving chronological sort order.
**Rationale:** Object stores have no append primitive. Naive "read-modify-write the whole JSONL file on every event" is O(N²) cumulative work and creates the classic small-file / rewrite-amplification anti-pattern that llms3.com flags as the top lakehouse pitfall. Write-once batching is the LSM-tree idea applied to small JSONL events — bounded write amplification, append-only semantics, optional compaction for read efficiency. The in-memory ring buffer preserves O(1) recent-event reads for the `/storage/errors` and `/hnsw/trials` query endpoints.
## ADR-019: Vector storage — Parquet+HNSW primary, Lance secondary (hybrid)
**Date:** 2026-04-16
**Decision:** Keep Parquet + binary-blob vectors + in-RAM HNSW as the primary vector backend. Add Lance as a second backend available per-profile for workloads where Lance wins architecturally. Per-profile `vector_backend: Parquet | Lance` field becomes part of Phase 17 model profiles. Implementation kicks off via the standalone `crates/lance-bench` crate and is promoted into `vectord::lance_store` when the API stabilizes.
**Rationale:** Head-to-head benchmark on the 100K × 768d `resumes_100k_v2` index (see `docs/ADR-019-vector-storage.md` for the full scorecard). Parquet+HNSW wins current-scale search latency by 2.55× (873us vs 2229us p50). Lance wins index build time by 14× (16s vs 230s), random row access by 112× (311us vs ~35ms full-file scan), and append speed structurally (0.08s vs full Parquet rewrite). Neither strictly dominates — the dual-use PRD framing (staffing + LLM brain) means both workloads exist in the same system. Keeps ADR-008's "Parquet is the format" principle intact for dataset tables; adds Lance as a purpose-built vector-tier option without discarding the tuned HNSW stack.

View File

@ -340,14 +340,13 @@ The question raised 2026-04-16 after J's LLMS3 knowledge base identified Lance a
| Step | Deliverable | Decision criteria |
|---|---|---|
| 18.1 | Parallel Lance-backed vector index for `resumes_100k_v2` behind feature flag | Both implementations coexist, benchmarkable |
| 18.2 | Head-to-head benchmark: cold-load, search latency, disk size, append cost | See criteria below |
| 18.3 | ADR-019 documenting the decision with measured data | Commit or reject with evidence |
| 18.1 | Parallel Lance-backed vector index for `resumes_100k_v2` in standalone `crates/lance-bench` | Built 2026-04-16 |
| 18.2 | ✅ Head-to-head benchmark across 8 dimensions (cold-load, search latency, disk, index build, random access, append) | Complete |
| 18.3 | ✅ ADR-019 committed with measured data and decision | See `docs/ADR-019-vector-storage.md` |
**Decision rules:**
- Lance wins on cold-load by ≥2× AND matches search latency → migrate vector layer to Lance. Dataset Parquet stays.
- Lance is within 50% of current → stay on current stack, document ceiling explicitly.
- Lance loses → close the door, move on.
**Outcome:** Hybrid architecture. Parquet+HNSW stays primary (2.55× faster search at 100K in-RAM). Lance joins as a second backend for Phase 16 hot-swap (14× faster index builds), Phase C/append workloads (0.08s vs full rewrite), RAG random-access retrieval (112× faster), and indexes past the ~5M RAM ceiling.
Per-profile `vector_backend: Parquet | Lance` becomes part of Phase 17 (model profiles). See ADR-019 for the full scorecard and caveats.
### Phase 19+: Further horizon
@ -364,7 +363,7 @@ The current stack has measurable limits. Documenting them so future decisions ar
| Dimension | Current ceiling | Breaks at | Escape hatch |
|---|---|---|---|
| Vector count per index | ~5M vectors on 128GB RAM | 10M+ (serious web crawl) | Phase 18 Lance migration OR mmap'd embeddings |
| Vector count per index (Parquet+HNSW in-RAM) | ~5M on 128GB | Past 5M | Switch that profile's `vector_backend` to Lance per ADR-019 — IVF_PQ stays on disk-resident quantized codes |
| Concurrent active indexes | ~50-100 at 100K vectors each | 10M×50 configurations | Lance disk-resident + per-profile activation |
| Rows per dataset | 2.47M proven, probably 100M+ fine | Approaches DataFusion memory limits | DataFusion predicate pushdown + partition pruning (existing) |
| Concurrent loaded models | 1-2 on 16GB VRAM (A4000) | 3+ models simultaneous | Not our problem — architectural, driven by Ollama |