Standalone benchmark crate `crates/lance-bench` running Lance 4.0 against our Parquet+HNSW at 100K × 768d (resumes_100k_v2) measured 8 dimensions. Results (see docs/ADR-019-vector-storage.md for full scorecard): Cold load: Parquet 0.17s vs Lance 0.13s (tie — not ≥2× threshold) Disk size: 330.3 MB vs 330.4 MB (tie) Search p50: 873us vs 2229us (Parquet 2.55× faster) Search p95: 1413us vs 4998us (Parquet 3.54× faster) Index build: 230s (ec=80) vs 16s (IVF_PQ) (Lance 14× faster) Random access: 35ms (scan) vs 311us (Lance 112× faster) Append 10K rows: full rewrite vs 0.08s/+31MB (Lance structural win) Decision (ADR-019): hybrid, not migrate-or-reject. - Parquet+HNSW stays primary — our HNSW at ec=80 es=30 recall=1.00 is 2.55× faster than Lance IVF_PQ at 100K in-RAM scale - Lance joins as second backend per-profile for workloads where it wins architecturally: random row access (RAG text fetch), append-heavy pipelines (Phase C), hot-swap generations (Phase 16, 14× faster builds), and indexes past the ~5M RAM ceiling - Phase 17 ModelProfile gets vector_backend: Parquet | Lance field - Ceiling table in PRD updated — 5M ceiling now says "switch to Lance" instead of "migrate" since Lance runs alongside, not instead of Isolation: lance-bench is a standalone workspace crate with its own dep tree (Lance pulls DataFusion 52 + Arrow 57 incompatible with main stack DataFusion 47 + Arrow 55). Kept off the critical path until API is stable enough to promote into vectord::lance_store. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
43 lines
1.5 KiB
TOML
43 lines
1.5 KiB
TOML
[package]
|
|
name = "lance-bench"
|
|
version = "0.1.0"
|
|
edition = "2024"
|
|
|
|
# Standalone pilot for Phase B (see docs/EXECUTION_PLAN.md).
|
|
# Deliberately NOT sharing workspace deps — Lance 4.x pulls in its own
|
|
# DataFusion and Arrow versions incompatible with the rest of the stack.
|
|
# Isolating the pilot means we don't force a workspace-wide upgrade until
|
|
# we've decided Lance is worth it.
|
|
|
|
[dependencies]
|
|
# Only the features we actually need — the default brings in AWS/Azure/GCP/HF etc
|
|
# which is ~200 extra crates we don't care about for a local pilot.
|
|
lance = { version = "4.0", default-features = false }
|
|
# Lance exposes DatasetIndexExt, IndexType, and IvfBuildParams through
|
|
# its sub-crates which must be imported directly — lance itself doesn't
|
|
# re-export them at a convenient path.
|
|
lance-index = { version = "4.0", default-features = false }
|
|
lance-linalg = { version = "4.0", default-features = false }
|
|
|
|
# Arrow re-exported by Lance; pin to a range Lance picks so types match.
|
|
arrow = "57"
|
|
arrow-array = "57"
|
|
arrow-schema = "57"
|
|
|
|
# Also need to read the EXISTING Parquet vector files so we can compare.
|
|
# These live in data/vectors/*.parquet. Lance's internal Parquet reading
|
|
# might differ from ours; using our format's Arrow/Parquet versions for
|
|
# the read side keeps the inputs identical.
|
|
parquet = "57"
|
|
|
|
tokio = { version = "1", features = ["full"] }
|
|
futures = "0.3"
|
|
serde = { version = "1", features = ["derive"] }
|
|
serde_json = "1"
|
|
anyhow = "1"
|
|
bytes = "1"
|
|
|
|
[[bin]]
|
|
name = "lance-bench"
|
|
path = "src/main.rs"
|