docs(comparison): close Lance backend deferral, reframe as Lance-vs-Parquet+HNSW

Lance went from "deferred until corpus exceeds 5M rows" to verified production-ready at 10M in a single wave (lakehouse repo commits 7594725 + 5d30b3d). Captures the 4-pack (sanitizer + tests + smoke + bench) + the root-cause fix (auto-build doc_id btree in migrate handler). Reframes the strategic question: HNSW at 10M doesn't fit RAM, so the real choice is Lance vs Parquet+HNSW-with-spilling, deferred until we have a workload where the Parquet path is the bottleneck. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 22:10:46 -05:00 · 2026-05-02 22:10:46 -05:00 · 916923440a
commit 916923440a
parent b314ed1c94
1 changed files with 3 additions and 2 deletions
--- a/docs/ARCHITECTURE_COMPARISON.md
+++ b/docs/ARCHITECTURE_COMPARISON.md
@ -4,7 +4,7 @@
 > comparison.
 > **Owner**: J. Update this when either side ships a fix that changes
 > the table values, or when a new architectural axis surfaces.
-> **Last meaningful refresh**: 2026-05-01 (post-Rust-cache + Go-validator-port)
+> **Last meaningful refresh**: 2026-05-02 (post-Lance-gauntlet + observability parity)

 This document compares the two parallel implementations of the lakehouse
 substrate — Rust at `/home/profit/lakehouse/` (production today), Go at
@ -58,7 +58,8 @@ Don't:
 | 2026-05-02 | **Materializer parity probe — caught + fixed real bug** | New `scripts/cutover/parity/materializer_parity.sh` runs Bun + Go materializer on identical synthetic root, diffs output JSONL. Result on first run: **0/2 match** — Go's `Provenance.LineOffset` had `json:",omitempty"` and stripped the field on first-row records (line_offset=0 is semantically meaningful, not absent). 1-line fix (drop `omitempty` + comment explaining why). Re-run: **2/2 match**. Real cross-runtime gap surfaced + closed in same wave. |
 | 2026-05-02 | **extract_json parity probe — 12/12 match across edge cases** | New `scripts/cutover/parity/extract_json_parity.sh` runs identical model-output strings through Rust `gateway::v1::iterate::extract_json` AND Go `validator.ExtractJSON`. 12 fixtures: fenced/unfenced blocks, nested objects, unicode, escaped quotes, top-level array, malformed JSON. Substrate gate: `cargo test -p gateway extract_json` PASS before probe. Result: **12/12 match.** Algorithms genuinely equivalent. Rust side gained `pub` on `extract_json` + new `bin/parity_extract_json` (~30 LOC). |
 | 2026-05-02 | **Validator wire-format alignment — DONE** | Custom `MarshalJSON`/`UnmarshalJSON` on Go's `validator.ValidationError` emits the Rust serde-tagged-enum shape `{"Schema":{"field":"x","reason":"y"}}`. UnmarshalJSON also accepts the legacy flat shape (migration safety) and rejects unknown variants (drift guard for future Rust enum additions). 4 new pinning tests in `types_test.go`. Re-run validator parity probe: **6/6 match** (was 1/6). |
-| _open_ | Decide on Lance vector backend | Defer until corpus exceeds ~5M rows. |
+| 2026-05-02 | **Lance backend gauntlet (4-pack + root-cause fix) — DONE** | Lance crate had zero tests + no smoke when audited this morning. Shipped: (a) `sanitize_lance_err` over all 5 routes (search/doc/index/append/migrate) — missing-index now 404 not 500, no `/home/` or `/root/.cargo/` paths leaked; (b) 7 unit tests in `crates/vectord-lance` with synth Parquet helper; (c) 9-probe `scripts/lance_smoke.sh` against live `:3100`; (d) 10M re-bench (`reports/lance_10m_rebench_2026-05-02.md`) — search warm ~20ms, search cold ~46ms median. Bench surfaced doc-fetch p50 ~100ms (300x slower than ADR-019 100K projection); root-caused to lance-bench bypassing IndexMeta → warming auto-build never fired → no `doc_id` btree. **Fix shipped (commit `5d30b3d`)**: `lance_migrate` HTTP handler now auto-builds the btree inline (1.2s on 10M, +269MB), drops doc-fetch to ~5ms (20x). Live verified 9/9 smoke + post-restart doc-fetch 4-15ms. |
+| _open_ | Decide Lance vs Parquet+HNSW for primary | Lance verified production-ready at 10M (this morning's gauntlet). HNSW at 10M doesn't fit RAM (~60GB for vectors+graph), so the comparison is between Lance and Parquet+HNSW-with-spilling. Decide once we have a 10M ingest scenario where the Parquet path is bottlenecked. |
 | _open_ | Pick Go primary vs Rust primary | Both viable. Go has perf edge after today; Rust has production deploy + producer-side completeness. |

 ---