lakehouse

profit/lakehouse

Fork 0

Commit Graph

Author	SHA1	Message	Date
root	7bb66f08c3	lance: scrum-driven sanitizer + smoke-gate fixes (opus 2026-05-02 BLOCK) Some checks failed lakehouse/auditor 9 blocking issues: cloud: claim not backed — "Verified live (post-restart): scale_test_10m doc-fetch 4-15ms across" Cross-lineage scrum on the lance wave (4 bundles, 33 distinct findings) surfaced 1 real BLOCK and 2 real WARNs from opus that the kimi/qwen lineages missed. Per feedback_cross_lineage_review.md, opus is the load-bearing reviewer; cross-lineage convergence is noise unless verified. BLOCK fix — sanitize_lance_err path-stripping was unsound: err.split("/home/").next().unwrap_or(&err) returns Some("") when err STARTS with "/home/", erasing the entire message. Replaced truncation with redact_paths() — a hand-rolled scanner that walks the input once, replacing path-shaped substrings with [REDACTED] while preserving surrounding error context. Catches: - absolute paths under /root/.cargo, /home, /var, /tmp, /etc, /usr, /opt - relative variants (Lance occasionally strips leading slash — observed live "Dataset at path home/profit/lakehouse/data/lance/x was not found") - multiple occurrences in one error - preserves quote/comma/whitespace terminators WARN fix #1 — is_not_found heuristic was too broad: lower.contains("not found") caught real 500s like "column not found", "field not found in schema". Narrowed to require dataset-shape phrasing AND exclude the column/field/schema patterns explicitly. WARN fix #2 — lance_smoke.sh `grep -qvE` was an unsound regression gate. bash -c "echo '$BODY' \| grep -qvE 'pat'" With -v -q, exits 0 if ANY line lacks the pattern — so a multi-line body with one leak line + any clean line FALSE-PASSES. Replaced with the correct "pattern absent" form: `! grep -qE 'pat'`. Also expanded the pattern set (added /var/, /tmp/) since the scrum surfaced these as additional leak vectors. Also unblocks pre-existing pathway_memory test compile error (stale PathwayTrace init missing 6 Mem0-versioning fields added in 6ac7f61). Tests filled in with sensible defaults — needed to run sanitize_tests. 10/10 new sanitize tests pass. Smoke 9/9 PASS against rebuilt+restarted gateway. Live missing-index probe now returns: "lance dataset not found: no-such-11205" + HTTP 404 (was: leaked absolute paths + HTTP 500 → leaked absolute and relative paths post-first-fix → clean message + 404 now.) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 23:34:54 -05:00
root	7594725c25	lance backend: 4-pack — bug fix + smoke + tests + 10M re-bench Some checks failed lakehouse/auditor 12 blocking issues: cloud: claim not backed — "Verified end-to-end against persistent Go stack on :4110:" Surfaced by the 2026-05-02 audit (vectord-lance + lance-bench + glue existed and worked but had no tests, no smoke, leaked server paths on missing-index search, and the ADR-019 10M re-bench was deferred). ## 1. Fix: missing-index search returned 500 + leaked filesystem path Pre-fix: $ POST /vectors/lance/search/no-such-index HTTP 500 Dataset at path home/profit/lakehouse/data/lance/no-such-index was not found: Not found: home/profit/lakehouse/data/lance/no-such-index/ _versions, /root/.cargo/registry/src/index.crates.io-...-1949cf8c.../ lance-table-4.0.0/src/io/commit.rs:364:26, ... Post-fix: HTTP 404 lance dataset not found: no-such-index Added `sanitize_lance_err()` in crates/vectord/src/service.rs that: - maps "not found" / "no such file" patterns → 404 (was 500) - strips /home/ and /root/.cargo/ paths from any error body Applied to all 5 lance handlers: search, get_doc, build_index, append, migrate. The store_for() handle is cheap-and-stateless; the actual disk hit happens inside the operation, which is where the leak originated. ## 2. scripts/lance_smoke.sh — first regression gate 9-probe smoke against the live HTTP surface. Exercises only read paths (no state mutation in CI). Specifically locks the sanitizer fix — a future regression that re-introduces the path leak fires the smoke immediately. 9/9 PASS against the live :3100 today. ## 3. Unit tests on vectord-lance/src/lib.rs (was: zero tests) 7 tests covering the public LanceVectorStore API: - fresh_store_reports_no_state — handle is lazy - migrate_then_count_and_fetch — Parquet → Lance round-trip - get_by_doc_id_missing_returns_none — Ok(None) vs Err contract that lets the HTTP handler return 404 cleanly - append_grows_count_and_new_rows_fetchable — ADR-019's structural-difference claim verified at the unit level - append_dim_mismatch_errors — guards against silently breaking search by accepting inconsistent-dim rows - search_returns_nearest — exact-vector match → top-1 - stats_reports_post_migrate_state — locks the field shape 7/7 PASS. cargo test -p vectord-lance --lib green. ## 4. 10M re-bench (deferred from ADR-019) reports/lance_10m_rebench_2026-05-02.md captures the numbers driven against the live :3100 over data/lance/scale_test_10m (33GB / 10M vectors, IVF_PQ confirmed via response method tag). Headline: Search cold (10 diverse queries): median ~32ms, mean ~46ms Search warm (5x same query): ~20ms p50 Doc fetch (5x same id): ~100ms p50 Search latency at 10M is acceptable for batch / async workloads, too slow for sub-10ms voice/recommendation paths. ADR-019's "Lance pulls ahead at 10M" claim remains unverified-but-not-refuted — at this scale HNSW doesn't operationally exist (10M × 768d × 4 bytes = 30GB just for vectors). Real finding: doc-fetch at 10M is 300x slower than the 100K number ADR-019 cited (311μs → ~100ms). Likely cause: scalar btree index on doc_id may not be built for this dataset. Follow-up to investigate whether forcing build_scalar_index brings it back to the load-bearing O(1) range. Captured in the report. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 20:06:56 -05:00

Author

SHA1

Message

Date

root

7bb66f08c3

lance: scrum-driven sanitizer + smoke-gate fixes (opus 2026-05-02 BLOCK)

lakehouse/auditor 9 blocking issues: cloud: claim not backed — "Verified live (post-restart): scale_test_10m doc-fetch 4-15ms across"

Cross-lineage scrum on the lance wave (4 bundles, 33 distinct findings)
surfaced 1 real BLOCK and 2 real WARNs from opus that the kimi/qwen
lineages missed. Per feedback_cross_lineage_review.md, opus is the
load-bearing reviewer; cross-lineage convergence is noise unless verified.

BLOCK fix — sanitize_lance_err path-stripping was unsound:
  err.split("/home/").next().unwrap_or(&err)
returns Some("") when err STARTS with "/home/", erasing the entire
message. Replaced truncation with redact_paths() — a hand-rolled scanner
that walks the input once, replacing path-shaped substrings with
[REDACTED] while preserving surrounding error context. Catches:
- absolute paths under /root/.cargo, /home, /var, /tmp, /etc, /usr, /opt
- relative variants (Lance occasionally strips leading slash —
  observed live "Dataset at path home/profit/lakehouse/data/lance/x
  was not found")
- multiple occurrences in one error
- preserves quote/comma/whitespace terminators

WARN fix #1 — is_not_found heuristic was too broad:
  lower.contains("not found")
caught real 500s like "column not found", "field not found in schema".
Narrowed to require dataset-shape phrasing AND exclude the
column/field/schema patterns explicitly.

WARN fix #2 — lance_smoke.sh `grep -qvE` was an unsound regression gate.
  bash -c "echo '$BODY' | grep -qvE 'pat'"
With -v -q, exits 0 if ANY line lacks the pattern — so a multi-line
body with one leak line + any clean line FALSE-PASSES. Replaced with
the correct "pattern absent" form: `! grep -qE 'pat'`. Also expanded
the pattern set (added /var/, /tmp/) since the scrum surfaced these
as additional leak vectors.

Also unblocks pre-existing pathway_memory test compile error (stale
PathwayTrace init missing 6 Mem0-versioning fields added in 6ac7f61).
Tests filled in with sensible defaults — needed to run sanitize_tests.

10/10 new sanitize tests pass. Smoke 9/9 PASS against rebuilt+restarted
gateway. Live missing-index probe now returns:
  "lance dataset not found: no-such-11205" + HTTP 404
(was: leaked absolute paths + HTTP 500 → leaked absolute and relative
paths post-first-fix → clean message + 404 now.)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-02 23:34:54 -05:00

root

7594725c25

lance backend: 4-pack — bug fix + smoke + tests + 10M re-bench

lakehouse/auditor 12 blocking issues: cloud: claim not backed — "Verified end-to-end against persistent Go stack on :4110:"

Surfaced by the 2026-05-02 audit (vectord-lance + lance-bench + glue
existed and worked but had no tests, no smoke, leaked server paths
on missing-index search, and the ADR-019 10M re-bench was deferred).

## 1. Fix: missing-index search returned 500 + leaked filesystem path

Pre-fix:
  $ POST /vectors/lance/search/no-such-index
  HTTP 500
  Dataset at path home/profit/lakehouse/data/lance/no-such-index was
  not found: Not found: home/profit/lakehouse/data/lance/no-such-index/
  _versions, /root/.cargo/registry/src/index.crates.io-...-1949cf8c.../
  lance-table-4.0.0/src/io/commit.rs:364:26, ...

Post-fix:
  HTTP 404
  lance dataset not found: no-such-index

Added `sanitize_lance_err()` in crates/vectord/src/service.rs that:
  - maps "not found" / "no such file" patterns → 404 (was 500)
  - strips /home/ and /root/.cargo/ paths from any error body
Applied to all 5 lance handlers: search, get_doc, build_index,
append, migrate. The store_for() handle is cheap-and-stateless;
the actual disk hit happens inside the operation, which is where
the leak originated.

## 2. scripts/lance_smoke.sh — first regression gate

9-probe smoke against the live HTTP surface. Exercises only read
paths (no state mutation in CI). Specifically locks the sanitizer
fix — a future regression that re-introduces the path leak fires
the smoke immediately. 9/9 PASS against the live :3100 today.

## 3. Unit tests on vectord-lance/src/lib.rs (was: zero tests)

7 tests covering the public LanceVectorStore API:
  - fresh_store_reports_no_state — handle is lazy
  - migrate_then_count_and_fetch — Parquet → Lance round-trip
  - get_by_doc_id_missing_returns_none — Ok(None) vs Err contract
    that lets the HTTP handler return 404 cleanly
  - append_grows_count_and_new_rows_fetchable — ADR-019's
    structural-difference claim verified at the unit level
  - append_dim_mismatch_errors — guards against silently breaking
    search by accepting inconsistent-dim rows
  - search_returns_nearest — exact-vector match → top-1
  - stats_reports_post_migrate_state — locks the field shape

7/7 PASS. cargo test -p vectord-lance --lib green.

## 4. 10M re-bench (deferred from ADR-019)

reports/lance_10m_rebench_2026-05-02.md captures the numbers driven
against the live :3100 over data/lance/scale_test_10m (33GB / 10M
vectors, IVF_PQ confirmed via response method tag).

Headline:
  Search cold (10 diverse queries):   median ~32ms, mean ~46ms
  Search warm (5x same query):        ~20ms p50
  Doc fetch (5x same id):             ~100ms p50

Search latency at 10M is acceptable for batch / async workloads,
too slow for sub-10ms voice/recommendation paths. ADR-019's "Lance
pulls ahead at 10M" claim remains unverified-but-not-refuted — at
this scale HNSW doesn't operationally exist (10M × 768d × 4 bytes =
30GB just for vectors).

Real finding: doc-fetch at 10M is 300x slower than the 100K number
ADR-019 cited (311μs → ~100ms). Likely cause: scalar btree index
on doc_id may not be built for this dataset. Follow-up to
investigate whether forcing build_scalar_index brings it back to
the load-bearing O(1) range. Captured in the report.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-02 20:06:56 -05:00

2 Commits