golangLAKEHOUSE

Author	SHA1	Message	Date
root	1ec85b0a16	Batch 2: perf baseline — multi-sample + warmup + MAD threshold Replaces single-shot baselines (40% noise floor flagged in Phase E) with noise-aware regression detection. What changed: ingest n=3 runs (was 1) with 3-pass warmup vector_add n=3 runs (was 1) with 3-pass warmup query n=20 samples (unchanged) with 50-pass warmup search n=20 samples (unchanged) with 50-pass warmup RSS n=1 (unchanged — steady-state in G0) Each metric stored as {value: median, mad: median absolute deviation} in baseline.json (schema: v2-multisample-mad). New regression detection: threshold = max(3 * baseline.mad, value * 0.75) REGRESSION iff \|actual - baseline.value\| > threshold AND direction signals worse (lower throughput / higher latency). Why these specific numbers: 3MAD = standard "outside the spread" bound; lets high-variance metrics tolerate their own noise. 75% floor = empirical observation: even with 50 warmups, single- host inter-run variance on bootstrap-cold queryd was consistently 90-130% on this box. 75% catches >75% regressions cleanly while ignoring known noise. lib/metrics.sh: new proof_compute_mad helper computes MAD from a file of one-number-per-line samples. Used for both regen (to write the baseline.mad value) and diff (read from baseline). Honest finding from this iteration's 3 back-to-back diff runs: query_ms shows 90-130% delta from baseline consistently — not random noise but a systematic 2x gap between regen-time and steady-state. The regen captured a particularly fast moment; steady-state is slower. Operator workflow: regenerate the baseline at a known-representative state via `bash tests/proof/run_proof.sh --mode performance --regenerate-baseline` rather than expecting the harness to track a moving target. The harness's value here is the EVIDENCE RECORD (every run captures median+MAD+p95 plus all raw samples in raw/metrics/), not the gate. Even false-positive REGRESSION skips give operators "this run was 20ms vs baseline 10ms" which is informative. Sample counts also written into baseline.json under "samples" so a future audit can verify the methodology that produced the values. Verified across 3 back-to-back runs: ingest_rows_per_sec PASS (delta within 75%, mostly < 10%) vectors_per_sec_add PASS search_ms PASS rss_ PASS query_ms REGRESSION flagged (130/100/90%) — known systematic gap, not bug Closes the "40% noise floor" follow-up from Phase E FINAL_REPORT. Honest about limitations: hard regression gating on a busy single- host setup needs either much bigger sample counts (n≥100), longer warmup, or moving to a dedicated benchmark host. Documented inline. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 06:13:47 -05:00
root	4840c10311	proof harness: fix queryd refresh-tick race in 04_query_correctness Caught by the audit rerun: with cache-warm binaries, 04 fires its first SELECT faster than queryd's 500ms refresh tick — Q1 returned 400 ("table not found") even though 03_ingest had registered the manifest. Subsequent queries (after the next tick) succeeded. This is an eventual-consistency wait, not a retry — queryd's contract is that views appear within one tick of catalogd having the manifest. Production code does not need changing. Added to lib/http.sh: proof_wait_for_sql <budget_sec> <sql> polls a SQL probe until it returns 200 or budget elapses; emits no evidence (test setup, not a claim). Used in 04_query_correctness: Wait up to 5s for queryd to have the view before running the 5 SQL assertions. Skip-with-loud-reason if the view never appears. Verified: integration mode back to 104 pass / 0 fail / 1 skip after fix. The skip is the unchanged GOLAKE-085 informational record. This is exactly the kind of finding the harness was designed to surface — the regression existed in the codebase the moment Phase D shipped, but only fired when the next compare run hit cache-warm timing. Without the harness, it would have surfaced on a CI run weeks from now and been hard to bisect. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 05:36:28 -05:00
root	175ad59cb3	proof harness Phase D: performance baseline · 1000-row ingest, p50/p95 GOLAKE-100. First run writes tests/proof/baseline.json; subsequent runs diff against it. >10% regression emits a SKIP with REGRESSION detail (not a fail — perf claim is required:false in claims.yaml so the gate stays green; the human summary tells the regression story honestly). Skip-with-loud-reason if any earlier case in the run failed, per spec "performance only after contract+integration pass." Workload (deterministic, repeatable): ingest 1000-row CSV (5 roles × 5 cities × seeded scores) → /v1/ingest query SELECT count(*) ×20 against the just-ingested dataset vector add 200 dim=4 vectors with formulaic content (no Ollama) search ×20 against the perf index with a fixed query vector RSS per-service post-workload sample via /proc/<pid>/status Recorded metrics: ingest_rows_per_sec, query_p50_ms, query_p95_ms, vectors_per_sec_add, search_p50_ms, search_p95_ms, rss_{storaged,catalogd,ingestd,queryd,vectord,embedd,gateway}_mb baseline.json on this box (committed): 25000 rows/sec ingest · 17ms p50 / 24ms p95 query 6250 vectors/sec add · 8ms p50 / 20ms p95 search queryd 69 MiB · vectord 14 MiB · others 11-29 MiB Honest measurement-design finding from the very first compare run: back-to-back runs surfaced -41% ingest and +29% query p50 — pure disk-cache + queryd-cold-start noise. Single-sample baselines have real noise floor ≈40%. Recorded as REGRESSION skips so the human summary surfaces it, not a code regression. Tightening the threshold or moving to multi-sample medians is a Phase E recommendation. Verified end-to-end: just proof contract — 53 pass · 1 skip · ~4s just proof integration — 104 pass · 1 skip · ~8s just proof performance — 110 pass · 3 skip · ~10s just verify — 9 smokes still green · 29s All 11 cases (4 contract + 6 integration + 1 performance) deterministic end-to-end. Phase E (final report against the 9 mandated questions) is the last piece. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 05:30:11 -05:00
root	1313eb2173	proof harness Phase C: 6 integration cases · 104/0/1 green Adds the integration tier — full chain CSV→Parquet→SQL and full text→embed→vector→search. All 10 cases (4 contract + 6 integration) end-to-end deterministic; 8s wall total. Cases added: 01_storage_roundtrip.sh GOLAKE-010-012. PUT 1KiB → GET sha256-equal → LIST contains key → DELETE 200/204 → GET 404. Deterministic key under proof/<case_id>/ so concurrent runs don't collide. 02_catalog_manifest.sh GOLAKE-020-022. Fresh register existing=false → manifest read matches → list contains dataset_id → idempotent re-register existing=true with stable dataset_id → schema-drift register 409 (the ADR-020 contract). Per-run unique name via PROOF_RUN_ID so existing=false is meaningful. 03_ingest_csv_to_parquet.sh GOLAKE-030. workers.csv (5 rows) via /v1/ingest multipart → parquet object on storaged → catalog manifest with row_count=5. Verifies content-addressed key shape (datasets/<n>/<fp>.parquet). 04_query_correctness.sh GOLAKE-040. The 5 SQL assertions from fixtures/expected/queries.json against the workers fixture: count=5, Chicago=2, max=95, safety→Barbara, Houston avg=89.5. Iterates the YAML claims, runs each query, compares response columns to expected values. 06_vector_add_search.sh integration extension GOLAKE-051. text → /v1/embed (4 docs from fixtures/text/docs.txt) → vectord add → search by query embedding. Top-1 ID per query asserted against fixtures/expected/rankings.json. First run (or --regenerate-rankings) writes the fixture and emits a skip with explicit reason; subsequent runs assert against it. 07_vector_persistence_restart.sh GOLAKE-070. add 4 unit-basis vectors → search → record top-1 distance → SIGTERM vectord → restart with the same --config → poll /health for 8s → search again → top-1 ID and distance match bit-identically. Skips with reason if vectord PID can't be found or post-restart bind times out. Two harness improvements landed alongside: run_proof.sh writes a temp lakehouse_proof.toml with refresh_every="500ms" override and passes --config to all booted binaries. Production default is 30s; 04_query_correctness needs queryd to pick up the new view within a tick. Production config unchanged. cleanup() now pgreps for any orphan bin/<svc> processes (anchored to start-of-argv per memory feedback_pkill_scope.md) so a case that restarts a service mid-run still gets cleaned up. lib/http.sh adds proof_call(case_id, probe, method, url, args...) — escape hatch for cases that need raw curl args (multipart -F, custom headers). Used by 03_ingest for the multipart upload that conflicts with proof_post's --data + Content-Type defaults. lib/env.sh exports PROOF_RUN_ID — short unique id derived from the report directory timestamp. Used by 02 and 07 for fresh-each-run state isolation. Two real findings recorded as evidence (no code changes): - rankings.json fixture pinned: 4 queries → 4 distinct top-1 docs via nomic-embed-text. A model swap that changes ranking now fails the harness loudly; --regenerate-rankings is the override. - vectord persistence kill+restart preserves top-1 distance bit-identically — the LHV1 single-Put framed format from G1P round-trips exactly through Save/Load. Verified end-to-end: just proof contract — 53 pass (4 cases) just proof integration — 104 pass (10 cases) · 8s wall just verify — 9 smokes still green · 33s wall Phase D (performance baseline) lands next: 10_perf_baseline measures rows/sec ingest, vectors/sec add, p50/p95 query+search latency, RSS, CPU. First run writes tests/proof/baseline.json; later runs diff against it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 05:26:00 -05:00
root	6d18394416	proof harness Phase B: 4 contract cases · 53/0/1 green Added the contract tier above 00_health canary. All 5 contract cases now cover GOLAKE-001-003, 050, 060-061, 080-085 — 53 assertions pass, 1 informational skip, 0 fail. Wall: 4s end-to-end (cached binaries). Cases: 05_embedding_contract.sh GOLAKE-050. POST /v1/embed with one short text → asserts dim=768, one vector returned, vector length matches dimension, sum of squared elements > 0 (proxy for non-zero), response.model echoed. Skips with explicit reason if Ollama is unreachable (502 from embedd) — per spec hard rule "skipped tests do not appear as passed." 06_vector_add_search.sh GOLAKE-060 + GOLAKE-061. Synthetic dim=4 unit basis vectors. Create index → add 3 vectors → get-index returns length=3 → search([1,0,0,0],k=3) returns v1 at rank 1 with distance < 0.001. Cleanup with DELETE. No embedd dependency — pure contract layer. 08_gateway_contracts.sh GOLAKE-003. For each /v1/* route, asserts gateway and direct upstream return identical status AND identical response body (sha256 match). Confirms gateway is a proxy not a transformer. Status passthrough verified on both 200 path (storage/list, catalog/list) and 4xx path (sql empty body → 400 from queryd). 09_failure_modes.sh GOLAKE-080..085. Six failure-mode contracts: 080 malformed JSON → 4xx on catalog/ingest/sql/embed 081 missing required field → 4xx on catalog/vectors/embed 082 bad SQL → 4xx with non-empty error body 083 vector dim mismatch → 4xx 084 missing storage object → 404 085 duplicate vector ID → INFORMATIONAL (spec says required:false) first/second statuses recorded as evidence; contract decided later from the recorded record. Two new lib helpers in lib/assert.sh: proof_assert_status_in <id> <claim> "200 201 204" <probe> pass if status is in the space-separated list. Used for delete-returns-200-or-204 case where vectord returns 204. proof_assert_status_4xx <id> <claim> <probe> pass if status in [400, 500). Used for failure modes where the specific 4xx code may vary (400 vs 422 vs 409). Records actual code as evidence. Two real contract findings recorded by the harness during build: - vectord add expects {"items": [...]}, not {"vectors": [...]}. My initial test sent the wrong field; would have masked the bug forever in CI. The harness caught it via the assertion failure. - vectord create returns 201 Created, delete returns 204 No Content. Documented in the test fixtures as canonical. Regression: just verify wall 33s, vet + test + 9 smokes still green. Phase C (integration) lands next: 01_storage_roundtrip, 02_catalog_manifest, 03_ingest_csv_to_parquet, 04_query_correctness, 05/06 integration extends, 07_vector_persistence_restart. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 05:15:04 -05:00
root	a81291e38c	proof harness Phase A: scaffolding + canary case green Per docs/TEST_PROOF_SCOPE.md, building the claims-verification tier above the smoke chain. This commit lays the scaffolding and proves the orchestrator end-to-end with one canary case (00_health). What landed: tests/proof/ README.md how to read a report, layout, modes claims.yaml 24 claims enumerated (GOLAKE-001..100) run_proof.sh orchestrator with --mode {contract\|integration\|performance} and --no-bootstrap / --regenerate-{rankings,baseline} lib/ env.sh service URLs, report dir, mode, git context http.sh curl wrappers writing per-probe JSON + body + headers assert.sh proof_assert_{eq,ne,contains,lt,gt,status,json_eq} + proof_skip — each emits one JSONL record per call metrics.sh start/stop timers, value capture, RSS sampling, percentile compute (for Phase D) cases/ 00_health.sh canary — gateway + 6 services /health → 200, body identifies service, latency < 500ms (21 assertions) fixtures/ csv/workers.csv spec's 5-row deterministic CSV text/docs.txt 4 deterministic vector docs expected/queries.json expected results for the 5 SQL assertions Wired into the task runner: just proof contract # canary only this commit just proof integration # Phase C just proof performance # Phase D .gitignore: /tests/proof/reports/* with !.gitkeep — same pattern as reports/scrum/_evidence/. Per-run output is a runtime artifact. Specs landed alongside (J's drops): docs/TEST_PROOF_SCOPE.md the harness contract this implements docs/CLAUDE_REFACTOR_GUARDRAILS.md process discipline this harness obeys Verified end-to-end (cached binaries): just proof contract wall < 2s, 21 pass / 0 fail / 0 skip just verify wall 31s, vet + test + 9 smokes still green Two bugs fixed during canary run, both in run_proof.sh aggregation: - grep -c exits 1 on zero matches; the `\|\| echo 0` form concatenated "0\n0" and broke jq --argjson + integer comparison. Fixed via a _count helper that captures count-or-zero cleanly. - per-case table iterated case scripts (filename-based) but cases write evidence under CASE_ID. Switched to JSONL-file iteration so multi-case scripts work and the mapping is faithful. Phase B (contract cases) lands next: 05_embedding, 06_vector_add, 08_gateway_contracts, 09_failure_modes. Each sourcing the same lib helpers and writing to the same report shape. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 05:08:51 -05:00

6 Commits