Adds the integration tier — full chain CSV→Parquet→SQL and full
text→embed→vector→search. All 10 cases (4 contract + 6 integration)
end-to-end deterministic; 8s wall total.
Cases added:
01_storage_roundtrip.sh
GOLAKE-010-012. PUT 1KiB → GET sha256-equal → LIST contains key
→ DELETE 200/204 → GET 404. Deterministic key under
proof/<case_id>/ so concurrent runs don't collide.
02_catalog_manifest.sh
GOLAKE-020-022. Fresh register existing=false → manifest read
matches → list contains dataset_id → idempotent re-register
existing=true with stable dataset_id → schema-drift register
409 (the ADR-020 contract). Per-run unique name via
PROOF_RUN_ID so existing=false is meaningful.
03_ingest_csv_to_parquet.sh
GOLAKE-030. workers.csv (5 rows) via /v1/ingest multipart →
parquet object on storaged → catalog manifest with row_count=5.
Verifies content-addressed key shape (datasets/<n>/<fp>.parquet).
04_query_correctness.sh
GOLAKE-040. The 5 SQL assertions from fixtures/expected/queries.json
against the workers fixture: count=5, Chicago=2, max=95,
safety→Barbara, Houston avg=89.5. Iterates the YAML claims, runs
each query, compares response columns to expected values.
06_vector_add_search.sh integration extension
GOLAKE-051. text → /v1/embed (4 docs from fixtures/text/docs.txt)
→ vectord add → search by query embedding. Top-1 ID per query
asserted against fixtures/expected/rankings.json. First run (or
--regenerate-rankings) writes the fixture and emits a skip with
explicit reason; subsequent runs assert against it.
07_vector_persistence_restart.sh
GOLAKE-070. add 4 unit-basis vectors → search → record top-1
distance → SIGTERM vectord → restart with the same --config →
poll /health for 8s → search again → top-1 ID and distance match
bit-identically. Skips with reason if vectord PID can't be found
or post-restart bind times out.
Two harness improvements landed alongside:
run_proof.sh writes a temp lakehouse_proof.toml with
refresh_every="500ms" override and passes --config to all booted
binaries. Production default is 30s; 04_query_correctness needs
queryd to pick up the new view within a tick. Production config
unchanged.
cleanup() now pgreps for any orphan bin/<svc> processes (anchored
to start-of-argv per memory feedback_pkill_scope.md) so a case
that restarts a service mid-run still gets cleaned up.
lib/http.sh adds proof_call(case_id, probe, method, url, args...)
— escape hatch for cases that need raw curl args (multipart -F,
custom headers). Used by 03_ingest for the multipart upload that
conflicts with proof_post's --data + Content-Type defaults.
lib/env.sh exports PROOF_RUN_ID — short unique id derived from the
report directory timestamp. Used by 02 and 07 for fresh-each-run
state isolation.
Two real findings recorded as evidence (no code changes):
- rankings.json fixture pinned: 4 queries → 4 distinct top-1 docs
via nomic-embed-text. A model swap that changes ranking now
fails the harness loudly; --regenerate-rankings is the override.
- vectord persistence kill+restart preserves top-1 distance
bit-identically — the LHV1 single-Put framed format from
G1P round-trips exactly through Save/Load.
Verified end-to-end:
just proof contract — 53 pass (4 cases)
just proof integration — 104 pass (10 cases) · 8s wall
just verify — 9 smokes still green · 33s wall
Phase D (performance baseline) lands next: 10_perf_baseline measures
rows/sec ingest, vectors/sec add, p50/p95 query+search latency, RSS,
CPU. First run writes tests/proof/baseline.json; later runs diff
against it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
126 lines
4.4 KiB
Bash
126 lines
4.4 KiB
Bash
#!/usr/bin/env bash
|
|
# lib/http.sh — curl wrappers that capture latency, status, body.
|
|
#
|
|
# Each request emits a JSON file under raw/http/<case_id>/<probe>.json
|
|
# describing the round-trip. Cases consume the JSON via assert.sh.
|
|
#
|
|
# Why JSON files instead of bash variables: gives the final report a
|
|
# diffable, replayable record. Future runs can compare on disk without
|
|
# re-executing the case.
|
|
#
|
|
# Functions:
|
|
# proof_get <case_id> <probe_name> <url> [extra-curl-args...]
|
|
# proof_post <case_id> <probe_name> <url> <content-type> <body> [extra-curl-args...]
|
|
# proof_put <case_id> <probe_name> <url> <content-type> <body|@file> [extra-curl-args...]
|
|
# proof_delete<case_id> <probe_name> <url> [extra-curl-args...]
|
|
#
|
|
# Returns 0 always (capture is independent of HTTP outcome).
|
|
# Stores result at: $PROOF_REPORT_DIR/raw/http/<case_id>/<probe>.json
|
|
# Stores body at: $PROOF_REPORT_DIR/raw/http/<case_id>/<probe>.body
|
|
|
|
_proof_http_emit() {
|
|
local case_id="$1" probe="$2" method="$3" url="$4" status="$5" latency_ms="$6" body_path="$7" headers_path="$8"
|
|
local dir="${PROOF_REPORT_DIR}/raw/http/${case_id}"
|
|
mkdir -p "$dir"
|
|
local body_sha=""
|
|
[ -s "$body_path" ] && body_sha="$(sha256sum "$body_path" | awk '{print $1}')"
|
|
cat > "${dir}/${probe}.json" <<JSON
|
|
{
|
|
"case_id": "${case_id}",
|
|
"probe": "${probe}",
|
|
"method": "${method}",
|
|
"url": "${url}",
|
|
"status": ${status},
|
|
"latency_ms": ${latency_ms},
|
|
"body_path": "raw/http/${case_id}/${probe}.body",
|
|
"body_sha256": "${body_sha}",
|
|
"headers_path": "raw/http/${case_id}/${probe}.headers"
|
|
}
|
|
JSON
|
|
}
|
|
|
|
# Internal common runner — populates a temp body+headers file, times
|
|
# the request, emits the per-probe JSON, prints the body to stdout for
|
|
# convenience (cases can capture or discard).
|
|
_proof_http_run() {
|
|
local case_id="$1" probe="$2" method="$3" url="$4"; shift 4
|
|
local dir="${PROOF_REPORT_DIR}/raw/http/${case_id}"
|
|
mkdir -p "$dir"
|
|
local body_path="${dir}/${probe}.body"
|
|
local headers_path="${dir}/${probe}.headers"
|
|
local start_ms end_ms
|
|
start_ms=$(date +%s%3N)
|
|
local status
|
|
status=$(curl -sS -X "$method" -o "$body_path" -D "$headers_path" -w "%{http_code}" "$@" "$url" 2>/dev/null || echo 0)
|
|
end_ms=$(date +%s%3N)
|
|
local latency_ms=$((end_ms - start_ms))
|
|
_proof_http_emit "$case_id" "$probe" "$method" "$url" "$status" "$latency_ms" "$body_path" "$headers_path"
|
|
cat "$body_path"
|
|
}
|
|
|
|
proof_get() {
|
|
local case_id="$1" probe="$2" url="$3"; shift 3
|
|
_proof_http_run "$case_id" "$probe" GET "$url" "$@"
|
|
}
|
|
|
|
proof_post() {
|
|
local case_id="$1" probe="$2" url="$3" content_type="$4" body="$5"; shift 5
|
|
_proof_http_run "$case_id" "$probe" POST "$url" \
|
|
-H "Content-Type: ${content_type}" \
|
|
--data "$body" \
|
|
"$@"
|
|
}
|
|
|
|
# proof_put accepts either an inline body or @-prefixed file path
|
|
# (curl --upload-file semantics for streaming).
|
|
proof_put() {
|
|
local case_id="$1" probe="$2" url="$3" content_type="$4" body="$5"; shift 5
|
|
if [[ "$body" == @* ]]; then
|
|
local file="${body#@}"
|
|
_proof_http_run "$case_id" "$probe" PUT "$url" \
|
|
-H "Content-Type: ${content_type}" \
|
|
--upload-file "$file" \
|
|
"$@"
|
|
else
|
|
_proof_http_run "$case_id" "$probe" PUT "$url" \
|
|
-H "Content-Type: ${content_type}" \
|
|
--data "$body" \
|
|
"$@"
|
|
fi
|
|
}
|
|
|
|
proof_delete() {
|
|
local case_id="$1" probe="$2" url="$3"; shift 3
|
|
_proof_http_run "$case_id" "$probe" DELETE "$url" "$@"
|
|
}
|
|
|
|
# proof_call: escape hatch for cases that need full control of curl
|
|
# args — multipart uploads (-F), custom headers, --form-string, etc.
|
|
# proof_post / proof_put add a Content-Type header and --data body
|
|
# that conflict with -F multipart, so use this for those cases.
|
|
#
|
|
# proof_call <case_id> <probe> <method> <url> [curl-args...]
|
|
#
|
|
# Example multipart POST:
|
|
# proof_call "$CASE_ID" "ingest" POST "$URL" -F "file=@${PATH}"
|
|
proof_call() {
|
|
local case_id="$1" probe="$2" method="$3" url="$4"; shift 4
|
|
_proof_http_run "$case_id" "$probe" "$method" "$url" "$@"
|
|
}
|
|
|
|
# Helper accessors — reads the per-probe JSON.
|
|
proof_status_of() {
|
|
local case_id="$1" probe="$2"
|
|
jq -r '.status' "${PROOF_REPORT_DIR}/raw/http/${case_id}/${probe}.json"
|
|
}
|
|
|
|
proof_body_of() {
|
|
local case_id="$1" probe="$2"
|
|
cat "${PROOF_REPORT_DIR}/raw/http/${case_id}/${probe}.body"
|
|
}
|
|
|
|
proof_latency_of() {
|
|
local case_id="$1" probe="$2"
|
|
jq -r '.latency_ms' "${PROOF_REPORT_DIR}/raw/http/${case_id}/${probe}.json"
|
|
}
|