History

root 1313eb2173 proof harness Phase C: 6 integration cases · 104/0/1 green

Adds the integration tier — full chain CSV→Parquet→SQL and full
text→embed→vector→search. All 10 cases (4 contract + 6 integration)
end-to-end deterministic; 8s wall total.

Cases added:
  01_storage_roundtrip.sh
    GOLAKE-010-012. PUT 1KiB → GET sha256-equal → LIST contains key
    → DELETE 200/204 → GET 404. Deterministic key under
    proof/<case_id>/ so concurrent runs don't collide.

  02_catalog_manifest.sh
    GOLAKE-020-022. Fresh register existing=false → manifest read
    matches → list contains dataset_id → idempotent re-register
    existing=true with stable dataset_id → schema-drift register
    409 (the ADR-020 contract). Per-run unique name via
    PROOF_RUN_ID so existing=false is meaningful.

  03_ingest_csv_to_parquet.sh
    GOLAKE-030. workers.csv (5 rows) via /v1/ingest multipart →
    parquet object on storaged → catalog manifest with row_count=5.
    Verifies content-addressed key shape (datasets/<n>/<fp>.parquet).

  04_query_correctness.sh
    GOLAKE-040. The 5 SQL assertions from fixtures/expected/queries.json
    against the workers fixture: count=5, Chicago=2, max=95,
    safety→Barbara, Houston avg=89.5. Iterates the YAML claims, runs
    each query, compares response columns to expected values.

  06_vector_add_search.sh integration extension
    GOLAKE-051. text → /v1/embed (4 docs from fixtures/text/docs.txt)
    → vectord add → search by query embedding. Top-1 ID per query
    asserted against fixtures/expected/rankings.json. First run (or
    --regenerate-rankings) writes the fixture and emits a skip with
    explicit reason; subsequent runs assert against it.

  07_vector_persistence_restart.sh
    GOLAKE-070. add 4 unit-basis vectors → search → record top-1
    distance → SIGTERM vectord → restart with the same --config →
    poll /health for 8s → search again → top-1 ID and distance match
    bit-identically. Skips with reason if vectord PID can't be found
    or post-restart bind times out.

Two harness improvements landed alongside:

  run_proof.sh writes a temp lakehouse_proof.toml with
  refresh_every="500ms" override and passes --config to all booted
  binaries. Production default is 30s; 04_query_correctness needs
  queryd to pick up the new view within a tick. Production config
  unchanged.

  cleanup() now pgreps for any orphan bin/<svc> processes (anchored
  to start-of-argv per memory feedback_pkill_scope.md) so a case
  that restarts a service mid-run still gets cleaned up.

  lib/http.sh adds proof_call(case_id, probe, method, url, args...)
  — escape hatch for cases that need raw curl args (multipart -F,
  custom headers). Used by 03_ingest for the multipart upload that
  conflicts with proof_post's --data + Content-Type defaults.

  lib/env.sh exports PROOF_RUN_ID — short unique id derived from the
  report directory timestamp. Used by 02 and 07 for fresh-each-run
  state isolation.

Two real findings recorded as evidence (no code changes):
  - rankings.json fixture pinned: 4 queries → 4 distinct top-1 docs
    via nomic-embed-text. A model swap that changes ranking now
    fails the harness loudly; --regenerate-rankings is the override.
  - vectord persistence kill+restart preserves top-1 distance
    bit-identically — the LHV1 single-Put framed format from
    G1P round-trips exactly through Save/Load.

Verified end-to-end:
  just proof contract       — 53 pass (4 cases)
  just proof integration    — 104 pass (10 cases) · 8s wall
  just verify               — 9 smokes still green · 33s wall

Phase D (performance baseline) lands next: 10_perf_baseline measures
rows/sec ingest, vectors/sec add, p50/p95 query+search latency, RSS,
CPU. First run writes tests/proof/baseline.json; later runs diff
against it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-29 05:26:00 -05:00

cases

proof harness Phase C: 6 integration cases · 104/0/1 green

2026-04-29 05:26:00 -05:00

fixtures

proof harness Phase C: 6 integration cases · 104/0/1 green

2026-04-29 05:26:00 -05:00

lib

proof harness Phase C: 6 integration cases · 104/0/1 green

2026-04-29 05:26:00 -05:00

reports

proof harness Phase A: scaffolding + canary case green

2026-04-29 05:08:51 -05:00

claims.yaml

proof harness Phase A: scaffolding + canary case green

2026-04-29 05:08:51 -05:00

README.md

proof harness Phase A: scaffolding + canary case green

2026-04-29 05:08:51 -05:00

run_proof.sh

proof harness Phase C: 6 integration cases · 104/0/1 green

2026-04-29 05:26:00 -05:00

README.md

tests/proof — claims-verification harness

Per docs/TEST_PROOF_SCOPE.md. The 9 smokes prove that the system runs; this harness proves that the system makes the claims it claims to make.

Why this exists

Smokes verify that services boot, talk, and pass deterministic round-trips. They do not verify:

contract drift (a route silently changes its response shape)
semantic correctness (the SQL query says what we claim it says)
failure-mode discipline (a malformed request returns 4xx, not silent 200)
performance regressions (vectors/sec drops 30% on a refactor)

The proof harness produces evidence, not pass/fail. Each case writes input/output hashes, latencies, status codes, log paths, git SHA → a future auditor can re-run + diff.

Layout

tests/proof/
  README.md            ← you are here
  claims.yaml          ← enumeration of every claim, with id + type + routes
  run_proof.sh         ← orchestrator (--mode contract|integration|performance)
  lib/
    env.sh             ← service URLs, report dir, mode, git context
    http.sh            ← curl wrappers (latency + status + body capture)
    assert.sh          ← structured assertions writing JSONL evidence
    metrics.sh         ← rss/cpu/timing capture for performance mode
  cases/
    00_health.sh
    01_storage_roundtrip.sh
    …
    10_perf_baseline.sh
  fixtures/
    csv/workers.csv         ← canonical 5-row fixture (sha-pinned)
    text/docs.txt           ← 4 deterministic vector docs
    expected/queries.json   ← expected results for the 5 SQL assertions
    expected/rankings.json  ← stored top-K rankings for vector search
  reports/
    proof-YYYYMMDD-HHMMSSZ/   ← per-run; gitignored
      summary.md
      summary.json
      raw/
        context.json    ← git_sha, hostname, timestamp, mode
        cases/<id>.jsonl  ← one JSONL line per assertion
        http/<id>/*.{json,body,headers}
        logs/<svc>.log  ← captured stdout+stderr from booted services
        metrics/<id>.jsonl

Modes

just proof contract       # APIs, schemas, status codes; no big data; ~30s
just proof integration    # full chain CSV→storaged→…→queryd, text→embedd→vectord
just proof performance    # measurements only; runs after contract+integration

The just recipes wrap tests/proof/run_proof.sh with --mode <X>. Use the script directly for advanced flags (--no-bootstrap, --regenerate-rankings, --regenerate-baseline).

Hard rules (from TEST_PROOF_SCOPE.md)

Don't claim performance without before/after metrics
Detect Ollama unavailability; mark embedding tests skipped or degraded with explanation
Skipped tests do not appear as passed
No silent ignore of missing services
No external cloud dependencies
No "HTTP 200" assertions unless the claim is health-only
No random data without a seed

How to read a report

After just proof integration:

Open tests/proof/reports/proof-<ts>/summary.md for the human view.
summary.json is the machine-readable counterpart.
To investigate a single failed assertion:
- find its case_id in summary.md
- read raw/cases/<case_id>.jsonl (each line is one assertion)
- cross-reference raw/http/<case_id>/<probe>.{json,body,headers} for the underlying HTTP round-trip

Every record cites the git SHA at run time; a clean re-run of the same SHA against the same fixtures must produce identical evidence (modulo timestamps + non-deterministic embedding noise).

Reading order for new contributors

docs/TEST_PROOF_SCOPE.md — the spec this harness implements.
docs/CLAUDE_REFACTOR_GUARDRAILS.md — process discipline this harness must obey when extended.
tests/proof/claims.yaml — what's claimed.
tests/proof/cases/00_health.sh — canonical case shape; copy-paste to add new cases.