History

root 4840c10311 proof harness: fix queryd refresh-tick race in 04_query_correctness

Caught by the audit rerun: with cache-warm binaries, 04 fires its
first SELECT faster than queryd's 500ms refresh tick — Q1 returned
400 ("table not found") even though 03_ingest had registered the
manifest. Subsequent queries (after the next tick) succeeded.

This is an eventual-consistency wait, not a retry — queryd's
contract is that views appear within one tick of catalogd having the
manifest. Production code does not need changing.

Added to lib/http.sh:
  proof_wait_for_sql <budget_sec> <sql>
    polls a SQL probe until it returns 200 or budget elapses; emits
    no evidence (test setup, not a claim).

Used in 04_query_correctness:
  Wait up to 5s for queryd to have the view before running the 5
  SQL assertions. Skip-with-loud-reason if the view never appears.

Verified: integration mode back to 104 pass / 0 fail / 1 skip after
fix. The skip is the unchanged GOLAKE-085 informational record.

This is exactly the kind of finding the harness was designed to
surface — the regression existed in the codebase the moment Phase D
shipped, but only fired when the next compare run hit cache-warm
timing. Without the harness, it would have surfaced on a CI run
weeks from now and been hard to bisect.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-29 05:36:28 -05:00

cases

proof harness: fix queryd refresh-tick race in 04_query_correctness

2026-04-29 05:36:28 -05:00

fixtures

proof harness Phase C: 6 integration cases · 104/0/1 green

2026-04-29 05:26:00 -05:00

lib

proof harness: fix queryd refresh-tick race in 04_query_correctness

2026-04-29 05:36:28 -05:00

reports

proof harness Phase A: scaffolding + canary case green

2026-04-29 05:08:51 -05:00

baseline.json

proof harness Phase D: performance baseline · 1000-row ingest, p50/p95

2026-04-29 05:30:11 -05:00

claims.yaml

proof harness Phase A: scaffolding + canary case green

2026-04-29 05:08:51 -05:00

FINAL_REPORT.md

proof harness Phase E: FINAL_REPORT.md answers the 9 mandated questions

2026-04-29 05:32:56 -05:00

README.md

proof harness Phase A: scaffolding + canary case green

2026-04-29 05:08:51 -05:00

run_proof.sh

proof harness Phase C: 6 integration cases · 104/0/1 green

2026-04-29 05:26:00 -05:00

README.md

tests/proof — claims-verification harness

Per docs/TEST_PROOF_SCOPE.md. The 9 smokes prove that the system runs; this harness proves that the system makes the claims it claims to make.

Why this exists

Smokes verify that services boot, talk, and pass deterministic round-trips. They do not verify:

contract drift (a route silently changes its response shape)
semantic correctness (the SQL query says what we claim it says)
failure-mode discipline (a malformed request returns 4xx, not silent 200)
performance regressions (vectors/sec drops 30% on a refactor)

The proof harness produces evidence, not pass/fail. Each case writes input/output hashes, latencies, status codes, log paths, git SHA → a future auditor can re-run + diff.

Layout

tests/proof/
  README.md            ← you are here
  claims.yaml          ← enumeration of every claim, with id + type + routes
  run_proof.sh         ← orchestrator (--mode contract|integration|performance)
  lib/
    env.sh             ← service URLs, report dir, mode, git context
    http.sh            ← curl wrappers (latency + status + body capture)
    assert.sh          ← structured assertions writing JSONL evidence
    metrics.sh         ← rss/cpu/timing capture for performance mode
  cases/
    00_health.sh
    01_storage_roundtrip.sh
    …
    10_perf_baseline.sh
  fixtures/
    csv/workers.csv         ← canonical 5-row fixture (sha-pinned)
    text/docs.txt           ← 4 deterministic vector docs
    expected/queries.json   ← expected results for the 5 SQL assertions
    expected/rankings.json  ← stored top-K rankings for vector search
  reports/
    proof-YYYYMMDD-HHMMSSZ/   ← per-run; gitignored
      summary.md
      summary.json
      raw/
        context.json    ← git_sha, hostname, timestamp, mode
        cases/<id>.jsonl  ← one JSONL line per assertion
        http/<id>/*.{json,body,headers}
        logs/<svc>.log  ← captured stdout+stderr from booted services
        metrics/<id>.jsonl

Modes

just proof contract       # APIs, schemas, status codes; no big data; ~30s
just proof integration    # full chain CSV→storaged→…→queryd, text→embedd→vectord
just proof performance    # measurements only; runs after contract+integration

The just recipes wrap tests/proof/run_proof.sh with --mode <X>. Use the script directly for advanced flags (--no-bootstrap, --regenerate-rankings, --regenerate-baseline).

Hard rules (from TEST_PROOF_SCOPE.md)

Don't claim performance without before/after metrics
Detect Ollama unavailability; mark embedding tests skipped or degraded with explanation
Skipped tests do not appear as passed
No silent ignore of missing services
No external cloud dependencies
No "HTTP 200" assertions unless the claim is health-only
No random data without a seed

How to read a report

After just proof integration:

Open tests/proof/reports/proof-<ts>/summary.md for the human view.
summary.json is the machine-readable counterpart.
To investigate a single failed assertion:
- find its case_id in summary.md
- read raw/cases/<case_id>.jsonl (each line is one assertion)
- cross-reference raw/http/<case_id>/<probe>.{json,body,headers} for the underlying HTTP round-trip

Every record cites the git SHA at run time; a clean re-run of the same SHA against the same fixtures must produce identical evidence (modulo timestamps + non-deterministic embedding noise).

Reading order for new contributors

docs/TEST_PROOF_SCOPE.md — the spec this harness implements.
docs/CLAUDE_REFACTOR_GUARDRAILS.md — process discipline this harness must obey when extended.
tests/proof/claims.yaml — what's claimed.
tests/proof/cases/00_health.sh — canonical case shape; copy-paste to add new cases.