History

root 175ad59cb3 proof harness Phase D: performance baseline · 1000-row ingest, p50/p95

GOLAKE-100. First run writes tests/proof/baseline.json; subsequent
runs diff against it. >10% regression emits a SKIP with REGRESSION
detail (not a fail — perf claim is required:false in claims.yaml so
the gate stays green; the human summary tells the regression story
honestly). Skip-with-loud-reason if any earlier case in the run
failed, per spec "performance only after contract+integration pass."

Workload (deterministic, repeatable):
  ingest      1000-row CSV (5 roles × 5 cities × seeded scores) → /v1/ingest
  query       SELECT count(*) ×20 against the just-ingested dataset
  vector add  200 dim=4 vectors with formulaic content (no Ollama)
  search      ×20 against the perf index with a fixed query vector
  RSS         per-service post-workload sample via /proc/<pid>/status

Recorded metrics:
  ingest_rows_per_sec, query_p50_ms, query_p95_ms,
  vectors_per_sec_add, search_p50_ms, search_p95_ms,
  rss_{storaged,catalogd,ingestd,queryd,vectord,embedd,gateway}_mb

baseline.json on this box (committed):
  25000 rows/sec ingest · 17ms p50 / 24ms p95 query
  6250 vectors/sec add  ·  8ms p50 / 20ms p95 search
  queryd 69 MiB · vectord 14 MiB · others 11-29 MiB

Honest measurement-design finding from the very first compare run:
back-to-back runs surfaced -41% ingest and +29% query p50 — pure
disk-cache + queryd-cold-start noise. Single-sample baselines have
real noise floor ≈40%. Recorded as REGRESSION skips so the human
summary surfaces it, not a code regression. Tightening the threshold
or moving to multi-sample medians is a Phase E recommendation.

Verified end-to-end:
  just proof contract       —  53 pass  · 1 skip · ~4s
  just proof integration    — 104 pass  · 1 skip · ~8s
  just proof performance    — 110 pass  · 3 skip · ~10s
  just verify               —  9 smokes still green · 29s

All 11 cases (4 contract + 6 integration + 1 performance) deterministic
end-to-end. Phase E (final report against the 9 mandated questions)
is the last piece.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-29 05:30:11 -05:00

cases

proof harness Phase D: performance baseline · 1000-row ingest, p50/p95

2026-04-29 05:30:11 -05:00

fixtures

proof harness Phase C: 6 integration cases · 104/0/1 green

2026-04-29 05:26:00 -05:00

lib

proof harness Phase C: 6 integration cases · 104/0/1 green

2026-04-29 05:26:00 -05:00

reports

proof harness Phase A: scaffolding + canary case green

2026-04-29 05:08:51 -05:00

baseline.json

proof harness Phase D: performance baseline · 1000-row ingest, p50/p95

2026-04-29 05:30:11 -05:00

claims.yaml

proof harness Phase A: scaffolding + canary case green

2026-04-29 05:08:51 -05:00

README.md

proof harness Phase A: scaffolding + canary case green

2026-04-29 05:08:51 -05:00

run_proof.sh

proof harness Phase C: 6 integration cases · 104/0/1 green

2026-04-29 05:26:00 -05:00

README.md

tests/proof — claims-verification harness

Per docs/TEST_PROOF_SCOPE.md. The 9 smokes prove that the system runs; this harness proves that the system makes the claims it claims to make.

Why this exists

Smokes verify that services boot, talk, and pass deterministic round-trips. They do not verify:

contract drift (a route silently changes its response shape)
semantic correctness (the SQL query says what we claim it says)
failure-mode discipline (a malformed request returns 4xx, not silent 200)
performance regressions (vectors/sec drops 30% on a refactor)

The proof harness produces evidence, not pass/fail. Each case writes input/output hashes, latencies, status codes, log paths, git SHA → a future auditor can re-run + diff.

Layout

tests/proof/
  README.md            ← you are here
  claims.yaml          ← enumeration of every claim, with id + type + routes
  run_proof.sh         ← orchestrator (--mode contract|integration|performance)
  lib/
    env.sh             ← service URLs, report dir, mode, git context
    http.sh            ← curl wrappers (latency + status + body capture)
    assert.sh          ← structured assertions writing JSONL evidence
    metrics.sh         ← rss/cpu/timing capture for performance mode
  cases/
    00_health.sh
    01_storage_roundtrip.sh
    …
    10_perf_baseline.sh
  fixtures/
    csv/workers.csv         ← canonical 5-row fixture (sha-pinned)
    text/docs.txt           ← 4 deterministic vector docs
    expected/queries.json   ← expected results for the 5 SQL assertions
    expected/rankings.json  ← stored top-K rankings for vector search
  reports/
    proof-YYYYMMDD-HHMMSSZ/   ← per-run; gitignored
      summary.md
      summary.json
      raw/
        context.json    ← git_sha, hostname, timestamp, mode
        cases/<id>.jsonl  ← one JSONL line per assertion
        http/<id>/*.{json,body,headers}
        logs/<svc>.log  ← captured stdout+stderr from booted services
        metrics/<id>.jsonl

Modes

just proof contract       # APIs, schemas, status codes; no big data; ~30s
just proof integration    # full chain CSV→storaged→…→queryd, text→embedd→vectord
just proof performance    # measurements only; runs after contract+integration

The just recipes wrap tests/proof/run_proof.sh with --mode <X>. Use the script directly for advanced flags (--no-bootstrap, --regenerate-rankings, --regenerate-baseline).

Hard rules (from TEST_PROOF_SCOPE.md)

Don't claim performance without before/after metrics
Detect Ollama unavailability; mark embedding tests skipped or degraded with explanation
Skipped tests do not appear as passed
No silent ignore of missing services
No external cloud dependencies
No "HTTP 200" assertions unless the claim is health-only
No random data without a seed

How to read a report

After just proof integration:

Open tests/proof/reports/proof-<ts>/summary.md for the human view.
summary.json is the machine-readable counterpart.
To investigate a single failed assertion:
- find its case_id in summary.md
- read raw/cases/<case_id>.jsonl (each line is one assertion)
- cross-reference raw/http/<case_id>/<probe>.{json,body,headers} for the underlying HTTP round-trip

Every record cites the git SHA at run time; a clean re-run of the same SHA against the same fixtures must produce identical evidence (modulo timestamps + non-deterministic embedding noise).

Reading order for new contributors

docs/TEST_PROOF_SCOPE.md — the spec this harness implements.
docs/CLAUDE_REFACTOR_GUARDRAILS.md — process discipline this harness must obey when extended.
tests/proof/claims.yaml — what's claimed.
tests/proof/cases/00_health.sh — canonical case shape; copy-paste to add new cases.