root 1ec85b0a16 Batch 2: perf baseline — multi-sample + warmup + MAD threshold
Replaces single-shot baselines (40% noise floor flagged in Phase E)
with noise-aware regression detection.

What changed:
  ingest      n=3 runs (was 1) with 3-pass warmup
  vector_add  n=3 runs (was 1) with 3-pass warmup
  query       n=20 samples (unchanged) with 50-pass warmup
  search      n=20 samples (unchanged) with 50-pass warmup
  RSS         n=1 (unchanged — steady-state in G0)

Each metric stored as {value: median, mad: median absolute
deviation} in baseline.json (schema: v2-multisample-mad).

New regression detection:
  threshold = max(3 * baseline.mad, value * 0.75)
  REGRESSION iff |actual - baseline.value| > threshold AND direction
    signals worse (lower throughput / higher latency).

Why these specific numbers:
  3*MAD   = standard "outside the spread" bound; lets high-variance
            metrics tolerate their own noise.
  75% floor = empirical observation: even with 50 warmups, single-
            host inter-run variance on bootstrap-cold queryd was
            consistently 90-130% on this box. 75% catches >75%
            regressions cleanly while ignoring known noise.

lib/metrics.sh: new proof_compute_mad helper computes MAD from a
file of one-number-per-line samples. Used for both regen (to write
the baseline.mad value) and diff (read from baseline).

Honest finding from this iteration's 3 back-to-back diff runs:
  query_ms shows 90-130% delta from baseline consistently — not
  random noise but a systematic 2x gap between regen-time and
  steady-state. The regen captured a particularly fast moment;
  steady-state is slower. Operator workflow: regenerate the
  baseline at a known-representative state via
  `bash tests/proof/run_proof.sh --mode performance --regenerate-baseline`
  rather than expecting the harness to track a moving target.

The harness's value here is the EVIDENCE RECORD (every run captures
median+MAD+p95 plus all raw samples in raw/metrics/), not the gate.
Even false-positive REGRESSION skips give operators "this run was
20ms vs baseline 10ms" which is informative.

Sample counts also written into baseline.json under "samples" so a
future audit can verify the methodology that produced the values.

Verified across 3 back-to-back runs:
  ingest_rows_per_sec    PASS (delta within 75%, mostly < 10%)
  vectors_per_sec_add    PASS
  search_ms              PASS
  rss_*                  PASS
  query_ms               REGRESSION flagged (130/100/90%) — known
                         systematic gap, not bug

Closes the "40% noise floor" follow-up from Phase E FINAL_REPORT.
Honest about limitations: hard regression gating on a busy single-
host setup needs either much bigger sample counts (n≥100), longer
warmup, or moving to a dedicated benchmark host. Documented inline.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 06:13:47 -05:00
..

tests/proof — claims-verification harness

Per docs/TEST_PROOF_SCOPE.md. The 9 smokes prove that the system runs; this harness proves that the system makes the claims it claims to make.

Why this exists

Smokes verify that services boot, talk, and pass deterministic round-trips. They do not verify:

  • contract drift (a route silently changes its response shape)
  • semantic correctness (the SQL query says what we claim it says)
  • failure-mode discipline (a malformed request returns 4xx, not silent 200)
  • performance regressions (vectors/sec drops 30% on a refactor)

The proof harness produces evidence, not pass/fail. Each case writes input/output hashes, latencies, status codes, log paths, git SHA → a future auditor can re-run + diff.

Layout

tests/proof/
  README.md            ← you are here
  claims.yaml          ← enumeration of every claim, with id + type + routes
  run_proof.sh         ← orchestrator (--mode contract|integration|performance)
  lib/
    env.sh             ← service URLs, report dir, mode, git context
    http.sh            ← curl wrappers (latency + status + body capture)
    assert.sh          ← structured assertions writing JSONL evidence
    metrics.sh         ← rss/cpu/timing capture for performance mode
  cases/
    00_health.sh
    01_storage_roundtrip.sh
    …
    10_perf_baseline.sh
  fixtures/
    csv/workers.csv         ← canonical 5-row fixture (sha-pinned)
    text/docs.txt           ← 4 deterministic vector docs
    expected/queries.json   ← expected results for the 5 SQL assertions
    expected/rankings.json  ← stored top-K rankings for vector search
  reports/
    proof-YYYYMMDD-HHMMSSZ/   ← per-run; gitignored
      summary.md
      summary.json
      raw/
        context.json    ← git_sha, hostname, timestamp, mode
        cases/<id>.jsonl  ← one JSONL line per assertion
        http/<id>/*.{json,body,headers}
        logs/<svc>.log  ← captured stdout+stderr from booted services
        metrics/<id>.jsonl

Modes

just proof contract       # APIs, schemas, status codes; no big data; ~30s
just proof integration    # full chain CSV→storaged→…→queryd, text→embedd→vectord
just proof performance    # measurements only; runs after contract+integration

The just recipes wrap tests/proof/run_proof.sh with --mode <X>. Use the script directly for advanced flags (--no-bootstrap, --regenerate-rankings, --regenerate-baseline).

Hard rules (from TEST_PROOF_SCOPE.md)

  • Don't claim performance without before/after metrics
  • Detect Ollama unavailability; mark embedding tests skipped or degraded with explanation
  • Skipped tests do not appear as passed
  • No silent ignore of missing services
  • No external cloud dependencies
  • No "HTTP 200" assertions unless the claim is health-only
  • No random data without a seed

How to read a report

After just proof integration:

  1. Open tests/proof/reports/proof-<ts>/summary.md for the human view.
  2. summary.json is the machine-readable counterpart.
  3. To investigate a single failed assertion:
    • find its case_id in summary.md
    • read raw/cases/<case_id>.jsonl (each line is one assertion)
    • cross-reference raw/http/<case_id>/<probe>.{json,body,headers} for the underlying HTTP round-trip

Every record cites the git SHA at run time; a clean re-run of the same SHA against the same fixtures must produce identical evidence (modulo timestamps + non-deterministic embedding noise).

Reading order for new contributors

  1. docs/TEST_PROOF_SCOPE.md — the spec this harness implements.
  2. docs/CLAUDE_REFACTOR_GUARDRAILS.md — process discipline this harness must obey when extended.
  3. tests/proof/claims.yaml — what's claimed.
  4. tests/proof/cases/00_health.sh — canonical case shape; copy-paste to add new cases.