golangLAKEHOUSE

profit/golangLAKEHOUSE

Fork 0

Commit Graph

Author	SHA1	Message	Date
root	1ec85b0a16	Batch 2: perf baseline — multi-sample + warmup + MAD threshold Replaces single-shot baselines (40% noise floor flagged in Phase E) with noise-aware regression detection. What changed: ingest n=3 runs (was 1) with 3-pass warmup vector_add n=3 runs (was 1) with 3-pass warmup query n=20 samples (unchanged) with 50-pass warmup search n=20 samples (unchanged) with 50-pass warmup RSS n=1 (unchanged — steady-state in G0) Each metric stored as {value: median, mad: median absolute deviation} in baseline.json (schema: v2-multisample-mad). New regression detection: threshold = max(3 * baseline.mad, value * 0.75) REGRESSION iff \|actual - baseline.value\| > threshold AND direction signals worse (lower throughput / higher latency). Why these specific numbers: 3MAD = standard "outside the spread" bound; lets high-variance metrics tolerate their own noise. 75% floor = empirical observation: even with 50 warmups, single- host inter-run variance on bootstrap-cold queryd was consistently 90-130% on this box. 75% catches >75% regressions cleanly while ignoring known noise. lib/metrics.sh: new proof_compute_mad helper computes MAD from a file of one-number-per-line samples. Used for both regen (to write the baseline.mad value) and diff (read from baseline). Honest finding from this iteration's 3 back-to-back diff runs: query_ms shows 90-130% delta from baseline consistently — not random noise but a systematic 2x gap between regen-time and steady-state. The regen captured a particularly fast moment; steady-state is slower. Operator workflow: regenerate the baseline at a known-representative state via `bash tests/proof/run_proof.sh --mode performance --regenerate-baseline` rather than expecting the harness to track a moving target. The harness's value here is the EVIDENCE RECORD (every run captures median+MAD+p95 plus all raw samples in raw/metrics/), not the gate. Even false-positive REGRESSION skips give operators "this run was 20ms vs baseline 10ms" which is informative. Sample counts also written into baseline.json under "samples" so a future audit can verify the methodology that produced the values. Verified across 3 back-to-back runs: ingest_rows_per_sec PASS (delta within 75%, mostly < 10%) vectors_per_sec_add PASS search_ms PASS rss_ PASS query_ms REGRESSION flagged (130/100/90%) — known systematic gap, not bug Closes the "40% noise floor" follow-up from Phase E FINAL_REPORT. Honest about limitations: hard regression gating on a busy single- host setup needs either much bigger sample counts (n≥100), longer warmup, or moving to a dedicated benchmark host. Documented inline. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 06:13:47 -05:00
root	a81291e38c	proof harness Phase A: scaffolding + canary case green Per docs/TEST_PROOF_SCOPE.md, building the claims-verification tier above the smoke chain. This commit lays the scaffolding and proves the orchestrator end-to-end with one canary case (00_health). What landed: tests/proof/ README.md how to read a report, layout, modes claims.yaml 24 claims enumerated (GOLAKE-001..100) run_proof.sh orchestrator with --mode {contract\|integration\|performance} and --no-bootstrap / --regenerate-{rankings,baseline} lib/ env.sh service URLs, report dir, mode, git context http.sh curl wrappers writing per-probe JSON + body + headers assert.sh proof_assert_{eq,ne,contains,lt,gt,status,json_eq} + proof_skip — each emits one JSONL record per call metrics.sh start/stop timers, value capture, RSS sampling, percentile compute (for Phase D) cases/ 00_health.sh canary — gateway + 6 services /health → 200, body identifies service, latency < 500ms (21 assertions) fixtures/ csv/workers.csv spec's 5-row deterministic CSV text/docs.txt 4 deterministic vector docs expected/queries.json expected results for the 5 SQL assertions Wired into the task runner: just proof contract # canary only this commit just proof integration # Phase C just proof performance # Phase D .gitignore: /tests/proof/reports/* with !.gitkeep — same pattern as reports/scrum/_evidence/. Per-run output is a runtime artifact. Specs landed alongside (J's drops): docs/TEST_PROOF_SCOPE.md the harness contract this implements docs/CLAUDE_REFACTOR_GUARDRAILS.md process discipline this harness obeys Verified end-to-end (cached binaries): just proof contract wall < 2s, 21 pass / 0 fail / 0 skip just verify wall 31s, vet + test + 9 smokes still green Two bugs fixed during canary run, both in run_proof.sh aggregation: - grep -c exits 1 on zero matches; the `\|\| echo 0` form concatenated "0\n0" and broke jq --argjson + integer comparison. Fixed via a _count helper that captures count-or-zero cleanly. - per-case table iterated case scripts (filename-based) but cases write evidence under CASE_ID. Switched to JSONL-file iteration so multi-case scripts work and the mapping is faithful. Phase B (contract cases) lands next: 05_embedding, 06_vector_add, 08_gateway_contracts, 09_failure_modes. Each sourcing the same lib helpers and writing to the same report shape. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 05:08:51 -05:00

Author

SHA1

Message

Date

root

1ec85b0a16

Batch 2: perf baseline — multi-sample + warmup + MAD threshold

Replaces single-shot baselines (40% noise floor flagged in Phase E)
with noise-aware regression detection.

What changed:
  ingest      n=3 runs (was 1) with 3-pass warmup
  vector_add  n=3 runs (was 1) with 3-pass warmup
  query       n=20 samples (unchanged) with 50-pass warmup
  search      n=20 samples (unchanged) with 50-pass warmup
  RSS         n=1 (unchanged — steady-state in G0)

Each metric stored as {value: median, mad: median absolute
deviation} in baseline.json (schema: v2-multisample-mad).

New regression detection:
  threshold = max(3 * baseline.mad, value * 0.75)
  REGRESSION iff |actual - baseline.value| > threshold AND direction
    signals worse (lower throughput / higher latency).

Why these specific numbers:
  3*MAD   = standard "outside the spread" bound; lets high-variance
            metrics tolerate their own noise.
  75% floor = empirical observation: even with 50 warmups, single-
            host inter-run variance on bootstrap-cold queryd was
            consistently 90-130% on this box. 75% catches >75%
            regressions cleanly while ignoring known noise.

lib/metrics.sh: new proof_compute_mad helper computes MAD from a
file of one-number-per-line samples. Used for both regen (to write
the baseline.mad value) and diff (read from baseline).

Honest finding from this iteration's 3 back-to-back diff runs:
  query_ms shows 90-130% delta from baseline consistently — not
  random noise but a systematic 2x gap between regen-time and
  steady-state. The regen captured a particularly fast moment;
  steady-state is slower. Operator workflow: regenerate the
  baseline at a known-representative state via
  `bash tests/proof/run_proof.sh --mode performance --regenerate-baseline`
  rather than expecting the harness to track a moving target.

The harness's value here is the EVIDENCE RECORD (every run captures
median+MAD+p95 plus all raw samples in raw/metrics/), not the gate.
Even false-positive REGRESSION skips give operators "this run was
20ms vs baseline 10ms" which is informative.

Sample counts also written into baseline.json under "samples" so a
future audit can verify the methodology that produced the values.

Verified across 3 back-to-back runs:
  ingest_rows_per_sec    PASS (delta within 75%, mostly < 10%)
  vectors_per_sec_add    PASS
  search_ms              PASS
  rss_*                  PASS
  query_ms               REGRESSION flagged (130/100/90%) — known
                         systematic gap, not bug

Closes the "40% noise floor" follow-up from Phase E FINAL_REPORT.
Honest about limitations: hard regression gating on a busy single-
host setup needs either much bigger sample counts (n≥100), longer
warmup, or moving to a dedicated benchmark host. Documented inline.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-29 06:13:47 -05:00

root

a81291e38c

proof harness Phase A: scaffolding + canary case green

Per docs/TEST_PROOF_SCOPE.md, building the claims-verification tier
above the smoke chain. This commit lays the scaffolding and proves
the orchestrator end-to-end with one canary case (00_health).

What landed:

  tests/proof/
    README.md             how to read a report, layout, modes
    claims.yaml           24 claims enumerated (GOLAKE-001..100)
    run_proof.sh          orchestrator with --mode {contract|integration|performance}
                          and --no-bootstrap / --regenerate-{rankings,baseline}
    lib/
      env.sh              service URLs, report dir, mode, git context
      http.sh             curl wrappers writing per-probe JSON + body + headers
      assert.sh           proof_assert_{eq,ne,contains,lt,gt,status,json_eq} +
                          proof_skip — each emits one JSONL record per call
      metrics.sh          start/stop timers, value capture, RSS sampling,
                          percentile compute (for Phase D)
    cases/
      00_health.sh        canary — gateway + 6 services /health → 200,
                          body identifies service, latency < 500ms (21 assertions)
    fixtures/
      csv/workers.csv     spec's 5-row deterministic CSV
      text/docs.txt       4 deterministic vector docs
      expected/queries.json  expected results for the 5 SQL assertions

Wired into the task runner:

  just proof contract       # canary only this commit
  just proof integration    # Phase C
  just proof performance    # Phase D

.gitignore: /tests/proof/reports/* with !.gitkeep — same pattern as
reports/scrum/_evidence/. Per-run output is a runtime artifact.

Specs landed alongside (J's drops):
  docs/TEST_PROOF_SCOPE.md           the harness contract this implements
  docs/CLAUDE_REFACTOR_GUARDRAILS.md process discipline this harness obeys

Verified end-to-end (cached binaries):
  just proof contract        wall < 2s, 21 pass / 0 fail / 0 skip
  just verify                wall 31s, vet + test + 9 smokes still green

Two bugs fixed during canary run, both in run_proof.sh aggregation:
- grep -c exits 1 on zero matches; the `|| echo 0` form concatenated
  "0\n0" and broke jq --argjson + integer comparison. Fixed via a
  _count helper that captures count-or-zero cleanly.
- per-case table iterated case scripts (filename-based) but cases
  write evidence under CASE_ID. Switched to JSONL-file iteration so
  multi-case scripts work and the mapping is faithful.

Phase B (contract cases) lands next: 05_embedding, 06_vector_add,
08_gateway_contracts, 09_failure_modes. Each sourcing the same lib
helpers and writing to the same report shape.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-29 05:08:51 -05:00

2 Commits