5 Commits

Author SHA1 Message Date
root
4bb6548cbc proof harness Phase E: FINAL_REPORT.md answers the 9 mandated questions
Per docs/TEST_PROOF_SCOPE.md, this is the closing deliverable for the
proof harness: a single document that names what's proven, what's
partially proven, what failed, what was skipped and why, what evidence
exists for each, what bottlenecks were measured, what contract drift
was found, what refactor risks remain, and what to fix first.

Per-run report dirs (tests/proof/reports/proof-<ts>/) keep their
existing summary.md + summary.json + raw/ structure — they are the
replayable evidence chain. FINAL_REPORT.md is the stable, repo-tracked
synthesis pointing at them.

Headline findings (no surprises — harness behaves as designed):
  - 24 claims encoded; 22 fully proven, 1 informational (GOLAKE-085
    duplicate vector ID, contract not yet specified), 0 failed.
  - 4 contract-drift findings recorded as canonical: vectord add
    body field is `items` not `vectors`, search response is `results`
    not `hits`, index info is `length` not `count`, status codes
    201/204 not 200. All caught during Phase B; all now pinned by the
    harness.
  - Performance baseline shows queryd as the largest RSS (69 MiB,
    DuckDB process); single-sample noise floor is ~40% — tightening
    to multi-sample medians is a documented Sprint follow-up.
  - HIGH-risk audit findings (R-001 queryd /sql, R-002/R-003 untested
    shared+storeclient) are NOT closed by the harness — it's a
    multiplier, not a replacement for unit tests + auth posture.

The proof harness is complete. 11 cases · 3 modes · 168 assertions
peak across all tiers · ~22s total wall (contract+integration+perf).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 05:32:56 -05:00
root
175ad59cb3 proof harness Phase D: performance baseline · 1000-row ingest, p50/p95
GOLAKE-100. First run writes tests/proof/baseline.json; subsequent
runs diff against it. >10% regression emits a SKIP with REGRESSION
detail (not a fail — perf claim is required:false in claims.yaml so
the gate stays green; the human summary tells the regression story
honestly). Skip-with-loud-reason if any earlier case in the run
failed, per spec "performance only after contract+integration pass."

Workload (deterministic, repeatable):
  ingest      1000-row CSV (5 roles × 5 cities × seeded scores) → /v1/ingest
  query       SELECT count(*) ×20 against the just-ingested dataset
  vector add  200 dim=4 vectors with formulaic content (no Ollama)
  search      ×20 against the perf index with a fixed query vector
  RSS         per-service post-workload sample via /proc/<pid>/status

Recorded metrics:
  ingest_rows_per_sec, query_p50_ms, query_p95_ms,
  vectors_per_sec_add, search_p50_ms, search_p95_ms,
  rss_{storaged,catalogd,ingestd,queryd,vectord,embedd,gateway}_mb

baseline.json on this box (committed):
  25000 rows/sec ingest · 17ms p50 / 24ms p95 query
  6250 vectors/sec add  ·  8ms p50 / 20ms p95 search
  queryd 69 MiB · vectord 14 MiB · others 11-29 MiB

Honest measurement-design finding from the very first compare run:
back-to-back runs surfaced -41% ingest and +29% query p50 — pure
disk-cache + queryd-cold-start noise. Single-sample baselines have
real noise floor ≈40%. Recorded as REGRESSION skips so the human
summary surfaces it, not a code regression. Tightening the threshold
or moving to multi-sample medians is a Phase E recommendation.

Verified end-to-end:
  just proof contract       —  53 pass  · 1 skip · ~4s
  just proof integration    — 104 pass  · 1 skip · ~8s
  just proof performance    — 110 pass  · 3 skip · ~10s
  just verify               —  9 smokes still green · 29s

All 11 cases (4 contract + 6 integration + 1 performance) deterministic
end-to-end. Phase E (final report against the 9 mandated questions)
is the last piece.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 05:30:11 -05:00
root
1313eb2173 proof harness Phase C: 6 integration cases · 104/0/1 green
Adds the integration tier — full chain CSV→Parquet→SQL and full
text→embed→vector→search. All 10 cases (4 contract + 6 integration)
end-to-end deterministic; 8s wall total.

Cases added:
  01_storage_roundtrip.sh
    GOLAKE-010-012. PUT 1KiB → GET sha256-equal → LIST contains key
    → DELETE 200/204 → GET 404. Deterministic key under
    proof/<case_id>/ so concurrent runs don't collide.

  02_catalog_manifest.sh
    GOLAKE-020-022. Fresh register existing=false → manifest read
    matches → list contains dataset_id → idempotent re-register
    existing=true with stable dataset_id → schema-drift register
    409 (the ADR-020 contract). Per-run unique name via
    PROOF_RUN_ID so existing=false is meaningful.

  03_ingest_csv_to_parquet.sh
    GOLAKE-030. workers.csv (5 rows) via /v1/ingest multipart →
    parquet object on storaged → catalog manifest with row_count=5.
    Verifies content-addressed key shape (datasets/<n>/<fp>.parquet).

  04_query_correctness.sh
    GOLAKE-040. The 5 SQL assertions from fixtures/expected/queries.json
    against the workers fixture: count=5, Chicago=2, max=95,
    safety→Barbara, Houston avg=89.5. Iterates the YAML claims, runs
    each query, compares response columns to expected values.

  06_vector_add_search.sh integration extension
    GOLAKE-051. text → /v1/embed (4 docs from fixtures/text/docs.txt)
    → vectord add → search by query embedding. Top-1 ID per query
    asserted against fixtures/expected/rankings.json. First run (or
    --regenerate-rankings) writes the fixture and emits a skip with
    explicit reason; subsequent runs assert against it.

  07_vector_persistence_restart.sh
    GOLAKE-070. add 4 unit-basis vectors → search → record top-1
    distance → SIGTERM vectord → restart with the same --config →
    poll /health for 8s → search again → top-1 ID and distance match
    bit-identically. Skips with reason if vectord PID can't be found
    or post-restart bind times out.

Two harness improvements landed alongside:

  run_proof.sh writes a temp lakehouse_proof.toml with
  refresh_every="500ms" override and passes --config to all booted
  binaries. Production default is 30s; 04_query_correctness needs
  queryd to pick up the new view within a tick. Production config
  unchanged.

  cleanup() now pgreps for any orphan bin/<svc> processes (anchored
  to start-of-argv per memory feedback_pkill_scope.md) so a case
  that restarts a service mid-run still gets cleaned up.

  lib/http.sh adds proof_call(case_id, probe, method, url, args...)
  — escape hatch for cases that need raw curl args (multipart -F,
  custom headers). Used by 03_ingest for the multipart upload that
  conflicts with proof_post's --data + Content-Type defaults.

  lib/env.sh exports PROOF_RUN_ID — short unique id derived from the
  report directory timestamp. Used by 02 and 07 for fresh-each-run
  state isolation.

Two real findings recorded as evidence (no code changes):
  - rankings.json fixture pinned: 4 queries → 4 distinct top-1 docs
    via nomic-embed-text. A model swap that changes ranking now
    fails the harness loudly; --regenerate-rankings is the override.
  - vectord persistence kill+restart preserves top-1 distance
    bit-identically — the LHV1 single-Put framed format from
    G1P round-trips exactly through Save/Load.

Verified end-to-end:
  just proof contract       — 53 pass (4 cases)
  just proof integration    — 104 pass (10 cases) · 8s wall
  just verify               — 9 smokes still green · 33s wall

Phase D (performance baseline) lands next: 10_perf_baseline measures
rows/sec ingest, vectors/sec add, p50/p95 query+search latency, RSS,
CPU. First run writes tests/proof/baseline.json; later runs diff
against it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 05:26:00 -05:00
root
6d18394416 proof harness Phase B: 4 contract cases · 53/0/1 green
Added the contract tier above 00_health canary. All 5 contract cases
now cover GOLAKE-001-003, 050, 060-061, 080-085 — 53 assertions pass,
1 informational skip, 0 fail. Wall: 4s end-to-end (cached binaries).

Cases:
  05_embedding_contract.sh
    GOLAKE-050. POST /v1/embed with one short text → asserts dim=768,
    one vector returned, vector length matches dimension, sum of
    squared elements > 0 (proxy for non-zero), response.model echoed.
    Skips with explicit reason if Ollama is unreachable (502 from
    embedd) — per spec hard rule "skipped tests do not appear as
    passed."

  06_vector_add_search.sh
    GOLAKE-060 + GOLAKE-061. Synthetic dim=4 unit basis vectors.
    Create index → add 3 vectors → get-index returns length=3 →
    search([1,0,0,0],k=3) returns v1 at rank 1 with distance < 0.001.
    Cleanup with DELETE. No embedd dependency — pure contract layer.

  08_gateway_contracts.sh
    GOLAKE-003. For each /v1/* route, asserts gateway and direct
    upstream return identical status AND identical response body
    (sha256 match). Confirms gateway is a proxy not a transformer.
    Status passthrough verified on both 200 path (storage/list,
    catalog/list) and 4xx path (sql empty body → 400 from queryd).

  09_failure_modes.sh
    GOLAKE-080..085. Six failure-mode contracts:
      080 malformed JSON → 4xx on catalog/ingest/sql/embed
      081 missing required field → 4xx on catalog/vectors/embed
      082 bad SQL → 4xx with non-empty error body
      083 vector dim mismatch → 4xx
      084 missing storage object → 404
      085 duplicate vector ID → INFORMATIONAL (spec says required:false)
          first/second statuses recorded as evidence; contract decided
          later from the recorded record.

Two new lib helpers in lib/assert.sh:
  proof_assert_status_in <id> <claim> "200 201 204" <probe>
    pass if status is in the space-separated list. Used for
    delete-returns-200-or-204 case where vectord returns 204.

  proof_assert_status_4xx <id> <claim> <probe>
    pass if status in [400, 500). Used for failure modes where the
    specific 4xx code may vary (400 vs 422 vs 409). Records actual
    code as evidence.

Two real contract findings recorded by the harness during build:
  - vectord add expects {"items": [...]}, not {"vectors": [...]}.
    My initial test sent the wrong field; would have masked the bug
    forever in CI. The harness caught it via the assertion failure.
  - vectord create returns 201 Created, delete returns 204 No Content.
    Documented in the test fixtures as canonical.

Regression: just verify wall 33s, vet + test + 9 smokes still green.

Phase C (integration) lands next: 01_storage_roundtrip, 02_catalog_manifest,
03_ingest_csv_to_parquet, 04_query_correctness, 05/06 integration extends,
07_vector_persistence_restart.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 05:15:04 -05:00
root
a81291e38c proof harness Phase A: scaffolding + canary case green
Per docs/TEST_PROOF_SCOPE.md, building the claims-verification tier
above the smoke chain. This commit lays the scaffolding and proves
the orchestrator end-to-end with one canary case (00_health).

What landed:

  tests/proof/
    README.md             how to read a report, layout, modes
    claims.yaml           24 claims enumerated (GOLAKE-001..100)
    run_proof.sh          orchestrator with --mode {contract|integration|performance}
                          and --no-bootstrap / --regenerate-{rankings,baseline}
    lib/
      env.sh              service URLs, report dir, mode, git context
      http.sh             curl wrappers writing per-probe JSON + body + headers
      assert.sh           proof_assert_{eq,ne,contains,lt,gt,status,json_eq} +
                          proof_skip — each emits one JSONL record per call
      metrics.sh          start/stop timers, value capture, RSS sampling,
                          percentile compute (for Phase D)
    cases/
      00_health.sh        canary — gateway + 6 services /health → 200,
                          body identifies service, latency < 500ms (21 assertions)
    fixtures/
      csv/workers.csv     spec's 5-row deterministic CSV
      text/docs.txt       4 deterministic vector docs
      expected/queries.json  expected results for the 5 SQL assertions

Wired into the task runner:

  just proof contract       # canary only this commit
  just proof integration    # Phase C
  just proof performance    # Phase D

.gitignore: /tests/proof/reports/* with !.gitkeep — same pattern as
reports/scrum/_evidence/. Per-run output is a runtime artifact.

Specs landed alongside (J's drops):
  docs/TEST_PROOF_SCOPE.md           the harness contract this implements
  docs/CLAUDE_REFACTOR_GUARDRAILS.md process discipline this harness obeys

Verified end-to-end (cached binaries):
  just proof contract        wall < 2s, 21 pass / 0 fail / 0 skip
  just verify                wall 31s, vet + test + 9 smokes still green

Two bugs fixed during canary run, both in run_proof.sh aggregation:
- grep -c exits 1 on zero matches; the `|| echo 0` form concatenated
  "0\n0" and broke jq --argjson + integer comparison. Fixed via a
  _count helper that captures count-or-zero cleanly.
- per-case table iterated case scripts (filename-based) but cases
  write evidence under CASE_ID. Switched to JSONL-file iteration so
  multi-case scripts work and the mapping is faithful.

Phase B (contract cases) lands next: 05_embedding, 06_vector_add,
08_gateway_contracts, 09_failure_modes. Each sourcing the same lib
helpers and writing to the same report shape.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 05:08:51 -05:00