golangLAKEHOUSE/docs/TEST_PROOF_SCOPE.md
root a81291e38c proof harness Phase A: scaffolding + canary case green
Per docs/TEST_PROOF_SCOPE.md, building the claims-verification tier
above the smoke chain. This commit lays the scaffolding and proves
the orchestrator end-to-end with one canary case (00_health).

What landed:

  tests/proof/
    README.md             how to read a report, layout, modes
    claims.yaml           24 claims enumerated (GOLAKE-001..100)
    run_proof.sh          orchestrator with --mode {contract|integration|performance}
                          and --no-bootstrap / --regenerate-{rankings,baseline}
    lib/
      env.sh              service URLs, report dir, mode, git context
      http.sh             curl wrappers writing per-probe JSON + body + headers
      assert.sh           proof_assert_{eq,ne,contains,lt,gt,status,json_eq} +
                          proof_skip — each emits one JSONL record per call
      metrics.sh          start/stop timers, value capture, RSS sampling,
                          percentile compute (for Phase D)
    cases/
      00_health.sh        canary — gateway + 6 services /health → 200,
                          body identifies service, latency < 500ms (21 assertions)
    fixtures/
      csv/workers.csv     spec's 5-row deterministic CSV
      text/docs.txt       4 deterministic vector docs
      expected/queries.json  expected results for the 5 SQL assertions

Wired into the task runner:

  just proof contract       # canary only this commit
  just proof integration    # Phase C
  just proof performance    # Phase D

.gitignore: /tests/proof/reports/* with !.gitkeep — same pattern as
reports/scrum/_evidence/. Per-run output is a runtime artifact.

Specs landed alongside (J's drops):
  docs/TEST_PROOF_SCOPE.md           the harness contract this implements
  docs/CLAUDE_REFACTOR_GUARDRAILS.md process discipline this harness obeys

Verified end-to-end (cached binaries):
  just proof contract        wall < 2s, 21 pass / 0 fail / 0 skip
  just verify                wall 31s, vet + test + 9 smokes still green

Two bugs fixed during canary run, both in run_proof.sh aggregation:
- grep -c exits 1 on zero matches; the `|| echo 0` form concatenated
  "0\n0" and broke jq --argjson + integer comparison. Fixed via a
  _count helper that captures count-or-zero cleanly.
- per-case table iterated case scripts (filename-based) but cases
  write evidence under CASE_ID. Switched to JSONL-file iteration so
  multi-case scripts work and the mapping is faithful.

Phase B (contract cases) lands next: 05_embedding, 06_vector_add,
08_gateway_contracts, 09_failure_modes. Each sourcing the same lib
helpers and writing to the same report shape.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 05:08:51 -05:00

259 lines
5.9 KiB
Markdown

Create `docs/TEST_PROOF_SCOPE.md`.
Purpose: design a serious proof harness for the Go lakehouse refactor.
You are not writing production features yet. You are designing and implementing a claims-verification test suite that proves or disproves what this system currently claims.
## System Claims To Prove
The Go lakehouse claims:
1. Gateway-fronted services work as a coherent system.
2. CSV data can ingest into Parquet.
3. Catalog manifests remain consistent.
4. DuckDB query path returns correct results.
5. Embedding path works through Ollama or configured embedding backend.
6. Vector add/search works.
7. Vector persistence survives restart.
8. Service contracts are stable.
9. Refactor preserved behavior.
10. Performance claims are measurable, not vibes.
## Required Output
Create a proof harness under:
```text
tests/proof/
tests/proof/
README.md
claims.yaml
run_proof.sh
lib/
env.sh
http.sh
assert.sh
metrics.sh
cases/
00_health.sh
01_storage_roundtrip.sh
02_catalog_manifest.sh
03_ingest_csv_to_parquet.sh
04_query_correctness.sh
05_embedding_contract.sh
06_vector_add_search.sh
07_vector_persistence_restart.sh
08_gateway_contracts.sh
09_failure_modes.sh
10_perf_baseline.sh
fixtures/
csv/
expected/
reports/
.gitkeep
Test Design Requirements
Each test must produce evidence, not just pass/fail.
For every case, record:
claim tested
service routes called
input fixture hash
output hash
expected result
actual result
pass/fail
latency
status codes
logs location
timestamp
git commit hash
Write results to:
tests/proof/reports/proof-YYYYMMDD-HHMMSS/
Each run must produce:
summary.md
summary.json
raw/
http/
logs/
outputs/
metrics/
Claims File
Create tests/proof/claims.yaml.
Each claim should have:
id: GOLAKE-001
name: Gateway health routes respond
type: contract
services:
- gateway
routes:
- GET /health
evidence:
- status_code
- response_body
- latency_ms
required: true
Include claims for:
gateway health
each service health
storage put/get/list/delete if supported
catalog create/read/update/list if supported
ingest job creation
Parquet output existence
query correctness against known CSV fixture
embedding vector dimension
vector add/search nearest-neighbor correctness
vector restart persistence
invalid request rejection
missing object behavior
duplicate vector ID behavior
malformed CSV behavior
unavailable downstream service behavior
latency baseline
throughput baseline
Fixtures
Create deterministic fixtures.
Minimum CSV fixture:
id,name,role,city,score
1,Ada,welder,Chicago,91
2,Grace,electrician,Detroit,88
3,Linus,operator,Chicago,77
4,Ken,pipefitter,Houston,84
5,Barbara,safety,Houston,95
Expected query assertions:
count rows = 5
city Chicago = 2
max score = 95
role safety belongs to Barbara
Houston average score = 89.5
For vector tests, use deterministic text fixtures:
doc-001: industrial staffing for welders in Chicago
doc-002: safety compliance for warehouse crews
doc-003: electrical contractors assigned to Detroit
doc-004: pipefitters and heavy equipment operators in Houston
Search assertions should verify that semantically related queries return expected top candidates where embeddings are enabled.
If embeddings are not deterministic enough, support a contract-only mode that verifies:
vector dimension
non-empty vector
add succeeds
search returns known inserted IDs
persistence survives restart
Modes
Support three modes:
tests/proof/run_proof.sh --mode contract
tests/proof/run_proof.sh --mode integration
tests/proof/run_proof.sh --mode performance
Contract mode
Fast. No massive data. Verifies APIs, schemas, status codes, basic correctness.
Integration mode
Runs full gateway → service chain.
Must prove:
CSV fixture → storaged → ingestd → catalogd → queryd
text fixture → embedd → vectord → search
Performance mode
Measures baseline only. Do not fake claims.
Record:
rows ingested/sec
vectors added/sec
p50/p95 query latency
p50/p95 vector search latency
memory usage if available
CPU usage if available
service restart time if available
Failure-Mode Tests
Add tests proving the system fails cleanly.
Required:
malformed JSON
missing required field
invalid vector dimension
missing object
bad SQL query
duplicate vector ID
downstream service unavailable if easy to simulate
restart before persistence load completes if relevant
Do not hide failures behind retries unless the system explicitly documents retry behavior.
Hard Rules
Do not add production features unless needed to expose testable behavior.
Do not change public route contracts without documenting it.
Do not write tests that merely check “HTTP 200” unless the claim is health-only.
Do not use random data unless seeded and recorded.
Do not make performance claims without before/after metrics.
Do not assume Ollama is available; detect it and mark embedding tests skipped or degraded with explanation.
Do not let skipped tests appear as passed.
Do not silently ignore missing services.
Do not make the proof harness depend on external cloud services.
Final Deliverables
After implementation, produce:
tests/proof/README.md
tests/proof/claims.yaml
tests/proof/run_proof.sh
tests/proof/cases/*.sh
tests/proof/reports/<latest>/summary.md
tests/proof/reports/<latest>/summary.json
Final Report Must Answer
At the end, write a clear report:
Which claims are proven?
Which claims are partially proven?
Which claims failed?
Which claims were skipped and why?
What evidence supports each claim?
What bottlenecks were measured?
What contract drift was found?
What refactor risks remain?
What should be fixed first?
Execution Plan
First inspect the repo.
Then produce a short implementation plan.
Then build the proof harness.
Then run contract mode.
Then run integration mode if services can be started.
Then run performance mode only if contract and integration pass.
Do not declare success without evidence files.