Per docs/TEST_PROOF_SCOPE.md, building the claims-verification tier
above the smoke chain. This commit lays the scaffolding and proves
the orchestrator end-to-end with one canary case (00_health).
What landed:
tests/proof/
README.md how to read a report, layout, modes
claims.yaml 24 claims enumerated (GOLAKE-001..100)
run_proof.sh orchestrator with --mode {contract|integration|performance}
and --no-bootstrap / --regenerate-{rankings,baseline}
lib/
env.sh service URLs, report dir, mode, git context
http.sh curl wrappers writing per-probe JSON + body + headers
assert.sh proof_assert_{eq,ne,contains,lt,gt,status,json_eq} +
proof_skip — each emits one JSONL record per call
metrics.sh start/stop timers, value capture, RSS sampling,
percentile compute (for Phase D)
cases/
00_health.sh canary — gateway + 6 services /health → 200,
body identifies service, latency < 500ms (21 assertions)
fixtures/
csv/workers.csv spec's 5-row deterministic CSV
text/docs.txt 4 deterministic vector docs
expected/queries.json expected results for the 5 SQL assertions
Wired into the task runner:
just proof contract # canary only this commit
just proof integration # Phase C
just proof performance # Phase D
.gitignore: /tests/proof/reports/* with !.gitkeep — same pattern as
reports/scrum/_evidence/. Per-run output is a runtime artifact.
Specs landed alongside (J's drops):
docs/TEST_PROOF_SCOPE.md the harness contract this implements
docs/CLAUDE_REFACTOR_GUARDRAILS.md process discipline this harness obeys
Verified end-to-end (cached binaries):
just proof contract wall < 2s, 21 pass / 0 fail / 0 skip
just verify wall 31s, vet + test + 9 smokes still green
Two bugs fixed during canary run, both in run_proof.sh aggregation:
- grep -c exits 1 on zero matches; the `|| echo 0` form concatenated
"0\n0" and broke jq --argjson + integer comparison. Fixed via a
_count helper that captures count-or-zero cleanly.
- per-case table iterated case scripts (filename-based) but cases
write evidence under CASE_ID. Switched to JSONL-file iteration so
multi-case scripts work and the mapping is faithful.
Phase B (contract cases) lands next: 05_embedding, 06_vector_add,
08_gateway_contracts, 09_failure_modes. Each sourcing the same lib
helpers and writing to the same report shape.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
259 lines
5.9 KiB
Markdown
259 lines
5.9 KiB
Markdown
Create `docs/TEST_PROOF_SCOPE.md`.
|
|
|
|
Purpose: design a serious proof harness for the Go lakehouse refactor.
|
|
|
|
You are not writing production features yet. You are designing and implementing a claims-verification test suite that proves or disproves what this system currently claims.
|
|
|
|
## System Claims To Prove
|
|
|
|
The Go lakehouse claims:
|
|
|
|
1. Gateway-fronted services work as a coherent system.
|
|
2. CSV data can ingest into Parquet.
|
|
3. Catalog manifests remain consistent.
|
|
4. DuckDB query path returns correct results.
|
|
5. Embedding path works through Ollama or configured embedding backend.
|
|
6. Vector add/search works.
|
|
7. Vector persistence survives restart.
|
|
8. Service contracts are stable.
|
|
9. Refactor preserved behavior.
|
|
10. Performance claims are measurable, not vibes.
|
|
|
|
## Required Output
|
|
|
|
Create a proof harness under:
|
|
|
|
```text
|
|
tests/proof/
|
|
tests/proof/
|
|
README.md
|
|
claims.yaml
|
|
run_proof.sh
|
|
lib/
|
|
env.sh
|
|
http.sh
|
|
assert.sh
|
|
metrics.sh
|
|
cases/
|
|
00_health.sh
|
|
01_storage_roundtrip.sh
|
|
02_catalog_manifest.sh
|
|
03_ingest_csv_to_parquet.sh
|
|
04_query_correctness.sh
|
|
05_embedding_contract.sh
|
|
06_vector_add_search.sh
|
|
07_vector_persistence_restart.sh
|
|
08_gateway_contracts.sh
|
|
09_failure_modes.sh
|
|
10_perf_baseline.sh
|
|
fixtures/
|
|
csv/
|
|
expected/
|
|
reports/
|
|
.gitkeep
|
|
|
|
Test Design Requirements
|
|
|
|
Each test must produce evidence, not just pass/fail.
|
|
|
|
For every case, record:
|
|
|
|
claim tested
|
|
service routes called
|
|
input fixture hash
|
|
output hash
|
|
expected result
|
|
actual result
|
|
pass/fail
|
|
latency
|
|
status codes
|
|
logs location
|
|
timestamp
|
|
git commit hash
|
|
|
|
Write results to:
|
|
|
|
tests/proof/reports/proof-YYYYMMDD-HHMMSS/
|
|
|
|
Each run must produce:
|
|
|
|
summary.md
|
|
summary.json
|
|
raw/
|
|
http/
|
|
logs/
|
|
outputs/
|
|
metrics/
|
|
Claims File
|
|
|
|
Create tests/proof/claims.yaml.
|
|
|
|
Each claim should have:
|
|
|
|
id: GOLAKE-001
|
|
name: Gateway health routes respond
|
|
type: contract
|
|
services:
|
|
- gateway
|
|
routes:
|
|
- GET /health
|
|
evidence:
|
|
- status_code
|
|
- response_body
|
|
- latency_ms
|
|
required: true
|
|
|
|
Include claims for:
|
|
|
|
gateway health
|
|
each service health
|
|
storage put/get/list/delete if supported
|
|
catalog create/read/update/list if supported
|
|
ingest job creation
|
|
Parquet output existence
|
|
query correctness against known CSV fixture
|
|
embedding vector dimension
|
|
vector add/search nearest-neighbor correctness
|
|
vector restart persistence
|
|
invalid request rejection
|
|
missing object behavior
|
|
duplicate vector ID behavior
|
|
malformed CSV behavior
|
|
unavailable downstream service behavior
|
|
latency baseline
|
|
throughput baseline
|
|
Fixtures
|
|
|
|
Create deterministic fixtures.
|
|
|
|
Minimum CSV fixture:
|
|
|
|
id,name,role,city,score
|
|
1,Ada,welder,Chicago,91
|
|
2,Grace,electrician,Detroit,88
|
|
3,Linus,operator,Chicago,77
|
|
4,Ken,pipefitter,Houston,84
|
|
5,Barbara,safety,Houston,95
|
|
|
|
Expected query assertions:
|
|
|
|
count rows = 5
|
|
city Chicago = 2
|
|
max score = 95
|
|
role safety belongs to Barbara
|
|
Houston average score = 89.5
|
|
|
|
For vector tests, use deterministic text fixtures:
|
|
|
|
doc-001: industrial staffing for welders in Chicago
|
|
doc-002: safety compliance for warehouse crews
|
|
doc-003: electrical contractors assigned to Detroit
|
|
doc-004: pipefitters and heavy equipment operators in Houston
|
|
|
|
Search assertions should verify that semantically related queries return expected top candidates where embeddings are enabled.
|
|
|
|
If embeddings are not deterministic enough, support a contract-only mode that verifies:
|
|
|
|
vector dimension
|
|
non-empty vector
|
|
add succeeds
|
|
search returns known inserted IDs
|
|
persistence survives restart
|
|
Modes
|
|
|
|
Support three modes:
|
|
|
|
tests/proof/run_proof.sh --mode contract
|
|
tests/proof/run_proof.sh --mode integration
|
|
tests/proof/run_proof.sh --mode performance
|
|
Contract mode
|
|
|
|
Fast. No massive data. Verifies APIs, schemas, status codes, basic correctness.
|
|
|
|
Integration mode
|
|
|
|
Runs full gateway → service chain.
|
|
|
|
Must prove:
|
|
|
|
CSV fixture → storaged → ingestd → catalogd → queryd
|
|
text fixture → embedd → vectord → search
|
|
Performance mode
|
|
|
|
Measures baseline only. Do not fake claims.
|
|
|
|
Record:
|
|
|
|
rows ingested/sec
|
|
vectors added/sec
|
|
p50/p95 query latency
|
|
p50/p95 vector search latency
|
|
memory usage if available
|
|
CPU usage if available
|
|
service restart time if available
|
|
Failure-Mode Tests
|
|
|
|
Add tests proving the system fails cleanly.
|
|
|
|
Required:
|
|
|
|
malformed JSON
|
|
missing required field
|
|
invalid vector dimension
|
|
missing object
|
|
bad SQL query
|
|
duplicate vector ID
|
|
downstream service unavailable if easy to simulate
|
|
restart before persistence load completes if relevant
|
|
|
|
Do not hide failures behind retries unless the system explicitly documents retry behavior.
|
|
|
|
Hard Rules
|
|
Do not add production features unless needed to expose testable behavior.
|
|
Do not change public route contracts without documenting it.
|
|
Do not write tests that merely check “HTTP 200” unless the claim is health-only.
|
|
Do not use random data unless seeded and recorded.
|
|
Do not make performance claims without before/after metrics.
|
|
Do not assume Ollama is available; detect it and mark embedding tests skipped or degraded with explanation.
|
|
Do not let skipped tests appear as passed.
|
|
Do not silently ignore missing services.
|
|
Do not make the proof harness depend on external cloud services.
|
|
Final Deliverables
|
|
|
|
After implementation, produce:
|
|
|
|
tests/proof/README.md
|
|
tests/proof/claims.yaml
|
|
tests/proof/run_proof.sh
|
|
tests/proof/cases/*.sh
|
|
tests/proof/reports/<latest>/summary.md
|
|
tests/proof/reports/<latest>/summary.json
|
|
Final Report Must Answer
|
|
|
|
At the end, write a clear report:
|
|
|
|
Which claims are proven?
|
|
Which claims are partially proven?
|
|
Which claims failed?
|
|
Which claims were skipped and why?
|
|
What evidence supports each claim?
|
|
What bottlenecks were measured?
|
|
What contract drift was found?
|
|
What refactor risks remain?
|
|
What should be fixed first?
|
|
Execution Plan
|
|
|
|
First inspect the repo.
|
|
|
|
Then produce a short implementation plan.
|
|
|
|
Then build the proof harness.
|
|
|
|
Then run contract mode.
|
|
|
|
Then run integration mode if services can be started.
|
|
|
|
Then run performance mode only if contract and integration pass.
|
|
|
|
Do not declare success without evidence files.
|