Create `docs/TEST_PROOF_SCOPE.md`.

Purpose: design a serious proof harness for the Go lakehouse refactor.

You are not writing production features yet. You are designing and implementing a claims-verification test suite that proves or disproves what this system currently claims.

## System Claims To Prove

The Go lakehouse claims:

1. Gateway-fronted services work as a coherent system.
2. CSV data can ingest into Parquet.
3. Catalog manifests remain consistent.
4. DuckDB query path returns correct results.
5. Embedding path works through Ollama or configured embedding backend.
6. Vector add/search works.
7. Vector persistence survives restart.
8. Service contracts are stable.
9. Refactor preserved behavior.
10. Performance claims are measurable, not vibes.

## Required Output

Create a proof harness under:

```text
tests/proof/
tests/proof/
  README.md
  claims.yaml
  run_proof.sh
  lib/
    env.sh
    http.sh
    assert.sh
    metrics.sh
  cases/
    00_health.sh
    01_storage_roundtrip.sh
    02_catalog_manifest.sh
    03_ingest_csv_to_parquet.sh
    04_query_correctness.sh
    05_embedding_contract.sh
    06_vector_add_search.sh
    07_vector_persistence_restart.sh
    08_gateway_contracts.sh
    09_failure_modes.sh
    10_perf_baseline.sh
  fixtures/
    csv/
    expected/
  reports/
    .gitkeep

Test Design Requirements

Each test must produce evidence, not just pass/fail.

For every case, record:

claim tested
service routes called
input fixture hash
output hash
expected result
actual result
pass/fail
latency
status codes
logs location
timestamp
git commit hash

Write results to:

tests/proof/reports/proof-YYYYMMDD-HHMMSS/

Each run must produce:

summary.md
summary.json
raw/
  http/
  logs/
  outputs/
  metrics/
Claims File

Create tests/proof/claims.yaml.

Each claim should have:

id: GOLAKE-001
name: Gateway health routes respond
type: contract
services:
  - gateway
routes:
  - GET /health
evidence:
  - status_code
  - response_body
  - latency_ms
required: true

Include claims for:

gateway health
each service health
storage put/get/list/delete if supported
catalog create/read/update/list if supported
ingest job creation
Parquet output existence
query correctness against known CSV fixture
embedding vector dimension
vector add/search nearest-neighbor correctness
vector restart persistence
invalid request rejection
missing object behavior
duplicate vector ID behavior
malformed CSV behavior
unavailable downstream service behavior
latency baseline
throughput baseline
Fixtures

Create deterministic fixtures.

Minimum CSV fixture:

id,name,role,city,score
1,Ada,welder,Chicago,91
2,Grace,electrician,Detroit,88
3,Linus,operator,Chicago,77
4,Ken,pipefitter,Houston,84
5,Barbara,safety,Houston,95

Expected query assertions:

count rows = 5
city Chicago = 2
max score = 95
role safety belongs to Barbara
Houston average score = 89.5

For vector tests, use deterministic text fixtures:

doc-001: industrial staffing for welders in Chicago
doc-002: safety compliance for warehouse crews
doc-003: electrical contractors assigned to Detroit
doc-004: pipefitters and heavy equipment operators in Houston

Search assertions should verify that semantically related queries return expected top candidates where embeddings are enabled.

If embeddings are not deterministic enough, support a contract-only mode that verifies:

vector dimension
non-empty vector
add succeeds
search returns known inserted IDs
persistence survives restart
Modes

Support three modes:

tests/proof/run_proof.sh --mode contract
tests/proof/run_proof.sh --mode integration
tests/proof/run_proof.sh --mode performance
Contract mode

Fast. No massive data. Verifies APIs, schemas, status codes, basic correctness.

Integration mode

Runs full gateway → service chain.

Must prove:

CSV fixture → storaged → ingestd → catalogd → queryd
text fixture → embedd → vectord → search
Performance mode

Measures baseline only. Do not fake claims.

Record:

rows ingested/sec
vectors added/sec
p50/p95 query latency
p50/p95 vector search latency
memory usage if available
CPU usage if available
service restart time if available
Failure-Mode Tests

Add tests proving the system fails cleanly.

Required:

malformed JSON
missing required field
invalid vector dimension
missing object
bad SQL query
duplicate vector ID
downstream service unavailable if easy to simulate
restart before persistence load completes if relevant

Do not hide failures behind retries unless the system explicitly documents retry behavior.

Hard Rules
Do not add production features unless needed to expose testable behavior.
Do not change public route contracts without documenting it.
Do not write tests that merely check “HTTP 200” unless the claim is health-only.
Do not use random data unless seeded and recorded.
Do not make performance claims without before/after metrics.
Do not assume Ollama is available; detect it and mark embedding tests skipped or degraded with explanation.
Do not let skipped tests appear as passed.
Do not silently ignore missing services.
Do not make the proof harness depend on external cloud services.
Final Deliverables

After implementation, produce:

tests/proof/README.md
tests/proof/claims.yaml
tests/proof/run_proof.sh
tests/proof/cases/*.sh
tests/proof/reports/<latest>/summary.md
tests/proof/reports/<latest>/summary.json
Final Report Must Answer

At the end, write a clear report:

Which claims are proven?
Which claims are partially proven?
Which claims failed?
Which claims were skipped and why?
What evidence supports each claim?
What bottlenecks were measured?
What contract drift was found?
What refactor risks remain?
What should be fixed first?
Execution Plan

First inspect the repo.

Then produce a short implementation plan.

Then build the proof harness.

Then run contract mode.

Then run integration mode if services can be started.

Then run performance mode only if contract and integration pass.

Do not declare success without evidence files.