Stress-tests the role gate with 40 queries (10 fill_events rows × 4
styles): need, client_first, looking, shorthand. Each row's role +
client + city stays the same; only the surface phrasing changes.
real_003 (original extractor) confirmed the shorthand-vs-shorthand
failure mode: CNC Operator shorthand recording leaked w-2404 onto
Forklift Operator shorthand query within the same Beacon Freight
Detroit cluster. Both record + query had empty role (extractor
returns "" for shorthand because there's no separator between role
and city), gate disabled, distance check passed, bleed fired.
Fix: extended extractRoleFromNeed to handle client_first
("{client} needs N {role} in...") and looking ("Looking for N
{role} at...") patterns. Shorthand left intentionally unmatched —
"Forklift Operator Detroit" is shape-indistinguishable from
"Forklift" + "Operator Detroit" without an LLM extractor or known-
cities lookup.
real_003b (extended extractor) verifies bleed closed across all 4
styles for this dataset. Forklift Operator queries keep w-2136 (the
cold-pass-correct match) regardless of which style the query came
in. Same-role boosts now fire correctly across styles — a CNC
Operator recording made in `looking` style boosts the CNC need-form
query.
scripts/cutover/gen_real_queries.go: added -styles flag with values
need|client_first|looking|shorthand|all (default need preserves
real_001/002 behavior). Tests/reality/real_coord_queries_v2.txt is
the 40-query stress file.
scripts/playbook_lift/main_test.go: 10 sub-tests lock the four
documented patterns + shorthand limitation + lift-suite-style
queries (no clean role, returns empty as expected).
Aggregate metrics:
- real_003 (original): disc=7, lift=7, boost=14, meanΔ=-0.108
- real_003b (extended): disc=11, lift=10, boost=31, meanΔ=-0.202
The growth reflects more LEGITIMATE same-role same-cluster transfer
firing across styles, not bleed (verified by per-cluster bleed
table — Forklift Operator queries unchanged across all 4 styles).
Known limitation documented in real_003_findings.md: same-cluster,
same-role queries in shorthand still embed close enough that a
shorthand recording could bleed onto a different-role shorthand
query if both record + query strip role. Closing this requires
LLM extraction or known-cities lookup at record + query time.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
First retrieval probe with non-synthetic query distribution. Pulls
N rows from /home/profit/lakehouse/data/datasets/fill_events.parquet
(real-shape demand data) and translates each to the natural language
a coordinator would type: "Need {count} {role}s in {city} {state}
starting at {at} for {client}".
Headline: 8/10 cold-pass top-1 = judge-best on real distribution.
Substrate works on queries it was never trained for. v2-moe + workers
corpus carry the load.
Surfaced finding (the real value of running this): same-client+city
queries cluster, and Shape A's distance boost bleeds across roles
within the cluster. Q#2 (Forklift @ Beacon Freight Detroit) records
e-6193 in the playbook corpus. Q#5 (Pickers same client+city) and
Q#10 (CNC Operator same client+city) inherit e-6193 at warm top-1
even though:
- Neither query has its own recorded playbook.
- Neither warm pass triggers a Shape B inject (boosted=0).
- The roles are different staffing categories.
Q#10 specifically demoted the cold-pass-correct w-3759 (judge rating
4 at rank 0) for a worker who was approved by the judge for a
different role on a different query.
Why the lift suite missed it: synthetic queries use 7 disjoint
scenario buckets (forklift+OSHA+WI / CDL+IL / etc.). Real demand
clusters on (client, city). The cluster doesn't exist in the
synthetic distribution.
Why the judge gate doesn't catch it: the gate (5a3364f) is
per-injection at record time. After approval the worker rides Shape A
distance boosts on all later same-cluster queries with no second
gate call.
Becomes new OPEN #1. Fix candidate: role-scoped playbook corpus
metadata + Shape A boost gate on role match. Cheap; doesn't need
new judge calls.
Files:
- scripts/cutover/gen_real_queries.go: parquet → coordinator NL
- tests/reality/real_coord_queries.txt: 10 generated queries
- reports/reality-tests/playbook_lift_real_001.md: harness output
- reports/reality-tests/real_001_findings.md: the reading
Repro:
go run scripts/cutover/gen_real_queries.go -limit 10 > tests/reality/real_coord_queries.txt
QUERIES_FILE=tests/reality/real_coord_queries.txt RUN_ID=real_001 \
WITH_PARAPHRASE=0 WITH_REJUDGE=0 ./scripts/playbook_lift.sh
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 1 had two known gaps: (1) the 3 contracts had zero shared role
names, so same-role-across-contracts Jaccard was vacuous (n=0); (2)
the verbatim handover at 100% was the trivial case, not the hard
learning test (paraphrased queries against another coord's playbook).
Both fixed in this commit.
Contract redesign — all 3 contracts now share warehouse worker /
admin assistant / heavy equipment operator roles, plus a unique
specialist per contract (industrial electrician / bilingual safety
coord / drone surveyor — the "specialist not on the standard roster"
case from J's spec). Counts and skill mixes vary per region.
New driver phase 4b — paraphrase handover. Bob runs qwen2.5-paraphrased
versions of Alice's contract queries against Alice's playbook
namespace. Tests whether institutional memory propagates across
coordinators AND across natural wording variation that Bob would
introduce when running Alice's contract.
Run #002 result (5K workers + 10K ethereal_workers, 4 demand × 3
coords + paraphrase handover):
Diversity (the question J asked: locking or cycling?):
Same-role-across-contracts Jaccard = 0.119 (n=9)
→ 88% of workers DIFFER across regions for the same role name.
Milwaukee warehouse vs Indianapolis warehouse vs Chicago
warehouse pull mostly distinct top-K from the same population.
The system locks into geo+cert+skill context, not cycling.
Different-roles-same-contract Jaccard = 0.004 (n=18)
→ role-specific retrieval works (unchanged from Phase 1).
Determinism: Jaccard = 1.000 (n=12) — unchanged.
Learning:
Verbatim handover 4/4 = 100% (trivial case, expected)
Paraphrase handover 4/4 = 100% (HARD case — passes!)
Of those 4 paraphrase recoveries:
- 2 used boost (Alice's recording was already in Bob's
paraphrase top-K; ApplyPlaybookBoost re-ranked to top-1)
- 2 used Shape B inject (recording wasn't in Bob's
paraphrase top-K; InjectPlaybookMisses brought it in)
The boost/inject mix is healthy — both paths are used and both
produce correct top-1s. Multi-coord institutional memory propagation
is empirically working under wording variation.
Sample warehouse worker top-1s across contracts (proves diversity):
alice / Milwaukee → w-713
bob / Indianapolis → e-8447
carol / Chicago → e-7145
Three different workers from the same 15K-person population,
selected on geo+cert+skill context.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three coordinators (alice / bob / carol) with three contracts
(Milwaukee distribution / Indianapolis manufacturing / Chicago
construction). 7-phase scenario runner: baseline → surge → merge →
handover → split → reissue → analysis. Each coord has a separate
playbook namespace (playbook_{name}) so institutional memory stays
isolated by default but transferable on demand.
Phase 1 deliberately skips the 48-hour clock, email/SMS endpoints,
and Langfuse tracing — those are Phase 2/3.
Run #001 (52 events, 4 queries × 3 coords × 2 demand flavors):
Diversity:
Different-roles-same-contract Jaccard = 0.004 (n=18)
→ role-specific retrieval is working perfectly. Different
roles within one contract pull totally different worker
pools. System is NOT cycling; locks into per-role retrieval.
Same-role-across-contracts Jaccard = N/A (n=0)
→ TEST-DESIGN ISSUE: the 3 contracts use distinct role names
per industry (warehouse worker / production worker / general
laborer), so no exact-name overlaps exist. Phase 2 should
either share at least one role across contracts OR add a
skill-based diversity metric.
Determinism: Jaccard = 1.000 (n=12)
→ HNSW + Ollama retrieval is fully deterministic on identical
query text. coder/hnsw + nomic-embed-text are stable.
Learning: handover hit rate = 4/4 = 100%
→ Bob inherits Alice's recordings perfectly when bob runs
identical queries with alice's playbook namespace. CAVEAT:
this tests the trivial verbatim case, not paraphrase handover.
The harder test (bob runs paraphrased queries with alice's
playbook) is Phase 2 work.
Per-event capture in JSON: every matrix.search response is logged
with phase / coordinator / contract / role / query / top-K IDs +
distances + per-corpus counts + boosted/injected counts. Reviewable
via:
jq '.events[] | select(.phase == "merge")'
jq '.events[] | select(.coordinator == "alice")'
jq '.events[] | select(.role == "warehouse worker")'
Notable finding from per-event: carol's "general laborer" and "crane
operator" queries both surface w-1009 as top-1, with crane operator
at distance 0.098 (very tight) and general laborer at 0.297. The
system found a worker who legitimately covers both roles — realistic
for small construction crews.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 5-loop substrate's load-bearing gate is verified — playbook +
matrix indexer give the results we're looking for. Per the report's
rubric, lift ≥ 50% of discoveries means matrix is doing real work;
7/8 = 87.5% blew through that.
Harness was structurally hiding bugs behind a 5-daemon stripped boot.
Expanding to the full 10-daemon prod stack surfaced 7 fixes in cascade:
1. driver→matrixd: {"query": ...} → {"query_text": ...} field name
2. harness temp toml missing [s3] → wrong default bucket → catalogd
rehydrate 500 on first call
3. harness→queryd SQL probe: {"q": ...} → {"sql": ...} field name
4. expand boot from 5 → 10 daemons in dep-ordered launch
5. add SQL surface probe (3-row CSV ingest → COUNT(*)=3 assertion)
6. candidates corpus was synthetic SWE-tech (Swift/iOS, Scala/Spark) —
wrong domain for staffing queries; replaced with ethereal_workers
(10K rows, real staffing schema, "e-" id prefix to avoid collision
with workers' "w-"). staffing_workers driver gains -index-name +
-id-prefix flags so the same binary serves both corpora
7. local_judge qwen3.5:latest is a vision-SSM 256K-ctx build running
~30s per judge call against the lift loop; reverted to
qwen2.5:latest (~1s/call, 30× faster, held lift theory)
Each contract drift (1, 3) is now locked into a cmd/<bin>/main_test.go
so future drift fires in `go test`, not in a reality run. R-005 closed:
- cmd/matrixd/main_test.go (new) — playbook record drift detector +
score bounds + 6 routes mounted
- cmd/queryd/main_test.go — wrong-field-name drift detector
- cmd/pathwayd/main_test.go (new) — 9 routes + add round-trip + retire
- cmd/observerd/main_test.go (new) — 4 routes + invalid-op + unknown-mode
`go test ./cmd/{matrixd,queryd,pathwayd,observerd}` all green.
Reality test results (reports/reality-tests/playbook_lift_001.{json,md}):
Queries 21 (staffing-domain, 7 categories)
Discoveries 8 (judge ≠ cosine top-1)
Lifts 7/8 (87.5%)
Boosts triggered 9
Mean Δ distance -0.053 (warm closer than cold)
OOD honesty dental/RN/SWE rated 1, no fake matches
Cross-corpus boosts confirmed (e- ↔ w- swaps in lifts)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
First reality test driver. Two-pass design:
- Pass 1 (cold): matrix.search use_playbook=false → small-model judge
rates top-K → record playbook entry pointing at the highest-rated
result (which may NOT be top-1 by distance — that's the discovery).
- Pass 2 (warm): same queries with use_playbook=true → measure
ranking shift. Lift = real if recorded answer becomes top-1.
Files:
- scripts/playbook_lift/main.go driver (391 LoC)
- scripts/playbook_lift.sh stack-bring-up + report gen
- tests/reality/playbook_lift_queries.txt query corpus (5 placeholders;
J writes real 20+)
- reports/reality-tests/README.md framework + interpretation
- .gitignore track reports/reality-tests/
but ignore per-run JSON evidence
This answers the gate from project_small_model_pipeline_vision.md:
"the playbook + matrix indexer must give the results we're looking
for." Without ground-truth labels, the LLM judge is the proxy — the
same small-model thesis applied to evaluation. Honest about that
limitation in the generated reports.
Driver compiles clean; full run requires Ollama + workers/candidates
ingest. Skips cleanly if Ollama absent.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces single-shot baselines (40% noise floor flagged in Phase E)
with noise-aware regression detection.
What changed:
ingest n=3 runs (was 1) with 3-pass warmup
vector_add n=3 runs (was 1) with 3-pass warmup
query n=20 samples (unchanged) with 50-pass warmup
search n=20 samples (unchanged) with 50-pass warmup
RSS n=1 (unchanged — steady-state in G0)
Each metric stored as {value: median, mad: median absolute
deviation} in baseline.json (schema: v2-multisample-mad).
New regression detection:
threshold = max(3 * baseline.mad, value * 0.75)
REGRESSION iff |actual - baseline.value| > threshold AND direction
signals worse (lower throughput / higher latency).
Why these specific numbers:
3*MAD = standard "outside the spread" bound; lets high-variance
metrics tolerate their own noise.
75% floor = empirical observation: even with 50 warmups, single-
host inter-run variance on bootstrap-cold queryd was
consistently 90-130% on this box. 75% catches >75%
regressions cleanly while ignoring known noise.
lib/metrics.sh: new proof_compute_mad helper computes MAD from a
file of one-number-per-line samples. Used for both regen (to write
the baseline.mad value) and diff (read from baseline).
Honest finding from this iteration's 3 back-to-back diff runs:
query_ms shows 90-130% delta from baseline consistently — not
random noise but a systematic 2x gap between regen-time and
steady-state. The regen captured a particularly fast moment;
steady-state is slower. Operator workflow: regenerate the
baseline at a known-representative state via
`bash tests/proof/run_proof.sh --mode performance --regenerate-baseline`
rather than expecting the harness to track a moving target.
The harness's value here is the EVIDENCE RECORD (every run captures
median+MAD+p95 plus all raw samples in raw/metrics/), not the gate.
Even false-positive REGRESSION skips give operators "this run was
20ms vs baseline 10ms" which is informative.
Sample counts also written into baseline.json under "samples" so a
future audit can verify the methodology that produced the values.
Verified across 3 back-to-back runs:
ingest_rows_per_sec PASS (delta within 75%, mostly < 10%)
vectors_per_sec_add PASS
search_ms PASS
rss_* PASS
query_ms REGRESSION flagged (130/100/90%) — known
systematic gap, not bug
Closes the "40% noise floor" follow-up from Phase E FINAL_REPORT.
Honest about limitations: hard regression gating on a busy single-
host setup needs either much bigger sample counts (n≥100), longer
warmup, or moving to a dedicated benchmark host. Documented inline.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Caught by the audit rerun: with cache-warm binaries, 04 fires its
first SELECT faster than queryd's 500ms refresh tick — Q1 returned
400 ("table not found") even though 03_ingest had registered the
manifest. Subsequent queries (after the next tick) succeeded.
This is an eventual-consistency wait, not a retry — queryd's
contract is that views appear within one tick of catalogd having the
manifest. Production code does not need changing.
Added to lib/http.sh:
proof_wait_for_sql <budget_sec> <sql>
polls a SQL probe until it returns 200 or budget elapses; emits
no evidence (test setup, not a claim).
Used in 04_query_correctness:
Wait up to 5s for queryd to have the view before running the 5
SQL assertions. Skip-with-loud-reason if the view never appears.
Verified: integration mode back to 104 pass / 0 fail / 1 skip after
fix. The skip is the unchanged GOLAKE-085 informational record.
This is exactly the kind of finding the harness was designed to
surface — the regression existed in the codebase the moment Phase D
shipped, but only fired when the next compare run hit cache-warm
timing. Without the harness, it would have surfaced on a CI run
weeks from now and been hard to bisect.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per docs/TEST_PROOF_SCOPE.md, this is the closing deliverable for the
proof harness: a single document that names what's proven, what's
partially proven, what failed, what was skipped and why, what evidence
exists for each, what bottlenecks were measured, what contract drift
was found, what refactor risks remain, and what to fix first.
Per-run report dirs (tests/proof/reports/proof-<ts>/) keep their
existing summary.md + summary.json + raw/ structure — they are the
replayable evidence chain. FINAL_REPORT.md is the stable, repo-tracked
synthesis pointing at them.
Headline findings (no surprises — harness behaves as designed):
- 24 claims encoded; 22 fully proven, 1 informational (GOLAKE-085
duplicate vector ID, contract not yet specified), 0 failed.
- 4 contract-drift findings recorded as canonical: vectord add
body field is `items` not `vectors`, search response is `results`
not `hits`, index info is `length` not `count`, status codes
201/204 not 200. All caught during Phase B; all now pinned by the
harness.
- Performance baseline shows queryd as the largest RSS (69 MiB,
DuckDB process); single-sample noise floor is ~40% — tightening
to multi-sample medians is a documented Sprint follow-up.
- HIGH-risk audit findings (R-001 queryd /sql, R-002/R-003 untested
shared+storeclient) are NOT closed by the harness — it's a
multiplier, not a replacement for unit tests + auth posture.
The proof harness is complete. 11 cases · 3 modes · 168 assertions
peak across all tiers · ~22s total wall (contract+integration+perf).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
GOLAKE-100. First run writes tests/proof/baseline.json; subsequent
runs diff against it. >10% regression emits a SKIP with REGRESSION
detail (not a fail — perf claim is required:false in claims.yaml so
the gate stays green; the human summary tells the regression story
honestly). Skip-with-loud-reason if any earlier case in the run
failed, per spec "performance only after contract+integration pass."
Workload (deterministic, repeatable):
ingest 1000-row CSV (5 roles × 5 cities × seeded scores) → /v1/ingest
query SELECT count(*) ×20 against the just-ingested dataset
vector add 200 dim=4 vectors with formulaic content (no Ollama)
search ×20 against the perf index with a fixed query vector
RSS per-service post-workload sample via /proc/<pid>/status
Recorded metrics:
ingest_rows_per_sec, query_p50_ms, query_p95_ms,
vectors_per_sec_add, search_p50_ms, search_p95_ms,
rss_{storaged,catalogd,ingestd,queryd,vectord,embedd,gateway}_mb
baseline.json on this box (committed):
25000 rows/sec ingest · 17ms p50 / 24ms p95 query
6250 vectors/sec add · 8ms p50 / 20ms p95 search
queryd 69 MiB · vectord 14 MiB · others 11-29 MiB
Honest measurement-design finding from the very first compare run:
back-to-back runs surfaced -41% ingest and +29% query p50 — pure
disk-cache + queryd-cold-start noise. Single-sample baselines have
real noise floor ≈40%. Recorded as REGRESSION skips so the human
summary surfaces it, not a code regression. Tightening the threshold
or moving to multi-sample medians is a Phase E recommendation.
Verified end-to-end:
just proof contract — 53 pass · 1 skip · ~4s
just proof integration — 104 pass · 1 skip · ~8s
just proof performance — 110 pass · 3 skip · ~10s
just verify — 9 smokes still green · 29s
All 11 cases (4 contract + 6 integration + 1 performance) deterministic
end-to-end. Phase E (final report against the 9 mandated questions)
is the last piece.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the integration tier — full chain CSV→Parquet→SQL and full
text→embed→vector→search. All 10 cases (4 contract + 6 integration)
end-to-end deterministic; 8s wall total.
Cases added:
01_storage_roundtrip.sh
GOLAKE-010-012. PUT 1KiB → GET sha256-equal → LIST contains key
→ DELETE 200/204 → GET 404. Deterministic key under
proof/<case_id>/ so concurrent runs don't collide.
02_catalog_manifest.sh
GOLAKE-020-022. Fresh register existing=false → manifest read
matches → list contains dataset_id → idempotent re-register
existing=true with stable dataset_id → schema-drift register
409 (the ADR-020 contract). Per-run unique name via
PROOF_RUN_ID so existing=false is meaningful.
03_ingest_csv_to_parquet.sh
GOLAKE-030. workers.csv (5 rows) via /v1/ingest multipart →
parquet object on storaged → catalog manifest with row_count=5.
Verifies content-addressed key shape (datasets/<n>/<fp>.parquet).
04_query_correctness.sh
GOLAKE-040. The 5 SQL assertions from fixtures/expected/queries.json
against the workers fixture: count=5, Chicago=2, max=95,
safety→Barbara, Houston avg=89.5. Iterates the YAML claims, runs
each query, compares response columns to expected values.
06_vector_add_search.sh integration extension
GOLAKE-051. text → /v1/embed (4 docs from fixtures/text/docs.txt)
→ vectord add → search by query embedding. Top-1 ID per query
asserted against fixtures/expected/rankings.json. First run (or
--regenerate-rankings) writes the fixture and emits a skip with
explicit reason; subsequent runs assert against it.
07_vector_persistence_restart.sh
GOLAKE-070. add 4 unit-basis vectors → search → record top-1
distance → SIGTERM vectord → restart with the same --config →
poll /health for 8s → search again → top-1 ID and distance match
bit-identically. Skips with reason if vectord PID can't be found
or post-restart bind times out.
Two harness improvements landed alongside:
run_proof.sh writes a temp lakehouse_proof.toml with
refresh_every="500ms" override and passes --config to all booted
binaries. Production default is 30s; 04_query_correctness needs
queryd to pick up the new view within a tick. Production config
unchanged.
cleanup() now pgreps for any orphan bin/<svc> processes (anchored
to start-of-argv per memory feedback_pkill_scope.md) so a case
that restarts a service mid-run still gets cleaned up.
lib/http.sh adds proof_call(case_id, probe, method, url, args...)
— escape hatch for cases that need raw curl args (multipart -F,
custom headers). Used by 03_ingest for the multipart upload that
conflicts with proof_post's --data + Content-Type defaults.
lib/env.sh exports PROOF_RUN_ID — short unique id derived from the
report directory timestamp. Used by 02 and 07 for fresh-each-run
state isolation.
Two real findings recorded as evidence (no code changes):
- rankings.json fixture pinned: 4 queries → 4 distinct top-1 docs
via nomic-embed-text. A model swap that changes ranking now
fails the harness loudly; --regenerate-rankings is the override.
- vectord persistence kill+restart preserves top-1 distance
bit-identically — the LHV1 single-Put framed format from
G1P round-trips exactly through Save/Load.
Verified end-to-end:
just proof contract — 53 pass (4 cases)
just proof integration — 104 pass (10 cases) · 8s wall
just verify — 9 smokes still green · 33s wall
Phase D (performance baseline) lands next: 10_perf_baseline measures
rows/sec ingest, vectors/sec add, p50/p95 query+search latency, RSS,
CPU. First run writes tests/proof/baseline.json; later runs diff
against it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Added the contract tier above 00_health canary. All 5 contract cases
now cover GOLAKE-001-003, 050, 060-061, 080-085 — 53 assertions pass,
1 informational skip, 0 fail. Wall: 4s end-to-end (cached binaries).
Cases:
05_embedding_contract.sh
GOLAKE-050. POST /v1/embed with one short text → asserts dim=768,
one vector returned, vector length matches dimension, sum of
squared elements > 0 (proxy for non-zero), response.model echoed.
Skips with explicit reason if Ollama is unreachable (502 from
embedd) — per spec hard rule "skipped tests do not appear as
passed."
06_vector_add_search.sh
GOLAKE-060 + GOLAKE-061. Synthetic dim=4 unit basis vectors.
Create index → add 3 vectors → get-index returns length=3 →
search([1,0,0,0],k=3) returns v1 at rank 1 with distance < 0.001.
Cleanup with DELETE. No embedd dependency — pure contract layer.
08_gateway_contracts.sh
GOLAKE-003. For each /v1/* route, asserts gateway and direct
upstream return identical status AND identical response body
(sha256 match). Confirms gateway is a proxy not a transformer.
Status passthrough verified on both 200 path (storage/list,
catalog/list) and 4xx path (sql empty body → 400 from queryd).
09_failure_modes.sh
GOLAKE-080..085. Six failure-mode contracts:
080 malformed JSON → 4xx on catalog/ingest/sql/embed
081 missing required field → 4xx on catalog/vectors/embed
082 bad SQL → 4xx with non-empty error body
083 vector dim mismatch → 4xx
084 missing storage object → 404
085 duplicate vector ID → INFORMATIONAL (spec says required:false)
first/second statuses recorded as evidence; contract decided
later from the recorded record.
Two new lib helpers in lib/assert.sh:
proof_assert_status_in <id> <claim> "200 201 204" <probe>
pass if status is in the space-separated list. Used for
delete-returns-200-or-204 case where vectord returns 204.
proof_assert_status_4xx <id> <claim> <probe>
pass if status in [400, 500). Used for failure modes where the
specific 4xx code may vary (400 vs 422 vs 409). Records actual
code as evidence.
Two real contract findings recorded by the harness during build:
- vectord add expects {"items": [...]}, not {"vectors": [...]}.
My initial test sent the wrong field; would have masked the bug
forever in CI. The harness caught it via the assertion failure.
- vectord create returns 201 Created, delete returns 204 No Content.
Documented in the test fixtures as canonical.
Regression: just verify wall 33s, vet + test + 9 smokes still green.
Phase C (integration) lands next: 01_storage_roundtrip, 02_catalog_manifest,
03_ingest_csv_to_parquet, 04_query_correctness, 05/06 integration extends,
07_vector_persistence_restart.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per docs/TEST_PROOF_SCOPE.md, building the claims-verification tier
above the smoke chain. This commit lays the scaffolding and proves
the orchestrator end-to-end with one canary case (00_health).
What landed:
tests/proof/
README.md how to read a report, layout, modes
claims.yaml 24 claims enumerated (GOLAKE-001..100)
run_proof.sh orchestrator with --mode {contract|integration|performance}
and --no-bootstrap / --regenerate-{rankings,baseline}
lib/
env.sh service URLs, report dir, mode, git context
http.sh curl wrappers writing per-probe JSON + body + headers
assert.sh proof_assert_{eq,ne,contains,lt,gt,status,json_eq} +
proof_skip — each emits one JSONL record per call
metrics.sh start/stop timers, value capture, RSS sampling,
percentile compute (for Phase D)
cases/
00_health.sh canary — gateway + 6 services /health → 200,
body identifies service, latency < 500ms (21 assertions)
fixtures/
csv/workers.csv spec's 5-row deterministic CSV
text/docs.txt 4 deterministic vector docs
expected/queries.json expected results for the 5 SQL assertions
Wired into the task runner:
just proof contract # canary only this commit
just proof integration # Phase C
just proof performance # Phase D
.gitignore: /tests/proof/reports/* with !.gitkeep — same pattern as
reports/scrum/_evidence/. Per-run output is a runtime artifact.
Specs landed alongside (J's drops):
docs/TEST_PROOF_SCOPE.md the harness contract this implements
docs/CLAUDE_REFACTOR_GUARDRAILS.md process discipline this harness obeys
Verified end-to-end (cached binaries):
just proof contract wall < 2s, 21 pass / 0 fail / 0 skip
just verify wall 31s, vet + test + 9 smokes still green
Two bugs fixed during canary run, both in run_proof.sh aggregation:
- grep -c exits 1 on zero matches; the `|| echo 0` form concatenated
"0\n0" and broke jq --argjson + integer comparison. Fixed via a
_count helper that captures count-or-zero cleanly.
- per-case table iterated case scripts (filename-based) but cases
write evidence under CASE_ID. Switched to JSONL-file iteration so
multi-case scripts work and the mapping is faithful.
Phase B (contract cases) lands next: 05_embedding, 06_vector_add,
08_gateway_contracts, 09_failure_modes. Each sourcing the same lib
helpers and writing to the same report shape.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>