Bash driver wrapping /v1/chat for Opus + Kimi + Qwen3-coder review
runs. Used today to scrum the 4-phase wave (1,624 LoC of chatd +
config-refactor + Rust cleanup) and caught 2 BLOCKs + 2 WARNs.
Usage:
./scripts/scrum_review.sh <bundle.diff> <bundle_label>
Output: reports/scrum/_evidence/<DATE>/verdicts/<bundle>_<reviewer>.md
verbatim, per the evidence-only convention. Per-reviewer latency +
token counts captured in the report header.
System prompt enforces the BLOCK/WARN/INFO + WHERE/WHAT/WHY shape
per feedback_cross_lineage_review.md — leads with verdict, no
preamble (Kimi tends to spend tokens thinking otherwise).
Reviewer fleet matches project_golang_lakehouse.md "Scrum routing":
- opencode/claude-opus-4-7
- openrouter/moonshotai/kimi-k2-0905
- openrouter/qwen/qwen3-coder
This is the first dogfood of chatd as the scrum vehicle — eats its
own /v1/chat dispatcher.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3-lineage scrum (Opus 4.7 / Kimi K2.6 / Qwen3-coder) on today's wave
landed 4 real findings (2 BLOCK + 2 WARN) and 2 INFO touch-ups.
Verbatim verdicts + disposition table at:
reports/scrum/_evidence/2026-04-30/
B-1 (BLOCK Opus + INFO Kimi convergent) — ResolveKey API:
collapse from 3-arg (envVar, envFileName, envFilePath) to 2-arg
(envVar, envFilePath). Pre-fix every chatd caller passed the env
var name twice; if operator renamed *_key_env in lakehouse.toml
while keeping the canonical KEY= line in the .env file, fallback
silently missed.
B-2 (WARN Opus + WARN Kimi convergent) — handleProviders probe:
drop the synthesize-then-Resolve probe; look up by name directly
via Registry.Available(name). Prior probe synthesized "<name>/probe"
model strings and routed through Resolve, fragile to any future
routing rule (e.g. cloud-suffix special case).
B-3 (BLOCK Opus single — verified by trace + end-to-end probe) —
OllamaCloud.Chat StripPrefix used "cloud" but registry routes
"ollama_cloud/<m>". Result: upstream got the prefixed model name
and 400'd. Smoke missed it because chatd_smoke runs without
ollama_cloud registered. Now strips the right prefix; new
TestOllamaCloud_StripsCorrectPrefix locks both prefix + suffix
cases. Verified live: ollama_cloud/deepseek-v3.2 round-trips
cleanly through the real ollama.com endpoint.
B-4 (WARN Opus single) — Ollama finishReason: read done_reason
field instead of inferring from done bool alone. Newer Ollama
reports done=true with done_reason="length" on truncation; the
prior code mapped that to "stop" and lost the truncation signal
the playbook_lift judge needs to retry. New
TestFinishReasonFromOllama_PrefersDoneReason covers the fallback
ladder.
INFOs:
- B-5: replace hand-rolled insertion sort in Registry.Names with
sort.Strings (Opus called the "avoid sort import" comment a
false economy — correct).
- A-1: clarify the playbook_lift.sh comment around -judge "" arg
passing (Opus noted the comment said "env priority" but didn't
reflect that the empty arg also passes through the Go driver's
resolution chain).
False positives dismissed (3, documented in disposition.md):
- Kimi: TestMaybeDowngrade_WithConfigList wrong assertion (test IS
correct per design — model excluded from weak list = strong = downgrade)
- Qwen: nil-deref claim (defensive code already handles nil)
- Opus: qwen3.5:latest doesn't exist on Ollama hub (true on the
public hub but local install has it)
just verify: PASS. chatd_smoke 6/6 PASS. New regression tests:
3 (B-2, B-3, B-4 each get a focused test).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
new cmd/chatd on :3220 routes /v1/chat to the right provider based
on model-name prefix or :cloud suffix. closes the architectural gap
named in lakehouse.toml [models]: tiers map to model IDs, but until
phase 4 there was no service that could actually CALL those models
from go.
routing rules (registry.Resolve):
ollama/<m> → local Ollama (prefix stripped)
ollama_cloud/<m> → Ollama Cloud
<m>:cloud → Ollama Cloud (suffix variant — kimi-k2.6:cloud)
openrouter/<v>/<m> → OpenRouter (prefix stripped, OpenAI-compat)
opencode/<m> → OpenCode unified Zen+Go
kimi/<m> → Kimi For Coding (api.kimi.com/coding/v1)
bare names → local Ollama (default)
provider implementations:
- internal/chat/types.go Provider interface, Request/Response, errors
- internal/chat/registry.go prefix + :cloud suffix dispatch
- internal/chat/ollama.go local Ollama via /api/chat (think=false default)
- internal/chat/ollama_cloud.go Ollama Cloud via /api/generate (Bearer auth)
- internal/chat/openai_compat.go shared OpenAI Chat Completions for the
OpenRouter/OpenCode/Kimi family
- internal/chat/builder.go BuildRegistry from BuilderInput;
ResolveKey reads env then .env file fallback
config:
- ChatdConfig in internal/shared/config.go with bind, ollama_url,
per-provider key env names + .env fallback paths, timeout
- Gateway gains chatd_url + /v1/chat + /v1/chat/* routes
- lakehouse.toml [chatd] block with /etc/lakehouse/<provider>.env defaults
tests (19 in internal/chat):
- registry: prefix + :cloud + errors + telemetry + provider listing
- ollama: happy path + prefix strip + format=json + 500 mapping +
flatten_messages
- openai_compat: happy path + format=json + 429 mapping + zero-choices
think=false default in ollama + ollama_cloud — local hot path skips
reasoning, low-budget callers (the playbook_lift judge at max_tokens=10)
get direct answers instead of empty content + done_reason=length.
proven via chatd_smoke acceptance.
acceptance gate: scripts/chatd_smoke.sh — 6/6 PASS:
1. /v1/chat/providers lists exactly registered providers (1 in dev mode)
2. bare model → ollama default with content + token counts + latency
3. explicit ollama/<m> → prefix stripped at upstream
4. <m>:cloud without ollama_cloud registered → 404 (no silent fall-through)
5. unknown/<m> → falls through to default → upstream 502 (no prefix rewrite)
6. missing model field → 400
just verify: PASS (vet + 30 packages × short tests + 9 smokes).
chatd_smoke is a domain smoke (not in just verify, mirrors matrix /
observer / pathway pattern).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
migrate the reality-test harness's judge-model default from a
hardcoded "qwen3.5:latest" string to cfg.Models.LocalJudge.
resolution priority: explicit -judge flag > $JUDGE_MODEL env >
cfg.Models.LocalJudge from lakehouse.toml > hardcoded fallback.
bumping the judge for run #N+1 now means editing one line in
lakehouse.toml [models].local_judge — no Go file or shell script
edits required.
changes:
- scripts/playbook_lift/main.go: -config flag added, judge default
flips to "" so resolution chain runs. Imports internal/shared for
config loader.
- scripts/playbook_lift.sh: JUDGE_MODEL no longer defaulted in bash;
EFFECTIVE_JUDGE resolved by mirror-of-the-Go-chain (env > config
grep > qwen3.5:latest fallback). Used for the Ollama presence
check + report header. Pre-flight grep avoids requiring jq just
to read the toml.
- reports/reality-tests/README.md: documents the 4-step priority
chain.
verified all 4 paths produce the expected judge:
- config (no env): qwen3.5:latest (from lakehouse.toml)
- env override: env wins
- flag override: flag wins over env
- missing config: DefaultConfig fallback still gives qwen3.5:latest
just verify PASS.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
migrate the strong-model auto-downgrade gate from a hardcoded weak
list to cfg.Models.WeakModels. backward compatible: existing API
preserved, callers that don't migrate keep using DefaultWeakModels.
changes:
- internal/matrix/downgrade.go: split IsWeakModel into rule-based
base (`:free` suffix/infix) + literal-list lookup. New
IsWeakModelInList(model, list) takes the config-supplied list.
DowngradeInput grows a WeakModels field; nil falls back to
DefaultWeakModels (preserves pre-phase-2 behavior).
- internal/workflow/modes.go: add MatrixDowngradeWithWeakList(list)
factory mirroring MatrixSearch's pattern. Plain MatrixDowngrade
kept for backward compat.
- cmd/matrixd/main.go: handlers struct holds weakModels populated
from cfg.Models.WeakModels at startup; handleDowngrade threads it
into every DowngradeInput.
- cmd/observerd/main.go: registerBuiltinModes accepts weakModels
and uses the factory variant. observerd reads cfg.Models.WeakModels
in main().
end-to-end verified: downgrade + matrix + observer + workflow smokes
all pass. Existing TestMaybeDowngrade_TruthTable + TestIsWeakModel
unchanged (backward compat). Two new tests cover the config path:
- TestIsWeakModelInList — covers rule + literal + empty + nil
- TestMaybeDowngrade_WithConfigList — verifies cfg list overrides
default
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codifies the small-model-pipeline tiering (per project_small_model_pipeline_vision.md)
in lakehouse.toml [models] section. Tier names map to actual model
IDs; bumping a model means editing one line, not hunting through code.
Tier philosophy:
- local_* : on-box Ollama. Inner-loop hot path. Repeated calls.
- cloud_* : Ollama Cloud (Pro plan). Larger context, fail-up tier.
- frontier_* : OpenRouter / OpenCode. Rate-limited, billed per call.
weak_models is the codified "local-hot-path eligible" list — phase 2
will migrate matrix.downgrade to read it instead of hardcoding.
Defaults reflect 2026-04-29 architecture: qwen3.5:latest as local
(stronger than qwen2.5, same JSON-clean property), kimi-k2.6 as cloud
judge (kimi-k2:1t still upstream-broken), opus-4-7 + kimi-k2-0905 as
frontier review/arch via OpenRouter, opencode/claude-opus-4-7 as
frontier_free leveraging the OpenCode subscription.
3 new tests in internal/shared/config_test.go:
- TestDefaultConfig_ModelsTier — locks tier defaults
- TestModelsConfig_IsWeak — weak-bypass list
- TestLoadConfig_ModelsTOMLRoundTrip — override semantics
just verify PASS (g2 had one flake on first run — Ollama transfer
truncation; clean on retry, unrelated to this change).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
First reality test driver. Two-pass design:
- Pass 1 (cold): matrix.search use_playbook=false → small-model judge
rates top-K → record playbook entry pointing at the highest-rated
result (which may NOT be top-1 by distance — that's the discovery).
- Pass 2 (warm): same queries with use_playbook=true → measure
ranking shift. Lift = real if recorded answer becomes top-1.
Files:
- scripts/playbook_lift/main.go driver (391 LoC)
- scripts/playbook_lift.sh stack-bring-up + report gen
- tests/reality/playbook_lift_queries.txt query corpus (5 placeholders;
J writes real 20+)
- reports/reality-tests/README.md framework + interpretation
- .gitignore track reports/reality-tests/
but ignore per-run JSON evidence
This answers the gate from project_small_model_pipeline_vision.md:
"the playbook + matrix indexer must give the results we're looking
for." Without ground-truth labels, the LLM judge is the proxy — the
same small-model thesis applied to evaluation. Honest about that
limitation in the generated reports.
Driver compiles clean; full run requires Ollama + workers/candidates
ingest. Skips cleanly if Ollama absent.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5 small fixes from the §3.8 scrum2 review wave:
- workflow.stringifyValue now JSON-marshals maps/slices instead of
fmt.Sprint %v (Opus+Kimi convergent: LLM modes were getting Go's
map[k:v] syntax, which is unparseable as JSON context).
- workflow.detectCycle removed — duplicate of topoSort that discarded
the useful node ID. Validate() now calls topoSort directly and
returns its wrapped ErrCycle.
- observer.SourceWorkflow named constant — was an implicit string
cast (observer.Source("workflow")) at the cmd/observerd handler.
- Unused context imports + dead silencer comments removed across
workflow/modes.go and observerd/main.go.
- Unused store parameter dropped from registerBuiltinModes (reserved
comment removed; can be re-added when a mode actually needs it).
just verify still PASS — these are pure cleanup, no behavior change.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Audited stash-clean c7e3124 (30 commits past rerun-1 4840c10).
3 HIGH risks closed (R-002 internal/shared, R-003 internal/storeclient,
R-008 queryd/db.go). 3 advanced to partial (R-001 via fail-loud-bind +
opt-in auth, R-006 via g2_smoke_fixtures, R-007 via ADR-003 auth.go).
Biggest move: Agent Memory Correctness 4 → 9 — pathway Mem0 ops
(ADD/UPDATE/REVISE/RETIRE/HISTORY) all tested, including cycle-detection
and retired-trace-exclusion. Sprint 2 acceptance criteria are now
verified code, not design-bar work.
Two new findings:
- F1 (MED): cmd/{matrixd,observerd,pathwayd}/main_test.go absent —
reopens R-005 against new daemons.
- F2 (LOW): scripts/staffing_*/main.go flag-defaults reach
/home/profit/lakehouse/data/...
Evidence under reports/scrum/_evidence/rerun2/ (local; per
.gitkeep convention).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
distillation.score, drift.scorer)
Lands the workflow.Mode adapters for the §3.4 components + the
distillation scorer + drift quantifier. Workflows can now compose
real measurement capabilities; the substrate's parallel
capabilities become composable Lego bricks (per the prior commit's
closing insight).
Modes registered (in observerd's registerBuiltinModes):
Pure-function wrappers (no I/O):
- matrix.relevance → matrix.FilterChunks
- matrix.downgrade → matrix.MaybeDowngrade
- distillation.score → distillation.ScoreRecord
- drift.scorer → drift.ComputeScorerDrift
HTTP-backed:
- matrix.search → POST matrixd /matrix/search
(registered only when matrixd_url is set)
Fixture (kept from §3.8 first slice):
- fixture.echo, fixture.upper
internal/workflow/modes.go:
Each mode follows the same glue pattern: marshal generic input
through a typed struct (free schema validation + clear error
messages), call the underlying capability, return a generic
output map. Roundtrip-via-JSON gives us schema validation
without writing custom field-by-field coercion.
internal/workflow/modes_test.go (10 tests, all PASS):
- matrix.relevance filters adjacency pollution (Connector kept,
catalogd::Registry dropped — same headline as the relevance
smoke, run through the workflow mode)
- matrix.downgrade flips lakehouse→isolation on strong model;
keeps lakehouse on weak (qwen3.5:latest); errors on missing
fields
- distillation.score rates scrum_review attempt_1 as accepted;
rejects empty record
- drift.scorer reports zero drift on matched inputs; errors on
empty inputs slice
- matrix.search HTTP flow round-trips through httptest fake
matrixd; non-OK status surfaces a clear error
scripts/workflow_smoke.sh (5 assertions PASS, was 4):
New assertion #5: real-mode chain
matrix.downgrade (lakehouse + grok-4.1-fast → isolation)
→ distillation.score (scrum_review attempt_1 → accepted)
Proves §3.4 components compose through the workflow runner with
no fixture intermediation. Both nodes ran successfully, runner
recorded provenance, status=succeeded.
Mode listing assertion now expects 7 modes (5 real + 2 fixture)
instead of just the fixtures.
17-smoke regression all green. SPEC §3.8 acceptance gate G3.8.D
("Mode catalog dispatches matrix.search invocation to the matrixd
backend without going through HTTP") still pending — current path
goes through HTTP for matrix.search, which is the cleaner service-
mesh shape but slower than direct in-process. In-process dispatch
when matrixd is co-resident is a future optimization.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per the 2026-04-29 scope-discipline pause: the wave shipped four
pieces beyond SPEC §3.4 component scope, and one architectural
pattern surfaced (Archon-style multi-pass workflow runner) that's
the observer's natural growth path. Document them as port targets
so the next scrum review has authoritative SPEC components.
§3.5 — Drift quantification (loop 5 of the PRD)
Names the SCORER drift work shipped in be65f85 + the deferred
shapes (PLAYBOOK drift, EMBEDDING drift, AUDIT BASELINE drift).
Acceptance gates G3.5.A–B.
§3.6 — Staffing-side structured filter
Names the metadata-filter MVP shipped in b199093 + the deferred
pre-retrieval SQL gate via queryd. Acceptance gates G3.6.A–C.
§3.7 — Operational rating wiring
Names the bulk playbook-record endpoint shipped in 6392772 + the
deferred UI shim, negative-feedback path, and time-decay.
Acceptance gates G3.7.A–B.
§3.8 — Observer-KB workflow runner (Archon-style multi-pass) —
PORT TARGET, not yet started
Documents the architecture J was working on across the Rust
observer-kb branch (10 commits ahead of main, never merged) and
the local Archon mod (committed 2026-04-29 as 3f2afc8 in
/home/profit/external/Archon, not pushed to coleam00/Archon).
The pattern: multi-pass mode chain (extract → validator →
hallucination → consensus → redteam → pipeline → render) where
each pass is a deterministic measurement. The observer is the
natural home — workflows ARE observation patterns whose every
step is recorded. Five components in dependency order: workflow
definition (YAML), node executor (DAG runner), provenance
recording (ObservedOps), mode catalog (matrix.search,
distillation.score, drift.scorer, llm.chat), HTTP surface
(/v1/observer/workflow/run).
Reference materials on the system (preserved, not lost):
- /home/profit/lakehouse/.archon/workflows/lakehouse-architect-review.yaml
(Rust main, 69919d9) — 3-node Archon-via-Lakehouse proof
- /home/profit/external/Archon dev branch — upstream engine
with local pi/provider.ts mod (3f2afc8) for Lakehouse routing
- Rust observer-kb branch — apps/observer-kb/docs/PRD.md +
Python prototypes proven on real ChatGPT/Claude PDF data
Acceptance gates G3.8.A–D. Estimated effort: L.
PRD updated with "Observer as system resource (clarified
2026-04-29)" section pointing at §3.8 as the architectural growth
path. The bare-bones observerd in bc9ab93 is the substrate; the
workflow runner is what makes it the "objective measurement engine"
the small-model pipeline needs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
POST /v1/matrix/playbooks/bulk accepts an array of playbook entries
and records each independently — failures per-entry don't abort the
batch. Designed for two operational use cases:
1. Backfilling historical placement data into the playbook
substrate (the Rust system has 4,701 fill operations recorded
with embeddings; that data deserves to feed the Go learning
loop without a 4,701-call procedural script).
2. Batched click-tracking from a session's worth of coordinator
interactions, posted once at idle rather than per-click.
Per-entry response shape: {index, playbook_id} on success or
{index, error} on failure. Caller can inspect failures without
diffing.
Smoke (scripts/playbook_smoke.sh, new assertion #4):
Bulk POST 3 entries: 2 valid (alpha→widget-a, bravo→widget-b) +
1 invalid (empty query_text). Verifies recorded=2, failed=1,
the 2 valid ones get playbook_ids back, and the invalid one
surfaces its validation error in-line.
Single-record /matrix/playbooks/record from 06e7152 still works
unchanged; bulk is additive. The corpus field can be set per-
entry or once at the batch level (entry-level wins on collision).
Per the small-model autonomous pipeline framing: this is the
"the playbook gets denser with each iteration" mechanism. Click
tracking → bulk POST → playbook entries → future similar queries
get those answers boosted via the existing /matrix/search
use_playbook path. The learning loop now has both inflows wired
(single + bulk) — what remains is the demo UI shim that calls
/feedback on result interaction (deferred — no Go demo UI yet).
15-smoke regression all green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses the reality-test gap surfaced by the candidates and
multi-corpus e2e runs (0d1553c, a97881d): semantic-only retrieval
can't gate by status / state / availability. SearchRequest now
takes an optional MetadataFilter map; results whose metadata
doesn't match every key are dropped before top-K truncation.
Filter value semantics:
string|number|bool → exact equality (JSON-canonical, so 1 ≡ 1.0)
[]any → OR within key (any element matching wins)
AND across keys: every filter key must match.
Missing key in metadata = drop. Malformed metadata = drop. Filter
absent or empty = pass through (zero overhead).
The response now reports MetadataFilterDropped so callers can see
how aggressive the filter was without re-querying.
Caveat (also captured in code comment): this is POST-retrieval, not
PRE-filtering via SQL. Aggressive filters can shrink the result set
below K; caller should bump PerCorpusK to compensate. A queryd-
backed pre-filter is a future commit; this lands the user-visible
fix today.
Tests:
- 7 unit tests (internal/matrix/filter_test.go) covering: nil/
empty filter pass-through, missing-metadata always-fails,
single-value exact match (incl. numeric 5 ≡ 5.0), AND across
keys, OR within list, bool match, malformed JSON metadata
- matrix_smoke.sh: new assertion #7 — filter
label∈{"a near","b near"} drops the 4 mid/far entries from the
6-entry pool, keeping exactly 2 (one per corpus, both with the
matching label). Dropped count surfaces in the response.
15-smoke regression all green. vet clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PRD's 5-loop substrate names "drift" as loop 5: quantify when
historical decisions stop matching current reality. Distinct from
the rating+distillation loop because drift is MEASUREMENT, not
LEARNING. The learning loop says "this match worked, remember it";
the drift loop says "this 4-month-old playbook entry — does it
still match what the substrate would surface today?"
First-shipped drift shape: SCORER drift. When the deterministic
scorer's ScorerVersion bumps, historical ScoredRuns may no longer
match what the current scorer produces on the same EvidenceRecord.
internal/drift/drift.go:
- ScorerDriftInput — (EvidenceRecord, persisted_category) pair
- ScorerDriftEntry — one mismatch with current reasons attached
- CategoryShift — (from, to, count) cell in the shift matrix
- ScorerDriftReport — summary + sorted shift matrix + optional entries
- ComputeScorerDrift(inputs, includeEntries) — pure function;
re-runs ScoreRecord over each input and reports mismatches
Why this matters: without a drift quantifier, a scorer-rule change
silently invalidates the historical training data feeding the
learning loop. With drift quantification, a rule change surfaces
a concrete number ("847 of 4701 historical ScoredRuns now
disagree") that triggers a re-score-and-retrain cycle rather than
letting the substrate quietly rot.
Tests (6/6 PASS):
- No-drift: all 3 inputs match → 100% matched
- Shift detected: 5 inputs, 3 drift cases, drift_rate=0.6,
shift matrix shows accepted→partially_accepted x3
- Multiple shifts sorted by count desc
- includeEntries=false skips the per-mismatch list
- Empty input → all-zero report (no division-by-zero)
- ScorerVersion stamped on every report
Future drift shapes (deferred to follow-ups, named in package doc):
- PLAYBOOK drift: re-run playbook queries through current
matrix-search; recorded answer not in top-K = drift
- EMBEDDING drift: KS-test on vector distribution at T1 vs T2
- AUDIT BASELINE drift: matches Rust audit_baselines.jsonl
longitudinal signal
Pure compute. Materialization layer (read scored-runs jsonl + their
matching evidence jsonl + feed into ComputeScorerDrift) lands with
the distillation materialization commit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
First slice of the Rust v1.0.0 distillation substrate (e7636f2)
ported to Go per ADR-001 #4 (port LOGIC, not bit-identical
reproducibility). This commit lands the LOAD-BEARING pieces named
in project_distillation_substrate.md memory:
- The deterministic Success Scorer (8 sub-scorers + dispatch)
- The contamination firewall on SFT samples (the "non-negotiable"
spec property: rejected/needs_human_review NEVER ship to SFT)
- All on-wire types + validators for ScoredRun, SftSample,
EvidenceRecord with Provenance
Files:
internal/distillation/types.go — types + ScorerVersion + SftNever
+ ValidateScoredRun + ValidateSftSample
internal/distillation/scorer.go — ScoreRecord + 8 class scorers +
BuildScoredRun (deterministic)
internal/distillation/scorer_test.go — ~40 test cases:
- source-class dispatch (verdict / telemetry / extraction)
- scrum_review (4 attempt cases)
- observer_review (5 verdict cases)
- audit (legacy + severity, 9 cases)
- auto_apply (4 cases)
- outcomes / mode_experiment / extraction
- CONTAMINATION FIREWALL: ErrSftContamination sentinel fires
on rejected/needs_human_review, distinct from typo errors
- empty-pair guard (instruction/response trim != "")
- reasons-required ScoredRun validation
- deterministic sig_hash on identical input
- purity check (input not mutated, repeatable output)
Per the 2026-04-29 cross-lineage scrum's discipline: false-positive
findings would be dismissed inline (none in this commit). Real
findings would be addressed before merge — but this is greenfield
port code reviewed against its Rust source line-by-line, which the
test suite encodes as truth tables.
Explicitly DEFERRED to follow-up commits:
- Materialization layer (jsonl read/write, date-partitioned
storage in data/scored-runs/YYYY/MM/DD/, evidence index)
- SFT exporter (file iteration + filtering — the SCORING firewall
is here; the EXPORT firewall is the next layer)
- export_preference, export_rag (other export shapes)
- Acceptance harness (16/16 acceptance gate that locks v1.0.0)
- replay, receipts, build_evidence_index, transforms
The scorer + firewall validator are pure functions — operational
tooling layers on top without changing the deterministic logic the
downstream learning loop depends on. The Go ScorerVersion stays at
v1.0.0 to match the Rust e7636f2 baseline; bumping in the Go
materialization commit is reserved for the next scoring-rule
change, NOT the port itself.
15-smoke regression all green. vet clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Workers driver embed text reverted to V0 after testing 3 variants
on the "Forklift operator with OSHA-30 certification, warehouse
experience" reality-test query against 5000 workers (which contains
569 actual Forklift Operators per the 31b4088 probe).
V0 (current, restored): "Worker role: <role>. Skills: ...
Certifications: ... <resume_text>"
→ 6 workers in top-8, 0 Forklift Ops,
top distance 0.327, top role
"Production Worker"
V4a (role-doubled): "<role>. <role> with <skills>. ..."
drop archetype + resume_text
→ 6 workers in top-8, 0 Forklift Ops,
top distance 0.254, top role
"Production Worker"
V4b (resume-only): just the resume_text natural-language
sentence, no structured prefix
→ 4 workers in top-8 (WORSE mix —
software-engineer candidates filled
the displaced slots), 0 Forklift Ops,
top distance 0.379
Conclusion: all three variants surface Production Workers / Machine
Operators / Line Leads ABOVE Forklift Operators for this query.
The 569 actual Forklift Operators in the 5000-row sample don't
appear in any top-8. Embed-text design isn't the bottleneck —
nomic-embed-text 137M's geometry doesn't separate "Forklift
Operator" from "Production Worker" / "Machine Operator" / "Line
Lead" in this query's neighborhood.
Real fixes belong elsewhere:
- Hybrid SQL+semantic (B): pre-filter by role/certs via queryd
before semantic ranking. Addresses the gap directly.
- Different embedding model: mxbai-embed-large or a staffing-
fine-tuned model. Costs an Ollama model swap + re-embedding.
- Playbook boost (component 5, already shipped): record
successful Forklift placements; future queries surface those
workers via similarity. Compounds with use.
V0 restored because it has the best worker/candidate mix in top-8
(6 vs 4 in V4b), preserving the multi-corpus reality-test signal
quality even if the role match is imperfect. Comments updated to
record the experiment so future sessions don't relitigate.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cross-lineage scrum review on the 12 commits of this session
(afbb506..06e7152) via Rust gateway :3100 with Opus + Kimi +
Qwen3-coder. Results:
Real findings landed:
1. Opus BLOCK — vectord BatchAdd intra-batch duplicates panic
coder/hnsw's "node not added" length-invariant. Fixed with
last-write-wins dedup inside BatchAdd before the pre-pass.
Regression test TestBatchAdd_IntraBatchDedup added.
2. Opus + Kimi convergent WARN — strings.Contains(err.Error(),
"status 404") was brittle string-matching to detect cold-
start playbook state. Fixed: ErrCorpusNotFound sentinel
returned by searchCorpus on HTTP 404; fetchPlaybookHits
uses errors.Is.
3. Opus WARN — corpusingest.Run returned nil on total batch
failure, masking broken pipelines as "empty corpora." Fixed:
Stats.FailedBatches counter, ErrPartialFailure sentinel
returned when nonzero. New regression test
TestRun_NonzeroFailedBatchesReturnsError.
4. Opus WARN — dead var _ = io.EOF in staffing_500k/main.go
was justified by a fictional comment. Removed.
Drivers (staffing_500k, staffing_candidates, staffing_workers)
updated to handle ErrPartialFailure gracefully — print warn, keep
running queries — rather than fatal'ing on transient hiccups
while still surfacing the failure clearly in the output.
Documented (no code change):
- Opus WARN: matrixd /matrix/downgrade reads
LH_FORCE_FULL_ENRICHMENT from process env when body omits
it. Comment now explains the opinionated default and points
callers wanting deterministic behavior to pass the field
explicitly.
False positives dismissed (caught and verified, NOT acted on):
A. Kimi BLOCK on errors.Is + wrapped error in cmd/matrixd:223.
Verified false: Search wraps with %w (fmt.Errorf("%w: %v",
ErrEmbed, err)), so errors.Is matches the chain correctly.
B. Kimi INFO "BatchAdd has no unit tests." Verified false:
batch_bench_test.go has BenchmarkBatchAdd; the new dedup
test TestBatchAdd_IntraBatchDedup adds another.
C. Opus BLOCK on missing finite/zero-norm pre-validation in
cmd/vectord:280-291. Verified false: line 272 already calls
vectord.ValidateVector before BatchAdd, so finite + zero-
norm IS checked. Pre-validation is exhaustive.
D. Opus WARN on relevance.go tokenRe (Opus self-corrected
mid-finding when realizing leading char counts toward token
length).
Qwen3-coder returned NO FINDINGS — known issue with very long
diffs through the OpenRouter free tier; lineage rotation worked
as designed (Opus + Kimi between them caught everything Qwen
would have).
15-smoke regression sweep all green (D1-D6, G1, G1P, G2,
storaged_cap, pathway, matrix, relevance, downgrade, playbook).
Unit tests all green (corpusingest +1, vectord +1).
Per feedback_cross_lineage_review.md: convergent finding #2 (404
detection) is the highest-signal one — both Opus and Kimi
flagged it independently. The other Opus findings stand on
single-reviewer signal but each one verified against the actual
code.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes SPEC §3.4. The matrix indexer is now a learning meta-index per
feedback_meta_index_vision.md — every successful (query → answer)
pair recorded via /matrix/playbooks/record boosts that answer for
future similar queries.
This is the architectural piece that lifts vectord from "static
hybrid search" to the meta-index J originally framed in Phase 19 of
the Rust system.
What's new:
- internal/matrix/playbook.go — PlaybookEntry, PlaybookHit,
ApplyPlaybookBoost. Pure-function boost math:
distance' = distance * (1 - 0.5 * score)
Score 0 = no boost (factor 1.0); score 1 = halve distance
(factor 0.5). Capped at 0.5 deliberately so a single high-
confidence playbook can't dominate the base ranking forever
(runaway-feedback-loop guard).
- Retriever.Record(entry, corpus) — embeds query_text, ensures
playbook corpus exists (idempotent), upserts via deterministic
sha256-derived ID (last score wins on re-record of same triple).
- Retriever.Search extended with UsePlaybook + PlaybookCorpus +
PlaybookTopK + PlaybookMaxDistance. Reuses the query vector —
no extra embed call. Missing-corpus 404 = no-op (cold-start
state before any Record call), not an error.
- POST /v1/matrix/playbooks/record (matrixd) — caller submits
{query_text, answer_id, answer_corpus, score, tags?}; gets
{playbook_id} back.
Storage: a vectord index named "playbook_memory" (configurable per
request) with embed(query_text) as the vector and the
PlaybookEntry JSON as metadata. Just another corpus — observable
from /vectors/index, persistable through G1P, etc.
Match key for boost: (AnswerID, AnswerCorpus). Cross-corpus ID
collisions don't false-match — verified by
TestApplyPlaybookBoost_CorpusAttributionRespected.
End-to-end smoke (scripts/playbook_smoke.sh, all assertions PASS):
- Baseline search: widget-c at distance 0.6566 (rank 3)
- Record playbook: query → widget-c, score=1.0
- Re-search with use_playbook=true:
widget-c distance: 0.3283 (rank 2)
ratio: 0.5 EXACTLY (matches boost math precisely)
playbook_boosted: 1
- widget-c jumped from #3 to #2 — learning loop visible
Tests:
- 8 unit tests in internal/matrix/playbook_test.go covering
Validate, BoostFactor (5 cases), the no-boost identity, the
boost-moves-result-up scenario, highest-score wins on duplicate
matches, cross-corpus attribution, JSON round-trip, and
rejection of empty metadata
- scripts/playbook_smoke.sh integration test (3 assertions PASS)
15-smoke regression sweep all green (D1-D6, G1, G1P, G2,
storaged_cap, pathway, matrix, relevance, downgrade, playbook).
SPEC §3.4 NOW COMPLETE: 5 of 5 components shipped. The matrix
indexer's port is done as a substrate; remaining work is operational
(rating signal sources, telemetry, eventual structured filtering for
staffing data — none in §3.4).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds WORKERS_LIMIT env override (default 5000) so the e2e can be
re-run at different sample sizes. Tiny change; the interesting part
is the FINDING that motivated the run.
Investigation: a97881d's reality test put zero Forklift Operators in
the top-6 for "Forklift operator with OSHA-30 certification,
warehouse experience" — instead returned Production Worker / Machine
Operator / Assembler.
Hypothesis tested: maybe the 5000-row sample didn't contain
forklift operators in retrievable density.
Result: hypothesis falsified. Direct probe of workers_500k.parquet:
All 500K rows → 55,349 Forklift Operators (11.07%)
→ 150,328 with "forklift" in certs
→ 74,852 with OSHA-30 specifically
First 5K rows → 569 Forklift Operators (11.38%)
→ distribution matches global, no ordering bias
So 569 forklift operators were IN the corpus the matrix indexer
searched and STILL didn't surface in top-6. That means the bottleneck
isn't sample size — it's nomic-embed-text + our embed-text template
ranking "Production Worker" / "Machine Operator" / "Assembler" as
semantically nearer to the query than literal "Forklift Operator".
The reality test exposed this faithfully. Three real follow-ups, none
in scope of this commit:
1. Embed text design — front-loading role + certs (currently
"Worker role: <role>" then skills then certs) might dominate
retrieval better. Worth A/B-testing.
2. Hybrid SQL+semantic — pre-filter by role/certs via queryd
before semantic ranking. Not in SPEC §3.4 today; would address
the "available" / "Chicago" gap from the candidates reality
test (0d1553c) too.
3. Playbook-memory boost — SPEC §3.4 component 5. When a query
"Forklift OSHA-30" was answered with worker w-X in the past,
boost w-X's score for similar future queries. The retrieval
gap CAN be bridged by the learning loop without changing the
base embedder.
Commits the env knob; the finding lives in the commit body so future
sessions don't re-run the sample-size hypothesis.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lands the second real-data corpus (workers_500k) and the first
multi-corpus reality test through /v1/matrix/search composing both
corpora live.
What's new:
- scripts/staffing_workers/main.go — parquet driver over
workers_500k.parquet, multi-chunk arrow handling (workers
parquet has multiple row groups vs candidates' one). Embed text:
role + skills + certifications + city + state + archetype +
resume_text. IDs prefixed "w-".
- scripts/multi_corpus_e2e.sh — first end-to-end test composing
both corpora through the matrix indexer.
Real-data multi-corpus result (this commit):
Query: "Forklift operator with OSHA-30 certification, warehouse
experience"
Corpora: workers (5000 rows) + candidates (1000 rows)
Merged top-8: workers=6, candidates=2
Top hits:
w d=0.327 w-4573 Production Worker
w d=0.353 w-1726 Machine Operator
w d=0.362 w-3806 Production Worker
w d=0.366 w-1000 Machine Operator
w d=0.374 w-1436 Assembler
w d=0.395 w-162 Machine Operator
c d=0.440 c-CAND-00727 C#,.NET,Azure
c d=0.446 c-CAND-00031 React,TypeScript,Node
The matrix indexer correctly chose the right domain — manufacturing/
warehouse roles in workers (correct semantic match for the staffing
query) rank ABOVE software-engineer candidates from the candidates
corpus. 0.11 gap between the worst worker (0.395) and the best
candidate (0.440) — clean distance separation.
Compared to the candidates-only e2e run from 0d1553c:
candidates-only top: c-CAND-00727 at d=0.4404
multi-corpus top: w-4573 at d=0.3265 (a Production Worker)
That's the matrix indexer's whole point made visible: composing
domain-distinct corpora surfaces better matches than single-corpus
search. Without workers in the search space, the staffing query
returned software engineers (wrong domain). With workers, it
returns roles in the right ballpark.
What's still imperfect (signal for component 5 + future work):
- No top-6 worker actually has "Forklift" or "OSHA-30" visible in
metadata; "Production Worker" is semantically nearest in this
sample. Likely needs a larger workers ingest (5000 from 500K)
or skill-keyword boost.
- Status/availability still not gated. The staffing-side
structured filtering gap from 0d1553c persists; relevance filter
(CODE-aware) doesn't address it.
Pipeline timings:
workers ingest: 5000 rows / 19.2s = 260/sec end-to-end
candidates ingest: 1000 rows / 3.1s = 322/sec
multi-corpus query (text → embed → 2 parallel vectord → merge): 14ms
14-smoke regression sweep all green (D1-D6, G1, G1P, G2,
storaged_cap, pathway, matrix, relevance, downgrade).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Faithful port of mcp-server/relevance.ts (Rust observer's adjacency-
pollution filter). Same 5-signal scoring, same default threshold 0.3.
Adds POST /v1/matrix/relevance endpoint via matrixd.
Scoring signals (additive, can sign-flip):
path_match +1.0 chunk source/doc_id encodes focus.path
filename_match +0.6 chunk text mentions focus's filename
defined_match +0.6 chunk text mentions focus.defined_symbols
token_overlap +0.4 jaccard of non-stopword tokens
prefix_match +0.3 chunk source shares first-2-segment prefix
import_penalty -0.5 mentions ONLY imported symbols, no defined ones
What this does and doesn't do:
- DOES filter code-aware corpora (eventually lakehouse_arch_v1,
lakehouse_symbols_v1, scrum_findings_v1) — drops chunks about
code the focus file IMPORTS rather than DEFINES, the
"adjacency pollution" pattern that makes a reviewer LLM
hallucinate imported-crate internals as belonging to the focus
- DOES NOT meaningfully filter staffing data — the candidates
reality test 2026-04-29 had "exact skill match buried at #3"
which is a different problem (semantic-only ranking dominated
by secondary text). Staffing needs structured filtering
(status gates, location gates) that lives outside this
package — future work, not in SPEC §3.4 yet
Headline smoke assertion: focus = crates/queryd/src/db.go which
defines Connector and imports catalogd::Registry. The filter
scores:
Connector chunk: +0.68 (defined_match fires, kept)
Registry chunk: -0.46 (import_only penalty fires, dropped)
unrelated junk: 0.00 (no signals, dropped)
That's a 1.14-point gap between what we ARE and what we IMPORT —
the entire purpose of the filter.
Tests:
- 9 unit tests in internal/matrix/relevance_test.go covering
Tokenize, Jaccard, ExtractDefinedSymbols (Rust + TS),
ExtractImportedSymbols, FilePrefix, ScoreRelevance per-signal,
FilterChunks threshold splitting, and the headline
AdjacencyPollutionScenario
- scripts/relevance_smoke.sh integration smoke (3 assertions PASS):
adjacency-pollution scenario, empty-chunks 400, threshold honored
13-smoke regression sweep all green (D1-D6, G1, G1P, G2,
storaged_cap, pathway, matrix, relevance).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lands the second staffing corpus and the first end-to-end reality test
through the full Go pipeline: parquet → corpusingest → embedd →
vectord → matrixd → gateway.
What's new:
- scripts/staffing_candidates/main.go — parquet Source over
candidates.parquet (1000 rows, 11 cols), single-chunk arrow-go
pqarrow read. Embed text: "Candidate skills: <s>. Based in
<city>, <state>. <years> years experience. Status: <status>.
<first> <last>." IDs prefixed "c-" so multi-corpus merges
against workers ("w-") stay unambiguous.
- scripts/candidates_e2e.sh — first integration smoke that runs
the full stack (storaged + embedd + vectord + matrixd + gateway),
ingests via corpusingest, runs a real query through
/v1/matrix/search, prints results. Ephemeral mode (vectord
persistence disabled via custom toml) so re-runs don't pollute
MinIO _vectors/ and break g1p_smoke's "only-one-persisted-index"
assertion.
Real bug caught + fixed in corpusingest:
When LogProgress > 0, the progress goroutine's only exit was
ctx.Done(). With context.Background() in the production driver,
Run hung forever after the pipeline finished. Added a stopProgress
channel that close()s after wg.Wait(). Regression test
TestRun_ProgressLoggerExits bounds Run's wall to 2s with
LogProgress=50ms.
This is the bug the unit tests didn't catch because every prior test
set LogProgress: 0. Reality test surfaced it on first real-data
run — exactly the hyperfocus-and-find-architectural-weakness
property J framed as the reason for the Go pass.
End-to-end output (1000 candidates, query "Python AWS Docker
engineer in Chicago available now"):
populate: scanned=1000 embedded=1000 added=1000 wall=3.5s
matrix returned 5 hits in 26ms
The result quality is the interesting signal: top-5 had ZERO
Chicago candidates, ZERO active-status candidates, and the exact-
skill-match (Python,AWS,Docker) ranked #3 not #1. Pipeline works;
retrieval quality has real architectural limits (no structured
filtering, no relevance gate, semantic-only ranking dominated by
secondary signals like "1 year experience" and "engineer"). This
motivates SPEC §3.4 components 3 (relevance filter) and
eventually structured filtering — exactly the kind of finding the
deep field reality tests are supposed to surface before Enterprise
cutover.
12-smoke regression sweep all green. 9 corpusingest unit tests
including the new regression. vet clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Generalizes the staffing_500k driver's embed-and-push loop into
internal/corpusingest. Per docs/SPEC.md §3.4 component 1 (corpus
builders): adding a new staffing/code/playbook corpus is now one
Source impl + one main.go calling Run, not 200 lines of pipeline
copy-paste.
API:
type Source interface { Next() (Row, error) }
func Run(ctx, Config, Source) (Stats, error)
Library owns:
- Index lifecycle (create, optional drop-existing, idempotent
reuse on 409)
- Parallel embed dispatcher (configurable workers + batch size)
- Vectord push batching
- Progress logging + Stats reporting
- Partial-failure semantics (log + continue per-batch errors;
operator decides on re-run via Stats.Embedded vs Scanned delta)
Per-corpus driver owns: source parsing + column→Row mapping +
post-ingest validation queries.
Refactor scripts/staffing_500k/main.go to use it. Driver is now
~190 lines (was 339), with the embed/add plumbing replaced by one
Run call. -drop flag added so callers can opt out of the destructive
DELETE-first behavior (default still true to keep the 500K test
clean-recall semantics).
Unit tests (internal/corpusingest/ingest_test.go, 8/8 PASS):
- Pipeline shape: 50 rows / 16 batch → 4 embed + 4 add calls,
every ID added exactly once, vectors at correct dimension
- DropExisting fires DELETE
- 409 on create → reuse existing index
- Limit stops early
- Empty Text rows skipped (counted as scanned, not added)
- Required IndexName + Dimension validation
- Context cancel stops mid-pipeline
Real bug caught and fixed by the test suite: if embedd ever returns
fewer vectors than texts in the request (degraded backend), the
addBatch loop would panic with index-out-of-range. Worker now
length-checks the response and logs+skips on mismatch.
12-smoke regression sweep all green (D1-D6, G1, G1P, G2,
storaged_cap, pathway, matrix). vet clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lands the matrix indexer's first piece per docs/SPEC.md §3.4:
multi-corpus retrieve+merge with corpus attribution per result.
Future components (relevance filter, downgrade gate, learning-loop
integration) layer on top of this surface.
Architecture:
- internal/matrix/retrieve.go — Retriever takes (query, corpora,
k, per_corpus_k), parallel-fans across vectord indexes, merges
by distance ascending, preserves corpus origin per hit
- cmd/matrixd — HTTP service on :3217, fronts /v1/matrix/*
- gateway proxy + [matrixd] config + lakehouse.toml entry
- Either query_text (matrix calls embedd) or query_vector
(caller pre-embedded) — vector takes precedence if both set
Error policy: fail-loud on any corpus error. Silent partial returns
would lie about coverage, defeating the matrix's whole purpose.
Bubbles vectord errors as 502 (upstream), validation as 400.
Smoke (scripts/matrix_smoke.sh, 6 assertions PASS first try):
- /matrix/corpora lists indexes
- Multi-corpus search returns hits from BOTH corpora
- Top hit is the globally-closest across all corpora
(b-near beats a-near at distance 0.05 vs 0.1 — proves merge)
- Metadata round-trips through the merge
- Distances ascending in result list
- Negative paths: empty corpora → 400, missing corpus → 502,
no query → 400
12-smoke regression sweep all green (D1-D6, G1, G1P, G2,
storaged_cap, pathway, matrix).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a "Product vision" section before the Direction-pivot section.
Captures the framing J flagged 2026-04-29: the Go refactor is not the
goal. The goal is a small-model-driven autonomous pipeline that gets
better with each run, with frontier models in audit/oversight, not
the hot path.
Five loops named explicitly:
1. Knowledge pathway (pathway memory + matrix indexer)
2. Execution (small models on focused context)
3. Observer (refines configs that got the model to a good pathway)
4. Rating + distillation (outcomes fold back into the playbook)
5. Drift (measure when the playbook stops matching reality)
Triage / human-in-loop named as the system's job, not an escape
hatch. The gate: "playbook + matrix indexer must give the results
we're looking for" — single load-bearing acceptance criterion.
Why Go after Rust: second-language pass surfaces architectural
weaknesses Rust hid; the pipeline must work AS A PIPELINE, not as
crates that interact. Maps existing Rust components (✓ pathway, ✓
matrix, ✓ observer, ✓ distillation, ✓ auditor; partial: drift,
rating gate, triage).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds matrix indexer as its own row in the §1 component table and a
new §3.4 with port plan. Distinct from vectord (substrate); lives at
internal/matrix/ + gateway /v1/matrix/*.
Five components in dependency order: corpus builders → multi-corpus
retrieve+merge → relevance filter → strong-model downgrade gate →
learning-loop integration.
Locks in the framing J flagged 2026-04-29: in Rust the matrix indexer
was emergent across mode.rs + build_*_corpus.ts + observer /relevance,
and earlier port-planning reduced it to "we have vectord." The SPEC
now names it explicitly so the port preserves the multi-corpus
retrieval shape AND the learning loop, not just the HNSW substrate.
Sharding-by-id was investigated as a throughput fix and rejected —
corpus-as-shard at the matrix layer is the existing retrieval shape
and parallelizes Adds for free.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the per-item Add loop in the HTTP handler with one call to
Index.BatchAdd, which acquires the write-lock once and pushes the
whole batch through coder/hnsw's variadic Graph.Add. Pre-validation
stays in the handler so per-item error messages keep their item-index
precision.
Microbench (internal/vectord/batch_bench_test.go) at d=768 cosine:
N=16 SingleAdd 283µs/op → BatchAdd 170µs/op 1.66×
N=128 SingleAdd 7.9ms/op → BatchAdd 7.5ms/op 1.05×
N=1024 SingleAdd 87.5ms/op → BatchAdd 83.4ms/op 1.05×
Win is biggest at staffing-driver batch sizes (N=16) where
per-call lock + validation overhead is a meaningful fraction. At
larger N the inner HNSW neighborhood search per insert dominates,
which is the load-bearing finding for Option B (sharded indexes):
the throughput ceiling lives inside the library, not at the lock,
so sharding to N parallel Graphs is the only path to true
concurrent-Add throughput.
g1, g1p, g2 smokes all PASS post-change.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Network-callable Mem0-style trace memory at :3217, fronted by gateway
/v1/pathway/*. Closes the ADR-004 wire-up: store substrate landed in
2a6234f, this lands the HTTP surface + [pathwayd] config + acceptance
gate.
Smoke proves the architecturally distinctive properties: Revise →
History walks the predecessor chain backward (audit trail), Retire
excludes from Search default but stays Get-able, AddIdempotent bumps
replay_count without replacing — and all survive kill+restart via
JSONL log replay.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes Sprint 2 design-bar work (audit reports/scrum/sprint-backlog.md):
S2.1 — ADR-004 documents the pathway-memory data model
S2.2 — pathway port lands with deterministic fixture corpus
and full test coverage on day one
S2.3 — retired traces are excluded from retrieval (test
passes; would fail without the filter)
Mem0-style operations: Add / AddIdempotent / Update / Revise /
Retire / Get / History / Search. Each operation is a method on
Store; persistence is JSONL append-only with corruption recovery
on Replay.
internal/pathway/types.go Trace + event + SearchFilter + sentinel errors
internal/pathway/store.go in-memory state + RWMutex + ops
internal/pathway/persistor.go JSONL append-only log with replay
internal/pathway/store_test.go 20 test funcs covering all 7
Sprint 2 claim rows + concurrency
internal/pathway/persistor_test.go 6 test funcs covering missing-
file, corruption recovery, long-line
handling, parent-dir auto-create,
apply-error skip behavior
Sprint 2 claim coverage row-by-row:
ADD TestAdd_AssignsUIDAndTimestamps + TestAdd_RejectsInvalidJSON
UPDATE TestUpdate_ReplacesContentSameUID + Update_MissingUID_Errors
REVISE TestRevise_LinksToPredecessorViaHistory +
TestRevise_PredecessorMissing_Errors +
TestRevise_ChainOfThree_BackwardWalk
RETIRE TestRetire_ExcludedFromSearch +
TestRetire_StillAccessibleViaGet +
TestRetire_StillAccessibleViaHistory
HISTORY/cycle TestHistory_CycleDetected (injected via internal map),
TestHistory_PredecessorMissing_TruncatesChain,
TestHistory_UnknownUID_ErrorsClean
REPLAY/dup TestAddIdempotent_IncrementsReplayCount (locks the
"replay preserves original content" rule per ADR-004)
CORRUPTION TestPersistor_CorruptedLines_Skipped +
TestPersistor_ApplyError_Skipped
ROUND-TRIP TestPersistor_RoundTrip locks the full Save → fresh
Store → Load → Stats-match contract
Two real bugs caught during testing:
- Add returned the same *Trace stored in the map, so callers
holding a reference saw later mutations. Fixed: clone before
return (matches Get's contract). Same fix in AddIdempotent
+ Revise.
- Test typo: {"v":different} isn't valid JSON; AddIdempotent's
json.Valid rejected it as ErrInvalidContent. Test fixed to
use {"v":"different"}; the validation behavior is correct.
Skipped this commit (next):
- cmd/pathwayd HTTP binary
- gateway routing for /v1/pathway/*
- end-to-end smoke
These add the wire surface; the substrate ships first so the
wire layer can be a pure proxy in the next commit.
Verified:
go test -count=1 ./internal/pathway/ — 26 tests green
just verify — vet + test + 9 smokes 34s
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the "needs heavy integration smoke" follow-up from the
ADR-002 commit (423a381). Until now the per-prefix PUT cap was
verified only by unit tests + commits' theory; this smoke runs
the actual cap path with real bytes.
Three assertions, ~2s wall:
1. PUT 300 MiB to _vectors/<key> → 200 (cap raised to 4 GiB
for the vectord persistence prefix).
2. PUT same 300 MiB to datasets/<key> → 413 (default 256 MiB
cap still protects routine traffic).
3. GET _vectors/<key> → sha256 round-trips (no truncation
between cap-raise and S3 multipart streaming).
scripts/storaged_cap_smoke.sh
Builds storaged + gateway, boots them, generates 300 MiB
deterministic /dev/zero payload (sha stable across runs),
runs the 3 assertions, cleans up the keys + processes via trap.
/dev/zero generation chosen over yes/head pipe — pipefail
catches the SIGPIPE from yes when head closes early.
just smoke-storaged-cap
Wrapper recipe. Outside the main `just verify` chain because
300 MiB payload generation + transfer is MB-heavy. Run after
meaningful storaged or vectord-persistence changes.
Verified:
bash scripts/storaged_cap_smoke.sh — 3/3 PASS · 2s wall
just verify — vet + test + 9 smokes still 33s
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Implements the auth posture from ADR-003 (commit 0d18ffa). Two
independent layers — Bearer token (constant-time compare via
crypto/subtle) and IP allowlist (CIDR set) — composed in shared.Run
so every binary inherits the same gate without per-binary wiring.
Together with the bind-gate from commit 6af0520, this mechanically
closes audit risks R-001 + R-007:
- non-loopback bind without auth.token = startup refuse
- non-loopback bind WITH auth.token + override env = allowed
- loopback bind = all gates open (G0 dev unchanged)
internal/shared/auth.go (NEW)
RequireAuth(cfg AuthConfig) returns chi-compatible middleware.
Empty Token + empty AllowedIPs → pass-through (G0 dev mode).
Token-only → 401 Bearer mismatch.
AllowedIPs-only → 403 source IP not in CIDR set.
Both → both gates apply.
/health bypasses both layers (load-balancer / liveness probes
shouldn't carry tokens).
CIDR parsing pre-runs at boot; bare IP (no /N) treated as /32 (or
/128 for IPv6). Invalid entries log warn and drop, fail-loud-but-
not-fatal so a typo doesn't kill the binary.
Token comparison: subtle.ConstantTimeCompare on the full
"Bearer <token>" wire-format string. Length-mismatch returns 0
(per stdlib spec), so wrong-length tokens reject without timing
leak. Pre-encoded comparison slice stored in the middleware
closure — one allocation per request.
Source-IP extraction prefers net.SplitHostPort fallback to
RemoteAddr-as-is for httptest compatibility. X-Forwarded-For
support is a follow-up when a trusted proxy fronts the gateway
(config knob TBD per ADR-003 §"Future").
internal/shared/server.go
Run signature: gained AuthConfig parameter (4th arg).
/health stays mounted on the outer router (public).
Registered routes go inside chi.Group with RequireAuth applied —
empty config = transparent group.
Added requireAuthOnNonLoopback startup check: non-loopback bind
with empty Token = refuse to start (cites R-001 + R-007 by name).
internal/shared/config.go
AuthConfig type added with TOML tags. Fields: Token, AllowedIPs.
Composed into Config under [auth].
cmd/<svc>/main.go × 7 (catalogd, embedd, gateway, ingestd, queryd,
storaged, vectord, mcpd is unaffected — stdio doesn't bind a port)
Each call site adds cfg.Auth as the 4th arg to shared.Run. No
other changes — middleware applies via shared.Run uniformly.
internal/shared/auth_test.go (12 test funcs)
Empty config pass-through, missing-token 401, wrong-token 401,
correct-token 200, raw-token-without-Bearer-prefix 401, /health
always public, IP allowlist allow + reject, bare IP /32, both
layers when both configured, invalid CIDR drop-with-warn, RemoteAddr
shape extraction. The constant-time comparison is verified by
inspection (comments in auth.go) plus the existence of the
passthrough test (length-mismatch case).
Verified:
go test -count=1 ./internal/shared/ — all green (was 21, now 33 funcs)
just verify — vet + test + 9 smokes 33s
just proof contract — 53/0/1 unchanged
Smokes + proof harness keep working without any token configuration:
default Auth is empty struct → middleware is no-op → existing tests
pass unchanged. To exercise the gate, operators set [auth].token in
lakehouse.toml (or, per the "future" note in the ADR, via env var).
Closes audit findings:
R-001 HIGH — fully mechanically closed (was: partial via bind gate)
R-007 MED — fully mechanically closed (was: design-only ADR-003)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New cmd/mcpd binary using github.com/modelcontextprotocol/go-sdk
v1.5.0 over stdio transport. Exposes Lakehouse capabilities as MCP
tools: list_datasets, get_manifest, query_sql, embed_text,
search_vectors. Each tool proxies to the gateway via HTTP.
Replaces the MCP-tool subset of the Rust system's Bun mcp-server.ts
(the audit's "split this 2520-line empire" finding from R-005). HTTP
demo routes (the staffing co-pilot UI at /api/intelligence/*,
/headshots/*, etc.) stay Bun until G5 cutover — those are demo-
specific and depend on matrix-indexer signals not yet ported.
Architecture:
cmd/mcpd/main.go (235 LoC)
main() reads --gateway flag, builds server via buildServer(),
runs on StdioTransport. Each tool's args is a typed struct with
jsonschema tags (the SDK's canonical pattern); reflection
generates the JSON Schema automatically.
gatewayClient: thin HTTP wrapper over the configured gateway URL.
30s per-request timeout. 16 MiB tool-response cap. Non-2xx
surfaces as IsError CallToolResult (NOT as transport error) so
the LLM caller sees the error text and can decide how to react.
proxy() handles GET + POST + JSON body uniformly. errorResult()
+ jsonResult() helpers normalize CallToolResult shape.
cmd/mcpd/main_test.go (13 test funcs)
Tests the full MCP wire end-to-end without a subprocess: spin
up a fake gateway via httptest, build the MCP server pointed at
it, connect a client via in-memory transports (NewInMemoryTransports),
call each tool. Each tool gets:
- happy path (gateway returns 200 → tool returns content)
- input validation (missing required fields → IsError)
- upstream error (gateway 4xx → tool returns IsError)
Plus TestListTools verifies all 5 tools register; TestGatewayUnreachable
verifies network-level failures surface as IsError, not panics.
Setup for Claude Desktop / Code documented in README:
{
"mcpServers": {
"lakehouse": {
"command": "/path/to/bin/mcpd",
"args": ["--gateway", "http://127.0.0.1:3110"]
}
}
}
Verified:
go test -count=1 ./cmd/mcpd/ — 13/13 green
just verify — vet + test + 9 smokes 35s
Out of scope for this commit:
- Resources (mcp.AddResource): not needed yet; tools cover the
interactive surface. Add when an LLM-side use case shows up.
- Prompts (mcp.AddPrompt): same.
- Streamable transports (HTTP, SSE): stdio is the universal one;
streamable can be added with srv.Run(ctx, &mcp.StreamableHTTPHandler{})
swap if a daemon-mode deploy makes sense.
- mcpd inside the daemon-supervised stack: it's stdio-only and
spawned by the MCP client, not run as a service. Adding a
daemon-mode (HTTP transport on a port) is a follow-up if MCP
consumers want long-lived sessions.
This is a tool-surface only port. The Bun mcp-server.ts also serves
HTTP demo routes (/api/catalog/datasets, /intelligence/*, /headshots/*)
that depend on the matrix-indexer signals from the Rust system; those
stay Bun until G5 cutover when the staffing co-pilot service ports
to Go.
Direct deps added:
github.com/modelcontextprotocol/go-sdk v1.5.0
Transitive (resolved by go mod tidy):
github.com/google/jsonschema-go v0.4.2
github.com/yosida95/uritemplate/v3 v3.0.2
golang.org/x/oauth2 v0.35.0
github.com/segmentio/encoding v0.5.4
github.com/golang-jwt/jwt/v5 v5.3.1
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds CachedProvider wrapping the embedding Provider with a thread-safe
LRU keyed on (effective_model, sha256(text)) → []float32. Repeat
queries return the stored vector without round-tripping to Ollama.
Why this matters: the staffing 500K test (memory project_golang_lakehouse)
documented that the staffing co-pilot replays many of the same query
texts ("forklift driver IL", "welder Chicago", "warehouse safety", etc).
Each repeat paid the ~50ms Ollama round-trip. Cached repeats now serve
in <1µs (LRU lookup + sha256 of input).
Memory budget: ~3 KiB per entry at d=768. Default 10K entries ≈ 30 MiB.
Configurable via [embedd].cache_size; 0 disables (pass-through mode).
Per-text caching, not per-batch — a batch with mixed hits/misses only
fetches the misses upstream, then merges the result preserving caller
input order. Three-text batch with one miss = one upstream call for
that one text instead of three.
Implementation:
internal/embed/cached.go (NEW, 150 LoC)
CachedProvider implements Provider; uses hashicorp/golang-lru/v2.
Key shape: "<model>:<sha256-hex>". Empty model resolves to
defaultModel (request-derived) for the key — NOT res.Model
(upstream-derived), so future requests with same input shape
hit the same key. Caught by TestCachedProvider_EmptyModelResolvesToDefault.
Atomic hit/miss counters + Stats() + HitRate() + Len().
internal/embed/cached_test.go (NEW, 12 test funcs)
Pass-through-when-zero, hit-on-repeat, mixed-batch only fetches
misses, model-key isolation, empty-model resolves to default,
LRU eviction at cap, error propagation, all-hits synthesized
without upstream call, hit-rate accumulation, empty-texts
rejected, concurrent-safe (50 goroutines × 100 calls), key
stability + distinctness.
internal/shared/config.go
EmbeddConfig.CacheSize (toml: cache_size). Default 10000.
cmd/embedd/main.go
Wraps Ollama Provider with CachedProvider on startup. Adds
/embed/stats endpoint exposing hits / misses / hit_rate / size.
Operators check the rate to confirm the cache is working
(high rate = good) or sized wrong (low rate + many misses on a
workload that should have repeats).
cmd/embedd/main_test.go
Stats endpoint tests — disabled mode shape, enabled mode tracks
hits + misses across repeat calls.
One real bug caught by my own test:
Initial implementation cached under res.Model (upstream-resolved)
rather than effectiveModel (request-resolved). A request with
model="" caching under "test-model" (Ollama's default), then a
request with model="the-default" (our config default) missing
the cache. Fix: always use the request-derived effectiveModel
for keys; that's the predictable side. Locked by
TestCachedProvider_EmptyModelResolvesToDefault.
Verified:
go test -count=1 ./internal/embed/ — all 12 cached tests + 6 ollama tests green
go test -count=1 ./cmd/embedd/ — stats endpoint tests green
just verify — vet + test + 9 smokes 33s
Production benefit:
~50ms Ollama round-trip → <1µs cache lookup for cached entries.
At 10K-entry default + ~30% repeat rate (typical staffing co-pilot
workload), saves several seconds per staffer-query session.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds cmd/fake_ollama, a minimal Ollama-API-compatible fake that
implements just enough surface for embedd to drive end-to-end
without a real Ollama install:
GET /api/tags — fixed model list including nomic-embed-text
POST /api/embeddings — deterministic dim-D vector from sha256(prompt)
GET /health — for the smoke's poll_health helper
Same prompt → bit-identical vector across runs, machines, and CI
nodes. Vectors are NOT semantically meaningful; the fake validates
the embed CONTRACT (dimension echo, response shape, status codes,
deterministic round-trip), not real semantic ranking. Real ranking
still requires real Ollama and lives in scripts/g2_smoke.sh + the
integration tier of the proof harness.
scripts/g2_smoke_fixtures.sh — full chain smoke against the fake:
- Build fake_ollama + embedd + vectord + gateway
- Start fake on :11435 (distinct from real Ollama at :11434)
- Generate temp lakehouse.toml with provider_url override
- Boot embedd/vectord/gateway with --config <override>
- 4 assertions: dim=768, deterministic same-text, different-text
divergence, bad-model → 4xx/5xx (fake 404 → embedd 502)
- Trap-cleanup tears down all 4 binaries + tmp config
Wired into the task runner:
just smoke-g2-fixtures
Closes R-006 partially:
- Embed half: ✓ — CI / fresh-clone reviewers without Ollama can
now run the embed contract smoke
- Storage half: deferred — mocking S3 protocol is non-trivial
(multipart, signed URLs, etc.) and MinIO itself is lightweight
enough to install via Docker in any CI environment. Documented
as Sprint 0 follow-up if a CI system without Docker shows up.
What this DOESN'T cover:
- Real semantic similarity (use scripts/g2_smoke.sh + real Ollama)
- Real Ollama API quirks (timeouts, version-specific shapes,
/api/embed batch endpoint that newer versions support)
Verified:
bash scripts/g2_smoke_fixtures.sh — 4/4 assertions PASS, ~3s wall
just verify — vet + test + 9 smokes still green
Doesn't replace the existing g2_smoke.sh (which still requires real
Ollama and exercises the actual embed semantics). Adds an alternate
mode for portability.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds main_test.go for each of the 6 cmd binaries that lacked them
(storaged already had main_test.go; that's where the pattern came
from). Each test file focuses on the cmd-specific surface — route
mounts, body caps, decode/validation paths — without re-testing
internal package logic that's covered elsewhere.
cmd/catalogd/main_test.go — 6 funcs
TestRoutesMounted: chi.Walk asserts /catalog/{register,manifest/*,list}
TestHandleRegister_BodyTooLarge: 5 MiB body → 4xx
TestHandleRegister_MalformedJSON: 400
TestHandleRegister_EmptyName_400: ErrEmptyName surfaces as 400
TestHandleGetManifest_404 + TestHandleList_EmptyShape
cmd/embedd/main_test.go — 8 funcs
stubProvider implements embed.Provider deterministically
TestRoutesMounted, MalformedJSON_400, EmptyTextRejected_400 (per
scrum O-W3), UpstreamError_502 (provider error → 502, not 500),
HappyPath_ProviderEcho, BodyTooLarge (4xx range), TestItoa
(covers the no-strconv helper)
cmd/gateway/main_test.go — 4 funcs
TestMustParseUpstream_HappyPaths: 3 valid URLs
TestMustParseUpstream_FailureExits: re-execs the test binary in a
subprocess with env flag (standard pattern for testing os.Exit
callers); subprocess invokes mustParseUpstream("127.0.0.1:3211")
[missing scheme]; expects exit non-zero. Same pattern for garbage.
TestUpstreamConfigKeys_DocumentedShape: locks the 6 _url keys
cmd/ingestd/main_test.go — 7 funcs
Stubs both storaged and catalogd via httptest.Server so the cmd
layer can be exercised without bringing the full chain up.
TestHandleIngest_MissingNameQueryParam: 400 with "name" in body
TestHandleIngest_MalformedMultipart: 400
TestHandleIngest_MissingFormFile: 400 (valid multipart, wrong field)
TestHandleIngest_BodyTooLarge: 4xx
TestEscapeKeyPath: 6-case URL-escape table (apostrophe, space, etc.)
TestParquetKeyPath_Format: locks the datasets/<n>/<fp>.parquet shape
per scrum C-DRIFT (any rename breaks idempotent re-ingest)
cmd/queryd/main_test.go — 6 funcs
Tests pre-DB paths (decode, body cap, empty SQL); db.QueryContext
itself needs DuckDB so it's covered by GOLAKE-040 in the proof
harness, not unit tests. handlers.db = nil here is intentional.
TestHandleSQL_EmptySQL_400: 3 cases (empty, whitespace, mixed-WS)
TestMaxSQLBodyBytes_Reasonable: locks the 64 KiB constant in a
sane range so a refactor can't blow it open
TestPrimaryBucket_Constant: locks "primary" — secrets lookup uses
this; rename = silent secret-resolution failure at boot
cmd/vectord/main_test.go — 14 funcs
All 6 routes verified mounted. handlers.persist = nil = pure
in-memory mode; persistence is GOLAKE-070 in the proof harness.
Coverage of every error branch in handleCreate/Add/Search/Delete:
missing index → 404, dim mismatch → 400, empty items → 400,
empty id → 400, malformed JSON → 400, body too large → 4xx,
happy create → 201, happy list → 200.
One real finding caught during writing:
Body-cap rejection is sometimes 413 (typed MaxBytesError survives
unwrap) and sometimes 400 (decoder wraps it as a generic decode
error). Both are valid client-error contracts; the contract isn't
"exactly 413" but "fails loud as 4xx, never silent 200 or 5xx."
Tests assert 4xx range. The proof harness's
proof_assert_status_4xx already had this shape — just bringing
the unit tests in line with it.
Verified:
go test -count=1 -short ./cmd/... — all 7 packages green
just verify — vet + test + 9 smokes 35s
Closes audit risk R-005 (6/7 cmd/main.go untested). Combined with
the proof harness's wiring coverage, every cmd-level handler now
has both unit-test and integration-test coverage of the wiring
layer. R-005 → CLOSED.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces single-shot baselines (40% noise floor flagged in Phase E)
with noise-aware regression detection.
What changed:
ingest n=3 runs (was 1) with 3-pass warmup
vector_add n=3 runs (was 1) with 3-pass warmup
query n=20 samples (unchanged) with 50-pass warmup
search n=20 samples (unchanged) with 50-pass warmup
RSS n=1 (unchanged — steady-state in G0)
Each metric stored as {value: median, mad: median absolute
deviation} in baseline.json (schema: v2-multisample-mad).
New regression detection:
threshold = max(3 * baseline.mad, value * 0.75)
REGRESSION iff |actual - baseline.value| > threshold AND direction
signals worse (lower throughput / higher latency).
Why these specific numbers:
3*MAD = standard "outside the spread" bound; lets high-variance
metrics tolerate their own noise.
75% floor = empirical observation: even with 50 warmups, single-
host inter-run variance on bootstrap-cold queryd was
consistently 90-130% on this box. 75% catches >75%
regressions cleanly while ignoring known noise.
lib/metrics.sh: new proof_compute_mad helper computes MAD from a
file of one-number-per-line samples. Used for both regen (to write
the baseline.mad value) and diff (read from baseline).
Honest finding from this iteration's 3 back-to-back diff runs:
query_ms shows 90-130% delta from baseline consistently — not
random noise but a systematic 2x gap between regen-time and
steady-state. The regen captured a particularly fast moment;
steady-state is slower. Operator workflow: regenerate the
baseline at a known-representative state via
`bash tests/proof/run_proof.sh --mode performance --regenerate-baseline`
rather than expecting the harness to track a moving target.
The harness's value here is the EVIDENCE RECORD (every run captures
median+MAD+p95 plus all raw samples in raw/metrics/), not the gate.
Even false-positive REGRESSION skips give operators "this run was
20ms vs baseline 10ms" which is informative.
Sample counts also written into baseline.json under "samples" so a
future audit can verify the methodology that produced the values.
Verified across 3 back-to-back runs:
ingest_rows_per_sec PASS (delta within 75%, mostly < 10%)
vectors_per_sec_add PASS
search_ms PASS
rss_* PASS
query_ms REGRESSION flagged (130/100/90%) — known
systematic gap, not bug
Closes the "40% noise floor" follow-up from Phase E FINAL_REPORT.
Honest about limitations: hard regression gating on a busy single-
host setup needs either much bigger sample counts (n≥100), longer
warmup, or moving to a dedicated benchmark host. Documented inline.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Locks in the auth model that R-001 + R-007 will be retrofitted
against. Doc-only — wiring deferred to Sprint 1 when the first
non-loopback binding is needed.
Decision: Bearer token (from secrets-go.toml [auth] section) + IP
allowlist (CIDR list). Both layers required when auth is on; empty
token = G0 dev no-op. /health exempt.
Implementation shape (when it lands):
- internal/shared/auth.go middleware: one chi r.Use line per binary
- shared.Run gates: refuses non-loopback bind without configured token
- subtle.ConstantTimeCompare for token equality (timing-safe)
Alternatives considered + rejected:
mTLS — too heavy for single-machine inter-service traffic
JWT — buys nothing over Bearer without external IdP
IP-only — one stolen IP entry = full access; no defense depth
OAuth2 — no external IdP commitment in G0-G3 timeline
What this doesn't do:
- Doesn't implement (code lands Sprint 1)
- Doesn't break G0 dev (empty token = middleware no-op)
- Doesn't address gateway→end-user auth (different ADR shape)
Closes the design-decision blocker for R-001 and R-007. Wiring
ticket: Sprint 1 backlog story S1.2.
Also lifts ADR-002 (storaged per-prefix PUT cap) into the doc —
it was implemented in 423a381 but not yet recorded as an ADR.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the documented 500K-test gap (memory project_golang_lakehouse:
"storaged 256 MiB PUT cap blocks single-file LHV1 persistence above
~150K vectors at d=768"). Vectord persistence under "_vectors/" now
gets a 4 GiB cap; everything else (parquets, manifests, ingest)
keeps the 256 MiB default.
Why per-prefix and not "raise globally":
- 256 MiB cap is a real DoS protection — runaway clients can't
drain the daemon. Raising it for ALL traffic would expand the
attack surface for routine paths that have no need.
- Per-prefix preserves existing protection while opening the one
documented production-scale path.
Why not split LHV1 across multiple keys (the alternative):
- G1P shipped a single-Put framed format SPECIFICALLY to eliminate
the torn-write class (memory: "Single Put eliminates the torn-
write class that the 3-way convergent scrum finding identified").
- Multi-key LHV1 would re-introduce the half-saved-state failure
mode we just paid to fix. Streaming via existing manager.Uploader
is the better architectural answer.
Why not bump the cap operationally via env/config:
- Future operator-driven cap can drop in cleanly via the
maxPutBytesFor function. Started with hardcoded 4 GiB to keep
this commit small; config knob is a follow-up if production
workloads diverge from the documented 500K-vector ceiling.
manager.Uploader is already streaming-multipart on the outbound
S3 side; the inbound MaxBytesReader cap is a safety gate, not a
memory bottleneck. So raising it for vectord just lets the
existing streaming path actually flow, without introducing new
memory pressure (4-slot semaphore × 4 GiB worst case = 16 GiB
only if all slots simultaneously max out — vanishingly unlikely).
Implementation:
cmd/storaged/main.go:
new constant maxPutBytesVectors = 4 GiB (covers >700K vectors @ d=768)
new constant vectorsPrefix = "_vectors/" (synced with vectord.VectorPrefix)
new function maxPutBytesFor(key) → cap-by-prefix
handlePut: ContentLength check + MaxBytesReader use the per-key cap
cmd/storaged/main_test.go (3 new test funcs):
TestMaxPutBytesFor: 7 cases incl. nested prefix, substring-but-not-
prefix, empty key, parquet/manifest paths.
TestVectorPrefixSyncWithVectord: regression test that asserts
vectorsPrefix == vectord.VectorPrefix. A future rename surfaces
here instead of silently bypassing the larger cap.
TestVectorCapAccommodates500KStaffingTest: bounds the cap above
the documented production workload (~700 MiB conservative).
Verified:
go test ./cmd/storaged/ — all green (was 1 func, now 4)
just verify — 9 smokes still pass · 32s wall
just proof contract — 53/0/1 unchanged
Out of scope for this commit (deserves its own):
- Heavy integration smoke: 200K dim=768 synthetic vectors → ~700
MiB LHV1 → kill+restart vectord → recall=1. ~5-10 min wall;
follow-up if you want production-scale persistence verified
end-to-end. Unit tests + existing g1p_smoke cover the wiring.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
shared.Run now refuses to bind a non-loopback address unless the
LH_<SERVICE>_ALLOW_NONLOOPBACK=1 env is set. Single change covers
all 7 binaries via the existing Run call site; no per-binary
wiring needed.
Closes the accidental-0.0.0.0 deploy attack surface for R-001:
queryd /sql is RCE-equivalent off loopback (DuckDB has filesystem
read + COPY TO + read_text), but the gate applies to every binary
uniformly so the same posture covers vectord (mutation routes),
catalogd (manifest writes), and the others.
What passes the gate:
127.0.0.1:port, 127.x.y.z:port (full /8), [::1]:port,
localhost:port, OR explicit env LH_<SVC>_ALLOW_NONLOOPBACK=1
What fail-louds:
0.0.0.0:port, [::]:port, :port (all interfaces),
any non-loopback IP, any non-localhost hostname,
unparseable shapes ("", "no port", garbage)
Override env is strict equality "1" — typos like "true"/"yes" do NOT
trigger it, so a future operator can't accidentally expose by typing
the wrong value. Override fires log a structured warn so the choice
is auditable in production.
Error message cites the env name AND R-001 by name so operators see
the fix path without grepping:
"refusing non-loopback bind \"0.0.0.0:3214\" for \"queryd\"
(set LH_QUERYD_ALLOW_NONLOOPBACK=1 to override; see audit R-001)"
internal/shared/bind.go — requireLoopbackOrOverride + isLoopbackAddr
internal/shared/bind_test.go — 7 test funcs incl. table-driven
IPv4/IPv6/hostname coverage and
per-service env isolation
internal/shared/server.go — 1-line gate in Run before listen
Verified:
go test -short ./internal/shared/ — all green (was 14 funcs, now 21)
just verify — vet + test + 9 smokes still 33s
Doesn't address R-001's full attack surface (any reachable port can
issue arbitrary SQL); ADR-003 + Bearer-token middleware is the
follow-up. This commit makes the implicit "localhost-only is the auth
layer" guarantee explicit and un-bypassable without explicit env.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Audit-driven follow-up to the Rust scrum review on the 3 untested
HIGH-risk packages. Both the audit (reports/scrum/risk-register.md)
and the scrum (tests/real-world/runs/scrum_mojxb5bw/) independently
flagged these files as the highest-leverage missing test coverage.
internal/shared/server_test.go — 8 test funcs
newListener: valid addr, invalid addr (non-numeric port, port
out of range, port-already-in-use surfacing as net.OpError).
Empty-addr-is-valid: documents the net.Listen quirk that "" binds
an OS-picked port — future readers don't need to relitigate.
HealthResponse marshal: JSON shape stable, round-trip clean.
/health handler reconstructed via httptest.Server: status 200,
Content-Type application/json, body fields stable.
RegisterRoutes callback: contract verified (callback is invoked
with a real chi.Router, mounted route reachable end-to-end).
Run bind-failure surface: synchronous error, not a goroutine swallow
— the contract Run depends on per the race-safe-startup comment.
internal/shared/config_test.go — 6 test funcs
DefaultConfig G0 port pinning: every binary's default bind locked
in (3110/3211-3216) so a refactor can't silently flip a port.
LoadConfig empty path: returns DefaultConfig, no error.
LoadConfig missing file: returns DefaultConfig, logs warn (the warn
line shows up in test output, captured-but-not-asserted).
LoadConfig valid TOML: partial overrides land, unspecified sections
keep defaults (TOML decoder leave-alone behavior).
LoadConfig invalid TOML: returns wrapped 'parse config' error.
LoadConfig unreadable file: skipped under root (root reads 0000);
captures the read-error wrap path for non-root contexts.
internal/storeclient/client_test.go — 14 test funcs
safeKey table-driven: plain segments, single slash, empty, trailing
slash, space (→ %20), apostrophe (→ %27), unicode (→ %C3%A9),
deep nesting. Locks URL-escape contract per scrum suggestion.
recordingServer helper backs Put/Get/Delete/List against
httptest.Server: verifies method, path, body bytes round-trip.
ErrKeyNotFound on 404 (errors.Is round-trip).
Non-OK status wraps body preview into the error chain.
Delete accepts both 200 and 204 (S3 vs compatible-store quirk).
List parses JSON shape and surfaces query-string prefix.
Context cancellation propagates through Put as context.Canceled.
internal/queryd/db_test.go — 5 test funcs (with subtests)
sqlEscape table-driven: 8 cases including empty, all-quotes,
nested apostrophes (the case from the scrum suggestion).
redactCreds table-driven: 6 cases — both keys, single keys,
empty, multi-occurrence, placeholder-collision (lossy but safe).
buildBootstrap statement order: INSTALL → LOAD → CREATE SECRET.
buildBootstrap endpoint schemes: http strips + USE_SSL false,
https keeps SSL true, no-scheme defaults SSL true (prod ambient).
buildBootstrap URL_STYLE: 'path' vs 'vhost' branch.
buildBootstrap escapes credential quotes: future SSO-token-with-
apostrophe doesn't break out of the SQL string literal — the
belt holds when the suspenders snap.
Real finding caught by my own test:
net.Listen("tcp", "") succeeds (OS-picked port) — captured as
TestNewListener_EmptyAddrIsValid so the quirk is documented.
Verified:
go test -short ./... — every internal/ package now has tests
(no more 'no test files' lines for shared/storeclient).
just verify — vet + test + 9 smokes green in 33s.
just proof contract — 53/0/1 green (no harness regression).
Closes:
R-002 internal/shared zero tests HIGH
R-003 internal/storeclient zero tests HIGH
R-008 queryd/db.go untested MED (sqlEscape, redactCreds,
CREATE SECRET formation)
Composite scrum score should move from 43 → ~46 / 60 — the three
HIGH/MED risks closed, internal/shared and internal/storeclient
become "tested + load-bearing" instead of "untested + load-bearing."
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Re-runs the SCRUM.md framework against HEAD (4840c10) to score the
delta from the audit baseline at 91edd43. Composite +8.
Scoring deltas:
Reproducibility 7 → 9 (just verify, just doctor, pre-push hook)
Test Coverage 6 → 8 (168 proof harness assertions; Go-test
gaps in shared/storeclient remain)
Trust Boundary 7 → 7 (no code change; R-001/R-007 open)
Memory Correctness 3 → 4 (vectord persistence proven; Mem0
pathway/playbook still not ported)
Deployment Readiness 4 → 5 (just doctor; REPLICATION/systemd open)
Maintainability 8 → 8 (spine unchanged; harness obeys
CLAUDE_REFACTOR_GUARDRAILS)
Risk register changes:
R-004 (smokes not gated) CLOSED — just verify + pre-push hook
R-005 (cmd/main.go untested) partial — proof harness covers wiring
R-012 (empty tests/ dir) CLOSED — populated by harness
R-001/R-002/R-003/R-006/R-007/R-008/R-009/R-010 unchanged
Sprint 0 progress:
S0.1 just doctor DONE
S0.3 just verify + pre-push DONE
S0.6 tests/ dir cleanup DONE
S0.2 just smoke-fixtures open
S0.4 cmd/main_test × 6 partial (harness coverage; go-test gap)
S0.5 shared/storeclient tests open (HIGH risks still unaddressed)
New finding from this rerun (worth recording):
Queryd refresh-tick race in 04_query_correctness — cache-warm
binaries fire SELECTs faster than queryd's 500ms refresh tick.
Caught by integration mode going 104/0/1 → 102/1/1, fixed at
4840c10 with proof_wait_for_sql helper. Exactly the failure-mode
the harness was designed to catch.
Original 5 audit reports preserved as immutable history at
91edd43; this file documents the delta only.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Caught by the audit rerun: with cache-warm binaries, 04 fires its
first SELECT faster than queryd's 500ms refresh tick — Q1 returned
400 ("table not found") even though 03_ingest had registered the
manifest. Subsequent queries (after the next tick) succeeded.
This is an eventual-consistency wait, not a retry — queryd's
contract is that views appear within one tick of catalogd having the
manifest. Production code does not need changing.
Added to lib/http.sh:
proof_wait_for_sql <budget_sec> <sql>
polls a SQL probe until it returns 200 or budget elapses; emits
no evidence (test setup, not a claim).
Used in 04_query_correctness:
Wait up to 5s for queryd to have the view before running the 5
SQL assertions. Skip-with-loud-reason if the view never appears.
Verified: integration mode back to 104 pass / 0 fail / 1 skip after
fix. The skip is the unchanged GOLAKE-085 informational record.
This is exactly the kind of finding the harness was designed to
surface — the regression existed in the codebase the moment Phase D
shipped, but only fired when the next compare run hit cache-warm
timing. Without the harness, it would have surfaced on a CI run
weeks from now and been hard to bisect.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per docs/TEST_PROOF_SCOPE.md, this is the closing deliverable for the
proof harness: a single document that names what's proven, what's
partially proven, what failed, what was skipped and why, what evidence
exists for each, what bottlenecks were measured, what contract drift
was found, what refactor risks remain, and what to fix first.
Per-run report dirs (tests/proof/reports/proof-<ts>/) keep their
existing summary.md + summary.json + raw/ structure — they are the
replayable evidence chain. FINAL_REPORT.md is the stable, repo-tracked
synthesis pointing at them.
Headline findings (no surprises — harness behaves as designed):
- 24 claims encoded; 22 fully proven, 1 informational (GOLAKE-085
duplicate vector ID, contract not yet specified), 0 failed.
- 4 contract-drift findings recorded as canonical: vectord add
body field is `items` not `vectors`, search response is `results`
not `hits`, index info is `length` not `count`, status codes
201/204 not 200. All caught during Phase B; all now pinned by the
harness.
- Performance baseline shows queryd as the largest RSS (69 MiB,
DuckDB process); single-sample noise floor is ~40% — tightening
to multi-sample medians is a documented Sprint follow-up.
- HIGH-risk audit findings (R-001 queryd /sql, R-002/R-003 untested
shared+storeclient) are NOT closed by the harness — it's a
multiplier, not a replacement for unit tests + auth posture.
The proof harness is complete. 11 cases · 3 modes · 168 assertions
peak across all tiers · ~22s total wall (contract+integration+perf).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
GOLAKE-100. First run writes tests/proof/baseline.json; subsequent
runs diff against it. >10% regression emits a SKIP with REGRESSION
detail (not a fail — perf claim is required:false in claims.yaml so
the gate stays green; the human summary tells the regression story
honestly). Skip-with-loud-reason if any earlier case in the run
failed, per spec "performance only after contract+integration pass."
Workload (deterministic, repeatable):
ingest 1000-row CSV (5 roles × 5 cities × seeded scores) → /v1/ingest
query SELECT count(*) ×20 against the just-ingested dataset
vector add 200 dim=4 vectors with formulaic content (no Ollama)
search ×20 against the perf index with a fixed query vector
RSS per-service post-workload sample via /proc/<pid>/status
Recorded metrics:
ingest_rows_per_sec, query_p50_ms, query_p95_ms,
vectors_per_sec_add, search_p50_ms, search_p95_ms,
rss_{storaged,catalogd,ingestd,queryd,vectord,embedd,gateway}_mb
baseline.json on this box (committed):
25000 rows/sec ingest · 17ms p50 / 24ms p95 query
6250 vectors/sec add · 8ms p50 / 20ms p95 search
queryd 69 MiB · vectord 14 MiB · others 11-29 MiB
Honest measurement-design finding from the very first compare run:
back-to-back runs surfaced -41% ingest and +29% query p50 — pure
disk-cache + queryd-cold-start noise. Single-sample baselines have
real noise floor ≈40%. Recorded as REGRESSION skips so the human
summary surfaces it, not a code regression. Tightening the threshold
or moving to multi-sample medians is a Phase E recommendation.
Verified end-to-end:
just proof contract — 53 pass · 1 skip · ~4s
just proof integration — 104 pass · 1 skip · ~8s
just proof performance — 110 pass · 3 skip · ~10s
just verify — 9 smokes still green · 29s
All 11 cases (4 contract + 6 integration + 1 performance) deterministic
end-to-end. Phase E (final report against the 9 mandated questions)
is the last piece.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the integration tier — full chain CSV→Parquet→SQL and full
text→embed→vector→search. All 10 cases (4 contract + 6 integration)
end-to-end deterministic; 8s wall total.
Cases added:
01_storage_roundtrip.sh
GOLAKE-010-012. PUT 1KiB → GET sha256-equal → LIST contains key
→ DELETE 200/204 → GET 404. Deterministic key under
proof/<case_id>/ so concurrent runs don't collide.
02_catalog_manifest.sh
GOLAKE-020-022. Fresh register existing=false → manifest read
matches → list contains dataset_id → idempotent re-register
existing=true with stable dataset_id → schema-drift register
409 (the ADR-020 contract). Per-run unique name via
PROOF_RUN_ID so existing=false is meaningful.
03_ingest_csv_to_parquet.sh
GOLAKE-030. workers.csv (5 rows) via /v1/ingest multipart →
parquet object on storaged → catalog manifest with row_count=5.
Verifies content-addressed key shape (datasets/<n>/<fp>.parquet).
04_query_correctness.sh
GOLAKE-040. The 5 SQL assertions from fixtures/expected/queries.json
against the workers fixture: count=5, Chicago=2, max=95,
safety→Barbara, Houston avg=89.5. Iterates the YAML claims, runs
each query, compares response columns to expected values.
06_vector_add_search.sh integration extension
GOLAKE-051. text → /v1/embed (4 docs from fixtures/text/docs.txt)
→ vectord add → search by query embedding. Top-1 ID per query
asserted against fixtures/expected/rankings.json. First run (or
--regenerate-rankings) writes the fixture and emits a skip with
explicit reason; subsequent runs assert against it.
07_vector_persistence_restart.sh
GOLAKE-070. add 4 unit-basis vectors → search → record top-1
distance → SIGTERM vectord → restart with the same --config →
poll /health for 8s → search again → top-1 ID and distance match
bit-identically. Skips with reason if vectord PID can't be found
or post-restart bind times out.
Two harness improvements landed alongside:
run_proof.sh writes a temp lakehouse_proof.toml with
refresh_every="500ms" override and passes --config to all booted
binaries. Production default is 30s; 04_query_correctness needs
queryd to pick up the new view within a tick. Production config
unchanged.
cleanup() now pgreps for any orphan bin/<svc> processes (anchored
to start-of-argv per memory feedback_pkill_scope.md) so a case
that restarts a service mid-run still gets cleaned up.
lib/http.sh adds proof_call(case_id, probe, method, url, args...)
— escape hatch for cases that need raw curl args (multipart -F,
custom headers). Used by 03_ingest for the multipart upload that
conflicts with proof_post's --data + Content-Type defaults.
lib/env.sh exports PROOF_RUN_ID — short unique id derived from the
report directory timestamp. Used by 02 and 07 for fresh-each-run
state isolation.
Two real findings recorded as evidence (no code changes):
- rankings.json fixture pinned: 4 queries → 4 distinct top-1 docs
via nomic-embed-text. A model swap that changes ranking now
fails the harness loudly; --regenerate-rankings is the override.
- vectord persistence kill+restart preserves top-1 distance
bit-identically — the LHV1 single-Put framed format from
G1P round-trips exactly through Save/Load.
Verified end-to-end:
just proof contract — 53 pass (4 cases)
just proof integration — 104 pass (10 cases) · 8s wall
just verify — 9 smokes still green · 33s wall
Phase D (performance baseline) lands next: 10_perf_baseline measures
rows/sec ingest, vectors/sec add, p50/p95 query+search latency, RSS,
CPU. First run writes tests/proof/baseline.json; later runs diff
against it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Added the contract tier above 00_health canary. All 5 contract cases
now cover GOLAKE-001-003, 050, 060-061, 080-085 — 53 assertions pass,
1 informational skip, 0 fail. Wall: 4s end-to-end (cached binaries).
Cases:
05_embedding_contract.sh
GOLAKE-050. POST /v1/embed with one short text → asserts dim=768,
one vector returned, vector length matches dimension, sum of
squared elements > 0 (proxy for non-zero), response.model echoed.
Skips with explicit reason if Ollama is unreachable (502 from
embedd) — per spec hard rule "skipped tests do not appear as
passed."
06_vector_add_search.sh
GOLAKE-060 + GOLAKE-061. Synthetic dim=4 unit basis vectors.
Create index → add 3 vectors → get-index returns length=3 →
search([1,0,0,0],k=3) returns v1 at rank 1 with distance < 0.001.
Cleanup with DELETE. No embedd dependency — pure contract layer.
08_gateway_contracts.sh
GOLAKE-003. For each /v1/* route, asserts gateway and direct
upstream return identical status AND identical response body
(sha256 match). Confirms gateway is a proxy not a transformer.
Status passthrough verified on both 200 path (storage/list,
catalog/list) and 4xx path (sql empty body → 400 from queryd).
09_failure_modes.sh
GOLAKE-080..085. Six failure-mode contracts:
080 malformed JSON → 4xx on catalog/ingest/sql/embed
081 missing required field → 4xx on catalog/vectors/embed
082 bad SQL → 4xx with non-empty error body
083 vector dim mismatch → 4xx
084 missing storage object → 404
085 duplicate vector ID → INFORMATIONAL (spec says required:false)
first/second statuses recorded as evidence; contract decided
later from the recorded record.
Two new lib helpers in lib/assert.sh:
proof_assert_status_in <id> <claim> "200 201 204" <probe>
pass if status is in the space-separated list. Used for
delete-returns-200-or-204 case where vectord returns 204.
proof_assert_status_4xx <id> <claim> <probe>
pass if status in [400, 500). Used for failure modes where the
specific 4xx code may vary (400 vs 422 vs 409). Records actual
code as evidence.
Two real contract findings recorded by the harness during build:
- vectord add expects {"items": [...]}, not {"vectors": [...]}.
My initial test sent the wrong field; would have masked the bug
forever in CI. The harness caught it via the assertion failure.
- vectord create returns 201 Created, delete returns 204 No Content.
Documented in the test fixtures as canonical.
Regression: just verify wall 33s, vet + test + 9 smokes still green.
Phase C (integration) lands next: 01_storage_roundtrip, 02_catalog_manifest,
03_ingest_csv_to_parquet, 04_query_correctness, 05/06 integration extends,
07_vector_persistence_restart.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per docs/TEST_PROOF_SCOPE.md, building the claims-verification tier
above the smoke chain. This commit lays the scaffolding and proves
the orchestrator end-to-end with one canary case (00_health).
What landed:
tests/proof/
README.md how to read a report, layout, modes
claims.yaml 24 claims enumerated (GOLAKE-001..100)
run_proof.sh orchestrator with --mode {contract|integration|performance}
and --no-bootstrap / --regenerate-{rankings,baseline}
lib/
env.sh service URLs, report dir, mode, git context
http.sh curl wrappers writing per-probe JSON + body + headers
assert.sh proof_assert_{eq,ne,contains,lt,gt,status,json_eq} +
proof_skip — each emits one JSONL record per call
metrics.sh start/stop timers, value capture, RSS sampling,
percentile compute (for Phase D)
cases/
00_health.sh canary — gateway + 6 services /health → 200,
body identifies service, latency < 500ms (21 assertions)
fixtures/
csv/workers.csv spec's 5-row deterministic CSV
text/docs.txt 4 deterministic vector docs
expected/queries.json expected results for the 5 SQL assertions
Wired into the task runner:
just proof contract # canary only this commit
just proof integration # Phase C
just proof performance # Phase D
.gitignore: /tests/proof/reports/* with !.gitkeep — same pattern as
reports/scrum/_evidence/. Per-run output is a runtime artifact.
Specs landed alongside (J's drops):
docs/TEST_PROOF_SCOPE.md the harness contract this implements
docs/CLAUDE_REFACTOR_GUARDRAILS.md process discipline this harness obeys
Verified end-to-end (cached binaries):
just proof contract wall < 2s, 21 pass / 0 fail / 0 skip
just verify wall 31s, vet + test + 9 smokes still green
Two bugs fixed during canary run, both in run_proof.sh aggregation:
- grep -c exits 1 on zero matches; the `|| echo 0` form concatenated
"0\n0" and broke jq --argjson + integer comparison. Fixed via a
_count helper that captures count-or-zero cleanly.
- per-case table iterated case scripts (filename-based) but cases
write evidence under CASE_ID. Switched to JSONL-file iteration so
multi-case scripts work and the mapping is faithful.
Phase B (contract cases) lands next: 05_embedding, 06_vector_add,
08_gateway_contracts, 09_failure_modes. Each sourcing the same lib
helpers and writing to the same report shape.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>