The 5-loop substrate's load-bearing gate is verified — playbook +
matrix indexer give the results we're looking for. Per the report's
rubric, lift ≥ 50% of discoveries means matrix is doing real work;
7/8 = 87.5% blew through that.
Harness was structurally hiding bugs behind a 5-daemon stripped boot.
Expanding to the full 10-daemon prod stack surfaced 7 fixes in cascade:
1. driver→matrixd: {"query": ...} → {"query_text": ...} field name
2. harness temp toml missing [s3] → wrong default bucket → catalogd
rehydrate 500 on first call
3. harness→queryd SQL probe: {"q": ...} → {"sql": ...} field name
4. expand boot from 5 → 10 daemons in dep-ordered launch
5. add SQL surface probe (3-row CSV ingest → COUNT(*)=3 assertion)
6. candidates corpus was synthetic SWE-tech (Swift/iOS, Scala/Spark) —
wrong domain for staffing queries; replaced with ethereal_workers
(10K rows, real staffing schema, "e-" id prefix to avoid collision
with workers' "w-"). staffing_workers driver gains -index-name +
-id-prefix flags so the same binary serves both corpora
7. local_judge qwen3.5:latest is a vision-SSM 256K-ctx build running
~30s per judge call against the lift loop; reverted to
qwen2.5:latest (~1s/call, 30× faster, held lift theory)
Each contract drift (1, 3) is now locked into a cmd/<bin>/main_test.go
so future drift fires in `go test`, not in a reality run. R-005 closed:
- cmd/matrixd/main_test.go (new) — playbook record drift detector +
score bounds + 6 routes mounted
- cmd/queryd/main_test.go — wrong-field-name drift detector
- cmd/pathwayd/main_test.go (new) — 9 routes + add round-trip + retire
- cmd/observerd/main_test.go (new) — 4 routes + invalid-op + unknown-mode
`go test ./cmd/{matrixd,queryd,pathwayd,observerd}` all green.
Reality test results (reports/reality-tests/playbook_lift_001.{json,md}):
Queries 21 (staffing-domain, 7 categories)
Discoveries 8 (judge ≠ cosine top-1)
Lifts 7/8 (87.5%)
Boosts triggered 9
Mean Δ distance -0.053 (warm closer than cold)
OOD honesty dental/RN/SWE rated 1, no fake matches
Cross-corpus boosts confirmed (e- ↔ w- swaps in lifts)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Implements the auth posture from ADR-003 (commit 0d18ffa). Two
independent layers — Bearer token (constant-time compare via
crypto/subtle) and IP allowlist (CIDR set) — composed in shared.Run
so every binary inherits the same gate without per-binary wiring.
Together with the bind-gate from commit 6af0520, this mechanically
closes audit risks R-001 + R-007:
- non-loopback bind without auth.token = startup refuse
- non-loopback bind WITH auth.token + override env = allowed
- loopback bind = all gates open (G0 dev unchanged)
internal/shared/auth.go (NEW)
RequireAuth(cfg AuthConfig) returns chi-compatible middleware.
Empty Token + empty AllowedIPs → pass-through (G0 dev mode).
Token-only → 401 Bearer mismatch.
AllowedIPs-only → 403 source IP not in CIDR set.
Both → both gates apply.
/health bypasses both layers (load-balancer / liveness probes
shouldn't carry tokens).
CIDR parsing pre-runs at boot; bare IP (no /N) treated as /32 (or
/128 for IPv6). Invalid entries log warn and drop, fail-loud-but-
not-fatal so a typo doesn't kill the binary.
Token comparison: subtle.ConstantTimeCompare on the full
"Bearer <token>" wire-format string. Length-mismatch returns 0
(per stdlib spec), so wrong-length tokens reject without timing
leak. Pre-encoded comparison slice stored in the middleware
closure — one allocation per request.
Source-IP extraction prefers net.SplitHostPort fallback to
RemoteAddr-as-is for httptest compatibility. X-Forwarded-For
support is a follow-up when a trusted proxy fronts the gateway
(config knob TBD per ADR-003 §"Future").
internal/shared/server.go
Run signature: gained AuthConfig parameter (4th arg).
/health stays mounted on the outer router (public).
Registered routes go inside chi.Group with RequireAuth applied —
empty config = transparent group.
Added requireAuthOnNonLoopback startup check: non-loopback bind
with empty Token = refuse to start (cites R-001 + R-007 by name).
internal/shared/config.go
AuthConfig type added with TOML tags. Fields: Token, AllowedIPs.
Composed into Config under [auth].
cmd/<svc>/main.go × 7 (catalogd, embedd, gateway, ingestd, queryd,
storaged, vectord, mcpd is unaffected — stdio doesn't bind a port)
Each call site adds cfg.Auth as the 4th arg to shared.Run. No
other changes — middleware applies via shared.Run uniformly.
internal/shared/auth_test.go (12 test funcs)
Empty config pass-through, missing-token 401, wrong-token 401,
correct-token 200, raw-token-without-Bearer-prefix 401, /health
always public, IP allowlist allow + reject, bare IP /32, both
layers when both configured, invalid CIDR drop-with-warn, RemoteAddr
shape extraction. The constant-time comparison is verified by
inspection (comments in auth.go) plus the existence of the
passthrough test (length-mismatch case).
Verified:
go test -count=1 ./internal/shared/ — all green (was 21, now 33 funcs)
just verify — vet + test + 9 smokes 33s
just proof contract — 53/0/1 unchanged
Smokes + proof harness keep working without any token configuration:
default Auth is empty struct → middleware is no-op → existing tests
pass unchanged. To exercise the gate, operators set [auth].token in
lakehouse.toml (or, per the "future" note in the ADR, via env var).
Closes audit findings:
R-001 HIGH — fully mechanically closed (was: partial via bind gate)
R-007 MED — fully mechanically closed (was: design-only ADR-003)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds main_test.go for each of the 6 cmd binaries that lacked them
(storaged already had main_test.go; that's where the pattern came
from). Each test file focuses on the cmd-specific surface — route
mounts, body caps, decode/validation paths — without re-testing
internal package logic that's covered elsewhere.
cmd/catalogd/main_test.go — 6 funcs
TestRoutesMounted: chi.Walk asserts /catalog/{register,manifest/*,list}
TestHandleRegister_BodyTooLarge: 5 MiB body → 4xx
TestHandleRegister_MalformedJSON: 400
TestHandleRegister_EmptyName_400: ErrEmptyName surfaces as 400
TestHandleGetManifest_404 + TestHandleList_EmptyShape
cmd/embedd/main_test.go — 8 funcs
stubProvider implements embed.Provider deterministically
TestRoutesMounted, MalformedJSON_400, EmptyTextRejected_400 (per
scrum O-W3), UpstreamError_502 (provider error → 502, not 500),
HappyPath_ProviderEcho, BodyTooLarge (4xx range), TestItoa
(covers the no-strconv helper)
cmd/gateway/main_test.go — 4 funcs
TestMustParseUpstream_HappyPaths: 3 valid URLs
TestMustParseUpstream_FailureExits: re-execs the test binary in a
subprocess with env flag (standard pattern for testing os.Exit
callers); subprocess invokes mustParseUpstream("127.0.0.1:3211")
[missing scheme]; expects exit non-zero. Same pattern for garbage.
TestUpstreamConfigKeys_DocumentedShape: locks the 6 _url keys
cmd/ingestd/main_test.go — 7 funcs
Stubs both storaged and catalogd via httptest.Server so the cmd
layer can be exercised without bringing the full chain up.
TestHandleIngest_MissingNameQueryParam: 400 with "name" in body
TestHandleIngest_MalformedMultipart: 400
TestHandleIngest_MissingFormFile: 400 (valid multipart, wrong field)
TestHandleIngest_BodyTooLarge: 4xx
TestEscapeKeyPath: 6-case URL-escape table (apostrophe, space, etc.)
TestParquetKeyPath_Format: locks the datasets/<n>/<fp>.parquet shape
per scrum C-DRIFT (any rename breaks idempotent re-ingest)
cmd/queryd/main_test.go — 6 funcs
Tests pre-DB paths (decode, body cap, empty SQL); db.QueryContext
itself needs DuckDB so it's covered by GOLAKE-040 in the proof
harness, not unit tests. handlers.db = nil here is intentional.
TestHandleSQL_EmptySQL_400: 3 cases (empty, whitespace, mixed-WS)
TestMaxSQLBodyBytes_Reasonable: locks the 64 KiB constant in a
sane range so a refactor can't blow it open
TestPrimaryBucket_Constant: locks "primary" — secrets lookup uses
this; rename = silent secret-resolution failure at boot
cmd/vectord/main_test.go — 14 funcs
All 6 routes verified mounted. handlers.persist = nil = pure
in-memory mode; persistence is GOLAKE-070 in the proof harness.
Coverage of every error branch in handleCreate/Add/Search/Delete:
missing index → 404, dim mismatch → 400, empty items → 400,
empty id → 400, malformed JSON → 400, body too large → 4xx,
happy create → 201, happy list → 200.
One real finding caught during writing:
Body-cap rejection is sometimes 413 (typed MaxBytesError survives
unwrap) and sometimes 400 (decoder wraps it as a generic decode
error). Both are valid client-error contracts; the contract isn't
"exactly 413" but "fails loud as 4xx, never silent 200 or 5xx."
Tests assert 4xx range. The proof harness's
proof_assert_status_4xx already had this shape — just bringing
the unit tests in line with it.
Verified:
go test -count=1 -short ./cmd/... — all 7 packages green
just verify — vet + test + 9 smokes 35s
Closes audit risk R-005 (6/7 cmd/main.go untested). Combined with
the proof harness's wiring coverage, every cmd-level handler now
has both unit-test and integration-test coverage of the wiring
layer. R-005 → CLOSED.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase G0 Day 5 ships queryd: in-memory DuckDB with custom Connector
that runs INSTALL httpfs / LOAD httpfs / CREATE OR REPLACE SECRET
(TYPE S3) on every new connection, sourced from SecretsProvider +
shared.S3Config. SetMaxOpenConns(1) so registrar's CREATE VIEWs and
handler's SELECTs serialize through one connection (avoids cross-
connection MVCC visibility edge cases).
Registrar.Refresh reads catalogd /catalog/list, runs CREATE OR
REPLACE VIEW "name" AS SELECT * FROM read_parquet('s3://bucket/key')
per manifest, drops views for removed manifests, skips on unchanged
updated_at (the implicit etag). Drop pass runs BEFORE create pass so
a poison manifest can't block other manifest refreshes (post-scrum
C1 fix).
POST /sql with JSON body {"sql":"…"} returns
{"columns":[{"name":"id","type":"BIGINT"},…], "rows":[[…]],
"row_count":N}. []byte → string conversion so VARCHAR rows
JSON-encode as text. 30s default refresh ticker, configurable via
[queryd].refresh_every.
Cross-lineage scrum on shipped code:
- Opus 4.7 (opencode): 1 BLOCK + 4 WARN + 4 INFO
- Kimi K2-0905 (openrouter): 2 BLOCK + 2 WARN + 1 INFO
- Qwen3-coder (openrouter): 2 BLOCK + 1 WARN + 1 INFO
Fixed (4):
C1 (Opus + Kimi convergent): Refresh aborts on first per-view error
→ drop pass first, collect errors, errors.Join. Poison manifest
no longer blocks the rest of the catalog from re-syncing.
B-CTX (Opus BLOCK): bootstrap closure captured OpenDB's ctx →
cancelled-ctx silently fails every reconnect. context.Background()
inside closure; passed ctx only for initial Ping.
B-LEAK (Kimi BLOCK): firstLine(stmt) truncated CREATE SECRET to 80
chars but those 80 chars contained KEY_ID + SECRET prefix → log
aggregator captures credentials. Stable per-statement labels +
redactCreds() filter on wrapped DuckDB errors.
JSON-ERR (Opus WARN): swallowed json.Encode error → silent
truncated 200 on unsupported column types. slog.Warn the failure.
Dismissed (4 false positives):
Qwen BLOCK "bootstrap not transactional" — DuckDB DDL is auto-commit
Qwen BLOCK "MaxBytesReader after Decode" — false, applied before
Kimi BLOCK "concurrent Refresh + user SELECT deadlock" — not a
deadlock, just serialization, by design with 10s timeout retry
Kimi WARN "dropView leaves r.known inconsistent" — current code
returns before the delete; the entry persists for retry
Critical reviewer behavior: 1 convergent BLOCK between Opus + Kimi
on the per-view error blocking, plus two independent single-reviewer
BLOCKs (B-CTX, B-LEAK) that smoke could never have caught. The
B-LEAK fix uses defense-in-depth: never pass SQL into the error
path AND redact known cred values from DuckDB's own error message.
DuckDB cgo path: github.com/duckdb/duckdb-go/v2 v2.10502.0 (per
ADR-001 §1) on Go 1.25 + arrow-go. Smoke 6/6 PASS after every
fix round.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>