Adapts docs/SCRUM.md framework (originally written for the matrix-agent-validated repo) to the Go rewrite. Five deliverables: golang-lakehouse-scrum-test.md top-line + scoring + verdict risk-register.md 12 findings, R-001..R-012 claim-coverage-table.md claim/test/risk for Sprint 2 sprint-backlog.md 5 sprints, ~2 weeks of work acceptance-gates.md DoD as runnable commands Every claim cites file:line, command output, or "missing evidence." Smoke chain ran clean (33s wall, all 9 PASS) and is captured in reports/scrum/_evidence/smoke_chain.log (gitignored — runtime artifact). Scoring: Reproducibility 7/10 9 smokes deterministic, no just/CI gate Test Coverage 6/10 internal/ packages tested, 6/7 cmd/ aren't Trust Boundary 7/10 escapes ok, zero auth, /sql is RCE-eq off-loopback Memory Correctness 3/10 pathway/playbook/observer not yet ported Deployment Readiness 4/10 no REPLICATION, no env template, no systemd Maintainability 8/10 no god-files, 7 lean binaries, ADRs current Top three risks: R-001 HIGH queryd /sql + DuckDB + non-loopback bind = RCE-equivalent R-002 HIGH internal/shared (server.go + config.go) zero tests R-003 HIGH internal/storeclient zero tests, used by 2 services R-004 MED 9-smoke chain green but not gated (no justfile/hook) The audit is the work; refactors come after. Sprint 0 owns coverage + CI gating; Sprint 1 owns trust-boundary decisions; Sprints 2-3 are mostly design-bar work for unbuilt agent components. .gitignore exception: /reports/* + !/reports/scrum/ keeps reports/ a runtime-artifact directory while exposing reports/scrum/ as tracked documentation. Mirrors the pattern future audit passes will land in. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
11 KiB
11 KiB
golangLAKEHOUSE — Risk Register
Severity-ranked findings from the 2026-04-29 scrum audit. Each row cites file:line or command output per SCRUM.md's "no vibes" rule. Severity uses HIGH (likely + impactful) / MED (one of those) / LOW (latent or mitigated). Risk IDs are stable — sprint-backlog.md and acceptance-gates.md reference them by ID.
HIGH severity
R-001 — queryd POST /sql accepts arbitrary SQL; localhost binding is sole guardrail
- Where:
cmd/queryd/main.go:142registersr.Post("/sql", h.handleSQL).cmd/queryd/main.go:181passesreq.SQLdirectly todb.QueryContext. No allowlist, no statement-type check, no rate limit. - Why this is HIGH: DuckDB is not a sandbox.
COPY ... TO '/tmp/x'writes the host filesystem.read_csv('s3://...')reads any S3 object the configured creds can reach.read_text('/etc/passwd')reads local files. Anything that can reach:3214can exfil anything queryd's process can read. - Today's mitigation: every binary binds
127.0.0.1by default (internal/shared/config.go:132-160). Network-layer is the only auth layer. - What breaks the mitigation: any future deploy that binds non-loopback (Docker port-publish, K8s pod IP, accidental
0.0.0.0) opens RCE-equivalent access. There is no second line of defense. - Recommended fix: Sprint 1 — decide the auth posture (Bearer token, mTLS, IP allow-list) and add middleware. Document the design risk in
docs/SECURITY.md. Until middleware lands: assert incmd/queryd/main.gostartup that bind starts with127.andos.Exit(1)otherwise — fail-loud rather than silent expose.
R-002 — internal/shared (server factory + config) has zero tests
- Where:
internal/shared/server.go(server.go: 0 tests, src=2 —server.go+config.go). Confirmed byls internal/shared/*_test.goreturning empty. - Why HIGH:
server.gocontains the shared chi factory + race-freenet.Listen()+ graceful shutdown that every binary depends on.config.gocontains the TOML loader that every binary calls inmain(). A regression here breaks all 7 binaries silently — and the only thing that catches it today is the 9-smoke chain at the integration layer. - Recommended fix: Sprint 0 — add
internal/shared/server_test.go(table-test bind-error surfacing, graceful-shutdown ordering, /health response shape) andconfig_test.go(TOML round-trip, missing-file warn behavior, default values).
R-003 — internal/storeclient has zero tests
- Where:
internal/storeclient/client.go(src=1, test=0). Used bycatalogd(store_client.gooriginally; extracted to shared package per memory4205ecd) andvectord(G1P persistence). Two services depend on it directly. - Why HIGH: This client owns the keep-alive pool, body-drain semantics, and the retry/timeout policy for storaged calls. The ADR-020 idempotency contract on catalogd partially relies on this client's error semantics. Untested + load-bearing = silent correctness risk.
- Recommended fix: Sprint 0 — add
client_test.gocovering the keep-alive drain path (the comment ininternal/catalogclient/client.gocites this as a known footgun), 4xx vs 5xx classification, body-cap enforcement on response.
MEDIUM severity
R-004 — Smokes are documentation, not a CI gate
- Where:
README.md:60showsfor s in scripts/{...}_smoke.sh; do ...; doneas the run instruction. Nojustfile, noMakefile, no.github/workflows/, no.git/hooks/pre-push. Confirmed byls justfile Makefile .github— all "No such file." - Why MED: the smokes are deterministic and fast (33s wall for the full chain —
_evidence/smoke_chain.log). The discipline of running them is purely human at the moment. A future commit that breaksd4will pass review unless the reviewer happens to run the chain. - Recommended fix: Sprint 0 —
justfilewithverify(full chain) +smoke <day>(single) +doctor(deps probe) +fmt/vet/testshortcuts. Pre-push hook callsjust verifyand aborts on non-zero exit.
R-005 — 6 of 7 cmd/<bin>/main.go files are untested
- Where: only
cmd/storaged/main_test.goexists. The other six binaries' wiring layers (route registration, handler chaining, error-mapping middleware, request-body decoding) are integration-tested only via shell smokes. - Why MED: wiring bugs don't show up in
go testand don't show up ingo vet. They show up at smoke time, which is a slower feedback loop than per-package unit tests would give.cmd/queryd/main.go:142is the highest-priority candidate for cmd-level tests because thehandleSQLbody-decode + cap path is the entry point for R-001 and runs without unit-test coverage today. - Recommended fix: Sprint 0 — pattern-match
cmd/storaged/main_test.go's shape across the other 6 binaries. Test scope per binary: routes registered, body-cap rejection (request entity too large), schema-validation rejection (400 on bad JSON), happy-path with mocked dependency.
R-006 — Smokes hit real MinIO + Ollama; no fixture-only path
- Where:
g2_smoke.sh:14requires Ollama at:11434withnomic-embed-textloaded.d2_smoke.shrequires MinIO at:9000with bucketlakehouse-go-primary. Confirmed inREADME.md:67-71("Cold-start dependencies"). - Why MED: any CI runner without these services cannot run the smoke chain. Fresh-clone reviewers cannot run it. Any downtime or version drift in MinIO / Ollama produces flaky CI.
- Recommended fix: Sprint 0 — define
embed.Providerandstorage.Bucketmock implementations behind the existing interfaces (internal/embed/embed.go:20,internal/storaged/bucket.go). Addjust smoke-fixturesthat points the binaries at the fakes via env vars. Real-MinIO / real-Ollama smokes become the "hardware-in-the-loop" tier.
R-007 — Zero auth middleware on 22 public routes
- Where:
grep -rn 'Authorization\|Bearer'returns zero matches outside test files. Routes inventoried: vectord (6), storaged (4), catalogd (3), queryd (1), ingestd (1), embedd (1), gateway (proxies all upstream), plus/healthon every binary. - Why MED: localhost-only binding is the sole guardrail (R-001 covers the worst case). Non-localhost deploy = open admin panel. The header design ("Authorization: Bearer ..." vs "X-API-Key" vs mTLS cert subject) needs to be decided once and then applied uniformly across all 22 routes — retrofit is more painful per-route than upfront.
- Recommended fix: Sprint 1 — write ADR-003 picking the auth model. Most likely choice: Bearer token + IP allow-list, with token loaded from
secrets-go.toml. Addinternal/shared/auth.gomiddleware so adding it to a new binary is one chir.Use()line.
R-008 — internal/queryd/db.go (DuckDB connector + CREATE SECRET site) untested
- Where:
internal/queryd/db.gois referenced viafunc (h *handlers) handleSQLand containssqlEscape(line 122),redactCreds(line 132), and theCREATE SECRET ... '%s'formation (line 102).internal/queryd/registrar_test.goexists, but nodb_test.go. - Why MED:
sqlEscapecorrectness is one bug from a credential-leak via SQL error chain.redactCredscorrectness is the only layer between a bad SECRET creation and S3 keys ending up in slog output. Both deserve unit tests with adversarial inputs (single-quote in key, embedded SECRET token, etc.). - Recommended fix: Sprint 0 — add
db_test.gowith:sqlEscaperound-trip on adversarial strings;redactCredsexhaustive case for empty / partial / multiple-occurrence credential values;bootstrapStatementsorder assertion (INSTALL → LOAD → CREATE SECRET).
LOW severity
R-009 — registrar.go:153 uses fmt.Sprintf for view DDL
- Where:
internal/queryd/registrar.go:153—sql := fmt.Sprintf("CREATE OR REPLACE VIEW %s AS SELECT * FROM %s", quoteIdent(m.Name), fromExpr). - Why LOW:
m.Namecomes from catalogd's manifest (server-controlled), is wrapped withquoteIdent(line 172, doubles").fromExpris built from S3 URLs which are themselves wrapped with'and escaped viasqlEscape(line 145, doubles'). DuckDB doesn't accept?placeholders for DDL, sofmt.Sprintfis unavoidable here. Inputs are not user-controlled at the SQL boundary; they came from a registration API call but the dataset name was already vetted by catalogd. - Recommended fix: none — currently correct. Note as a "design risk to remember" if catalogd ever loosens validation on dataset names. Add a regression test that asserts a manifest with
name: 'foo"; DROP TABLE x; --'produces a quoted-but-non-executing view name.
R-010 — No CORS posture on any binding
- Where:
grep -rni 'Access-Control'returns zero hits in source. Confirmed. - Why LOW: all binaries bind 127.0.0.1; no browser is making cross-origin requests today; the future HTMX UI will be same-origin via gateway.
- Recommended fix: none until a non-localhost binding is needed. When it is needed (Sprint 4 or later), the decision belongs in the same ADR as auth posture (R-007) — same blast radius, same review.
R-011 — g2_smoke.sh:79 exact-match on nomic-embed-text model name
- Where:
scripts/g2_smoke.sh:79—[ "$MODEL" = "nomic-embed-text" ]. - Why LOW: if the operator swaps to
nomic-embed-text-v2-moe(which is also loaded on this box), the smoke fails loudly — the dimension and recall would still likely pass; only the literal model-name assertion fails. That's the right failure mode (not silent acceptance), so this is more of an annotation than a finding. - Recommended fix: none — keep the assertion strict. If the swap is intentional, the operator updates the smoke alongside the swap. That's the discipline.
R-012 — tests/ directory exists but is empty
- Where:
ls tests/returns only.and... Listed inREADME.md:90("Layout") but uncited in any code path. - Why LOW: dead directory, harmless, but suggests an older plan (Rust-style integration test convention) that didn't carry over.
- Recommended fix: either remove the directory or claim it for the fixture-mode smoke story (R-006). Pick one in Sprint 0.
Risk-to-sprint mapping
| Risk | Severity | Sprint |
|---|---|---|
| R-001 queryd /sql RCE-eq via DuckDB | HIGH | 1 |
| R-002 internal/shared untested | HIGH | 0 |
| R-003 internal/storeclient untested | HIGH | 0 |
| R-004 smokes not gated | MED | 0 |
| R-005 6/7 cmd/main.go untested | MED | 0 |
| R-006 no fixture-only smokes | MED | 0 |
| R-007 zero auth on 22 routes | MED | 1 |
| R-008 queryd/db.go untested | MED | 0 |
| R-009 registrar.go fmt.Sprintf | LOW | — (note only) |
| R-010 no CORS posture | LOW | 1 (with R-007) |
| R-011 g2 smoke model assertion | LOW | — (correct as-is) |
| R-012 empty tests/ dir | LOW | 0 |
Sprint 0 owns the test-coverage and CI-gate work (R-002, R-003, R-004, R-005, R-006, R-008, R-012). Sprint 1 owns the trust-boundary decisions (R-001, R-007, R-010). Sprint 2-4 are design-bar work for unbuilt components.