golangLAKEHOUSE/reports/scrum/risk-register.md
root 91edd43164 scrum audit: 5 reports under reports/scrum/ · score 35/60
Adapts docs/SCRUM.md framework (originally written for the
matrix-agent-validated repo) to the Go rewrite. Five deliverables:

  golang-lakehouse-scrum-test.md  top-line + scoring + verdict
  risk-register.md                12 findings, R-001..R-012
  claim-coverage-table.md         claim/test/risk for Sprint 2
  sprint-backlog.md               5 sprints, ~2 weeks of work
  acceptance-gates.md             DoD as runnable commands

Every claim cites file:line, command output, or "missing evidence."
Smoke chain ran clean (33s wall, all 9 PASS) and is captured in
reports/scrum/_evidence/smoke_chain.log (gitignored — runtime artifact).

Scoring:
  Reproducibility       7/10  9 smokes deterministic, no just/CI gate
  Test Coverage         6/10  internal/ packages tested, 6/7 cmd/ aren't
  Trust Boundary        7/10  escapes ok, zero auth, /sql is RCE-eq off-loopback
  Memory Correctness    3/10  pathway/playbook/observer not yet ported
  Deployment Readiness  4/10  no REPLICATION, no env template, no systemd
  Maintainability       8/10  no god-files, 7 lean binaries, ADRs current

Top three risks:
  R-001 HIGH  queryd /sql + DuckDB + non-loopback bind = RCE-equivalent
  R-002 HIGH  internal/shared (server.go + config.go) zero tests
  R-003 HIGH  internal/storeclient zero tests, used by 2 services
  R-004 MED   9-smoke chain green but not gated (no justfile/hook)

The audit is the work; refactors come after. Sprint 0 owns coverage
+ CI gating; Sprint 1 owns trust-boundary decisions; Sprints 2-3 are
mostly design-bar work for unbuilt agent components.

.gitignore exception: /reports/* + !/reports/scrum/ keeps reports/
a runtime-artifact directory while exposing reports/scrum/ as
tracked documentation. Mirrors the pattern future audit passes will
land in.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 04:51:47 -05:00

11 KiB

golangLAKEHOUSE — Risk Register

Severity-ranked findings from the 2026-04-29 scrum audit. Each row cites file:line or command output per SCRUM.md's "no vibes" rule. Severity uses HIGH (likely + impactful) / MED (one of those) / LOW (latent or mitigated). Risk IDs are stable — sprint-backlog.md and acceptance-gates.md reference them by ID.


HIGH severity

R-001 — queryd POST /sql accepts arbitrary SQL; localhost binding is sole guardrail

  • Where: cmd/queryd/main.go:142 registers r.Post("/sql", h.handleSQL). cmd/queryd/main.go:181 passes req.SQL directly to db.QueryContext. No allowlist, no statement-type check, no rate limit.
  • Why this is HIGH: DuckDB is not a sandbox. COPY ... TO '/tmp/x' writes the host filesystem. read_csv('s3://...') reads any S3 object the configured creds can reach. read_text('/etc/passwd') reads local files. Anything that can reach :3214 can exfil anything queryd's process can read.
  • Today's mitigation: every binary binds 127.0.0.1 by default (internal/shared/config.go:132-160). Network-layer is the only auth layer.
  • What breaks the mitigation: any future deploy that binds non-loopback (Docker port-publish, K8s pod IP, accidental 0.0.0.0) opens RCE-equivalent access. There is no second line of defense.
  • Recommended fix: Sprint 1 — decide the auth posture (Bearer token, mTLS, IP allow-list) and add middleware. Document the design risk in docs/SECURITY.md. Until middleware lands: assert in cmd/queryd/main.go startup that bind starts with 127. and os.Exit(1) otherwise — fail-loud rather than silent expose.

R-002 — internal/shared (server factory + config) has zero tests

  • Where: internal/shared/server.go (server.go: 0 tests, src=2 — server.go + config.go). Confirmed by ls internal/shared/*_test.go returning empty.
  • Why HIGH: server.go contains the shared chi factory + race-free net.Listen() + graceful shutdown that every binary depends on. config.go contains the TOML loader that every binary calls in main(). A regression here breaks all 7 binaries silently — and the only thing that catches it today is the 9-smoke chain at the integration layer.
  • Recommended fix: Sprint 0 — add internal/shared/server_test.go (table-test bind-error surfacing, graceful-shutdown ordering, /health response shape) and config_test.go (TOML round-trip, missing-file warn behavior, default values).

R-003 — internal/storeclient has zero tests

  • Where: internal/storeclient/client.go (src=1, test=0). Used by catalogd (store_client.go originally; extracted to shared package per memory 4205ecd) and vectord (G1P persistence). Two services depend on it directly.
  • Why HIGH: This client owns the keep-alive pool, body-drain semantics, and the retry/timeout policy for storaged calls. The ADR-020 idempotency contract on catalogd partially relies on this client's error semantics. Untested + load-bearing = silent correctness risk.
  • Recommended fix: Sprint 0 — add client_test.go covering the keep-alive drain path (the comment in internal/catalogclient/client.go cites this as a known footgun), 4xx vs 5xx classification, body-cap enforcement on response.

MEDIUM severity

R-004 — Smokes are documentation, not a CI gate

  • Where: README.md:60 shows for s in scripts/{...}_smoke.sh; do ...; done as the run instruction. No justfile, no Makefile, no .github/workflows/, no .git/hooks/pre-push. Confirmed by ls justfile Makefile .github — all "No such file."
  • Why MED: the smokes are deterministic and fast (33s wall for the full chain — _evidence/smoke_chain.log). The discipline of running them is purely human at the moment. A future commit that breaks d4 will pass review unless the reviewer happens to run the chain.
  • Recommended fix: Sprint 0 — justfile with verify (full chain) + smoke <day> (single) + doctor (deps probe) + fmt/vet/test shortcuts. Pre-push hook calls just verify and aborts on non-zero exit.

R-005 — 6 of 7 cmd/<bin>/main.go files are untested

  • Where: only cmd/storaged/main_test.go exists. The other six binaries' wiring layers (route registration, handler chaining, error-mapping middleware, request-body decoding) are integration-tested only via shell smokes.
  • Why MED: wiring bugs don't show up in go test and don't show up in go vet. They show up at smoke time, which is a slower feedback loop than per-package unit tests would give. cmd/queryd/main.go:142 is the highest-priority candidate for cmd-level tests because the handleSQL body-decode + cap path is the entry point for R-001 and runs without unit-test coverage today.
  • Recommended fix: Sprint 0 — pattern-match cmd/storaged/main_test.go's shape across the other 6 binaries. Test scope per binary: routes registered, body-cap rejection (request entity too large), schema-validation rejection (400 on bad JSON), happy-path with mocked dependency.

R-006 — Smokes hit real MinIO + Ollama; no fixture-only path

  • Where: g2_smoke.sh:14 requires Ollama at :11434 with nomic-embed-text loaded. d2_smoke.sh requires MinIO at :9000 with bucket lakehouse-go-primary. Confirmed in README.md:67-71 ("Cold-start dependencies").
  • Why MED: any CI runner without these services cannot run the smoke chain. Fresh-clone reviewers cannot run it. Any downtime or version drift in MinIO / Ollama produces flaky CI.
  • Recommended fix: Sprint 0 — define embed.Provider and storage.Bucket mock implementations behind the existing interfaces (internal/embed/embed.go:20, internal/storaged/bucket.go). Add just smoke-fixtures that points the binaries at the fakes via env vars. Real-MinIO / real-Ollama smokes become the "hardware-in-the-loop" tier.

R-007 — Zero auth middleware on 22 public routes

  • Where: grep -rn 'Authorization\|Bearer' returns zero matches outside test files. Routes inventoried: vectord (6), storaged (4), catalogd (3), queryd (1), ingestd (1), embedd (1), gateway (proxies all upstream), plus /health on every binary.
  • Why MED: localhost-only binding is the sole guardrail (R-001 covers the worst case). Non-localhost deploy = open admin panel. The header design ("Authorization: Bearer ..." vs "X-API-Key" vs mTLS cert subject) needs to be decided once and then applied uniformly across all 22 routes — retrofit is more painful per-route than upfront.
  • Recommended fix: Sprint 1 — write ADR-003 picking the auth model. Most likely choice: Bearer token + IP allow-list, with token loaded from secrets-go.toml. Add internal/shared/auth.go middleware so adding it to a new binary is one chi r.Use() line.

R-008 — internal/queryd/db.go (DuckDB connector + CREATE SECRET site) untested

  • Where: internal/queryd/db.go is referenced via func (h *handlers) handleSQL and contains sqlEscape (line 122), redactCreds (line 132), and the CREATE SECRET ... '%s' formation (line 102). internal/queryd/registrar_test.go exists, but no db_test.go.
  • Why MED: sqlEscape correctness is one bug from a credential-leak via SQL error chain. redactCreds correctness is the only layer between a bad SECRET creation and S3 keys ending up in slog output. Both deserve unit tests with adversarial inputs (single-quote in key, embedded SECRET token, etc.).
  • Recommended fix: Sprint 0 — add db_test.go with: sqlEscape round-trip on adversarial strings; redactCreds exhaustive case for empty / partial / multiple-occurrence credential values; bootstrapStatements order assertion (INSTALL → LOAD → CREATE SECRET).

LOW severity

R-009 — registrar.go:153 uses fmt.Sprintf for view DDL

  • Where: internal/queryd/registrar.go:153sql := fmt.Sprintf("CREATE OR REPLACE VIEW %s AS SELECT * FROM %s", quoteIdent(m.Name), fromExpr).
  • Why LOW: m.Name comes from catalogd's manifest (server-controlled), is wrapped with quoteIdent (line 172, doubles "). fromExpr is built from S3 URLs which are themselves wrapped with ' and escaped via sqlEscape (line 145, doubles '). DuckDB doesn't accept ? placeholders for DDL, so fmt.Sprintf is unavoidable here. Inputs are not user-controlled at the SQL boundary; they came from a registration API call but the dataset name was already vetted by catalogd.
  • Recommended fix: none — currently correct. Note as a "design risk to remember" if catalogd ever loosens validation on dataset names. Add a regression test that asserts a manifest with name: 'foo"; DROP TABLE x; --' produces a quoted-but-non-executing view name.

R-010 — No CORS posture on any binding

  • Where: grep -rni 'Access-Control' returns zero hits in source. Confirmed.
  • Why LOW: all binaries bind 127.0.0.1; no browser is making cross-origin requests today; the future HTMX UI will be same-origin via gateway.
  • Recommended fix: none until a non-localhost binding is needed. When it is needed (Sprint 4 or later), the decision belongs in the same ADR as auth posture (R-007) — same blast radius, same review.

R-011 — g2_smoke.sh:79 exact-match on nomic-embed-text model name

  • Where: scripts/g2_smoke.sh:79[ "$MODEL" = "nomic-embed-text" ].
  • Why LOW: if the operator swaps to nomic-embed-text-v2-moe (which is also loaded on this box), the smoke fails loudly — the dimension and recall would still likely pass; only the literal model-name assertion fails. That's the right failure mode (not silent acceptance), so this is more of an annotation than a finding.
  • Recommended fix: none — keep the assertion strict. If the swap is intentional, the operator updates the smoke alongside the swap. That's the discipline.

R-012 — tests/ directory exists but is empty

  • Where: ls tests/ returns only . and ... Listed in README.md:90 ("Layout") but uncited in any code path.
  • Why LOW: dead directory, harmless, but suggests an older plan (Rust-style integration test convention) that didn't carry over.
  • Recommended fix: either remove the directory or claim it for the fixture-mode smoke story (R-006). Pick one in Sprint 0.

Risk-to-sprint mapping

Risk Severity Sprint
R-001 queryd /sql RCE-eq via DuckDB HIGH 1
R-002 internal/shared untested HIGH 0
R-003 internal/storeclient untested HIGH 0
R-004 smokes not gated MED 0
R-005 6/7 cmd/main.go untested MED 0
R-006 no fixture-only smokes MED 0
R-007 zero auth on 22 routes MED 1
R-008 queryd/db.go untested MED 0
R-009 registrar.go fmt.Sprintf LOW — (note only)
R-010 no CORS posture LOW 1 (with R-007)
R-011 g2 smoke model assertion LOW — (correct as-is)
R-012 empty tests/ dir LOW 0

Sprint 0 owns the test-coverage and CI-gate work (R-002, R-003, R-004, R-005, R-006, R-008, R-012). Sprint 1 owns the trust-boundary decisions (R-001, R-007, R-010). Sprint 2-4 are design-bar work for unbuilt components.