golangLAKEHOUSE/reports/scrum/risk-register.md
root 91edd43164 scrum audit: 5 reports under reports/scrum/ · score 35/60
Adapts docs/SCRUM.md framework (originally written for the
matrix-agent-validated repo) to the Go rewrite. Five deliverables:

  golang-lakehouse-scrum-test.md  top-line + scoring + verdict
  risk-register.md                12 findings, R-001..R-012
  claim-coverage-table.md         claim/test/risk for Sprint 2
  sprint-backlog.md               5 sprints, ~2 weeks of work
  acceptance-gates.md             DoD as runnable commands

Every claim cites file:line, command output, or "missing evidence."
Smoke chain ran clean (33s wall, all 9 PASS) and is captured in
reports/scrum/_evidence/smoke_chain.log (gitignored — runtime artifact).

Scoring:
  Reproducibility       7/10  9 smokes deterministic, no just/CI gate
  Test Coverage         6/10  internal/ packages tested, 6/7 cmd/ aren't
  Trust Boundary        7/10  escapes ok, zero auth, /sql is RCE-eq off-loopback
  Memory Correctness    3/10  pathway/playbook/observer not yet ported
  Deployment Readiness  4/10  no REPLICATION, no env template, no systemd
  Maintainability       8/10  no god-files, 7 lean binaries, ADRs current

Top three risks:
  R-001 HIGH  queryd /sql + DuckDB + non-loopback bind = RCE-equivalent
  R-002 HIGH  internal/shared (server.go + config.go) zero tests
  R-003 HIGH  internal/storeclient zero tests, used by 2 services
  R-004 MED   9-smoke chain green but not gated (no justfile/hook)

The audit is the work; refactors come after. Sprint 0 owns coverage
+ CI gating; Sprint 1 owns trust-boundary decisions; Sprints 2-3 are
mostly design-bar work for unbuilt agent components.

.gitignore exception: /reports/* + !/reports/scrum/ keeps reports/
a runtime-artifact directory while exposing reports/scrum/ as
tracked documentation. Mirrors the pattern future audit passes will
land in.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 04:51:47 -05:00

111 lines
11 KiB
Markdown

# golangLAKEHOUSE — Risk Register
Severity-ranked findings from the 2026-04-29 scrum audit. Each row cites `file:line` or command output per SCRUM.md's "no vibes" rule. Severity uses HIGH (likely + impactful) / MED (one of those) / LOW (latent or mitigated). Risk IDs are stable — `sprint-backlog.md` and `acceptance-gates.md` reference them by ID.
---
## HIGH severity
### R-001 — `queryd POST /sql` accepts arbitrary SQL; localhost binding is sole guardrail
- **Where:** `cmd/queryd/main.go:142` registers `r.Post("/sql", h.handleSQL)`. `cmd/queryd/main.go:181` passes `req.SQL` directly to `db.QueryContext`. No allowlist, no statement-type check, no rate limit.
- **Why this is HIGH:** DuckDB is not a sandbox. `COPY ... TO '/tmp/x'` writes the host filesystem. `read_csv('s3://...')` reads any S3 object the configured creds can reach. `read_text('/etc/passwd')` reads local files. Anything that can reach `:3214` can exfil anything queryd's process can read.
- **Today's mitigation:** every binary binds `127.0.0.1` by default (`internal/shared/config.go:132-160`). Network-layer is the only auth layer.
- **What breaks the mitigation:** any future deploy that binds non-loopback (Docker port-publish, K8s pod IP, accidental `0.0.0.0`) opens RCE-equivalent access. There is no second line of defense.
- **Recommended fix:** Sprint 1 — decide the auth posture (Bearer token, mTLS, IP allow-list) and add middleware. Document the design risk in `docs/SECURITY.md`. Until middleware lands: assert in `cmd/queryd/main.go` startup that bind starts with `127.` and `os.Exit(1)` otherwise — fail-loud rather than silent expose.
### R-002 — `internal/shared` (server factory + config) has zero tests
- **Where:** `internal/shared/server.go` (server.go: 0 tests, src=2 — `server.go` + `config.go`). Confirmed by `ls internal/shared/*_test.go` returning empty.
- **Why HIGH:** `server.go` contains the shared chi factory + race-free `net.Listen()` + graceful shutdown that every binary depends on. `config.go` contains the TOML loader that every binary calls in `main()`. A regression here breaks all 7 binaries silently — and the only thing that catches it today is the 9-smoke chain at the integration layer.
- **Recommended fix:** Sprint 0 — add `internal/shared/server_test.go` (table-test bind-error surfacing, graceful-shutdown ordering, /health response shape) and `config_test.go` (TOML round-trip, missing-file warn behavior, default values).
### R-003 — `internal/storeclient` has zero tests
- **Where:** `internal/storeclient/client.go` (src=1, test=0). Used by `catalogd` (`store_client.go` originally; extracted to shared package per memory `4205ecd`) and `vectord` (G1P persistence). Two services depend on it directly.
- **Why HIGH:** This client owns the keep-alive pool, body-drain semantics, and the retry/timeout policy for storaged calls. The ADR-020 idempotency contract on catalogd partially relies on this client's error semantics. Untested + load-bearing = silent correctness risk.
- **Recommended fix:** Sprint 0 — add `client_test.go` covering the keep-alive drain path (the comment in `internal/catalogclient/client.go` cites this as a known footgun), 4xx vs 5xx classification, body-cap enforcement on response.
---
## MEDIUM severity
### R-004 — Smokes are documentation, not a CI gate
- **Where:** `README.md:60` shows `for s in scripts/{...}_smoke.sh; do ...; done` as the run instruction. No `justfile`, no `Makefile`, no `.github/workflows/`, no `.git/hooks/pre-push`. Confirmed by `ls justfile Makefile .github` — all "No such file."
- **Why MED:** the smokes are *deterministic and fast* (33s wall for the full chain — `_evidence/smoke_chain.log`). The discipline of running them is purely human at the moment. A future commit that breaks `d4` will pass review unless the reviewer happens to run the chain.
- **Recommended fix:** Sprint 0 — `justfile` with `verify` (full chain) + `smoke <day>` (single) + `doctor` (deps probe) + `fmt`/`vet`/`test` shortcuts. Pre-push hook calls `just verify` and aborts on non-zero exit.
### R-005 — 6 of 7 `cmd/<bin>/main.go` files are untested
- **Where:** only `cmd/storaged/main_test.go` exists. The other six binaries' wiring layers (route registration, handler chaining, error-mapping middleware, request-body decoding) are integration-tested only via shell smokes.
- **Why MED:** wiring bugs don't show up in `go test` and don't show up in `go vet`. They show up at smoke time, which is a slower feedback loop than per-package unit tests would give. `cmd/queryd/main.go:142` is the highest-priority candidate for cmd-level tests because the `handleSQL` body-decode + cap path is the entry point for R-001 and runs without unit-test coverage today.
- **Recommended fix:** Sprint 0 — pattern-match `cmd/storaged/main_test.go`'s shape across the other 6 binaries. Test scope per binary: routes registered, body-cap rejection (request entity too large), schema-validation rejection (400 on bad JSON), happy-path with mocked dependency.
### R-006 — Smokes hit real MinIO + Ollama; no fixture-only path
- **Where:** `g2_smoke.sh:14` requires Ollama at `:11434` with `nomic-embed-text` loaded. `d2_smoke.sh` requires MinIO at `:9000` with bucket `lakehouse-go-primary`. Confirmed in `README.md:67-71` ("Cold-start dependencies").
- **Why MED:** any CI runner without these services cannot run the smoke chain. Fresh-clone reviewers cannot run it. Any downtime or version drift in MinIO / Ollama produces flaky CI.
- **Recommended fix:** Sprint 0 — define `embed.Provider` and `storage.Bucket` mock implementations behind the existing interfaces (`internal/embed/embed.go:20`, `internal/storaged/bucket.go`). Add `just smoke-fixtures` that points the binaries at the fakes via env vars. Real-MinIO / real-Ollama smokes become the "hardware-in-the-loop" tier.
### R-007 — Zero auth middleware on 22 public routes
- **Where:** `grep -rn 'Authorization\|Bearer'` returns zero matches outside test files. Routes inventoried: vectord (6), storaged (4), catalogd (3), queryd (1), ingestd (1), embedd (1), gateway (proxies all upstream), plus `/health` on every binary.
- **Why MED:** localhost-only binding is the sole guardrail (R-001 covers the worst case). Non-localhost deploy = open admin panel. The header design ("Authorization: Bearer ..." vs "X-API-Key" vs mTLS cert subject) needs to be decided once and then applied uniformly across all 22 routes — retrofit is more painful per-route than upfront.
- **Recommended fix:** Sprint 1 — write ADR-003 picking the auth model. Most likely choice: Bearer token + IP allow-list, with token loaded from `secrets-go.toml`. Add `internal/shared/auth.go` middleware so adding it to a new binary is one chi `r.Use()` line.
### R-008 — `internal/queryd/db.go` (DuckDB connector + `CREATE SECRET` site) untested
- **Where:** `internal/queryd/db.go` is referenced via `func (h *handlers) handleSQL` and contains `sqlEscape` (line 122), `redactCreds` (line 132), and the `CREATE SECRET ... '%s'` formation (line 102). `internal/queryd/registrar_test.go` exists, but no `db_test.go`.
- **Why MED:** `sqlEscape` correctness is one bug from a credential-leak via SQL error chain. `redactCreds` correctness is the *only* layer between a bad SECRET creation and S3 keys ending up in slog output. Both deserve unit tests with adversarial inputs (single-quote in key, embedded SECRET token, etc.).
- **Recommended fix:** Sprint 0 — add `db_test.go` with: `sqlEscape` round-trip on adversarial strings; `redactCreds` exhaustive case for empty / partial / multiple-occurrence credential values; `bootstrapStatements` order assertion (INSTALL → LOAD → CREATE SECRET).
---
## LOW severity
### R-009 — `registrar.go:153` uses `fmt.Sprintf` for view DDL
- **Where:** `internal/queryd/registrar.go:153``sql := fmt.Sprintf("CREATE OR REPLACE VIEW %s AS SELECT * FROM %s", quoteIdent(m.Name), fromExpr)`.
- **Why LOW:** `m.Name` comes from catalogd's manifest (server-controlled), is wrapped with `quoteIdent` (line 172, doubles `"`). `fromExpr` is built from S3 URLs which are themselves wrapped with `'` and escaped via `sqlEscape` (line 145, doubles `'`). DuckDB doesn't accept `?` placeholders for DDL, so `fmt.Sprintf` is unavoidable here. Inputs are not user-controlled at the SQL boundary; they came from a registration API call but the dataset name was already vetted by catalogd.
- **Recommended fix:** none — currently correct. Note as a "design risk to remember" if catalogd ever loosens validation on dataset names. Add a regression test that asserts a manifest with `name: 'foo"; DROP TABLE x; --'` produces a quoted-but-non-executing view name.
### R-010 — No CORS posture on any binding
- **Where:** `grep -rni 'Access-Control'` returns zero hits in source. Confirmed.
- **Why LOW:** all binaries bind 127.0.0.1; no browser is making cross-origin requests today; the future HTMX UI will be same-origin via gateway.
- **Recommended fix:** none until a non-localhost binding is needed. When it is needed (Sprint 4 or later), the decision belongs in the same ADR as auth posture (R-007) — same blast radius, same review.
### R-011 — `g2_smoke.sh:79` exact-match on `nomic-embed-text` model name
- **Where:** `scripts/g2_smoke.sh:79``[ "$MODEL" = "nomic-embed-text" ]`.
- **Why LOW:** if the operator swaps to `nomic-embed-text-v2-moe` (which is also loaded on this box), the smoke fails *loudly* — the dimension and recall would still likely pass; only the literal model-name assertion fails. That's the right failure mode (not silent acceptance), so this is more of an annotation than a finding.
- **Recommended fix:** none — keep the assertion strict. If the swap is intentional, the operator updates the smoke alongside the swap. That's the discipline.
### R-012 — `tests/` directory exists but is empty
- **Where:** `ls tests/` returns only `.` and `..`. Listed in `README.md:90` ("Layout") but uncited in any code path.
- **Why LOW:** dead directory, harmless, but suggests an older plan (Rust-style integration test convention) that didn't carry over.
- **Recommended fix:** either remove the directory or claim it for the fixture-mode smoke story (R-006). Pick one in Sprint 0.
---
## Risk-to-sprint mapping
| Risk | Severity | Sprint |
|---|---|---|
| R-001 queryd /sql RCE-eq via DuckDB | HIGH | 1 |
| R-002 internal/shared untested | HIGH | 0 |
| R-003 internal/storeclient untested | HIGH | 0 |
| R-004 smokes not gated | MED | 0 |
| R-005 6/7 cmd/main.go untested | MED | 0 |
| R-006 no fixture-only smokes | MED | 0 |
| R-007 zero auth on 22 routes | MED | 1 |
| R-008 queryd/db.go untested | MED | 0 |
| R-009 registrar.go fmt.Sprintf | LOW | — (note only) |
| R-010 no CORS posture | LOW | 1 (with R-007) |
| R-011 g2 smoke model assertion | LOW | — (correct as-is) |
| R-012 empty tests/ dir | LOW | 0 |
Sprint 0 owns the test-coverage and CI-gate work (R-002, R-003, R-004, R-005, R-006, R-008, R-012). Sprint 1 owns the trust-boundary decisions (R-001, R-007, R-010). Sprint 2-4 are design-bar work for unbuilt components.