Adapts docs/SCRUM.md framework (originally written for the matrix-agent-validated repo) to the Go rewrite. Five deliverables: golang-lakehouse-scrum-test.md top-line + scoring + verdict risk-register.md 12 findings, R-001..R-012 claim-coverage-table.md claim/test/risk for Sprint 2 sprint-backlog.md 5 sprints, ~2 weeks of work acceptance-gates.md DoD as runnable commands Every claim cites file:line, command output, or "missing evidence." Smoke chain ran clean (33s wall, all 9 PASS) and is captured in reports/scrum/_evidence/smoke_chain.log (gitignored — runtime artifact). Scoring: Reproducibility 7/10 9 smokes deterministic, no just/CI gate Test Coverage 6/10 internal/ packages tested, 6/7 cmd/ aren't Trust Boundary 7/10 escapes ok, zero auth, /sql is RCE-eq off-loopback Memory Correctness 3/10 pathway/playbook/observer not yet ported Deployment Readiness 4/10 no REPLICATION, no env template, no systemd Maintainability 8/10 no god-files, 7 lean binaries, ADRs current Top three risks: R-001 HIGH queryd /sql + DuckDB + non-loopback bind = RCE-equivalent R-002 HIGH internal/shared (server.go + config.go) zero tests R-003 HIGH internal/storeclient zero tests, used by 2 services R-004 MED 9-smoke chain green but not gated (no justfile/hook) The audit is the work; refactors come after. Sprint 0 owns coverage + CI gating; Sprint 1 owns trust-boundary decisions; Sprints 2-3 are mostly design-bar work for unbuilt agent components. .gitignore exception: /reports/* + !/reports/scrum/ keeps reports/ a runtime-artifact directory while exposing reports/scrum/ as tracked documentation. Mirrors the pattern future audit passes will land in. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
111 lines
11 KiB
Markdown
111 lines
11 KiB
Markdown
# golangLAKEHOUSE — Risk Register
|
|
|
|
Severity-ranked findings from the 2026-04-29 scrum audit. Each row cites `file:line` or command output per SCRUM.md's "no vibes" rule. Severity uses HIGH (likely + impactful) / MED (one of those) / LOW (latent or mitigated). Risk IDs are stable — `sprint-backlog.md` and `acceptance-gates.md` reference them by ID.
|
|
|
|
---
|
|
|
|
## HIGH severity
|
|
|
|
### R-001 — `queryd POST /sql` accepts arbitrary SQL; localhost binding is sole guardrail
|
|
|
|
- **Where:** `cmd/queryd/main.go:142` registers `r.Post("/sql", h.handleSQL)`. `cmd/queryd/main.go:181` passes `req.SQL` directly to `db.QueryContext`. No allowlist, no statement-type check, no rate limit.
|
|
- **Why this is HIGH:** DuckDB is not a sandbox. `COPY ... TO '/tmp/x'` writes the host filesystem. `read_csv('s3://...')` reads any S3 object the configured creds can reach. `read_text('/etc/passwd')` reads local files. Anything that can reach `:3214` can exfil anything queryd's process can read.
|
|
- **Today's mitigation:** every binary binds `127.0.0.1` by default (`internal/shared/config.go:132-160`). Network-layer is the only auth layer.
|
|
- **What breaks the mitigation:** any future deploy that binds non-loopback (Docker port-publish, K8s pod IP, accidental `0.0.0.0`) opens RCE-equivalent access. There is no second line of defense.
|
|
- **Recommended fix:** Sprint 1 — decide the auth posture (Bearer token, mTLS, IP allow-list) and add middleware. Document the design risk in `docs/SECURITY.md`. Until middleware lands: assert in `cmd/queryd/main.go` startup that bind starts with `127.` and `os.Exit(1)` otherwise — fail-loud rather than silent expose.
|
|
|
|
### R-002 — `internal/shared` (server factory + config) has zero tests
|
|
|
|
- **Where:** `internal/shared/server.go` (server.go: 0 tests, src=2 — `server.go` + `config.go`). Confirmed by `ls internal/shared/*_test.go` returning empty.
|
|
- **Why HIGH:** `server.go` contains the shared chi factory + race-free `net.Listen()` + graceful shutdown that every binary depends on. `config.go` contains the TOML loader that every binary calls in `main()`. A regression here breaks all 7 binaries silently — and the only thing that catches it today is the 9-smoke chain at the integration layer.
|
|
- **Recommended fix:** Sprint 0 — add `internal/shared/server_test.go` (table-test bind-error surfacing, graceful-shutdown ordering, /health response shape) and `config_test.go` (TOML round-trip, missing-file warn behavior, default values).
|
|
|
|
### R-003 — `internal/storeclient` has zero tests
|
|
|
|
- **Where:** `internal/storeclient/client.go` (src=1, test=0). Used by `catalogd` (`store_client.go` originally; extracted to shared package per memory `4205ecd`) and `vectord` (G1P persistence). Two services depend on it directly.
|
|
- **Why HIGH:** This client owns the keep-alive pool, body-drain semantics, and the retry/timeout policy for storaged calls. The ADR-020 idempotency contract on catalogd partially relies on this client's error semantics. Untested + load-bearing = silent correctness risk.
|
|
- **Recommended fix:** Sprint 0 — add `client_test.go` covering the keep-alive drain path (the comment in `internal/catalogclient/client.go` cites this as a known footgun), 4xx vs 5xx classification, body-cap enforcement on response.
|
|
|
|
---
|
|
|
|
## MEDIUM severity
|
|
|
|
### R-004 — Smokes are documentation, not a CI gate
|
|
|
|
- **Where:** `README.md:60` shows `for s in scripts/{...}_smoke.sh; do ...; done` as the run instruction. No `justfile`, no `Makefile`, no `.github/workflows/`, no `.git/hooks/pre-push`. Confirmed by `ls justfile Makefile .github` — all "No such file."
|
|
- **Why MED:** the smokes are *deterministic and fast* (33s wall for the full chain — `_evidence/smoke_chain.log`). The discipline of running them is purely human at the moment. A future commit that breaks `d4` will pass review unless the reviewer happens to run the chain.
|
|
- **Recommended fix:** Sprint 0 — `justfile` with `verify` (full chain) + `smoke <day>` (single) + `doctor` (deps probe) + `fmt`/`vet`/`test` shortcuts. Pre-push hook calls `just verify` and aborts on non-zero exit.
|
|
|
|
### R-005 — 6 of 7 `cmd/<bin>/main.go` files are untested
|
|
|
|
- **Where:** only `cmd/storaged/main_test.go` exists. The other six binaries' wiring layers (route registration, handler chaining, error-mapping middleware, request-body decoding) are integration-tested only via shell smokes.
|
|
- **Why MED:** wiring bugs don't show up in `go test` and don't show up in `go vet`. They show up at smoke time, which is a slower feedback loop than per-package unit tests would give. `cmd/queryd/main.go:142` is the highest-priority candidate for cmd-level tests because the `handleSQL` body-decode + cap path is the entry point for R-001 and runs without unit-test coverage today.
|
|
- **Recommended fix:** Sprint 0 — pattern-match `cmd/storaged/main_test.go`'s shape across the other 6 binaries. Test scope per binary: routes registered, body-cap rejection (request entity too large), schema-validation rejection (400 on bad JSON), happy-path with mocked dependency.
|
|
|
|
### R-006 — Smokes hit real MinIO + Ollama; no fixture-only path
|
|
|
|
- **Where:** `g2_smoke.sh:14` requires Ollama at `:11434` with `nomic-embed-text` loaded. `d2_smoke.sh` requires MinIO at `:9000` with bucket `lakehouse-go-primary`. Confirmed in `README.md:67-71` ("Cold-start dependencies").
|
|
- **Why MED:** any CI runner without these services cannot run the smoke chain. Fresh-clone reviewers cannot run it. Any downtime or version drift in MinIO / Ollama produces flaky CI.
|
|
- **Recommended fix:** Sprint 0 — define `embed.Provider` and `storage.Bucket` mock implementations behind the existing interfaces (`internal/embed/embed.go:20`, `internal/storaged/bucket.go`). Add `just smoke-fixtures` that points the binaries at the fakes via env vars. Real-MinIO / real-Ollama smokes become the "hardware-in-the-loop" tier.
|
|
|
|
### R-007 — Zero auth middleware on 22 public routes
|
|
|
|
- **Where:** `grep -rn 'Authorization\|Bearer'` returns zero matches outside test files. Routes inventoried: vectord (6), storaged (4), catalogd (3), queryd (1), ingestd (1), embedd (1), gateway (proxies all upstream), plus `/health` on every binary.
|
|
- **Why MED:** localhost-only binding is the sole guardrail (R-001 covers the worst case). Non-localhost deploy = open admin panel. The header design ("Authorization: Bearer ..." vs "X-API-Key" vs mTLS cert subject) needs to be decided once and then applied uniformly across all 22 routes — retrofit is more painful per-route than upfront.
|
|
- **Recommended fix:** Sprint 1 — write ADR-003 picking the auth model. Most likely choice: Bearer token + IP allow-list, with token loaded from `secrets-go.toml`. Add `internal/shared/auth.go` middleware so adding it to a new binary is one chi `r.Use()` line.
|
|
|
|
### R-008 — `internal/queryd/db.go` (DuckDB connector + `CREATE SECRET` site) untested
|
|
|
|
- **Where:** `internal/queryd/db.go` is referenced via `func (h *handlers) handleSQL` and contains `sqlEscape` (line 122), `redactCreds` (line 132), and the `CREATE SECRET ... '%s'` formation (line 102). `internal/queryd/registrar_test.go` exists, but no `db_test.go`.
|
|
- **Why MED:** `sqlEscape` correctness is one bug from a credential-leak via SQL error chain. `redactCreds` correctness is the *only* layer between a bad SECRET creation and S3 keys ending up in slog output. Both deserve unit tests with adversarial inputs (single-quote in key, embedded SECRET token, etc.).
|
|
- **Recommended fix:** Sprint 0 — add `db_test.go` with: `sqlEscape` round-trip on adversarial strings; `redactCreds` exhaustive case for empty / partial / multiple-occurrence credential values; `bootstrapStatements` order assertion (INSTALL → LOAD → CREATE SECRET).
|
|
|
|
---
|
|
|
|
## LOW severity
|
|
|
|
### R-009 — `registrar.go:153` uses `fmt.Sprintf` for view DDL
|
|
|
|
- **Where:** `internal/queryd/registrar.go:153` — `sql := fmt.Sprintf("CREATE OR REPLACE VIEW %s AS SELECT * FROM %s", quoteIdent(m.Name), fromExpr)`.
|
|
- **Why LOW:** `m.Name` comes from catalogd's manifest (server-controlled), is wrapped with `quoteIdent` (line 172, doubles `"`). `fromExpr` is built from S3 URLs which are themselves wrapped with `'` and escaped via `sqlEscape` (line 145, doubles `'`). DuckDB doesn't accept `?` placeholders for DDL, so `fmt.Sprintf` is unavoidable here. Inputs are not user-controlled at the SQL boundary; they came from a registration API call but the dataset name was already vetted by catalogd.
|
|
- **Recommended fix:** none — currently correct. Note as a "design risk to remember" if catalogd ever loosens validation on dataset names. Add a regression test that asserts a manifest with `name: 'foo"; DROP TABLE x; --'` produces a quoted-but-non-executing view name.
|
|
|
|
### R-010 — No CORS posture on any binding
|
|
|
|
- **Where:** `grep -rni 'Access-Control'` returns zero hits in source. Confirmed.
|
|
- **Why LOW:** all binaries bind 127.0.0.1; no browser is making cross-origin requests today; the future HTMX UI will be same-origin via gateway.
|
|
- **Recommended fix:** none until a non-localhost binding is needed. When it is needed (Sprint 4 or later), the decision belongs in the same ADR as auth posture (R-007) — same blast radius, same review.
|
|
|
|
### R-011 — `g2_smoke.sh:79` exact-match on `nomic-embed-text` model name
|
|
|
|
- **Where:** `scripts/g2_smoke.sh:79` — `[ "$MODEL" = "nomic-embed-text" ]`.
|
|
- **Why LOW:** if the operator swaps to `nomic-embed-text-v2-moe` (which is also loaded on this box), the smoke fails *loudly* — the dimension and recall would still likely pass; only the literal model-name assertion fails. That's the right failure mode (not silent acceptance), so this is more of an annotation than a finding.
|
|
- **Recommended fix:** none — keep the assertion strict. If the swap is intentional, the operator updates the smoke alongside the swap. That's the discipline.
|
|
|
|
### R-012 — `tests/` directory exists but is empty
|
|
|
|
- **Where:** `ls tests/` returns only `.` and `..`. Listed in `README.md:90` ("Layout") but uncited in any code path.
|
|
- **Why LOW:** dead directory, harmless, but suggests an older plan (Rust-style integration test convention) that didn't carry over.
|
|
- **Recommended fix:** either remove the directory or claim it for the fixture-mode smoke story (R-006). Pick one in Sprint 0.
|
|
|
|
---
|
|
|
|
## Risk-to-sprint mapping
|
|
|
|
| Risk | Severity | Sprint |
|
|
|---|---|---|
|
|
| R-001 queryd /sql RCE-eq via DuckDB | HIGH | 1 |
|
|
| R-002 internal/shared untested | HIGH | 0 |
|
|
| R-003 internal/storeclient untested | HIGH | 0 |
|
|
| R-004 smokes not gated | MED | 0 |
|
|
| R-005 6/7 cmd/main.go untested | MED | 0 |
|
|
| R-006 no fixture-only smokes | MED | 0 |
|
|
| R-007 zero auth on 22 routes | MED | 1 |
|
|
| R-008 queryd/db.go untested | MED | 0 |
|
|
| R-009 registrar.go fmt.Sprintf | LOW | — (note only) |
|
|
| R-010 no CORS posture | LOW | 1 (with R-007) |
|
|
| R-011 g2 smoke model assertion | LOW | — (correct as-is) |
|
|
| R-012 empty tests/ dir | LOW | 0 |
|
|
|
|
Sprint 0 owns the test-coverage and CI-gate work (R-002, R-003, R-004, R-005, R-006, R-008, R-012). Sprint 1 owns the trust-boundary decisions (R-001, R-007, R-010). Sprint 2-4 are design-bar work for unbuilt components.
|