Adapts docs/SCRUM.md framework (originally written for the matrix-agent-validated repo) to the Go rewrite. Five deliverables: golang-lakehouse-scrum-test.md top-line + scoring + verdict risk-register.md 12 findings, R-001..R-012 claim-coverage-table.md claim/test/risk for Sprint 2 sprint-backlog.md 5 sprints, ~2 weeks of work acceptance-gates.md DoD as runnable commands Every claim cites file:line, command output, or "missing evidence." Smoke chain ran clean (33s wall, all 9 PASS) and is captured in reports/scrum/_evidence/smoke_chain.log (gitignored — runtime artifact). Scoring: Reproducibility 7/10 9 smokes deterministic, no just/CI gate Test Coverage 6/10 internal/ packages tested, 6/7 cmd/ aren't Trust Boundary 7/10 escapes ok, zero auth, /sql is RCE-eq off-loopback Memory Correctness 3/10 pathway/playbook/observer not yet ported Deployment Readiness 4/10 no REPLICATION, no env template, no systemd Maintainability 8/10 no god-files, 7 lean binaries, ADRs current Top three risks: R-001 HIGH queryd /sql + DuckDB + non-loopback bind = RCE-equivalent R-002 HIGH internal/shared (server.go + config.go) zero tests R-003 HIGH internal/storeclient zero tests, used by 2 services R-004 MED 9-smoke chain green but not gated (no justfile/hook) The audit is the work; refactors come after. Sprint 0 owns coverage + CI gating; Sprint 1 owns trust-boundary decisions; Sprints 2-3 are mostly design-bar work for unbuilt agent components. .gitignore exception: /reports/* + !/reports/scrum/ keeps reports/ a runtime-artifact directory while exposing reports/scrum/ as tracked documentation. Mirrors the pattern future audit passes will land in. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
67 lines
7.9 KiB
Markdown
67 lines
7.9 KiB
Markdown
# golangLAKEHOUSE — Scrum Hardening Audit
|
||
|
||
**Audit date:** 2026-04-29
|
||
**Auditor:** Claude (Opus 4.7, 1M context)
|
||
**Repo state:** `main @ 1f700e7` — clean working tree, 6,587 LoC of Go across 7 binaries + 11 internal packages
|
||
**Methodology:** Adapted from `docs/SCRUM.md` (originally written for `matrix-agent-validated`).
|
||
**Sibling reports:** `risk-register.md` · `claim-coverage-table.md` · `sprint-backlog.md` · `acceptance-gates.md`
|
||
|
||
---
|
||
|
||
## Verdict (one paragraph)
|
||
|
||
The Go rewrite is structurally clean and substantially more disciplined than the Rust system this audit's framework was originally designed against. The five concerns from the upstream verdict are mostly non-issues here: no raw SQL from request bodies (one server-side `fmt.Sprintf` site, properly escaped — `internal/queryd/registrar.go:153`); no hardcoded `/home/profit` (`grep` returns zero `*.go` matches); the 7-binary split forecloses any 2,520-line god-file; smokes are deterministic and pass in 33 seconds wall-time end-to-end. **The real gaps are different ones:** no `just verify` / Makefile / CI-gate wiring (smokes are documentation-only), no fixture-only test path (every smoke hits real MinIO + Ollama), 6 of 7 `cmd/<bin>/main.go` files are untested, two load-bearing internal packages (`internal/shared`, `internal/storeclient`) have zero tests, and the Mem0 / pathway / playbook / observer surfaces from the upstream system are simply **not yet ported** — meaning Sprints 2-3 are design-bar work, not bug-hunt work. **Top single fix:** wire the 9-smoke chain into a `just verify` and pre-push hook before any new feature lands. Cheapest, highest-leverage hardening move available.
|
||
|
||
---
|
||
|
||
## Scoring
|
||
|
||
Each dimension rated 0-10 with evidence cited. Evidence files live in `reports/scrum/_evidence/`.
|
||
|
||
| Dimension | Score | Evidence |
|
||
|---|---|---|
|
||
| **Reproducibility** | **7 / 10** | All 9 smokes pass clean in 33s wall (`_evidence/smoke_chain.log`); `go vet ./...` exit=0; `go test -short ./...` exit=0; `README.md` lists deps. **−3** for: no `just verify`, no Makefile, no `.github/workflows`, no `just doctor`, no fixture-only smoke path (every smoke hits real MinIO + Ollama). |
|
||
| **Test Coverage** | **6 / 10** | 13 `*_test.go` files, ~77 test functions, every `internal/` impl package has at least one test, vectord has 18 test funcs across index + persistor. **−4** for: 6 of 7 `cmd/<bin>/main.go` untested (only `cmd/storaged/main_test.go` exists); `internal/shared` and `internal/storeclient` have zero tests; `internal/queryd/db.go` (DuckDB connector + `sqlEscape` + `CREATE SECRET` site) untested; integration coverage lives in shell smokes, not Go tests. |
|
||
| **Trust Boundary Safety** | **7 / 10** | One `fmt.Sprintf` SQL site (`internal/queryd/registrar.go:153`) properly uses `quoteIdent` (line 172, doubles `"`) + `sqlEscape` (`internal/queryd/db.go:122`, doubles `'`); zero `os/exec` invocations (`grep` clean); zero hardcoded `/home/profit` paths in `*.go`; every public POST capped via `MaxBytesReader` (`cmd/{catalogd:87,queryd:165,ingestd:110,vectord:334,embedd:71,storaged:215}`); `redactCreds` (`internal/queryd/db.go:132`) scrubs S3 keys from error chain. **−3** for: zero auth middleware on any of the 22 routes, queryd `POST /sql` accepts arbitrary SQL by design (R-001), no CORS posture (no Access-Control headers anywhere), localhost-binding is the sole guardrail. |
|
||
| **Agent Memory Correctness** | **3 / 10** (design-bar; not built) | Vectord HNSW exists with 13 index tests + 5 persistor tests; round-trip verified by `g1p_smoke.sh` (kill+restart preserves state, post-restart search returns dist=0). **−7** because: no Mem0-style ADD/UPDATE/REVISE/RETIRE/HISTORY semantics — vectord is an unversioned HNSW index, not a versioned memory; no pathway memory; no playbook memory; no observer; no cycle-safety; no retired-trace exclusion test (concept doesn't exist yet). Score reflects "not yet ported" — the design bars belong in Sprint 2. |
|
||
| **Deployment Readiness** | **4 / 10** | `lakehouse.toml` present with sane defaults; `secrets-go.toml` path is flag-overridable (`cmd/storaged/main.go:35`); 9 smokes self-bootstrap services with trap-cleanup. **−6** for: no `REPLICATION.md`, no `.env.example`, no `*.service` systemd units in repo, no `Dockerfile`, no `just doctor` to surface missing deps, no `--version` flag on binaries, no readiness-check separate from `/health` liveness. |
|
||
| **Maintainability** | **8 / 10** | Every binary 111-354 LoC (no god-files); `docs/{PRD,SPEC,DECISIONS,PHASE_G0_KICKOFF,RESEARCH_LOG}.md` document direction + ratified ADRs; ADR-020 idempotency contract is enforced by smoke (`d3_smoke.sh` — rehydrate-across-restart preserves dataset_id); `docs/PHASE_G0_KICKOFF.md` is the day-by-day record + scrum disposition. **−2** for: no `CONTRIBUTING.md`, no per-handler godoc convention enforced, two load-bearing packages without tests means refactor risk is concentrated. |
|
||
|
||
**Composite: 35 / 60 — strong G0/G1/G2 substrate, weak operational scaffolding, large design-bar surface for unbuilt agent components.**
|
||
|
||
---
|
||
|
||
## Methodology
|
||
|
||
Followed SCRUM.md's "no vibes" rule. Every claim above and in sibling reports is backed by:
|
||
|
||
1. **Verbatim command output** — cargo equivalents (`go vet`, `go test`, `go build`), all 9 smokes, full chain wall-times. Captured in `_evidence/smoke_chain.log`.
|
||
2. **`grep`-able file:line citations** — every code claim points at a specific line; readers can verify by `git show <sha>:<path>` or `sed -n '<line>p' <path>`.
|
||
3. **Absence as evidence** — `ls justfile` failure, `find . -name "*.service"` empty, `grep -rn "/home/profit" --include="*.go"` empty. Recorded as cited absences, not implied.
|
||
|
||
What was NOT inspected (out of scope this round):
|
||
- Performance characteristics under load (the 500K staffing test is captured in `docs/PHASE_G0_KICKOFF.md` and the head commit message — not re-run here).
|
||
- Cross-binary failure cascades (a deliberate Sprint 1 follow-up — kill storaged mid-PUT and inspect catalogd state, etc.).
|
||
- Supply-chain audit of the 9 direct + ~70 transitive dependencies in `go.sum`.
|
||
|
||
---
|
||
|
||
## Top recommendations (ordered by leverage / cost)
|
||
|
||
1. **`justfile` + pre-push hook** wrapping the 9-smoke chain. ~30 min. Closes the biggest Sprint 0 gap and ratchets every future PR.
|
||
2. **Tests for `internal/shared` and `internal/storeclient`.** ~1 hr. Two packages, every binary depends on them, zero coverage today. Highest "silent break" risk per code-LoC ratio.
|
||
3. **ADR-002: observer fail-safe semantics.** Doc only, ~30 min. Locks in `degraded` / `cycle` default before observer is ported, so the upstream `verdict:"accept"` anti-pattern can't recur.
|
||
4. **Auth posture decision** for non-localhost binding. Doc only, ~30 min. Today's posture (127.0.0.1 + zero auth) is fine for G0; deciding token-vs-mTLS-vs-IP-allowlist now means it's not retrofitted under fire.
|
||
5. **Fixture-mode smokes** (`MockS3Storage` + `MockEmbedProvider` interfaces). ~3 hr. Decouples CI from MinIO + Ollama, makes the chain run in any CI box.
|
||
|
||
Risk register (`risk-register.md`) carries the full prioritized list. Sprint backlog (`sprint-backlog.md`) groups them into shipping units with acceptance criteria.
|
||
|
||
---
|
||
|
||
## What this audit does NOT recommend
|
||
|
||
- **Do not refactor the 7-binary split.** It already addresses the upstream "2,520-line mcp-server.ts" lesson structurally; touching it now is churn.
|
||
- **Do not introduce auth before deciding the deployment model.** Adding bearer-token middleware preemptively will get rewritten when mTLS or IP-allowlist wins.
|
||
- **Do not "rebuild pathway memory in Go" to score Sprint 2 higher.** That's a real engineering project, not a Sprint-scoped fix; the 3/10 reflects honest current state and the design bars in Sprint 2 backlog stories are the right shape.
|
||
- **Do not rewrite the 9 smokes as Go integration tests yet.** Bash + curl is currently the right tool — small, transparent, easy to debug. Migrate only when fixture-mode is in place and you're paying observably for the bash dependency.
|