root 91edd43164 scrum audit: 5 reports under reports/scrum/ · score 35/60

Adapts docs/SCRUM.md framework (originally written for the
matrix-agent-validated repo) to the Go rewrite. Five deliverables:

  golang-lakehouse-scrum-test.md  top-line + scoring + verdict
  risk-register.md                12 findings, R-001..R-012
  claim-coverage-table.md         claim/test/risk for Sprint 2
  sprint-backlog.md               5 sprints, ~2 weeks of work
  acceptance-gates.md             DoD as runnable commands

Every claim cites file:line, command output, or "missing evidence."
Smoke chain ran clean (33s wall, all 9 PASS) and is captured in
reports/scrum/_evidence/smoke_chain.log (gitignored — runtime artifact).

Scoring:
  Reproducibility       7/10  9 smokes deterministic, no just/CI gate
  Test Coverage         6/10  internal/ packages tested, 6/7 cmd/ aren't
  Trust Boundary        7/10  escapes ok, zero auth, /sql is RCE-eq off-loopback
  Memory Correctness    3/10  pathway/playbook/observer not yet ported
  Deployment Readiness  4/10  no REPLICATION, no env template, no systemd
  Maintainability       8/10  no god-files, 7 lean binaries, ADRs current

Top three risks:
  R-001 HIGH  queryd /sql + DuckDB + non-loopback bind = RCE-equivalent
  R-002 HIGH  internal/shared (server.go + config.go) zero tests
  R-003 HIGH  internal/storeclient zero tests, used by 2 services
  R-004 MED   9-smoke chain green but not gated (no justfile/hook)

The audit is the work; refactors come after. Sprint 0 owns coverage
+ CI gating; Sprint 1 owns trust-boundary decisions; Sprints 2-3 are
mostly design-bar work for unbuilt agent components.

.gitignore exception: /reports/* + !/reports/scrum/ keeps reports/
a runtime-artifact directory while exposing reports/scrum/ as
tracked documentation. Mirrors the pattern future audit passes will
land in.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-29 04:51:47 -05:00

7.9 KiB

Raw Blame History

golangLAKEHOUSE — Scrum Hardening Audit

Audit date: 2026-04-29 Auditor: Claude (Opus 4.7, 1M context) Repo state: main @ 1f700e7 — clean working tree, 6,587 LoC of Go across 7 binaries + 11 internal packages Methodology: Adapted from docs/SCRUM.md (originally written for matrix-agent-validated). Sibling reports: risk-register.md · claim-coverage-table.md · sprint-backlog.md · acceptance-gates.md

Verdict (one paragraph)

The Go rewrite is structurally clean and substantially more disciplined than the Rust system this audit's framework was originally designed against. The five concerns from the upstream verdict are mostly non-issues here: no raw SQL from request bodies (one server-side fmt.Sprintf site, properly escaped — internal/queryd/registrar.go:153); no hardcoded /home/profit (grep returns zero *.go matches); the 7-binary split forecloses any 2,520-line god-file; smokes are deterministic and pass in 33 seconds wall-time end-to-end. The real gaps are different ones: no just verify / Makefile / CI-gate wiring (smokes are documentation-only), no fixture-only test path (every smoke hits real MinIO + Ollama), 6 of 7 cmd/<bin>/main.go files are untested, two load-bearing internal packages (internal/shared, internal/storeclient) have zero tests, and the Mem0 / pathway / playbook / observer surfaces from the upstream system are simply not yet ported — meaning Sprints 2-3 are design-bar work, not bug-hunt work. Top single fix: wire the 9-smoke chain into a just verify and pre-push hook before any new feature lands. Cheapest, highest-leverage hardening move available.

Scoring

Each dimension rated 0-10 with evidence cited. Evidence files live in reports/scrum/_evidence/.

Dimension	Score	Evidence
Reproducibility	7 / 10	All 9 smokes pass clean in 33s wall (`_evidence/smoke_chain.log`); `go vet ./...` exit=0; `go test -short ./...` exit=0; `README.md` lists deps. −3 for: no `just verify`, no Makefile, no `.github/workflows`, no `just doctor`, no fixture-only smoke path (every smoke hits real MinIO + Ollama).
Test Coverage	6 / 10	13 `*_test.go` files, ~77 test functions, every `internal/` impl package has at least one test, vectord has 18 test funcs across index + persistor. −4 for: 6 of 7 `cmd/<bin>/main.go` untested (only `cmd/storaged/main_test.go` exists); `internal/shared` and `internal/storeclient` have zero tests; `internal/queryd/db.go` (DuckDB connector + `sqlEscape` + `CREATE SECRET` site) untested; integration coverage lives in shell smokes, not Go tests.
Trust Boundary Safety	7 / 10	One `fmt.Sprintf` SQL site (`internal/queryd/registrar.go:153`) properly uses `quoteIdent` (line 172, doubles `"`) + `sqlEscape` (`internal/queryd/db.go:122`, doubles `'`); zero `os/exec` invocations (`grep` clean); zero hardcoded `/home/profit` paths in `*.go`; every public POST capped via `MaxBytesReader` (`cmd/{catalogd:87,queryd:165,ingestd:110,vectord:334,embedd:71,storaged:215}`); `redactCreds` (`internal/queryd/db.go:132`) scrubs S3 keys from error chain. −3 for: zero auth middleware on any of the 22 routes, queryd `POST /sql` accepts arbitrary SQL by design (R-001), no CORS posture (no Access-Control headers anywhere), localhost-binding is the sole guardrail.
Agent Memory Correctness	3 / 10 (design-bar; not built)	Vectord HNSW exists with 13 index tests + 5 persistor tests; round-trip verified by `g1p_smoke.sh` (kill+restart preserves state, post-restart search returns dist=0). −7 because: no Mem0-style ADD/UPDATE/REVISE/RETIRE/HISTORY semantics — vectord is an unversioned HNSW index, not a versioned memory; no pathway memory; no playbook memory; no observer; no cycle-safety; no retired-trace exclusion test (concept doesn't exist yet). Score reflects "not yet ported" — the design bars belong in Sprint 2.
Deployment Readiness	4 / 10	`lakehouse.toml` present with sane defaults; `secrets-go.toml` path is flag-overridable (`cmd/storaged/main.go:35`); 9 smokes self-bootstrap services with trap-cleanup. −6 for: no `REPLICATION.md`, no `.env.example`, no `*.service` systemd units in repo, no `Dockerfile`, no `just doctor` to surface missing deps, no `--version` flag on binaries, no readiness-check separate from `/health` liveness.
Maintainability	8 / 10	Every binary 111-354 LoC (no god-files); `docs/{PRD,SPEC,DECISIONS,PHASE_G0_KICKOFF,RESEARCH_LOG}.md` document direction + ratified ADRs; ADR-020 idempotency contract is enforced by smoke (`d3_smoke.sh` — rehydrate-across-restart preserves dataset_id); `docs/PHASE_G0_KICKOFF.md` is the day-by-day record + scrum disposition. −2 for: no `CONTRIBUTING.md`, no per-handler godoc convention enforced, two load-bearing packages without tests means refactor risk is concentrated.

Composite: 35 / 60 — strong G0/G1/G2 substrate, weak operational scaffolding, large design-bar surface for unbuilt agent components.

Methodology

Followed SCRUM.md's "no vibes" rule. Every claim above and in sibling reports is backed by:

Verbatim command output — cargo equivalents (go vet, go test, go build), all 9 smokes, full chain wall-times. Captured in _evidence/smoke_chain.log.
grep-able file:line citations — every code claim points at a specific line; readers can verify by git show <sha>:<path> or sed -n '<line>p' <path>.
Absence as evidence — ls justfile failure, find . -name "*.service" empty, grep -rn "/home/profit" --include="*.go" empty. Recorded as cited absences, not implied.

What was NOT inspected (out of scope this round):

Performance characteristics under load (the 500K staffing test is captured in docs/PHASE_G0_KICKOFF.md and the head commit message — not re-run here).
Cross-binary failure cascades (a deliberate Sprint 1 follow-up — kill storaged mid-PUT and inspect catalogd state, etc.).
Supply-chain audit of the 9 direct + ~70 transitive dependencies in go.sum.

Top recommendations (ordered by leverage / cost)

justfile + pre-push hook wrapping the 9-smoke chain. ~30 min. Closes the biggest Sprint 0 gap and ratchets every future PR.
Tests for internal/shared and internal/storeclient. ~1 hr. Two packages, every binary depends on them, zero coverage today. Highest "silent break" risk per code-LoC ratio.
ADR-002: observer fail-safe semantics. Doc only, ~30 min. Locks in degraded / cycle default before observer is ported, so the upstream verdict:"accept" anti-pattern can't recur.
Auth posture decision for non-localhost binding. Doc only, ~30 min. Today's posture (127.0.0.1 + zero auth) is fine for G0; deciding token-vs-mTLS-vs-IP-allowlist now means it's not retrofitted under fire.
Fixture-mode smokes (MockS3Storage + MockEmbedProvider interfaces). ~3 hr. Decouples CI from MinIO + Ollama, makes the chain run in any CI box.

Risk register (risk-register.md) carries the full prioritized list. Sprint backlog (sprint-backlog.md) groups them into shipping units with acceptance criteria.

Do not refactor the 7-binary split. It already addresses the upstream "2,520-line mcp-server.ts" lesson structurally; touching it now is churn.
Do not introduce auth before deciding the deployment model. Adding bearer-token middleware preemptively will get rewritten when mTLS or IP-allowlist wins.
Do not "rebuild pathway memory in Go" to score Sprint 2 higher. That's a real engineering project, not a Sprint-scoped fix; the 3/10 reflects honest current state and the design bars in Sprint 2 backlog stories are the right shape.
Do not rewrite the 9 smokes as Go integration tests yet. Bash + curl is currently the right tool — small, transparent, easy to debug. Migrate only when fixture-mode is in place and you're paying observably for the bash dependency.

7.9 KiB Raw Blame History Unescape Escape