golangLAKEHOUSE/reports/scrum/sprint-backlog.md
root 91edd43164 scrum audit: 5 reports under reports/scrum/ · score 35/60
Adapts docs/SCRUM.md framework (originally written for the
matrix-agent-validated repo) to the Go rewrite. Five deliverables:

  golang-lakehouse-scrum-test.md  top-line + scoring + verdict
  risk-register.md                12 findings, R-001..R-012
  claim-coverage-table.md         claim/test/risk for Sprint 2
  sprint-backlog.md               5 sprints, ~2 weeks of work
  acceptance-gates.md             DoD as runnable commands

Every claim cites file:line, command output, or "missing evidence."
Smoke chain ran clean (33s wall, all 9 PASS) and is captured in
reports/scrum/_evidence/smoke_chain.log (gitignored — runtime artifact).

Scoring:
  Reproducibility       7/10  9 smokes deterministic, no just/CI gate
  Test Coverage         6/10  internal/ packages tested, 6/7 cmd/ aren't
  Trust Boundary        7/10  escapes ok, zero auth, /sql is RCE-eq off-loopback
  Memory Correctness    3/10  pathway/playbook/observer not yet ported
  Deployment Readiness  4/10  no REPLICATION, no env template, no systemd
  Maintainability       8/10  no god-files, 7 lean binaries, ADRs current

Top three risks:
  R-001 HIGH  queryd /sql + DuckDB + non-loopback bind = RCE-equivalent
  R-002 HIGH  internal/shared (server.go + config.go) zero tests
  R-003 HIGH  internal/storeclient zero tests, used by 2 services
  R-004 MED   9-smoke chain green but not gated (no justfile/hook)

The audit is the work; refactors come after. Sprint 0 owns coverage
+ CI gating; Sprint 1 owns trust-boundary decisions; Sprints 2-3 are
mostly design-bar work for unbuilt agent components.

.gitignore exception: /reports/* + !/reports/scrum/ keeps reports/
a runtime-artifact directory while exposing reports/scrum/ as
tracked documentation. Mirrors the pattern future audit passes will
land in.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 04:51:47 -05:00

210 lines
13 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# golangLAKEHOUSE — Sprint Backlog
Five sprints adapted from SCRUM.md's framework. Each sprint has a goal, user stories, and acceptance criteria. Risk IDs reference `risk-register.md`. Acceptance-of-done details live in `acceptance-gates.md`.
The audit is the work of *this* turn; these sprints are the next turns. Order matters — Sprint 0 unblocks the rest by making the substrate provably runnable on a clean box.
---
## Sprint 0 — Reproducibility Gate
**Goal:** make the repo provably runnable, with structural protection against silent regressions in the load-bearing-but-untested layers.
**Risks closed:** R-002, R-003, R-004, R-005, R-006, R-008, R-012.
### Stories
- **S0.1** — As an operator, I can run **one command** and know exactly which dependencies are missing or wrong-versioned.
- Concrete: `just doctor` checks Go ≥1.25, gcc, MinIO at `:9000`, Ollama at `:11434` with `nomic-embed-text` loaded, `secrets-go.toml` present + readable. Output is structured JSON on `--json` flag. Non-zero exit on any missing dep.
- **S0.2** — As an operator, I can run a **minimal fixture test** without MinIO or Ollama.
- Concrete: `just smoke-fixtures` runs against in-process fakes (`MockS3Storage` + `MockEmbedProvider`). Smokes split into two tiers: `*_smoke.sh` (real services, slow) vs `*_smoke_fixtures.sh` (fakes, runs anywhere).
- **S0.3** — As an operator, I can verify the whole substrate with one command, and I cannot push a regression past it.
- Concrete: `just verify` runs `go vet` + `go test` + the 9-smoke chain. `.git/hooks/pre-push` calls `just verify` and aborts on non-zero exit. Failure output is structured.
- **S0.4** — As a reviewer, I can read coverage at a glance and see where wiring layers lack tests.
- Concrete: `cmd/<bin>/main_test.go` exists for all 7 binaries (today: only `storaged`). Each tests routes registered, body-cap rejection, schema-validation rejection, happy-path with mocked dependency.
- **S0.5** — Load-bearing internal packages have unit-test coverage proportional to their blast radius.
- Concrete: `internal/shared/{server,config}_test.go` exist (R-002). `internal/storeclient/client_test.go` exists (R-003). `internal/queryd/db_test.go` exists with adversarial `sqlEscape` + exhaustive `redactCreds` cases (R-008).
- **S0.6** — Empty `tests/` directory either claimed or removed.
- Concrete: pick. If claimed for fixture-mode wiring (S0.2), document its purpose in README. If not, delete in the same commit as S0.1.
### Acceptance
- `just --list` shows `verify`, `smoke-fixtures`, `doctor`, plus shortcuts for `fmt`/`vet`/`test`/`smoke <day>`.
- `just verify` exits 0 on a clean clone with deps present.
- `just smoke-fixtures` exits 0 on a clean clone with **no MinIO and no Ollama**.
- Pre-push hook present at `.git/hooks/pre-push`, executable, calls `just verify`.
- `go test ./...` shows non-empty test count for every package in `internal/` (no more `[no test files]` lines for shared/storeclient).
- Test count for cmd/ binaries: 7/7 (today: 1/7).
- Failure output structured: any `just doctor` failure prints JSON describing what's missing, no claim of success.
### Estimate
- S0.1 doctor: ~1 hr
- S0.2 fixture-mode: ~3 hr (interface plumbing + fakes + new smokes)
- S0.3 verify + hook: ~30 min
- S0.4 cmd-level tests: ~3 hr (6 binaries × ~30 min)
- S0.5 internal tests: ~3 hr
- S0.6 tests/ dir: ~5 min
Total: ~1.5 days focused. Single bundled PR with one commit per story.
---
## Sprint 1 — Trust Boundary Gate
**Goal:** prevent agent trust collapse. Make the SQL surface not be RCE-equivalent on accidental non-localhost binding. Decide auth posture once and apply uniformly.
**Risks closed:** R-001, R-007, R-009 (regression test only), R-010.
### Stories
- **S1.1** — As an operator, I cannot accidentally expose `POST /sql` to the network.
- Concrete: `cmd/queryd/main.go` startup asserts bind starts with `127.` or `[::1]`. If env `LH_QUERYD_ALLOW_NONLOOPBACK=1` is set, log a warning and continue. Otherwise `os.Exit(1)`. Same gate added to vectord, storaged, ingestd until S1.2 lands.
- **S1.2** — As an operator, I have one configurable auth posture across all 7 binaries.
- Concrete: ADR-003 picks Bearer-token + IP allow-list (or alternative — decide in the ADR). `internal/shared/auth.go` provides middleware; each `cmd/<bin>/main.go` adds `r.Use(authMiddleware)` in one line. Token sourced from `secrets-go.toml`'s new `[auth].token` field. Empty token = local-mode (no auth, only `127.` bind allowed).
- **S1.3** — As an operator, every public endpoint validates schema on input.
- Concrete: each handler decoding a JSON body has explicit struct tags + missing-field detection. Unknown fields rejected (`json.Decoder.DisallowUnknownFields`). Empty-required-field rejected with structured 400. Today's coverage is partial; this story closes it uniformly.
- **S1.4** — As a reviewer, I have a regression test against SQL injection in dataset names.
- Concrete: `internal/queryd/registrar_test.go` gains a test where catalogd returns a manifest with `name: 'foo"; DROP TABLE x; --'`. The test asserts `quoteIdent` quoting prevents the DROP from executing — view name is `"foo""; DROP TABLE x; --"` which is a single quoted identifier (R-009 latent guard).
### Acceptance
- All 7 binaries fail-loud on non-loopback bind without explicit override env.
- ADR-003 in `docs/DECISIONS.md` documents the auth model with rationale.
- Auth middleware is one `r.Use()` line per binary; adding it to a new binary takes one import.
- Every JSON-decoding handler uses `DisallowUnknownFields` + missing-required-field rejection.
- R-009 regression test passes; assertion would fail if `quoteIdent` is removed.
### Estimate
~2 days focused. ADR-003 is the gating decision; once written, S1.1 + S1.2 are mechanical.
---
## Sprint 2 — Memory Correctness Gate
**Goal:** prove pathway / playbook memory cannot poison itself, with the test fixture covering Mem0 semantics on day one. This sprint is **design-bar work** for components that haven't been ported from Rust yet — the memory layer will not exist after Sprint 1.
**Risks closed:** all DESIGN-BAR rows in `claim-coverage-table.md` for Mem0 + persistence-at-scale.
### Stories
- **S2.1** — As an architect, I have an ADR fixing the pathway-memory data model in Go before code lands.
- Concrete: ADR-004 documents trace shape, history-chain rules, retire semantics, replay-count rules. Cites the Rust `pathway_memory` crate as reference but does NOT carry forward the 88-trace state per ADR-001 (clean start ratified).
- **S2.2** — As a developer, the pathway-memory port lands with a deterministic fixture corpus and full test coverage on day one.
- Concrete: `tests/fixtures/pathway/` has known-shape JSON entries covering ADD / UPDATE / REVISE / RETIRE / HISTORY / cycle-attempt / replay-duplicate / corrupted-row. New `internal/pathway/` package implements the data model. Test count: ≥7 functions in `pathway_test.go`, one per fixture row.
- **S2.3** — As a developer, retired traces are excluded from retrieval — and the test would fail without the exclusion.
- Concrete: integration test does ADD → RETIRE → SEARCH → assert returned set excludes the retired UID. Removing the retirement filter must turn this test red.
- **S2.4** — As an architect, vectord persistence works above 256 MiB single-key (the gap from the 500K staffing test).
- Concrete: either bump storaged's `MaxBytesReader` for vector-content paths, or split LHV1 across N fixed-size keys with a manifest pointer, or add multipart upload to storaged. Decision in ADR-005. Smoke variant `g1p_scale_smoke.sh` ingests 200K vectors @ d=768 + asserts kill-restart preserves state at that size.
### Acceptance
- ADR-004 and ADR-005 in `docs/DECISIONS.md`.
- `internal/pathway/` package with ≥7 covering tests; `go test ./internal/pathway/` passes.
- Retire-exclusion regression test passes; would fail if filter logic removed.
- `g1p_scale_smoke.sh` passes at 200K vectors.
### Estimate
~1 week. ADR-004 is the design anchor; the test fixtures derive from it.
---
## Sprint 3 — Agent Loop Reality Gate
**Goal:** prove the full agent loop works across an actual workflow. End-to-end deterministic: search → verify → observer review → playbook seal → second-run retrieval surfaces the prior playbook.
**Risks closed:** all DESIGN-BAR rows for observer + playbook seal + agent loop closure. The Rust system's `r.json()` on text/plain crash-loop bug (memory `54689d5`) gets a regression test.
### Stories
- **S3.1** — As an architect, ADR-002 fixes observer fail-safe semantics before observer is ported.
- Concrete: doc-only. Default verdict = `cycle`, `degraded: true` on internal error. Explicit `LH_OBSERVER_FAIL_OPEN=1` env to opt into fail-open in dev only. Reference the Rust mcp-server's `verdict: "accept"` on observer error as the anti-pattern being designed away.
- **S3.2** — As a developer, the observer port ships with tests covering the four states (accept / reject / cycle / degraded).
- Concrete: `internal/observer/` package + `cmd/observerd` binary. Test fixture: hallucinated claim → reject; valid claim with SQL truth → accept; SQL truth unreachable → degraded+cycle (NEVER accept).
- **S3.3** — As a developer, playbook seal + second-run retrieval is a single end-to-end smoke.
- Concrete: `agent_loop_smoke.sh` does ingest → search → verify → observer review → seal → second-run retrieval. Assertions: second run surfaces prior playbook UID; report includes input hash, output hash, verdict, and memory-mutation receipt.
- **S3.4** — As a reviewer, the Rust health-endpoint content-type bug cannot recur.
- Concrete: regression test that consumes `/health` from each of the 7 binaries via the gateway and asserts: response is text/plain, body matches `<service> ok` pattern, never silently parses as JSON.
### Acceptance
- ADR-002 in `docs/DECISIONS.md`.
- `internal/observer/` with ≥4 covering tests.
- `agent_loop_smoke.sh` passes deterministically; tagged report includes input/output hashes + verdict + receipt.
- `health_contenttype_test.go` exists, would fail if any binary regresses to JSON.
### Estimate
~1 week. ADR-002 is short; observer port is the bulk; agent-loop wiring is real engineering.
---
## Sprint 4 — Deployment Gate
**Goal:** turn deployment from tribal-knowledge into executable validation. Fresh box → green smoke chain in one command.
**Risks closed:** R-006 (cloud-only Provider), all deployment-readiness gaps (no REPLICATION, no env template, no systemd, no doctor).
### Stories
- **S4.1** — As an operator on a fresh Debian box, `just doctor` tells me exactly what to install.
- Concrete: structured JSON output describing each missing dep with the `apt install` / `curl ... | tar` command to fix it. Cross-checked against `README.md` "Cold-start dependencies" — single source of truth.
- **S4.2** — As an operator, `REPLICATION.md` is executable, not narrative.
- Concrete: every step in `REPLICATION.md` is either a copy-pasteable command block or a reference to a `just <target>` invocation. Validation steps from the upstream `REPLICATION.md` (health checks, embed probe, vector probe, agent test) become `just smoke-replication`.
- **S4.3** — As an operator, I have an env template for `secrets-go.toml`.
- Concrete: `secrets-go.toml.example` in repo with all required keys + comments documenting each. `just doctor` checks for unfilled placeholder values.
- **S4.4** — As an operator, systemd units in repo wire each binary cleanly.
- Concrete: `deploy/systemd/{gateway,storaged,catalogd,ingestd,queryd,vectord,embedd}.service` with `After=`, `Restart=on-failure`, `MemoryMax=`, environment loading. `just install-systemd` symlinks them.
- **S4.5** — As an operator deploying to AWS S3 instead of MinIO, no code changes are required.
- Concrete: `just smoke-aws-s3` variant that points the bucket config at real S3. Existing smokes pass against real S3 (validates the aws-sdk-go-v2 path).
### Acceptance
- `just doctor` on fresh Debian 13 box reports actionable JSON with install commands.
- `just smoke-replication` succeeds on first run after `just doctor` shows green.
- `secrets-go.toml.example` present with documented keys.
- 7 systemd unit files in `deploy/systemd/`; `systemctl status lakehouse-go-*` shows green after install.
- `just smoke-aws-s3` succeeds against a real bucket (manual: requires AWS creds).
### Estimate
~3 days focused. S4.4 + S4.5 are most of the time.
---
## Cross-sprint dependencies
```
Sprint 0 ─────────────────────────────────────► (unblocks all)
├─► Sprint 1 ───► Sprint 2 ───► Sprint 3 ───► Sprint 4
│ │ │ │
│ ▼ ▼ ▼
└──── auth ADR ── memory ADR ── observer ADR
```
- Sprint 0 is the gate. None of the others should ship without `just verify` reliably catching regressions.
- Sprint 1 should land before Sprint 2 because R-001 (queryd /sql) is HIGH severity and the fix is mostly mechanical.
- Sprint 2 / 3 are real engineering; estimates are floors not ceilings.
- Sprint 4 can land in parallel with Sprint 2/3 — its stories don't depend on the agent-loop port.