golangLAKEHOUSE/reports/scrum/sprint-backlog.md

# golangLAKEHOUSE — Sprint Backlog

Five sprints adapted from SCRUM.md's framework. Each sprint has a goal, user stories, and acceptance criteria. Risk IDs reference `risk-register.md`. Acceptance-of-done details live in `acceptance-gates.md`.

The audit is the work of *this* turn; these sprints are the next turns. Order matters — Sprint 0 unblocks the rest by making the substrate provably runnable on a clean box.

---

## Sprint 0 — Reproducibility Gate

**Goal:** make the repo provably runnable, with structural protection against silent regressions in the load-bearing-but-untested layers.

**Risks closed:** R-002, R-003, R-004, R-005, R-006, R-008, R-012.

### Stories

- **S0.1** — As an operator, I can run **one command** and know exactly which dependencies are missing or wrong-versioned.
  - Concrete: `just doctor` checks Go ≥1.25, gcc, MinIO at `:9000`, Ollama at `:11434` with `nomic-embed-text` loaded, `secrets-go.toml` present + readable. Output is structured JSON on `--json` flag. Non-zero exit on any missing dep.

- **S0.2** — As an operator, I can run a **minimal fixture test** without MinIO or Ollama.
  - Concrete: `just smoke-fixtures` runs against in-process fakes (`MockS3Storage` + `MockEmbedProvider`). Smokes split into two tiers: `*_smoke.sh` (real services, slow) vs `*_smoke_fixtures.sh` (fakes, runs anywhere).

- **S0.3** — As an operator, I can verify the whole substrate with one command, and I cannot push a regression past it.
  - Concrete: `just verify` runs `go vet` + `go test` + the 9-smoke chain. `.git/hooks/pre-push` calls `just verify` and aborts on non-zero exit. Failure output is structured.

- **S0.4** — As a reviewer, I can read coverage at a glance and see where wiring layers lack tests.
  - Concrete: `cmd/<bin>/main_test.go` exists for all 7 binaries (today: only `storaged`). Each tests routes registered, body-cap rejection, schema-validation rejection, happy-path with mocked dependency.

- **S0.5** — Load-bearing internal packages have unit-test coverage proportional to their blast radius.
  - Concrete: `internal/shared/{server,config}_test.go` exist (R-002). `internal/storeclient/client_test.go` exists (R-003). `internal/queryd/db_test.go` exists with adversarial `sqlEscape` + exhaustive `redactCreds` cases (R-008).

- **S0.6** — Empty `tests/` directory either claimed or removed.
  - Concrete: pick. If claimed for fixture-mode wiring (S0.2), document its purpose in README. If not, delete in the same commit as S0.1.

### Acceptance

- `just --list` shows `verify`, `smoke-fixtures`, `doctor`, plus shortcuts for `fmt`/`vet`/`test`/`smoke <day>`.
- `just verify` exits 0 on a clean clone with deps present.
- `just smoke-fixtures` exits 0 on a clean clone with **no MinIO and no Ollama**.
- Pre-push hook present at `.git/hooks/pre-push`, executable, calls `just verify`.
- `go test ./...` shows non-empty test count for every package in `internal/` (no more `[no test files]` lines for shared/storeclient).
- Test count for cmd/ binaries: 7/7 (today: 1/7).
- Failure output structured: any `just doctor` failure prints JSON describing what's missing, no claim of success.

### Estimate

- S0.1 doctor: ~1 hr
- S0.2 fixture-mode: ~3 hr (interface plumbing + fakes + new smokes)
- S0.3 verify + hook: ~30 min
- S0.4 cmd-level tests: ~3 hr (6 binaries × ~30 min)
- S0.5 internal tests: ~3 hr
- S0.6 tests/ dir: ~5 min

Total: ~1.5 days focused. Single bundled PR with one commit per story.

---

## Sprint 1 — Trust Boundary Gate

**Goal:** prevent agent trust collapse. Make the SQL surface not be RCE-equivalent on accidental non-localhost binding. Decide auth posture once and apply uniformly.

**Risks closed:** R-001, R-007, R-009 (regression test only), R-010.

### Stories

- **S1.1** — As an operator, I cannot accidentally expose `POST /sql` to the network.
  - Concrete: `cmd/queryd/main.go` startup asserts bind starts with `127.` or `[::1]`. If env `LH_QUERYD_ALLOW_NONLOOPBACK=1` is set, log a warning and continue. Otherwise `os.Exit(1)`. Same gate added to vectord, storaged, ingestd until S1.2 lands.

- **S1.2** — As an operator, I have one configurable auth posture across all 7 binaries.
  - Concrete: ADR-003 picks Bearer-token + IP allow-list (or alternative — decide in the ADR). `internal/shared/auth.go` provides middleware; each `cmd/<bin>/main.go` adds `r.Use(authMiddleware)` in one line. Token sourced from `secrets-go.toml`'s new `[auth].token` field. Empty token = local-mode (no auth, only `127.` bind allowed).

- **S1.3** — As an operator, every public endpoint validates schema on input.
  - Concrete: each handler decoding a JSON body has explicit struct tags + missing-field detection. Unknown fields rejected (`json.Decoder.DisallowUnknownFields`). Empty-required-field rejected with structured 400. Today's coverage is partial; this story closes it uniformly.

- **S1.4** — As a reviewer, I have a regression test against SQL injection in dataset names.
  - Concrete: `internal/queryd/registrar_test.go` gains a test where catalogd returns a manifest with `name: 'foo"; DROP TABLE x; --'`. The test asserts `quoteIdent` quoting prevents the DROP from executing — view name is `"foo""; DROP TABLE x; --"` which is a single quoted identifier (R-009 latent guard).

### Acceptance

- All 7 binaries fail-loud on non-loopback bind without explicit override env.
- ADR-003 in `docs/DECISIONS.md` documents the auth model with rationale.
- Auth middleware is one `r.Use()` line per binary; adding it to a new binary takes one import.
- Every JSON-decoding handler uses `DisallowUnknownFields` + missing-required-field rejection.
- R-009 regression test passes; assertion would fail if `quoteIdent` is removed.

### Estimate

~2 days focused. ADR-003 is the gating decision; once written, S1.1 + S1.2 are mechanical.

---

## Sprint 2 — Memory Correctness Gate

**Goal:** prove pathway / playbook memory cannot poison itself, with the test fixture covering Mem0 semantics on day one. This sprint is **design-bar work** for components that haven't been ported from Rust yet — the memory layer will not exist after Sprint 1.

**Risks closed:** all DESIGN-BAR rows in `claim-coverage-table.md` for Mem0 + persistence-at-scale.

### Stories

- **S2.1** — As an architect, I have an ADR fixing the pathway-memory data model in Go before code lands.
  - Concrete: ADR-004 documents trace shape, history-chain rules, retire semantics, replay-count rules. Cites the Rust `pathway_memory` crate as reference but does NOT carry forward the 88-trace state per ADR-001 (clean start ratified).

- **S2.2** — As a developer, the pathway-memory port lands with a deterministic fixture corpus and full test coverage on day one.
  - Concrete: `tests/fixtures/pathway/` has known-shape JSON entries covering ADD / UPDATE / REVISE / RETIRE / HISTORY / cycle-attempt / replay-duplicate / corrupted-row. New `internal/pathway/` package implements the data model. Test count: ≥7 functions in `pathway_test.go`, one per fixture row.

- **S2.3** — As a developer, retired traces are excluded from retrieval — and the test would fail without the exclusion.
  - Concrete: integration test does ADD → RETIRE → SEARCH → assert returned set excludes the retired UID. Removing the retirement filter must turn this test red.

- **S2.4** — As an architect, vectord persistence works above 256 MiB single-key (the gap from the 500K staffing test).
  - Concrete: either bump storaged's `MaxBytesReader` for vector-content paths, or split LHV1 across N fixed-size keys with a manifest pointer, or add multipart upload to storaged. Decision in ADR-005. Smoke variant `g1p_scale_smoke.sh` ingests 200K vectors @ d=768 + asserts kill-restart preserves state at that size.

### Acceptance

- ADR-004 and ADR-005 in `docs/DECISIONS.md`.
- `internal/pathway/` package with ≥7 covering tests; `go test ./internal/pathway/` passes.
- Retire-exclusion regression test passes; would fail if filter logic removed.
- `g1p_scale_smoke.sh` passes at 200K vectors.

### Estimate

~1 week. ADR-004 is the design anchor; the test fixtures derive from it.

---

## Sprint 3 — Agent Loop Reality Gate

**Goal:** prove the full agent loop works across an actual workflow. End-to-end deterministic: search → verify → observer review → playbook seal → second-run retrieval surfaces the prior playbook.

**Risks closed:** all DESIGN-BAR rows for observer + playbook seal + agent loop closure. The Rust system's `r.json()` on text/plain crash-loop bug (memory `54689d5`) gets a regression test.

### Stories

- **S3.1** — As an architect, ADR-002 fixes observer fail-safe semantics before observer is ported.
  - Concrete: doc-only. Default verdict = `cycle`, `degraded: true` on internal error. Explicit `LH_OBSERVER_FAIL_OPEN=1` env to opt into fail-open in dev only. Reference the Rust mcp-server's `verdict: "accept"` on observer error as the anti-pattern being designed away.

- **S3.2** — As a developer, the observer port ships with tests covering the four states (accept / reject / cycle / degraded).
  - Concrete: `internal/observer/` package + `cmd/observerd` binary. Test fixture: hallucinated claim → reject; valid claim with SQL truth → accept; SQL truth unreachable → degraded+cycle (NEVER accept).

- **S3.3** — As a developer, playbook seal + second-run retrieval is a single end-to-end smoke.
  - Concrete: `agent_loop_smoke.sh` does ingest → search → verify → observer review → seal → second-run retrieval. Assertions: second run surfaces prior playbook UID; report includes input hash, output hash, verdict, and memory-mutation receipt.

- **S3.4** — As a reviewer, the Rust health-endpoint content-type bug cannot recur.
  - Concrete: regression test that consumes `/health` from each of the 7 binaries via the gateway and asserts: response is text/plain, body matches `<service> ok` pattern, never silently parses as JSON.

### Acceptance

- ADR-002 in `docs/DECISIONS.md`.
- `internal/observer/` with ≥4 covering tests.
- `agent_loop_smoke.sh` passes deterministically; tagged report includes input/output hashes + verdict + receipt.
- `health_contenttype_test.go` exists, would fail if any binary regresses to JSON.

### Estimate

~1 week. ADR-002 is short; observer port is the bulk; agent-loop wiring is real engineering.

---

## Sprint 4 — Deployment Gate

**Goal:** turn deployment from tribal-knowledge into executable validation. Fresh box → green smoke chain in one command.

**Risks closed:** R-006 (cloud-only Provider), all deployment-readiness gaps (no REPLICATION, no env template, no systemd, no doctor).

### Stories

- **S4.1** — As an operator on a fresh Debian box, `just doctor` tells me exactly what to install.
  - Concrete: structured JSON output describing each missing dep with the `apt install` / `curl ... | tar` command to fix it. Cross-checked against `README.md` "Cold-start dependencies" — single source of truth.

- **S4.2** — As an operator, `REPLICATION.md` is executable, not narrative.
  - Concrete: every step in `REPLICATION.md` is either a copy-pasteable command block or a reference to a `just <target>` invocation. Validation steps from the upstream `REPLICATION.md` (health checks, embed probe, vector probe, agent test) become `just smoke-replication`.

- **S4.3** — As an operator, I have an env template for `secrets-go.toml`.
  - Concrete: `secrets-go.toml.example` in repo with all required keys + comments documenting each. `just doctor` checks for unfilled placeholder values.

- **S4.4** — As an operator, systemd units in repo wire each binary cleanly.
  - Concrete: `deploy/systemd/{gateway,storaged,catalogd,ingestd,queryd,vectord,embedd}.service` with `After=`, `Restart=on-failure`, `MemoryMax=`, environment loading. `just install-systemd` symlinks them.

- **S4.5** — As an operator deploying to AWS S3 instead of MinIO, no code changes are required.
  - Concrete: `just smoke-aws-s3` variant that points the bucket config at real S3. Existing smokes pass against real S3 (validates the aws-sdk-go-v2 path).

### Acceptance

- `just doctor` on fresh Debian 13 box reports actionable JSON with install commands.
- `just smoke-replication` succeeds on first run after `just doctor` shows green.
- `secrets-go.toml.example` present with documented keys.
- 7 systemd unit files in `deploy/systemd/`; `systemctl status lakehouse-go-*` shows green after install.
- `just smoke-aws-s3` succeeds against a real bucket (manual: requires AWS creds).

### Estimate

~3 days focused. S4.4 + S4.5 are most of the time.

---

## Cross-sprint dependencies

```
Sprint 0 ─────────────────────────────────────► (unblocks all)
   │
   ├─► Sprint 1 ───► Sprint 2 ───► Sprint 3 ───► Sprint 4
   │       │              │              │
   │       ▼              ▼              ▼
   └──── auth ADR ── memory ADR ── observer ADR
```

- Sprint 0 is the gate. None of the others should ship without `just verify` reliably catching regressions.
- Sprint 1 should land before Sprint 2 because R-001 (queryd /sql) is HIGH severity and the fix is mostly mechanical.
- Sprint 2 / 3 are real engineering; estimates are floors not ceilings.
- Sprint 4 can land in parallel with Sprint 2/3 — its stories don't depend on the agent-loop port.