docs: Phase G0 kickoff plan + scrum-style independent review

7-day day-by-day plan for the smallest end-to-end ingest+query path in Go: D0 ops setup → D1 skeleton + chi + /health × 5 binaries → D2 storaged S3 → D3 catalogd Parquet manifests → D4 ingestd CSV→Parquet → D5 queryd DuckDB → D6 gate-day end-to-end → D7 cleanup + retro. Plan was reviewed by opencode/claude-opus-4-7 via the gateway (same path the production overseer correction loop uses post-G0). 9 findings (2 BLOCK + 5 WARN + 2 INFO): - 2 BLOCK fixed inline: - cgo build dependency surfaced on D0 not D5 - DuckDB CREATE SECRET (S3) plumbed from SecretsProvider on D5.1 - 4 of 5 WARN fixed inline: - storaged binds 127.0.0.1 only + 2 GiB body cap - queryd uses TTL-cached views + etag invalidation, not refresh-per-call - gateway reverse-proxy stubbed on D1.10 (501), promoted on D6 - ADR stubs go in at start of D4/D5, finalized on D7 - 1 WARN deferred (orphan GC on two-phase write — punted to G2) - 1 WARN accepted with note (shared-server.go refactor — G1+ follow-up) - 2 INFO fixed inline (go mod tidy timing, ADR-after-fact inversion) Disposition table appended to the doc itself for auditability — matches the human_overrides.jsonl pattern from the Rust auditor. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 06:47:15 -05:00 · 2026-04-28 06:47:15 -05:00 · ed3ccf7c53
commit ed3ccf7c53
parent 29468b1413
1 changed files with 244 additions and 0 deletions
--- a/docs/PHASE_G0_KICKOFF.md
+++ b/docs/PHASE_G0_KICKOFF.md
@ -0,0 +1,244 @@
 # Phase G0 Kickoff Plan
 **Goal:** smallest end-to-end ingest+query path in Go. Upload a CSV
 via `POST /ingest`, query it via `POST /sql`, get rows back. No
 vector, no profile, no UI yet.
 **Estimated duration:** 1 engineer-week (5 working days + gate day +
 cleanup). Plan is calibrated for solo work; cut by ~40% with two
 engineers in parallel on storaged/catalogd vs ingestd/queryd.
 **Cutoff for G0:** the closing acceptance gate (Day 6) passes
 end-to-end against `workers_500k.csv` as the test fixture.
 ---
 ## Day 0 — One-time setup
 Done by an operator with sudo on the dev box. ~15 minutes.
 | # | Step | Verify |
 |---|---|---|
 | 0.1 | Install Go 1.23+: `curl -L https://go.dev/dl/go1.23.linux-amd64.tar.gz \| sudo tar -C /usr/local -xz` | `go version` shows 1.23+ |
 | 0.2 | Add `/usr/local/go/bin` to `PATH` (in `~/.bashrc`) | new shell sees `go` |
 | 0.3 | Install cgo toolchain: `apt install build-essential` | `gcc --version` works |
 | 0.4 | Clone repo: `git clone https://git.agentview.dev/profit/golangLAKEHOUSE.git` | `cd golangLAKEHOUSE && go version` from inside |
 | 0.5 | Bring up MinIO locally (or point at existing) | `mc ls local/` lists buckets, or whatever the dev S3 is |
 | 0.6 | Verify DuckDB cgo path: `go install github.com/duckdb/duckdb-go/v2@latest` against an empty module — succeeds on Linux x86_64/arm64 via static-linked duckdb-go-bindings; on unsupported platforms surface this here, not on Day 5 | install exits 0 |
 **Day 0 acceptance:** `go version` shows 1.23+, `gcc --version` works,
 MinIO reachable on `localhost:9000`, the cgo smoke install above
 succeeded. (`go mod tidy` is intentionally NOT run here — no imports
 yet; verification moves to D1.)
 ---
 ## Day 1 — Skeleton + chi + /health × 5 binaries
 **Goal:** five binaries build, each binds to its port, `/health`
 returns `{"status":"ok","service":"<name>"}`.
 | # | File | What |
 |---|---|---|
 | 1.1 | `internal/shared/server.go` | chi router factory, slog setup, graceful shutdown via `signal.NotifyContext` |
 | 1.2 | `internal/shared/config.go` | TOML loader using `pelletier/go-toml/v2`, default + override pattern |
 | 1.3 | `cmd/gateway/main.go` | port 3100, `/health` |
 | 1.4 | `cmd/storaged/main.go` | port 3201, `/health` |
 | 1.5 | `cmd/catalogd/main.go` | port 3202, `/health` |
 | 1.6 | `cmd/ingestd/main.go` | port 3203, `/health` |
 | 1.7 | `cmd/queryd/main.go` | port 3204, `/health` |
 | 1.8 | `lakehouse.toml` | bind addresses, log level — sample committed |
 | 1.9 | `Makefile` | `build`, `run-gateway`, etc. — convenience |
 | 1.10 | `cmd/gateway/main.go` adds STUB routes `POST /v1/ingest` and `POST /v1/sql` returning `501 Not Implemented` with a header `X-Lakehouse-Stub: g0`. Real reverse-proxy wiring lands on Day 6, but the routes exist from D1 so D6 is just behavior change, not new endpoints. | `curl -X POST :3100/v1/ingest` returns 501 with the stub header |
 **Acceptance D1:** `go mod tidy` populates `go.sum` cleanly; `go build
 ./cmd/...` exits 0; running each binary in a separate terminal,
 `curl :3100/health` through `:3204/health` all return `200 OK` with
 the expected JSON; gateway's stub `/v1/*` routes return 501.
 **Dependencies pulled:** `go-chi/chi/v5`, `pelletier/go-toml/v2`.
 ---
 ## Day 2 — storaged: S3 GET/PUT/LIST
 **Goal:** put a file, get it back, list it.
 | # | File | What |
 |---|---|---|
 | 2.1 | `internal/storaged/bucket.go` | `aws-sdk-go-v2/service/s3` wrapper — `Get`, `Put`, `List`, `Delete` |
 | 2.2 | `internal/storaged/registry.go` | `BucketRegistry` skeleton (per Rust ADR-017) — single bucket only in G0; multi-bucket lands in G2 |
 | 2.3 | `internal/secrets/provider.go` | `SecretsProvider` interface + `FileSecretsProvider` reading `/etc/lakehouse/secrets.toml` |
 | 2.4 | `cmd/storaged/main.go` | wire routes — `GET /storage/get/{key}`, `PUT /storage/put/{key}`, `GET /storage/list?prefix=...`. Bind to `127.0.0.1:3201` only (G0 is dev-only, no auth). Apply `http.MaxBytesReader` with a 2 GiB cap on PUT to bound memory + reject runaway uploads |
 **Acceptance D2:** `curl -T sample.csv 127.0.0.1:3201/storage/put/test/sample.csv`
 returns 200; `curl 127.0.0.1:3201/storage/get/test/sample.csv` echoes
 the file bytes; `curl 127.0.0.1:3201/storage/list?prefix=test/` lists
 `sample.csv`; PUT exceeding 2 GiB returns `413 Payload Too Large`.
 **Open question:** error journal (Rust ADR-018 append-log pattern) —
 defer to G2 with multi-bucket federation, or wire it now? Plan says
 defer; revisit if errors surface during D3-D5.
 ---
 ## Day 3 — catalogd: Parquet manifests
 **Goal:** register a dataset, persist to storaged, restart, manifest
 still visible.
 | # | File | What |
 |---|---|---|
 | 3.1 | `internal/catalogd/manifest.go` | Parquet read/write using `arrow-go/v18/parquet/pqarrow`. Schema: `dataset_id`, `name`, `schema_fingerprint`, `objects`, `created_at`, `updated_at`, `row_count` |
 | 3.2 | `internal/catalogd/registry.go` | In-memory index (map[name]Manifest), rehydrated on startup from `primary://_catalog/manifests/*.parquet` |
 | 3.3 | `cmd/catalogd/main.go` | wire routes — `POST /catalog/register` (idempotent by name + fingerprint per Rust ADR-020), `GET /catalog/manifest/{name}`, `GET /catalog/list` |
 | 3.4 | `internal/catalogd/store_client.go` | thin HTTP client to `cmd/storaged` — round-trips manifest Parquets |
 **Acceptance D3:** register a dataset, see it in `/catalog/list`,
 restart catalogd, `/catalog/list` still shows it. Re-register same
 name + same fingerprint → 200, same `dataset_id`. Different
 fingerprint → 409 Conflict.
 ---
 ## Day 4 — ingestd: CSV → Parquet → catalog
 **Goal:** `POST /ingest` with a CSV file produces a Parquet in
 storaged + a manifest in catalogd.
 | # | File | What |
 |---|---|---|
 | 4.1 | `internal/ingestd/schema.go` | infer Arrow schema from CSV header + first-N-row sampling. ADR-010 default-to-string on ambiguity |
 | 4.2 | `internal/ingestd/csv.go` | stream CSV → `array.RecordBatch` → Parquet (arrow-go pqarrow writer) |
 | 4.3 | `cmd/ingestd/main.go` | route `POST /ingest` — multipart form file → schema infer → write Parquet → call catalogd to register |
 **Acceptance D4:** `curl -F file=@workers_500k.csv :3203/ingest?name=workers_500k`
 returns 200 with the registered manifest; `aws s3 ls` (or `mc ls`)
 shows the Parquet under `primary://datasets/workers_500k/`;
 `curl :3202/catalog/manifest/workers_500k` returns the manifest with
 `row_count=500000`.
 ---
 ## Day 5 — queryd: DuckDB SELECT
 **Goal:** SQL queries over Parquet datasets.
 | # | File | What |
 |---|---|---|
 | 5.1 | `internal/queryd/db.go` | `database/sql` connection to `github.com/duckdb/duckdb-go/v2` (cgo). Ensures DuckDB extensions Parquet + httpfs are loaded; on connection open, executes `CREATE SECRET` (TYPE S3) populated from `internal/secrets/provider.go` so `read_parquet('s3://...')` against MinIO authenticates per session |
 | 5.2 | `internal/queryd/registrar.go` | reads catalogd `/catalog/list`, registers each dataset as a DuckDB view: `CREATE VIEW workers_500k AS SELECT * FROM read_parquet('s3://...')` |
 | 5.3 | `cmd/queryd/main.go` | route `POST /sql` (JSON body `{"sql": "..."}`). View refresh strategy: cache views with a TTL (default 30s) + invalidate on `If-None-Match` against catalogd's manifest etag. **Don't** re-CREATE on every request — Opus review flagged that as the perf cliff during D6 timing capture |
 **Acceptance D5:** after Day 4 ingestion,
 `curl -X POST -d '{"sql":"SELECT count(*) FROM workers_500k"}' :3204/sql`
 returns `[{"count_star()":500000}]`. A `SELECT role, count(*) FROM
 workers_500k WHERE state='IL' GROUP BY role` returns expected rows.
 ---
 ## Day 6 — Gate day: end-to-end via gateway
 **Goal:** the closing G0 acceptance gate passes.
 | # | What |
 |---|---|
 | 6.1 | Promote `cmd/gateway/main.go` `/v1/ingest` + `/v1/sql` from D1.10 stubs (501) to real reverse-proxies via `httputil.NewSingleHostReverseProxy` to ingestd / queryd. Multipart forwarding for `/v1/ingest` is the riskiest hop — verify form parts pass through with the file body intact |
 | 6.2 | Smoke script `scripts/g0_smoke.sh`: spin up MinIO + 5 services, ingest, query, assert row count |
 | 6.3 | Run smoke against `workers_500k.csv` end-to-end |
 | 6.4 | Capture timing — total ingest + query latency, file size, peak memory |
 **Closing G0 acceptance:** `scripts/g0_smoke.sh` exits 0. Numbers
 recorded in `docs/G0_BASELINE.md` for future regression comparison.
 ---
 ## Day 7 — Cleanup + retro
 | # | What |
 |---|---|
 | 7.1 | Update SPEC §4 G0 with what actually shipped vs planned (deviations, surprises) |
 | 7.2 | Write `docs/G0_BASELINE.md` — measured perf numbers + comparison hooks for G1+ |
 | 7.3 | **Finalize** ADRs that were stubbed *before* their decisions landed — ADR stubs go in at the start of D4 (arrow-go version pin, schema inference policy) and D5 (DuckDB extension load order, S3 secret provisioning, view-refresh TTL) so reviewers can object in-flight; D7 just commits them after running real code against the calls |
 | 7.4 | Tag the commit `phase-g0-complete` |
 | 7.5 | Open follow-up issues for anything punted (error journal, multi-bucket, profile system, two-phase-write orphan GC, shared-server.go refactor for cgo-handle services) |
 ---
 ## Risks tracked across the week
 | Risk | Where | Mitigation |
 |---|---|---|
 | cgo build fails on the dev box | D5 | D0.3 verifies `gcc` present; if cgo specifically breaks, fall back to DuckDB external process (SPEC §3.1 option B) |
 | arrow-go pqarrow schema mismatch with CSV inference | D4 | Sample 1k rows for type inference, default to String per ADR-010, log when defaulting |
 | DuckDB can't read S3 Parquet directly | D5 | Load `httpfs` extension explicitly; if it fails, copy Parquet to a local temp file before query (slow but correct) |
 | `/catalog/register` race between ingestd writer and catalogd reader | D3-D4 | Same write-lock-across-storage-write pattern as Rust ADR-020 — serialize registers; throughput is OK at low ingest QPS |
 | `workers_500k.csv` schema drifts vs Rust era | D4 | Plan calls for inferring fresh, not porting Rust schema. If staffer-domain features break in G3+, revisit |
 ---
 ## Out of scope for G0 (deferred to later phases)
 - Vector indexing — Phase G1
 - Multi-bucket / federation — Phase G2
 - Profile system — Phase G2
 - Hot-swap atomicity — Phase G2
 - Pathway memory — Phase G3
 - Distillation pipeline — Phase G3
 - MCP server / observer / auditor — Phase G4
 - HTMX UI — Phase G5
 - TLS, auth — explicit non-goal until G2 (single-bucket no-auth dev)
 ---
 ## Open questions before Day 1
 1. **MinIO instance** — reuse the existing one at `localhost:9000`
   that lakehouse uses (shared dev box) or stand up a fresh one with
   a separate bucket prefix?
 2. **`/etc/lakehouse/secrets.toml`** — share the lakehouse repo's
   secrets file or create `/etc/golangLAKEHOUSE/secrets.toml`?
 3. **Workers CSV source** — derive from `workers_500k.parquet` (round-
   trip back to CSV) or use `workers_500k_v9.csv` if it exists?
 These are ops calls, not architecture. Answer when D0 is being
 executed.
 ---
 ## Self-review — independent pass via gateway overseer
 Reviewer: `opencode/claude-opus-4-7` via `localhost:3100/v1/chat`
 (the same path the production overseer correction loop uses post-G0
 in the Rust era). Run on the original draft before any of the inline
 fixes above were applied. Findings dispositioned below.
 ### BLOCK — both real, both fixed inline
 | # | Finding | Disposition | Fix location |
 |---|---|---|---|
 | B1 | `apt install build-essential` alone won't satisfy the cgo link step for `duckdb-go/v2` | **Fixed** — D0.6 now runs a smoke `go install` against an empty module on D0 to flush platform issues here, not on D5 | D0.6 |
 | B2 | DuckDB session needs S3 credentials (`CREATE SECRET`) plumbed from SecretsProvider; "load httpfs" alone leaves auth unwired | **Fixed** — D5.1 now calls `CREATE SECRET (TYPE S3, ...)` on connection open, populated from `internal/secrets/provider.go` | D5.1 |
 ### WARN — 4 of 5 fixed inline; 1 deferred
 | # | Finding | Disposition | Fix location |
 |---|---|---|---|
 | W1 | Two-phase write (storaged → catalogd register) leaves orphan Parquets on partial failure; no GC story | **Deferred** — punted to G2 alongside multi-bucket + error journal; tracked in §Risks and D7.5 follow-up | D7.5 |
 | W2 | "Refresh views on each `/sql` call" will be the D6 perf cliff | **Fixed** — D5.3 now uses TTL-cached views with etag invalidation against catalogd | D5.3 |
 | W3 | Shared `internal/shared/server.go` factory across heterogeneous binaries (HTTP ingress vs cgo-DB-holder) couples graceful-shutdown semantics that will need unwinding later | **Accepted with note** — G0 keeps the simple shared factory; refactor explicitly listed as a G1+ follow-up | D7.5 |
 | W4 | storaged PUT/GET on a TCP port with no auth + no body cap is a footgun | **Fixed** — D2.4 now binds `127.0.0.1` only and applies a 2 GiB `MaxBytesReader` cap | D2.4 |
 | W5 | Gateway reverse-proxy introduced cold on D6 gate day compresses risk into the deadline | **Fixed** — D1.10 now stubs the routes returning 501; D6.1 just promotes them to real proxies | D1.10 + D6.1 |
 ### INFO — both fixed inline
 | # | Finding | Disposition | Fix location |
 |---|---|---|---|
 | I1 | `go mod tidy` before any imports is a trivial-true verification | **Fixed** — D0.6 re-purposed for the cgo smoke; tidy verification moved to D1 acceptance | D0.6 + D1 acceptance |
 | I2 | Filing ADRs *after* the work is done inverts the usual pattern | **Fixed** — D7.3 reframed: ADR stubs go in at the start of D4/D5 so reviewers can object in-flight; D7.3 just finalizes them | D7.3 |
 ### Net change
 7 of 9 findings produced inline plan edits; 2 deferred to post-G0
 follow-up issues (W1 orphan GC, W3 shared-server refactor) with the
 deferral itself documented. No findings dismissed as confabulation.