golangLAKEHOUSE/docs/PHASE_G0_KICKOFF.md
Claw ed3ccf7c53 docs: Phase G0 kickoff plan + scrum-style independent review
7-day day-by-day plan for the smallest end-to-end ingest+query path
in Go: D0 ops setup → D1 skeleton + chi + /health × 5 binaries → D2
storaged S3 → D3 catalogd Parquet manifests → D4 ingestd CSV→Parquet
→ D5 queryd DuckDB → D6 gate-day end-to-end → D7 cleanup + retro.

Plan was reviewed by opencode/claude-opus-4-7 via the gateway
(same path the production overseer correction loop uses post-G0).
9 findings (2 BLOCK + 5 WARN + 2 INFO):

- 2 BLOCK fixed inline:
  - cgo build dependency surfaced on D0 not D5
  - DuckDB CREATE SECRET (S3) plumbed from SecretsProvider on D5.1
- 4 of 5 WARN fixed inline:
  - storaged binds 127.0.0.1 only + 2 GiB body cap
  - queryd uses TTL-cached views + etag invalidation, not refresh-per-call
  - gateway reverse-proxy stubbed on D1.10 (501), promoted on D6
  - ADR stubs go in at start of D4/D5, finalized on D7
- 1 WARN deferred (orphan GC on two-phase write — punted to G2)
- 1 WARN accepted with note (shared-server.go refactor — G1+ follow-up)
- 2 INFO fixed inline (go mod tidy timing, ADR-after-fact inversion)

Disposition table appended to the doc itself for auditability —
matches the human_overrides.jsonl pattern from the Rust auditor.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 06:47:15 -05:00

245 lines
14 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Phase G0 Kickoff Plan
**Goal:** smallest end-to-end ingest+query path in Go. Upload a CSV
via `POST /ingest`, query it via `POST /sql`, get rows back. No
vector, no profile, no UI yet.
**Estimated duration:** 1 engineer-week (5 working days + gate day +
cleanup). Plan is calibrated for solo work; cut by ~40% with two
engineers in parallel on storaged/catalogd vs ingestd/queryd.
**Cutoff for G0:** the closing acceptance gate (Day 6) passes
end-to-end against `workers_500k.csv` as the test fixture.
---
## Day 0 — One-time setup
Done by an operator with sudo on the dev box. ~15 minutes.
| # | Step | Verify |
|---|---|---|
| 0.1 | Install Go 1.23+: `curl -L https://go.dev/dl/go1.23.linux-amd64.tar.gz \| sudo tar -C /usr/local -xz` | `go version` shows 1.23+ |
| 0.2 | Add `/usr/local/go/bin` to `PATH` (in `~/.bashrc`) | new shell sees `go` |
| 0.3 | Install cgo toolchain: `apt install build-essential` | `gcc --version` works |
| 0.4 | Clone repo: `git clone https://git.agentview.dev/profit/golangLAKEHOUSE.git` | `cd golangLAKEHOUSE && go version` from inside |
| 0.5 | Bring up MinIO locally (or point at existing) | `mc ls local/` lists buckets, or whatever the dev S3 is |
| 0.6 | Verify DuckDB cgo path: `go install github.com/duckdb/duckdb-go/v2@latest` against an empty module — succeeds on Linux x86_64/arm64 via static-linked duckdb-go-bindings; on unsupported platforms surface this here, not on Day 5 | install exits 0 |
**Day 0 acceptance:** `go version` shows 1.23+, `gcc --version` works,
MinIO reachable on `localhost:9000`, the cgo smoke install above
succeeded. (`go mod tidy` is intentionally NOT run here — no imports
yet; verification moves to D1.)
---
## Day 1 — Skeleton + chi + /health × 5 binaries
**Goal:** five binaries build, each binds to its port, `/health`
returns `{"status":"ok","service":"<name>"}`.
| # | File | What |
|---|---|---|
| 1.1 | `internal/shared/server.go` | chi router factory, slog setup, graceful shutdown via `signal.NotifyContext` |
| 1.2 | `internal/shared/config.go` | TOML loader using `pelletier/go-toml/v2`, default + override pattern |
| 1.3 | `cmd/gateway/main.go` | port 3100, `/health` |
| 1.4 | `cmd/storaged/main.go` | port 3201, `/health` |
| 1.5 | `cmd/catalogd/main.go` | port 3202, `/health` |
| 1.6 | `cmd/ingestd/main.go` | port 3203, `/health` |
| 1.7 | `cmd/queryd/main.go` | port 3204, `/health` |
| 1.8 | `lakehouse.toml` | bind addresses, log level — sample committed |
| 1.9 | `Makefile` | `build`, `run-gateway`, etc. — convenience |
| 1.10 | `cmd/gateway/main.go` adds STUB routes `POST /v1/ingest` and `POST /v1/sql` returning `501 Not Implemented` with a header `X-Lakehouse-Stub: g0`. Real reverse-proxy wiring lands on Day 6, but the routes exist from D1 so D6 is just behavior change, not new endpoints. | `curl -X POST :3100/v1/ingest` returns 501 with the stub header |
**Acceptance D1:** `go mod tidy` populates `go.sum` cleanly; `go build
./cmd/...` exits 0; running each binary in a separate terminal,
`curl :3100/health` through `:3204/health` all return `200 OK` with
the expected JSON; gateway's stub `/v1/*` routes return 501.
**Dependencies pulled:** `go-chi/chi/v5`, `pelletier/go-toml/v2`.
---
## Day 2 — storaged: S3 GET/PUT/LIST
**Goal:** put a file, get it back, list it.
| # | File | What |
|---|---|---|
| 2.1 | `internal/storaged/bucket.go` | `aws-sdk-go-v2/service/s3` wrapper — `Get`, `Put`, `List`, `Delete` |
| 2.2 | `internal/storaged/registry.go` | `BucketRegistry` skeleton (per Rust ADR-017) — single bucket only in G0; multi-bucket lands in G2 |
| 2.3 | `internal/secrets/provider.go` | `SecretsProvider` interface + `FileSecretsProvider` reading `/etc/lakehouse/secrets.toml` |
| 2.4 | `cmd/storaged/main.go` | wire routes — `GET /storage/get/{key}`, `PUT /storage/put/{key}`, `GET /storage/list?prefix=...`. Bind to `127.0.0.1:3201` only (G0 is dev-only, no auth). Apply `http.MaxBytesReader` with a 2 GiB cap on PUT to bound memory + reject runaway uploads |
**Acceptance D2:** `curl -T sample.csv 127.0.0.1:3201/storage/put/test/sample.csv`
returns 200; `curl 127.0.0.1:3201/storage/get/test/sample.csv` echoes
the file bytes; `curl 127.0.0.1:3201/storage/list?prefix=test/` lists
`sample.csv`; PUT exceeding 2 GiB returns `413 Payload Too Large`.
**Open question:** error journal (Rust ADR-018 append-log pattern) —
defer to G2 with multi-bucket federation, or wire it now? Plan says
defer; revisit if errors surface during D3-D5.
---
## Day 3 — catalogd: Parquet manifests
**Goal:** register a dataset, persist to storaged, restart, manifest
still visible.
| # | File | What |
|---|---|---|
| 3.1 | `internal/catalogd/manifest.go` | Parquet read/write using `arrow-go/v18/parquet/pqarrow`. Schema: `dataset_id`, `name`, `schema_fingerprint`, `objects`, `created_at`, `updated_at`, `row_count` |
| 3.2 | `internal/catalogd/registry.go` | In-memory index (map[name]Manifest), rehydrated on startup from `primary://_catalog/manifests/*.parquet` |
| 3.3 | `cmd/catalogd/main.go` | wire routes — `POST /catalog/register` (idempotent by name + fingerprint per Rust ADR-020), `GET /catalog/manifest/{name}`, `GET /catalog/list` |
| 3.4 | `internal/catalogd/store_client.go` | thin HTTP client to `cmd/storaged` — round-trips manifest Parquets |
**Acceptance D3:** register a dataset, see it in `/catalog/list`,
restart catalogd, `/catalog/list` still shows it. Re-register same
name + same fingerprint → 200, same `dataset_id`. Different
fingerprint → 409 Conflict.
---
## Day 4 — ingestd: CSV → Parquet → catalog
**Goal:** `POST /ingest` with a CSV file produces a Parquet in
storaged + a manifest in catalogd.
| # | File | What |
|---|---|---|
| 4.1 | `internal/ingestd/schema.go` | infer Arrow schema from CSV header + first-N-row sampling. ADR-010 default-to-string on ambiguity |
| 4.2 | `internal/ingestd/csv.go` | stream CSV → `array.RecordBatch` → Parquet (arrow-go pqarrow writer) |
| 4.3 | `cmd/ingestd/main.go` | route `POST /ingest` — multipart form file → schema infer → write Parquet → call catalogd to register |
**Acceptance D4:** `curl -F file=@workers_500k.csv :3203/ingest?name=workers_500k`
returns 200 with the registered manifest; `aws s3 ls` (or `mc ls`)
shows the Parquet under `primary://datasets/workers_500k/`;
`curl :3202/catalog/manifest/workers_500k` returns the manifest with
`row_count=500000`.
---
## Day 5 — queryd: DuckDB SELECT
**Goal:** SQL queries over Parquet datasets.
| # | File | What |
|---|---|---|
| 5.1 | `internal/queryd/db.go` | `database/sql` connection to `github.com/duckdb/duckdb-go/v2` (cgo). Ensures DuckDB extensions Parquet + httpfs are loaded; on connection open, executes `CREATE SECRET` (TYPE S3) populated from `internal/secrets/provider.go` so `read_parquet('s3://...')` against MinIO authenticates per session |
| 5.2 | `internal/queryd/registrar.go` | reads catalogd `/catalog/list`, registers each dataset as a DuckDB view: `CREATE VIEW workers_500k AS SELECT * FROM read_parquet('s3://...')` |
| 5.3 | `cmd/queryd/main.go` | route `POST /sql` (JSON body `{"sql": "..."}`). View refresh strategy: cache views with a TTL (default 30s) + invalidate on `If-None-Match` against catalogd's manifest etag. **Don't** re-CREATE on every request — Opus review flagged that as the perf cliff during D6 timing capture |
**Acceptance D5:** after Day 4 ingestion,
`curl -X POST -d '{"sql":"SELECT count(*) FROM workers_500k"}' :3204/sql`
returns `[{"count_star()":500000}]`. A `SELECT role, count(*) FROM
workers_500k WHERE state='IL' GROUP BY role` returns expected rows.
---
## Day 6 — Gate day: end-to-end via gateway
**Goal:** the closing G0 acceptance gate passes.
| # | What |
|---|---|
| 6.1 | Promote `cmd/gateway/main.go` `/v1/ingest` + `/v1/sql` from D1.10 stubs (501) to real reverse-proxies via `httputil.NewSingleHostReverseProxy` to ingestd / queryd. Multipart forwarding for `/v1/ingest` is the riskiest hop — verify form parts pass through with the file body intact |
| 6.2 | Smoke script `scripts/g0_smoke.sh`: spin up MinIO + 5 services, ingest, query, assert row count |
| 6.3 | Run smoke against `workers_500k.csv` end-to-end |
| 6.4 | Capture timing — total ingest + query latency, file size, peak memory |
**Closing G0 acceptance:** `scripts/g0_smoke.sh` exits 0. Numbers
recorded in `docs/G0_BASELINE.md` for future regression comparison.
---
## Day 7 — Cleanup + retro
| # | What |
|---|---|
| 7.1 | Update SPEC §4 G0 with what actually shipped vs planned (deviations, surprises) |
| 7.2 | Write `docs/G0_BASELINE.md` — measured perf numbers + comparison hooks for G1+ |
| 7.3 | **Finalize** ADRs that were stubbed *before* their decisions landed — ADR stubs go in at the start of D4 (arrow-go version pin, schema inference policy) and D5 (DuckDB extension load order, S3 secret provisioning, view-refresh TTL) so reviewers can object in-flight; D7 just commits them after running real code against the calls |
| 7.4 | Tag the commit `phase-g0-complete` |
| 7.5 | Open follow-up issues for anything punted (error journal, multi-bucket, profile system, two-phase-write orphan GC, shared-server.go refactor for cgo-handle services) |
---
## Risks tracked across the week
| Risk | Where | Mitigation |
|---|---|---|
| cgo build fails on the dev box | D5 | D0.3 verifies `gcc` present; if cgo specifically breaks, fall back to DuckDB external process (SPEC §3.1 option B) |
| arrow-go pqarrow schema mismatch with CSV inference | D4 | Sample 1k rows for type inference, default to String per ADR-010, log when defaulting |
| DuckDB can't read S3 Parquet directly | D5 | Load `httpfs` extension explicitly; if it fails, copy Parquet to a local temp file before query (slow but correct) |
| `/catalog/register` race between ingestd writer and catalogd reader | D3-D4 | Same write-lock-across-storage-write pattern as Rust ADR-020 — serialize registers; throughput is OK at low ingest QPS |
| `workers_500k.csv` schema drifts vs Rust era | D4 | Plan calls for inferring fresh, not porting Rust schema. If staffer-domain features break in G3+, revisit |
---
## Out of scope for G0 (deferred to later phases)
- Vector indexing — Phase G1
- Multi-bucket / federation — Phase G2
- Profile system — Phase G2
- Hot-swap atomicity — Phase G2
- Pathway memory — Phase G3
- Distillation pipeline — Phase G3
- MCP server / observer / auditor — Phase G4
- HTMX UI — Phase G5
- TLS, auth — explicit non-goal until G2 (single-bucket no-auth dev)
---
## Open questions before Day 1
1. **MinIO instance** — reuse the existing one at `localhost:9000`
that lakehouse uses (shared dev box) or stand up a fresh one with
a separate bucket prefix?
2. **`/etc/lakehouse/secrets.toml`** — share the lakehouse repo's
secrets file or create `/etc/golangLAKEHOUSE/secrets.toml`?
3. **Workers CSV source** — derive from `workers_500k.parquet` (round-
trip back to CSV) or use `workers_500k_v9.csv` if it exists?
These are ops calls, not architecture. Answer when D0 is being
executed.
---
## Self-review — independent pass via gateway overseer
Reviewer: `opencode/claude-opus-4-7` via `localhost:3100/v1/chat`
(the same path the production overseer correction loop uses post-G0
in the Rust era). Run on the original draft before any of the inline
fixes above were applied. Findings dispositioned below.
### BLOCK — both real, both fixed inline
| # | Finding | Disposition | Fix location |
|---|---|---|---|
| B1 | `apt install build-essential` alone won't satisfy the cgo link step for `duckdb-go/v2` | **Fixed** — D0.6 now runs a smoke `go install` against an empty module on D0 to flush platform issues here, not on D5 | D0.6 |
| B2 | DuckDB session needs S3 credentials (`CREATE SECRET`) plumbed from SecretsProvider; "load httpfs" alone leaves auth unwired | **Fixed** — D5.1 now calls `CREATE SECRET (TYPE S3, ...)` on connection open, populated from `internal/secrets/provider.go` | D5.1 |
### WARN — 4 of 5 fixed inline; 1 deferred
| # | Finding | Disposition | Fix location |
|---|---|---|---|
| W1 | Two-phase write (storaged → catalogd register) leaves orphan Parquets on partial failure; no GC story | **Deferred** — punted to G2 alongside multi-bucket + error journal; tracked in §Risks and D7.5 follow-up | D7.5 |
| W2 | "Refresh views on each `/sql` call" will be the D6 perf cliff | **Fixed** — D5.3 now uses TTL-cached views with etag invalidation against catalogd | D5.3 |
| W3 | Shared `internal/shared/server.go` factory across heterogeneous binaries (HTTP ingress vs cgo-DB-holder) couples graceful-shutdown semantics that will need unwinding later | **Accepted with note** — G0 keeps the simple shared factory; refactor explicitly listed as a G1+ follow-up | D7.5 |
| W4 | storaged PUT/GET on a TCP port with no auth + no body cap is a footgun | **Fixed** — D2.4 now binds `127.0.0.1` only and applies a 2 GiB `MaxBytesReader` cap | D2.4 |
| W5 | Gateway reverse-proxy introduced cold on D6 gate day compresses risk into the deadline | **Fixed** — D1.10 now stubs the routes returning 501; D6.1 just promotes them to real proxies | D1.10 + D6.1 |
### INFO — both fixed inline
| # | Finding | Disposition | Fix location |
|---|---|---|---|
| I1 | `go mod tidy` before any imports is a trivial-true verification | **Fixed** — D0.6 re-purposed for the cgo smoke; tidy verification moved to D1 acceptance | D0.6 + D1 acceptance |
| I2 | Filing ADRs *after* the work is done inverts the usual pattern | **Fixed** — D7.3 reframed: ADR stubs go in at the start of D4/D5 so reviewers can object in-flight; D7.3 just finalizes them | D7.3 |
### Net change
7 of 9 findings produced inline plan edits; 2 deferred to post-G0
follow-up issues (W1 orphan GC, W3 shared-server refactor) with the
deferral itself documented. No findings dismissed as confabulation.