docs: Phase G0 kickoff plan + scrum-style independent review
7-day day-by-day plan for the smallest end-to-end ingest+query path in Go: D0 ops setup → D1 skeleton + chi + /health × 5 binaries → D2 storaged S3 → D3 catalogd Parquet manifests → D4 ingestd CSV→Parquet → D5 queryd DuckDB → D6 gate-day end-to-end → D7 cleanup + retro. Plan was reviewed by opencode/claude-opus-4-7 via the gateway (same path the production overseer correction loop uses post-G0). 9 findings (2 BLOCK + 5 WARN + 2 INFO): - 2 BLOCK fixed inline: - cgo build dependency surfaced on D0 not D5 - DuckDB CREATE SECRET (S3) plumbed from SecretsProvider on D5.1 - 4 of 5 WARN fixed inline: - storaged binds 127.0.0.1 only + 2 GiB body cap - queryd uses TTL-cached views + etag invalidation, not refresh-per-call - gateway reverse-proxy stubbed on D1.10 (501), promoted on D6 - ADR stubs go in at start of D4/D5, finalized on D7 - 1 WARN deferred (orphan GC on two-phase write — punted to G2) - 1 WARN accepted with note (shared-server.go refactor — G1+ follow-up) - 2 INFO fixed inline (go mod tidy timing, ADR-after-fact inversion) Disposition table appended to the doc itself for auditability — matches the human_overrides.jsonl pattern from the Rust auditor. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
29468b1413
commit
ed3ccf7c53
244
docs/PHASE_G0_KICKOFF.md
Normal file
244
docs/PHASE_G0_KICKOFF.md
Normal file
@ -0,0 +1,244 @@
|
|||||||
|
# Phase G0 Kickoff Plan
|
||||||
|
|
||||||
|
**Goal:** smallest end-to-end ingest+query path in Go. Upload a CSV
|
||||||
|
via `POST /ingest`, query it via `POST /sql`, get rows back. No
|
||||||
|
vector, no profile, no UI yet.
|
||||||
|
|
||||||
|
**Estimated duration:** 1 engineer-week (5 working days + gate day +
|
||||||
|
cleanup). Plan is calibrated for solo work; cut by ~40% with two
|
||||||
|
engineers in parallel on storaged/catalogd vs ingestd/queryd.
|
||||||
|
|
||||||
|
**Cutoff for G0:** the closing acceptance gate (Day 6) passes
|
||||||
|
end-to-end against `workers_500k.csv` as the test fixture.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Day 0 — One-time setup
|
||||||
|
|
||||||
|
Done by an operator with sudo on the dev box. ~15 minutes.
|
||||||
|
|
||||||
|
| # | Step | Verify |
|
||||||
|
|---|---|---|
|
||||||
|
| 0.1 | Install Go 1.23+: `curl -L https://go.dev/dl/go1.23.linux-amd64.tar.gz \| sudo tar -C /usr/local -xz` | `go version` shows 1.23+ |
|
||||||
|
| 0.2 | Add `/usr/local/go/bin` to `PATH` (in `~/.bashrc`) | new shell sees `go` |
|
||||||
|
| 0.3 | Install cgo toolchain: `apt install build-essential` | `gcc --version` works |
|
||||||
|
| 0.4 | Clone repo: `git clone https://git.agentview.dev/profit/golangLAKEHOUSE.git` | `cd golangLAKEHOUSE && go version` from inside |
|
||||||
|
| 0.5 | Bring up MinIO locally (or point at existing) | `mc ls local/` lists buckets, or whatever the dev S3 is |
|
||||||
|
| 0.6 | Verify DuckDB cgo path: `go install github.com/duckdb/duckdb-go/v2@latest` against an empty module — succeeds on Linux x86_64/arm64 via static-linked duckdb-go-bindings; on unsupported platforms surface this here, not on Day 5 | install exits 0 |
|
||||||
|
|
||||||
|
**Day 0 acceptance:** `go version` shows 1.23+, `gcc --version` works,
|
||||||
|
MinIO reachable on `localhost:9000`, the cgo smoke install above
|
||||||
|
succeeded. (`go mod tidy` is intentionally NOT run here — no imports
|
||||||
|
yet; verification moves to D1.)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Day 1 — Skeleton + chi + /health × 5 binaries
|
||||||
|
|
||||||
|
**Goal:** five binaries build, each binds to its port, `/health`
|
||||||
|
returns `{"status":"ok","service":"<name>"}`.
|
||||||
|
|
||||||
|
| # | File | What |
|
||||||
|
|---|---|---|
|
||||||
|
| 1.1 | `internal/shared/server.go` | chi router factory, slog setup, graceful shutdown via `signal.NotifyContext` |
|
||||||
|
| 1.2 | `internal/shared/config.go` | TOML loader using `pelletier/go-toml/v2`, default + override pattern |
|
||||||
|
| 1.3 | `cmd/gateway/main.go` | port 3100, `/health` |
|
||||||
|
| 1.4 | `cmd/storaged/main.go` | port 3201, `/health` |
|
||||||
|
| 1.5 | `cmd/catalogd/main.go` | port 3202, `/health` |
|
||||||
|
| 1.6 | `cmd/ingestd/main.go` | port 3203, `/health` |
|
||||||
|
| 1.7 | `cmd/queryd/main.go` | port 3204, `/health` |
|
||||||
|
| 1.8 | `lakehouse.toml` | bind addresses, log level — sample committed |
|
||||||
|
| 1.9 | `Makefile` | `build`, `run-gateway`, etc. — convenience |
|
||||||
|
| 1.10 | `cmd/gateway/main.go` adds STUB routes `POST /v1/ingest` and `POST /v1/sql` returning `501 Not Implemented` with a header `X-Lakehouse-Stub: g0`. Real reverse-proxy wiring lands on Day 6, but the routes exist from D1 so D6 is just behavior change, not new endpoints. | `curl -X POST :3100/v1/ingest` returns 501 with the stub header |
|
||||||
|
|
||||||
|
**Acceptance D1:** `go mod tidy` populates `go.sum` cleanly; `go build
|
||||||
|
./cmd/...` exits 0; running each binary in a separate terminal,
|
||||||
|
`curl :3100/health` through `:3204/health` all return `200 OK` with
|
||||||
|
the expected JSON; gateway's stub `/v1/*` routes return 501.
|
||||||
|
|
||||||
|
**Dependencies pulled:** `go-chi/chi/v5`, `pelletier/go-toml/v2`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Day 2 — storaged: S3 GET/PUT/LIST
|
||||||
|
|
||||||
|
**Goal:** put a file, get it back, list it.
|
||||||
|
|
||||||
|
| # | File | What |
|
||||||
|
|---|---|---|
|
||||||
|
| 2.1 | `internal/storaged/bucket.go` | `aws-sdk-go-v2/service/s3` wrapper — `Get`, `Put`, `List`, `Delete` |
|
||||||
|
| 2.2 | `internal/storaged/registry.go` | `BucketRegistry` skeleton (per Rust ADR-017) — single bucket only in G0; multi-bucket lands in G2 |
|
||||||
|
| 2.3 | `internal/secrets/provider.go` | `SecretsProvider` interface + `FileSecretsProvider` reading `/etc/lakehouse/secrets.toml` |
|
||||||
|
| 2.4 | `cmd/storaged/main.go` | wire routes — `GET /storage/get/{key}`, `PUT /storage/put/{key}`, `GET /storage/list?prefix=...`. Bind to `127.0.0.1:3201` only (G0 is dev-only, no auth). Apply `http.MaxBytesReader` with a 2 GiB cap on PUT to bound memory + reject runaway uploads |
|
||||||
|
|
||||||
|
**Acceptance D2:** `curl -T sample.csv 127.0.0.1:3201/storage/put/test/sample.csv`
|
||||||
|
returns 200; `curl 127.0.0.1:3201/storage/get/test/sample.csv` echoes
|
||||||
|
the file bytes; `curl 127.0.0.1:3201/storage/list?prefix=test/` lists
|
||||||
|
`sample.csv`; PUT exceeding 2 GiB returns `413 Payload Too Large`.
|
||||||
|
|
||||||
|
**Open question:** error journal (Rust ADR-018 append-log pattern) —
|
||||||
|
defer to G2 with multi-bucket federation, or wire it now? Plan says
|
||||||
|
defer; revisit if errors surface during D3-D5.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Day 3 — catalogd: Parquet manifests
|
||||||
|
|
||||||
|
**Goal:** register a dataset, persist to storaged, restart, manifest
|
||||||
|
still visible.
|
||||||
|
|
||||||
|
| # | File | What |
|
||||||
|
|---|---|---|
|
||||||
|
| 3.1 | `internal/catalogd/manifest.go` | Parquet read/write using `arrow-go/v18/parquet/pqarrow`. Schema: `dataset_id`, `name`, `schema_fingerprint`, `objects`, `created_at`, `updated_at`, `row_count` |
|
||||||
|
| 3.2 | `internal/catalogd/registry.go` | In-memory index (map[name]Manifest), rehydrated on startup from `primary://_catalog/manifests/*.parquet` |
|
||||||
|
| 3.3 | `cmd/catalogd/main.go` | wire routes — `POST /catalog/register` (idempotent by name + fingerprint per Rust ADR-020), `GET /catalog/manifest/{name}`, `GET /catalog/list` |
|
||||||
|
| 3.4 | `internal/catalogd/store_client.go` | thin HTTP client to `cmd/storaged` — round-trips manifest Parquets |
|
||||||
|
|
||||||
|
**Acceptance D3:** register a dataset, see it in `/catalog/list`,
|
||||||
|
restart catalogd, `/catalog/list` still shows it. Re-register same
|
||||||
|
name + same fingerprint → 200, same `dataset_id`. Different
|
||||||
|
fingerprint → 409 Conflict.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Day 4 — ingestd: CSV → Parquet → catalog
|
||||||
|
|
||||||
|
**Goal:** `POST /ingest` with a CSV file produces a Parquet in
|
||||||
|
storaged + a manifest in catalogd.
|
||||||
|
|
||||||
|
| # | File | What |
|
||||||
|
|---|---|---|
|
||||||
|
| 4.1 | `internal/ingestd/schema.go` | infer Arrow schema from CSV header + first-N-row sampling. ADR-010 default-to-string on ambiguity |
|
||||||
|
| 4.2 | `internal/ingestd/csv.go` | stream CSV → `array.RecordBatch` → Parquet (arrow-go pqarrow writer) |
|
||||||
|
| 4.3 | `cmd/ingestd/main.go` | route `POST /ingest` — multipart form file → schema infer → write Parquet → call catalogd to register |
|
||||||
|
|
||||||
|
**Acceptance D4:** `curl -F file=@workers_500k.csv :3203/ingest?name=workers_500k`
|
||||||
|
returns 200 with the registered manifest; `aws s3 ls` (or `mc ls`)
|
||||||
|
shows the Parquet under `primary://datasets/workers_500k/`;
|
||||||
|
`curl :3202/catalog/manifest/workers_500k` returns the manifest with
|
||||||
|
`row_count=500000`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Day 5 — queryd: DuckDB SELECT
|
||||||
|
|
||||||
|
**Goal:** SQL queries over Parquet datasets.
|
||||||
|
|
||||||
|
| # | File | What |
|
||||||
|
|---|---|---|
|
||||||
|
| 5.1 | `internal/queryd/db.go` | `database/sql` connection to `github.com/duckdb/duckdb-go/v2` (cgo). Ensures DuckDB extensions Parquet + httpfs are loaded; on connection open, executes `CREATE SECRET` (TYPE S3) populated from `internal/secrets/provider.go` so `read_parquet('s3://...')` against MinIO authenticates per session |
|
||||||
|
| 5.2 | `internal/queryd/registrar.go` | reads catalogd `/catalog/list`, registers each dataset as a DuckDB view: `CREATE VIEW workers_500k AS SELECT * FROM read_parquet('s3://...')` |
|
||||||
|
| 5.3 | `cmd/queryd/main.go` | route `POST /sql` (JSON body `{"sql": "..."}`). View refresh strategy: cache views with a TTL (default 30s) + invalidate on `If-None-Match` against catalogd's manifest etag. **Don't** re-CREATE on every request — Opus review flagged that as the perf cliff during D6 timing capture |
|
||||||
|
|
||||||
|
**Acceptance D5:** after Day 4 ingestion,
|
||||||
|
`curl -X POST -d '{"sql":"SELECT count(*) FROM workers_500k"}' :3204/sql`
|
||||||
|
returns `[{"count_star()":500000}]`. A `SELECT role, count(*) FROM
|
||||||
|
workers_500k WHERE state='IL' GROUP BY role` returns expected rows.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Day 6 — Gate day: end-to-end via gateway
|
||||||
|
|
||||||
|
**Goal:** the closing G0 acceptance gate passes.
|
||||||
|
|
||||||
|
| # | What |
|
||||||
|
|---|---|
|
||||||
|
| 6.1 | Promote `cmd/gateway/main.go` `/v1/ingest` + `/v1/sql` from D1.10 stubs (501) to real reverse-proxies via `httputil.NewSingleHostReverseProxy` to ingestd / queryd. Multipart forwarding for `/v1/ingest` is the riskiest hop — verify form parts pass through with the file body intact |
|
||||||
|
| 6.2 | Smoke script `scripts/g0_smoke.sh`: spin up MinIO + 5 services, ingest, query, assert row count |
|
||||||
|
| 6.3 | Run smoke against `workers_500k.csv` end-to-end |
|
||||||
|
| 6.4 | Capture timing — total ingest + query latency, file size, peak memory |
|
||||||
|
|
||||||
|
**Closing G0 acceptance:** `scripts/g0_smoke.sh` exits 0. Numbers
|
||||||
|
recorded in `docs/G0_BASELINE.md` for future regression comparison.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Day 7 — Cleanup + retro
|
||||||
|
|
||||||
|
| # | What |
|
||||||
|
|---|---|
|
||||||
|
| 7.1 | Update SPEC §4 G0 with what actually shipped vs planned (deviations, surprises) |
|
||||||
|
| 7.2 | Write `docs/G0_BASELINE.md` — measured perf numbers + comparison hooks for G1+ |
|
||||||
|
| 7.3 | **Finalize** ADRs that were stubbed *before* their decisions landed — ADR stubs go in at the start of D4 (arrow-go version pin, schema inference policy) and D5 (DuckDB extension load order, S3 secret provisioning, view-refresh TTL) so reviewers can object in-flight; D7 just commits them after running real code against the calls |
|
||||||
|
| 7.4 | Tag the commit `phase-g0-complete` |
|
||||||
|
| 7.5 | Open follow-up issues for anything punted (error journal, multi-bucket, profile system, two-phase-write orphan GC, shared-server.go refactor for cgo-handle services) |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Risks tracked across the week
|
||||||
|
|
||||||
|
| Risk | Where | Mitigation |
|
||||||
|
|---|---|---|
|
||||||
|
| cgo build fails on the dev box | D5 | D0.3 verifies `gcc` present; if cgo specifically breaks, fall back to DuckDB external process (SPEC §3.1 option B) |
|
||||||
|
| arrow-go pqarrow schema mismatch with CSV inference | D4 | Sample 1k rows for type inference, default to String per ADR-010, log when defaulting |
|
||||||
|
| DuckDB can't read S3 Parquet directly | D5 | Load `httpfs` extension explicitly; if it fails, copy Parquet to a local temp file before query (slow but correct) |
|
||||||
|
| `/catalog/register` race between ingestd writer and catalogd reader | D3-D4 | Same write-lock-across-storage-write pattern as Rust ADR-020 — serialize registers; throughput is OK at low ingest QPS |
|
||||||
|
| `workers_500k.csv` schema drifts vs Rust era | D4 | Plan calls for inferring fresh, not porting Rust schema. If staffer-domain features break in G3+, revisit |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Out of scope for G0 (deferred to later phases)
|
||||||
|
|
||||||
|
- Vector indexing — Phase G1
|
||||||
|
- Multi-bucket / federation — Phase G2
|
||||||
|
- Profile system — Phase G2
|
||||||
|
- Hot-swap atomicity — Phase G2
|
||||||
|
- Pathway memory — Phase G3
|
||||||
|
- Distillation pipeline — Phase G3
|
||||||
|
- MCP server / observer / auditor — Phase G4
|
||||||
|
- HTMX UI — Phase G5
|
||||||
|
- TLS, auth — explicit non-goal until G2 (single-bucket no-auth dev)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Open questions before Day 1
|
||||||
|
|
||||||
|
1. **MinIO instance** — reuse the existing one at `localhost:9000`
|
||||||
|
that lakehouse uses (shared dev box) or stand up a fresh one with
|
||||||
|
a separate bucket prefix?
|
||||||
|
2. **`/etc/lakehouse/secrets.toml`** — share the lakehouse repo's
|
||||||
|
secrets file or create `/etc/golangLAKEHOUSE/secrets.toml`?
|
||||||
|
3. **Workers CSV source** — derive from `workers_500k.parquet` (round-
|
||||||
|
trip back to CSV) or use `workers_500k_v9.csv` if it exists?
|
||||||
|
|
||||||
|
These are ops calls, not architecture. Answer when D0 is being
|
||||||
|
executed.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Self-review — independent pass via gateway overseer
|
||||||
|
|
||||||
|
Reviewer: `opencode/claude-opus-4-7` via `localhost:3100/v1/chat`
|
||||||
|
(the same path the production overseer correction loop uses post-G0
|
||||||
|
in the Rust era). Run on the original draft before any of the inline
|
||||||
|
fixes above were applied. Findings dispositioned below.
|
||||||
|
|
||||||
|
### BLOCK — both real, both fixed inline
|
||||||
|
|
||||||
|
| # | Finding | Disposition | Fix location |
|
||||||
|
|---|---|---|---|
|
||||||
|
| B1 | `apt install build-essential` alone won't satisfy the cgo link step for `duckdb-go/v2` | **Fixed** — D0.6 now runs a smoke `go install` against an empty module on D0 to flush platform issues here, not on D5 | D0.6 |
|
||||||
|
| B2 | DuckDB session needs S3 credentials (`CREATE SECRET`) plumbed from SecretsProvider; "load httpfs" alone leaves auth unwired | **Fixed** — D5.1 now calls `CREATE SECRET (TYPE S3, ...)` on connection open, populated from `internal/secrets/provider.go` | D5.1 |
|
||||||
|
|
||||||
|
### WARN — 4 of 5 fixed inline; 1 deferred
|
||||||
|
|
||||||
|
| # | Finding | Disposition | Fix location |
|
||||||
|
|---|---|---|---|
|
||||||
|
| W1 | Two-phase write (storaged → catalogd register) leaves orphan Parquets on partial failure; no GC story | **Deferred** — punted to G2 alongside multi-bucket + error journal; tracked in §Risks and D7.5 follow-up | D7.5 |
|
||||||
|
| W2 | "Refresh views on each `/sql` call" will be the D6 perf cliff | **Fixed** — D5.3 now uses TTL-cached views with etag invalidation against catalogd | D5.3 |
|
||||||
|
| W3 | Shared `internal/shared/server.go` factory across heterogeneous binaries (HTTP ingress vs cgo-DB-holder) couples graceful-shutdown semantics that will need unwinding later | **Accepted with note** — G0 keeps the simple shared factory; refactor explicitly listed as a G1+ follow-up | D7.5 |
|
||||||
|
| W4 | storaged PUT/GET on a TCP port with no auth + no body cap is a footgun | **Fixed** — D2.4 now binds `127.0.0.1` only and applies a 2 GiB `MaxBytesReader` cap | D2.4 |
|
||||||
|
| W5 | Gateway reverse-proxy introduced cold on D6 gate day compresses risk into the deadline | **Fixed** — D1.10 now stubs the routes returning 501; D6.1 just promotes them to real proxies | D1.10 + D6.1 |
|
||||||
|
|
||||||
|
### INFO — both fixed inline
|
||||||
|
|
||||||
|
| # | Finding | Disposition | Fix location |
|
||||||
|
|---|---|---|---|
|
||||||
|
| I1 | `go mod tidy` before any imports is a trivial-true verification | **Fixed** — D0.6 re-purposed for the cgo smoke; tidy verification moved to D1 acceptance | D0.6 + D1 acceptance |
|
||||||
|
| I2 | Filing ADRs *after* the work is done inverts the usual pattern | **Fixed** — D7.3 reframed: ADR stubs go in at the start of D4/D5 so reviewers can object in-flight; D7.3 just finalizes them | D7.3 |
|
||||||
|
|
||||||
|
### Net change
|
||||||
|
|
||||||
|
7 of 9 findings produced inline plan edits; 2 deferred to post-G0
|
||||||
|
follow-up issues (W1 orphan GC, W3 shared-server refactor) with the
|
||||||
|
deferral itself documented. No findings dismissed as confabulation.
|
||||||
Loading…
x
Reference in New Issue
Block a user