root 9e9e4c26a4 G0 D5: queryd DuckDB SELECT over Parquet via httpfs · 4 scrum fixes

Phase G0 Day 5 ships queryd: in-memory DuckDB with custom Connector
that runs INSTALL httpfs / LOAD httpfs / CREATE OR REPLACE SECRET
(TYPE S3) on every new connection, sourced from SecretsProvider +
shared.S3Config. SetMaxOpenConns(1) so registrar's CREATE VIEWs and
handler's SELECTs serialize through one connection (avoids cross-
connection MVCC visibility edge cases).

Registrar.Refresh reads catalogd /catalog/list, runs CREATE OR
REPLACE VIEW "name" AS SELECT * FROM read_parquet('s3://bucket/key')
per manifest, drops views for removed manifests, skips on unchanged
updated_at (the implicit etag). Drop pass runs BEFORE create pass so
a poison manifest can't block other manifest refreshes (post-scrum
C1 fix).

POST /sql with JSON body {"sql":"…"} returns
{"columns":[{"name":"id","type":"BIGINT"},…], "rows":[[…]],
"row_count":N}. []byte → string conversion so VARCHAR rows
JSON-encode as text. 30s default refresh ticker, configurable via
[queryd].refresh_every.

Cross-lineage scrum on shipped code:
  - Opus 4.7 (opencode):                      1 BLOCK + 4 WARN + 4 INFO
  - Kimi K2-0905 (openrouter):                2 BLOCK + 2 WARN + 1 INFO
  - Qwen3-coder (openrouter):                 2 BLOCK + 1 WARN + 1 INFO

Fixed (4):
  C1 (Opus + Kimi convergent): Refresh aborts on first per-view error
    → drop pass first, collect errors, errors.Join. Poison manifest
    no longer blocks the rest of the catalog from re-syncing.
  B-CTX (Opus BLOCK): bootstrap closure captured OpenDB's ctx →
    cancelled-ctx silently fails every reconnect. context.Background()
    inside closure; passed ctx only for initial Ping.
  B-LEAK (Kimi BLOCK): firstLine(stmt) truncated CREATE SECRET to 80
    chars but those 80 chars contained KEY_ID + SECRET prefix → log
    aggregator captures credentials. Stable per-statement labels +
    redactCreds() filter on wrapped DuckDB errors.
  JSON-ERR (Opus WARN): swallowed json.Encode error → silent
    truncated 200 on unsupported column types. slog.Warn the failure.

Dismissed (4 false positives):
  Qwen BLOCK "bootstrap not transactional" — DuckDB DDL is auto-commit
  Qwen BLOCK "MaxBytesReader after Decode" — false, applied before
  Kimi BLOCK "concurrent Refresh + user SELECT deadlock" — not a
    deadlock, just serialization, by design with 10s timeout retry
  Kimi WARN "dropView leaves r.known inconsistent" — current code
    returns before the delete; the entry persists for retry

Critical reviewer behavior: 1 convergent BLOCK between Opus + Kimi
on the per-view error blocking, plus two independent single-reviewer
BLOCKs (B-CTX, B-LEAK) that smoke could never have caught. The
B-LEAK fix uses defense-in-depth: never pass SQL into the error
path AND redact known cred values from DuckDB's own error message.

DuckDB cgo path: github.com/duckdb/duckdb-go/v2 v2.10502.0 (per
ADR-001 §1) on Go 1.25 + arrow-go. Smoke 6/6 PASS after every
fix round.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-29 00:10:55 -05:00

60 KiB

Raw Blame History

Phase G0 Kickoff Plan

Goal: smallest end-to-end ingest+query path in Go. Upload a CSV via POST /ingest, query it via POST /sql, get rows back. No vector, no profile, no UI yet.

Estimated duration: 1 engineer-week (5 working days + gate day + cleanup). Plan is calibrated for solo work; cut by ~40% with two engineers in parallel on storaged/catalogd vs ingestd/queryd.

Cutoff for G0: the closing acceptance gate (Day 6) passes end-to-end against workers_500k.csv as the test fixture.

Day 0 — One-time setup

Done by an operator with sudo on the dev box. ~15 minutes.

#	Step	Verify
0.1	Install Go 1.23+: `curl -L https://go.dev/dl/go1.23.linux-amd64.tar.gz \| sudo tar -C /usr/local -xz`	`go version` shows 1.23+
0.2	Add `/usr/local/go/bin` to `PATH` (in `~/.bashrc`)	new shell sees `go`
0.3	Install cgo toolchain: `apt install build-essential`	`gcc --version` works
0.4	Clone repo: `git clone https://git.agentview.dev/profit/golangLAKEHOUSE.git`	`cd golangLAKEHOUSE && go version` from inside
0.5	Bring up MinIO locally (or point at existing)	`mc ls local/` lists buckets, or whatever the dev S3 is
0.6	Verify DuckDB cgo path with a real compile-and-run smoke (NOT `go install pkg@latest` — that requires a main package and would fail on the duckdb library before exercising cgo). Steps: `mkdir /tmp/duckdb-smoke && cd /tmp/duckdb-smoke && go mod init smoke && go get github.com/duckdb/duckdb-go/v2 && cat > main.go <<<'package main; import (_ "github.com/duckdb/duckdb-go/v2"; "database/sql"); func main(){db,_:=sql.Open("duckdb","");db.Ping();db.Close()}' && go run main.go` — proves cgo linker chain + static-linked duckdb-go-bindings work on this platform	exits 0, no link errors

Day 0 acceptance: go version shows 1.23+, gcc --version works, MinIO reachable on localhost:9000, the cgo smoke install above succeeded. (go mod tidy is intentionally NOT run here — no imports yet; verification moves to D1.)

Day 1 — Skeleton + chi + /health × 5 binaries

Goal: five binaries build, each binds to its port, /health returns {"status":"ok","service":"<name>"}.

#	File	What
1.1	`internal/shared/server.go`	chi router factory, slog setup, graceful shutdown via `signal.NotifyContext`
1.2	`internal/shared/config.go`	TOML loader using `pelletier/go-toml/v2`, default + override pattern
1.3	`cmd/gateway/main.go`	port 3110, `/health`
1.4	`cmd/storaged/main.go`	port 3211, `/health`
1.5	`cmd/catalogd/main.go`	port 3212, `/health`
1.6	`cmd/ingestd/main.go`	port 3213, `/health`
1.7	`cmd/queryd/main.go`	port 3214, `/health`

Port shift note: Original SPEC said 3100/3201–3204; D1 runtime caught a collision — the live Rust lakehouse owns 3100. Shifted Go dev ports to 3110+ so both systems can run concurrently during the migration. G5 cutover flips gateway back to 3100 when Rust retires. | 1.8 | lakehouse.toml | bind addresses, log level — sample committed | | 1.9 | Makefile | build, run-gateway, etc. — convenience | | 1.10 | cmd/gateway/main.go adds STUB routes POST /v1/ingest and POST /v1/sql returning 501 Not Implemented with a header X-Lakehouse-Stub: g0. Real reverse-proxy wiring lands on Day 6, but the routes exist from D1 so D6 is just behavior change, not new endpoints. | curl -X POST :3100/v1/ingest returns 501 with the stub header |

Acceptance D1: go mod tidy populates go.sum cleanly; go build ./cmd/... exits 0; running each binary in a separate terminal, curl :3100/health through :3204/health all return 200 OK with the expected JSON; gateway's stub /v1/* routes return 501.

Dependencies pulled: go-chi/chi/v5, pelletier/go-toml/v2.

Day 2 — storaged: S3 GET/PUT/LIST

Goal: put a file, get it back, list it.

#	File	What
2.1	`internal/storaged/bucket.go`	`aws-sdk-go-v2/service/s3` wrapper — `Get`, `Put`, `List`, `Delete`
2.2	`internal/storaged/registry.go`	`BucketRegistry` skeleton (per Rust ADR-017) — single bucket only in G0; multi-bucket lands in G2
2.3	`internal/secrets/provider.go`	`SecretsProvider` interface + `FileSecretsProvider` reading `/etc/lakehouse/secrets.toml`
2.4	`cmd/storaged/main.go`	wire routes — `GET /storage/get/{key}`, `PUT /storage/put/{key}`, `GET /storage/list?prefix=...`. Bind to `127.0.0.1:3201` only (G0 is dev-only, no auth). Apply `http.MaxBytesReader` with a 256 MiB per-request cap (reduced from 2 GiB per Qwen3-coder Q1 — 2 GiB × N concurrent = blow the box) + a buffered semaphore capping concurrent in-flight PUTs at 4. PUTs exceeding the cap → 413; PUTs blocked on the semaphore → 503 with `Retry-After: 5`

Acceptance D2: curl -T sample.csv 127.0.0.1:3201/storage/put/test/sample.csv returns 200; curl 127.0.0.1:3201/storage/get/test/sample.csv echoes the file bytes; curl 127.0.0.1:3201/storage/list?prefix=test/ lists sample.csv; PUT exceeding 2 GiB returns 413 Payload Too Large.

Open question: error journal (Rust ADR-018 append-log pattern) — defer to G2 with multi-bucket federation, or wire it now? Plan says defer; revisit if errors surface during D3-D5.

Day 3 — catalogd: Parquet manifests

Goal: register a dataset, persist to storaged, restart, manifest still visible.

#	File	What
3.1	`internal/catalogd/manifest.go`	Parquet read/write using `arrow-go/v18/parquet/pqarrow`. Schema: `dataset_id`, `name`, `schema_fingerprint`, `objects`, `created_at`, `updated_at`, `row_count`
3.2	`internal/catalogd/registry.go`	In-memory index (map[name]Manifest), rehydrated on startup from `primary://_catalog/manifests/*.parquet`
3.3	`cmd/catalogd/main.go`	wire routes — `POST /catalog/register` (idempotent by name + fingerprint per Rust ADR-020), `GET /catalog/manifest/{name}`, `GET /catalog/list`
3.4	`internal/catalogd/store_client.go`	thin HTTP client to `cmd/storaged` — round-trips manifest Parquets

Acceptance D3: register a dataset, see it in /catalog/list, restart catalogd, /catalog/list still shows it. Re-register same name + same fingerprint → 200, same dataset_id. Different fingerprint → 409 Conflict.

Day 4 — ingestd: CSV → Parquet → catalog

Goal: POST /ingest with a CSV file produces a Parquet in storaged + a manifest in catalogd.

#	File	What
4.1	`internal/ingestd/schema.go`	infer Arrow schema from CSV header + first-N-row sampling. ADR-010 default-to-string on ambiguity
4.2	`internal/ingestd/csv.go`	stream CSV → `array.RecordBatch` → Parquet (arrow-go pqarrow writer)
4.3	`cmd/ingestd/main.go`	route `POST /ingest` — multipart form file → schema infer → write Parquet → call catalogd to register

Acceptance D4: curl -F file=@workers_500k.csv :3203/ingest?name=workers_500k returns 200 with the registered manifest; aws s3 ls (or mc ls) shows the Parquet under primary://datasets/workers_500k/; curl :3202/catalog/manifest/workers_500k returns the manifest with row_count=500000.

Day 5 — queryd: DuckDB SELECT

Goal: SQL queries over Parquet datasets.

#	File	What
5.1	`internal/queryd/db.go`	`database/sql` connection to `github.com/duckdb/duckdb-go/v2` (cgo). Ensures DuckDB extensions Parquet + httpfs are loaded; on connection open, executes `CREATE SECRET` (TYPE S3) populated from `internal/secrets/provider.go` so `read_parquet('s3://...')` against MinIO authenticates per session
5.2	`internal/queryd/registrar.go`	reads catalogd `/catalog/list`, registers each dataset as a DuckDB view: `CREATE VIEW workers_500k AS SELECT * FROM read_parquet('s3://...')`
5.3	`cmd/queryd/main.go`	route `POST /sql` (JSON body `{"sql": "..."}`). View refresh strategy: cache views with a TTL (default 30s) + invalidate on `If-None-Match` against catalogd's manifest etag. Don't re-CREATE on every request — Opus review flagged that as the perf cliff during D6 timing capture

Acceptance D5: after Day 4 ingestion, curl -X POST -d '{"sql":"SELECT count(*) FROM workers_500k"}' :3204/sql returns [{"count_star()":500000}]. A SELECT role, count(*) FROM workers_500k WHERE state='IL' GROUP BY role returns expected rows.

Day 6 — Gate day: end-to-end via gateway

Goal: the closing G0 acceptance gate passes.

#	What
6.1	Promote `cmd/gateway/main.go` `/v1/ingest` + `/v1/sql` from D1.10 stubs (501) to real reverse-proxies. `httputil.NewSingleHostReverseProxy` preserves the inbound path by default, so the proxy must use a custom `Director` (or `Rewrite`) that strips the `/v1` prefix before forwarding — otherwise the call lands on `ingestd:3203/v1/ingest` which doesn't exist (ingestd serves `/ingest`, queryd serves `/sql`). Multipart forwarding for `/v1/ingest` is the riskiest hop — verify form parts pass through with the file body intact
6.2	Smoke script `scripts/g0_smoke.sh`: spin up MinIO + 5 services, ingest, query, assert row count
6.3	Run smoke against `workers_500k.csv` end-to-end
6.4	Capture timing — total ingest + query latency, file size, peak memory

Closing G0 acceptance: scripts/g0_smoke.sh exits 0. Numbers recorded in docs/G0_BASELINE.md for future regression comparison.

Day 7 — Cleanup + retro

#	What
7.1	Update SPEC §4 G0 with what actually shipped vs planned (deviations, surprises)
7.2	Write `docs/G0_BASELINE.md` — measured perf numbers + comparison hooks for G1+
7.3	Finalize ADRs that were stubbed before their decisions landed — ADR stubs go in at the start of D4 (arrow-go version pin, schema inference policy) and D5 (DuckDB extension load order, S3 secret provisioning, view-refresh TTL) so reviewers can object in-flight; D7 just commits them after running real code against the calls
7.4	Tag the commit `phase-g0-complete`
7.5	Open follow-up issues for anything punted (error journal, multi-bucket, profile system, two-phase-write orphan GC, shared-server.go refactor for cgo-handle services)

Risks tracked across the week

Risk	Where	Mitigation
cgo build fails on the dev box	D5	D0.3 verifies `gcc` present; if cgo specifically breaks, fall back to DuckDB external process (SPEC §3.1 option B)
arrow-go pqarrow schema mismatch with CSV inference	D4	Sample 1k rows for type inference, default to String per ADR-010, log when defaulting
DuckDB can't read S3 Parquet directly	D5	Load `httpfs` extension explicitly; if it fails, copy Parquet to a local temp file before query (slow but correct)
`/catalog/register` race between ingestd writer and catalogd reader	D3-D4	Same write-lock-across-storage-write pattern as Rust ADR-020 — serialize registers; throughput is OK at low ingest QPS
`workers_500k.csv` schema drifts vs Rust era	D4	Plan calls for inferring fresh, not porting Rust schema. If staffer-domain features break in G3+, revisit

Out of scope for G0 (deferred to later phases)

Vector indexing — Phase G1
Multi-bucket / federation — Phase G2
Profile system — Phase G2
Hot-swap atomicity — Phase G2
Pathway memory — Phase G3
Distillation pipeline — Phase G3
MCP server / observer / auditor — Phase G4
HTMX UI — Phase G5
TLS, auth — explicit non-goal until G2 (single-bucket no-auth dev)

Open questions before Day 1

MinIO instance — reuse the existing one at localhost:9000 that lakehouse uses (shared dev box) or stand up a fresh one with a separate bucket prefix?
/etc/lakehouse/secrets.toml — share the lakehouse repo's secrets file or create /etc/golangLAKEHOUSE/secrets.toml?
Workers CSV source — derive from workers_500k.parquet (round- trip back to CSV) or use workers_500k_v9.csv if it exists?

These are ops calls, not architecture. Answer when D0 is being executed.

Self-review — independent pass via gateway overseer

Reviewer: opencode/claude-opus-4-7 via localhost:3100/v1/chat (the same path the production overseer correction loop uses post-G0 in the Rust era). Run on the original draft before any of the inline fixes above were applied. Findings dispositioned below.

BLOCK — both real, both fixed inline

#	Finding	Disposition	Fix location
B1	`apt install build-essential` alone won't satisfy the cgo link step for `duckdb-go/v2`	Fixed — D0.6 now runs a smoke `go install` against an empty module on D0 to flush platform issues here, not on D5	D0.6
B2	DuckDB session needs S3 credentials (`CREATE SECRET`) plumbed from SecretsProvider; "load httpfs" alone leaves auth unwired	Fixed — D5.1 now calls `CREATE SECRET (TYPE S3, ...)` on connection open, populated from `internal/secrets/provider.go`	D5.1

WARN — 4 of 5 fixed inline; 1 deferred

#	Finding	Disposition	Fix location
W1	Two-phase write (storaged → catalogd register) leaves orphan Parquets on partial failure; no GC story	Deferred — punted to G2 alongside multi-bucket + error journal; tracked in §Risks and D7.5 follow-up	D7.5
W2	"Refresh views on each `/sql` call" will be the D6 perf cliff	Fixed — D5.3 now uses TTL-cached views with etag invalidation against catalogd	D5.3
W3	Shared `internal/shared/server.go` factory across heterogeneous binaries (HTTP ingress vs cgo-DB-holder) couples graceful-shutdown semantics that will need unwinding later	Accepted with note — G0 keeps the simple shared factory; refactor explicitly listed as a G1+ follow-up	D7.5
W4	storaged PUT/GET on a TCP port with no auth + no body cap is a footgun	Fixed — D2.4 now binds `127.0.0.1` only and applies a 2 GiB `MaxBytesReader` cap	D2.4
W5	Gateway reverse-proxy introduced cold on D6 gate day compresses risk into the deadline	Fixed — D1.10 now stubs the routes returning 501; D6.1 just promotes them to real proxies	D1.10 + D6.1

INFO — both fixed inline

#	Finding	Disposition	Fix location
I1	`go mod tidy` before any imports is a trivial-true verification	Fixed — D0.6 re-purposed for the cgo smoke; tidy verification moved to D1 acceptance	D0.6 + D1 acceptance
I2	Filing ADRs after the work is done inverts the usual pattern	Fixed — D7.3 reframed: ADR stubs go in at the start of D4/D5 so reviewers can object in-flight; D7.3 just finalizes them	D7.3

Net change (Opus pass)

7 of 9 findings produced inline plan edits; 2 deferred to post-G0 follow-up issues (W1 orphan GC, W3 shared-server refactor) with the deferral itself documented. No findings dismissed as confabulation.

Self-review — second pass via Kimi K2.6 (cross-lineage)

Reviewer: opencode/kimi-k2.6 via the same gateway path. Run on the post-Opus-fix doc to surface what a different model lineage catches. Per the Rust-era cross-lineage rotation pattern, Kimi tends to ground on textual specifics where Opus surfaces architectural shape — both lenses are useful.

Kimi's output was discursive (pre-format thinking rather than the requested BLOCK/WARN/INFO bullets), but two concrete catches landed — both BLOCKs that Opus missed and that the Opus-pass fixes themselves introduced.

BLOCK — both real, both fixed inline

#	Finding	Disposition	Fix location
K1	Opus's BLOCK B1 fix used `go install github.com/duckdb/duckdb-go/v2@latest` to verify cgo, but `go install pkg@version` requires a `main` package — duckdb-go/v2 is a library, so the command fails with "not a main package" before the cgo linker chain is exercised. The verification can pass on a broken-cgo box.	Fixed — D0.6 now creates a temporary module + a 5-line `main.go` that imports duckdb-go/v2 and calls `sql.Open("duckdb","")`, then `go run`s it — actually compiles + executes with cgo	D0.6
K2	Path mismatch: D1.10 stubs `/v1/ingest` and `/v1/sql` on the gateway, but D4 has ingestd serving `/ingest` and D5 has queryd serving `/sql`. D6.1's `httputil.NewSingleHostReverseProxy` preserves the inbound path by default, so the proxy would forward to `ingestd:3203/v1/ingest` which doesn't exist. Smoke would fail on D6 with a 404 on the backend, not the gateway.	Fixed — D6.1 now specifies a custom `Director` that strips the `/v1` prefix before forwarding	D6.1

WARN — none extracted (Kimi response truncated at this section)

The remaining response stream considered D2.4 MaxBytesReader semantics and D3 manifest registry concurrency, but cut off before producing structured findings. A third pass could be run if more lineage coverage is wanted — for now, two BLOCKs fixed is the cross-lineage delta.

Net change (Kimi pass)

2 BLOCKs landed, both fixed inline. Both were introduced by Opus-pass fixes — illustrating exactly why cross-lineage rotation matters: one model's review of the original is not the same as a different model's review of the post-fix version. The Rust auditor's Kimi/Haiku/Opus rotation captures this dynamic; today's two-pass doc review reproduces it on a much smaller scale.

Self-review — third pass via Qwen3-coder:480b (cross-lineage)

Reviewer: qwen3-coder:480b via ollama_cloud (Ollama Pro). Largest single-model open-weights coder in the fleet. Run on the post-Kimi-fix doc.

Output style note

Qwen3-coder's response was 80% approval of prior fixes (citing B1/B2/K1/K2/I2 as "NEW BLOCKs/INFOs" — actually a misuse of the NEW label, but useful as triangulation: three independent lineages now agree the prior fixes are correct). Only 2 genuinely new WARNs.

WARN — 1 fixed inline, 1 deferred

#	Finding	Disposition	Fix location
Q1	D2.4 `MaxBytesReader` at 2 GiB still permits memory exhaustion under concurrent uploads — N concurrent × 2 GiB blows the box	Fixed — D2.4 cap reduced to 256 MiB per request, plus a 4-slot semaphore on concurrent in-flight PUTs (503 + Retry-After when full)	D2.4
Q2	D5.3 view refresh has TTL + etag invalidation but no batching, so bursty queries against many datasets can re-issue catalog reads redundantly	Deferred — minor under G0's single-tenant load; revisit in G2 alongside the multi-bucket / profile work that creates more catalog churn	D7.5

Net change (Qwen pass)

1 fixed inline, 1 deferred. The mislabeled "approval" output was net positive — it's the closest thing to a triangulation signal we get without a fourth model. With Opus + Kimi + Qwen all confirming the prior fixes hold, the plan is unusually well-cross-checked for a greenfield kickoff.

Cumulative disposition across all 3 lineages

Pass	Reviewer	New findings	Fixed	Deferred
1	`opencode/claude-opus-4-7`	9	7	2
2	`opencode/kimi-k2.6`	2	2	0
3	`qwen3-coder:480b` (`ollama_cloud`)	2	1	1
4	runtime smoke (D1 actual launch)	1	1	0
Total		14	11	3

The 4th pass — runtime smoke — caught the only thing three doc-review lineages couldn't: port 3100 was already occupied by the live Rust lakehouse. Documented and dispositioned same pattern as the model findings.

Three deferrals, all to G2: orphan GC on two-phase write (W1), shared-server refactor for cgo-handle services (W3), batched view refresh (Q2). Tracked in D7.5.

D1 — actual run results (2026-04-28)

Phase G0 Day 1 executed end-to-end. Output of scripts/d1_smoke.sh:

[d1-smoke] building...
[d1-smoke] launching...
[d1-smoke] /health probes:
  ✓ gateway (:3110) → {"status":"ok","service":"gateway"}
  ✓ storaged (:3211) → {"status":"ok","service":"storaged"}
  ✓ catalogd (:3212) → {"status":"ok","service":"catalogd"}
  ✓ ingestd (:3213) → {"status":"ok","service":"ingestd"}
  ✓ queryd (:3214) → {"status":"ok","service":"queryd"}
[d1-smoke] gateway 501 stub probes:
  ✓ POST /v1/ingest → 501 + X-Lakehouse-Stub: g0
  ✓ POST /v1/sql → 501 + X-Lakehouse-Stub: g0
[d1-smoke] D1 acceptance gate: PASSED

What landed:

internal/shared/server.go — chi factory, slog JSON logging, /health, graceful shutdown via signal.NotifyContext
internal/shared/config.go — TOML loader with DefaultConfig() and env-override-via-flag pattern (-config flag)
cmd/{gateway,storaged,catalogd,ingestd,queryd}/main.go — five binaries, each ~30 lines, all using the shared factory
lakehouse.toml — G0 dev config with the shifted 3110+ ports
scripts/d1_smoke.sh — repeatable smoke test, exits 0 on PASS
bin/{gateway,storaged,catalogd,ingestd,queryd} — built binaries (~9.7 MB each, no debug stripping)

What G0 didn't need but I added anyway (intentional):

Gateway already has the D1.10 stub routes wired; D6.1 is just swap-the-handler.
TOML config supports an S3 section even though storaged doesn't consume it until D2 — saves a config schema bump on D2.

Next: D2 — storaged's actual S3 GET/PUT/LIST routes against MinIO.

D1 — code scrum review (3-lineage parallel pass)

After D1 shipped, the actual code (not docs) was reviewed by all three lineages running in parallel — the cross-lineage discipline applied to the implementation, not just the plan.

Pass	Reviewer	Latency	Findings
1	`opencode/claude-opus-4-7`	24.9s	2 BLOCK + 4 WARN + 2 INFO = 8
2	`opencode/kimi-k2.6`	18.3s	discursive output; 1 effective WARN extracted
3	`qwen3-coder:480b` (`ollama_cloud`)	62.0s	2 BLOCK + 2 WARN + 1 INFO = 5

Convergent findings (caught by ≥2 reviewers — high confidence)

#	Severity	Finding	Reviewers	Disposition
C1	BLOCK	`Run()` errCh/select race: fast bind error can be silently dropped	Opus + Qwen	Fixed — `net.Listen()` is now called synchronously before the goroutine; bind errors return as `Run`'s return value before any select
C2	BLOCK	`sleep 0.5` in d1_smoke.sh races bind on cold-start boxes	Opus + Qwen	Fixed — replaced with `poll_health()` loop, 5s budget per service, 50ms poll interval
C3	WARN	`LoadConfig` silent fallback when file missing	Opus + Qwen	Fixed — emits `slog.Warn` with path + hint when path was given but file absent

Single-reviewer findings (lineage-specific catches)

#	Severity	Finding	Reviewer	Disposition
S1	WARN	`slog.SetDefault` mutates global state from a library function	Kimi	Fixed — `Run()` no longer calls `SetDefault`; logger is constructed locally and passed to middleware
S2	WARN	`os.IsNotExist` → `errors.Is(err, fs.ErrNotExist)` idiom	Opus	Fixed
S3	WARN	secrets in `lakehouse.toml` — no `.gitignore` / env path	Opus	Already planned — D2.3 introduces `SecretsProvider`; D1 fields are empty strings
S4	INFO	5x near-identical `cmd/*/main.go` — could share a `Main()` helper	Opus	Accepted — defer until D2 surfaces the real per-service config consumption pattern
S5	INFO	`/health` logs at Info — k8s liveness probes will dominate volume	Opus	Accepted — defer post-G0; G0 isn't on k8s yet
S6	WARN	smoke double-curl per route doubles load + creates state window	Opus	Fixed — single `curl -i` parsed once for both code + headers

Second-pass review (post-fix code, Opus only)

After applying the first-round fixes, a second-pass Opus review caught 1 self-downgraded BLOCK + 4 WARN + 2 INFO. Of those, only one was a real new actionable: head -1 on curl -i is fragile against 1xx interim lines. Fixed — switched to awk '/^HTTP\//{code=$2} END{print code}' which picks the last status line (robust to 100 Continue etc).

The other 5 second-pass findings were:

Theoretical (clean-exit-without-Shutdown path that doesn't trigger today)
Aspirational (defensive defer ln.Close() for hypothetical future code)
Caught downstream (poll_health not identity-checking — but the followup probe IS identity-checked)
Stylistic (config.go's slog.Warn lands on the global default — but config loads BEFORE the service-level logger exists; accepting)
Aspirational (newListener "for testability" indirection isn't needed yet)

Cumulative D1 disposition

Pass 1 (3 lineages parallel): 14 distinct findings (3 convergent + 11 single-reviewer; some overlap with prior doc-review findings)
Pass 2 (Opus second-pass on post-fix code): 7 findings, 1 actionable fixed, 6 dispositioned
Total fixed in D1 code review: 7 · accepted-with-rationale: 5 · deferred: 0
Build clean (go build ./cmd/... exit 0), go vet ./... clean, smoke PASS after every fix round.

D1 ships harder than it would have without the scrum: bind-error handling is now race-free, smoke is deterministic, log volume on /health is acknowledged, secrets handling has a flagged path forward.

D2 — actual run results (2026-04-28 evening)

Phase G0 Day 2 executed end-to-end. Output of scripts/d2_smoke.sh:

[d2-smoke] PUT round-trip:
  ✓ PUT d2-smoke/<ts>.bin → 200
[d2-smoke] GET echoes bytes:
  ✓ GET d2-smoke/<ts>.bin → bytes match
[d2-smoke] LIST includes key:
  ✓ LIST prefix=d2-smoke/ → contains d2-smoke/<ts>.bin
[d2-smoke] DELETE then GET → 404:
  ✓ DELETE then GET → 404
[d2-smoke] 256 MiB cap → 413:
  ✓ PUT 257 MiB → 413
[d2-smoke] semaphore: 5th concurrent PUT → 503 + Retry-After:5
  ✓ 5th concurrent PUT → 503 + Retry-After: 5
[d2-smoke] D2 acceptance gate: PASSED

What landed:

internal/secrets/provider.go — Provider interface, FileProvider reading /etc/lakehouse/secrets-go.toml with inline-fallback for G0 dev convenience, StaticProvider test helper
internal/storaged/bucket.go — aws-sdk-go-v2/service/s3 wrapper: Get/Put/List/Delete; manager.Uploader for multipart-streaming PUT; MaxListResults=10_000 cap with ...truncated... sentinel
internal/storaged/registry.go — BucketRegistry (single bucket in G0; G2 multi-bucket federation extends this)
cmd/storaged/main.go — verb-paths GET/PUT/LIST/DELETE, strict validateKey (rejects empty, >1024B, NUL, leading-/, .., CR/LF/tab), Content-Length up-front 413, MaxBytesReader 256 MiB body cap, 4-slot non-blocking semaphore (503+Retry-After:5)
scripts/d2_smoke.sh — 6 acceptance probes; uses curl -T --limit-rate for true streaming uploads (--data-binary @- first attempt buffered client-side, semaphore never engaged)
Per-package unit tests for provider + registry + validateKey
New bucket lakehouse-go-primary on the existing MinIO at :9000, isolated from the Rust lakehouse bucket during coexistence

Out of spec but added: DELETE was exposed at the HTTP layer (the kickoff D2.4 listed only GET/PUT/LIST in routes; bucket.Delete was in the wrapper). J approved option A (DELETE exposed) for symmetry

smoke ergonomics.

validateKey policy: J approved the strict stance — leading /, .. components, and CR/LF/tab control characters all rejected at the HTTP boundary. Costs ~5 lines, propagates safety to every downstream consumer.

Next: D3 — catalogd Parquet manifests with idempotent register.

D2 — code scrum review (3-lineage parallel pass)

After D2 shipped, the actual code went through the same 3-lineage parallel scrum as D1.

Pass	Reviewer	Latency	Findings
1	`opencode/claude-opus-4-7`	~30s	1 BLOCK + 3 WARN + 3 INFO = 7
2	`openrouter/qwen/qwen3-coder`	~32s	2 BLOCK + 1 WARN + 1 INFO = 4
3a	`opencode/kimi-k2.6` (max_tokens=4096)	18s	discursive — finish_reason=length, no structured output (model spent budget thinking, never reached BLOCK/WARN format). opencode rejected `max_tokens>4096` without `stream=true`
3b	`kimi/kimi-k2-turbo` (direct adapter)	124s	empty content, finish_reason=length (8192 reasoning tokens, no surfaced output)
3c	`openrouter/moonshotai/kimi-k2-0905`	33s	1 BLOCK + 2 WARN + 1 INFO = 4 (used as the K-lineage representative)

The Kimi route shopping (3a → 3c) was a process finding worth recording: opencode caps non-streaming Kimi calls at 4096 output tokens, the direct kimi.com adapter consumed 8192 tokens of reasoning but surfaced empty content, and openrouter's moonshotai/kimi-k2-0905 route delivered structured output in ~33s. For future scrums on Kimi, default to openrouter/moonshotai/kimi-k2-0905.

Convergent findings (≥2 reviewers — high confidence)

#	Severity	Finding	Reviewers	Disposition
C1	BLOCK	`buildRegistry` `defer cancel()` cancels the ctx that the AWS SDK was loaded with — fine today (static creds, sync call) but breaks future credential refresh chains (EC2 IMDS, SSO, AssumeRole)	Opus + Kimi	Fixed — switched to `context.Background()` for SDK construction. Per-request lifetimes flow through `r.Context()`
C2	WARN→BLOCK	`*http.MaxBytesError` may not unwrap through `manager.Uploader`'s multipart goroutines for bodies >5 MiB; smoke covers 257 MiB single-PutObject path only — bodies in the multipart range that exceed 256 MiB could surface as 500 instead of 413	Opus (WARN) + Kimi (BLOCK)	Fixed — added Content-Length up-front 413 (deterministic for honest clients) + string-suffix fallback (`"http: request body too large"`) on the Put error path for chunked / lying-CL cases
C3	INFO→WARN	`Bucket.List` accumulates unbounded results into a slice; stray `prefix=""` against a large bucket OOMs the daemon	Opus (INFO) + Kimi (WARN)	Fixed — `MaxListResults=10_000` cap; truncated responses append a sentinel `ObjectInfo{Key: "...truncated..."}` so callers see they didn't get everything

Single-reviewer findings (lineage-specific catches)

#	Severity	Finding	Reviewer	Disposition
S1	WARN	PUT 200 response missing `Content-Type: application/json` — Go's content-sniffing won't catch the JSON prefix, clients may see `text/plain`	Opus	Fixed — explicit header set before `WriteHeader`
S2	WARN	`Bucket.Get` body close ordering fragile — currently correct but the contract is implicit	Opus	Accepted — current order is correct (early returns on errors where body is nil); no fix required
S3	INFO	`extractKey` uses chi `*` wildcard — works correctly, just noting empty-key path is covered by `validateKey`	Opus	Accepted — already covered
S4	INFO	`FileProvider.mu sync.RWMutex` is unused given the immutable-after-construction design	Opus	Accepted — drop the mutex when SIGHUP reload lands in G1 (S6 below); keep it for now as a placeholder
S5	INFO	`Bucket.Delete` doesn't translate not-found to `ErrKeyNotFound`	(Kimi noted, not flagged)	Accepted — S3 DeleteObject is idempotent by spec; non-error on missing key is the correct behavior
S6	INFO	`FileProvider` never reloads (no SIGHUP handler)	Kimi	Deferred to G1 — reload-on-SIGHUP is on the G1 list per `internal/shared/server.go` opening comment
S7	INFO	Error messages on no-creds-found could specify which source (file vs fallback) was missing	Qwen	Deferred — minor debug ergonomics, no production impact
F1	BLOCK	"Body close happens after semaphore release"	Qwen	Dismissed — false. Defer order is LIFO; `r.Body.Close()` was registered AFTER `<-h.putSem`, so it fires FIRST. Body closes before slot frees
F2	BLOCK	"ObjectInfo fields can be nil-dereferenced in List"	Qwen	Dismissed — false. `bucket.go:147-160` checks `if o.Key != nil` etc. before dereferencing every field
F3	WARN	"MaxBytesReader not applied to semaphore-protected path"	Qwen	Dismissed — false. Semaphore is non-blocking try-acquire (`select { default }`); there is no waiting state where pre-cap MaxBytesReader matters

Cumulative D2 disposition

3-lineage parallel pass: 15 distinct findings (3 convergent + 7 single-reviewer real + 3 false + 2 absorbed by other findings)
Fixed: 5 (3 convergent + 2 single-reviewer Opus)
Accepted-with-rationale: 5
Deferred to G1+: 2
Dismissed (false positives): 3 (all from Qwen — its strength on D1 doesn't reproduce on D2 code; the lineage caught real C1/C2 issues there but misread defer order + nil-checks here)
Build clean, vet clean, all tests pass, smoke 6/6 PASS after every fix round.

The Kimi/Opus convergence on C1 (ctx cancel) is the highest-signal finding of the round: two completely different lineages flagged the same architectural footgun. C2 (MaxBytesReader unwrap) was the most consequential — the smoke test would have stayed green while production multi-MiB uploads silently returned 500 on oversize. C3 was a latent OOM that's a 5-line fix.

The Qwen lineage delivered three false BLOCKs on D2 — different from its D1 contribution where it caught two real BLOCKs that Opus missed. Lineage rotation is real; on a given PR, one lineage may be the only one finding bugs and another may be confidently wrong. The convergence filter (≥2 reviewers) is the right gate.

D3 — actual run results (2026-04-28 evening)

Phase G0 Day 3 executed end-to-end. Output of scripts/d3_smoke.sh:

[d3-smoke] POST /catalog/register (fresh):
  ✓ fresh register → existing=false, dataset_id=200a05a8-...
[d3-smoke] GET /catalog/manifest/<name>:
  ✓ manifest dataset_id matches
[d3-smoke] GET /catalog/list (1 entry):
  ✓ list count=1
[d3-smoke] restart catalogd → rehydrate from Parquet:
  ✓ rehydrated dataset_id matches across restart
[d3-smoke] re-register (same name + same fingerprint) → existing=true:
  ✓ existing=true, same dataset_id, objects replaced (count=2)
[d3-smoke] re-register (different fingerprint) → 409:
  ✓ different fingerprint → 409 Conflict
[d3-smoke] D3 acceptance gate: PASSED

What landed:

internal/catalogd/manifest.go — Arrow Parquet codec for the Manifest type (dataset_id/name/schema_fingerprint/objects-as-list-of- struct/created_at_ns/updated_at_ns/row_count). One row per Parquet file. DatasetIDForName derives a deterministic UUIDv5 from name (namespace a8f3c1d2-4e5b-5a6c-9d8e-7f0a1b2c3d4e); same name on any box yields the same dataset_id.
internal/catalogd/registry.go — in-memory map[name]*Manifest with the ADR-020 contract: same name+fingerprint reuses dataset_id, replaces objects, bumps updated_at; different fingerprint → ErrFingerprintConflict. Single mutex serializes Register across persistence to close the check→insert TOCTOU. Rehydrate runs storaged I/O OUTSIDE the lock, swaps in the new map under the lock.
internal/catalogd/store_client.go — HTTP client over storaged's GET/PUT/DELETE/list. safeKey URL-escapes path segments while preserving /. Error paths drain body before close to keep the keep-alive pool healthy.
cmd/catalogd/main.go — POST /catalog/register, GET /catalog/manifest/{name}, GET /catalog/list. Sentinel-error detection (errors.Is(err, ErrEmptyName)) for 400s. 4 MiB body cap on register payloads.
New [catalogd] config: bind + storaged_url. Default http://127.0.0.1:3211.
scripts/d3_smoke.sh — 6 acceptance probes including rehydrate-across-restart (the load-bearing ADR-020 contract test).
arrow-go/v18 v18.6.0 + google/uuid v1.6.0 added; Go 1.24 → 1.25 forced by arrow-go's minimum.

UUIDv5 vs Rust v4: the Go rewrite picks deterministic-from-name. Same dataset name on any box converges to the same dataset_id; rehydrate after disk loss can't silently issue new IDs and break downstream cross-references. Rust's surrogate-UUIDv4 is what the prior system used; the divergence is intentional and documented at internal/catalogd/manifest.go:34.

Next: D4 — ingestd CSV → Parquet → catalogd register loop.

D3 — code scrum review (3-lineage parallel pass)

After D3 shipped, the actual code went through the same 3-lineage parallel scrum as D1/D2.

Pass	Reviewer	Latency	Findings
1	`opencode/claude-opus-4-7`	~30s	1 BLOCK + 5 WARN + 3 INFO = 9
2	`openrouter/moonshotai/kimi-k2-0905`	~30s	2 BLOCK + 2 WARN + 1 INFO = 5
3	`openrouter/qwen/qwen3-coder`	~25s	2 BLOCK + 2 WARN + 2 INFO = 6

Total: 20 distinct findings (some convergent across reviewers). Kimi route shopping from D2 paid off — openrouter/moonshotai/kimi-k2-0905 delivered structured output on first attempt, no max_tokens cap, no empty-content reasoning trap.

Convergent findings (≥2 reviewers — high confidence)

#	Severity	Finding	Reviewers	Disposition
C1	BLOCK×3	`Decode` indexes `listArr.Offsets()[0]/[1]` directly — fragile under array slicing/non-zero offset; panics on malformed Parquet (multi-row reader chunks, single-offset corrupt files)	Opus + Kimi + Qwen	Fixed — switched to `listArr.ValueOffsets(0)` (Arrow API that accounts for array offset) + bounds check on `start/end` against `structArr.Len()`
C2	WARN×2	`Rehydrate` holds the registry mutex across network I/O to storaged — slow storaged blocks all `Register/Get/List`; future re-sync endpoints stall	Opus + Kimi	Fixed — list/get/decode happen outside the lock; the completed map is swapped into `r.byKey` under a brief lock

Single-reviewer findings (lineage-specific catches)

#	Severity	Finding	Reviewer	Disposition
S1	WARN	Idempotent re-register mutates the in-memory manifest BEFORE persist succeeds — split-brain on storaged failure (in-memory advances, disk holds old, restart silently rolls back)	Opus	Fixed — build candidate copy of prior, persist candidate, swap into `r.byKey` only on persist success
S2	WARN	`cmd/catalogd` detects empty-input via `strings.Contains(err.Error(), "empty name")` — brittle, anyone reformatting the registry error silently demotes a 400 to a 500	Opus	Fixed — exported `ErrEmptyName` / `ErrEmptyFingerprint` sentinels, HTTP layer uses `errors.Is`
S3	WARN	`Get/List` `cp := *m` aliases the `Objects` slice + `RowCount` pointer — caller mutation corrupts registry state under the lock-free read	Opus	Fixed — `cloneManifest` deep-copies Objects + dereferences RowCount into a fresh `*int64`
S4	WARN	error paths in `store_client` preview body but don't drain before close → HTTP/1.1 keep-alive pool reuse breaks → slow socket leak	Qwen	Fixed — `drainAndClose` helper reads up to 64 KiB before close on every defer
S5	INFO×3	name-validation regex / per-call deadline / Snappy compression on manifest writes	Opus	Deferred — small, no risk if skipped
S6	WARN	`Rehydrate` aborts on first decode error → partial state	Kimi	Accepted — fail-loud > silent-partial is the design; cmd/catalogd `os.Exit(1)`s on rehydrate error so a corrupt manifest is operator-visible at startup
S7	WARN	`store_client.List` ignores pagination — if storaged gains continuation tokens, the client silently drops everything past page 1	Opus	Accepted — storaged caps lists at MaxListResults=10_000 with a `...truncated...` sentinel; no continuation token in the wire format yet (deferred to D7.5+ alongside multi-bucket federation)
F1	BLOCK	`Decode` crashes on empty Parquet (NumRows==0) — `Value(0)` panics	Kimi	Dismissed — already handled. `tbl.NumRows() != 1` returns error before any column access; NumRows==0 also fails the subsequent `rr.Next()` check
F2	INFO	`safeKey` double-escapes slashes; should use `url.PathEscape(key)` directly	Kimi	Dismissed — `url.PathEscape("a/b")` returns `"a%2Fb"`. Splitting on `/` first is necessary to preserve the path separator while escaping segment content. The current code is correct; Kimi's suggested fix would break storaged's chi wildcard match
F3	INFO	`rb.NewRecord()` error unchecked	Qwen	Dismissed — false signature. `RecordBuilder.NewRecord()` returns `arrow.Record` only; no error to check

Cumulative D3 disposition

3-lineage parallel pass: 20 distinct findings
Fixed: 6 (2 convergent + 4 single-reviewer)
Accepted-with-rationale: 5 (3 INFO + 2 WARN with deferred-by-design rationale)
Dismissed (false positives): 3 (1 Kimi BLOCK on already-handled empty case, 1 Kimi INFO on safeKey, 1 Qwen INFO on Arrow API signature)
Build clean, vet clean, all tests pass, smoke 6/6 PASS after every fix round.

The C1 triple-confirmation (Opus + Kimi + Qwen all flagging the same list-offset indexing) is the strongest convergence signal across the three days. All three correctly diagnosed the underlying issue (direct Offsets() indexing is fragile under array slicing); Opus named the right Arrow API (ValueOffsets(0)), Kimi and Qwen named the bounds-check shape. Fix uses Opus's API + the others' bounds discipline together.

S1 (split-brain on persist failure) was the most consequential finding — D3's smoke would have stayed green through every test case because none of them simulate storaged-down mid-register. Opus alone caught it; the fix is a 4-line candidate-then-swap pattern but the class of bug (in-memory advances ahead of disk) is exactly the distributed-systems hazard ADR-020 was added to prevent at a different layer.

D4 — actual run results (2026-04-28 evening)

Phase G0 Day 4 executed end-to-end. Output of scripts/d4_smoke.sh:

[d4-smoke] POST /ingest?name=d4_workers (5 rows, 5 cols):
  ✓ ingest fresh → row_count=5, existing=false, key=datasets/d4_workers/247165...parquet
[d4-smoke] mc shows the parquet on MinIO:
  ✓ <fp>.parquet present in lakehouse-go-primary/datasets/d4_workers/
[d4-smoke] catalogd manifest matches:
  ✓ manifest row_count=5, fp matches, 1 object at datasets/d4_workers/<fp>.parquet
[d4-smoke] re-ingest same CSV → existing=true:
  ✓ idempotent re-ingest: existing=true, same dataset_id, same fingerprint
[d4-smoke] schema-drift CSV → 409:
  ✓ schema drift → 409 Conflict
[d4-smoke] D4 acceptance gate: PASSED

What landed:

internal/ingestd/schema.go — Arrow schema inference per ADR-010 ("default to string on ambiguity"). Detects int64, float64, bool, string. Empty cells → nullable flag. Fingerprint() is a deterministic SHA-256 over (name, type) tuples in header order using ASCII record/unit separators (0x1e/0x1f) so column names with commas don't collide. Nullability intentionally NOT in the fingerprint — gaining/losing nulls isn't a schema change.
internal/ingestd/csv.go — single-pass CSV → Arrow → Parquet. Buffers the first DefaultSampleRows=1000 rows for inference, then replays them + streams the rest into the pqarrow writer in DefaultBatchSize=8192-row record batches. Snappy compression on the parquet output. Empty cells → null on numeric/bool, empty string on string columns. Per scrum C-WCLOSE: deferred guarded w.Close() so error paths flush + free buffered column data.
internal/ingestd/catalog_client.go — symmetric in shape with internal/catalogd/store_client.go. POST /catalog/register, drain- on-error keep-alive hygiene, ErrFingerprintConflict sentinel for the 409 path.
cmd/ingestd/main.go — POST /ingest?name=X with multipart form. Pipeline: parse multipart → IngestCSV → PUT parquet to storaged at datasets/<name>/<fp_hex>.parquet (content-addressed per scrum C-DRIFT) → catalogd /catalog/register. 256 MiB body cap matches storaged's PUT cap. 5-minute upstream timeout for large ingests.
New [ingestd] config block: bind + storaged_url + catalogd_url
scripts/d4_smoke.sh — 6 acceptance probes. Generates a CSV that exercises every inference path (clean int, string, ADR-010 fallback on salary with one N/A, mixed-case bool literals, float64).

Content-addressed parquet keys: datasets/<name>/<fp_hex>.parquet. Per scrum C-DRIFT (Opus WARN): the prior data.parquet shape meant a schema-drift ingest would PUT-overwrite the live parquet BEFORE catalogd's 409 fired, leaving storaged inconsistent with the catalog. Fingerprint-keyed paths mean drift attempts write to a different file that becomes an orphan; the live data is never corrupted. Same-fp re-ingest still overwrites the same key (idempotent).

Next: D5 — queryd DuckDB SELECT over registered datasets.

D4 — code scrum review (3-lineage parallel pass)

Pass	Reviewer	Latency	Findings
1	`opencode/claude-opus-4-7`	~25s	4 WARN + 3 INFO (+ 2 BLOCKs Opus self-retracted in-flight after re-reading)
2	`openrouter/moonshotai/kimi-k2-0905`	~25s	1 BLOCK + 2 WARN + 1 INFO
3	`openrouter/qwen/qwen3-coder`	~30s	2 BLOCK + 2 WARN + 2 INFO

Total: 16 distinct findings. Notably, Opus retracted two of its own BLOCKs after re-reading the code — the first time we've seen self-correction in the scrum stream. The actual reviewable output was 4 WARN + 3 INFO from Opus, and the analysis itself was visible.

Convergent findings (≥2 reviewers — high confidence)

No real convergent findings. Kimi and Qwen both flagged a "RecordBuilder leak on early error mid-stream" but on close reading the existing code is correct: the deferred rb.Release() at the outer scope captures whatever value rb holds at function exit, and the in-loop rb.Release() runs before rb is reassigned to a new builder. No leak, regardless of where errors occur. Both reviewers landed on the same wrong intuition by different paths.

Single-reviewer findings — applied

#	Severity	Finding	Reviewer	Disposition
C-DRIFT	WARN	PUT-before-register on a fixed `datasets/<name>/data.parquet` key means a schema-drift ingest overwrites the good parquet before catalogd's 409 fires. After: storaged holds the new (wrong-schema) parquet, manifest still has the old fingerprint, queryd reads stale-schema data	Opus	Fixed — content-addressed key shape `datasets/<name>/<fp_hex>.parquet`. Drift writes to a different file (orphan); live parquet is never overwritten by a non-matching schema
C-WCLOSE	WARN	`pqarrow.NewFileWriter` not Closed on error paths — buffered column data + OS resources leak per failed ingest	Opus	Fixed — `wClosed` flag + deferred guarded close. Success-path explicit Close still runs and is the only place the close error surfaces; defer fires only on error returns

Single-reviewer findings — accepted with rationale

#	Severity	Finding	Reviewer	Disposition
O-WARN1	WARN	Schema sample is 1000 rows; ambiguous values past row 1000 fail the entire ingest with 400 instead of widening the inference	Opus	Accepted — design call. G0's 256 MiB cap caps reasonable-CSVs well below the boundary. Long-term fix is two-pass infer or downgrade-to-string-on-failure; out of scope for D4
O-WARN2	WARN	String-match on `"http: request body too large"` is paranoia; trust `errors.As` only	Opus	Accepted — `errors.As` IS the load-bearing check; the string-match is a documented safety net we keep across the codebase (also in storaged D2, also called out then)
K-WARN2	WARN	Multipart parse buffers entire upload, then `IngestCSV` reads it again — double in-memory cost	Kimi	Accepted — known G0 limitation. ParseMultipartForm with the `64<<20` threshold spills to temp file above 64 MiB so the doubling only hits below that; 256 MiB upload cap means peak is bounded

Single-reviewer findings — deferred

#	Severity	Finding	Reviewer	Disposition
O-INFO×3	INFO	`0x1e/0x1f` separator validation, body-close ordering, `FieldsPerRecord=-1` swallowing CSV truncation	Opus	Deferred — small, no risk if skipped
K-INFO×1	INFO	Fingerprint ignores nullability — already documented intentional choice	Kimi	Deferred
Q-INFO×2	INFO	Lowercase isBoolLiteral micro-opt; align multipart-parse limit with body cap	Qwen	Deferred

Dismissed (false positives)

#	Severity	Finding	Reviewer	Why dismissed
F1	Q-BLOCK	csv.Reader needs `LazyQuotes=true` for multi-line quoted fields	Qwen	False — `LazyQuotes` is for unescaped quotes WITHIN fields. Go's `csv.Reader` correctly handles RFC 4180 multi-line quoted fields by default; smoke would have surfaced this if true
F2	Q-BLOCK	`row[i]` OOB panic on inconsistent rows	Qwen	False — already bounds-checked at `schema.go:73` (`if i >= len(row) { continue }`) and `csv.go:201` (`if i < len(row) { cell = row[i] }`)
F3	K-BLOCK	Type assertion panic if pqarrow reorders fields	Kimi	Speculative — pqarrow doesn't reorder schema; no real path
F4	K-WARN + Q-WARN×2	"RecordBuilder leak on early error mid-stream" (false convergent)	Kimi + Qwen	Both wrong, by different theories. The outer `defer rb.Release()` captures the current value of `rb` at function exit; the in-loop `rb.Release()` runs before reassignment. No path leaks

Cumulative D4 disposition

3-lineage parallel pass: 16 distinct findings
Fixed: 2 (both Opus single-reviewer; no real convergent findings this round)
Accepted-with-rationale: 3
Deferred: 6 (mostly INFO)
Dismissed (false positives): 5 (2 Qwen BLOCKs + 1 Kimi BLOCK + 1 false convergent leak finding)
Build clean, vet clean, all tests pass, smoke 6/6 PASS after every fix round.

The D4 scrum produced fewer real findings than D3 (2 vs 6) — both real findings were Opus-only, both were WARN-level architectural hazards that smoke wouldn't catch (PUT-then-register order failure on schema drift; resource leak on writer error path). Kimi and Qwen delivered a false convergent finding on a non-issue, plus several hallucinated BLOCKs (LazyQuotes for multi-line, OOB on bounds-checked indexing). Opus's self-retraction of two BLOCKs in-flight is a healthier reviewer behavior than confidently-wrong dismissals would have been.

The C-DRIFT fix (content-addressed parquet keys) is the kind of finding that's invisible to integration tests because the smoke's "schema drift → 409" assertion was passing in the corrupted-state world too — the 409 fires correctly; what was wrong was the live data getting overwritten before the 409 fired. Opus reading the PUT-then-register order and noticing that PUT mutates state before the validation check is exactly the kind of architectural code-read the scrum exists for.

Pre-D5 prep (commit `4205ecd`)

Between D4 and D5, J asked what would bite D5 if not fixed first. The answer was the CatalogClient package location: it lived in internal/ingestd/ but D5's queryd needed the same shape, and having queryd import from ingestd would have inverted the data-flow direction (ingestd is upstream of queryd).

Extracted to internal/catalogclient/ as a shared package, added the List(ctx) method queryd needs for view registration, kept the wire format unchanged. ingestd switched its import; all four existing smokes (D1-D4) passed unchanged.

DuckDB cgo path re-verified at this point with the official github.com/duckdb/duckdb-go/v2 v2.10502.0 (per ADR-001 §1) on Go 1.25 + arrow-go: standalone sql.Open("duckdb","") → db.Ping() → SELECT 'pong' round-trip succeeded.

D5 — actual run results (2026-04-29)

Phase G0 Day 5 executed end-to-end. Output of scripts/d5_smoke.sh:

[d5-smoke] ingest 5-row CSV via D4 path:
  ✓ ingest row_count=5
[d5-smoke] launching queryd (initial Refresh picks up d5_workers)...
[d5-smoke] POST /sql SELECT count(*) FROM d5_workers:
  ✓ count(*)=5
[d5-smoke] POST /sql SELECT * FROM d5_workers LIMIT 3:
  ✓ rows[0] = (id=1, name=Alice), columns=[id, name, salary]
[d5-smoke] schema-drift ingest 409s; existing view still queries:
  ✓ drift → 409
  ✓ post-drift count(*)=5 (view unchanged)
[d5-smoke] error path: SELECT FROM nonexistent → 400:
  ✓ unknown table → 400
[d5-smoke] D5 acceptance gate: PASSED

What landed:

internal/queryd/db.go — OpenDB(ctx, s3, prov, bucket) returns a *sql.DB backed by an in-memory DuckDB. Custom Connector with a bootstrapper that runs INSTALL httpfs / LOAD httpfs / CREATE OR REPLACE SECRET (TYPE S3, KEY_ID …, ENDPOINT …, URL_STYLE 'path', USE_SSL false) on every new physical connection. SetMaxOpenConns(1) so registrar's CREATE VIEWs and handler's user SQL serialize through one connection (avoids cross-connection visibility edge cases).
internal/queryd/registrar.go — Registrar.Refresh(ctx) reads catalog list, runs CREATE OR REPLACE VIEW "name" AS SELECT * FROM read_parquet('s3://bucket/key') per manifest, drops views for removed manifests, skips on unchanged updated_at. CatalogLister + Execer interfaces enable unit tests with no DuckDB / no real catalogd.
cmd/queryd/main.go — POST /sql with JSON body {"sql":"…"} → executes via *sql.DB → response shape {"columns":[{"name","type"},...], "rows":[[...]], "row_count":N}. []byte → string conversion so VARCHAR rows JSON-encode as text. Initial Refresh on startup, then a TTL ticker (30s default, configurable via [queryd].refresh_every) calling Refresh in a goroutine cancellable via process ctx.
New [queryd] config block: bind + catalogd_url + secrets_path + refresh_every.
scripts/d5_smoke.sh — 6 acceptance probes including post-drift query stability (view unchanged after schema-drift 409).

The D5 smoke launches queryd LAST — after the D4 ingest has registered the dataset — so the initial Refresh picks it up without waiting for the TTL. This matches what real ops looks like: queryd starts after data is loaded, picks up the catalog at that moment, then drifts in lockstep via the ticker.

Critical architectural choices:

SetMaxOpenConns(1) — DuckDB views live in the in-memory catalog of a single instance; with one connection, registrar's CREATE VIEWs and handler's SELECTs are visibly synchronized without MVCC-timing edge cases. Lift in G2+ if intra-process query concurrency wins matter.
Always-quote view identifiers — CREATE OR REPLACE VIEW "name" with internal " doubled. Catalogd accepts user-supplied names that wouldn't pass SQL bare-identifier rules; quoting makes them unambiguous.
updated_at as the implicit etag — no separate ETag header in catalogd. The manifest's timestamp bumps on every same-fp re-register; registrar tracks time.Time per name and skips CREATE when unchanged (Opus's "perf cliff" warning from D5 plan).

Next: D6 — gateway as a reverse proxy in front of all four backing services. Last day of G0. After that: G1+ (gRPC, Lance bench, vector indices, etc).

D5 — code scrum review (3-lineage parallel pass)

Pass	Reviewer	Latency	Findings
1	`opencode/claude-opus-4-7`	~28s	1 BLOCK + 4 WARN + 4 INFO = 9 (1 self-retracted)
2	`openrouter/moonshotai/kimi-k2-0905`	~25s	2 BLOCK + 2 WARN + 1 INFO = 5
3	`openrouter/qwen/qwen3-coder`	~20s	2 BLOCK + 1 WARN + 1 INFO = 4

Total: 18 distinct findings. Strongest scrum yet on convergence quality — Opus's BLOCK on ctx capture + Kimi's BLOCK on credential leakage in error messages were both deep architectural reads that smoke could not have caught.

Convergent findings (≥2 reviewers — high confidence)

#	Severity	Finding	Reviewers	Disposition
C1	WARN×2	`Refresh` aborts mid-loop on first per-view error → poison manifest blocks all subsequent drops/updates → stale views stick around forever (Opus); a partial `dropView` failure would block other manifest refreshes (Kimi)	Opus + Kimi	Fixed — drop pass runs BEFORE create pass; per-iteration errors are collected and Refresh continues; final return uses `errors.Join` to surface every failure

Single-reviewer findings — applied

#	Severity	Finding	Reviewer	Disposition
B-CTX	Opus BLOCK	Bootstrap closure captures the `ctx` passed to `OpenDB`. Today masked by `SetMaxOpenConns(1)` + long-lived `procCtx`; future short-lived ctx + reconnect would silently fail every reconnect's bootstrap with a cancelled ctx	Opus	Fixed — bootstrap explicitly uses `context.Background()`; passed `ctx` is only used for the initial `PingContext`
B-LEAK	Kimi BLOCK	The prior `firstLine(stmt)` truncation to 80 chars STILL contained both `KEY_ID '...'` AND the start of `SECRET '...'` — log aggregators would capture credentials	Kimi	Fixed — replaced `firstLine(stmt)` with stable per-statement labels (`"install httpfs"` / `"load httpfs"` / `"create secret"`) so no SQL reaches the error path. Wrapped `err.Error()` is filtered through `redactCreds` to scrub any secret values DuckDB itself might echo
JSON-ERR	Opus WARN	`_ = json.NewEncoder(w).Encode(resp)` swallows mid-encode failures; client sees truncated 200 with no log signal — first column type DuckDB can't JSON-encode (e.g. INTERVAL) lands here silently	Opus	Fixed — log via `slog.Warn` with the underlying error

Single-reviewer findings — accepted with rationale

#	Severity	Finding	Reviewer	Disposition
O-W3	Opus WARN	DuckDB exotic types (Decimal, INTERVAL, etc.) may not JSON-encode well — `dest := []any` with `[]byte→string` conversion is the only type-aware step today	Opus	Accepted — type-aware row converter is post-G0; G0 column types in test fixtures are int64/varchar/double/bool. The new `JSON-ERR` fix makes any encoding failure operator-visible in the meantime
K-W1	Kimi WARN	No exponential backoff on repeated refresh failures	Kimi	Accepted — G0 single-tenant; 30s ticker is fine. Backoff adds complexity that smoke can't validate
O-INFO×3	Opus INFO	USE_SSL casing cosmetic; initial-refresh log doesn't match runRefreshLoop's four-field format	Opus	Accepted / fixed cleanup — the `var _ = io.Discard` placeholder removed; the rest deferred
Q-W	Qwen WARN	Object key validation	Qwen	Accepted — catalogd's contract; storaged validateKey is the actual gate
Q-I	Qwen INFO	Hardcoded refresh timeout	Qwen	Accepted — separate config knob is over-engineering for G0

Dismissed (false positives)

#	Severity	Finding	Reviewer	Why dismissed
F1	Q-BLOCK	"Bootstrap not transactional"	Qwen	False — DuckDB DDL is auto-commit + idempotent; `CREATE OR REPLACE` handles re-run cleanly
F2	Q-BLOCK	"MaxBytesReader applied AFTER json.NewDecoder reads"	Qwen	False — `cmd/queryd/main.go` sets `r.Body = http.MaxBytesReader(...)` BEFORE the decode call, by line ordering
F3	K-BLOCK	"Concurrent Refresh + user query deadlock on single connection"	Kimi	False — not a deadlock, just serialization. With Refresh's 10s ctx-timeout, slow SELECT causes Refresh to skip a tick and retry. Design accept
F4	K-W2	"dropView failure leaves r.known inconsistent"	Kimi	False — current code returns BEFORE the `delete(r.known, name)` on dropView error, so the entry persists and next refresh retries (correct behavior). Misread of the sequence

Cumulative D5 disposition

3-lineage parallel pass: 18 distinct findings
Fixed: 4 (1 convergent + 3 single-reviewer real)
Accepted-with-rationale: 5
Deferred / cleanup: 3 (1 INFO removal, 2 INFOs deferred)
Dismissed (false positives): 4
Build clean, vet clean, all tests pass, smoke 6/6 PASS after every fix round.

The B-LEAK finding (Kimi BLOCK) is worth a pause: the prior code explicitly tried to be careful about credentials by truncating the SQL to 80 chars in error messages, but the truncation window still captured both KEY_ID '<value>' and the prefix of SECRET '<value>' because of where they appear in the statement. The defense-in-depth answer was to never pass the SQL into the error path at all — labels are static, the error chain wraps a credential-redacted version of DuckDB's own error message. Two layers of "don't leak" instead of one fragile truncation.

The B-CTX finding (Opus BLOCK) is the kind of latent issue that would cost a future debugger hours: the connector callback runs on EVERY new physical connection, but SetMaxOpenConns(1) masks the issue today. A future change to allow concurrent connections would silently break bootstrap on every reconnect with no clear error signal. Catching it during the scrum saved a Friday-afternoon- debugging incident months from now.

60 KiB Raw Blame History Unescape Escape

Phase G0 Kickoff Plan

Day 0 — One-time setup

Day 1 — Skeleton + chi + /health × 5 binaries

Day 2 — storaged: S3 GET/PUT/LIST

Day 3 — catalogd: Parquet manifests

Day 4 — ingestd: CSV → Parquet → catalog

Day 5 — queryd: DuckDB SELECT

Day 6 — Gate day: end-to-end via gateway

Day 7 — Cleanup + retro

Risks tracked across the week

Out of scope for G0 (deferred to later phases)

Open questions before Day 1

Self-review — independent pass via gateway overseer

BLOCK — both real, both fixed inline

WARN — 4 of 5 fixed inline; 1 deferred

INFO — both fixed inline

Net change (Opus pass)

Self-review — second pass via Kimi K2.6 (cross-lineage)

BLOCK — both real, both fixed inline

WARN — none extracted (Kimi response truncated at this section)

Net change (Kimi pass)

Self-review — third pass via Qwen3-coder:480b (cross-lineage)

Output style note

WARN — 1 fixed inline, 1 deferred

Net change (Qwen pass)

Cumulative disposition across all 3 lineages

D1 — actual run results (2026-04-28)

D1 — code scrum review (3-lineage parallel pass)

Convergent findings (caught by ≥2 reviewers — high confidence)

Single-reviewer findings (lineage-specific catches)

Second-pass review (post-fix code, Opus only)

Cumulative D1 disposition

D2 — actual run results (2026-04-28 evening)

D2 — code scrum review (3-lineage parallel pass)

Convergent findings (≥2 reviewers — high confidence)

Single-reviewer findings (lineage-specific catches)

Cumulative D2 disposition

D3 — actual run results (2026-04-28 evening)

D3 — code scrum review (3-lineage parallel pass)

Convergent findings (≥2 reviewers — high confidence)

Single-reviewer findings (lineage-specific catches)

Cumulative D3 disposition

D4 — actual run results (2026-04-28 evening)

D4 — code scrum review (3-lineage parallel pass)

Convergent findings (≥2 reviewers — high confidence)

Single-reviewer findings — applied

Single-reviewer findings — accepted with rationale

Single-reviewer findings — deferred

Dismissed (false positives)

Cumulative D4 disposition

Pre-D5 prep (commit 4205ecd)

D5 — actual run results (2026-04-29)

D5 — code scrum review (3-lineage parallel pass)

Convergent findings (≥2 reviewers — high confidence)

Single-reviewer findings — applied

Single-reviewer findings — accepted with rationale

Dismissed (false positives)

Cumulative D5 disposition

60 KiB

Raw Blame History

Pre-D5 prep (commit `4205ecd`)