golangLAKEHOUSE/docs/PHASE_G0_KICKOFF.md
root b1d52306ad G0 D6: gateway reverse proxy fronting all 4 backing services · 2 scrum fixes · G0 COMPLETE
Last day of Phase G0. Gateway promotes the D1 stub endpoints into
real reverse-proxies on :3110 fronting storaged + catalogd + ingestd
+ queryd. /v1 prefix lives at the edge — internal services route on
/storage, /catalog, /ingest, /sql, with the prefix stripped by a
custom Director per Kimi K2's D1-plan finding.

Routes:
  /v1/storage/*  → storaged
  /v1/catalog/*  → catalogd
  /v1/ingest     → ingestd
  /v1/sql        → queryd

Acceptance smoke 6/6 PASS — every assertion goes through :3110, none
direct to backing services. Full ingest → storage → catalog → query
round-trip verified end-to-end. The smoke's "rows[0].name=Alice"
assertion is the architectural payoff: five binaries, six HTTP
routes, one round-trip through one edge.

Cross-lineage scrum on shipped code:
  - Opus 4.7 (opencode):                      1 BLOCK + 2 WARN + 2 INFO
  - Kimi K2-0905 (openrouter):                1 BLOCK + 3 WARN + 1 INFO (3 false positives, all from one wrong TrimPrefix theory)
  - Qwen3-coder (openrouter):                 5 completion tokens — "No BLOCKs."

Fixed (2, both Opus single-reviewer):
  O-BLOCK: Director path stripping fails if upstream URL has a
    non-empty path. The default Director's singleJoiningSlash runs
    BEFORE the custom code, so an upstream like http://host/api
    produces /api/v1/storage/... after the join — then TrimPrefix("/v1")
    is a no-op because the string starts with /api. Fix: strip /v1
    BEFORE calling origDirector. New TestProxy_SubPathUpstream regression
    locks this in. Today: bare-host URLs only, dormant — but moving
    gateway behind a sub-path in prod would have silently 404'd.
  O-WARN2: url.Parse is permissive — typo "127.0.0.1:3211" (no scheme)
    parses fine, produces empty Host, every request 502s. mustParseUpstream
    fail-fast at startup with a clear message naming the offending
    config field.

Dismissed (3, all Kimi, same false TrimPrefix theory):
  K-BLOCK "TrimPrefix loops forever on //v1storage" — false, single
    check-and-trim, no loop
  K-WARN "no upper bound on repeated // removal" — same false theory
  K-WARN "goroutines leak if upstream parse fails while binaries
    running" — confused scope; binaries are separate OS processes
    launched by the smoke script

D1 smoke updated (post-D6): the 501 stub probes are gone (gateway no
longer stubs /v1/ingest and /v1/sql). Replaced with proxy probes that
verify gateway forwards malformed requests to ingestd and queryd. Launch
order changed from parallel to dep-ordered (storaged → catalogd →
ingestd → queryd → gateway) since catalogd's rehydrate now needs
storaged, queryd's initial Refresh needs catalogd.

All six G0 smokes (D1 through D6) PASS end-to-end after every fix
round. Phase G0 substrate is complete: 5 binaries, 6 routes, 25 fixes
applied across 6 days from cross-lineage review.

G1+ next: gRPC adapters, Lance/HNSW vector indices, Go MCP SDK port,
distillation rebuild, observer + Langfuse integration.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 00:21:54 -05:00

70 KiB
Raw Blame History

Phase G0 Kickoff Plan

Goal: smallest end-to-end ingest+query path in Go. Upload a CSV via POST /ingest, query it via POST /sql, get rows back. No vector, no profile, no UI yet.

Estimated duration: 1 engineer-week (5 working days + gate day + cleanup). Plan is calibrated for solo work; cut by ~40% with two engineers in parallel on storaged/catalogd vs ingestd/queryd.

Cutoff for G0: the closing acceptance gate (Day 6) passes end-to-end against workers_500k.csv as the test fixture.


Day 0 — One-time setup

Done by an operator with sudo on the dev box. ~15 minutes.

# Step Verify
0.1 Install Go 1.23+: curl -L https://go.dev/dl/go1.23.linux-amd64.tar.gz | sudo tar -C /usr/local -xz go version shows 1.23+
0.2 Add /usr/local/go/bin to PATH (in ~/.bashrc) new shell sees go
0.3 Install cgo toolchain: apt install build-essential gcc --version works
0.4 Clone repo: git clone https://git.agentview.dev/profit/golangLAKEHOUSE.git cd golangLAKEHOUSE && go version from inside
0.5 Bring up MinIO locally (or point at existing) mc ls local/ lists buckets, or whatever the dev S3 is
0.6 Verify DuckDB cgo path with a real compile-and-run smoke (NOT go install pkg@latest — that requires a main package and would fail on the duckdb library before exercising cgo). Steps: mkdir /tmp/duckdb-smoke && cd /tmp/duckdb-smoke && go mod init smoke && go get github.com/duckdb/duckdb-go/v2 && cat > main.go <<<'package main; import (_ "github.com/duckdb/duckdb-go/v2"; "database/sql"); func main(){db,_:=sql.Open("duckdb","");db.Ping();db.Close()}' && go run main.go — proves cgo linker chain + static-linked duckdb-go-bindings work on this platform exits 0, no link errors

Day 0 acceptance: go version shows 1.23+, gcc --version works, MinIO reachable on localhost:9000, the cgo smoke install above succeeded. (go mod tidy is intentionally NOT run here — no imports yet; verification moves to D1.)


Day 1 — Skeleton + chi + /health × 5 binaries

Goal: five binaries build, each binds to its port, /health returns {"status":"ok","service":"<name>"}.

# File What
1.1 internal/shared/server.go chi router factory, slog setup, graceful shutdown via signal.NotifyContext
1.2 internal/shared/config.go TOML loader using pelletier/go-toml/v2, default + override pattern
1.3 cmd/gateway/main.go port 3110, /health
1.4 cmd/storaged/main.go port 3211, /health
1.5 cmd/catalogd/main.go port 3212, /health
1.6 cmd/ingestd/main.go port 3213, /health
1.7 cmd/queryd/main.go port 3214, /health

Port shift note: Original SPEC said 3100/32013204; D1 runtime caught a collision — the live Rust lakehouse owns 3100. Shifted Go dev ports to 3110+ so both systems can run concurrently during the migration. G5 cutover flips gateway back to 3100 when Rust retires. | 1.8 | lakehouse.toml | bind addresses, log level — sample committed | | 1.9 | Makefile | build, run-gateway, etc. — convenience | | 1.10 | cmd/gateway/main.go adds STUB routes POST /v1/ingest and POST /v1/sql returning 501 Not Implemented with a header X-Lakehouse-Stub: g0. Real reverse-proxy wiring lands on Day 6, but the routes exist from D1 so D6 is just behavior change, not new endpoints. | curl -X POST :3100/v1/ingest returns 501 with the stub header |

Acceptance D1: go mod tidy populates go.sum cleanly; go build ./cmd/... exits 0; running each binary in a separate terminal, curl :3100/health through :3204/health all return 200 OK with the expected JSON; gateway's stub /v1/* routes return 501.

Dependencies pulled: go-chi/chi/v5, pelletier/go-toml/v2.


Day 2 — storaged: S3 GET/PUT/LIST

Goal: put a file, get it back, list it.

# File What
2.1 internal/storaged/bucket.go aws-sdk-go-v2/service/s3 wrapper — Get, Put, List, Delete
2.2 internal/storaged/registry.go BucketRegistry skeleton (per Rust ADR-017) — single bucket only in G0; multi-bucket lands in G2
2.3 internal/secrets/provider.go SecretsProvider interface + FileSecretsProvider reading /etc/lakehouse/secrets.toml
2.4 cmd/storaged/main.go wire routes — GET /storage/get/{key}, PUT /storage/put/{key}, GET /storage/list?prefix=.... Bind to 127.0.0.1:3201 only (G0 is dev-only, no auth). Apply http.MaxBytesReader with a 256 MiB per-request cap (reduced from 2 GiB per Qwen3-coder Q1 — 2 GiB × N concurrent = blow the box) + a buffered semaphore capping concurrent in-flight PUTs at 4. PUTs exceeding the cap → 413; PUTs blocked on the semaphore → 503 with Retry-After: 5

Acceptance D2: curl -T sample.csv 127.0.0.1:3201/storage/put/test/sample.csv returns 200; curl 127.0.0.1:3201/storage/get/test/sample.csv echoes the file bytes; curl 127.0.0.1:3201/storage/list?prefix=test/ lists sample.csv; PUT exceeding 2 GiB returns 413 Payload Too Large.

Open question: error journal (Rust ADR-018 append-log pattern) — defer to G2 with multi-bucket federation, or wire it now? Plan says defer; revisit if errors surface during D3-D5.


Day 3 — catalogd: Parquet manifests

Goal: register a dataset, persist to storaged, restart, manifest still visible.

# File What
3.1 internal/catalogd/manifest.go Parquet read/write using arrow-go/v18/parquet/pqarrow. Schema: dataset_id, name, schema_fingerprint, objects, created_at, updated_at, row_count
3.2 internal/catalogd/registry.go In-memory index (map[name]Manifest), rehydrated on startup from primary://_catalog/manifests/*.parquet
3.3 cmd/catalogd/main.go wire routes — POST /catalog/register (idempotent by name + fingerprint per Rust ADR-020), GET /catalog/manifest/{name}, GET /catalog/list
3.4 internal/catalogd/store_client.go thin HTTP client to cmd/storaged — round-trips manifest Parquets

Acceptance D3: register a dataset, see it in /catalog/list, restart catalogd, /catalog/list still shows it. Re-register same name + same fingerprint → 200, same dataset_id. Different fingerprint → 409 Conflict.


Day 4 — ingestd: CSV → Parquet → catalog

Goal: POST /ingest with a CSV file produces a Parquet in storaged + a manifest in catalogd.

# File What
4.1 internal/ingestd/schema.go infer Arrow schema from CSV header + first-N-row sampling. ADR-010 default-to-string on ambiguity
4.2 internal/ingestd/csv.go stream CSV → array.RecordBatch → Parquet (arrow-go pqarrow writer)
4.3 cmd/ingestd/main.go route POST /ingest — multipart form file → schema infer → write Parquet → call catalogd to register

Acceptance D4: curl -F file=@workers_500k.csv :3203/ingest?name=workers_500k returns 200 with the registered manifest; aws s3 ls (or mc ls) shows the Parquet under primary://datasets/workers_500k/; curl :3202/catalog/manifest/workers_500k returns the manifest with row_count=500000.


Day 5 — queryd: DuckDB SELECT

Goal: SQL queries over Parquet datasets.

# File What
5.1 internal/queryd/db.go database/sql connection to github.com/duckdb/duckdb-go/v2 (cgo). Ensures DuckDB extensions Parquet + httpfs are loaded; on connection open, executes CREATE SECRET (TYPE S3) populated from internal/secrets/provider.go so read_parquet('s3://...') against MinIO authenticates per session
5.2 internal/queryd/registrar.go reads catalogd /catalog/list, registers each dataset as a DuckDB view: CREATE VIEW workers_500k AS SELECT * FROM read_parquet('s3://...')
5.3 cmd/queryd/main.go route POST /sql (JSON body {"sql": "..."}). View refresh strategy: cache views with a TTL (default 30s) + invalidate on If-None-Match against catalogd's manifest etag. Don't re-CREATE on every request — Opus review flagged that as the perf cliff during D6 timing capture

Acceptance D5: after Day 4 ingestion, curl -X POST -d '{"sql":"SELECT count(*) FROM workers_500k"}' :3204/sql returns [{"count_star()":500000}]. A SELECT role, count(*) FROM workers_500k WHERE state='IL' GROUP BY role returns expected rows.


Day 6 — Gate day: end-to-end via gateway

Goal: the closing G0 acceptance gate passes.

# What
6.1 Promote cmd/gateway/main.go /v1/ingest + /v1/sql from D1.10 stubs (501) to real reverse-proxies. httputil.NewSingleHostReverseProxy preserves the inbound path by default, so the proxy must use a custom Director (or Rewrite) that strips the /v1 prefix before forwarding — otherwise the call lands on ingestd:3203/v1/ingest which doesn't exist (ingestd serves /ingest, queryd serves /sql). Multipart forwarding for /v1/ingest is the riskiest hop — verify form parts pass through with the file body intact
6.2 Smoke script scripts/g0_smoke.sh: spin up MinIO + 5 services, ingest, query, assert row count
6.3 Run smoke against workers_500k.csv end-to-end
6.4 Capture timing — total ingest + query latency, file size, peak memory

Closing G0 acceptance: scripts/g0_smoke.sh exits 0. Numbers recorded in docs/G0_BASELINE.md for future regression comparison.


Day 7 — Cleanup + retro

# What
7.1 Update SPEC §4 G0 with what actually shipped vs planned (deviations, surprises)
7.2 Write docs/G0_BASELINE.md — measured perf numbers + comparison hooks for G1+
7.3 Finalize ADRs that were stubbed before their decisions landed — ADR stubs go in at the start of D4 (arrow-go version pin, schema inference policy) and D5 (DuckDB extension load order, S3 secret provisioning, view-refresh TTL) so reviewers can object in-flight; D7 just commits them after running real code against the calls
7.4 Tag the commit phase-g0-complete
7.5 Open follow-up issues for anything punted (error journal, multi-bucket, profile system, two-phase-write orphan GC, shared-server.go refactor for cgo-handle services)

Risks tracked across the week

Risk Where Mitigation
cgo build fails on the dev box D5 D0.3 verifies gcc present; if cgo specifically breaks, fall back to DuckDB external process (SPEC §3.1 option B)
arrow-go pqarrow schema mismatch with CSV inference D4 Sample 1k rows for type inference, default to String per ADR-010, log when defaulting
DuckDB can't read S3 Parquet directly D5 Load httpfs extension explicitly; if it fails, copy Parquet to a local temp file before query (slow but correct)
/catalog/register race between ingestd writer and catalogd reader D3-D4 Same write-lock-across-storage-write pattern as Rust ADR-020 — serialize registers; throughput is OK at low ingest QPS
workers_500k.csv schema drifts vs Rust era D4 Plan calls for inferring fresh, not porting Rust schema. If staffer-domain features break in G3+, revisit

Out of scope for G0 (deferred to later phases)

  • Vector indexing — Phase G1
  • Multi-bucket / federation — Phase G2
  • Profile system — Phase G2
  • Hot-swap atomicity — Phase G2
  • Pathway memory — Phase G3
  • Distillation pipeline — Phase G3
  • MCP server / observer / auditor — Phase G4
  • HTMX UI — Phase G5
  • TLS, auth — explicit non-goal until G2 (single-bucket no-auth dev)

Open questions before Day 1

  1. MinIO instance — reuse the existing one at localhost:9000 that lakehouse uses (shared dev box) or stand up a fresh one with a separate bucket prefix?
  2. /etc/lakehouse/secrets.toml — share the lakehouse repo's secrets file or create /etc/golangLAKEHOUSE/secrets.toml?
  3. Workers CSV source — derive from workers_500k.parquet (round- trip back to CSV) or use workers_500k_v9.csv if it exists?

These are ops calls, not architecture. Answer when D0 is being executed.


Self-review — independent pass via gateway overseer

Reviewer: opencode/claude-opus-4-7 via localhost:3100/v1/chat (the same path the production overseer correction loop uses post-G0 in the Rust era). Run on the original draft before any of the inline fixes above were applied. Findings dispositioned below.

BLOCK — both real, both fixed inline

# Finding Disposition Fix location
B1 apt install build-essential alone won't satisfy the cgo link step for duckdb-go/v2 Fixed — D0.6 now runs a smoke go install against an empty module on D0 to flush platform issues here, not on D5 D0.6
B2 DuckDB session needs S3 credentials (CREATE SECRET) plumbed from SecretsProvider; "load httpfs" alone leaves auth unwired Fixed — D5.1 now calls CREATE SECRET (TYPE S3, ...) on connection open, populated from internal/secrets/provider.go D5.1

WARN — 4 of 5 fixed inline; 1 deferred

# Finding Disposition Fix location
W1 Two-phase write (storaged → catalogd register) leaves orphan Parquets on partial failure; no GC story Deferred — punted to G2 alongside multi-bucket + error journal; tracked in §Risks and D7.5 follow-up D7.5
W2 "Refresh views on each /sql call" will be the D6 perf cliff Fixed — D5.3 now uses TTL-cached views with etag invalidation against catalogd D5.3
W3 Shared internal/shared/server.go factory across heterogeneous binaries (HTTP ingress vs cgo-DB-holder) couples graceful-shutdown semantics that will need unwinding later Accepted with note — G0 keeps the simple shared factory; refactor explicitly listed as a G1+ follow-up D7.5
W4 storaged PUT/GET on a TCP port with no auth + no body cap is a footgun Fixed — D2.4 now binds 127.0.0.1 only and applies a 2 GiB MaxBytesReader cap D2.4
W5 Gateway reverse-proxy introduced cold on D6 gate day compresses risk into the deadline Fixed — D1.10 now stubs the routes returning 501; D6.1 just promotes them to real proxies D1.10 + D6.1

INFO — both fixed inline

# Finding Disposition Fix location
I1 go mod tidy before any imports is a trivial-true verification Fixed — D0.6 re-purposed for the cgo smoke; tidy verification moved to D1 acceptance D0.6 + D1 acceptance
I2 Filing ADRs after the work is done inverts the usual pattern Fixed — D7.3 reframed: ADR stubs go in at the start of D4/D5 so reviewers can object in-flight; D7.3 just finalizes them D7.3

Net change (Opus pass)

7 of 9 findings produced inline plan edits; 2 deferred to post-G0 follow-up issues (W1 orphan GC, W3 shared-server refactor) with the deferral itself documented. No findings dismissed as confabulation.


Self-review — second pass via Kimi K2.6 (cross-lineage)

Reviewer: opencode/kimi-k2.6 via the same gateway path. Run on the post-Opus-fix doc to surface what a different model lineage catches. Per the Rust-era cross-lineage rotation pattern, Kimi tends to ground on textual specifics where Opus surfaces architectural shape — both lenses are useful.

Kimi's output was discursive (pre-format thinking rather than the requested BLOCK/WARN/INFO bullets), but two concrete catches landed — both BLOCKs that Opus missed and that the Opus-pass fixes themselves introduced.

BLOCK — both real, both fixed inline

# Finding Disposition Fix location
K1 Opus's BLOCK B1 fix used go install github.com/duckdb/duckdb-go/v2@latest to verify cgo, but go install pkg@version requires a main package — duckdb-go/v2 is a library, so the command fails with "not a main package" before the cgo linker chain is exercised. The verification can pass on a broken-cgo box. Fixed — D0.6 now creates a temporary module + a 5-line main.go that imports duckdb-go/v2 and calls sql.Open("duckdb",""), then go runs it — actually compiles + executes with cgo D0.6
K2 Path mismatch: D1.10 stubs /v1/ingest and /v1/sql on the gateway, but D4 has ingestd serving /ingest and D5 has queryd serving /sql. D6.1's httputil.NewSingleHostReverseProxy preserves the inbound path by default, so the proxy would forward to ingestd:3203/v1/ingest which doesn't exist. Smoke would fail on D6 with a 404 on the backend, not the gateway. Fixed — D6.1 now specifies a custom Director that strips the /v1 prefix before forwarding D6.1

WARN — none extracted (Kimi response truncated at this section)

The remaining response stream considered D2.4 MaxBytesReader semantics and D3 manifest registry concurrency, but cut off before producing structured findings. A third pass could be run if more lineage coverage is wanted — for now, two BLOCKs fixed is the cross-lineage delta.

Net change (Kimi pass)

2 BLOCKs landed, both fixed inline. Both were introduced by Opus-pass fixes — illustrating exactly why cross-lineage rotation matters: one model's review of the original is not the same as a different model's review of the post-fix version. The Rust auditor's Kimi/Haiku/Opus rotation captures this dynamic; today's two-pass doc review reproduces it on a much smaller scale.


Self-review — third pass via Qwen3-coder:480b (cross-lineage)

Reviewer: qwen3-coder:480b via ollama_cloud (Ollama Pro). Largest single-model open-weights coder in the fleet. Run on the post-Kimi-fix doc.

Output style note

Qwen3-coder's response was 80% approval of prior fixes (citing B1/B2/K1/K2/I2 as "NEW BLOCKs/INFOs" — actually a misuse of the NEW label, but useful as triangulation: three independent lineages now agree the prior fixes are correct). Only 2 genuinely new WARNs.

WARN — 1 fixed inline, 1 deferred

# Finding Disposition Fix location
Q1 D2.4 MaxBytesReader at 2 GiB still permits memory exhaustion under concurrent uploads — N concurrent × 2 GiB blows the box Fixed — D2.4 cap reduced to 256 MiB per request, plus a 4-slot semaphore on concurrent in-flight PUTs (503 + Retry-After when full) D2.4
Q2 D5.3 view refresh has TTL + etag invalidation but no batching, so bursty queries against many datasets can re-issue catalog reads redundantly Deferred — minor under G0's single-tenant load; revisit in G2 alongside the multi-bucket / profile work that creates more catalog churn D7.5

Net change (Qwen pass)

1 fixed inline, 1 deferred. The mislabeled "approval" output was net positive — it's the closest thing to a triangulation signal we get without a fourth model. With Opus + Kimi + Qwen all confirming the prior fixes hold, the plan is unusually well-cross-checked for a greenfield kickoff.

Cumulative disposition across all 3 lineages

Pass Reviewer New findings Fixed Deferred
1 opencode/claude-opus-4-7 9 7 2
2 opencode/kimi-k2.6 2 2 0
3 qwen3-coder:480b (ollama_cloud) 2 1 1
4 runtime smoke (D1 actual launch) 1 1 0
Total 14 11 3

The 4th pass — runtime smoke — caught the only thing three doc-review lineages couldn't: port 3100 was already occupied by the live Rust lakehouse. Documented and dispositioned same pattern as the model findings.

Three deferrals, all to G2: orphan GC on two-phase write (W1), shared-server refactor for cgo-handle services (W3), batched view refresh (Q2). Tracked in D7.5.


D1 — actual run results (2026-04-28)

Phase G0 Day 1 executed end-to-end. Output of scripts/d1_smoke.sh:

[d1-smoke] building...
[d1-smoke] launching...
[d1-smoke] /health probes:
  ✓ gateway (:3110) → {"status":"ok","service":"gateway"}
  ✓ storaged (:3211) → {"status":"ok","service":"storaged"}
  ✓ catalogd (:3212) → {"status":"ok","service":"catalogd"}
  ✓ ingestd (:3213) → {"status":"ok","service":"ingestd"}
  ✓ queryd (:3214) → {"status":"ok","service":"queryd"}
[d1-smoke] gateway 501 stub probes:
  ✓ POST /v1/ingest → 501 + X-Lakehouse-Stub: g0
  ✓ POST /v1/sql → 501 + X-Lakehouse-Stub: g0
[d1-smoke] D1 acceptance gate: PASSED

What landed:

  • internal/shared/server.go — chi factory, slog JSON logging, /health, graceful shutdown via signal.NotifyContext
  • internal/shared/config.go — TOML loader with DefaultConfig() and env-override-via-flag pattern (-config flag)
  • cmd/{gateway,storaged,catalogd,ingestd,queryd}/main.go — five binaries, each ~30 lines, all using the shared factory
  • lakehouse.toml — G0 dev config with the shifted 3110+ ports
  • scripts/d1_smoke.sh — repeatable smoke test, exits 0 on PASS
  • bin/{gateway,storaged,catalogd,ingestd,queryd} — built binaries (~9.7 MB each, no debug stripping)

What G0 didn't need but I added anyway (intentional):

  • Gateway already has the D1.10 stub routes wired; D6.1 is just swap-the-handler.
  • TOML config supports an S3 section even though storaged doesn't consume it until D2 — saves a config schema bump on D2.

Next: D2 — storaged's actual S3 GET/PUT/LIST routes against MinIO.


D1 — code scrum review (3-lineage parallel pass)

After D1 shipped, the actual code (not docs) was reviewed by all three lineages running in parallel — the cross-lineage discipline applied to the implementation, not just the plan.

Pass Reviewer Latency Findings
1 opencode/claude-opus-4-7 24.9s 2 BLOCK + 4 WARN + 2 INFO = 8
2 opencode/kimi-k2.6 18.3s discursive output; 1 effective WARN extracted
3 qwen3-coder:480b (ollama_cloud) 62.0s 2 BLOCK + 2 WARN + 1 INFO = 5

Convergent findings (caught by ≥2 reviewers — high confidence)

# Severity Finding Reviewers Disposition
C1 BLOCK Run() errCh/select race: fast bind error can be silently dropped Opus + Qwen Fixednet.Listen() is now called synchronously before the goroutine; bind errors return as Run's return value before any select
C2 BLOCK sleep 0.5 in d1_smoke.sh races bind on cold-start boxes Opus + Qwen Fixed — replaced with poll_health() loop, 5s budget per service, 50ms poll interval
C3 WARN LoadConfig silent fallback when file missing Opus + Qwen Fixed — emits slog.Warn with path + hint when path was given but file absent

Single-reviewer findings (lineage-specific catches)

# Severity Finding Reviewer Disposition
S1 WARN slog.SetDefault mutates global state from a library function Kimi FixedRun() no longer calls SetDefault; logger is constructed locally and passed to middleware
S2 WARN os.IsNotExisterrors.Is(err, fs.ErrNotExist) idiom Opus Fixed
S3 WARN secrets in lakehouse.toml — no .gitignore / env path Opus Already planned — D2.3 introduces SecretsProvider; D1 fields are empty strings
S4 INFO 5x near-identical cmd/*/main.go — could share a Main() helper Opus Accepted — defer until D2 surfaces the real per-service config consumption pattern
S5 INFO /health logs at Info — k8s liveness probes will dominate volume Opus Accepted — defer post-G0; G0 isn't on k8s yet
S6 WARN smoke double-curl per route doubles load + creates state window Opus Fixed — single curl -i parsed once for both code + headers

Second-pass review (post-fix code, Opus only)

After applying the first-round fixes, a second-pass Opus review caught 1 self-downgraded BLOCK + 4 WARN + 2 INFO. Of those, only one was a real new actionable: head -1 on curl -i is fragile against 1xx interim lines. Fixed — switched to awk '/^HTTP\//{code=$2} END{print code}' which picks the last status line (robust to 100 Continue etc).

The other 5 second-pass findings were:

  • Theoretical (clean-exit-without-Shutdown path that doesn't trigger today)
  • Aspirational (defensive defer ln.Close() for hypothetical future code)
  • Caught downstream (poll_health not identity-checking — but the followup probe IS identity-checked)
  • Stylistic (config.go's slog.Warn lands on the global default — but config loads BEFORE the service-level logger exists; accepting)
  • Aspirational (newListener "for testability" indirection isn't needed yet)

Cumulative D1 disposition

  • Pass 1 (3 lineages parallel): 14 distinct findings (3 convergent + 11 single-reviewer; some overlap with prior doc-review findings)
  • Pass 2 (Opus second-pass on post-fix code): 7 findings, 1 actionable fixed, 6 dispositioned
  • Total fixed in D1 code review: 7 · accepted-with-rationale: 5 · deferred: 0
  • Build clean (go build ./cmd/... exit 0), go vet ./... clean, smoke PASS after every fix round.

D1 ships harder than it would have without the scrum: bind-error handling is now race-free, smoke is deterministic, log volume on /health is acknowledged, secrets handling has a flagged path forward.


D2 — actual run results (2026-04-28 evening)

Phase G0 Day 2 executed end-to-end. Output of scripts/d2_smoke.sh:

[d2-smoke] PUT round-trip:
  ✓ PUT d2-smoke/<ts>.bin → 200
[d2-smoke] GET echoes bytes:
  ✓ GET d2-smoke/<ts>.bin → bytes match
[d2-smoke] LIST includes key:
  ✓ LIST prefix=d2-smoke/ → contains d2-smoke/<ts>.bin
[d2-smoke] DELETE then GET → 404:
  ✓ DELETE then GET → 404
[d2-smoke] 256 MiB cap → 413:
  ✓ PUT 257 MiB → 413
[d2-smoke] semaphore: 5th concurrent PUT → 503 + Retry-After:5
  ✓ 5th concurrent PUT → 503 + Retry-After: 5
[d2-smoke] D2 acceptance gate: PASSED

What landed:

  • internal/secrets/provider.goProvider interface, FileProvider reading /etc/lakehouse/secrets-go.toml with inline-fallback for G0 dev convenience, StaticProvider test helper
  • internal/storaged/bucket.goaws-sdk-go-v2/service/s3 wrapper: Get/Put/List/Delete; manager.Uploader for multipart-streaming PUT; MaxListResults=10_000 cap with ...truncated... sentinel
  • internal/storaged/registry.goBucketRegistry (single bucket in G0; G2 multi-bucket federation extends this)
  • cmd/storaged/main.go — verb-paths GET/PUT/LIST/DELETE, strict validateKey (rejects empty, >1024B, NUL, leading-/, .., CR/LF/tab), Content-Length up-front 413, MaxBytesReader 256 MiB body cap, 4-slot non-blocking semaphore (503+Retry-After:5)
  • scripts/d2_smoke.sh — 6 acceptance probes; uses curl -T --limit-rate for true streaming uploads (--data-binary @- first attempt buffered client-side, semaphore never engaged)
  • Per-package unit tests for provider + registry + validateKey
  • New bucket lakehouse-go-primary on the existing MinIO at :9000, isolated from the Rust lakehouse bucket during coexistence

Out of spec but added: DELETE was exposed at the HTTP layer (the kickoff D2.4 listed only GET/PUT/LIST in routes; bucket.Delete was in the wrapper). J approved option A (DELETE exposed) for symmetry

  • smoke ergonomics.

validateKey policy: J approved the strict stance — leading /, .. components, and CR/LF/tab control characters all rejected at the HTTP boundary. Costs ~5 lines, propagates safety to every downstream consumer.

Next: D3 — catalogd Parquet manifests with idempotent register.


D2 — code scrum review (3-lineage parallel pass)

After D2 shipped, the actual code went through the same 3-lineage parallel scrum as D1.

Pass Reviewer Latency Findings
1 opencode/claude-opus-4-7 ~30s 1 BLOCK + 3 WARN + 3 INFO = 7
2 openrouter/qwen/qwen3-coder ~32s 2 BLOCK + 1 WARN + 1 INFO = 4
3a opencode/kimi-k2.6 (max_tokens=4096) 18s discursive — finish_reason=length, no structured output (model spent budget thinking, never reached BLOCK/WARN format). opencode rejected max_tokens>4096 without stream=true
3b kimi/kimi-k2-turbo (direct adapter) 124s empty content, finish_reason=length (8192 reasoning tokens, no surfaced output)
3c openrouter/moonshotai/kimi-k2-0905 33s 1 BLOCK + 2 WARN + 1 INFO = 4 (used as the K-lineage representative)

The Kimi route shopping (3a → 3c) was a process finding worth recording: opencode caps non-streaming Kimi calls at 4096 output tokens, the direct kimi.com adapter consumed 8192 tokens of reasoning but surfaced empty content, and openrouter's moonshotai/kimi-k2-0905 route delivered structured output in ~33s. For future scrums on Kimi, default to openrouter/moonshotai/kimi-k2-0905.

Convergent findings (≥2 reviewers — high confidence)

# Severity Finding Reviewers Disposition
C1 BLOCK buildRegistry defer cancel() cancels the ctx that the AWS SDK was loaded with — fine today (static creds, sync call) but breaks future credential refresh chains (EC2 IMDS, SSO, AssumeRole) Opus + Kimi Fixed — switched to context.Background() for SDK construction. Per-request lifetimes flow through r.Context()
C2 WARN→BLOCK *http.MaxBytesError may not unwrap through manager.Uploader's multipart goroutines for bodies >5 MiB; smoke covers 257 MiB single-PutObject path only — bodies in the multipart range that exceed 256 MiB could surface as 500 instead of 413 Opus (WARN) + Kimi (BLOCK) Fixed — added Content-Length up-front 413 (deterministic for honest clients) + string-suffix fallback ("http: request body too large") on the Put error path for chunked / lying-CL cases
C3 INFO→WARN Bucket.List accumulates unbounded results into a slice; stray prefix="" against a large bucket OOMs the daemon Opus (INFO) + Kimi (WARN) FixedMaxListResults=10_000 cap; truncated responses append a sentinel ObjectInfo{Key: "...truncated..."} so callers see they didn't get everything

Single-reviewer findings (lineage-specific catches)

# Severity Finding Reviewer Disposition
S1 WARN PUT 200 response missing Content-Type: application/json — Go's content-sniffing won't catch the JSON prefix, clients may see text/plain Opus Fixed — explicit header set before WriteHeader
S2 WARN Bucket.Get body close ordering fragile — currently correct but the contract is implicit Opus Accepted — current order is correct (early returns on errors where body is nil); no fix required
S3 INFO extractKey uses chi * wildcard — works correctly, just noting empty-key path is covered by validateKey Opus Accepted — already covered
S4 INFO FileProvider.mu sync.RWMutex is unused given the immutable-after-construction design Opus Accepted — drop the mutex when SIGHUP reload lands in G1 (S6 below); keep it for now as a placeholder
S5 INFO Bucket.Delete doesn't translate not-found to ErrKeyNotFound (Kimi noted, not flagged) Accepted — S3 DeleteObject is idempotent by spec; non-error on missing key is the correct behavior
S6 INFO FileProvider never reloads (no SIGHUP handler) Kimi Deferred to G1 — reload-on-SIGHUP is on the G1 list per internal/shared/server.go opening comment
S7 INFO Error messages on no-creds-found could specify which source (file vs fallback) was missing Qwen Deferred — minor debug ergonomics, no production impact
F1 BLOCK "Body close happens after semaphore release" Qwen Dismissed — false. Defer order is LIFO; r.Body.Close() was registered AFTER <-h.putSem, so it fires FIRST. Body closes before slot frees
F2 BLOCK "ObjectInfo fields can be nil-dereferenced in List" Qwen Dismissed — false. bucket.go:147-160 checks if o.Key != nil etc. before dereferencing every field
F3 WARN "MaxBytesReader not applied to semaphore-protected path" Qwen Dismissed — false. Semaphore is non-blocking try-acquire (select { default }); there is no waiting state where pre-cap MaxBytesReader matters

Cumulative D2 disposition

  • 3-lineage parallel pass: 15 distinct findings (3 convergent + 7 single-reviewer real + 3 false + 2 absorbed by other findings)
  • Fixed: 5 (3 convergent + 2 single-reviewer Opus)
  • Accepted-with-rationale: 5
  • Deferred to G1+: 2
  • Dismissed (false positives): 3 (all from Qwen — its strength on D1 doesn't reproduce on D2 code; the lineage caught real C1/C2 issues there but misread defer order + nil-checks here)
  • Build clean, vet clean, all tests pass, smoke 6/6 PASS after every fix round.

The Kimi/Opus convergence on C1 (ctx cancel) is the highest-signal finding of the round: two completely different lineages flagged the same architectural footgun. C2 (MaxBytesReader unwrap) was the most consequential — the smoke test would have stayed green while production multi-MiB uploads silently returned 500 on oversize. C3 was a latent OOM that's a 5-line fix.

The Qwen lineage delivered three false BLOCKs on D2 — different from its D1 contribution where it caught two real BLOCKs that Opus missed. Lineage rotation is real; on a given PR, one lineage may be the only one finding bugs and another may be confidently wrong. The convergence filter (≥2 reviewers) is the right gate.


D3 — actual run results (2026-04-28 evening)

Phase G0 Day 3 executed end-to-end. Output of scripts/d3_smoke.sh:

[d3-smoke] POST /catalog/register (fresh):
  ✓ fresh register → existing=false, dataset_id=200a05a8-...
[d3-smoke] GET /catalog/manifest/<name>:
  ✓ manifest dataset_id matches
[d3-smoke] GET /catalog/list (1 entry):
  ✓ list count=1
[d3-smoke] restart catalogd → rehydrate from Parquet:
  ✓ rehydrated dataset_id matches across restart
[d3-smoke] re-register (same name + same fingerprint) → existing=true:
  ✓ existing=true, same dataset_id, objects replaced (count=2)
[d3-smoke] re-register (different fingerprint) → 409:
  ✓ different fingerprint → 409 Conflict
[d3-smoke] D3 acceptance gate: PASSED

What landed:

  • internal/catalogd/manifest.go — Arrow Parquet codec for the Manifest type (dataset_id/name/schema_fingerprint/objects-as-list-of- struct/created_at_ns/updated_at_ns/row_count). One row per Parquet file. DatasetIDForName derives a deterministic UUIDv5 from name (namespace a8f3c1d2-4e5b-5a6c-9d8e-7f0a1b2c3d4e); same name on any box yields the same dataset_id.
  • internal/catalogd/registry.go — in-memory map[name]*Manifest with the ADR-020 contract: same name+fingerprint reuses dataset_id, replaces objects, bumps updated_at; different fingerprint → ErrFingerprintConflict. Single mutex serializes Register across persistence to close the check→insert TOCTOU. Rehydrate runs storaged I/O OUTSIDE the lock, swaps in the new map under the lock.
  • internal/catalogd/store_client.go — HTTP client over storaged's GET/PUT/DELETE/list. safeKey URL-escapes path segments while preserving /. Error paths drain body before close to keep the keep-alive pool healthy.
  • cmd/catalogd/main.go — POST /catalog/register, GET /catalog/manifest/{name}, GET /catalog/list. Sentinel-error detection (errors.Is(err, ErrEmptyName)) for 400s. 4 MiB body cap on register payloads.
  • New [catalogd] config: bind + storaged_url. Default http://127.0.0.1:3211.
  • scripts/d3_smoke.sh — 6 acceptance probes including rehydrate-across-restart (the load-bearing ADR-020 contract test).
  • arrow-go/v18 v18.6.0 + google/uuid v1.6.0 added; Go 1.24 → 1.25 forced by arrow-go's minimum.

UUIDv5 vs Rust v4: the Go rewrite picks deterministic-from-name. Same dataset name on any box converges to the same dataset_id; rehydrate after disk loss can't silently issue new IDs and break downstream cross-references. Rust's surrogate-UUIDv4 is what the prior system used; the divergence is intentional and documented at internal/catalogd/manifest.go:34.

Next: D4 — ingestd CSV → Parquet → catalogd register loop.


D3 — code scrum review (3-lineage parallel pass)

After D3 shipped, the actual code went through the same 3-lineage parallel scrum as D1/D2.

Pass Reviewer Latency Findings
1 opencode/claude-opus-4-7 ~30s 1 BLOCK + 5 WARN + 3 INFO = 9
2 openrouter/moonshotai/kimi-k2-0905 ~30s 2 BLOCK + 2 WARN + 1 INFO = 5
3 openrouter/qwen/qwen3-coder ~25s 2 BLOCK + 2 WARN + 2 INFO = 6

Total: 20 distinct findings (some convergent across reviewers). Kimi route shopping from D2 paid off — openrouter/moonshotai/kimi-k2-0905 delivered structured output on first attempt, no max_tokens cap, no empty-content reasoning trap.

Convergent findings (≥2 reviewers — high confidence)

# Severity Finding Reviewers Disposition
C1 BLOCK×3 Decode indexes listArr.Offsets()[0]/[1] directly — fragile under array slicing/non-zero offset; panics on malformed Parquet (multi-row reader chunks, single-offset corrupt files) Opus + Kimi + Qwen Fixed — switched to listArr.ValueOffsets(0) (Arrow API that accounts for array offset) + bounds check on start/end against structArr.Len()
C2 WARN×2 Rehydrate holds the registry mutex across network I/O to storaged — slow storaged blocks all Register/Get/List; future re-sync endpoints stall Opus + Kimi Fixed — list/get/decode happen outside the lock; the completed map is swapped into r.byKey under a brief lock

Single-reviewer findings (lineage-specific catches)

# Severity Finding Reviewer Disposition
S1 WARN Idempotent re-register mutates the in-memory manifest BEFORE persist succeeds — split-brain on storaged failure (in-memory advances, disk holds old, restart silently rolls back) Opus Fixed — build candidate copy of prior, persist candidate, swap into r.byKey only on persist success
S2 WARN cmd/catalogd detects empty-input via strings.Contains(err.Error(), "empty name") — brittle, anyone reformatting the registry error silently demotes a 400 to a 500 Opus Fixed — exported ErrEmptyName / ErrEmptyFingerprint sentinels, HTTP layer uses errors.Is
S3 WARN Get/List cp := *m aliases the Objects slice + RowCount pointer — caller mutation corrupts registry state under the lock-free read Opus FixedcloneManifest deep-copies Objects + dereferences RowCount into a fresh *int64
S4 WARN error paths in store_client preview body but don't drain before close → HTTP/1.1 keep-alive pool reuse breaks → slow socket leak Qwen FixeddrainAndClose helper reads up to 64 KiB before close on every defer
S5 INFO×3 name-validation regex / per-call deadline / Snappy compression on manifest writes Opus Deferred — small, no risk if skipped
S6 WARN Rehydrate aborts on first decode error → partial state Kimi Accepted — fail-loud > silent-partial is the design; cmd/catalogd os.Exit(1)s on rehydrate error so a corrupt manifest is operator-visible at startup
S7 WARN store_client.List ignores pagination — if storaged gains continuation tokens, the client silently drops everything past page 1 Opus Accepted — storaged caps lists at MaxListResults=10_000 with a ...truncated... sentinel; no continuation token in the wire format yet (deferred to D7.5+ alongside multi-bucket federation)
F1 BLOCK Decode crashes on empty Parquet (NumRows==0) — Value(0) panics Kimi Dismissed — already handled. tbl.NumRows() != 1 returns error before any column access; NumRows==0 also fails the subsequent rr.Next() check
F2 INFO safeKey double-escapes slashes; should use url.PathEscape(key) directly Kimi Dismissedurl.PathEscape("a/b") returns "a%2Fb". Splitting on / first is necessary to preserve the path separator while escaping segment content. The current code is correct; Kimi's suggested fix would break storaged's chi wildcard match
F3 INFO rb.NewRecord() error unchecked Qwen Dismissed — false signature. RecordBuilder.NewRecord() returns arrow.Record only; no error to check

Cumulative D3 disposition

  • 3-lineage parallel pass: 20 distinct findings
  • Fixed: 6 (2 convergent + 4 single-reviewer)
  • Accepted-with-rationale: 5 (3 INFO + 2 WARN with deferred-by-design rationale)
  • Dismissed (false positives): 3 (1 Kimi BLOCK on already-handled empty case, 1 Kimi INFO on safeKey, 1 Qwen INFO on Arrow API signature)
  • Build clean, vet clean, all tests pass, smoke 6/6 PASS after every fix round.

The C1 triple-confirmation (Opus + Kimi + Qwen all flagging the same list-offset indexing) is the strongest convergence signal across the three days. All three correctly diagnosed the underlying issue (direct Offsets() indexing is fragile under array slicing); Opus named the right Arrow API (ValueOffsets(0)), Kimi and Qwen named the bounds-check shape. Fix uses Opus's API + the others' bounds discipline together.

S1 (split-brain on persist failure) was the most consequential finding — D3's smoke would have stayed green through every test case because none of them simulate storaged-down mid-register. Opus alone caught it; the fix is a 4-line candidate-then-swap pattern but the class of bug (in-memory advances ahead of disk) is exactly the distributed-systems hazard ADR-020 was added to prevent at a different layer.


D4 — actual run results (2026-04-28 evening)

Phase G0 Day 4 executed end-to-end. Output of scripts/d4_smoke.sh:

[d4-smoke] POST /ingest?name=d4_workers (5 rows, 5 cols):
  ✓ ingest fresh → row_count=5, existing=false, key=datasets/d4_workers/247165...parquet
[d4-smoke] mc shows the parquet on MinIO:
  ✓ <fp>.parquet present in lakehouse-go-primary/datasets/d4_workers/
[d4-smoke] catalogd manifest matches:
  ✓ manifest row_count=5, fp matches, 1 object at datasets/d4_workers/<fp>.parquet
[d4-smoke] re-ingest same CSV → existing=true:
  ✓ idempotent re-ingest: existing=true, same dataset_id, same fingerprint
[d4-smoke] schema-drift CSV → 409:
  ✓ schema drift → 409 Conflict
[d4-smoke] D4 acceptance gate: PASSED

What landed:

  • internal/ingestd/schema.go — Arrow schema inference per ADR-010 ("default to string on ambiguity"). Detects int64, float64, bool, string. Empty cells → nullable flag. Fingerprint() is a deterministic SHA-256 over (name, type) tuples in header order using ASCII record/unit separators (0x1e/0x1f) so column names with commas don't collide. Nullability intentionally NOT in the fingerprint — gaining/losing nulls isn't a schema change.
  • internal/ingestd/csv.go — single-pass CSV → Arrow → Parquet. Buffers the first DefaultSampleRows=1000 rows for inference, then replays them + streams the rest into the pqarrow writer in DefaultBatchSize=8192-row record batches. Snappy compression on the parquet output. Empty cells → null on numeric/bool, empty string on string columns. Per scrum C-WCLOSE: deferred guarded w.Close() so error paths flush + free buffered column data.
  • internal/ingestd/catalog_client.go — symmetric in shape with internal/catalogd/store_client.go. POST /catalog/register, drain- on-error keep-alive hygiene, ErrFingerprintConflict sentinel for the 409 path.
  • cmd/ingestd/main.go — POST /ingest?name=X with multipart form. Pipeline: parse multipart → IngestCSV → PUT parquet to storaged at datasets/<name>/<fp_hex>.parquet (content-addressed per scrum C-DRIFT) → catalogd /catalog/register. 256 MiB body cap matches storaged's PUT cap. 5-minute upstream timeout for large ingests.
  • New [ingestd] config block: bind + storaged_url + catalogd_url
  • scripts/d4_smoke.sh — 6 acceptance probes. Generates a CSV that exercises every inference path (clean int, string, ADR-010 fallback on salary with one N/A, mixed-case bool literals, float64).

Content-addressed parquet keys: datasets/<name>/<fp_hex>.parquet. Per scrum C-DRIFT (Opus WARN): the prior data.parquet shape meant a schema-drift ingest would PUT-overwrite the live parquet BEFORE catalogd's 409 fired, leaving storaged inconsistent with the catalog. Fingerprint-keyed paths mean drift attempts write to a different file that becomes an orphan; the live data is never corrupted. Same-fp re-ingest still overwrites the same key (idempotent).

Next: D5 — queryd DuckDB SELECT over registered datasets.


D4 — code scrum review (3-lineage parallel pass)

Pass Reviewer Latency Findings
1 opencode/claude-opus-4-7 ~25s 4 WARN + 3 INFO (+ 2 BLOCKs Opus self-retracted in-flight after re-reading)
2 openrouter/moonshotai/kimi-k2-0905 ~25s 1 BLOCK + 2 WARN + 1 INFO
3 openrouter/qwen/qwen3-coder ~30s 2 BLOCK + 2 WARN + 2 INFO

Total: 16 distinct findings. Notably, Opus retracted two of its own BLOCKs after re-reading the code — the first time we've seen self-correction in the scrum stream. The actual reviewable output was 4 WARN + 3 INFO from Opus, and the analysis itself was visible.

Convergent findings (≥2 reviewers — high confidence)

No real convergent findings. Kimi and Qwen both flagged a "RecordBuilder leak on early error mid-stream" but on close reading the existing code is correct: the deferred rb.Release() at the outer scope captures whatever value rb holds at function exit, and the in-loop rb.Release() runs before rb is reassigned to a new builder. No leak, regardless of where errors occur. Both reviewers landed on the same wrong intuition by different paths.

Single-reviewer findings — applied

# Severity Finding Reviewer Disposition
C-DRIFT WARN PUT-before-register on a fixed datasets/<name>/data.parquet key means a schema-drift ingest overwrites the good parquet before catalogd's 409 fires. After: storaged holds the new (wrong-schema) parquet, manifest still has the old fingerprint, queryd reads stale-schema data Opus Fixed — content-addressed key shape datasets/<name>/<fp_hex>.parquet. Drift writes to a different file (orphan); live parquet is never overwritten by a non-matching schema
C-WCLOSE WARN pqarrow.NewFileWriter not Closed on error paths — buffered column data + OS resources leak per failed ingest Opus FixedwClosed flag + deferred guarded close. Success-path explicit Close still runs and is the only place the close error surfaces; defer fires only on error returns

Single-reviewer findings — accepted with rationale

# Severity Finding Reviewer Disposition
O-WARN1 WARN Schema sample is 1000 rows; ambiguous values past row 1000 fail the entire ingest with 400 instead of widening the inference Opus Accepted — design call. G0's 256 MiB cap caps reasonable-CSVs well below the boundary. Long-term fix is two-pass infer or downgrade-to-string-on-failure; out of scope for D4
O-WARN2 WARN String-match on "http: request body too large" is paranoia; trust errors.As only Opus Acceptederrors.As IS the load-bearing check; the string-match is a documented safety net we keep across the codebase (also in storaged D2, also called out then)
K-WARN2 WARN Multipart parse buffers entire upload, then IngestCSV reads it again — double in-memory cost Kimi Accepted — known G0 limitation. ParseMultipartForm with the 64<<20 threshold spills to temp file above 64 MiB so the doubling only hits below that; 256 MiB upload cap means peak is bounded

Single-reviewer findings — deferred

# Severity Finding Reviewer Disposition
O-INFO×3 INFO 0x1e/0x1f separator validation, body-close ordering, FieldsPerRecord=-1 swallowing CSV truncation Opus Deferred — small, no risk if skipped
K-INFO×1 INFO Fingerprint ignores nullability — already documented intentional choice Kimi Deferred
Q-INFO×2 INFO Lowercase isBoolLiteral micro-opt; align multipart-parse limit with body cap Qwen Deferred

Dismissed (false positives)

# Severity Finding Reviewer Why dismissed
F1 Q-BLOCK csv.Reader needs LazyQuotes=true for multi-line quoted fields Qwen False — LazyQuotes is for unescaped quotes WITHIN fields. Go's csv.Reader correctly handles RFC 4180 multi-line quoted fields by default; smoke would have surfaced this if true
F2 Q-BLOCK row[i] OOB panic on inconsistent rows Qwen False — already bounds-checked at schema.go:73 (if i >= len(row) { continue }) and csv.go:201 (if i < len(row) { cell = row[i] })
F3 K-BLOCK Type assertion panic if pqarrow reorders fields Kimi Speculative — pqarrow doesn't reorder schema; no real path
F4 K-WARN + Q-WARN×2 "RecordBuilder leak on early error mid-stream" (false convergent) Kimi + Qwen Both wrong, by different theories. The outer defer rb.Release() captures the current value of rb at function exit; the in-loop rb.Release() runs before reassignment. No path leaks

Cumulative D4 disposition

  • 3-lineage parallel pass: 16 distinct findings
  • Fixed: 2 (both Opus single-reviewer; no real convergent findings this round)
  • Accepted-with-rationale: 3
  • Deferred: 6 (mostly INFO)
  • Dismissed (false positives): 5 (2 Qwen BLOCKs + 1 Kimi BLOCK + 1 false convergent leak finding)
  • Build clean, vet clean, all tests pass, smoke 6/6 PASS after every fix round.

The D4 scrum produced fewer real findings than D3 (2 vs 6) — both real findings were Opus-only, both were WARN-level architectural hazards that smoke wouldn't catch (PUT-then-register order failure on schema drift; resource leak on writer error path). Kimi and Qwen delivered a false convergent finding on a non-issue, plus several hallucinated BLOCKs (LazyQuotes for multi-line, OOB on bounds-checked indexing). Opus's self-retraction of two BLOCKs in-flight is a healthier reviewer behavior than confidently-wrong dismissals would have been.

The C-DRIFT fix (content-addressed parquet keys) is the kind of finding that's invisible to integration tests because the smoke's "schema drift → 409" assertion was passing in the corrupted-state world too — the 409 fires correctly; what was wrong was the live data getting overwritten before the 409 fired. Opus reading the PUT-then-register order and noticing that PUT mutates state before the validation check is exactly the kind of architectural code-read the scrum exists for.


Pre-D5 prep (commit 4205ecd)

Between D4 and D5, J asked what would bite D5 if not fixed first. The answer was the CatalogClient package location: it lived in internal/ingestd/ but D5's queryd needed the same shape, and having queryd import from ingestd would have inverted the data-flow direction (ingestd is upstream of queryd).

Extracted to internal/catalogclient/ as a shared package, added the List(ctx) method queryd needs for view registration, kept the wire format unchanged. ingestd switched its import; all four existing smokes (D1-D4) passed unchanged.

DuckDB cgo path re-verified at this point with the official github.com/duckdb/duckdb-go/v2 v2.10502.0 (per ADR-001 §1) on Go 1.25 + arrow-go: standalone sql.Open("duckdb","")db.Ping()SELECT 'pong' round-trip succeeded.


D5 — actual run results (2026-04-29)

Phase G0 Day 5 executed end-to-end. Output of scripts/d5_smoke.sh:

[d5-smoke] ingest 5-row CSV via D4 path:
  ✓ ingest row_count=5
[d5-smoke] launching queryd (initial Refresh picks up d5_workers)...
[d5-smoke] POST /sql SELECT count(*) FROM d5_workers:
  ✓ count(*)=5
[d5-smoke] POST /sql SELECT * FROM d5_workers LIMIT 3:
  ✓ rows[0] = (id=1, name=Alice), columns=[id, name, salary]
[d5-smoke] schema-drift ingest 409s; existing view still queries:
  ✓ drift → 409
  ✓ post-drift count(*)=5 (view unchanged)
[d5-smoke] error path: SELECT FROM nonexistent → 400:
  ✓ unknown table → 400
[d5-smoke] D5 acceptance gate: PASSED

What landed:

  • internal/queryd/db.goOpenDB(ctx, s3, prov, bucket) returns a *sql.DB backed by an in-memory DuckDB. Custom Connector with a bootstrapper that runs INSTALL httpfs / LOAD httpfs / CREATE OR REPLACE SECRET (TYPE S3, KEY_ID …, ENDPOINT …, URL_STYLE 'path', USE_SSL false) on every new physical connection. SetMaxOpenConns(1) so registrar's CREATE VIEWs and handler's user SQL serialize through one connection (avoids cross-connection visibility edge cases).
  • internal/queryd/registrar.goRegistrar.Refresh(ctx) reads catalog list, runs CREATE OR REPLACE VIEW "name" AS SELECT * FROM read_parquet('s3://bucket/key') per manifest, drops views for removed manifests, skips on unchanged updated_at. CatalogLister + Execer interfaces enable unit tests with no DuckDB / no real catalogd.
  • cmd/queryd/main.go — POST /sql with JSON body {"sql":"…"} → executes via *sql.DB → response shape {"columns":[{"name","type"},...], "rows":[[...]], "row_count":N}. []byte → string conversion so VARCHAR rows JSON-encode as text. Initial Refresh on startup, then a TTL ticker (30s default, configurable via [queryd].refresh_every) calling Refresh in a goroutine cancellable via process ctx.
  • New [queryd] config block: bind + catalogd_url + secrets_path + refresh_every.
  • scripts/d5_smoke.sh — 6 acceptance probes including post-drift query stability (view unchanged after schema-drift 409).

The D5 smoke launches queryd LAST — after the D4 ingest has registered the dataset — so the initial Refresh picks it up without waiting for the TTL. This matches what real ops looks like: queryd starts after data is loaded, picks up the catalog at that moment, then drifts in lockstep via the ticker.

Critical architectural choices:

  1. SetMaxOpenConns(1) — DuckDB views live in the in-memory catalog of a single instance; with one connection, registrar's CREATE VIEWs and handler's SELECTs are visibly synchronized without MVCC-timing edge cases. Lift in G2+ if intra-process query concurrency wins matter.
  2. Always-quote view identifiersCREATE OR REPLACE VIEW "name" with internal " doubled. Catalogd accepts user-supplied names that wouldn't pass SQL bare-identifier rules; quoting makes them unambiguous.
  3. updated_at as the implicit etag — no separate ETag header in catalogd. The manifest's timestamp bumps on every same-fp re-register; registrar tracks time.Time per name and skips CREATE when unchanged (Opus's "perf cliff" warning from D5 plan).

Next: D6 — gateway as a reverse proxy in front of all four backing services. Last day of G0. After that: G1+ (gRPC, Lance bench, vector indices, etc).


D5 — code scrum review (3-lineage parallel pass)

Pass Reviewer Latency Findings
1 opencode/claude-opus-4-7 ~28s 1 BLOCK + 4 WARN + 4 INFO = 9 (1 self-retracted)
2 openrouter/moonshotai/kimi-k2-0905 ~25s 2 BLOCK + 2 WARN + 1 INFO = 5
3 openrouter/qwen/qwen3-coder ~20s 2 BLOCK + 1 WARN + 1 INFO = 4

Total: 18 distinct findings. Strongest scrum yet on convergence quality — Opus's BLOCK on ctx capture + Kimi's BLOCK on credential leakage in error messages were both deep architectural reads that smoke could not have caught.

Convergent findings (≥2 reviewers — high confidence)

# Severity Finding Reviewers Disposition
C1 WARN×2 Refresh aborts mid-loop on first per-view error → poison manifest blocks all subsequent drops/updates → stale views stick around forever (Opus); a partial dropView failure would block other manifest refreshes (Kimi) Opus + Kimi Fixed — drop pass runs BEFORE create pass; per-iteration errors are collected and Refresh continues; final return uses errors.Join to surface every failure

Single-reviewer findings — applied

# Severity Finding Reviewer Disposition
B-CTX Opus BLOCK Bootstrap closure captures the ctx passed to OpenDB. Today masked by SetMaxOpenConns(1) + long-lived procCtx; future short-lived ctx + reconnect would silently fail every reconnect's bootstrap with a cancelled ctx Opus Fixed — bootstrap explicitly uses context.Background(); passed ctx is only used for the initial PingContext
B-LEAK Kimi BLOCK The prior firstLine(stmt) truncation to 80 chars STILL contained both KEY_ID '...' AND the start of SECRET '...' — log aggregators would capture credentials Kimi Fixed — replaced firstLine(stmt) with stable per-statement labels ("install httpfs" / "load httpfs" / "create secret") so no SQL reaches the error path. Wrapped err.Error() is filtered through redactCreds to scrub any secret values DuckDB itself might echo
JSON-ERR Opus WARN _ = json.NewEncoder(w).Encode(resp) swallows mid-encode failures; client sees truncated 200 with no log signal — first column type DuckDB can't JSON-encode (e.g. INTERVAL) lands here silently Opus Fixed — log via slog.Warn with the underlying error

Single-reviewer findings — accepted with rationale

# Severity Finding Reviewer Disposition
O-W3 Opus WARN DuckDB exotic types (Decimal, INTERVAL, etc.) may not JSON-encode well — dest := []any with []byte→string conversion is the only type-aware step today Opus Accepted — type-aware row converter is post-G0; G0 column types in test fixtures are int64/varchar/double/bool. The new JSON-ERR fix makes any encoding failure operator-visible in the meantime
K-W1 Kimi WARN No exponential backoff on repeated refresh failures Kimi Accepted — G0 single-tenant; 30s ticker is fine. Backoff adds complexity that smoke can't validate
O-INFO×3 Opus INFO USE_SSL casing cosmetic; initial-refresh log doesn't match runRefreshLoop's four-field format Opus Accepted / fixed cleanup — the var _ = io.Discard placeholder removed; the rest deferred
Q-W Qwen WARN Object key validation Qwen Accepted — catalogd's contract; storaged validateKey is the actual gate
Q-I Qwen INFO Hardcoded refresh timeout Qwen Accepted — separate config knob is over-engineering for G0

Dismissed (false positives)

# Severity Finding Reviewer Why dismissed
F1 Q-BLOCK "Bootstrap not transactional" Qwen False — DuckDB DDL is auto-commit + idempotent; CREATE OR REPLACE handles re-run cleanly
F2 Q-BLOCK "MaxBytesReader applied AFTER json.NewDecoder reads" Qwen False — cmd/queryd/main.go sets r.Body = http.MaxBytesReader(...) BEFORE the decode call, by line ordering
F3 K-BLOCK "Concurrent Refresh + user query deadlock on single connection" Kimi False — not a deadlock, just serialization. With Refresh's 10s ctx-timeout, slow SELECT causes Refresh to skip a tick and retry. Design accept
F4 K-W2 "dropView failure leaves r.known inconsistent" Kimi False — current code returns BEFORE the delete(r.known, name) on dropView error, so the entry persists and next refresh retries (correct behavior). Misread of the sequence

Cumulative D5 disposition

  • 3-lineage parallel pass: 18 distinct findings
  • Fixed: 4 (1 convergent + 3 single-reviewer real)
  • Accepted-with-rationale: 5
  • Deferred / cleanup: 3 (1 INFO removal, 2 INFOs deferred)
  • Dismissed (false positives): 4
  • Build clean, vet clean, all tests pass, smoke 6/6 PASS after every fix round.

The B-LEAK finding (Kimi BLOCK) is worth a pause: the prior code explicitly tried to be careful about credentials by truncating the SQL to 80 chars in error messages, but the truncation window still captured both KEY_ID '<value>' and the prefix of SECRET '<value>' because of where they appear in the statement. The defense-in-depth answer was to never pass the SQL into the error path at all — labels are static, the error chain wraps a credential-redacted version of DuckDB's own error message. Two layers of "don't leak" instead of one fragile truncation.

The B-CTX finding (Opus BLOCK) is the kind of latent issue that would cost a future debugger hours: the connector callback runs on EVERY new physical connection, but SetMaxOpenConns(1) masks the issue today. A future change to allow concurrent connections would silently break bootstrap on every reconnect with no clear error signal. Catching it during the scrum saved a Friday-afternoon- debugging incident months from now.


D6 — actual run results (2026-04-29) — G0 COMPLETE

Phase G0 Day 6 (last day) executed end-to-end. Output of scripts/d6_smoke.sh:

[d6-smoke] /v1/ingest?name=d6_workers (gateway → ingestd):
  ✓ ingest row_count=3, content-addressed key
[d6-smoke] /v1/catalog/list (gateway → catalogd):
  ✓ catalog count=1
[d6-smoke] /v1/storage/list (gateway → storaged):
  ✓ storage list returned 1 object(s)
[d6-smoke] /v1/sql SELECT count(*) (gateway → queryd):
  ✓ count(*)=3
[d6-smoke] /v1/sql with row data (full round-trip):
  ✓ rows[0].name=Alice (full ingest → storage → catalog → query through gateway)
[d6-smoke] /v1/unknown → 404:
  ✓ unknown route → 404
[d6-smoke] D6 acceptance gate: PASSED

All six G0 smokes (D1 through D6) PASS end-to-end.

What landed:

  • internal/gateway/proxy.goNewProxyHandler(upstream) returns an http.Handler that wraps httputil.NewSingleHostReverseProxy with a Director that strips the /v1 prefix BEFORE the default Director's singleJoiningSlash runs (post-scrum O-BLOCK). Query string preserved; Host header rewritten to upstream.
  • cmd/gateway/main.go — replaces the D1 stub handlers. Wires /v1/storage/* → storaged, /v1/catalog/* → catalogd, /v1/ingest → ingestd, /v1/sql → queryd. mustParseUpstream fail-fast on missing scheme/host (post-scrum O-WARN2).
  • New [gateway] config block — bind + 4 upstream URLs, one per backing service.
  • scripts/d6_smoke.sh — 6 acceptance probes, every assertion through :3110 (gateway), none direct to backing services.
  • scripts/d1_smoke.sh updated — the 501 stub probes are gone (stubs replaced by real proxies). Replaced with proxy probes that verify gateway forwards to ingestd and queryd. Launch order changed from parallel to dep-ordered (storaged → catalogd → ingestd → queryd → gateway) since catalogd's rehydrate now needs storaged, and queryd's initial Refresh needs catalogd.

The architectural payoff of the whole G0 phase is the D6 smoke's "rows[0].name=Alice" assertion: a CSV is uploaded to gateway (/v1/ingest), gateway forwards to ingestd, ingestd PUTs the parquet through storaged at a content-addressed key, registers with catalogd, returns. Then a SELECT through gateway (/v1/sql) is forwarded to queryd, which had picked up the dataset on its initial Refresh against catalogd, points at the parquet via DuckDB's httpfs talking directly to MinIO with credentials from SecretsProvider, returns the rows. Five binaries, six HTTP routes, one round-trip. The /v1 prefix lives at the edge; internal services don't see it.

Next: G1+ work — gRPC adapters alongside the HTTP routes, Lance/HNSW vector indices for the staffing search use case, MCP server port (the existing Bun mcp-server on the Rust system goes away once the Go MCP SDK port lands), distillation rebuild on the new substrate, observer + Langfuse integration.


D6 — code scrum review (3-lineage parallel pass)

Pass Reviewer Latency Findings
1 opencode/claude-opus-4-7 ~25s 1 BLOCK + 2 WARN + 2 INFO = 5
2 openrouter/moonshotai/kimi-k2-0905 ~22s 1 BLOCK + 3 WARN + 1 INFO = 5
3 openrouter/qwen/qwen3-coder ~20s "No BLOCKs." (5 completion tokens total)

Total: 10 distinct findings. Smaller round than D5 — D6 is the smallest code surface (~50 lines of Go in the new package, ~50 lines of changes to cmd/gateway). Qwen returned a literal "No BLOCKs." which is fine — the empty-finding answer is a valid review outcome the system prompt explicitly allows.

Convergent findings (≥2 reviewers — high confidence)

No real convergent findings. Kimi and Opus both flagged something in the path-stripping logic, but Kimi's diagnosis ("TrimPrefix loops forever on //v1storage") was a misread of strings.TrimPrefix semantics — that function performs a single check-and-trim, not a loop. Opus's diagnosis (default Director's join order) was the real issue.

Single-reviewer findings — applied

# Severity Finding Reviewer Disposition
O-BLOCK BLOCK Director path stripping fails if upstream URL has a non-empty path. The default Director's singleJoiningSlash(target.Path, req.URL.Path) runs BEFORE the custom code; with target.Path="/api", the joined path is /api/v1/storage/...; my TrimPrefix(..., "/v1") is then a no-op. Today: bare-host URLs only, dormant. The moment anyone runs gateway behind a sub-path, every request silently 404s Opus Fixed — strip /v1 BEFORE calling origDirector, so the join sees the already-clean path. New regression test TestProxy_SubPathUpstream verifies forward path is /api/storage/... not /api/v1/storage/...
O-WARN2 WARN url.Parse is permissive — a typo like 127.0.0.1:3211 (missing scheme) parses fine but produces empty Host, all requests 502 at first traffic instead of failing fast at startup Opus FixedmustParseUpstream validates Scheme != "" and Host != "", exits with clear message naming the offending config field

Single-reviewer findings — accepted with rationale

# Severity Finding Reviewer Disposition
O-WARN1 WARN Bare /v1 (no trailing path) → TrimPrefix yields "" → upstream sees empty path Opus Accepted — chi's /v1/storage/* etc. won't match bare /v1, so it returns 404 at chi before reaching the proxy. Also singleJoiningSlash("","") returns "/" which is benign — no malformed request reaches the upstream
O-INFO×2 INFO No proxy transport timeout; 4 near-identical proxy construction blocks Opus Deferred — both are real but post-G0. Transport timeout becomes load-bearing when traffic isn't single-tenant; helper extraction is cosmetic
K-INFO INFO Parse calls in slice loop Kimi Deferred — same cosmetic concern

Dismissed (false positives)

# Severity Finding Reviewer Why dismissed
F1 K-BLOCK "TrimPrefix on //v1storage loops forever" Kimi False — strings.TrimPrefix performs a single check-and-trim. There is no loop. Verified by running strings.TrimPrefix("//v1storage", "/v1") in the Go playground → returns /storage immediately
F2 K-WARN "No upper bound on repeated // removal" Kimi Same false theory as F1
F3 K-WARN "Goroutines leak if upstream parse fails while binaries are already running" Kimi Confused about scope. The other binaries are separate OS processes launched by d1_smoke.sh, not goroutines inside cmd/gateway/main. gateway's os.Exit(1) doesn't affect them

Cumulative D6 disposition

  • 3-lineage parallel pass: 10 distinct findings
  • Fixed: 2 (both Opus single-reviewer)
  • Accepted-with-rationale: 4
  • Dismissed (false positives): 3 (all Kimi, same theory ×3)
  • Qwen contribution: 5 completion tokens ("No BLOCKs.") — net-zero
  • Build clean, vet clean, all tests pass, all 6 G0 smokes PASS after every fix round.

The O-BLOCK fix is the kind of thing smoke can never catch on its own: the smoke runs against bare-host upstreams (http://127.0.0.1:3211), which have empty target.Path, so singleJoiningSlash produces /v1/storage/... and the strip works. Move gateway behind a sub-path in production (e.g. https://api.example.com/lakehouse/...) and the strip silently no-ops. The new TestProxy_SubPathUpstream regression locks the strip-before-join order in.

Kimi's three false BLOCKs/WARNs all stem from one wrong intuition about strings.TrimPrefix semantics. This is the second time across G0 (D2 had a similar false convergent on RecordBuilder lifetime) where Kimi delivered confidently-incorrect findings that Opus or direct code-tracing dismissed. The convergence filter (≥2 reviewers) worked as designed — Kimi's BLOCK didn't have a second reviewer backing it, so it stayed in the dismissed column.


G0 retrospective — six days, six smokes, 5 binaries

Phase G0 (substrate ship) shipped in six days, 2026-04-28 → 2026-04-29. Every day produced one binary's worth of functionality plus a 3-lineage scrum review on the shipped code:

Day Component Smoke Scrum fixes Cumulative commit
D1 5 binary skeletons + chi + /health 6/6 7 1142f54ad2ec1a
D2 storaged S3 GET/PUT/LIST/DELETE 6/6 4 8cfcdb8
D3 catalogd Parquet manifests + ADR-020 idempotency 6/6 6 66a704c
D4 ingestd CSV → Parquet → catalogd register 6/6 2 c1e4113
Pre-D5 CatalogClient extraction to internal/catalogclient (smokes 1-4 still pass) 4205ecd
D5 queryd DuckDB SELECT over Parquet via httpfs 6/6 4 9e9e4c2
D6 gateway reverse proxy fronting all 4 services 6/6 2 (this commit)

Total: 25 distinct fixes applied across six days from cross-lineage review (with another ~15 deferred-with-rationale and ~12 dismissed false positives). Smoke acceptance gate passed at every fix round on every day. Every day produced both functioning code AND structured documentation (this file + project memory).

Real architectural choices that proved out:

  1. Content-addressed parquet keys (D4) — schema-drift attempts write to a different file, leaving the live parquet uncorrupted
  2. ADR-020 idempotency contract (D3) — same name + same fingerprint = same dataset_id; smoke proves rehydrate-across- restart preserves identity
  3. UUIDv5 dataset_id (D3) — diverges from Rust's v4 surrogate; same name on any box converges to the same ID after disk loss
  4. /v1 prefix at the edge (D6) — internal services route on /storage, /catalog, etc.; gateway strips on the way in
  5. SetMaxOpenConns(1) on DuckDB (D5) — registrar's CREATE VIEWs and handler's SELECTs serialize through one connection (avoids cross-connection MVCC visibility edge cases for G0)

Next: G1+. The substrate is now in place to build gRPC adapters, vector indices (Lance/HNSW), the Go MCP SDK port, distillation rebuild, and observer/Langfuse integration on top.