Phase G0 Day 4 ships ingestd: multipart CSV upload, Arrow schema
inference per ADR-010 (default-to-string on ambiguity), single-pass
streaming CSV → Parquet via pqarrow batched writer (Snappy compressed,
8192 rows per batch), PUT to storaged at content-addressed key
datasets/<name>/<fp_hex>.parquet, register manifest with catalogd.
Acceptance smoke 6/6 PASS including idempotent re-ingest (proves
inference is deterministic — same CSV always produces same fingerprint)
and schema-drift → 409 (proves catalogd's gate fires on ingest traffic).
Schema fingerprint is SHA-256 over (name, type) tuples in header order
using ASCII record/unit separators (0x1e/0x1f) so column names with
commas can't collide. Nullability intentionally NOT in the fingerprint
— a column gaining nulls isn't a schema change.
Cross-lineage scrum on shipped code:
- Opus 4.7 (opencode): 4 WARN + 3 INFO (after 2 self-retracted BLOCKs)
- Kimi K2-0905 (openrouter): 1 BLOCK + 2 WARN + 1 INFO
- Qwen3-coder (openrouter): 2 BLOCK + 2 WARN + 2 INFO
Fixed (2, both Opus single-reviewer):
C-DRIFT: PUT-then-register on fixed datasets/<name>/data.parquet
meant a schema-drift ingest overwrote the live parquet BEFORE
catalogd's 409 fired → storaged inconsistent with manifest.
Fix: content-addressed key datasets/<name>/<fp_hex>.parquet.
Drift writes to a different file (orphan in G2 GC scope); the
live data is never corrupted.
C-WCLOSE: pqarrow.NewFileWriter not Closed on error paths leaks
buffered column data + OS resources per failed ingest.
Fix: deferred guarded close with wClosed flag.
Dismissed (5, all false positives):
Qwen BLOCK "csv.Reader needs LazyQuotes=true for multi-line" — false,
Go csv handles RFC 4180 multi-line quoted fields by default
Qwen BLOCK "row[i] OOB" — already bounds-checked at schema.go:73
and csv.go:201
Kimi BLOCK "type assertion panic if pqarrow reorders fields" —
speculative, no real path
Kimi WARN + Qwen WARN×2 "RecordBuilder leak on early error" —
false convergent. Outer defer rb.Release() captures the current
builder; in-loop release runs before reassignment. No leak.
Deferred (6 INFO + accepted-with-rationale on 3 WARN): sample
boundary type mismatch (G0 cap bounds peak), string-match
paranoia on http.MaxBytesError, multipart double-buffer (G2 spool-
to-disk), separator validation, body close ordering, etc.
The D4 scrum produced fewer real findings than D3 (2 vs 6) — both
were architectural hazards smoke wouldn't catch because the smoke's
"schema drift → 409" assertion was passing even in the corrupted-
state world. The 409 fires correctly; what was wrong was the PUT
having already mutated the live parquet before the validation check.
Opus's PUT-then-register read of the order is exactly the kind of
architectural insight the cross-lineage scrum is designed to surface.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
golangLAKEHOUSE
Go reimplementation of the Lakehouse — a versioned knowledge substrate for staffing analytics + local AI workloads.
Status
Pre-Phase G0. Documents seeded; Go module declared; implementation
has not started. See docs/PRD.md for direction and docs/SPEC.md
for the component-by-component port plan.
Phase G0 prerequisites (must be done before any code lands)
- Install Go 1.23+ on the dev box. Not currently present at
/usr/local/goor elsewhere on the build machine. Standard install:curl -L https://go.dev/dl/go1.23.linux-amd64.tar.gz | sudo tar -C /usr/local -xz echo 'export PATH=$PATH:/usr/local/go/bin' >> ~/.bashrc - Ensure cgo toolchain is present (gcc + libc-dev) — required by
the DuckDB binding per ADR-001 §1.1.
apt install build-essentialon Debian-based systems. - Initialize the dependency tree with
go mod tidyoncecmd/gateway/main.godeclares its first imports.
Layout
docs/ Direction + spec + ADRs
cmd/ (forthcoming) main packages — one per service
internal/ (forthcoming) shared packages
web/ (forthcoming) HTMX templates + static
scripts/ (forthcoming) cold-start, smoke, distill
tests/ (forthcoming) golden files, integration tests
Reading order
docs/PRD.md— what we're building and whydocs/SPEC.md— how, per-componentdocs/DECISIONS.md— ADRs, starting with ADR-001 (foundational)docs/RUST_PATHWAY_MEMORY_NOTE.md— historical reference for the Rust era's pathway memory state (not migrated)
Predecessor
The Rust Lakehouse this rewrite supersedes lives at
git.agentview.dev/profit/lakehouse. It remains the live system until
this Go implementation reaches feature parity (per docs/SPEC.md §7).
Description
Go reimplementation of the Lakehouse — versioned knowledge substrate for staffing analytics + local AI workloads
Languages
Go
79.4%
Shell
20.1%
Just
0.3%
Dockerfile
0.2%