docs: README + PHASE_G0_KICKOFF reflect post-G0 state (G1, G1P, G2)

README was stuck on "Pre-Phase G0, implementation has not started"
while we shipped through G2. Updated to reflect the current 7-binary
service inventory, the 9 acceptance smokes, the cold-start deps
(MinIO bucket, Ollama with nomic-embed-text, secrets-go.toml).

PHASE_G0_KICKOFF gains a "Post-G0 work" pointer at the end —
brief table mapping each G1+/G2 commit to its smoke + scrum-fix
count. Full per-day detail stays in commit messages and the
project memory file.

No code changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
root 2026-04-29 01:45:59 -05:00
parent 9ee7fc5550
commit 0cb29cda15
2 changed files with 118 additions and 28 deletions

View File

@ -1,49 +1,96 @@
# golangLAKEHOUSE # golangLAKEHOUSE
Go reimplementation of the Lakehouse — a versioned knowledge substrate Go reimplementation of the Lakehouse — a versioned knowledge
for staffing analytics + local AI workloads. substrate for staffing analytics + local AI workloads.
## Status ## Status
**Pre-Phase G0.** Documents seeded; Go module declared; implementation **Phase G0 complete + G1/G1P/G2 shipped.** Six binaries plus a
has not started. See `docs/PRD.md` for direction and `docs/SPEC.md` seventh (vectord) and an eighth (embedd) on top, fronted by a
for the component-by-component port plan. single gateway. Acceptance smokes green for D1-D6 + G1 + G1P + G2.
### Phase G0 prerequisites (must be done before any code lands) End-to-end staffing co-pilot pipeline functional through the
gateway:
1. **Install Go 1.23+ on the dev box.** Not currently present at
`/usr/local/go` or elsewhere on the build machine. Standard install:
``` ```
curl -L https://go.dev/dl/go1.23.linux-amd64.tar.gz | sudo tar -C /usr/local -xz text → /v1/embed → /v1/vectors/index/<name>/add
echo 'export PATH=$PATH:/usr/local/go/bin' >> ~/.bashrc text → /v1/embed → /v1/vectors/index/<name>/search → top-K hits
``` ```
2. **Ensure cgo toolchain is present** (gcc + libc-dev) — required by
the DuckDB binding per ADR-001 §1.1. `apt install build-essential` Plus the SQL path:
on Debian-based systems. ```
3. **Initialize the dependency tree** with `go mod tidy` once CSV → /v1/ingest (parses, writes Parquet via storaged, registers
`cmd/gateway/main.go` declares its first imports. manifest with catalogd)
SQL → /v1/sql (DuckDB over the registered Parquets via httpfs)
```
See `docs/PHASE_G0_KICKOFF.md` for the day-by-day record (D1-D6 +
real-scale validation + G1/G1P/G2 pointer at the bottom).
## Service inventory
| Bin | Port | Role |
|---|---|---|
| `gateway` | 3110 | Reverse proxy fronting all backing services |
| `storaged` | 3211 | Object I/O over S3 (MinIO in dev) |
| `catalogd` | 3212 | Parquet manifest registry, ADR-020 idempotency |
| `ingestd` | 3213 | CSV → Parquet → register loop |
| `queryd` | 3214 | DuckDB SELECT over registered Parquets via httpfs |
| `vectord` | 3215 | HNSW vector search (+ optional persistence to storaged) |
| `embedd` | 3216 | Text → vector via Ollama (default `nomic-embed-text` 768-d) |
## Acceptance smokes
```
scripts/d1_smoke.sh # 5-binary skeleton + chi /health + gateway proxy probes
scripts/d2_smoke.sh # storaged GET/PUT/LIST/DELETE + 256 MiB cap + concurrency cap
scripts/d3_smoke.sh # catalogd register/manifest/list + rehydrate-across-restart
scripts/d4_smoke.sh # ingestd CSV → Parquet round-trip + schema-drift 409
scripts/d5_smoke.sh # queryd DuckDB SELECT through httpfs over MinIO
scripts/d6_smoke.sh # full ingest → query through gateway only
scripts/g1_smoke.sh # vectord HNSW recall + dim mismatch + duplicate-create 409
scripts/g1p_smoke.sh # vectord state survives kill+restart via storaged
scripts/g2_smoke.sh # embed → vectord add → search round-trip
```
Run them all in any order:
```
for s in scripts/{d1,d2,d3,d4,d5,d6,g1,g1p,g2}_smoke.sh; do "$s" || break; done
```
## Cold-start dependencies
- Go 1.25+ at `/usr/local/go/bin` (arrow-go pulled the 1.25 floor)
- `gcc` + `libc-dev` for the DuckDB cgo binding (ADR-001 §1.1)
- MinIO running on `:9000` with bucket `lakehouse-go-primary`
- Ollama running on `:11434` with `nomic-embed-text` loaded (G2)
- `/etc/lakehouse/secrets-go.toml` with `[s3.primary]` credentials
(storaged + queryd both read this)
## Layout ## Layout
``` ```
docs/ Direction + spec + ADRs docs/ Direction + spec + ADRs + day-by-day
cmd/ (forthcoming) main packages — one per service cmd/ One main package per binary
internal/ (forthcoming) shared packages internal/ Shared packages — storeclient, catalogclient,
web/ (forthcoming) HTMX templates + static secrets, shared, embed, gateway, plus
scripts/ (forthcoming) cold-start, smoke, distill per-service implementation packages
tests/ (forthcoming) golden files, integration tests scripts/ Smokes + ancillary tooling
``` ```
## Reading order ## Reading order
1. `docs/PRD.md` — what we're building and why 1. `docs/PRD.md` — what we're building and why
2. `docs/SPEC.md` — how, per-component 2. `docs/SPEC.md` — how, per-component
3. `docs/DECISIONS.md` — ADRs, starting with ADR-001 (foundational) 3. `docs/DECISIONS.md` — ADRs (ADR-001 foundational)
4. `docs/RUST_PATHWAY_MEMORY_NOTE.md` — historical reference for the 4. `docs/PHASE_G0_KICKOFF.md` — day-by-day from D1 through G2
Rust era's pathway memory state (not migrated) 5. `docs/RUST_PATHWAY_MEMORY_NOTE.md` — historical reference for the
Rust era's pathway memory (not migrated, by ADR-001 #5)
## Predecessor ## Predecessor
The Rust Lakehouse this rewrite supersedes lives at The Rust Lakehouse this rewrite supersedes lives at
`git.agentview.dev/profit/lakehouse`. It remains the live system until `git.agentview.dev/profit/lakehouse`. It remains the live system
this Go implementation reaches feature parity (per `docs/SPEC.md` §7). serving `devop.live/lakehouse/` until this Go implementation reaches
feature parity per `docs/SPEC.md` §7. Then Rust enters
maintenance-only mode.

View File

@ -1273,3 +1273,46 @@ G0's substrate handles real production-scale data with one config
knob bumped. No correctness issues, no OOMs, no silent type knob bumped. No correctness issues, no OOMs, no silent type
errors. Query latency is fast enough for ad-hoc analytics. errors. Query latency is fast enough for ad-hoc analytics.
The substrate is ready for G1+ work to build on top. The substrate is ready for G1+ work to build on top.
---
## Post-G0 work (pointer, not detail)
After G0 substrate validated at real scale, work continued on top.
Each piece has its own commit + scrum-review record; this section
is a pointer into the git log, not a full retro.
| Phase | Component | Commit | Smoke | Scrum fixes |
|---|---|---|---|---|
| G1 | vectord — HNSW vector search via `coder/hnsw` | `b8c072c` | g1_smoke 7/7 | 6 |
| G1P | vectord persistence to storaged (single-file `LHV1` framing) | `8b92518` | g1p_smoke 8/8 | 3 (incl. 3-way convergent torn-write fix) |
| G2 | embedd — text → vector via Ollama (default `nomic-embed-text` 768-d) | `9ee7fc5` | g2_smoke 5/5 | 2 |
After G2, the **end-to-end staffing co-pilot pipeline** is functional
through gateway:
```
text → /v1/embed → /v1/vectors/index/<name>/add
text → /v1/embed → /v1/vectors/index/<name>/search → top-K hits
```
The `g2_smoke.sh` end-to-end assertion proves it: a CSV-style staffing
text round-trips through embed → vectord → search at distance ≈ 0
(float32 precision noise on identical unit vectors).
The post-G0 service inventory is now 7 binaries + 1 shared library
package:
- `gateway` (:3110) — reverse proxy
- `storaged` (:3211) — S3 I/O
- `catalogd` (:3212) — Parquet manifests
- `ingestd` (:3213) — CSV → Parquet → register
- `queryd` (:3214) — DuckDB SELECT over Parquet via httpfs
- `vectord` (:3215) — HNSW vector search + optional persistence
- `embedd` (:3216) — text → vector via Ollama
Plus shared packages: `internal/storeclient`, `internal/catalogclient`,
`internal/secrets`, `internal/shared`.
**Smokes (deterministic, run-in-any-order)**: d1, d2, d3, d4, d5, d6,
g1, g1p, g2 — 9 acceptance gates, all PASS. `scripts/g1_smoke.toml`
disables vectord persistence specifically for the in-memory API smoke
to avoid rehydrate-from-storaged contamination.