docs: README + PHASE_G0_KICKOFF reflect post-G0 state (G1, G1P, G2)

README was stuck on "Pre-Phase G0, implementation has not started"
while we shipped through G2. Updated to reflect the current 7-binary
service inventory, the 9 acceptance smokes, the cold-start deps
(MinIO bucket, Ollama with nomic-embed-text, secrets-go.toml).

PHASE_G0_KICKOFF gains a "Post-G0 work" pointer at the end —
brief table mapping each G1+/G2 commit to its smoke + scrum-fix
count. Full per-day detail stays in commit messages and the
project memory file.

No code changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
root 2026-04-29 01:45:59 -05:00
parent 9ee7fc5550
commit 0cb29cda15
2 changed files with 118 additions and 28 deletions

103
README.md
View File

@ -1,49 +1,96 @@
# golangLAKEHOUSE
Go reimplementation of the Lakehouse — a versioned knowledge substrate
for staffing analytics + local AI workloads.
Go reimplementation of the Lakehouse — a versioned knowledge
substrate for staffing analytics + local AI workloads.
## Status
**Pre-Phase G0.** Documents seeded; Go module declared; implementation
has not started. See `docs/PRD.md` for direction and `docs/SPEC.md`
for the component-by-component port plan.
**Phase G0 complete + G1/G1P/G2 shipped.** Six binaries plus a
seventh (vectord) and an eighth (embedd) on top, fronted by a
single gateway. Acceptance smokes green for D1-D6 + G1 + G1P + G2.
### Phase G0 prerequisites (must be done before any code lands)
End-to-end staffing co-pilot pipeline functional through the
gateway:
1. **Install Go 1.23+ on the dev box.** Not currently present at
`/usr/local/go` or elsewhere on the build machine. Standard install:
```
curl -L https://go.dev/dl/go1.23.linux-amd64.tar.gz | sudo tar -C /usr/local -xz
echo 'export PATH=$PATH:/usr/local/go/bin' >> ~/.bashrc
```
2. **Ensure cgo toolchain is present** (gcc + libc-dev) — required by
the DuckDB binding per ADR-001 §1.1. `apt install build-essential`
on Debian-based systems.
3. **Initialize the dependency tree** with `go mod tidy` once
`cmd/gateway/main.go` declares its first imports.
```
text → /v1/embed → /v1/vectors/index/<name>/add
text → /v1/embed → /v1/vectors/index/<name>/search → top-K hits
```
Plus the SQL path:
```
CSV → /v1/ingest (parses, writes Parquet via storaged, registers
manifest with catalogd)
SQL → /v1/sql (DuckDB over the registered Parquets via httpfs)
```
See `docs/PHASE_G0_KICKOFF.md` for the day-by-day record (D1-D6 +
real-scale validation + G1/G1P/G2 pointer at the bottom).
## Service inventory
| Bin | Port | Role |
|---|---|---|
| `gateway` | 3110 | Reverse proxy fronting all backing services |
| `storaged` | 3211 | Object I/O over S3 (MinIO in dev) |
| `catalogd` | 3212 | Parquet manifest registry, ADR-020 idempotency |
| `ingestd` | 3213 | CSV → Parquet → register loop |
| `queryd` | 3214 | DuckDB SELECT over registered Parquets via httpfs |
| `vectord` | 3215 | HNSW vector search (+ optional persistence to storaged) |
| `embedd` | 3216 | Text → vector via Ollama (default `nomic-embed-text` 768-d) |
## Acceptance smokes
```
scripts/d1_smoke.sh # 5-binary skeleton + chi /health + gateway proxy probes
scripts/d2_smoke.sh # storaged GET/PUT/LIST/DELETE + 256 MiB cap + concurrency cap
scripts/d3_smoke.sh # catalogd register/manifest/list + rehydrate-across-restart
scripts/d4_smoke.sh # ingestd CSV → Parquet round-trip + schema-drift 409
scripts/d5_smoke.sh # queryd DuckDB SELECT through httpfs over MinIO
scripts/d6_smoke.sh # full ingest → query through gateway only
scripts/g1_smoke.sh # vectord HNSW recall + dim mismatch + duplicate-create 409
scripts/g1p_smoke.sh # vectord state survives kill+restart via storaged
scripts/g2_smoke.sh # embed → vectord add → search round-trip
```
Run them all in any order:
```
for s in scripts/{d1,d2,d3,d4,d5,d6,g1,g1p,g2}_smoke.sh; do "$s" || break; done
```
## Cold-start dependencies
- Go 1.25+ at `/usr/local/go/bin` (arrow-go pulled the 1.25 floor)
- `gcc` + `libc-dev` for the DuckDB cgo binding (ADR-001 §1.1)
- MinIO running on `:9000` with bucket `lakehouse-go-primary`
- Ollama running on `:11434` with `nomic-embed-text` loaded (G2)
- `/etc/lakehouse/secrets-go.toml` with `[s3.primary]` credentials
(storaged + queryd both read this)
## Layout
```
docs/ Direction + spec + ADRs
cmd/ (forthcoming) main packages — one per service
internal/ (forthcoming) shared packages
web/ (forthcoming) HTMX templates + static
scripts/ (forthcoming) cold-start, smoke, distill
tests/ (forthcoming) golden files, integration tests
docs/ Direction + spec + ADRs + day-by-day
cmd/ One main package per binary
internal/ Shared packages — storeclient, catalogclient,
secrets, shared, embed, gateway, plus
per-service implementation packages
scripts/ Smokes + ancillary tooling
```
## Reading order
1. `docs/PRD.md` — what we're building and why
2. `docs/SPEC.md` — how, per-component
3. `docs/DECISIONS.md` — ADRs, starting with ADR-001 (foundational)
4. `docs/RUST_PATHWAY_MEMORY_NOTE.md` — historical reference for the
Rust era's pathway memory state (not migrated)
3. `docs/DECISIONS.md` — ADRs (ADR-001 foundational)
4. `docs/PHASE_G0_KICKOFF.md` — day-by-day from D1 through G2
5. `docs/RUST_PATHWAY_MEMORY_NOTE.md` — historical reference for the
Rust era's pathway memory (not migrated, by ADR-001 #5)
## Predecessor
The Rust Lakehouse this rewrite supersedes lives at
`git.agentview.dev/profit/lakehouse`. It remains the live system until
this Go implementation reaches feature parity (per `docs/SPEC.md` §7).
`git.agentview.dev/profit/lakehouse`. It remains the live system
serving `devop.live/lakehouse/` until this Go implementation reaches
feature parity per `docs/SPEC.md` §7. Then Rust enters
maintenance-only mode.

View File

@ -1273,3 +1273,46 @@ G0's substrate handles real production-scale data with one config
knob bumped. No correctness issues, no OOMs, no silent type
errors. Query latency is fast enough for ad-hoc analytics.
The substrate is ready for G1+ work to build on top.
---
## Post-G0 work (pointer, not detail)
After G0 substrate validated at real scale, work continued on top.
Each piece has its own commit + scrum-review record; this section
is a pointer into the git log, not a full retro.
| Phase | Component | Commit | Smoke | Scrum fixes |
|---|---|---|---|---|
| G1 | vectord — HNSW vector search via `coder/hnsw` | `b8c072c` | g1_smoke 7/7 | 6 |
| G1P | vectord persistence to storaged (single-file `LHV1` framing) | `8b92518` | g1p_smoke 8/8 | 3 (incl. 3-way convergent torn-write fix) |
| G2 | embedd — text → vector via Ollama (default `nomic-embed-text` 768-d) | `9ee7fc5` | g2_smoke 5/5 | 2 |
After G2, the **end-to-end staffing co-pilot pipeline** is functional
through gateway:
```
text → /v1/embed → /v1/vectors/index/<name>/add
text → /v1/embed → /v1/vectors/index/<name>/search → top-K hits
```
The `g2_smoke.sh` end-to-end assertion proves it: a CSV-style staffing
text round-trips through embed → vectord → search at distance ≈ 0
(float32 precision noise on identical unit vectors).
The post-G0 service inventory is now 7 binaries + 1 shared library
package:
- `gateway` (:3110) — reverse proxy
- `storaged` (:3211) — S3 I/O
- `catalogd` (:3212) — Parquet manifests
- `ingestd` (:3213) — CSV → Parquet → register
- `queryd` (:3214) — DuckDB SELECT over Parquet via httpfs
- `vectord` (:3215) — HNSW vector search + optional persistence
- `embedd` (:3216) — text → vector via Ollama
Plus shared packages: `internal/storeclient`, `internal/catalogclient`,
`internal/secrets`, `internal/shared`.
**Smokes (deterministic, run-in-any-order)**: d1, d2, d3, d4, d5, d6,
g1, g1p, g2 — 9 acceptance gates, all PASS. `scripts/g1_smoke.toml`
disables vectord persistence specifically for the in-memory API smoke
to avoid rehydrate-from-storaged contamination.