diff --git a/README.md b/README.md index 1584aa4..f1f3bc4 100644 --- a/README.md +++ b/README.md @@ -1,49 +1,96 @@ # golangLAKEHOUSE -Go reimplementation of the Lakehouse — a versioned knowledge substrate -for staffing analytics + local AI workloads. +Go reimplementation of the Lakehouse — a versioned knowledge +substrate for staffing analytics + local AI workloads. ## Status -**Pre-Phase G0.** Documents seeded; Go module declared; implementation -has not started. See `docs/PRD.md` for direction and `docs/SPEC.md` -for the component-by-component port plan. +**Phase G0 complete + G1/G1P/G2 shipped.** Six binaries plus a +seventh (vectord) and an eighth (embedd) on top, fronted by a +single gateway. Acceptance smokes green for D1-D6 + G1 + G1P + G2. -### Phase G0 prerequisites (must be done before any code lands) +End-to-end staffing co-pilot pipeline functional through the +gateway: -1. **Install Go 1.23+ on the dev box.** Not currently present at - `/usr/local/go` or elsewhere on the build machine. Standard install: - ``` - curl -L https://go.dev/dl/go1.23.linux-amd64.tar.gz | sudo tar -C /usr/local -xz - echo 'export PATH=$PATH:/usr/local/go/bin' >> ~/.bashrc - ``` -2. **Ensure cgo toolchain is present** (gcc + libc-dev) — required by - the DuckDB binding per ADR-001 §1.1. `apt install build-essential` - on Debian-based systems. -3. **Initialize the dependency tree** with `go mod tidy` once - `cmd/gateway/main.go` declares its first imports. +``` +text → /v1/embed → /v1/vectors/index//add +text → /v1/embed → /v1/vectors/index//search → top-K hits +``` + +Plus the SQL path: +``` +CSV → /v1/ingest (parses, writes Parquet via storaged, registers + manifest with catalogd) +SQL → /v1/sql (DuckDB over the registered Parquets via httpfs) +``` + +See `docs/PHASE_G0_KICKOFF.md` for the day-by-day record (D1-D6 + +real-scale validation + G1/G1P/G2 pointer at the bottom). + +## Service inventory + +| Bin | Port | Role | +|---|---|---| +| `gateway` | 3110 | Reverse proxy fronting all backing services | +| `storaged` | 3211 | Object I/O over S3 (MinIO in dev) | +| `catalogd` | 3212 | Parquet manifest registry, ADR-020 idempotency | +| `ingestd` | 3213 | CSV → Parquet → register loop | +| `queryd` | 3214 | DuckDB SELECT over registered Parquets via httpfs | +| `vectord` | 3215 | HNSW vector search (+ optional persistence to storaged) | +| `embedd` | 3216 | Text → vector via Ollama (default `nomic-embed-text` 768-d) | + +## Acceptance smokes + +``` +scripts/d1_smoke.sh # 5-binary skeleton + chi /health + gateway proxy probes +scripts/d2_smoke.sh # storaged GET/PUT/LIST/DELETE + 256 MiB cap + concurrency cap +scripts/d3_smoke.sh # catalogd register/manifest/list + rehydrate-across-restart +scripts/d4_smoke.sh # ingestd CSV → Parquet round-trip + schema-drift 409 +scripts/d5_smoke.sh # queryd DuckDB SELECT through httpfs over MinIO +scripts/d6_smoke.sh # full ingest → query through gateway only +scripts/g1_smoke.sh # vectord HNSW recall + dim mismatch + duplicate-create 409 +scripts/g1p_smoke.sh # vectord state survives kill+restart via storaged +scripts/g2_smoke.sh # embed → vectord add → search round-trip +``` + +Run them all in any order: +``` +for s in scripts/{d1,d2,d3,d4,d5,d6,g1,g1p,g2}_smoke.sh; do "$s" || break; done +``` + +## Cold-start dependencies + +- Go 1.25+ at `/usr/local/go/bin` (arrow-go pulled the 1.25 floor) +- `gcc` + `libc-dev` for the DuckDB cgo binding (ADR-001 §1.1) +- MinIO running on `:9000` with bucket `lakehouse-go-primary` +- Ollama running on `:11434` with `nomic-embed-text` loaded (G2) +- `/etc/lakehouse/secrets-go.toml` with `[s3.primary]` credentials + (storaged + queryd both read this) ## Layout ``` -docs/ Direction + spec + ADRs -cmd/ (forthcoming) main packages — one per service -internal/ (forthcoming) shared packages -web/ (forthcoming) HTMX templates + static -scripts/ (forthcoming) cold-start, smoke, distill -tests/ (forthcoming) golden files, integration tests +docs/ Direction + spec + ADRs + day-by-day +cmd/ One main package per binary +internal/ Shared packages — storeclient, catalogclient, + secrets, shared, embed, gateway, plus + per-service implementation packages +scripts/ Smokes + ancillary tooling ``` ## Reading order 1. `docs/PRD.md` — what we're building and why 2. `docs/SPEC.md` — how, per-component -3. `docs/DECISIONS.md` — ADRs, starting with ADR-001 (foundational) -4. `docs/RUST_PATHWAY_MEMORY_NOTE.md` — historical reference for the - Rust era's pathway memory state (not migrated) +3. `docs/DECISIONS.md` — ADRs (ADR-001 foundational) +4. `docs/PHASE_G0_KICKOFF.md` — day-by-day from D1 through G2 +5. `docs/RUST_PATHWAY_MEMORY_NOTE.md` — historical reference for the + Rust era's pathway memory (not migrated, by ADR-001 #5) ## Predecessor The Rust Lakehouse this rewrite supersedes lives at -`git.agentview.dev/profit/lakehouse`. It remains the live system until -this Go implementation reaches feature parity (per `docs/SPEC.md` §7). +`git.agentview.dev/profit/lakehouse`. It remains the live system +serving `devop.live/lakehouse/` until this Go implementation reaches +feature parity per `docs/SPEC.md` §7. Then Rust enters +maintenance-only mode. diff --git a/docs/PHASE_G0_KICKOFF.md b/docs/PHASE_G0_KICKOFF.md index e9b3ee9..edcb6e0 100644 --- a/docs/PHASE_G0_KICKOFF.md +++ b/docs/PHASE_G0_KICKOFF.md @@ -1273,3 +1273,46 @@ G0's substrate handles real production-scale data with one config knob bumped. No correctness issues, no OOMs, no silent type errors. Query latency is fast enough for ad-hoc analytics. The substrate is ready for G1+ work to build on top. + +--- + +## Post-G0 work (pointer, not detail) + +After G0 substrate validated at real scale, work continued on top. +Each piece has its own commit + scrum-review record; this section +is a pointer into the git log, not a full retro. + +| Phase | Component | Commit | Smoke | Scrum fixes | +|---|---|---|---|---| +| G1 | vectord — HNSW vector search via `coder/hnsw` | `b8c072c` | g1_smoke 7/7 | 6 | +| G1P | vectord persistence to storaged (single-file `LHV1` framing) | `8b92518` | g1p_smoke 8/8 | 3 (incl. 3-way convergent torn-write fix) | +| G2 | embedd — text → vector via Ollama (default `nomic-embed-text` 768-d) | `9ee7fc5` | g2_smoke 5/5 | 2 | + +After G2, the **end-to-end staffing co-pilot pipeline** is functional +through gateway: +``` +text → /v1/embed → /v1/vectors/index//add +text → /v1/embed → /v1/vectors/index//search → top-K hits +``` + +The `g2_smoke.sh` end-to-end assertion proves it: a CSV-style staffing +text round-trips through embed → vectord → search at distance ≈ 0 +(float32 precision noise on identical unit vectors). + +The post-G0 service inventory is now 7 binaries + 1 shared library +package: +- `gateway` (:3110) — reverse proxy +- `storaged` (:3211) — S3 I/O +- `catalogd` (:3212) — Parquet manifests +- `ingestd` (:3213) — CSV → Parquet → register +- `queryd` (:3214) — DuckDB SELECT over Parquet via httpfs +- `vectord` (:3215) — HNSW vector search + optional persistence +- `embedd` (:3216) — text → vector via Ollama + +Plus shared packages: `internal/storeclient`, `internal/catalogclient`, +`internal/secrets`, `internal/shared`. + +**Smokes (deterministic, run-in-any-order)**: d1, d2, d3, d4, d5, d6, +g1, g1p, g2 — 9 acceptance gates, all PASS. `scripts/g1_smoke.toml` +disables vectord persistence specifically for the in-memory API smoke +to avoid rehydrate-from-storaged contamination.