golangLAKEHOUSE/REPLICATION.md
root 68d9e554b0 shared: auto-emit Langfuse trace+span per HTTP request — closes OPEN #2
Adds langfuseMiddleware in internal/shared so every daemon's
shared.Run gets free production-traffic trace visibility when
LANGFUSE_URL + LANGFUSE_PUBLIC_KEY + LANGFUSE_SECRET_KEY are set.
Same env names + file shape as the multi_coord_stress driver, so
operators ship one /etc/lakehouse/langfuse.env across the deploy.

Wiring is auth-gated: middleware runs INSIDE the RequireAuth group,
so 401s from credential-stuffing don't pollute traces. /health is
exempt so LB probes don't either. Missing env vars → nil client →
middleware is a passthrough no-op (fail-open per ADR-005 5.1).

Bundled deploy:
- langfuse.env.example template (mode 0640, root:lakehouse)
- 11 systemd units gain `EnvironmentFile=-/etc/lakehouse/langfuse.env`
  (leading - so missing file = OK)
- REPLICATION.md bootstrap section documents setup

Tests (4): nil passthrough, /health bypass, real-request emission,
status-writer wrapping. All green.

STATE_OF_PLAY OPEN list: 5 rows → 4 rows.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 19:55:42 -05:00

265 lines
10 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Lakehouse-Go — Replication Runbook
How to deploy Lakehouse-Go onto a fresh Linux host. Mirrors the layout
the dev box uses; covers prereqs, secrets, systemd units, validation.
## Prereqs
The host needs these external services reachable BEFORE the Lakehouse
daemons can usefully start. None are managed by Lakehouse-Go's own
units; they're operator infrastructure.
| Service | Purpose | Reachability |
|---|---|---|
| **Go 1.25+** | builds the binaries | `go version` returns ≥ 1.25 |
| **gcc** | DuckDB cgo (queryd) | `gcc --version` |
| **MinIO** (or AWS S3) | storaged backing store | `curl http://localhost:9000/minio/health/live` returns 200; bucket `lakehouse-go-primary` exists |
| **Ollama** | embedd + chatd LLM dispatch | `curl http://localhost:11434/api/tags` returns 200 with `nomic-embed-text-v2-moe` (or whatever `[embedd].default_model` names) loaded |
| **Langfuse** *(optional)* | trace + span observability | `curl http://localhost:3001/api/public/health` returns 200 |
| **PostgreSQL** *(optional)* | only if Langfuse is wanted | bundled with the Langfuse docker compose |
Bind ports the daemons use (G0 dev defaults; shifted by 10 from the
Rust legacy on 3100/32013204 so both stacks coexist):
| Daemon | Port |
|---|---:|
| gateway | 3110 |
| storaged | 3211 |
| catalogd | 3212 |
| ingestd | 3213 |
| queryd | 3214 |
| vectord | 3215 |
| embedd | 3216 |
| pathwayd | 3217 |
| matrixd | 3218 |
| observerd | 3219 |
| chatd | 3220 |
## Bootstrap
### 1. User + directories
```bash
sudo useradd --system --no-create-home --shell /usr/sbin/nologin lakehouse
sudo mkdir -p /var/lib/lakehouse/{pathway,observer} /var/log/lakehouse \
/usr/local/bin/lakehouse /etc/lakehouse
sudo chown -R lakehouse:lakehouse /var/lib/lakehouse /var/log/lakehouse
```
### 2. Build + install binaries
From a clone of the repo:
```bash
git clone https://git.agentview.dev/profit/golangLAKEHOUSE.git
cd golangLAKEHOUSE
just verify # vet + tests + 9 core smokes — ~31s
go build -o bin/ ./cmd/... # 11 binaries land in ./bin/
sudo cp bin/{gateway,storaged,catalogd,ingestd,queryd,vectord,embedd,pathwayd,observerd,matrixd,chatd} /usr/local/bin/lakehouse/
sudo chmod 755 /usr/local/bin/lakehouse/*
```
### 3. Config + secrets
```bash
# Main config — edit ports/URLs/model tier as needed
sudo cp lakehouse.toml /etc/lakehouse/lakehouse.toml
# S3 credentials — fill in real keys
sudo cp deploy/etc-lakehouse/secrets-go.toml.example /etc/lakehouse/secrets-go.toml
sudo chown root:lakehouse /etc/lakehouse/secrets-go.toml
sudo chmod 0640 /etc/lakehouse/secrets-go.toml
sudo $EDITOR /etc/lakehouse/secrets-go.toml # set [s3.primary] keys
# Auth token — required ONLY if any daemon binds non-loopback
sudo cp deploy/etc-lakehouse/auth.env.example /etc/lakehouse/auth.env
sudo chown root:lakehouse /etc/lakehouse/auth.env
sudo chmod 0640 /etc/lakehouse/auth.env
# For non-loopback deploys, set:
# AUTH_TOKEN=<generate via `openssl rand -hex 32`>
sudo $EDITOR /etc/lakehouse/auth.env
# Optional: Langfuse traces. When set, every authenticated HTTP
# request to every daemon emits a trace + span (production
# observability per OPEN item #2 closure). Missing file = no
# traces, no warnings.
sudo cp deploy/etc-lakehouse/langfuse.env.example /etc/lakehouse/langfuse.env
sudo chown root:lakehouse /etc/lakehouse/langfuse.env
sudo chmod 0640 /etc/lakehouse/langfuse.env
sudo $EDITOR /etc/lakehouse/langfuse.env # set URL + PUBLIC_KEY + SECRET_KEY
# Optional: chatd cloud provider keys, one file per provider
# (each is its own EnvironmentFile so rotations don't restart all chatd)
for provider in ollama_cloud openrouter opencode kimi; do
echo "${provider^^}_API_KEY=" | sudo tee /etc/lakehouse/$provider.env > /dev/null
sudo chown root:lakehouse /etc/lakehouse/$provider.env
sudo chmod 0640 /etc/lakehouse/$provider.env
done
sudo $EDITOR /etc/lakehouse/openrouter.env # etc per provider you need
```
### 4. systemd units
```bash
sudo cp deploy/systemd/*.service deploy/systemd/*.target /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable lakehouse-go.target
sudo systemctl start lakehouse-go.target
```
### 5. Validation
```bash
# All 11 daemons should be active
systemctl status 'lakehouse-*.service' --no-pager | grep -E "Active|●"
# Health endpoints respond on each port
for port in 3110 3211 3212 3213 3214 3215 3216 3217 3218 3219 3220; do
printf "%5d: " "$port"
curl -sS --max-time 2 "http://127.0.0.1:$port/health" || echo "FAIL"
done
# Through the gateway: all chatd providers register (cloud keys present)
curl -sS http://127.0.0.1:3110/v1/chat/providers | jq
# End-to-end: ingest a tiny CSV → queryd SELECT → matrix.search
echo -e "id,name,role\n1,Alice,Forklift Operator" > /tmp/probe.csv
curl -sS -F "file=@/tmp/probe.csv" "http://127.0.0.1:3110/v1/ingest?name=probe"
curl -sS -X POST http://127.0.0.1:3110/v1/sql \
-H 'content-type: application/json' \
-d '{"sql":"SELECT COUNT(*) FROM probe"}' | jq
```
## Auth posture
Per ADR-006:
- **Loopback-only deploy** (every daemon binds 127.0.0.1): no auth needed. Empty `AUTH_TOKEN` is fine. Network is the boundary.
- **Non-loopback deploy** (gateway exposed beyond loopback, daemons internal-private): set `AUTH_TOKEN` in `/etc/lakehouse/auth.env`. The mechanical gate at startup refuses to bind without one.
- **Multi-host deploy** (gateway + daemons on separate machines): set `AUTH_TOKEN` *and* `[auth].allowed_ips` in lakehouse.toml to the gateway's address. Both layers gate.
- **TLS**: terminate at nginx/Caddy in front of the gateway. The Go daemons speak HTTP; in-process TLS is explicitly out of scope per ADR-006 Decision 6.6.
## Token rotation
Per ADR-006 Decision 6.5 — dual-token window:
```bash
# 1. Generate new token
NEW=$(openssl rand -hex 32)
# 2. Add as secondary, keep old as primary
sudo sed -i "s|^AUTH_SECONDARY_TOKEN=.*|AUTH_SECONDARY_TOKEN=$NEW|" /etc/lakehouse/auth.env
sudo systemctl restart lakehouse-go.target
# 3. Update every caller to use NEW token
# 4. Promote: NEW becomes primary, secondary clears
sudo sed -i "s|^AUTH_TOKEN=.*|AUTH_TOKEN=$NEW|" /etc/lakehouse/auth.env
sudo sed -i "s|^AUTH_SECONDARY_TOKEN=.*|AUTH_SECONDARY_TOKEN=|" /etc/lakehouse/auth.env
sudo systemctl restart lakehouse-go.target
```
## Docker / docker-compose deploy (alternative to systemd)
The single-image `Dockerfile` carries all 11 daemons; `docker-compose.yml`
runs one container per daemon with the same dependency graph as the
systemd units. Useful when the host doesn't have systemd (Mac dev
boxes, remote VMs without root) or when you want all of Lakehouse-Go
isolated to a private docker network.
```bash
# Build the image (multi-stage; ~3 min on first build, ~30s with
# cached go module download).
docker build -t lakehouse-go:latest .
# Place config + secrets next to docker-compose.yml. The compose file
# bind-mounts these into every container at /etc/lakehouse/.
cp lakehouse.toml lakehouse.toml # already in repo; edit if needed
cp deploy/etc-lakehouse/secrets-go.toml.example secrets-go.toml
chmod 0600 secrets-go.toml
cp deploy/etc-lakehouse/auth.env.example auth.env
chmod 0600 auth.env
# Per-provider chatd keys (each its own file so missing == provider
# unregistered, NOT chatd startup failure):
for p in ollama_cloud openrouter opencode kimi; do
echo "${p^^}_API_KEY=" > $p.env
chmod 0600 $p.env
done
# $EDITOR each file to fill in real values...
# Bring up the stack.
docker compose up -d
docker compose ps # all 11 services Healthy
docker compose logs -f gateway
# Validate via the gateway like the systemd path.
curl -sS http://127.0.0.1:3110/v1/chat/providers | jq
# Tear down.
docker compose down
# State volume (pathway/observer JSONLs) survives `down`. To wipe:
docker compose down -v
```
### Key docker-vs-systemd differences
| Concern | systemd | docker-compose |
|---|---|---|
| Process supervision | systemd | tini + docker daemon |
| Logs | journald | `docker logs` (or routed to a sink via logging driver) |
| Restarts on failure | `Restart=on-failure` | `restart: unless-stopped` |
| File ownership | `User=lakehouse` (uid varies) | `user: 999:999` (uid is fixed in the image) |
| Reaches MinIO/Ollama | host network | host's address from inside the bridge network — typically `host.docker.internal` (Mac/Win) or `172.17.0.1` (Linux). Set `[s3].endpoint` + `[embedd].provider_url` accordingly. |
| Backup target | `/var/lib/lakehouse/` on host | the `lakehouse-state` named volume; bind to a host path via the commented-out `driver_opts` in compose if needed |
## Logs
systemd routes everything to journald with per-daemon SyslogIdentifier:
```bash
journalctl -u lakehouse-gateway.service -f
journalctl -u 'lakehouse-*.service' --since '5 min ago'
```
## Stopping
```bash
sudo systemctl stop lakehouse-go.target # cascades to all 11 daemons
```
## Backup / state preservation
| Path | What | Backup priority |
|---|---|---|
| `/var/lib/lakehouse/pathway/state.jsonl` | Mem0 trace store (append-only) | high |
| `/var/lib/lakehouse/observer/ops.jsonl` | observer ring's persistor backup | medium |
| MinIO `lakehouse-go-primary` bucket | parquets, vector LHV1 indexes, catalog manifests | high |
| `/etc/lakehouse/lakehouse.toml` | service config | medium |
| `/etc/lakehouse/secrets-go.toml` + `*.env` | secrets | high (in your secrets manager, not on disk) |
## Troubleshooting
**Daemon refuses to start with "refuse non-loopback bind without auth.token"**
ADR-006 6.1 mechanical gate. Set `AUTH_TOKEN` in `/etc/lakehouse/auth.env` or bind back to loopback.
**Daemon refuses to start with "refusing non-loopback bind ... see audit R-001"**
The previous loopback-bind gate. For dev: `LH_<NAME>_ALLOW_NONLOOPBACK=1` overrides. For prod: set `AUTH_TOKEN` AND keep the override (or move to loopback + reverse-proxy).
**catalogd 500 / NoSuchBucket**
storaged is pointing at a bucket that doesn't exist. Either create the bucket in MinIO or fix `[s3].bucket` in lakehouse.toml.
**embedd 502 on /v1/embed**
Ollama not running OR `[embedd].default_model` not loaded. `ollama list` to verify; `ollama pull nomic-embed-text-v2-moe` to load.
**chatd `/v1/chat/providers` shows `false` for cloud providers**
The provider's env file is missing or empty. Check `/etc/lakehouse/<provider>.env`.
**queryd unable to read parquet**
Check `[queryd].secrets_path` points at the right secrets-go.toml AND the file's owner+mode allow the lakehouse user to read.
## Related docs
- `STATE_OF_PLAY.md` — verified-working snapshot
- `docs/DECISIONS.md` — all ADRs, especially ADR-003 (auth substrate) + ADR-006 (auth posture)
- `docs/SPEC.md` §1 — component table