golangLAKEHOUSE/REPLICATION.md
root a59ef5b930 Sprint 4 deployment artifacts: 11 systemd units + REPLICATION.md + env templates
Builds on ADR-006 to ship the operator-facing bits Sprint 4 was
blocked on. Single-host deploy is now a documented procedure.

deploy/systemd/ (12 files):
- 11 .service units, one per daemon. Each follows the same template:
  Type=simple, User=lakehouse, hardening (NoNewPrivileges,
  ProtectSystem=strict, ProtectHome, PrivateTmp, ReadWritePaths
  scoped to /var/lib/lakehouse + /var/log/lakehouse), JSON to
  journald with per-daemon SyslogIdentifier, EnvironmentFile=- on
  /etc/lakehouse/auth.env.
- Dependency graph baked in via After=/Requires=:
    storaged → standalone (only network-online)
    catalogd → Requires storaged
    ingestd → Requires storaged + catalogd
    queryd → Requires catalogd
    matrixd → Requires embedd + vectord
    gateway → Wants every other daemon (Wants= not Requires=
              so a single upstream restart doesn't cascade-restart
              the gateway)
    pathwayd / observerd / vectord / embedd / chatd → standalone
- chatd unit reads 4 cloud-provider EnvironmentFile=s
  (ollama_cloud / openrouter / opencode / kimi) — each is its own
  file so per-provider key rotation doesn't restart the others.
- lakehouse-go.target: convenience aggregator. Operators
  systemctl start/stop/enable lakehouse-go.target instead of
  managing 11 daemons individually. Per-daemon WantedBy=
  this target.

deploy/etc-lakehouse/ (2 templates):
- auth.env.example: AUTH_TOKEN per ADR-006 6.2 + rotation playbook
  comments. The committed file is empty — operators copy + fill in.
- secrets-go.toml.example: [s3.primary] template with
  REPLACE_ME placeholders. Multi-bucket G2 example commented.

REPLICATION.md (top-level):
- Operator runbook from fresh box → 11 daemons running.
- Prereqs (Go 1.25+, gcc, MinIO, Ollama, optionally Langfuse +
  Postgres for Langfuse) with reachability checks.
- Bind ports table (3110–3220, shifted by 10 from Rust legacy).
- Bootstrap: useradd → build → install → config → secrets →
  systemd → validation.
- Auth posture matrix (loopback / non-loopback / multi-host / TLS).
- Token rotation procedure inline (ADR-006 Decision 6.5).
- Logs (journalctl), backup paths, troubleshooting matrix.

Validation: systemd-analyze verify passed on all 11 .service files
(only "not executable" warnings, expected since binaries don't live
at /usr/local/bin/lakehouse/ until step 2 of bootstrap runs).

Sprint 4 is now operator-ready. Next: Dockerfile + multi-stage
build for container deploys (separate concern; deploy targets
either systemd OR docker, not both).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:54:49 -05:00

202 lines
7.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Lakehouse-Go — Replication Runbook
How to deploy Lakehouse-Go onto a fresh Linux host. Mirrors the layout
the dev box uses; covers prereqs, secrets, systemd units, validation.
## Prereqs
The host needs these external services reachable BEFORE the Lakehouse
daemons can usefully start. None are managed by Lakehouse-Go's own
units; they're operator infrastructure.
| Service | Purpose | Reachability |
|---|---|---|
| **Go 1.25+** | builds the binaries | `go version` returns ≥ 1.25 |
| **gcc** | DuckDB cgo (queryd) | `gcc --version` |
| **MinIO** (or AWS S3) | storaged backing store | `curl http://localhost:9000/minio/health/live` returns 200; bucket `lakehouse-go-primary` exists |
| **Ollama** | embedd + chatd LLM dispatch | `curl http://localhost:11434/api/tags` returns 200 with `nomic-embed-text-v2-moe` (or whatever `[embedd].default_model` names) loaded |
| **Langfuse** *(optional)* | trace + span observability | `curl http://localhost:3001/api/public/health` returns 200 |
| **PostgreSQL** *(optional)* | only if Langfuse is wanted | bundled with the Langfuse docker compose |
Bind ports the daemons use (G0 dev defaults; shifted by 10 from the
Rust legacy on 3100/32013204 so both stacks coexist):
| Daemon | Port |
|---|---:|
| gateway | 3110 |
| storaged | 3211 |
| catalogd | 3212 |
| ingestd | 3213 |
| queryd | 3214 |
| vectord | 3215 |
| embedd | 3216 |
| pathwayd | 3217 |
| matrixd | 3218 |
| observerd | 3219 |
| chatd | 3220 |
## Bootstrap
### 1. User + directories
```bash
sudo useradd --system --no-create-home --shell /usr/sbin/nologin lakehouse
sudo mkdir -p /var/lib/lakehouse/{pathway,observer} /var/log/lakehouse \
/usr/local/bin/lakehouse /etc/lakehouse
sudo chown -R lakehouse:lakehouse /var/lib/lakehouse /var/log/lakehouse
```
### 2. Build + install binaries
From a clone of the repo:
```bash
git clone https://git.agentview.dev/profit/golangLAKEHOUSE.git
cd golangLAKEHOUSE
just verify # vet + tests + 9 core smokes — ~31s
go build -o bin/ ./cmd/... # 11 binaries land in ./bin/
sudo cp bin/{gateway,storaged,catalogd,ingestd,queryd,vectord,embedd,pathwayd,observerd,matrixd,chatd} /usr/local/bin/lakehouse/
sudo chmod 755 /usr/local/bin/lakehouse/*
```
### 3. Config + secrets
```bash
# Main config — edit ports/URLs/model tier as needed
sudo cp lakehouse.toml /etc/lakehouse/lakehouse.toml
# S3 credentials — fill in real keys
sudo cp deploy/etc-lakehouse/secrets-go.toml.example /etc/lakehouse/secrets-go.toml
sudo chown root:lakehouse /etc/lakehouse/secrets-go.toml
sudo chmod 0640 /etc/lakehouse/secrets-go.toml
sudo $EDITOR /etc/lakehouse/secrets-go.toml # set [s3.primary] keys
# Auth token — required ONLY if any daemon binds non-loopback
sudo cp deploy/etc-lakehouse/auth.env.example /etc/lakehouse/auth.env
sudo chown root:lakehouse /etc/lakehouse/auth.env
sudo chmod 0640 /etc/lakehouse/auth.env
# For non-loopback deploys, set:
# AUTH_TOKEN=<generate via `openssl rand -hex 32`>
sudo $EDITOR /etc/lakehouse/auth.env
# Optional: chatd cloud provider keys, one file per provider
# (each is its own EnvironmentFile so rotations don't restart all chatd)
for provider in ollama_cloud openrouter opencode kimi; do
echo "${provider^^}_API_KEY=" | sudo tee /etc/lakehouse/$provider.env > /dev/null
sudo chown root:lakehouse /etc/lakehouse/$provider.env
sudo chmod 0640 /etc/lakehouse/$provider.env
done
sudo $EDITOR /etc/lakehouse/openrouter.env # etc per provider you need
```
### 4. systemd units
```bash
sudo cp deploy/systemd/*.service deploy/systemd/*.target /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable lakehouse-go.target
sudo systemctl start lakehouse-go.target
```
### 5. Validation
```bash
# All 11 daemons should be active
systemctl status 'lakehouse-*.service' --no-pager | grep -E "Active|●"
# Health endpoints respond on each port
for port in 3110 3211 3212 3213 3214 3215 3216 3217 3218 3219 3220; do
printf "%5d: " "$port"
curl -sS --max-time 2 "http://127.0.0.1:$port/health" || echo "FAIL"
done
# Through the gateway: all chatd providers register (cloud keys present)
curl -sS http://127.0.0.1:3110/v1/chat/providers | jq
# End-to-end: ingest a tiny CSV → queryd SELECT → matrix.search
echo -e "id,name,role\n1,Alice,Forklift Operator" > /tmp/probe.csv
curl -sS -F "file=@/tmp/probe.csv" "http://127.0.0.1:3110/v1/ingest?name=probe"
curl -sS -X POST http://127.0.0.1:3110/v1/sql \
-H 'content-type: application/json' \
-d '{"sql":"SELECT COUNT(*) FROM probe"}' | jq
```
## Auth posture
Per ADR-006:
- **Loopback-only deploy** (every daemon binds 127.0.0.1): no auth needed. Empty `AUTH_TOKEN` is fine. Network is the boundary.
- **Non-loopback deploy** (gateway exposed beyond loopback, daemons internal-private): set `AUTH_TOKEN` in `/etc/lakehouse/auth.env`. The mechanical gate at startup refuses to bind without one.
- **Multi-host deploy** (gateway + daemons on separate machines): set `AUTH_TOKEN` *and* `[auth].allowed_ips` in lakehouse.toml to the gateway's address. Both layers gate.
- **TLS**: terminate at nginx/Caddy in front of the gateway. The Go daemons speak HTTP; in-process TLS is explicitly out of scope per ADR-006 Decision 6.6.
## Token rotation
Per ADR-006 Decision 6.5 — dual-token window:
```bash
# 1. Generate new token
NEW=$(openssl rand -hex 32)
# 2. Add as secondary, keep old as primary
sudo sed -i "s|^AUTH_SECONDARY_TOKEN=.*|AUTH_SECONDARY_TOKEN=$NEW|" /etc/lakehouse/auth.env
sudo systemctl restart lakehouse-go.target
# 3. Update every caller to use NEW token
# 4. Promote: NEW becomes primary, secondary clears
sudo sed -i "s|^AUTH_TOKEN=.*|AUTH_TOKEN=$NEW|" /etc/lakehouse/auth.env
sudo sed -i "s|^AUTH_SECONDARY_TOKEN=.*|AUTH_SECONDARY_TOKEN=|" /etc/lakehouse/auth.env
sudo systemctl restart lakehouse-go.target
```
## Logs
systemd routes everything to journald with per-daemon SyslogIdentifier:
```bash
journalctl -u lakehouse-gateway.service -f
journalctl -u 'lakehouse-*.service' --since '5 min ago'
```
## Stopping
```bash
sudo systemctl stop lakehouse-go.target # cascades to all 11 daemons
```
## Backup / state preservation
| Path | What | Backup priority |
|---|---|---|
| `/var/lib/lakehouse/pathway/state.jsonl` | Mem0 trace store (append-only) | high |
| `/var/lib/lakehouse/observer/ops.jsonl` | observer ring's persistor backup | medium |
| MinIO `lakehouse-go-primary` bucket | parquets, vector LHV1 indexes, catalog manifests | high |
| `/etc/lakehouse/lakehouse.toml` | service config | medium |
| `/etc/lakehouse/secrets-go.toml` + `*.env` | secrets | high (in your secrets manager, not on disk) |
## Troubleshooting
**Daemon refuses to start with "refuse non-loopback bind without auth.token"**
ADR-006 6.1 mechanical gate. Set `AUTH_TOKEN` in `/etc/lakehouse/auth.env` or bind back to loopback.
**Daemon refuses to start with "refusing non-loopback bind ... see audit R-001"**
The previous loopback-bind gate. For dev: `LH_<NAME>_ALLOW_NONLOOPBACK=1` overrides. For prod: set `AUTH_TOKEN` AND keep the override (or move to loopback + reverse-proxy).
**catalogd 500 / NoSuchBucket**
storaged is pointing at a bucket that doesn't exist. Either create the bucket in MinIO or fix `[s3].bucket` in lakehouse.toml.
**embedd 502 on /v1/embed**
Ollama not running OR `[embedd].default_model` not loaded. `ollama list` to verify; `ollama pull nomic-embed-text-v2-moe` to load.
**chatd `/v1/chat/providers` shows `false` for cloud providers**
The provider's env file is missing or empty. Check `/etc/lakehouse/<provider>.env`.
**queryd unable to read parquet**
Check `[queryd].secrets_path` points at the right secrets-go.toml AND the file's owner+mode allow the lakehouse user to read.
## Related docs
- `STATE_OF_PLAY.md` — verified-working snapshot
- `docs/DECISIONS.md` — all ADRs, especially ADR-003 (auth substrate) + ADR-006 (auth posture)
- `docs/SPEC.md` §1 — component table