golangLAKEHOUSE/REPLICATION.md
root a59ef5b930 Sprint 4 deployment artifacts: 11 systemd units + REPLICATION.md + env templates
Builds on ADR-006 to ship the operator-facing bits Sprint 4 was
blocked on. Single-host deploy is now a documented procedure.

deploy/systemd/ (12 files):
- 11 .service units, one per daemon. Each follows the same template:
  Type=simple, User=lakehouse, hardening (NoNewPrivileges,
  ProtectSystem=strict, ProtectHome, PrivateTmp, ReadWritePaths
  scoped to /var/lib/lakehouse + /var/log/lakehouse), JSON to
  journald with per-daemon SyslogIdentifier, EnvironmentFile=- on
  /etc/lakehouse/auth.env.
- Dependency graph baked in via After=/Requires=:
    storaged → standalone (only network-online)
    catalogd → Requires storaged
    ingestd → Requires storaged + catalogd
    queryd → Requires catalogd
    matrixd → Requires embedd + vectord
    gateway → Wants every other daemon (Wants= not Requires=
              so a single upstream restart doesn't cascade-restart
              the gateway)
    pathwayd / observerd / vectord / embedd / chatd → standalone
- chatd unit reads 4 cloud-provider EnvironmentFile=s
  (ollama_cloud / openrouter / opencode / kimi) — each is its own
  file so per-provider key rotation doesn't restart the others.
- lakehouse-go.target: convenience aggregator. Operators
  systemctl start/stop/enable lakehouse-go.target instead of
  managing 11 daemons individually. Per-daemon WantedBy=
  this target.

deploy/etc-lakehouse/ (2 templates):
- auth.env.example: AUTH_TOKEN per ADR-006 6.2 + rotation playbook
  comments. The committed file is empty — operators copy + fill in.
- secrets-go.toml.example: [s3.primary] template with
  REPLACE_ME placeholders. Multi-bucket G2 example commented.

REPLICATION.md (top-level):
- Operator runbook from fresh box → 11 daemons running.
- Prereqs (Go 1.25+, gcc, MinIO, Ollama, optionally Langfuse +
  Postgres for Langfuse) with reachability checks.
- Bind ports table (3110–3220, shifted by 10 from Rust legacy).
- Bootstrap: useradd → build → install → config → secrets →
  systemd → validation.
- Auth posture matrix (loopback / non-loopback / multi-host / TLS).
- Token rotation procedure inline (ADR-006 Decision 6.5).
- Logs (journalctl), backup paths, troubleshooting matrix.

Validation: systemd-analyze verify passed on all 11 .service files
(only "not executable" warnings, expected since binaries don't live
at /usr/local/bin/lakehouse/ until step 2 of bootstrap runs).

Sprint 4 is now operator-ready. Next: Dockerfile + multi-stage
build for container deploys (separate concern; deploy targets
either systemd OR docker, not both).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:54:49 -05:00

7.7 KiB
Raw Blame History

Lakehouse-Go — Replication Runbook

How to deploy Lakehouse-Go onto a fresh Linux host. Mirrors the layout the dev box uses; covers prereqs, secrets, systemd units, validation.

Prereqs

The host needs these external services reachable BEFORE the Lakehouse daemons can usefully start. None are managed by Lakehouse-Go's own units; they're operator infrastructure.

Service Purpose Reachability
Go 1.25+ builds the binaries go version returns ≥ 1.25
gcc DuckDB cgo (queryd) gcc --version
MinIO (or AWS S3) storaged backing store curl http://localhost:9000/minio/health/live returns 200; bucket lakehouse-go-primary exists
Ollama embedd + chatd LLM dispatch curl http://localhost:11434/api/tags returns 200 with nomic-embed-text-v2-moe (or whatever [embedd].default_model names) loaded
Langfuse (optional) trace + span observability curl http://localhost:3001/api/public/health returns 200
PostgreSQL (optional) only if Langfuse is wanted bundled with the Langfuse docker compose

Bind ports the daemons use (G0 dev defaults; shifted by 10 from the Rust legacy on 3100/32013204 so both stacks coexist):

Daemon Port
gateway 3110
storaged 3211
catalogd 3212
ingestd 3213
queryd 3214
vectord 3215
embedd 3216
pathwayd 3217
matrixd 3218
observerd 3219
chatd 3220

Bootstrap

1. User + directories

sudo useradd --system --no-create-home --shell /usr/sbin/nologin lakehouse
sudo mkdir -p /var/lib/lakehouse/{pathway,observer} /var/log/lakehouse \
              /usr/local/bin/lakehouse /etc/lakehouse
sudo chown -R lakehouse:lakehouse /var/lib/lakehouse /var/log/lakehouse

2. Build + install binaries

From a clone of the repo:

git clone https://git.agentview.dev/profit/golangLAKEHOUSE.git
cd golangLAKEHOUSE
just verify    # vet + tests + 9 core smokes — ~31s
go build -o bin/ ./cmd/...   # 11 binaries land in ./bin/
sudo cp bin/{gateway,storaged,catalogd,ingestd,queryd,vectord,embedd,pathwayd,observerd,matrixd,chatd} /usr/local/bin/lakehouse/
sudo chmod 755 /usr/local/bin/lakehouse/*

3. Config + secrets

# Main config — edit ports/URLs/model tier as needed
sudo cp lakehouse.toml /etc/lakehouse/lakehouse.toml

# S3 credentials — fill in real keys
sudo cp deploy/etc-lakehouse/secrets-go.toml.example /etc/lakehouse/secrets-go.toml
sudo chown root:lakehouse /etc/lakehouse/secrets-go.toml
sudo chmod 0640 /etc/lakehouse/secrets-go.toml
sudo $EDITOR /etc/lakehouse/secrets-go.toml  # set [s3.primary] keys

# Auth token — required ONLY if any daemon binds non-loopback
sudo cp deploy/etc-lakehouse/auth.env.example /etc/lakehouse/auth.env
sudo chown root:lakehouse /etc/lakehouse/auth.env
sudo chmod 0640 /etc/lakehouse/auth.env
# For non-loopback deploys, set:
#   AUTH_TOKEN=<generate via `openssl rand -hex 32`>
sudo $EDITOR /etc/lakehouse/auth.env

# Optional: chatd cloud provider keys, one file per provider
# (each is its own EnvironmentFile so rotations don't restart all chatd)
for provider in ollama_cloud openrouter opencode kimi; do
  echo "${provider^^}_API_KEY=" | sudo tee /etc/lakehouse/$provider.env > /dev/null
  sudo chown root:lakehouse /etc/lakehouse/$provider.env
  sudo chmod 0640 /etc/lakehouse/$provider.env
done
sudo $EDITOR /etc/lakehouse/openrouter.env  # etc per provider you need

4. systemd units

sudo cp deploy/systemd/*.service deploy/systemd/*.target /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable lakehouse-go.target
sudo systemctl start lakehouse-go.target

5. Validation

# All 11 daemons should be active
systemctl status 'lakehouse-*.service' --no-pager | grep -E "Active|●"

# Health endpoints respond on each port
for port in 3110 3211 3212 3213 3214 3215 3216 3217 3218 3219 3220; do
  printf "%5d: " "$port"
  curl -sS --max-time 2 "http://127.0.0.1:$port/health" || echo "FAIL"
done

# Through the gateway: all chatd providers register (cloud keys present)
curl -sS http://127.0.0.1:3110/v1/chat/providers | jq

# End-to-end: ingest a tiny CSV → queryd SELECT → matrix.search
echo -e "id,name,role\n1,Alice,Forklift Operator" > /tmp/probe.csv
curl -sS -F "file=@/tmp/probe.csv" "http://127.0.0.1:3110/v1/ingest?name=probe"
curl -sS -X POST http://127.0.0.1:3110/v1/sql \
  -H 'content-type: application/json' \
  -d '{"sql":"SELECT COUNT(*) FROM probe"}' | jq

Auth posture

Per ADR-006:

  • Loopback-only deploy (every daemon binds 127.0.0.1): no auth needed. Empty AUTH_TOKEN is fine. Network is the boundary.
  • Non-loopback deploy (gateway exposed beyond loopback, daemons internal-private): set AUTH_TOKEN in /etc/lakehouse/auth.env. The mechanical gate at startup refuses to bind without one.
  • Multi-host deploy (gateway + daemons on separate machines): set AUTH_TOKEN and [auth].allowed_ips in lakehouse.toml to the gateway's address. Both layers gate.
  • TLS: terminate at nginx/Caddy in front of the gateway. The Go daemons speak HTTP; in-process TLS is explicitly out of scope per ADR-006 Decision 6.6.

Token rotation

Per ADR-006 Decision 6.5 — dual-token window:

# 1. Generate new token
NEW=$(openssl rand -hex 32)

# 2. Add as secondary, keep old as primary
sudo sed -i "s|^AUTH_SECONDARY_TOKEN=.*|AUTH_SECONDARY_TOKEN=$NEW|" /etc/lakehouse/auth.env
sudo systemctl restart lakehouse-go.target

# 3. Update every caller to use NEW token
# 4. Promote: NEW becomes primary, secondary clears
sudo sed -i "s|^AUTH_TOKEN=.*|AUTH_TOKEN=$NEW|" /etc/lakehouse/auth.env
sudo sed -i "s|^AUTH_SECONDARY_TOKEN=.*|AUTH_SECONDARY_TOKEN=|" /etc/lakehouse/auth.env
sudo systemctl restart lakehouse-go.target

Logs

systemd routes everything to journald with per-daemon SyslogIdentifier:

journalctl -u lakehouse-gateway.service -f
journalctl -u 'lakehouse-*.service' --since '5 min ago'

Stopping

sudo systemctl stop lakehouse-go.target  # cascades to all 11 daemons

Backup / state preservation

Path What Backup priority
/var/lib/lakehouse/pathway/state.jsonl Mem0 trace store (append-only) high
/var/lib/lakehouse/observer/ops.jsonl observer ring's persistor backup medium
MinIO lakehouse-go-primary bucket parquets, vector LHV1 indexes, catalog manifests high
/etc/lakehouse/lakehouse.toml service config medium
/etc/lakehouse/secrets-go.toml + *.env secrets high (in your secrets manager, not on disk)

Troubleshooting

Daemon refuses to start with "refuse non-loopback bind without auth.token" ADR-006 6.1 mechanical gate. Set AUTH_TOKEN in /etc/lakehouse/auth.env or bind back to loopback.

Daemon refuses to start with "refusing non-loopback bind ... see audit R-001" The previous loopback-bind gate. For dev: LH_<NAME>_ALLOW_NONLOOPBACK=1 overrides. For prod: set AUTH_TOKEN AND keep the override (or move to loopback + reverse-proxy).

catalogd 500 / NoSuchBucket storaged is pointing at a bucket that doesn't exist. Either create the bucket in MinIO or fix [s3].bucket in lakehouse.toml.

embedd 502 on /v1/embed Ollama not running OR [embedd].default_model not loaded. ollama list to verify; ollama pull nomic-embed-text-v2-moe to load.

chatd /v1/chat/providers shows false for cloud providers The provider's env file is missing or empty. Check /etc/lakehouse/<provider>.env.

queryd unable to read parquet Check [queryd].secrets_path points at the right secrets-go.toml AND the file's owner+mode allow the lakehouse user to read.

  • STATE_OF_PLAY.md — verified-working snapshot
  • docs/DECISIONS.md — all ADRs, especially ADR-003 (auth substrate) + ADR-006 (auth posture)
  • docs/SPEC.md §1 — component table