golangLAKEHOUSE/scripts/cutover/start_go_stack.sh
root 54b2e7db76 start_go_stack.sh: document smoke-vs-persistent-stack pkill conflict
Caught immediately after the prior commit pushed: pre-push smokes
killed 7 of 11 persistent Go daemons because the smokes' anchored
`pkill -f "bin/(name)$"` teardown matches ANY process named
`bin/<daemon>`, not just the smokes' own children.

Documented in the script header as a KNOWN CONSTRAINT with a
workaround (re-run start_go_stack.sh after every push) and a
proper-fix sketch (give the persistent stack a different binary
name via build tag or symlink). Proper fix deferred until trigger
fires — operators living through this once will know to want it.

Persistent stack restored (all 11 healthy as of this commit).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 02:56:52 -05:00

94 lines
3.1 KiB
Bash
Executable File

#!/usr/bin/env bash
# scripts/cutover/start_go_stack.sh
#
# Bring up the full Go stack persistently — alongside the live Rust
# gateway on :3100. All Go daemons land on the parallel port range
# :3110 + :3211-:3220 so there's no port collision.
#
# Unlike playbook_lift.sh's transient harness boot (which kills the
# stack on exit), this script starts every daemon detached via nohup
# + disown. Operators run it once at boot or after a restart; the
# stack stays up until a `pkill -f "bin/(name)"` or reboot.
#
# Logs land in /tmp/gostack-logs/<bin>.log (one per daemon).
#
# Used to bring up the persistent stack 2026-05-01 — the first time
# the Go side has run as long-running daemons rather than per-harness
# transient processes.
#
# KNOWN CONSTRAINT: the pre-push smoke chain (`just verify` →
# scripts/{d,g}*_smoke.sh) uses the SAME anchored `pkill -f
# "bin/(name)$"` pattern this script does, and ALSO matches our
# persistent daemons by name. Pushing while the persistent stack
# is up will kill 7 of 11 daemons (gateway, storaged, catalogd,
# ingestd, queryd, embedd, vectord; the smokes don't reach for
# pathwayd/observerd/matrixd/chatd). Workaround: re-run this
# script after every push. A proper fix is to give the persistent
# stack a different binary name (e.g. via build tags or a
# wrapper symlink) so smoke-side pkill doesn't see it; deferred
# until the trigger fires (i.e. when an operator gets bitten).
set -euo pipefail
cd "$(dirname "$0")/../.."
if [ ! -d bin ]; then
echo "[gostack] bin/ missing — run 'just build' first" >&2
exit 1
fi
# Ensure no leftover from a transient harness run. Anchored pattern
# per feedback_pkill_scope; never bare `bin/`.
echo "[gostack] killing any stale Go daemons (anchored pkill)"
pkill -f "bin/(storaged|catalogd|ingestd|queryd|embedd|vectord|pathwayd|observerd|matrixd|gateway)$" 2>/dev/null || true
sleep 0.5
mkdir -p /tmp/gostack-logs
start() {
local bin="$1"
local port="$2"
local log="/tmp/gostack-logs/$bin.log"
nohup ./bin/"$bin" -config lakehouse.toml > "$log" 2>&1 & disown
for _ in $(seq 1 50); do
if curl -sSf -m 1 "http://127.0.0.1:$port/health" >/dev/null 2>&1; then
echo " $bin :$port up (log: $log)"
return 0
fi
sleep 0.1
done
echo " $bin :$port FAILED — log tail:"
tail -20 "$log"
return 1
}
echo "[gostack] starting in dependency order"
start storaged 3211
start catalogd 3212
start ingestd 3213
start queryd 3214
start embedd 3216
start vectord 3215
start pathwayd 3217
start observerd 3219
start matrixd 3218
start gateway 3110
# chatd is started independently — its provider key files come from
# /etc/lakehouse/{ollama_cloud,openrouter,opencode,kimi}.env; if
# chatd is already up (long-running from a prior session) we don't
# touch it.
if ! curl -sSf -m 1 http://127.0.0.1:3220/health >/dev/null 2>&1; then
echo "[gostack] chatd :3220 not up; starting"
start chatd 3220
else
echo " chatd :3220 already up (skipping)"
fi
echo
echo "[gostack] ready · sweep:"
for p in 3110 3211 3212 3213 3214 3215 3216 3217 3218 3219 3220; do
curl -sSf -m 1 "http://127.0.0.1:$p/health" 2>/dev/null | head -c 80
echo
done