Per /home/profit/lakehouse/docs/specs/SUBJECT_MANIFESTS_ON_CATALOGD.md §5 Step 8.
Go side reads SubjectManifest + verifies HMAC chain on per-subject
audit JSONL files using IDENTICAL canonical-JSON + HMAC-SHA256 algorithm
to crates/catalogd/src/subject_audit.rs. A Rust-written chain now
verifies under Go and vice versa.
Files:
- internal/catalogd/subject.go
SubjectManifest, SubjectAuditRow, AuditAccessor, AuditLogEntry
LoadSubjectManifest, LoadKeyFile (32-byte minimum, matches Rust)
ReadAuditLog, VerifyChain
canonicalRowBytesFromRaw (production), canonicalRowBytesFromStruct (tests)
computeRowHMAC, CanonicalAndHmac (parity helper)
- internal/catalogd/subject_test.go (10 unit tests)
- scripts/cutover/parity/subject_audit_helper/main.go
CLI helper mirroring crates/catalogd/src/bin/parity_subject_audit.rs
- scripts/cutover/parity/subject_audit_parity.sh
Two-phase probe: known-answer + every real audit log
Two real bugs caught + fixed by the probe authoring loop:
1. omitempty on AuditAccessor.TraceID stripped the field when empty,
producing different canonical bytes than Rust (which always writes
the field). Removed omitempty. Rust + Go now produce identical
bytes for rows with trace_id="" (the common production case).
2. time.RFC3339Nano strips trailing zeros from nanoseconds, producing
"...46143921" where Rust's chrono AutoSi produces "...461439210".
Hashing through the parsed-then-re-marshaled struct breaks the
chain on any row whose nanos end in 0. Fixed by canonicalizing
from the RAW LINE BYTES (preserves the original timestamp string
byte-for-byte). Test TestVerifyChain_RawBytesPreserveTimePrecision
regression-locks this with a hand-crafted nanos=461439210 row.
Live verification (6 / 6 byte-identical assertions):
- Phase 1 known-answer: canonical bytes (266) + HMAC match
- Phase 2 real logs: WORKER-1..5 audit JSONL all verify under both
runtimes with identical (count, tip, verified, error) output
Report: reports/cutover/gauntlet_2026-05-02/parity/subject_audit_parity.md
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Today's sidecar drop (lakehouse ba928b1) changed Rust's embed
transport from gateway → sidecar → Ollama (2 hops) to gateway →
Ollama directly. Go's embedd has always been direct. A drift here
would mean: same query, different vector → different HNSW top-K →
different staffing recommendations. This probe is the regression
gate for that surface.
Fixtures cover staffing-domain shapes (forklift, welder, OSHA,
dental, CNC) plus stress shapes (unicode "Café résumé ⭐ 你好",
single char "x", 200-word long fixture).
Match metric: cosine similarity ≥ 0.99999. Byte-equal isn't
expected — Go round-trips through []float32 internally while Rust
stays at Vec<f64>, so JSON serialization introduces small float
drift. What matters operationally is vector direction (HNSW uses
cosine distance), and both runtimes preserve it when calling the
same Ollama with the same model.
Result: **8/8 fixtures match** including the long + unicode cases.
Sidecar drop didn't disturb the embed surface. The probe also
forces both endpoints to use `nomic-embed-text` so the v1-vs-v2-moe
default difference doesn't pollute the comparison.
5th cross-runtime parity probe joining the family:
- validator_parity (6/6)
- extract_json_parity (12/12)
- session_log_parity (4/4)
- materializer_parity (2/2)
- embed_parity (8/8) — this commit
Cumulative: 32/32 parity assertions across 5 probes covering
HTTP shape (validator, embed), CLI output (materializer), unit
behavior (extract_json), and persisted shape (session_log).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Companion to lakehouse commit 57bde63 (Rust gateway gains
trace-id propagation + coordinator session JSONL). The
cross-runtime parity probe is the regression gate that prevents
silent schema drift between the two runtimes.
scripts/cutover/parity/session_log_parity.sh:
- 4 fixtures (accepted_grounded, max_iter_exhausted, infra_error,
unicode_in_prompt) feed identical input to both helpers
- jq -e validity gate + non-trivial-equal guard prevents the
"both sides fail identically → spurious match" failure mode
(caught one IFS='||' bug during initial authoring — recorded
in the script comment)
- normalize() strips timestamp + daemon (legitimate per-producer
differences); everything else must be byte-equal
- Result: 4/4 fixtures match, including unicode
scripts/cutover/parity/session_log_helper/main.go:
- Tiny stdin/stdout Go helper that round-trips a fixture
through validator.SessionRecord serde
- Counterpart to crates/gateway/src/bin/parity_session_log.rs
docs/ARCHITECTURE_COMPARISON.md decisions tracker:
- "Rust observability parity" row added (DONE 2026-05-02)
- Cross-runtime probe documented as reusable gate
STATE_OF_PLAY refreshed.
Both observability pieces (trace-id propagation, session JSONL)
now exist on both runtimes. Operators who point Rust gateway and
Go validatord at the same session-log path get a unified
longitudinal stream queryable via DuckDB.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the second half of J's 2026-05-02 multi-call observability
concern. Trace-id propagation (commit d6d2fdf) gave us the *live*
view in Langfuse; this gives us the *longitudinal* view for ad-hoc
DuckDB queries over thousands of sessions:
"show me every session where the model produced a real candidate
without ever needing a retry"
"find sessions where validation rejected three times in a row"
"first-shot success rate per model — did we feed it enough corpus?"
## What's in
internal/validator/session_log.go:
- SessionRecord type (schema=session.iterate.v1)
- SessionLogger writer — mutex-guarded append, best-effort posture,
nil-safe (NewSessionLogger("") = nil = no-op on Append)
- BuildSessionRecord helper — assembles a row from any
iterate response/failure/infra-error combination, callable from
other daemons that wrap iterate (cross-daemon shared schema)
- 7 unit tests including concurrent-append safety + the three
code paths (success / max_iter_exhausted / infra_error)
cmd/validatord/main.go:
- handlers.sessionLog field + wiring from cfg.Validatord.SessionLogPath
- Iterate handler: build + append a SessionRecord on every call
- rosterCheckFor("fill") closure stamps grounded_in_roster — the
load-bearing forensic property J flagged ("we can never
hallucinate available staff members to contracts")
internal/shared/config.go + lakehouse.toml:
- [validatord].session_log_path field; empty = disabled
- Production: /var/lib/lakehouse/validator/sessions.jsonl
scripts/validatord_smoke.sh:
- Adds a probe verifying validatord announces session log path on
startup. Smoke is now 6/6 (was 5/5).
docs/SESSION_LOG.md:
- Schema reference + 5 worked DuckDB query examples including the
"alarm" query (sessions where grounded_in_roster=false on an
accepted fill — should always be empty; if not, something is
bypassing FillValidator).
## What this is NOT
This is NOT a duplicate of replay_runs.jsonl. They're siblings:
- replay_runs.jsonl: replay tool's per-task retrieval+model output
- sessions.jsonl: validatord's per-iterate full retry chain +
grounded-in-roster verdict
A single coordinator session can produce rows in both streams; the
session_id (= Langfuse trace_id) is the join key.
## Layered observability now in place
Live view: Langfuse trace tree (X-Lakehouse-Trace-Id propagation)
`iterate.attempt[N]` spans with prompt/raw/verdict
Offline: coordinator_sessions.jsonl (this commit)
DuckDB-queryable; longitudinal forensics
Hard gate: FillValidator + WorkerLookup (existing)
phantom IDs structurally rejected, never reach
session log's grounded_in_roster=true bucket
Per the architecture invariant in STATE_OF_PLAY's DO NOT RELITIGATE
section — these layers are wired; future work targets the data, not
the wiring.
## Verification
- internal/validator: 7 new tests (session_log_test.go) — all PASS
- cmd/validatord: 3 new integration tests covering the success,
failure, and grounded=false paths — all PASS
- validatord_smoke.sh: 6/6 PASS through gateway :3110
- Full go test ./... green across 33 packages
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the 2026-05-02 parity finding: validator_parity probe found
5/6 body shapes diverging because Go emitted {"Kind":"...","Field":"...","Reason":"..."}
while Rust emits the externally-tagged-enum {"Schema":{"field":"...","reason":"..."}}.
A caller parsing the error envelope would break silently in cutover.
## Changes
internal/validator/types.go:
- Custom MarshalJSON emits the Rust shape:
Schema: {"Schema": {"field":"x","reason":"y"}}
Completeness: {"Completeness":{"reason":"y"}}
Consistency: {"Consistency": {"reason":"y"}}
Policy: {"Policy": {"reason":"y"}}
- Custom UnmarshalJSON accepts BOTH the new Rust shape AND the legacy
flat shape (migration safety for any persisted error rows).
- Unknown variants (e.g. a future Rust addition Go hasn't learned)
surface as an Unmarshal error, not a silent default.
internal/validator/types_test.go:
- 4 pinning tests anchor the wire format. Failing them = wire-format
drift; the parity probe is the secondary line of defense.
scripts/validatord_smoke.sh:
- Updated probes to read the new variant-name shape (jq keys[0],
.Schema.field) instead of legacy .Kind/.Field.
## Verification
- internal/validator unit tests: PASS (4 new + all existing).
- cmd/validatord HTTP tests: PASS (UnmarshalJSON falls through to flat
shape so existing tests reading ValidationError still work).
- validatord_smoke.sh: 5/5 PASS through gateway :3110.
- validator parity probe re-run: **6/6 match** (was 1/6).
## Pattern
Per architecture_comparison's "use the dual-implementation as a
measurement instrument" thesis: a parity probe surfaced this gap;
50 LOC of MarshalJSON closed it; 4 pinning tests prevent regression;
the probe is the longitudinal gate. Cutover-friendly direction (Go
matches Rust) chosen because Rust is the existing production
contract.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two new cross-runtime parity probes joining the validator probe from
the gauntlet wave. Pattern: feed identical input through Rust and Go;
diff outputs. Each probe surfaced a different signal.
## Materializer parity probe
scripts/cutover/parity/materializer_parity.sh runs Bun + Go
materializer against an identical synthetic data/_kb/ root, diffs the
resulting evidence/ JSONL byte-equivalent (modulo provenance.recorded_at).
**First run: 0/2 match.** Real finding: Go's Provenance.LineOffset
had `json:"line_offset,omitempty"` which strips the field when value
is 0. Line offset 0 is the FIRST ROW of every source file — a real
semantic value, not absent. Bun side always emits it.
Fix: drop `omitempty` on Provenance.LineOffset. Updated comment
explaining why.
**Re-run: 2/2 match.** On-wire JSON parity holds.
## extract_json parity probe
scripts/cutover/parity/extract_json_parity.sh feeds 12 fixture
strings through both runtimes' extract_json:
- fenced ```json``` blocks
- unfenced ``` blocks
- bare braces with prose around
- first-balanced-of-many
- nested objects
- unicode in string values
- escaped quotes
- empty object
- top-level array (both return first inner object)
- no JSON
- depth-balanced but invalid syntax
- trailing garbage
Substrate gate: cargo test -p gateway extract_json PASS before probe.
**Result: 12/12 match.** Algorithms genuinely equivalent.
## scripts/cutover/parity/extract_json_helper/main.go
Tiny Go binary that reads stdin, calls validator.ExtractJSON, prints
{matched, value} JSON. Counterpart to the Rust parity_extract_json
binary in golangLAKEHOUSE's sibling lakehouse repo (separate commit).
## Pattern crystallized
Every cross-runtime port should land with a parity probe. Three
probes now exist:
- validator (5/6 wire-format gap captured 2026-05-02)
- materializer (caught + fixed real bug 2026-05-02)
- extract_json (12/12 match 2026-05-02)
The instrument is reusable — each new shared HTTP/CLI surface gets
a probe row added.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Production-readiness gauntlet exploiting the dual Rust/Go
implementation as a measurement instrument.
## Phase 1 — Full smoke chain
21/21 PASS in ~60s. Substrate intact across the full service surface.
## Phase 2 — Per-component scrum (token-volume fix)
Prior wave (165KB diff): Kimi 62 tokens out, Qwen 297 → no useful
analysis. This wave splits today's commits into 4 focused bundles
(36-71KB each):
c1 validatord (46KB) → 0 convergent / 11 distinct
c2 vectord substrate (36KB) → 0 convergent / 10 distinct
c3 materializer (71KB) → 0 convergent / 6 distinct (Opus emitted
a BLOCK then self-retracted in same response)
c4 replay (45KB) → 0 convergent / 10 distinct
Reviewer engagement vs prior wave: Kimi went 62 → ~250 tokens out
once bundles dropped below 60KB.
scripts/scrum_review.sh hardening:
* Diff-size guard (warn >60KB, hard-fail >100KB,
SCRUM_FORCE_OVERSIZE=1 override)
* Tightened prompt — file path must appear EXACTLY as in diff
so post-processor can grep WHERE: lines reliably
* Auto-tally step dedupes by (reviewer, location); convergence
counts distinct lineages (closes the prior `opus+opus+opus`
false-convergence bug)
## Phase 3 — Cross-runtime validator parity probe (the headline finding)
scripts/cutover/parity/validator_parity.sh sends 6 identical
/v1/validate cases to Rust :3100 AND Go :4110, compares status+body.
Result: **6/6 status codes match · 5/6 body shapes diverge.**
Rust returns serde-tagged enum: {"Schema":{"field":"x","reason":"y"}}
Go returns flat exported-fields: {"Kind":"schema","Field":"x","Reason":"y"}
Both round-trip inside their own runtime; a caller swapping one for
the other would break parsing silently. Captured as new _open_ row
in docs/ARCHITECTURE_COMPARISON.md decisions tracker.
This is the "use the dual-implementation as a measurement instrument"
return — single-repo scrums can't catch this class of cross-runtime
drift.
## Phase 4 — Production assessment
ship-with-known-gap. Validator wire-format gap is documented, not
regressed. ~50 LOC future fix on Go side (custom MarshalJSON on
ValidationError to match Rust's serde shape).
Persistent stack config (/tmp/lakehouse-persistent.toml) gains
validatord on :3221 + persistent-validatord binary so operators
bringing up the persistent stack get the new daemon automatically.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two threads landing together — the doc edits interleave so they ship
in a single commit.
1. **vectord substrate fix verified at original scale** (closes the
2026-05-01 thread). Re-ran multitier 5min @ conc=50: 132,211
scenarios at 438/sec, 6/6 classes at 0% failure (was 4/6 pre-fix).
Throughput dropped 1,115 → 438/sec because previously-broken
scenarios now do real HNSW Add work — honest cost of correctness.
The fix (i.vectors side-store + safeGraphAdd recover wrappers +
smallIndexRebuildThreshold=32 + saveTask coalescing) holds at the
footprint that originally surfaced the bug.
2. **Materializer port** — internal/materializer + cmd/materializer +
scripts/materializer_smoke.sh. Ports scripts/distillation/transforms.ts
(12 transforms) + build_evidence_index.ts (idempotency, day-partition,
receipt). On-wire JSON shape matches TS so Bun and Go runs are
interchangeable. 14 tests green.
3. **Replay port** — internal/replay + cmd/replay +
scripts/replay_smoke.sh. Ports scripts/distillation/replay.ts
(retrieve → bundle → /v1/chat → validate → log). Closes audit-FULL
phase 7 live invocation on the Go side. Both runtimes append to the
same data/_kb/replay_runs.jsonl (schema=replay_run.v1). 14 tests green.
Side effect on internal/distillation/types.go: EvidenceRecord gained
prompt_tokens, completion_tokens, and metadata fields to mirror the TS
shape the materializer transforms produce.
STATE_OF_PLAY refreshed to 2026-05-02; ARCHITECTURE_COMPARISON decisions
tracker moves the materializer + replay items from _open_ to DONE and
adds the substrate-fix scale verification row.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
J asked for a much more sophisticated test using the 100k corpus from
the Rust legacy database. This commit ships:
scripts/cutover/multitier/main.go — 6-scenario harness with weighted
random selection per goroutine. Mixes search, email/SMS/fill
validators (in-process via internal/validator), profile swap with
ExcludeIDs, repeat-cache exercise, and playbook record/replay.
Scenarios + weights (cumulative scenario fractions):
35% cold_search_email — search + email outreach + EmailValidator
15% surge_fill_validate — search + fill proposal + FillValidator + record
15% profile_swap — original search + ExcludeIDs swap + no-overlap check
15% repeat_cache — same query × 5 (cache effectiveness)
10% sms_validate — SMS draft (≤160 chars, phone for SSN-FP guard)
10% playbook_record_replay — cold → record → warm w/ use_playbook=true
Test results (5-min sustained, conc=50, 100k workers indexed):
TOTAL 335,257 scenarios @ 1,115/sec
cold_search_email 117k @ 0.0% fail · p50 2.2ms · p99 8.6ms
surge_fill_validate 50k @ 98.8% fail (substrate bug below)
profile_swap 50k @ 0.0% fail · p50 4.5ms · ExcludeIDs verified
repeat_cache 50k × 5 = 252k searches @ 0.0% fail · p50 11.7ms
sms_validate 33k @ 0.0% fail · phone-pattern guard works
playbook_record_replay 33k @ 96.8% fail (substrate bug below)
Total successful workflows: ~250k+
Validator integration verified at load:
150,930 EmailValidator passes across cold_search_email + sms_validate
35 + 1,061 successful FillValidator + playbook_record (where the bug
didn't fire)
zero false positives on the SSN-pattern guard against phone numbers
Resource footprint at 100k:
vectord 1.23GB RSS (linear with 100k vectors)
matrixd 26MB, 75% CPU (1-core saturated at conc=50)
Total across 11 daemons: 1.7GB
Compare to Rust at 14.9GB — ~10× less even at 100k.
SUBSTRATE BUG SURFACED: coder/hnsw v0.6.1 nil-deref in
layerNode.search at graph.go:95. Triggers on /v1/matrix/playbooks/record
under sustained writes to the small playbook_memory index. Both Add
and Search paths can panic.
Workaround applied (this commit) in internal/vectord/index.go
BatchAdd: recover() guard converts panic to error; daemon stays up
instead of crashing the request handler.
Operator recovery procedure (also documented in the report):
curl -X DELETE http://localhost:4215/vectors/index/playbook_memory
Next record recreates the index fresh.
Real fix DEFERRED — open in docs/ARCHITECTURE_COMPARISON.md
Decisions tracker. Three options:
a) upstream patch to coder/hnsw
b) custom small-index Add path that always rebuilds when len < threshold
c) alternate store for playbook_memory (Lance? in-memory map?)
Evidence: reports/cutover/multitier_100k.md (full methodology +
results + repro + bug analysis). docs/ARCHITECTURE_COMPARISON.md
Decisions tracker updated.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sustained-traffic load test against the cutover slice. Three runs,
zero correctness errors across 101,770 total requests. Substrate
holds up under concurrent load — matrix gate, vectord HNSW,
embedd cache, gateway proxy all hold. This was the load test's
primary question; latency numbers are secondary.
scripts/cutover/loadgen — focused Go load generator. 6-query
rotating body mix (Forklift/CNC/Warehouse/Picker/Loader/Shipping).
Configurable URL/concurrency/duration. Reports per-status-code
counts + p50/p95/p99 latencies + JSON summary on stderr.
Three runs:
baseline (Bun → Go, conc=1, 10s):
4,085 req · 408 RPS · p50 1.3ms · p99 32ms · max 215ms
sustained (Bun → Go, conc=10, 30s):
14,527 req · 484 RPS · p50 4.6ms · p99 92ms · max 372ms
direct (→ Go, conc=10, 30s):
83,158 req · 2,772 RPS · p50 2.5ms · p99 8.5ms · max 16ms
Critical findings:
1. ZERO correctness errors across 101k requests. No 5xx, no
transport errors, no panics. Concurrency-safety verified across
matrix gate / vectord / gateway / embedd cache.
2. Direct-to-Go is production-grade. 2,772 RPS at p99 8.5ms on a
single host, no scaling cliff at concurrency=10.
3. Bun frontend is the bottleneck. -82% RPS, +982% p99 vs direct.
Single-process JS event loop queueing under concurrent
requests — known Bun proxy-mode characteristic. The substrate
itself isn't the limiter.
4. For staffing-domain demand levels (<1 RPS typical per
coordinator), Bun-fronted 484 RPS has 480× headroom. No
urgency to optimize Bun out of the data path. If/when
concurrent demand grows orders of magnitude, the path is
nginx → Go direct for hot endpoints, skip Bun.
Substrate is now load-tested and verified production-ready.
What this load test does NOT cover (documented in
g5_load_test.md): cold-cache embed, larger corpus, mixed
read/write, multi-host, full 5-loop traffic with judge gate
calls. Each is its own probe shape.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Earlier push exposed the gap in the previous 2-layer isolation:
smokes still failed because they tried to bind :3211-:3220 which
my persistent stack already had. Smoke catalogd's bind-failure
went undetected because poll_health 3212 succeeded responding to
the persistent catalogd, and smoke proceeded against the wrong
backend with the wrong bucket expectations.
Fix: persistent stack now uses :4110 + :4211-:4219 via additional
sed in the temp toml (bind addresses + upstream URLs). Smoke
harnesses keep :3110 + :3211-:3219. Both reach the SAME chatd at
:3220 because chatd is read-mostly (no state to clobber) and
operators don't want to maintain two LLM provider key sets.
Three isolation layers now in effect:
1. Binary names (bin/persistent-* via symlinks)
2. MinIO buckets (lakehouse-go-persistent vs lakehouse-go-primary)
3. Port range (:4xxx vs :3xxx, with shared chatd on :3220)
Verified pre-push:
- 11 persistent ports listening on :4xxx + :3220
- 0 smoke ports listening on :3110-:3219 (free for smokes)
Pushed while persistent stack live — first cross-isolation test
(no port collision, no bucket collision, no name collision).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 2026-05-01 persistent-stack milestone exposed two collision
modes between the long-running Go stack and the pre-push smoke
harness:
1. PKILL COLLISION: smoke teardown uses anchored
`pkill -f "bin/(storaged|...|gateway)$"`. Same-named persistent
processes match → smokes kill 7 of 11 persistent daemons.
2. MINIO STATE COLLISION: persistent stack writes
`_vectors/workers.lhv1` to the shared lakehouse-go-primary
bucket. Smoke vectord rehydrates from same bucket → sees both
smoke-owned and persistent-owned indexes → assertion failures.
Both fixed in this commit by adding two isolation layers:
LAYER 1 — distinct binary names via symlink:
bin/persistent-<daemon> → bin/<daemon>
Persistent stack runs as ./bin/persistent-gateway etc.
Smoke pattern `bin/(name)$` matches `bin/gateway$` but NOT
`bin/persistent-gateway$` (regex group requires bin/ followed
immediately by a daemon name; "bin/p..." doesn't qualify).
Cmdline lookup verified: 7 persistent procs, 0 match smoke pkill.
LAYER 2 — separate MinIO bucket via temp config:
Persistent stack writes to lakehouse-go-persistent (configurable
via $LH_PERSISTENT_BUCKET). Temp toml at /tmp/lakehouse-persistent.toml
inherits everything from lakehouse.toml except [s3].bucket which
is sed-replaced. Bucket auto-created via mc if missing.
Verified: workers.lhv1 lands in persistent bucket; primary
bucket _vectors/ stays empty.
Net effect: the persistent stack should survive `git push` (which
runs smokes that rehydrate vectord from primary bucket and pkill
their own bin/<name>$ daemons). This commit is the first push test
WITH the persistent stack live.
Caveat: bin/persistent-* symlinks are gitignored already (/bin/ is
in .gitignore wholesale), so the symlinks need to be created on
each fresh checkout — which start_go_stack.sh does idempotently.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Caught immediately after the prior commit pushed: pre-push smokes
killed 7 of 11 persistent Go daemons because the smokes' anchored
`pkill -f "bin/(name)$"` teardown matches ANY process named
`bin/<daemon>`, not just the smokes' own children.
Documented in the script header as a KNOWN CONSTRAINT with a
workaround (re-run start_go_stack.sh after every push) and a
proper-fix sketch (give the persistent stack a different binary
name via build tag or symlink). Proper fix deferred until trigger
fires — operators living through this once will know to want it.
Persistent stack restored (all 11 healthy as of this commit).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
J's "let's go" instruction: leave OPEN list behind, push the Go
substrate forward into actual deployment shape. This commit marks
the first time the Go side has run as long-running daemons rather
than per-harness transient processes, and the first time the
shared cross-runtime longitudinal log has carried a Go-emitted
entry alongside the Rust ones.
What landed:
scripts/cutover/start_go_stack.sh — the persistent-stack runbook.
Brings up all 11 daemons (storaged → catalogd → ingestd → queryd
→ embedd → vectord → pathwayd → observerd → matrixd → gateway,
plus chatd-if-not-already-up) in dependency order via nohup +
disown. Anchored pkill per feedback_pkill_scope (never bare
"bin/"). Logs land in /tmp/gostack-logs/<bin>.log, one per daemon.
Verified live state:
- All 11 services healthy on :3110 + :3211-:3220
- gateway → embedd proxy returns nomic-embed-text-v2-moe vectors
- chatd reports 5/5 providers loaded
- No port collision with Rust gateway on :3100
- Daemons stay up after exit of the start script (production shape,
not harness-transient)
audit_baselines.jsonl crosses the runtime boundary:
- 7 Rust-emitted entries (last: ca7375ea 2026-04-27)
- 1 Go-emitted entry (ee2a40c 2026-05-01T07:53:54Z) appended via
./bin/audit_full -append-baseline
- Same envelope shape, same metric set, same drift comparator
semantics — operators running either runtime grow the same log
What this DOES prove:
- Substrate parity at deployment shape (not just unit tests)
- Cross-runtime artifact write-side compatibility (was previously
proven on read side via audit_baselines roundtrip)
- The deploy machinery works end-to-end for the persistent case
What this does NOT prove (still ahead):
- Real coordinator traffic against the Go stack (no nginx flip yet;
devop.live/lakehouse/ still serves through Rust)
- Go-side production materializer (Phase 2 is observer-only)
- Replay tool parity (Phase 7 is observer-only)
- The 5-loop product gate against actual humans
reports/cutover/SUMMARY.md now logs three new rows:
- audit-FULL with 12/12 phases ported
- First Go-emitted audit_baselines entry
- Persistent Go stack live
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Same shape of proof as embed_parity.sh for the embed endpoint:
take the just-shipped Go port (ca142b9) and validate it against
the actual production data the Rust legacy emits, not just unit-
test fixtures. Locks the cross-runtime parity that operators
running mixed pipelines depend on.
scripts/cutover/audit_baselines_validate.go:
- Reads /home/profit/lakehouse/data/_kb/audit_baselines.jsonl
- Parses every entry via the Go AuditBaseline struct
- Round-trips the last entry: encode → decode → field-by-field
equality check (catches any silently-dropped JSON keys)
- Calls LoadLastBaseline against the live file (proves the public
API works on real shapes, not just inline parsing)
- Computes BuildAuditDriftTable(first → last) — full-window
lineage drift over the captured baselines
Live-data probe results (reports/cutover/audit_baselines_roundtrip.md):
- 7 entries parse without error
- Round-trip is byte-equal on every metric + every header field
- Drift table fires the expected verdicts:
- p2_evidence_rows 12→82 (+583%) → warn (above 20% threshold)
- p3_accepted/partial/rejected/human 0→non-zero → warn (the
zero-baseline edge case TestBuildAuditDriftTable_ZeroBaseline
was designed to lock — verified now firing on real history)
- p4_* metrics +0% → ok (stable across the window)
What this does NOT prove (documented in the report): the Go-side
audit-FULL pipeline that PRODUCES baselines doesn't exist yet.
Only the load/append/drift substrate is ported. Operators running
audit-full from Go would still need a metric-collection pass —
that's a separate port deliberately not in this wave.
reports/cutover/SUMMARY.md gains a new row alongside the embed
parity entries; cutover-prep verification log keeps the
discipline of "verified against real data, not just fixtures."
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Substantial wave addressing all 4 prior OPEN items. Three closed in
full, one partially (the speculative half deliberately deferred).
OPEN #1 — Periodic fresh→main index merge (FULL):
- POST /v1/vectors/index/{src}/merge with {dest, clear_source}
- Idempotent on re-runs (existing-in-dest items skipped)
- internal/vectord/index.go: new Index.IDs() snapshot method +
i.ids tracker field as canonical ID set, independent of meta
map's nil-vs-{} sparseness (was a real bug — IDs() backed by meta
alone missed items added with nil metadata)
- 4 cmd-level integration tests (happy path drain+clear, dim
mismatch, dest not found, self-merge rejection) + 1 unit test
- DecodeIndex backward-compat: old envelopes restore i.ids from
meta keys (best effort; new items going forward use the tracker)
OPEN #2 — Distillation SFT export (SUBSTRATE):
- internal/distillation/sft_export.go ports the load-bearing half:
IsSftNever predicate + ListScoredRunFiles (data/scored-runs/YYYY/
MM/DD walk) + LoadScoredRunsFromFile + partial ExportSft.
- Synthesis (instruction/input/response generation) deferred to a
separate wave — too big for this session, but the substrate
makes the next wave a port-not-design exercise.
- TestSftNever_PinsExpectedSet locks the contamination firewall
set: if a future commit adds/removes from SftNever, this test
fails — forcing the change through review.
- 5 new tests; firewall fires end-to-end through the partial port.
OPEN #3 — Distribution drift via PSI (FULL):
- internal/drift/drift.go: ComputeDistributionDrift via Population
Stability Index. Standard finance/risk metric, well-defined
verdict tiers (stable < 0.10, minor 0.10–0.25, major ≥ 0.25).
- Equal-width bucketing over combined min/max so neither dist
falls outside; epsilon-clamping for empty buckets so log doesn't
blow up. Per-bucket breakdown for drilldown.
- Pairs with the existing ComputeScorerDrift: scorer drift is
categorical, distribution drift is continuous. Different shapes,
same package.
- 7 new tests covering identical-is-stable, hard-shift-is-major,
moderate-detected-not-stable, empty-inputs-safe, all-identical-
safe, bucket-counts-conserved, num-buckets-clamping.
OPEN #4 — Ops nice-to-haves (PARTIAL — wall-clock done, others
deferred):
- (a) Real-time wall-clock for stress harness: per-phase elapsed
time logged to stdout as it runs (`[stress] phase NAME starting
(T+12.3s)` + `[stress] phase NAME done — 8.5s (T+20.8s)`).
Output.PhaseTimings + Output.TotalElapsedMs in JSON.
- (b) chatd fixture-mode S3 mock + (c) liberal-paraphrase
calibration: not actioned — no fired trigger, would be
speculative. Documented as deferred-until-need rather than
ignored. Per the project's discipline ("don't add features
beyond what the task requires").
OPEN list now empty / steady-state. Future items will land as
production triggers fire.
Build + vet + tests green; 18 new tests across the 4 closures.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Real wire-up gap discovered post-scrum: Demand.Role was already
extracted at every call site in multi_coord_stress (44 occurrences,
both contract-driven and LLM-parsed inbox-triggered paths), but
neither matrixSearch nor playbookRecord accepted role in their
signatures. Cross-role gate (real_001..real_004 work) was bypassed
for the entire multi-coord harness — recordings and queries went
through with empty role, gate fell back to lenient behavior.
Fix:
- matrixSearchReq gains query_role field
- matrixSearch signature: (..., query, role string, ...)
- tracedSearch wrapper gains role param + emits it in span input
metadata for Langfuse visibility
- playbookRecord signature: (..., query, role, ...) — body emits
role only when non-empty (preserves backward compat at API)
- 14 call sites updated:
contract-driven Demand loops → d.Role
LLM-parsed inbox path → parsed.Role (qwen2.5 already extracts it)
swap path (warehouseDemand) → warehouseDemand.Role
reissue path → ev.Role (captured at original event time)
fresh-verify (resume snippet, no role concept) → ""
Build clean, vet clean, all tests pass. Cross-role gate now fires
end-to-end across the multi-coord harness — matches the playbook_lift
harness's coverage from the original real_001 fix.
This closes the symmetric gap to scripts/playbook_lift's existing
wire-through. Both production-shape harnesses now exercise the role
gate; future reality tests automatically inherit the protection.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
real_003 left a known-weak hole: shorthand-style queries
("{count} {role} {city} {state} ...") have no separator between
role and city, so a regex can't reliably extract — leaving the
cross-role gate disabled when both record AND query are shorthand.
This commit adds a roleExtractor with regex-first + LLM fallback:
- Regex first (fast, deterministic) — handles need + client_first +
looking from real_003b. ~75% of styles, no LLM cost paid.
- LLM fallback when regex returns empty AND model is configured —
Ollama-shape /api/chat with format=json, schema-tight prompt,
temperature 0. ~1-3s on local qwen2.5.
- Per-process cache — paraphrase + rejudge passes reuse the same
query 4× per run; cache prevents 4× LLM cost.
- Off-by-default — opt-in via -llm-role-extract flag (CLI) and
LLM_ROLE_EXTRACT=1 env var (harness wrapper). real_003b shipping
config unchanged unless explicitly enabled.
8 new tests in scripts/playbook_lift/main_test.go:
- TestRoleExtractor_RegexFirst: LLM not called when regex matches
- TestRoleExtractor_LLMFallback: shorthand goes to LLM
- TestRoleExtractor_LLMOffLeavesEmpty: opt-in default preserved
- TestRoleExtractor_Cache: 3 calls = 1 LLM hit
- TestRoleExtractor_NilSafe: nil receiver runs regex only
- TestExtractRoleViaLLM_HTTPError + _BadJSON: failure paths
- TestRoleExtractor_ClosesCrossRoleShorthandBleed: synthetic
witness for the real_003 scenario — both record + query are
shorthand, regex returns "" for both, LLM produces DIFFERENT
role tokens for CNC vs Forklift, so matrix gate's cross-role
rejection (locked separately in
TestInjectPlaybookMisses_RoleGateRejectsCrossRole) fires
correctly. This is the load-bearing verification.
Reality test real_004 ran the same 40-query stress as real_003 with
LLM extraction on. Cross-style same-role boosts fired correctly
across all 4 styles for Loaders + Packers + Shipping Clerk clusters
(including shorthand → other-style transfer). No cross-role bleed
observed. The reality test alone can't be a clean "with vs without"
comparison (HNSW build is non-deterministic across runs, and
real_004 stochastics didn't trigger a shorthand recording at all),
which is why the unit-test witness exists.
Production note (in real_004_findings.md): LLM extraction is for
reality-test coverage of arbitrary query shapes. Production should
extract role at INGEST time (when the inbox parser already runs an
LLM) and pass already-resolved role through requests — same shape
as multi_coord_stress's existing Demand{Role: ...} model. The hot
path should never need the harness extractor's per-query LLM cost.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Stress-tests the role gate with 40 queries (10 fill_events rows × 4
styles): need, client_first, looking, shorthand. Each row's role +
client + city stays the same; only the surface phrasing changes.
real_003 (original extractor) confirmed the shorthand-vs-shorthand
failure mode: CNC Operator shorthand recording leaked w-2404 onto
Forklift Operator shorthand query within the same Beacon Freight
Detroit cluster. Both record + query had empty role (extractor
returns "" for shorthand because there's no separator between role
and city), gate disabled, distance check passed, bleed fired.
Fix: extended extractRoleFromNeed to handle client_first
("{client} needs N {role} in...") and looking ("Looking for N
{role} at...") patterns. Shorthand left intentionally unmatched —
"Forklift Operator Detroit" is shape-indistinguishable from
"Forklift" + "Operator Detroit" without an LLM extractor or known-
cities lookup.
real_003b (extended extractor) verifies bleed closed across all 4
styles for this dataset. Forklift Operator queries keep w-2136 (the
cold-pass-correct match) regardless of which style the query came
in. Same-role boosts now fire correctly across styles — a CNC
Operator recording made in `looking` style boosts the CNC need-form
query.
scripts/cutover/gen_real_queries.go: added -styles flag with values
need|client_first|looking|shorthand|all (default need preserves
real_001/002 behavior). Tests/reality/real_coord_queries_v2.txt is
the 40-query stress file.
scripts/playbook_lift/main_test.go: 10 sub-tests lock the four
documented patterns + shorthand limitation + lift-suite-style
queries (no clean role, returns empty as expected).
Aggregate metrics:
- real_003 (original): disc=7, lift=7, boost=14, meanΔ=-0.108
- real_003b (extended): disc=11, lift=10, boost=31, meanΔ=-0.202
The growth reflects more LEGITIMATE same-role same-cluster transfer
firing across styles, not bleed (verified by per-cluster bleed
table — Forklift Operator queries unchanged across all 4 styles).
Known limitation documented in real_003_findings.md: same-cluster,
same-role queries in shorthand still embed close enough that a
shorthand recording could bleed onto a different-role shorthand
query if both record + query strip role. Closing this requires
LLM extraction or known-cities lookup at record + query time.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
real_001 surfaced same-client+city queries bleeding across roles:
Q#2 (Forklift Operator @ Beacon Freight Detroit) recorded e-6193
in the playbook corpus. Q#5 (Pickers same client+city) and Q#10
(CNC Operator same client+city) embedded within 0.13-0.18 cosine of
Q#2's query — well inside the 0.20 inject threshold — so e-6193
injected on both, demoting the cold-pass-correct workers.
Root cause: the inject distance threshold isn't tight enough on
the same-client+city cluster. Cosine collapses queries that share
city + client + count-token + time-token regardless of role. The
existing judge gate is per-injection at record time and doesn't
fire at retrieve time.
Fix: structural role gate in front of both Shape A boost and
Shape B inject. PlaybookEntry gains Role; SearchRequest gains
QueryRole. When both are non-empty and differ under roleEqual's
case+plural normalization, the entry is rejected before BoostFactor
or judge-gate logic runs.
Backward-compat: empty role on either side disables the gate —
preserves behavior for the lift suite's free-form multi-constraint
queries that have no clean single role. Caller-supplied (not
inferred), so existing recordings unaffected.
Wire-through:
- internal/matrix/playbook.go: Role field, NewPlaybookEntryWithRole,
roleEqual helper with plural+case normalization
- internal/matrix/retrieve.go: QueryRole on SearchRequest, threaded
to both ApplyPlaybookBoost + InjectPlaybookMisses
- cmd/matrixd/main.go: role on POST /matrix/playbooks/record + bulk
- scripts/playbook_lift/main.go: extractRoleFromNeed regex pulls
role from "Need N {role}{s} in" queries (the fill_events shape);
free-form queries fall back to empty (gate disabled)
Tests (5 new):
- TestInjectPlaybookMisses_RoleGateRejectsCrossRole: exact Q#10
scenario (distance 0.135, recorded "Forklift Operator", query
"CNC Operator") — locks the bleed at unit level
- TestInjectPlaybookMisses_RoleGateAllowsSameRole: Forklift Operator
recording fires on Forklift Operators query (plural normalization)
- TestInjectPlaybookMisses_RoleGateBackwardCompat: empty Role on
either side = gate disabled, preserves current behavior
- TestApplyPlaybookBoost_RoleGateRejectsCrossRole: Shape A defense
in depth — boost doesn't fire on cross-role even when answer is
in cold top-K
- TestRoleEqual_PluralAndCase: case + -s + -es plural normalization
Verification (real_002, same query set as real_001):
- Q#5 Pickers @ Beacon Freight: e-6193 → e-8499 (no bleed)
- Q#10 CNC Operator @ Beacon Freight: e-6193 → w-2404 (no bleed)
- Discoveries + lifts unchanged at 2 each (same-role lift still fires)
- Mean Δdist tightens from -0.127 to -0.040 (boosts no longer
pulling distances through the floor on cross-role mismatches)
Findings: reports/reality-tests/real_002_findings.md.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
First retrieval probe with non-synthetic query distribution. Pulls
N rows from /home/profit/lakehouse/data/datasets/fill_events.parquet
(real-shape demand data) and translates each to the natural language
a coordinator would type: "Need {count} {role}s in {city} {state}
starting at {at} for {client}".
Headline: 8/10 cold-pass top-1 = judge-best on real distribution.
Substrate works on queries it was never trained for. v2-moe + workers
corpus carry the load.
Surfaced finding (the real value of running this): same-client+city
queries cluster, and Shape A's distance boost bleeds across roles
within the cluster. Q#2 (Forklift @ Beacon Freight Detroit) records
e-6193 in the playbook corpus. Q#5 (Pickers same client+city) and
Q#10 (CNC Operator same client+city) inherit e-6193 at warm top-1
even though:
- Neither query has its own recorded playbook.
- Neither warm pass triggers a Shape B inject (boosted=0).
- The roles are different staffing categories.
Q#10 specifically demoted the cold-pass-correct w-3759 (judge rating
4 at rank 0) for a worker who was approved by the judge for a
different role on a different query.
Why the lift suite missed it: synthetic queries use 7 disjoint
scenario buckets (forklift+OSHA+WI / CDL+IL / etc.). Real demand
clusters on (client, city). The cluster doesn't exist in the
synthetic distribution.
Why the judge gate doesn't catch it: the gate (5a3364f) is
per-injection at record time. After approval the worker rides Shape A
distance boosts on all later same-cluster queries with no second
gate call.
Becomes new OPEN #1. Fix candidate: role-scoped playbook corpus
metadata + Shape A boost gate on role match. Cheap; doesn't need
new judge calls.
Files:
- scripts/cutover/gen_real_queries.go: parquet → coordinator NL
- tests/reality/real_coord_queries.txt: 10 generated queries
- reports/reality-tests/playbook_lift_real_001.md: harness output
- reports/reality-tests/real_001_findings.md: the reading
Repro:
go run scripts/cutover/gen_real_queries.go -limit 10 > tests/reality/real_coord_queries.txt
QUERIES_FILE=tests/reality/real_coord_queries.txt RUN_ID=real_001 \
WITH_PARAPHRASE=0 WITH_REJUDGE=0 ./scripts/playbook_lift.sh
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
First concrete cutover artifact: scripts/cutover/embed_parity.sh
brings up Go embedd + gateway alongside the live Rust gateway,
hits both /ai/embed and /v1/embed with the same forced model, and
emits a per-date verdict report under reports/cutover/.
Why embed first: the parity invariant is one math identity (cosine
sim of vectors against same input). Retrieve has thousands of edge
cases. If embed parity holds, all downstream vector consumers
inherit confidence; if it doesn't, we catch it in 30s instead of
after a flip.
Verdict 2026-04-30: 5/5 samples cosine=1.000000 with model forced
to nomic-embed-text (v1). Same with nomic-embed-text-v2-moe (both
Ollamas have it loaded). Math is provably equivalent across the
gateway plumbing.
Drift catalog (reports/cutover/SUMMARY.md):
- URL: Rust /ai/embed vs Go /v1/embed
- Wire: Rust {embeddings, dimensions} (plural) vs Go {vectors,
dimension} (singular). Wire-format adapter is the only real
cutover work for this endpoint.
- L2 norm: Rust unit vectors (~1.0); Go raw Ollama (~20-23). Same
direction (cos=1.0); harmless under cosine-distance HNSW (which
is Go vectord's default), but worth fixing in internal/embed/
before extending to euclidean indexes.
reports/cutover/ now tracked (joined the scrum/ + reality-tests/
exemptions in .gitignore).
Next probe: /v1/matrix/retrieve ↔ Rust /vectors/hybrid for the
real user-facing retrieve path. Embed parity gives that probe a
clean foundation.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`jq --arg` and `curl --data-binary @-` both read stdin/argv-bound
buffers. Diffs >~128KB blow past the kernel's argv limit even when
piped via stdin (because we still build `body` as a shell variable
first, then feed it to curl). Voice-ai full bundle was 156K and
hit it.
Switch to writing user/system/body to mktemp files, jq reads via
--rawfile, curl reads via @file. Same on-the-wire shape, no argv
involvement. Cleanup with rm at the end.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cross-lineage scrum on bundle 87cbd10..f971e64 (3,652 lines)
produced 4 actionable findings, all defensive hardening.
1. (Opus WARN) internal/langfuse/client.go:queue
Synchronous Flush at maxBatch threshold blocked the calling
goroutine for the full 5s HTTP timeout when Langfuse hiccupped,
defeating the "best-effort, never blocks calling path" contract
in the package doc. Now fire-and-forget via goroutine.
2. (Opus + Kimi convergent) cmd/observerd/main.go:handleInbox
- Free-form priority string was accepted; "nonsense" passed
through unchecked. Now closed enum: urgent|high|medium|low (+
empty defaults to medium). Tested: TestInbox_RejectsBadPriority.
- No size cap on body, only emptiness check; multi-MB payloads
would bloat observer's ring + JSONL. Now 8 KiB cap returns 413.
Tested: TestInbox_RejectsOversizedBody.
- Subject/sender/tag concatenated into InputSummary without
newline stripping; embedded \n could corrupt JSONL line-based
parsers. New sanitizeInboxField strips \r\n + caps at 256 chars
before interpolation.
3. (Opus INFO) scripts/multi_coord_stress/main.go
Removed dead `must[T]` generic — tracedSearch took over the
fail-fast role for matrix searches, so the helper became unused.
4. (Opus INFO) scripts/multi_coord_stress/main.go:Event
`JudgeRating int` collapsed "judge errored" and "judge said
unrated" both to 0. Changed to *int — nil = errored, 1-5 =
verdict. judgeInboxResult still returns 0 on error; caller
gates on > 0 before assigning.
Dismissed (with rationale):
- Opus WARN ExcludeIDs ordering: verified by code read — filter
applies after sort + before top-K truncation as documented;
no slot waste possible.
- Opus INFO 10 prior-run reports contradict #011: those are
point-in-time snapshots; intentional history.
- Kimi INFO Langfuse error suppression: design intent (best-effort
per package doc).
- Kimi INFO contract schema validation: defer until contract count
grows enough to make hand-edit drift a real risk.
- Kimi INFO paraphrase prompt duplicated across lift + multi_coord:
defer (lift to internal/paraphrase/ when a third consumer appears).
- Qwen HOLD: single-line, no actionable finding.
go test ./cmd/observerd ./internal/langfuse all green; multi_coord
driver builds clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pre-push hook caught the regression — the smoke hardcoded
MODEL = "nomic-embed-text" and the bump to nomic-embed-text-v2-moe
in 4da32ad failed the gate.
Fix: glob-match the family prefix (nomic-embed-text*). Both v1 and
v2-moe are 768d drop-ins; the property the smoke is locking is
dim + distinct-vectors, not the exact model variant. Operators
swap the variant in lakehouse.toml without needing to touch the
smoke.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 1c-only tracing (commit 7e6431e) was the proof-of-concept.
This commit threads tracing through every phase: baseline / fresh-
resume / inbox burst / surge / swap / merge / handover (verbatim +
paraphrase) / split / reissue. Each phase is a parent span; each
matrix.search / LLM call inside is a child span.
Refactor:
- One run-level trace is created at driver startup.
- New startPhase(name, hour, meta) helper emits a phase span as a
child of the run trace; subsequent emitSpan calls nest under it.
- New tracedSearch(spanName, query, corpora, ...) wraps matrixSearch
with span emission. Every search call site replaced with this so
the input/output JSON (query, corpora, k, playbook, exclude_n →
top-K ids, top1 distance, boost/inject counts) lands in Langfuse.
- Phase 4b's paraphrase generation also emits llm.paraphrase spans.
- Phase 1c's existing inline span emission converted to use the new
helpers (no more inboxTraceID variable).
Run #011 result: trace landed at http://localhost:3001 with 111
observations attached. Span breakdown:
phase.* parents: 9 (one per phase that ran)
matrix.search.baseline: 10
matrix.search.fresh_verify: 3 (top-1 confirmed for all 3 fresh)
observerd.inbox.record: 6
llm.parse_demand: 6
matrix.search.inbox: 6
llm.judge_top1: 6
matrix.search.surge: 12
matrix.search.swap_orig: 1
matrix.search.swap_replace: 1
matrix.search.merge: 6
matrix.search.handover_verbatim: 4
llm.paraphrase: 4
matrix.search.handover_paraphrase: 4
matrix.search.split: 4
matrix.search.reissue: 12
matrix.search.reissue_retrieval_only: 12
─────────────
Total: 111
Browse: http://localhost:3001 → Traces → "multi_coord_stress run"
Each phase is a collapsible section showing per-call timing and
input/output JSON. Operators can drill into any single retrieval
to see exactly what query was issued and what came back.
All other metrics held: diversity 0.026, determinism 1.000,
verbatim handover 4/4, paraphrase handover 4/4, fresh-resume 3/3
at top-1 (two-tier index), 200-worker swap Jaccard 0.000.
This is the FULL TEST J asked for — every action in the run
visible in Langfuse, full input/output drilldown.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Runs #003-#009 surfaced the same finding: fresh workers added
mid-run to the main 'workers' vectord index (5K items) reliably
*absorbed* (HTTP 200) but failed to *surface* in semantic queries
even with content-matching prompts. Distances on the verify queries
sat at 0.25-0.65 against existing workers; fresh items were beyond
top-K. Better embedder (v2-moe) didn't help — distances got TIGHTER
on existing items, pushing fresh items further out of reach.
Root cause: coder/hnsw incremental adds to a populated graph land
in poorly-connected regions and disappear from search traversal.
Known property of HNSW post-build adds; not a bug.
Fix: two-tier index pattern (canonical NRT search architecture).
Fresh content goes to a small "hot" corpus (fresh_workers); main
queries include it in the corpora list and merge results. Hot corpus
has no recall crowding because it's tiny; periodic batch job (post-
G3) merges it into the main index.
Implementation:
- ensureFreshIndex(hc, gw, name, dim) — idempotent POST
/v1/vectors/index. 409 from re-create treated as "already there."
- ingestFreshWorker now takes idx parameter so callers can target
fresh_workers instead of workers.
- multi_coord_stress phase 1b creates fresh_workers index + ingests
3 fresh workers there + searches verifyCorpora=[workers,
ethereal_workers, fresh_workers].
Run #010 result:
fresh-001 (Senior tower crane rigger NCCCO Chicago)
top-1: fresh-001 from fresh_workers, distance 0.143
fresh-002 (Bilingual Spanish/English OSHA trainer Indianapolis)
top-1: fresh-002 from fresh_workers, distance 0.146
fresh-003 (FAA Part 107 drone surveyor Chicago)
top-1: fresh-003 from fresh_workers, distance 0.129
3/3 fresh workers surface at top-1 — the absorption-but-not-
findable issue from runs #003-#009 is closed.
All other metrics held: diversity 0.007, determinism 1.000,
verbatim handover 4/4, paraphrase handover 4/4, swap Jaccard 0.000,
inbox burst all 6 events accepted + traced to Langfuse.
This is the final structural fix for the multi-coord stress
suite. Phase 3 is feature-complete.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Rust side has Langfuse tracing already (gateway/v1/langfuse_trace.rs);
this commit lands Go-side parity so the multi-coord stress harness can
emit traces visible at http://localhost:3001.
internal/langfuse/client.go:
- Minimal Trace + Span + Flush API mirroring what the Rust emitter
uses. Auth: Basic over public_key:secret_key.
- Best-effort posture: errors are slog.Warn'd, never block calling
paths. Same fail-open as observerd's persistor (ADR-005 Decision
5.1) — observability is a witness, not a gate.
- Events buffered until 50, then auto-flushed; explicit Flush() at
process exit.
- Each Trace/Span returns its id so callers can build hierarchies.
multi_coord_stress driver wiring:
- New --langfuse-env flag (default /etc/lakehouse/langfuse.env).
Empty / missing / unparseable file → skip tracing with a logged
warning; run still proceeds.
- Phase 1c (inbox burst) now emits one parent trace + 4 spans per
inbox event:
1. observerd.inbox.record (post to /v1/observer/inbox)
2. llm.parse_demand (qwen2.5 → structured fields)
3. matrix.search (parsed query → top-K)
4. llm.judge_top1 (rate top-1 vs original body)
Each span carries input/output JSON + start/end times so the
Langfuse UI shows a full waterfall per event.
Run #009 result:
Trace landed: "multi_coord_stress phase 1c inbox burst"
Observations attached: 24 (= 6 events × 4 spans)
Tags: stress, phase-1c, inbox
Browseable at http://localhost:3001 by tag query.
Other harness metrics: diversity 0.016, determinism 1.000,
verbatim handover 4/4, paraphrase handover 4/4 — all unchanged
by the tracing addition (best-effort post in parallel).
Phase 1c is the proof-of-concept; future commits can wrap other
phases (baseline / merge / handover / split) in traces too. Once
that's done, the entire stress run becomes scrubbable in Langfuse
without grepping the events JSON.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Run #007 surfaced a tradeoff: LLM-parsed inbox queries produce much
tighter cosine distances (0.05-0.10 in three cases) but lose the
"system has no good match" signal that high-distance results give.
A coordinator UI showing only distance can't tell wrong-domain
matches apart from real ones.
Fix: judge re-rates top-1 against the ORIGINAL inbox body (not the
LLM-parsed query). Coordinators see both:
- distance: how close was retrieval in vector space
- rating: does this person actually fit the original ask
The pair tells the honest story.
Run #008 result on the 6 inbox events:
Demand Top-1 Distance Rating Reading
─────────────────────────────────────────────────────────────
Forklift Cleveland w-3573 0.29 4 Strong
Production Indy e-1764 0.41 3 Adjacent
Crane Chicago e-7798 0.23 1 TIGHT BUT WRONG
Bilingual safety Indy w-3918 0.05 5 Perfect
Drone Chicago e-1058 0.06 5 Perfect (verify e-1058)
Warehouse Milwaukee w-460 0.32 4 Strong
The crane-Chicago case is the architectural-honesty signal at work:
distance 0.23 says "tight match" but the judge says rating 1 reading
the original body. A coordinator seeing only distance would ship the
wrong worker; coordinator seeing distance+rating sees the disagreement
and escalates.
Net distribution: 5/6 rated 3+ (acceptable→perfect), 1/6 rated 1
(irrelevant despite tight cosine). The substrate-honesty signal is
recovered without losing the LLM-parse quality wins.
Cost: 6 extra judge calls (~9s on qwen2.5). Production amortizes
when judge runs only on top-1 of high-priority inbox events; the
search-cost-vs-quality tradeoff lives in the priority gate.
Implementation:
- New JudgeRating int field on Event (omitempty so non-judged
events stay clean in JSON)
- New judgeInboxResult helper, reusing the same prompt structure as
playbook_lift's judgeRate. The two could share an internal package
if a third judge consumer appears.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaced the hard-coded DemandQuery on inbox events with an actual
LLM call: each email/SMS body is parsed by qwen2.5 (format=json,
schema-anchored) into structured {role, count, location, certs,
skills, shift}. The driver then composes a query string from those
fields and runs matrix.search.
This is the real-product flow that the Phase 3 stress test was
asking for: real bodies → real LLM parsing → real search. Before
this commit, the DemandQuery was my hand-crafted string, which
made the inbox phase trivial.
Run #007 result vs #006 (same bodies, parser swapped):
All 6 inbox events parsed cleanly — qwen2.5 nailed:
"Need 50 forklift operators in Cleveland OH for Monday day
shift. OSHA-30 + active forklift cert required."
→ {role:"forklift operator", count:50, location:"Cleveland, OH",
certs:["OSHA-30","active forklift cert"], skills:[], shift:"day"}
Other 5 similarly faithful (indy stayed as "indy", count
defaulted to 1 when unspecified, no hallucinated fields).
LLM-parsed queries produced TIGHTER matches than hard-coded:
Demand #006 dist #007 dist Δ
Crane Chicago 0.499 0.093 -82%
Drone Chicago 0.707 0.073 -90%
Bilingual safety 0.240 0.048 -80%
Forklift Cleveland 0.330 0.273 -17%
Production Indy 0.260 0.399 +53%
Warehouse Milwaukee 0.458 0.420 -8%
Three matches landed at distance < 0.10 — verbatim-replay-tight
territory. Structured queries embed sharper than conversational
hand-crafted strings.
Other metrics unchanged: diversity 0.000, determinism 1.000,
verbatim handover 4/4, paraphrase handover 4/4.
Tradeoff worth flagging: the drone-Chicago case dropped from
distance 0.71 (clear "we don't have one") to 0.07 (confident match
returned). The OOD honesty signal weakens when LLM-parsed structure
makes any closest-neighbor look tight. Future Phase 4 work: judge
re-rates the top match before surfacing, so coordinators see "your
demand was for X but the closest match scored 2/5" rather than just
the worker ID + distance.
Substrate cost: +6 LLM calls per inbox burst (~9s on qwen2.5).
Production would amortize via a small dedicated parser model.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 3 ask: real-world inbox-style event injection during the stress
test. Coordinators in production receive emails + SMS that trigger
contract responses; the substrate has to RECORD these signals AND
react with a search using the embedded demand. This commit lands the
endpoint and exercises it end-to-end in the stress harness.
observerd surface:
- New POST /observer/inbox route — accepts {type, sender, subject,
body, priority, tag} and records as ObservedOp with
Source=SourceInbox. Type must be email|sms; body required;
priority defaults to medium. The handler ONLY records — downstream
triggers (search, ingest, etc.) are the caller's concern, recorded
separately. Keeps the witness role pure.
- New observer.SourceInbox = "inbox" alongside SourceMCP /
SourceScenario / SourceWorkflow.
- Three contract tests on the new route (happy path / bad type / empty
body), router-mount test extended, all green.
Stress harness phase 1c (Hour 9):
- 6 inbox events fire in priority order (urgent → high → medium):
2 urgent emails (forklift Cleveland, production Indianapolis)
1 high email (crane Chicago)
1 high sms (bilingual safety Indianapolis)
1 medium sms (drone Chicago)
1 medium email (warehouse Milwaukee FYI)
- Each event:
1. POSTs to /v1/observer/inbox (recorded by observerd)
2. Triggers matrix.search using a parsed demand (the demand
extraction is hard-coded for now; production needs a small
LLM to parse from body)
3. Captures both as events in the run JSON
Run #006 result (with v2-moe embedder + all phases including inbox):
Diversity:
Same-role-across-contracts Jaccard = 0.000 (n=9)
Different-roles-same-contract Jaccard = 0.046 (n=18)
Determinism: 1.000
Verbatim handover: 4/4 (100%)
Paraphrase handover: 4/4 (100%)
Inbox burst:
6/6 events accepted by observerd (200 status, all recorded)
6/6 triggered searches produced distinct top-1 worker IDs
distance distribution: 0.24 (Indy production) → 0.71 (Chicago
drone surveyor — honest stretch since drones aren't in the
5K-worker corpus, system surfaces closest neighbor at high
distance rather than fabricating)
The drone-Chicago case is the architectural-honesty signal: when
the demand asks for a specialist NOT in the roster, the system
returns the closest semantic neighbor with a distance that flags
"this is a stretch." Coordinators reading distances see "we don't
have a great match here" rather than a confident wrong answer.
Total events captured: 67 (was 61 pre-inbox).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Local Ollama has three embedding models loaded:
nomic-embed-text:latest 137M 768d (previous default)
nomic-embed-text-v2-moe:latest 475M 768d (this commit's default)
qwen3-embedding:latest 7.6B 4096d (would require dim change)
v2-moe is a drop-in upgrade — same 768 dim, 3.5× more params, MoE
architecture. Workers index doesn't need rebuilding, just future ingests
embed with the stronger model.
Run #005 result on the multi-coord stress suite:
Diversity (same-role-across-contracts): 0.080 → 0.000 (n=9)
→ MoE is more discriminating: zero worker overlap across
Milwaukee / Indianapolis / Chicago for shared role names.
The geo + cert + skill context fully separates worker pools.
Different-roles-same-contract: 0.013 → 0.036 (still ~96% diff)
Determinism: 1.000 (unchanged)
Verbatim handover: 4/4 (100%)
Paraphrase handover: 4/4 (100%)
200-worker swap: Jaccard 0.000 (unchanged — still perfect)
Fresh-resume verify: STILL doesn't surface fresh workers in top-8.
With v2-moe, distances increased (top-1 = 0.43–0.65 vs v1's 0.25–0.39)
— the embedder is MORE discriminating, but the fresh worker's
vector still doesn't outrank the 8th-best existing worker. Now
suspect of being an HNSW post-build add issue (coder/hnsw
incremental adds can land in hard-to-reach graph regions, not an
embedder problem). Better embedder didn't fix it; needs a
different strategy: full index rebuild after fresh adds, or
explicit playbook-layer score boost for fresh workers, or
hybrid (keyword + semantic) retrieval. Phase 3 investigation.
Cost: ingest is ~5× slower (workers 20s→100s; ethereal 35s→112s).
Acceptable for the quality jump on diversity. Real production with
incremental ingest won't pay this once-per-deploy.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three Phase 2 additions land in this commit:
1. matrix.SearchRequest gains ExcludeIDs ([]string) — filters specific
worker IDs out of results post-retrieval, AND skips them at the
playbook boost+inject step (so excluded answers can't sneak back
via Shape B). Real-world driver: coordinator placed N workers,
client asks for replacements, system needs alternatives, not the
same N. Threaded through retrieve.go after merge but before
metadata filter so excluded IDs don't waste post-filter top-K slots.
2. New harness phase 2b: 200-worker swap simulation. Captures the
top-K from alpha's warehouse query, then re-issues with
exclude_ids=<placed>. Result Jaccard(orig, swap) measures whether
the substrate finds genuine alternatives.
3. New harness phase 1b: fresh-resume mid-run injection. Three new
workers ingested via /v1/embed + /v1/vectors/index/workers/add,
then verified findable via semantic queries matching resume content.
Plus Hour labels on every event (operational narrative: 0/6/12/18/
24/30/36/42/48) and a refactor of captureEvent to take hour as a
param.
Run #003 + #004 results (5K workers + 10K ethereal):
Diversity (#004):
Same-role-across-contracts Jaccard = 0.080 (n=9)
Different-roles-same-contract Jaccard = 0.013 (n=18)
Determinism: 1.000 (#004 unchanged)
Verbatim handover: 4/4 = 100%
Paraphrase handover: 4/4 = 100%
Phase 2b — 200-worker swap (Jaccard 0.000):
8 originally-placed workers fully replaced by 8 alternatives.
ExcludeIDs substrate change works end-to-end — boost AND inject
both honor the exclusion, so excluded workers don't return via
the playbook either.
Phase 1b — fresh-resume injection: REAL PRODUCT FINDING.
Substrate ABSORPTION is fine — 3 /v1/vectors/index/workers/add
calls at 200 status, 3 vectors persisted. But none of the 3
fresh workers surfaced in top-8 even with semantic queries
matching their resume content (e.g. "Senior tower crane rigger
NCCCO Chicago" vs fresh-001's resume "Senior rigger with 12
years tower-crane signaling..." NCCCO + Chicago).
Top-1 came from existing workers at distance ~0.25; fresh
workers' distances must be > 0.25, pushing them past rank 8.
Cause: dense retrieval at 5000+ workers means many existing
profiles cluster near any specific query in cosine space;
nomic-embed-text-v2 (137M) introduces enough noise that a
fresh worker doesn't reliably outrank them just because the
text content overlaps.
Workarounds (Phase 3 work): (a) hybrid retrieval (keyword +
semantic), (b) playbook-layer score boost for fresh adds,
(c) larger embedder. Documented in run #004 report.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 1 had two known gaps: (1) the 3 contracts had zero shared role
names, so same-role-across-contracts Jaccard was vacuous (n=0); (2)
the verbatim handover at 100% was the trivial case, not the hard
learning test (paraphrased queries against another coord's playbook).
Both fixed in this commit.
Contract redesign — all 3 contracts now share warehouse worker /
admin assistant / heavy equipment operator roles, plus a unique
specialist per contract (industrial electrician / bilingual safety
coord / drone surveyor — the "specialist not on the standard roster"
case from J's spec). Counts and skill mixes vary per region.
New driver phase 4b — paraphrase handover. Bob runs qwen2.5-paraphrased
versions of Alice's contract queries against Alice's playbook
namespace. Tests whether institutional memory propagates across
coordinators AND across natural wording variation that Bob would
introduce when running Alice's contract.
Run #002 result (5K workers + 10K ethereal_workers, 4 demand × 3
coords + paraphrase handover):
Diversity (the question J asked: locking or cycling?):
Same-role-across-contracts Jaccard = 0.119 (n=9)
→ 88% of workers DIFFER across regions for the same role name.
Milwaukee warehouse vs Indianapolis warehouse vs Chicago
warehouse pull mostly distinct top-K from the same population.
The system locks into geo+cert+skill context, not cycling.
Different-roles-same-contract Jaccard = 0.004 (n=18)
→ role-specific retrieval works (unchanged from Phase 1).
Determinism: Jaccard = 1.000 (n=12) — unchanged.
Learning:
Verbatim handover 4/4 = 100% (trivial case, expected)
Paraphrase handover 4/4 = 100% (HARD case — passes!)
Of those 4 paraphrase recoveries:
- 2 used boost (Alice's recording was already in Bob's
paraphrase top-K; ApplyPlaybookBoost re-ranked to top-1)
- 2 used Shape B inject (recording wasn't in Bob's
paraphrase top-K; InjectPlaybookMisses brought it in)
The boost/inject mix is healthy — both paths are used and both
produce correct top-1s. Multi-coord institutional memory propagation
is empirically working under wording variation.
Sample warehouse worker top-1s across contracts (proves diversity):
alice / Milwaukee → w-713
bob / Indianapolis → e-8447
carol / Chicago → e-7145
Three different workers from the same 15K-person population,
selected on geo+cert+skill context.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three coordinators (alice / bob / carol) with three contracts
(Milwaukee distribution / Indianapolis manufacturing / Chicago
construction). 7-phase scenario runner: baseline → surge → merge →
handover → split → reissue → analysis. Each coord has a separate
playbook namespace (playbook_{name}) so institutional memory stays
isolated by default but transferable on demand.
Phase 1 deliberately skips the 48-hour clock, email/SMS endpoints,
and Langfuse tracing — those are Phase 2/3.
Run #001 (52 events, 4 queries × 3 coords × 2 demand flavors):
Diversity:
Different-roles-same-contract Jaccard = 0.004 (n=18)
→ role-specific retrieval is working perfectly. Different
roles within one contract pull totally different worker
pools. System is NOT cycling; locks into per-role retrieval.
Same-role-across-contracts Jaccard = N/A (n=0)
→ TEST-DESIGN ISSUE: the 3 contracts use distinct role names
per industry (warehouse worker / production worker / general
laborer), so no exact-name overlaps exist. Phase 2 should
either share at least one role across contracts OR add a
skill-based diversity metric.
Determinism: Jaccard = 1.000 (n=12)
→ HNSW + Ollama retrieval is fully deterministic on identical
query text. coder/hnsw + nomic-embed-text are stable.
Learning: handover hit rate = 4/4 = 100%
→ Bob inherits Alice's recordings perfectly when bob runs
identical queries with alice's playbook namespace. CAVEAT:
this tests the trivial verbatim case, not paraphrase handover.
The harder test (bob runs paraphrased queries with alice's
playbook) is Phase 2 work.
Per-event capture in JSON: every matrix.search response is logged
with phase / coordinator / contract / role / query / top-K IDs +
distances + per-corpus counts + boosted/injected counts. Reviewable
via:
jq '.events[] | select(.phase == "merge")'
jq '.events[] | select(.coordinator == "alice")'
jq '.events[] | select(.role == "warehouse worker")'
Notable finding from per-event: carol's "general laborer" and "crane
operator" queries both surface w-1009 as top-1, with crane operator
at distance 0.098 (very tight) and general laborer at 0.297. The
system found a worker who legitimately covers both roles — realistic
for small construction crews.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The rank-based "lift" metric (warm-top-1 == cold-judge-best) doesn't
distinguish "Shape B surfaced a strictly-better answer" from "Shape B
shuffled ranks but quality is unchanged" from "Shape B replaced a good
answer with a wrong one." This commit adds Pass 4: judge warm top-1
with the same prompt as cold ratings, then bucket the comparison.
Implementation:
- New --with-rejudge driver flag (default off).
- New WITH_REJUDGE harness env (default 1, on for prod runs).
- queryRun gains WarmTop1Metadata (cached during Pass 2 for the
rejudge call) + WarmTop1Rating *int (nil-distinguishable; nil = no
rejudge, 0..5 = rating).
- summary gains RejudgeAttempted, QualityLifted, QualityNeutral,
QualityRegressed (counts of warm-rating > / == / < cold-rating).
- Markdown headline gains a Quality block when rejudge ran.
- ~21 extra judge calls (~30s on qwen2.5).
Run #005 result (split inject threshold 0.20 + paraphrase + rejudge):
Quality lifted 5 / 21 (24%) — 3× +2 rating, 2× +1 rating
Quality neutral 13 / 21 (62%) — includes OOD queries holding 1
Quality regressed 3 / 21 (14%)
Net rating delta +3 across 21 queries (+0.14 average)
The 5 lifts were all rating-2 cold replaced with rating-3 or rating-4
warm — Shape B took mediocre matches and substituted substantively
better ones. The 3 regressions were small (-1, -1, -3).
Q11 is the cautionary tale: cold top-1 "production line worker"
(rating 4) got replaced by Q1's recorded "forklift OSHA-30 operator"
e-5729 (rating 1). Adjacent-domain cross-pollination — production
worker and forklift operator embed within 0.20 cosine because both
are warehouse-adjacent staffing queries, even though the judge
correctly distinguishes them. The split-threshold defense (0.5 boost
/ 0.20 inject) catches OOD cross-pollination (Q19/Q20/Q21 all stayed
neutral at rating 1) but not adjacent-domain cross-pollination.
Net product verdict: working, net-positive on quality, but the worst
case (Q11 4→1) is customer-visible and warrants a tighter inject
threshold OR an additional gate beyond cosine distance. Filed in
STATE_OF_PLAY OPEN as a follow-up.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The v0 boost-only stance documented in internal/matrix/playbook.go:22-27
("the boost only re-ranks results that ALREADY surfaced from the regular
retrieval") couldn't promote recorded answers that dropped out of a
paraphrase's top-K. playbook_lift_002 surfaced exactly that gap: 0/2
paraphrase recoveries because the recorded answers weren't in regular
retrieval at all (rank=-1).
Shape B: when warm-pass retrieval doesn't surface a playbook hit's
answer, inject a synthetic Result for it directly. Distance =
playbook_hit_distance × BoostFactor — same formula as the boost path so
injections land in comparable distance space. Caller re-sorts +
truncates after both boost and inject have run.
Result on playbook_lift_003 (Shape B + paraphrase pass):
Verbatim discovery 6
Verbatim lift 2 / 6
**Paraphrase top-1** **6 / 6**
Paraphrase any-rank in K 6 / 6
Mean Δ top-1 distance -0.1637 (warm closer than cold)
Every paraphrase the judge generated landed the v1-recorded answer at
top-1 of the new query's results. The learning property holds — cosine
on embed(paraphrase) finds the recorded query's vector within
DefaultPlaybookMaxDistance (0.5), and Shape B injects the answer.
Verbatim lift dropped from v1's 7/8 because Shape B cross-pollinates
recorded answers across queries. w-4435 (Q2's recording) appears as
warm top-1 for several other queries because their embeddings are
within the playbook hit threshold of "OSHA-30 forklift Wisconsin." This
is a feature, not a bug — the matrix layer's purpose is to share
knowledge across queries — but the lift metric only counts "warm top-1
== cold judge best," so cross-pollinated lifts don't register. A v3
metric would re-judge warm pass to measure true judge improvement.
Tests:
- TestInjectPlaybookMisses_AddsMissingAnswers — primary claim
- TestInjectPlaybookMisses_SkipsAnswersAlreadyPresent — no double-inject
- TestInjectPlaybookMisses_DedupesPerAnswer — multi-hit same answer
- TestInjectPlaybookMisses_EmptyHits — fast-path no-op
Driver fix: ParaphraseRecordedRank int → *int. The `omitempty` int
silently dropped rank=0 (top-1, the WANTED value) from JSON, making the
v003 report show "null" instead of "0" for every successful recovery.
Pointer keeps nil/rank-0 distinguishable.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds an opt-in Pass 3 to the lift driver: for each query whose Pass 1
recorded a playbook, ask the judge to rephrase the query, then re-query
with playbook=true and check whether the recorded answer surfaces in
top-K. This is the test the v1 report's caveat #3 explicitly flagged
as the actual learning-property gate (not the cheap verbatim case).
Implementation:
- New flag --with-paraphrase on the driver (default off).
- New WITH_PARAPHRASE env in the harness (default 1, on for prod runs).
- New paraphrase_* fields on queryRun + summary, // 0 fallback in jq so
re-rendering verbatim-only evidence stays clean.
- generateParaphrase() calls the same judge model with format=json and
a tight schema; temperature=0.5 for variance without domain drift.
- Markdown report adds a paraphrase per-query table (only when the
pass ran) and an honesty caveat about judge-also-rephrases coupling.
Run #002 result (reports/reality-tests/playbook_lift_002.{json,md}):
Verbatim lift 2/2 (100% — Q7 + Q13, both stable from v1)
Paraphrase top-1 0/2
Paraphrase any-rank in K 0/2
Both paraphrases dropped the recorded answer OUT of top-K entirely
(rank=-1). This isn't a paraphrase-quality problem — qwen2.5's outputs
preserved intent ("Hazmat-certified warehouse worker comfortable with
cold storage" → "Warehouse worker with Hazmat certification and
experience in cold storage"). It's the v0 boost-only stance documented
in internal/matrix/playbook.go:22-27: the boost only re-ranks results
that ALREADY surfaced from regular retrieval. If paraphrase's cosine
retrieval doesn't include the recorded answer in top-K, no boost can
promote it.
The "Shape B" upgrade mentioned in the playbook.go comment — inject
playbook hits directly even when they weren't in the top-K — is what
would close this gap. The reality test surfaced exactly the gap the
docs warned about. Worth filing as the next product gate.
Run-to-run variance also visible: v1 had 8 discoveries, v2 had 2.
HNSW insertion order + judge variance both contribute. Stability of
Q7 and Q13 across both runs (lifted in v1 AND v2) is the most reliable
signal in the dataset.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cross-lineage scrum on b2e45f7 produced 1 convergent + 3 single-reviewer
findings worth fixing. All apply.
1. (Opus WARN + Qwen INFO convergent) scripts/playbook_lift.sh: replace
sleep 2.5 in SQL probe with active polling up to 5s. refresh_every=1s
is a lower bound; under load the manifest may not be visible in a
fixed sleep, which would 4xx the probe and abort the reality run.
2. (Opus INFO) scripts/playbook_lift.sh: report template glued
"env JUDGE_MODEL" + value as "env JUDGE_MODELqwen2.5:latest" with no
separator. Replaced two :+/:- substitution chains with a single
JUDGE_SOURCE variable computed once at the top of the harness.
3. (Opus INFO) scripts/staffing_workers/main.go: -id-prefix "" silently
allowed, defeating the flag's purpose (cross-corpus collision prevent).
Now log.Fatal at startup with explicit hint.
4. (Opus WARN) cmd/{pathwayd,observerd}/main_test.go: newTestRouter
returned http.Handler then re-cast to chi.Router for chi.Walk.
Returning chi.Router directly satisfies http.Handler AND avoids an
assertion that would panic if future middleware wraps the router.
Dismissed (with rationale):
- Kimi INFO hardcoded MinIO endpoint: harness is local-by-design.
- Kimi WARN matrixd accepts 502/500: documented; real retriever needs
real upstreams the test doesn't spin up.
- Qwen INFO queryd string.Contains: brittle but very low risk; restating
through typed-error path would couple without adding signal.
go test ./cmd/{matrixd,queryd,pathwayd,observerd} all green.
Verdicts at reports/scrum/_evidence/2026-04-30/verdicts/lift_001_*.md.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 5-loop substrate's load-bearing gate is verified — playbook +
matrix indexer give the results we're looking for. Per the report's
rubric, lift ≥ 50% of discoveries means matrix is doing real work;
7/8 = 87.5% blew through that.
Harness was structurally hiding bugs behind a 5-daemon stripped boot.
Expanding to the full 10-daemon prod stack surfaced 7 fixes in cascade:
1. driver→matrixd: {"query": ...} → {"query_text": ...} field name
2. harness temp toml missing [s3] → wrong default bucket → catalogd
rehydrate 500 on first call
3. harness→queryd SQL probe: {"q": ...} → {"sql": ...} field name
4. expand boot from 5 → 10 daemons in dep-ordered launch
5. add SQL surface probe (3-row CSV ingest → COUNT(*)=3 assertion)
6. candidates corpus was synthetic SWE-tech (Swift/iOS, Scala/Spark) —
wrong domain for staffing queries; replaced with ethereal_workers
(10K rows, real staffing schema, "e-" id prefix to avoid collision
with workers' "w-"). staffing_workers driver gains -index-name +
-id-prefix flags so the same binary serves both corpora
7. local_judge qwen3.5:latest is a vision-SSM 256K-ctx build running
~30s per judge call against the lift loop; reverted to
qwen2.5:latest (~1s/call, 30× faster, held lift theory)
Each contract drift (1, 3) is now locked into a cmd/<bin>/main_test.go
so future drift fires in `go test`, not in a reality run. R-005 closed:
- cmd/matrixd/main_test.go (new) — playbook record drift detector +
score bounds + 6 routes mounted
- cmd/queryd/main_test.go — wrong-field-name drift detector
- cmd/pathwayd/main_test.go (new) — 9 routes + add round-trip + retire
- cmd/observerd/main_test.go (new) — 4 routes + invalid-op + unknown-mode
`go test ./cmd/{matrixd,queryd,pathwayd,observerd}` all green.
Reality test results (reports/reality-tests/playbook_lift_001.{json,md}):
Queries 21 (staffing-domain, 7 categories)
Discoveries 8 (judge ≠ cosine top-1)
Lifts 7/8 (87.5%)
Boosts triggered 9
Mean Δ distance -0.053 (warm closer than cold)
OOD honesty dental/RN/SWE rated 1, no fake matches
Cross-corpus boosts confirmed (e- ↔ w- swaps in lifts)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 4-bundle review (128KB diff) hit "Argument list too long" when
curl --data was passed the body as a literal arg. Pipe via stdin
with --data-binary @- instead. Lifts the practical bundle size from
~30KB to whatever fits in process memory.
Caught while running the harness scrum on golangLAKEHOUSE today —
the bigger Phase A+B harness diff (4566 lines) tripped it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bash driver wrapping /v1/chat for Opus + Kimi + Qwen3-coder review
runs. Used today to scrum the 4-phase wave (1,624 LoC of chatd +
config-refactor + Rust cleanup) and caught 2 BLOCKs + 2 WARNs.
Usage:
./scripts/scrum_review.sh <bundle.diff> <bundle_label>
Output: reports/scrum/_evidence/<DATE>/verdicts/<bundle>_<reviewer>.md
verbatim, per the evidence-only convention. Per-reviewer latency +
token counts captured in the report header.
System prompt enforces the BLOCK/WARN/INFO + WHERE/WHAT/WHY shape
per feedback_cross_lineage_review.md — leads with verdict, no
preamble (Kimi tends to spend tokens thinking otherwise).
Reviewer fleet matches project_golang_lakehouse.md "Scrum routing":
- opencode/claude-opus-4-7
- openrouter/moonshotai/kimi-k2-0905
- openrouter/qwen/qwen3-coder
This is the first dogfood of chatd as the scrum vehicle — eats its
own /v1/chat dispatcher.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3-lineage scrum (Opus 4.7 / Kimi K2.6 / Qwen3-coder) on today's wave
landed 4 real findings (2 BLOCK + 2 WARN) and 2 INFO touch-ups.
Verbatim verdicts + disposition table at:
reports/scrum/_evidence/2026-04-30/
B-1 (BLOCK Opus + INFO Kimi convergent) — ResolveKey API:
collapse from 3-arg (envVar, envFileName, envFilePath) to 2-arg
(envVar, envFilePath). Pre-fix every chatd caller passed the env
var name twice; if operator renamed *_key_env in lakehouse.toml
while keeping the canonical KEY= line in the .env file, fallback
silently missed.
B-2 (WARN Opus + WARN Kimi convergent) — handleProviders probe:
drop the synthesize-then-Resolve probe; look up by name directly
via Registry.Available(name). Prior probe synthesized "<name>/probe"
model strings and routed through Resolve, fragile to any future
routing rule (e.g. cloud-suffix special case).
B-3 (BLOCK Opus single — verified by trace + end-to-end probe) —
OllamaCloud.Chat StripPrefix used "cloud" but registry routes
"ollama_cloud/<m>". Result: upstream got the prefixed model name
and 400'd. Smoke missed it because chatd_smoke runs without
ollama_cloud registered. Now strips the right prefix; new
TestOllamaCloud_StripsCorrectPrefix locks both prefix + suffix
cases. Verified live: ollama_cloud/deepseek-v3.2 round-trips
cleanly through the real ollama.com endpoint.
B-4 (WARN Opus single) — Ollama finishReason: read done_reason
field instead of inferring from done bool alone. Newer Ollama
reports done=true with done_reason="length" on truncation; the
prior code mapped that to "stop" and lost the truncation signal
the playbook_lift judge needs to retry. New
TestFinishReasonFromOllama_PrefersDoneReason covers the fallback
ladder.
INFOs:
- B-5: replace hand-rolled insertion sort in Registry.Names with
sort.Strings (Opus called the "avoid sort import" comment a
false economy — correct).
- A-1: clarify the playbook_lift.sh comment around -judge "" arg
passing (Opus noted the comment said "env priority" but didn't
reflect that the empty arg also passes through the Go driver's
resolution chain).
False positives dismissed (3, documented in disposition.md):
- Kimi: TestMaybeDowngrade_WithConfigList wrong assertion (test IS
correct per design — model excluded from weak list = strong = downgrade)
- Qwen: nil-deref claim (defensive code already handles nil)
- Opus: qwen3.5:latest doesn't exist on Ollama hub (true on the
public hub but local install has it)
just verify: PASS. chatd_smoke 6/6 PASS. New regression tests:
3 (B-2, B-3, B-4 each get a focused test).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
new cmd/chatd on :3220 routes /v1/chat to the right provider based
on model-name prefix or :cloud suffix. closes the architectural gap
named in lakehouse.toml [models]: tiers map to model IDs, but until
phase 4 there was no service that could actually CALL those models
from go.
routing rules (registry.Resolve):
ollama/<m> → local Ollama (prefix stripped)
ollama_cloud/<m> → Ollama Cloud
<m>:cloud → Ollama Cloud (suffix variant — kimi-k2.6:cloud)
openrouter/<v>/<m> → OpenRouter (prefix stripped, OpenAI-compat)
opencode/<m> → OpenCode unified Zen+Go
kimi/<m> → Kimi For Coding (api.kimi.com/coding/v1)
bare names → local Ollama (default)
provider implementations:
- internal/chat/types.go Provider interface, Request/Response, errors
- internal/chat/registry.go prefix + :cloud suffix dispatch
- internal/chat/ollama.go local Ollama via /api/chat (think=false default)
- internal/chat/ollama_cloud.go Ollama Cloud via /api/generate (Bearer auth)
- internal/chat/openai_compat.go shared OpenAI Chat Completions for the
OpenRouter/OpenCode/Kimi family
- internal/chat/builder.go BuildRegistry from BuilderInput;
ResolveKey reads env then .env file fallback
config:
- ChatdConfig in internal/shared/config.go with bind, ollama_url,
per-provider key env names + .env fallback paths, timeout
- Gateway gains chatd_url + /v1/chat + /v1/chat/* routes
- lakehouse.toml [chatd] block with /etc/lakehouse/<provider>.env defaults
tests (19 in internal/chat):
- registry: prefix + :cloud + errors + telemetry + provider listing
- ollama: happy path + prefix strip + format=json + 500 mapping +
flatten_messages
- openai_compat: happy path + format=json + 429 mapping + zero-choices
think=false default in ollama + ollama_cloud — local hot path skips
reasoning, low-budget callers (the playbook_lift judge at max_tokens=10)
get direct answers instead of empty content + done_reason=length.
proven via chatd_smoke acceptance.
acceptance gate: scripts/chatd_smoke.sh — 6/6 PASS:
1. /v1/chat/providers lists exactly registered providers (1 in dev mode)
2. bare model → ollama default with content + token counts + latency
3. explicit ollama/<m> → prefix stripped at upstream
4. <m>:cloud without ollama_cloud registered → 404 (no silent fall-through)
5. unknown/<m> → falls through to default → upstream 502 (no prefix rewrite)
6. missing model field → 400
just verify: PASS (vet + 30 packages × short tests + 9 smokes).
chatd_smoke is a domain smoke (not in just verify, mirrors matrix /
observer / pathway pattern).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
migrate the reality-test harness's judge-model default from a
hardcoded "qwen3.5:latest" string to cfg.Models.LocalJudge.
resolution priority: explicit -judge flag > $JUDGE_MODEL env >
cfg.Models.LocalJudge from lakehouse.toml > hardcoded fallback.
bumping the judge for run #N+1 now means editing one line in
lakehouse.toml [models].local_judge — no Go file or shell script
edits required.
changes:
- scripts/playbook_lift/main.go: -config flag added, judge default
flips to "" so resolution chain runs. Imports internal/shared for
config loader.
- scripts/playbook_lift.sh: JUDGE_MODEL no longer defaulted in bash;
EFFECTIVE_JUDGE resolved by mirror-of-the-Go-chain (env > config
grep > qwen3.5:latest fallback). Used for the Ollama presence
check + report header. Pre-flight grep avoids requiring jq just
to read the toml.
- reports/reality-tests/README.md: documents the 4-step priority
chain.
verified all 4 paths produce the expected judge:
- config (no env): qwen3.5:latest (from lakehouse.toml)
- env override: env wins
- flag override: flag wins over env
- missing config: DefaultConfig fallback still gives qwen3.5:latest
just verify PASS.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
First reality test driver. Two-pass design:
- Pass 1 (cold): matrix.search use_playbook=false → small-model judge
rates top-K → record playbook entry pointing at the highest-rated
result (which may NOT be top-1 by distance — that's the discovery).
- Pass 2 (warm): same queries with use_playbook=true → measure
ranking shift. Lift = real if recorded answer becomes top-1.
Files:
- scripts/playbook_lift/main.go driver (391 LoC)
- scripts/playbook_lift.sh stack-bring-up + report gen
- tests/reality/playbook_lift_queries.txt query corpus (5 placeholders;
J writes real 20+)
- reports/reality-tests/README.md framework + interpretation
- .gitignore track reports/reality-tests/
but ignore per-run JSON evidence
This answers the gate from project_small_model_pipeline_vision.md:
"the playbook + matrix indexer must give the results we're looking
for." Without ground-truth labels, the LLM judge is the proxy — the
same small-model thesis applied to evaluation. Honest about that
limitation in the generated reports.
Driver compiles clean; full run requires Ollama + workers/candidates
ingest. Skips cleanly if Ollama absent.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
distillation.score, drift.scorer)
Lands the workflow.Mode adapters for the §3.4 components + the
distillation scorer + drift quantifier. Workflows can now compose
real measurement capabilities; the substrate's parallel
capabilities become composable Lego bricks (per the prior commit's
closing insight).
Modes registered (in observerd's registerBuiltinModes):
Pure-function wrappers (no I/O):
- matrix.relevance → matrix.FilterChunks
- matrix.downgrade → matrix.MaybeDowngrade
- distillation.score → distillation.ScoreRecord
- drift.scorer → drift.ComputeScorerDrift
HTTP-backed:
- matrix.search → POST matrixd /matrix/search
(registered only when matrixd_url is set)
Fixture (kept from §3.8 first slice):
- fixture.echo, fixture.upper
internal/workflow/modes.go:
Each mode follows the same glue pattern: marshal generic input
through a typed struct (free schema validation + clear error
messages), call the underlying capability, return a generic
output map. Roundtrip-via-JSON gives us schema validation
without writing custom field-by-field coercion.
internal/workflow/modes_test.go (10 tests, all PASS):
- matrix.relevance filters adjacency pollution (Connector kept,
catalogd::Registry dropped — same headline as the relevance
smoke, run through the workflow mode)
- matrix.downgrade flips lakehouse→isolation on strong model;
keeps lakehouse on weak (qwen3.5:latest); errors on missing
fields
- distillation.score rates scrum_review attempt_1 as accepted;
rejects empty record
- drift.scorer reports zero drift on matched inputs; errors on
empty inputs slice
- matrix.search HTTP flow round-trips through httptest fake
matrixd; non-OK status surfaces a clear error
scripts/workflow_smoke.sh (5 assertions PASS, was 4):
New assertion #5: real-mode chain
matrix.downgrade (lakehouse + grok-4.1-fast → isolation)
→ distillation.score (scrum_review attempt_1 → accepted)
Proves §3.4 components compose through the workflow runner with
no fixture intermediation. Both nodes ran successfully, runner
recorded provenance, status=succeeded.
Mode listing assertion now expects 7 modes (5 real + 2 fixture)
instead of just the fixtures.
17-smoke regression all green. SPEC §3.8 acceptance gate G3.8.D
("Mode catalog dispatches matrix.search invocation to the matrixd
backend without going through HTTP") still pending — current path
goes through HTTP for matrix.search, which is the cleaner service-
mesh shape but slower than direct in-process. In-process dispatch
when matrixd is co-resident is a future optimization.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>