root 7e6431e4fd langfuse: Go-side client + Phase 1c instrumentation

The Rust side has Langfuse tracing already (gateway/v1/langfuse_trace.rs);
this commit lands Go-side parity so the multi-coord stress harness can
emit traces visible at http://localhost:3001.

internal/langfuse/client.go:
- Minimal Trace + Span + Flush API mirroring what the Rust emitter
  uses. Auth: Basic over public_key:secret_key.
- Best-effort posture: errors are slog.Warn'd, never block calling
  paths. Same fail-open as observerd's persistor (ADR-005 Decision
  5.1) — observability is a witness, not a gate.
- Events buffered until 50, then auto-flushed; explicit Flush() at
  process exit.
- Each Trace/Span returns its id so callers can build hierarchies.

multi_coord_stress driver wiring:
- New --langfuse-env flag (default /etc/lakehouse/langfuse.env).
  Empty / missing / unparseable file → skip tracing with a logged
  warning; run still proceeds.
- Phase 1c (inbox burst) now emits one parent trace + 4 spans per
  inbox event:
    1. observerd.inbox.record  (post to /v1/observer/inbox)
    2. llm.parse_demand        (qwen2.5 → structured fields)
    3. matrix.search           (parsed query → top-K)
    4. llm.judge_top1          (rate top-1 vs original body)
  Each span carries input/output JSON + start/end times so the
  Langfuse UI shows a full waterfall per event.

Run #009 result:
  Trace landed: "multi_coord_stress phase 1c inbox burst"
  Observations attached: 24 (= 6 events × 4 spans)
  Tags: stress, phase-1c, inbox
  Browseable at http://localhost:3001 by tag query.

Other harness metrics: diversity 0.016, determinism 1.000,
verbatim handover 4/4, paraphrase handover 4/4 — all unchanged
by the tracing addition (best-effort post in parallel).

Phase 1c is the proof-of-concept; future commits can wrap other
phases (baseline / merge / handover / split) in traces too. Once
that's done, the entire stress run becomes scrubbable in Langfuse
without grepping the events JSON.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-30 16:25:03 -05:00

3.7 KiB

Raw Permalink Blame History

Multi-Coordinator Stress Test — Run 009

Generated: 2026-04-30T21:23:59.011167722Z Coordinators: alice / bob / carol (each with own playbook namespace: playbook_alice / playbook_bob / playbook_carol) Contracts: alpha_milwaukee_distribution / beta_indianapolis_manufacturing / gamma_chicago_construction Corpora: workers,ethereal_workers K per query: 8 Total events captured: 67 Evidence: reports/reality-tests/multi_coord_stress_009.json

Diversity — is the system locking into scenarios or cycling?

Metric	Mean Jaccard	n pairs	Interpretation
Same role across different contracts	0.015873015873015872	9	Lower = more diverse (different region/cert mix → different workers)
Different roles within same contract	0.015343915343915345	18	Should be near-zero (different roles = different worker pools)

Healthy ranges:

Same role across contracts: < 0.30 means the system is genuinely picking different workers per region/contract.
Different roles same contract: < 0.10 means role-specific retrieval is working.
If either is > 0.50, the system is "cycling" the same handful of workers regardless of query intent.

Determinism — same query reissued, top-K stability

Metric	Value
Mean Jaccard on retrieval-only reissue	1
Number of reissue pairs	12

Interpretation:

≥ 0.95: HNSW retrieval is highly deterministic; reissues land on near-identical top-K. Good — system locks into a stable view of "best workers for this query."
0.80 – 0.95: Some HNSW or embed variance, acceptable.
< 0.80: Retrieval is unstable — reissues see substantially different results, suggesting either embed nondeterminism (Ollama returning slightly different vectors) or vectord nondeterminism (HNSW insertion order affecting recall).

Learning — handover hit rate

Bob takes Alice's contract using Alice's playbook namespace. Did Alice's recorded answers surface in Bob's results?

Metric	Value
Verbatim handover queries run	4
Alice's recorded answer at Bob's top-1 (verbatim)	4
Alice's recorded answer in Bob's top-K (verbatim)	4
Verbatim handover hit rate (top-1)	1
Paraphrase handover queries run	4
Alice's recorded answer at Bob's top-1 (paraphrase)	4
Alice's recorded answer in Bob's top-K (paraphrase)	4
Paraphrase handover hit rate (top-1)	1

Interpretation:

Verbatim hit rate ≈ 1.0: trivial case — Bob runs identical queries; should always hit.
Paraphrase hit rate ≥ 0.5: institutional memory survives wording change — the harder learning property.
Paraphrase hit rate ≈ 0.0: Bob's paraphrases drift past the inject threshold, so Alice's recordings don't activate. Same caveat as the playbook_lift paraphrase pass.

Per-event capture

All matrix.search responses live in the JSON — top-K with worker IDs, distances, and per-corpus counts. Search by phase:

jq '.events[] | select(.phase == "merge")' reports/reality-tests/multi_coord_stress_009.json
jq '.events[] | select(.coordinator == "alice" and .phase == "baseline")' reports/reality-tests/multi_coord_stress_009.json
jq '.events[] | select(.role == "warehouse worker") | {phase, contract, top_k_ids: [.top_k[].id]}' reports/reality-tests/multi_coord_stress_009.json

What's NOT in this run (Phase 1 deliberately defers)

48-hour clock. Events fire as discrete steps, not on a timeline.
Email / SMS ingest. No endpoints exist on the Go side yet.
New-resume injection mid-run. The corpus is fixed at the start.
Langfuse traces. Need Go-side wiring.

These are Phase 2/3. The Phase 1 substrate is what the time-based runner will mount on top of.

3.7 KiB Raw Permalink Blame History Unescape Escape