root 5d49967833 multi_coord_stress: full Langfuse coverage — every phase + every call

Phase 1c-only tracing (commit 7e6431e) was the proof-of-concept.
This commit threads tracing through every phase: baseline / fresh-
resume / inbox burst / surge / swap / merge / handover (verbatim +
paraphrase) / split / reissue. Each phase is a parent span; each
matrix.search / LLM call inside is a child span.

Refactor:
- One run-level trace is created at driver startup.
- New startPhase(name, hour, meta) helper emits a phase span as a
  child of the run trace; subsequent emitSpan calls nest under it.
- New tracedSearch(spanName, query, corpora, ...) wraps matrixSearch
  with span emission. Every search call site replaced with this so
  the input/output JSON (query, corpora, k, playbook, exclude_n →
  top-K ids, top1 distance, boost/inject counts) lands in Langfuse.
- Phase 4b's paraphrase generation also emits llm.paraphrase spans.
- Phase 1c's existing inline span emission converted to use the new
  helpers (no more inboxTraceID variable).

Run #011 result: trace landed at http://localhost:3001 with 111
observations attached. Span breakdown:
  phase.* parents:         9 (one per phase that ran)
  matrix.search.baseline:  10
  matrix.search.fresh_verify: 3 (top-1 confirmed for all 3 fresh)
  observerd.inbox.record:  6
  llm.parse_demand:        6
  matrix.search.inbox:     6
  llm.judge_top1:          6
  matrix.search.surge:     12
  matrix.search.swap_orig: 1
  matrix.search.swap_replace: 1
  matrix.search.merge:     6
  matrix.search.handover_verbatim: 4
  llm.paraphrase:          4
  matrix.search.handover_paraphrase: 4
  matrix.search.split:     4
  matrix.search.reissue:   12
  matrix.search.reissue_retrieval_only: 12
  ─────────────
  Total:                   111

Browse: http://localhost:3001 → Traces → "multi_coord_stress run"
Each phase is a collapsible section showing per-call timing and
input/output JSON. Operators can drill into any single retrieval
to see exactly what query was issued and what came back.

All other metrics held: diversity 0.026, determinism 1.000,
verbatim handover 4/4, paraphrase handover 4/4, fresh-resume 3/3
at top-1 (two-tier index), 200-worker swap Jaccard 0.000.

This is the FULL TEST J asked for — every action in the run
visible in Langfuse, full input/output drilldown.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-30 16:43:32 -05:00

3.7 KiB

Raw Blame History

Multi-Coordinator Stress Test — Run 011

Generated: 2026-04-30T21:41:26.801002955Z Coordinators: alice / bob / carol (each with own playbook namespace: playbook_alice / playbook_bob / playbook_carol) Contracts: alpha_milwaukee_distribution / beta_indianapolis_manufacturing / gamma_chicago_construction Corpora: workers,ethereal_workers K per query: 8 Total events captured: 67 Evidence: reports/reality-tests/multi_coord_stress_011.json

Diversity — is the system locking into scenarios or cycling?

Metric	Mean Jaccard	n pairs	Interpretation
Same role across different contracts	0.025641025641025644	9	Lower = more diverse (different region/cert mix → different workers)
Different roles within same contract	0.06996336996336996	18	Should be near-zero (different roles = different worker pools)

Healthy ranges:

Same role across contracts: < 0.30 means the system is genuinely picking different workers per region/contract.
Different roles same contract: < 0.10 means role-specific retrieval is working.
If either is > 0.50, the system is "cycling" the same handful of workers regardless of query intent.

Determinism — same query reissued, top-K stability

Metric	Value
Mean Jaccard on retrieval-only reissue	1
Number of reissue pairs	12

Interpretation:

≥ 0.95: HNSW retrieval is highly deterministic; reissues land on near-identical top-K. Good — system locks into a stable view of "best workers for this query."
0.80 – 0.95: Some HNSW or embed variance, acceptable.
< 0.80: Retrieval is unstable — reissues see substantially different results, suggesting either embed nondeterminism (Ollama returning slightly different vectors) or vectord nondeterminism (HNSW insertion order affecting recall).

Learning — handover hit rate

Bob takes Alice's contract using Alice's playbook namespace. Did Alice's recorded answers surface in Bob's results?

Metric	Value
Verbatim handover queries run	4
Alice's recorded answer at Bob's top-1 (verbatim)	4
Alice's recorded answer in Bob's top-K (verbatim)	4
Verbatim handover hit rate (top-1)	1
Paraphrase handover queries run	4
Alice's recorded answer at Bob's top-1 (paraphrase)	4
Alice's recorded answer in Bob's top-K (paraphrase)	4
Paraphrase handover hit rate (top-1)	1

Interpretation:

Verbatim hit rate ≈ 1.0: trivial case — Bob runs identical queries; should always hit.
Paraphrase hit rate ≥ 0.5: institutional memory survives wording change — the harder learning property.
Paraphrase hit rate ≈ 0.0: Bob's paraphrases drift past the inject threshold, so Alice's recordings don't activate. Same caveat as the playbook_lift paraphrase pass.

Per-event capture

All matrix.search responses live in the JSON — top-K with worker IDs, distances, and per-corpus counts. Search by phase:

jq '.events[] | select(.phase == "merge")' reports/reality-tests/multi_coord_stress_011.json
jq '.events[] | select(.coordinator == "alice" and .phase == "baseline")' reports/reality-tests/multi_coord_stress_011.json
jq '.events[] | select(.role == "warehouse worker") | {phase, contract, top_k_ids: [.top_k[].id]}' reports/reality-tests/multi_coord_stress_011.json

What's NOT in this run (Phase 1 deliberately defers)

48-hour clock. Events fire as discrete steps, not on a timeline.
Email / SMS ingest. No endpoints exist on the Go side yet.
New-resume injection mid-run. The corpus is fixed at the start.
Langfuse traces. Need Go-side wiring.

These are Phase 2/3. The Phase 1 substrate is what the time-based runner will mount on top of.

3.7 KiB Raw Blame History Unescape Escape