root 5d49967833 multi_coord_stress: full Langfuse coverage — every phase + every call
Phase 1c-only tracing (commit 7e6431e) was the proof-of-concept.
This commit threads tracing through every phase: baseline / fresh-
resume / inbox burst / surge / swap / merge / handover (verbatim +
paraphrase) / split / reissue. Each phase is a parent span; each
matrix.search / LLM call inside is a child span.

Refactor:
- One run-level trace is created at driver startup.
- New startPhase(name, hour, meta) helper emits a phase span as a
  child of the run trace; subsequent emitSpan calls nest under it.
- New tracedSearch(spanName, query, corpora, ...) wraps matrixSearch
  with span emission. Every search call site replaced with this so
  the input/output JSON (query, corpora, k, playbook, exclude_n →
  top-K ids, top1 distance, boost/inject counts) lands in Langfuse.
- Phase 4b's paraphrase generation also emits llm.paraphrase spans.
- Phase 1c's existing inline span emission converted to use the new
  helpers (no more inboxTraceID variable).

Run #011 result: trace landed at http://localhost:3001 with 111
observations attached. Span breakdown:
  phase.* parents:         9 (one per phase that ran)
  matrix.search.baseline:  10
  matrix.search.fresh_verify: 3 (top-1 confirmed for all 3 fresh)
  observerd.inbox.record:  6
  llm.parse_demand:        6
  matrix.search.inbox:     6
  llm.judge_top1:          6
  matrix.search.surge:     12
  matrix.search.swap_orig: 1
  matrix.search.swap_replace: 1
  matrix.search.merge:     6
  matrix.search.handover_verbatim: 4
  llm.paraphrase:          4
  matrix.search.handover_paraphrase: 4
  matrix.search.split:     4
  matrix.search.reissue:   12
  matrix.search.reissue_retrieval_only: 12
  ─────────────
  Total:                   111

Browse: http://localhost:3001 → Traces → "multi_coord_stress run"
Each phase is a collapsible section showing per-call timing and
input/output JSON. Operators can drill into any single retrieval
to see exactly what query was issued and what came back.

All other metrics held: diversity 0.026, determinism 1.000,
verbatim handover 4/4, paraphrase handover 4/4, fresh-resume 3/3
at top-1 (two-tier index), 200-worker swap Jaccard 0.000.

This is the FULL TEST J asked for — every action in the run
visible in Langfuse, full input/output drilldown.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 16:43:32 -05:00
..

reports/reality-tests — does the 5-loop substrate actually work?

Reality tests measure product outcomes, not substrate health. The 21 smokes prove the system runs; the proof harness proves the system makes the claims it claims; reality tests answer: does the small-model pipeline + matrix indexer + playbook give measurably better results than raw cosine?

This is the gate from project_small_model_pipeline_vision.md: "the playbook + matrix indexer must give the results we're looking for." Single load-bearing criterion. Throughput, scaling, code elegance are secondary.


What lives here

Each reality test is a numbered run that produces:

  • <test>_<NNN>.json — raw structured evidence (per-query data, summary metrics)
  • <test>_<NNN>.md — human-readable report with headline metrics, per-query table, honesty caveats, next moves

Runs are append-only. Earlier runs stay in tree as historical baseline.


Test catalog

playbook_lift_<NNN> — does the playbook actually lift the right answer?

Driver: scripts/playbook_lift.shbin/playbook_lift Queries: tests/reality/playbook_lift_queries.txt Pipeline: cold pass → LLM judge → playbook record → warm pass → measure ranking shift.

The headline question: when the LLM judge finds a better answer than cosine top-1, can the playbook boost it to top-1 on the next run? If yes, the learning loop closes; if no, the matrix layer + playbook is infrastructure for a thesis that doesn't pay rent.

See the run reports for honesty caveats — chiefly that the LLM judge IS the ground-truth proxy.


Running a reality test

# Defaults: judge resolved from lakehouse.toml [models].local_judge,
# workers limit 5000, run id 001
./scripts/playbook_lift.sh

# Re-run with a different judge to check inter-judge agreement
# (env JUDGE_MODEL overrides the config tier)
JUDGE_MODEL=qwen3:latest RUN_ID=002 ./scripts/playbook_lift.sh

# Smaller scale for fast iteration
WORKERS_LIMIT=1000 K=5 RUN_ID=dev ./scripts/playbook_lift.sh

Judge resolution priority (Phase 3, 2026-04-29):

  1. -judge flag on the Go driver (explicit override)
  2. JUDGE_MODEL env var (operator override)
  3. lakehouse.toml [models].local_judge (default)
  4. Hardcoded qwen3.5:latest (last-resort fallback if config missing)

This means model bumps land in lakehouse.toml, not in this script or the Go driver. Bumping local_judge to a stronger local model (e.g. when qwen4 ships) takes one line.

Requires: Ollama on :11434 with nomic-embed-text + the resolved judge model loaded. Skips cleanly (exit 0) if Ollama is absent.


Interpreting results

Three thresholds matter on the playbook_lift tests:

Lift rate (lifts / discoveries) Verdict
≥ 50% Loop closes — playbook is doing real work, move to paraphrase queries
20-50% Lift exists but inconsistent — investigate boost math (score × 0.5) or judge variance
< 20% Loop is not pulling its weight — diagnose before adding more components

A separate concern: discovery rate (cold judge-best ≠ cold top-1). If discovery is itself rare (< 30% of queries), cosine is already close to optimal on this query distribution and the matrix+playbook layer has little headroom. That's not necessarily a bug — but it means the value gate has to come from somewhere else (multi-corpus retrieval, domain-specific tags, drift signal).


What this is not

  • Not a benchmark. No comparison against external systems; only internal cold-vs-warm.
  • Not a regression gate. Each run is a snapshot. Scores will drift with corpus changes, judge updates, and playbook math tuning. Don't wire just verify to demand a minimum lift.
  • Not human-validated. The LLM judge is the ground truth proxy. Sample 5-10 verdicts manually per run to sanity-check the judge isn't pathological.