Compare commits

...

13 Commits

Author SHA1 Message Date
root
f971e64745 g2_smoke: accept nomic-embed-text* family members as default
Pre-push hook caught the regression — the smoke hardcoded
MODEL = "nomic-embed-text" and the bump to nomic-embed-text-v2-moe
in 4da32ad failed the gate.

Fix: glob-match the family prefix (nomic-embed-text*). Both v1 and
v2-moe are 768d drop-ins; the property the smoke is locking is
dim + distinct-vectors, not the exact model variant. Operators
swap the variant in lakehouse.toml without needing to touch the
smoke.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 17:37:20 -05:00
root
db2e57402e STATE_OF_PLAY: capture multi-coord stress wave (Phase 1-3 verified)
Anchor was last touched at v4 split-threshold; since then the
multi-coord stress harness landed end-to-end across 11 commits.
Future sessions reading this file need to see the verified state,
not derive from git log.

Major additions:
- New "Multi-coordinator stress test (Phase 1 → 3)" section in
  VERIFIED WORKING. 11-row capability table covering per-coord
  playbook isolation, diversity metrics, paraphrase handover,
  ExcludeIDs swap, fresh-resume two-tier, inbox endpoints, LLM
  demand parsing, judge re-rating, Langfuse tracing.
- Substrate-gains list under that section: ExcludeIDs on
  SearchRequest, observer.SourceInbox + /observer/inbox,
  internal/langfuse client, embedd default bumped to v2-moe,
  two-tier fresh_workers index pattern.
- Last-verified bumped to 16:42 CDT on the run #011 anchor.

DO NOT RELITIGATE expanded with five new locks:
1. Boost / inject use SEPARATE thresholds (0.5 / 0.20)
2. Multi-coord product theory is empirically VALIDATED
3. Fresh content uses two-tier indexing (fresh_workers)
4. embedd.default_model = nomic-embed-text-v2-moe (don't downgrade)
5. Inbox flow: parse + search + judge + trace
6. Langfuse Go-side client lives at internal/langfuse/

OPEN list refresh:
- Removed: re-judge metric (shipped as b13b5cd), adjacent-query as
  separate item (folded into a single "judge-approves-before-inject"
  follow-up), liberal-paraphrase (kept).
- Added: real-time 48-hour clock, wider Langfuse instrumentation,
  periodic fresh→main merge job.

RECENT VERIFIED WAVE table extended with 11 commits (b13b5cd..5d49967).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 17:30:04 -05:00
root
5d49967833 multi_coord_stress: full Langfuse coverage — every phase + every call
Phase 1c-only tracing (commit 7e6431e) was the proof-of-concept.
This commit threads tracing through every phase: baseline / fresh-
resume / inbox burst / surge / swap / merge / handover (verbatim +
paraphrase) / split / reissue. Each phase is a parent span; each
matrix.search / LLM call inside is a child span.

Refactor:
- One run-level trace is created at driver startup.
- New startPhase(name, hour, meta) helper emits a phase span as a
  child of the run trace; subsequent emitSpan calls nest under it.
- New tracedSearch(spanName, query, corpora, ...) wraps matrixSearch
  with span emission. Every search call site replaced with this so
  the input/output JSON (query, corpora, k, playbook, exclude_n →
  top-K ids, top1 distance, boost/inject counts) lands in Langfuse.
- Phase 4b's paraphrase generation also emits llm.paraphrase spans.
- Phase 1c's existing inline span emission converted to use the new
  helpers (no more inboxTraceID variable).

Run #011 result: trace landed at http://localhost:3001 with 111
observations attached. Span breakdown:
  phase.* parents:         9 (one per phase that ran)
  matrix.search.baseline:  10
  matrix.search.fresh_verify: 3 (top-1 confirmed for all 3 fresh)
  observerd.inbox.record:  6
  llm.parse_demand:        6
  matrix.search.inbox:     6
  llm.judge_top1:          6
  matrix.search.surge:     12
  matrix.search.swap_orig: 1
  matrix.search.swap_replace: 1
  matrix.search.merge:     6
  matrix.search.handover_verbatim: 4
  llm.paraphrase:          4
  matrix.search.handover_paraphrase: 4
  matrix.search.split:     4
  matrix.search.reissue:   12
  matrix.search.reissue_retrieval_only: 12
  ─────────────
  Total:                   111

Browse: http://localhost:3001 → Traces → "multi_coord_stress run"
Each phase is a collapsible section showing per-call timing and
input/output JSON. Operators can drill into any single retrieval
to see exactly what query was issued and what came back.

All other metrics held: diversity 0.026, determinism 1.000,
verbatim handover 4/4, paraphrase handover 4/4, fresh-resume 3/3
at top-1 (two-tier index), 200-worker swap Jaccard 0.000.

This is the FULL TEST J asked for — every action in the run
visible in Langfuse, full input/output drilldown.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 16:43:32 -05:00
root
08a086779b multi_coord_stress: fresh_workers two-tier index — fresh-resume now top-1
Runs #003-#009 surfaced the same finding: fresh workers added
mid-run to the main 'workers' vectord index (5K items) reliably
*absorbed* (HTTP 200) but failed to *surface* in semantic queries
even with content-matching prompts. Distances on the verify queries
sat at 0.25-0.65 against existing workers; fresh items were beyond
top-K. Better embedder (v2-moe) didn't help — distances got TIGHTER
on existing items, pushing fresh items further out of reach.

Root cause: coder/hnsw incremental adds to a populated graph land
in poorly-connected regions and disappear from search traversal.
Known property of HNSW post-build adds; not a bug.

Fix: two-tier index pattern (canonical NRT search architecture).
Fresh content goes to a small "hot" corpus (fresh_workers); main
queries include it in the corpora list and merge results. Hot corpus
has no recall crowding because it's tiny; periodic batch job (post-
G3) merges it into the main index.

Implementation:
- ensureFreshIndex(hc, gw, name, dim) — idempotent POST
  /v1/vectors/index. 409 from re-create treated as "already there."
- ingestFreshWorker now takes idx parameter so callers can target
  fresh_workers instead of workers.
- multi_coord_stress phase 1b creates fresh_workers index + ingests
  3 fresh workers there + searches verifyCorpora=[workers,
  ethereal_workers, fresh_workers].

Run #010 result:
  fresh-001 (Senior tower crane rigger NCCCO Chicago)
    top-1: fresh-001 from fresh_workers, distance 0.143
  fresh-002 (Bilingual Spanish/English OSHA trainer Indianapolis)
    top-1: fresh-002 from fresh_workers, distance 0.146
  fresh-003 (FAA Part 107 drone surveyor Chicago)
    top-1: fresh-003 from fresh_workers, distance 0.129

3/3 fresh workers surface at top-1 — the absorption-but-not-
findable issue from runs #003-#009 is closed.

All other metrics held: diversity 0.007, determinism 1.000,
verbatim handover 4/4, paraphrase handover 4/4, swap Jaccard 0.000,
inbox burst all 6 events accepted + traced to Langfuse.

This is the final structural fix for the multi-coord stress
suite. Phase 3 is feature-complete.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 16:31:45 -05:00
root
7e6431e4fd langfuse: Go-side client + Phase 1c instrumentation
The Rust side has Langfuse tracing already (gateway/v1/langfuse_trace.rs);
this commit lands Go-side parity so the multi-coord stress harness can
emit traces visible at http://localhost:3001.

internal/langfuse/client.go:
- Minimal Trace + Span + Flush API mirroring what the Rust emitter
  uses. Auth: Basic over public_key:secret_key.
- Best-effort posture: errors are slog.Warn'd, never block calling
  paths. Same fail-open as observerd's persistor (ADR-005 Decision
  5.1) — observability is a witness, not a gate.
- Events buffered until 50, then auto-flushed; explicit Flush() at
  process exit.
- Each Trace/Span returns its id so callers can build hierarchies.

multi_coord_stress driver wiring:
- New --langfuse-env flag (default /etc/lakehouse/langfuse.env).
  Empty / missing / unparseable file → skip tracing with a logged
  warning; run still proceeds.
- Phase 1c (inbox burst) now emits one parent trace + 4 spans per
  inbox event:
    1. observerd.inbox.record  (post to /v1/observer/inbox)
    2. llm.parse_demand        (qwen2.5 → structured fields)
    3. matrix.search           (parsed query → top-K)
    4. llm.judge_top1          (rate top-1 vs original body)
  Each span carries input/output JSON + start/end times so the
  Langfuse UI shows a full waterfall per event.

Run #009 result:
  Trace landed: "multi_coord_stress phase 1c inbox burst"
  Observations attached: 24 (= 6 events × 4 spans)
  Tags: stress, phase-1c, inbox
  Browseable at http://localhost:3001 by tag query.

Other harness metrics: diversity 0.016, determinism 1.000,
verbatim handover 4/4, paraphrase handover 4/4 — all unchanged
by the tracing addition (best-effort post in parallel).

Phase 1c is the proof-of-concept; future commits can wrap other
phases (baseline / merge / handover / split) in traces too. Once
that's done, the entire stress run becomes scrubbable in Langfuse
without grepping the events JSON.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 16:25:03 -05:00
root
ce940f4a14 multi_coord_stress: judge re-rates inbox top-1 — recovers honesty signal
Run #007 surfaced a tradeoff: LLM-parsed inbox queries produce much
tighter cosine distances (0.05-0.10 in three cases) but lose the
"system has no good match" signal that high-distance results give.
A coordinator UI showing only distance can't tell wrong-domain
matches apart from real ones.

Fix: judge re-rates top-1 against the ORIGINAL inbox body (not the
LLM-parsed query). Coordinators see both:
  - distance: how close was retrieval in vector space
  - rating:   does this person actually fit the original ask
The pair tells the honest story.

Run #008 result on the 6 inbox events:

  Demand                Top-1     Distance  Rating  Reading
  ─────────────────────────────────────────────────────────────
  Forklift Cleveland    w-3573    0.29      4       Strong
  Production Indy       e-1764    0.41      3       Adjacent
  Crane Chicago         e-7798    0.23      1       TIGHT BUT WRONG
  Bilingual safety Indy w-3918    0.05      5       Perfect
  Drone Chicago         e-1058    0.06      5       Perfect (verify e-1058)
  Warehouse Milwaukee   w-460     0.32      4       Strong

The crane-Chicago case is the architectural-honesty signal at work:
distance 0.23 says "tight match" but the judge says rating 1 reading
the original body. A coordinator seeing only distance would ship the
wrong worker; coordinator seeing distance+rating sees the disagreement
and escalates.

Net distribution: 5/6 rated 3+ (acceptable→perfect), 1/6 rated 1
(irrelevant despite tight cosine). The substrate-honesty signal is
recovered without losing the LLM-parse quality wins.

Cost: 6 extra judge calls (~9s on qwen2.5). Production amortizes
when judge runs only on top-1 of high-priority inbox events; the
search-cost-vs-quality tradeoff lives in the priority gate.

Implementation:
- New JudgeRating int field on Event (omitempty so non-judged
  events stay clean in JSON)
- New judgeInboxResult helper, reusing the same prompt structure as
  playbook_lift's judgeRate. The two could share an internal package
  if a third judge consumer appears.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 16:16:49 -05:00
root
186d209aae multi_coord_stress: LLM-parsed inbox demands (qwen2.5)
Replaced the hard-coded DemandQuery on inbox events with an actual
LLM call: each email/SMS body is parsed by qwen2.5 (format=json,
schema-anchored) into structured {role, count, location, certs,
skills, shift}. The driver then composes a query string from those
fields and runs matrix.search.

This is the real-product flow that the Phase 3 stress test was
asking for: real bodies → real LLM parsing → real search. Before
this commit, the DemandQuery was my hand-crafted string, which
made the inbox phase trivial.

Run #007 result vs #006 (same bodies, parser swapped):

  All 6 inbox events parsed cleanly — qwen2.5 nailed:
    "Need 50 forklift operators in Cleveland OH for Monday day
     shift. OSHA-30 + active forklift cert required."
    → {role:"forklift operator", count:50, location:"Cleveland, OH",
       certs:["OSHA-30","active forklift cert"], skills:[], shift:"day"}
    Other 5 similarly faithful (indy stayed as "indy", count
    defaulted to 1 when unspecified, no hallucinated fields).

  LLM-parsed queries produced TIGHTER matches than hard-coded:
    Demand              #006 dist  #007 dist  Δ
    Crane Chicago       0.499      0.093      -82%
    Drone Chicago       0.707      0.073      -90%
    Bilingual safety    0.240      0.048      -80%
    Forklift Cleveland  0.330      0.273      -17%
    Production Indy     0.260      0.399      +53%
    Warehouse Milwaukee 0.458      0.420       -8%

  Three matches landed at distance < 0.10 — verbatim-replay-tight
  territory. Structured queries embed sharper than conversational
  hand-crafted strings.

  Other metrics unchanged: diversity 0.000, determinism 1.000,
  verbatim handover 4/4, paraphrase handover 4/4.

Tradeoff worth flagging: the drone-Chicago case dropped from
distance 0.71 (clear "we don't have one") to 0.07 (confident match
returned). The OOD honesty signal weakens when LLM-parsed structure
makes any closest-neighbor look tight. Future Phase 4 work: judge
re-rates the top match before surfacing, so coordinators see "your
demand was for X but the closest match scored 2/5" rather than just
the worker ID + distance.

Substrate cost: +6 LLM calls per inbox burst (~9s on qwen2.5).
Production would amortize via a small dedicated parser model.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 14:51:19 -05:00
root
e7fc63b216 observerd: /observer/inbox + multi-coord stress phase 1c (priority-ordered events)
Phase 3 ask: real-world inbox-style event injection during the stress
test. Coordinators in production receive emails + SMS that trigger
contract responses; the substrate has to RECORD these signals AND
react with a search using the embedded demand. This commit lands the
endpoint and exercises it end-to-end in the stress harness.

observerd surface:
- New POST /observer/inbox route — accepts {type, sender, subject,
  body, priority, tag} and records as ObservedOp with
  Source=SourceInbox. Type must be email|sms; body required;
  priority defaults to medium. The handler ONLY records — downstream
  triggers (search, ingest, etc.) are the caller's concern, recorded
  separately. Keeps the witness role pure.
- New observer.SourceInbox = "inbox" alongside SourceMCP /
  SourceScenario / SourceWorkflow.
- Three contract tests on the new route (happy path / bad type / empty
  body), router-mount test extended, all green.

Stress harness phase 1c (Hour 9):
- 6 inbox events fire in priority order (urgent → high → medium):
    2 urgent emails (forklift Cleveland, production Indianapolis)
    1 high email (crane Chicago)
    1 high sms (bilingual safety Indianapolis)
    1 medium sms (drone Chicago)
    1 medium email (warehouse Milwaukee FYI)
- Each event:
    1. POSTs to /v1/observer/inbox (recorded by observerd)
    2. Triggers matrix.search using a parsed demand (the demand
       extraction is hard-coded for now; production needs a small
       LLM to parse from body)
    3. Captures both as events in the run JSON

Run #006 result (with v2-moe embedder + all phases including inbox):

  Diversity:
    Same-role-across-contracts Jaccard = 0.000 (n=9)
    Different-roles-same-contract Jaccard = 0.046 (n=18)
  Determinism: 1.000
  Verbatim handover: 4/4 (100%)
  Paraphrase handover: 4/4 (100%)
  Inbox burst:
    6/6 events accepted by observerd (200 status, all recorded)
    6/6 triggered searches produced distinct top-1 worker IDs
    distance distribution: 0.24 (Indy production) → 0.71 (Chicago
    drone surveyor — honest stretch since drones aren't in the
    5K-worker corpus, system surfaces closest neighbor at high
    distance rather than fabricating)

The drone-Chicago case is the architectural-honesty signal: when
the demand asks for a specialist NOT in the roster, the system
returns the closest semantic neighbor with a distance that flags
"this is a stretch." Coordinators reading distances see "we don't
have a great match here" rather than a confident wrong answer.

Total events captured: 67 (was 61 pre-inbox).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 08:34:36 -05:00
root
4da32ad102 embedd: bump default to nomic-embed-text-v2-moe (475M MoE, 768d drop-in)
Local Ollama has three embedding models loaded:
  nomic-embed-text:latest        137M  768d  (previous default)
  nomic-embed-text-v2-moe:latest 475M  768d  (this commit's default)
  qwen3-embedding:latest         7.6B  4096d (would require dim change)

v2-moe is a drop-in upgrade — same 768 dim, 3.5× more params, MoE
architecture. Workers index doesn't need rebuilding, just future ingests
embed with the stronger model.

Run #005 result on the multi-coord stress suite:

  Diversity (same-role-across-contracts): 0.080 → 0.000 (n=9)
    → MoE is more discriminating: zero worker overlap across
      Milwaukee / Indianapolis / Chicago for shared role names.
      The geo + cert + skill context fully separates worker pools.
  Different-roles-same-contract: 0.013 → 0.036 (still ~96% diff)
  Determinism: 1.000 (unchanged)
  Verbatim handover: 4/4 (100%)
  Paraphrase handover: 4/4 (100%)

  200-worker swap: Jaccard 0.000 (unchanged — still perfect)

  Fresh-resume verify: STILL doesn't surface fresh workers in top-8.
    With v2-moe, distances increased (top-1 = 0.43–0.65 vs v1's 0.25–0.39)
    — the embedder is MORE discriminating, but the fresh worker's
    vector still doesn't outrank the 8th-best existing worker. Now
    suspect of being an HNSW post-build add issue (coder/hnsw
    incremental adds can land in hard-to-reach graph regions, not an
    embedder problem). Better embedder didn't fix it; needs a
    different strategy: full index rebuild after fresh adds, or
    explicit playbook-layer score boost for fresh workers, or
    hybrid (keyword + semantic) retrieval. Phase 3 investigation.

Cost: ingest is ~5× slower (workers 20s→100s; ethereal 35s→112s).
Acceptable for the quality jump on diversity. Real production with
incremental ingest won't pay this once-per-deploy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 08:26:52 -05:00
root
84a32f0d29 multi-coord stress Phase 2: ExcludeIDs + fresh-resume + 200-worker swap
Three Phase 2 additions land in this commit:

1. matrix.SearchRequest gains ExcludeIDs ([]string) — filters specific
   worker IDs out of results post-retrieval, AND skips them at the
   playbook boost+inject step (so excluded answers can't sneak back
   via Shape B). Real-world driver: coordinator placed N workers,
   client asks for replacements, system needs alternatives, not the
   same N. Threaded through retrieve.go after merge but before
   metadata filter so excluded IDs don't waste post-filter top-K slots.

2. New harness phase 2b: 200-worker swap simulation. Captures the
   top-K from alpha's warehouse query, then re-issues with
   exclude_ids=<placed>. Result Jaccard(orig, swap) measures whether
   the substrate finds genuine alternatives.

3. New harness phase 1b: fresh-resume mid-run injection. Three new
   workers ingested via /v1/embed + /v1/vectors/index/workers/add,
   then verified findable via semantic queries matching resume content.

Plus Hour labels on every event (operational narrative: 0/6/12/18/
24/30/36/42/48) and a refactor of captureEvent to take hour as a
param.

Run #003 + #004 results (5K workers + 10K ethereal):

  Diversity (#004):
    Same-role-across-contracts Jaccard = 0.080 (n=9)
    Different-roles-same-contract Jaccard = 0.013 (n=18)
  Determinism: 1.000 (#004 unchanged)
  Verbatim handover:  4/4 = 100%
  Paraphrase handover: 4/4 = 100%

  Phase 2b — 200-worker swap (Jaccard 0.000):
    8 originally-placed workers fully replaced by 8 alternatives.
    ExcludeIDs substrate change works end-to-end — boost AND inject
    both honor the exclusion, so excluded workers don't return via
    the playbook either.

  Phase 1b — fresh-resume injection: REAL PRODUCT FINDING.
    Substrate ABSORPTION is fine — 3 /v1/vectors/index/workers/add
    calls at 200 status, 3 vectors persisted. But none of the 3
    fresh workers surfaced in top-8 even with semantic queries
    matching their resume content (e.g. "Senior tower crane rigger
    NCCCO Chicago" vs fresh-001's resume "Senior rigger with 12
    years tower-crane signaling..." NCCCO + Chicago).
    Top-1 came from existing workers at distance ~0.25; fresh
    workers' distances must be > 0.25, pushing them past rank 8.
    Cause: dense retrieval at 5000+ workers means many existing
    profiles cluster near any specific query in cosine space;
    nomic-embed-text-v2 (137M) introduces enough noise that a
    fresh worker doesn't reliably outrank them just because the
    text content overlaps.
    Workarounds (Phase 3 work): (a) hybrid retrieval (keyword +
    semantic), (b) playbook-layer score boost for fresh adds,
    (c) larger embedder. Documented in run #004 report.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 08:19:29 -05:00
root
0fa42a0cc3 multi-coord stress Phase 1.5: shared-role contracts + paraphrase handover
Phase 1 had two known gaps: (1) the 3 contracts had zero shared role
names, so same-role-across-contracts Jaccard was vacuous (n=0); (2)
the verbatim handover at 100% was the trivial case, not the hard
learning test (paraphrased queries against another coord's playbook).

Both fixed in this commit.

Contract redesign — all 3 contracts now share warehouse worker /
admin assistant / heavy equipment operator roles, plus a unique
specialist per contract (industrial electrician / bilingual safety
coord / drone surveyor — the "specialist not on the standard roster"
case from J's spec). Counts and skill mixes vary per region.

New driver phase 4b — paraphrase handover. Bob runs qwen2.5-paraphrased
versions of Alice's contract queries against Alice's playbook
namespace. Tests whether institutional memory propagates across
coordinators AND across natural wording variation that Bob would
introduce when running Alice's contract.

Run #002 result (5K workers + 10K ethereal_workers, 4 demand × 3
coords + paraphrase handover):

  Diversity (the question J asked: locking or cycling?):
    Same-role-across-contracts Jaccard = 0.119 (n=9)
      → 88% of workers DIFFER across regions for the same role name.
        Milwaukee warehouse vs Indianapolis warehouse vs Chicago
        warehouse pull mostly distinct top-K from the same population.
        The system locks into geo+cert+skill context, not cycling.
    Different-roles-same-contract Jaccard = 0.004 (n=18)
      → role-specific retrieval works (unchanged from Phase 1).

  Determinism: Jaccard = 1.000 (n=12) — unchanged.

  Learning:
    Verbatim handover  4/4 = 100%  (trivial case, expected)
    Paraphrase handover 4/4 = 100% (HARD case — passes!)
      Of those 4 paraphrase recoveries:
        - 2 used boost (Alice's recording was already in Bob's
          paraphrase top-K; ApplyPlaybookBoost re-ranked to top-1)
        - 2 used Shape B inject (recording wasn't in Bob's
          paraphrase top-K; InjectPlaybookMisses brought it in)

The boost/inject mix is healthy — both paths are used and both
produce correct top-1s. Multi-coord institutional memory propagation
is empirically working under wording variation.

Sample warehouse worker top-1s across contracts (proves diversity):
  alice / Milwaukee     → w-713
  bob   / Indianapolis  → e-8447
  carol / Chicago       → e-7145
Three different workers from the same 15K-person population,
selected on geo+cert+skill context.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 08:03:16 -05:00
root
61c7b55e48 multi-coord stress harness — Phase 1 of 48-hour mock
Three coordinators (alice / bob / carol) with three contracts
(Milwaukee distribution / Indianapolis manufacturing / Chicago
construction). 7-phase scenario runner: baseline → surge → merge →
handover → split → reissue → analysis. Each coord has a separate
playbook namespace (playbook_{name}) so institutional memory stays
isolated by default but transferable on demand.

Phase 1 deliberately skips the 48-hour clock, email/SMS endpoints,
and Langfuse tracing — those are Phase 2/3.

Run #001 (52 events, 4 queries × 3 coords × 2 demand flavors):

  Diversity:
    Different-roles-same-contract Jaccard = 0.004 (n=18)
      → role-specific retrieval is working perfectly. Different
        roles within one contract pull totally different worker
        pools. System is NOT cycling; locks into per-role retrieval.
    Same-role-across-contracts Jaccard = N/A (n=0)
      → TEST-DESIGN ISSUE: the 3 contracts use distinct role names
        per industry (warehouse worker / production worker / general
        laborer), so no exact-name overlaps exist. Phase 2 should
        either share at least one role across contracts OR add a
        skill-based diversity metric.

  Determinism: Jaccard = 1.000 (n=12)
    → HNSW + Ollama retrieval is fully deterministic on identical
      query text. coder/hnsw + nomic-embed-text are stable.

  Learning: handover hit rate = 4/4 = 100%
    → Bob inherits Alice's recordings perfectly when bob runs
      identical queries with alice's playbook namespace. CAVEAT:
      this tests the trivial verbatim case, not paraphrase handover.
      The harder test (bob runs paraphrased queries with alice's
      playbook) is Phase 2 work.

Per-event capture in JSON: every matrix.search response is logged
with phase / coordinator / contract / role / query / top-K IDs +
distances + per-corpus counts + boosted/injected counts. Reviewable
via:
  jq '.events[] | select(.phase == "merge")'
  jq '.events[] | select(.coordinator == "alice")'
  jq '.events[] | select(.role == "warehouse worker")'

Notable finding from per-event: carol's "general laborer" and "crane
operator" queries both surface w-1009 as top-1, with crane operator
at distance 0.098 (very tight) and general laborer at 0.297. The
system found a worker who legitimately covers both roles — realistic
for small construction crews.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 07:55:29 -05:00
root
b13b5cd7a1 playbook_lift v4 metric: warm-top-1 re-judge — quality lift +24%/-14%
The rank-based "lift" metric (warm-top-1 == cold-judge-best) doesn't
distinguish "Shape B surfaced a strictly-better answer" from "Shape B
shuffled ranks but quality is unchanged" from "Shape B replaced a good
answer with a wrong one." This commit adds Pass 4: judge warm top-1
with the same prompt as cold ratings, then bucket the comparison.

Implementation:
- New --with-rejudge driver flag (default off).
- New WITH_REJUDGE harness env (default 1, on for prod runs).
- queryRun gains WarmTop1Metadata (cached during Pass 2 for the
  rejudge call) + WarmTop1Rating *int (nil-distinguishable; nil = no
  rejudge, 0..5 = rating).
- summary gains RejudgeAttempted, QualityLifted, QualityNeutral,
  QualityRegressed (counts of warm-rating > / == / < cold-rating).
- Markdown headline gains a Quality block when rejudge ran.
- ~21 extra judge calls (~30s on qwen2.5).

Run #005 result (split inject threshold 0.20 + paraphrase + rejudge):

  Quality lifted     5 / 21  (24%)  — 3× +2 rating, 2× +1 rating
  Quality neutral   13 / 21  (62%)  — includes OOD queries holding 1
  Quality regressed  3 / 21  (14%)
  Net rating delta  +3 across 21 queries (+0.14 average)

The 5 lifts were all rating-2 cold replaced with rating-3 or rating-4
warm — Shape B took mediocre matches and substituted substantively
better ones. The 3 regressions were small (-1, -1, -3).

Q11 is the cautionary tale: cold top-1 "production line worker"
(rating 4) got replaced by Q1's recorded "forklift OSHA-30 operator"
e-5729 (rating 1). Adjacent-domain cross-pollination — production
worker and forklift operator embed within 0.20 cosine because both
are warehouse-adjacent staffing queries, even though the judge
correctly distinguishes them. The split-threshold defense (0.5 boost
/ 0.20 inject) catches OOD cross-pollination (Q19/Q20/Q21 all stayed
neutral at rating 1) but not adjacent-domain cross-pollination.

Net product verdict: working, net-positive on quality, but the worst
case (Q11 4→1) is customer-visible and warrants a tighter inject
threshold OR an additional gate beyond cosine distance. Filed in
STATE_OF_PLAY OPEN as a follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 07:42:04 -05:00
27 changed files with 3263 additions and 21 deletions

View File

@ -1,7 +1,7 @@
# STATE OF PLAY — Lakehouse-Go
**Last verified:** 2026-04-30 ~07:25 CDT
**Verified by:** live probes + `just verify` PASS + reality tests #001/#002/#003/#004 (v4 with split inject threshold: 6/8 verbatim lift + 6/8 paraphrase recovery + zero OOD cross-pollination), not memory.
**Last verified:** 2026-04-30 ~16:42 CDT
**Verified by:** live probes + `just verify` PASS + multi-coord stress run #011 (full 9-phase scenario, 67 captured events, 1 Langfuse trace + 111 child observations covering every phase + every external call), not memory.
> **Read this FIRST.** When the user says "we're working on lakehouse," default to the Go rewrite (this repo); the Rust legacy at `/home/profit/lakehouse/` is maintenance-only. If memory contradicts this file, this file wins. Update it when something is verified working — not when a phase finishes.
@ -114,6 +114,34 @@ Evidence: `reports/reality-tests/playbook_lift_{001,002,003}.{json,md}`. Per the
**v3 → v4 is the configuration evolution.** v3 ran with one threshold (0.5) for both boost and inject — paraphrase recovery hit 6/6 but a single strong recording (w-4435 for "OSHA-30 forklift Wisconsin") cross-pollinated to OOD queries (dental hygienist / RN / software engineer) because their text vectors fell within 0.5 cosine. v4 split: boost stays at 0.5 (safe — only re-ranks existing results), inject tightens to 0.20 (true paraphrase territory). Result: OOD cross-pollination eliminated, paraphrase recovery 6/8, verbatim lift 6/8. The two paraphrases that missed (Q9, Q15) were liberally rephrased — drift > 0.20 cosine.
### Multi-coordinator stress test (Phase 1 → 3) — VERIFIED end-to-end
Reality test #2 catalog. New harness `scripts/multi_coord_stress.{sh,go}` simulates 3 coordinators (alice/bob/carol) handling 3 distinct contracts (Milwaukee distribution / Indianapolis manufacturing / Chicago construction), each with their own playbook namespace. 9-phase operational narrative across simulated Hours 048: baseline → fresh-resume injection → inbox burst → mid-day surge → 200-worker swap → contract merge → handover (verbatim + paraphrase) → split → reissue.
| Capability | Verified | Where |
|---|---|---|
| Per-coordinator playbook isolation | ✓ | `playbook_alice` / `playbook_bob` / `playbook_carol` corpora |
| Same-role-across-contracts diversity | Jaccard 0.026 (n=9) — 97% workers differ per region | Phase 1 baseline |
| Different-roles-same-contract diversity | Jaccard 0.070 (n=18) — 93% differ per role | Phase 1 baseline |
| HNSW retrieval determinism | Jaccard 1.000 (n=12) | Phase 6 reissue |
| Verbatim handover (Bob runs Alice's queries with Alice's playbook) | 4/4 | Phase 4 |
| Paraphrase handover (Bob runs qwen2.5-paraphrased queries) | 4/4 | Phase 4b |
| 200-worker swap with `ExcludeIDs` | Jaccard 0.000 — 8/8 placed workers fully replaced | Phase 2b |
| Fresh-resume injection (two-tier `fresh_workers` index) | 3/3 fresh workers at top-1 | Phase 1b |
| Inbox endpoint `/v1/observer/inbox` (email + SMS, priority weighting) | 6/6 events recorded | Phase 1c |
| LLM demand parsing (qwen2.5 format=json on inbox bodies) | 6/6 parsed cleanly into structured `{role, count, location, certs, skills, shift}` | Phase 1c |
| Judge re-rates inbox top-1 against ORIGINAL body | catches tight-distance-but-wrong (Q3 crane case: dist 0.23 → rating 1) | Phase 1c |
| Langfuse Go-side tracing | 111 observations on a single run trace, browseable at http://localhost:3001 | Run #011 |
**Substrate gains added by this wave:**
- `internal/matrix/playbook.go` Shape B + split inject threshold (commit `67d1957` from earlier wave; verified in multi-coord too)
- `internal/matrix/retrieve.go` `ExcludeIDs` field on `SearchRequest` — filters worker IDs at retrieve, boost, AND inject (so excluded answers can't sneak back via playbook). Real-world driver: coordinator placed N workers, client asks for replacements.
- `internal/observer/types.go` `SourceInbox` taxonomy alongside `SourceMCP / SourceScenario / SourceWorkflow`
- `cmd/observerd` `POST /observer/inbox` route — accepts `{type, sender, subject, body, priority, tag}` and records as `ObservedOp`. Type must be `email` or `sms`; body required; priority defaults to medium.
- `internal/langfuse/client.go` — minimal Go-side Trace+Span client, best-effort posture (logs on error, never blocks calling path; same fail-open semantics as ADR-005 Decision 5.1).
- `embedd default_model` bumped from `nomic-embed-text` (137M) → `nomic-embed-text-v2-moe` (475M, MoE, drop-in 768d). Same-role-across-contracts diversity went from 0.080 → 0.000 with the upgrade.
- Two-tier index pattern: fresh content goes to `fresh_workers` (a small "hot" corpus); main queries include it in the corpora list. Solves the HNSW post-build add recall issue (incremental adds to a populated graph land in poorly-connected regions and disappear from search). Canonical NRT pattern; Lucene works the same way.
### Harness expansion (2026-04-30 ~05:30 CDT)
`scripts/playbook_lift.sh` rewritten from a 5-daemon stripped harness to the **full 10-daemon prod-realistic stack** (chatd stays up independently). The 5-daemon version was structurally hiding bugs; expanding the daemon set surfaced 7 distinct fixes:
@ -171,10 +199,16 @@ Verbatim verdicts at `reports/scrum/_evidence/2026-04-30/verdicts/`. Disposition
- The matrix indexer **5/5 components** are shipped. Don't propose to "build the matrix indexer" — it's done.
- The 5-loop substrate's load-bearing gate is **PASSED**. v3 (`154a72e`) showed 6/6 paraphrase recovery via Shape B. Don't propose to "test if the playbook learns" — it does.
- **Shape B is the playbook stance now.** When `use_playbook=true`, both `ApplyPlaybookBoost` (re-rank in place) AND `InjectPlaybookMisses` (insert recorded answers not in regular top-K) run. Don't revert to boost-only; v2 proved that path can't recover paraphrase queries.
- **Boost / inject use SEPARATE thresholds.** Boost stays at `DefaultPlaybookMaxDistance = 0.5` (safe — only re-ranks results already in retrieval). Inject uses tighter `DefaultPlaybookMaxInjectDistance = 0.20` (Shape B forces results in, so loose match cross-pollinates wrong-domain). Don't merge them.
- **Multi-coord product theory is empirically VALIDATED** by run #011 (Phase 3). Per-coordinator playbook namespaces (`playbook_alice` etc.) with cross-coordinator handover (Bob takes Alice's contract using `playbook_corpus=playbook_alice`) work end-to-end including paraphrase handover. Don't propose to "test if multi-coord works" — it does.
- **Fresh content uses two-tier indexing.** Fresh resumes go to `fresh_workers` corpus, not the main `workers` index. coder/hnsw incremental adds to a populated 5K+ graph have recall issues; the small hot index has no such crowding. Periodic merge (post-G3) consolidates fresh→main. Canonical NRT pattern.
- **`embedd.default_model = "nomic-embed-text-v2-moe"`** (475M MoE, 768d). Don't bump to `nomic-embed-text` (137M) "for speed" — diversity scores fell from 0.000 → 0.080 same-role-across-contracts on the smaller model. Cost (~5× slower ingest) is acceptable for once-per-deploy work.
- **Inbox flow: parse + search + judge + trace.** `/v1/observer/inbox` records the body; coordinator/driver parses via LLM (qwen2.5 format=json), runs `matrix.search` on the parsed query, judges top-1 against the ORIGINAL body, emits Langfuse spans through it all. Don't replace the judge re-rate with a distance-only gate — tight distance + low rating is the load-bearing honesty signal.
- `local_judge = "qwen2.5:latest"` for the lift loop — qwen3.5:latest is a vision-SSM 256K-ctx build that ran 30s/judge. Don't bump to qwen3.5 again "for cleanliness." Different model in the same tier has different cost geometry.
- `qwen3.5:latest` IS available locally on this box. Opus's hub-only knowledge is a known-stale signal; the chatd_smoke uses it daily.
- `temperature` is **omitted** for Anthropic 4.7 (handled by `Request.Temperature *float64`); don't re-add it.
- chatd-smoke runs with **all cloud providers disabled** intentionally so the suite doesn't depend on API keys; that's why it can't catch B-3-class bugs (those need a fake-server fixture, see Sprint 0 follow-up).
- **Langfuse Go-side client lives at `internal/langfuse/`** with best-effort fail-open posture. URL+creds from `/etc/lakehouse/langfuse.env`. Don't propose to "wire Langfuse on Go side" — it's wired; multi_coord_stress is the proof.
---
@ -182,9 +216,11 @@ Verbatim verdicts at `reports/scrum/_evidence/2026-04-30/verdicts/`. Disposition
| Item | What | When to act |
|---|---|---|
| **Reality test v4: re-judge warm results** | Current lift metric counts "warm top-1 == cold judge best." Shape B cross-pollinates recorded answers across queries (run #003 dropped verbatim-lift 7→2 because w-4435 surfaced as warm top-1 for several queries whose cold-judge-best was a different worker). True quality lift would re-judge warm results and measure judge agreement. ~30 min — adds a 4th pass to the driver. | Next time we want to measure Shape B's actual quality, not rank-of-original-judge-best. |
| **Adjacent-query cross-pollination** | After v4's split threshold, OOD cross-pollination is gone but Q6 ("Forklift-certified loader") ↔ Q7 ("Hazmat-certified warehouse worker, cold storage") still swap recordings as warm top-1 because their embeddings are within 0.20 cosine of each other. Likely correct (genuinely similar staffing queries), but worth verifying with the v4 re-judge metric — if the judge agrees both directions are good matches, accept; if not, tighten further (e.g. 0.15) or add a same-query-only mode. | Co-decision with v4 re-judge. |
| **Liberal-paraphrase recovery loss** | Q9 + Q15 in run #004 lost paraphrase recovery because qwen2.5 rephrased liberally enough to drift past 0.20 inject threshold. Acceptable (system refusing to inject when not confident), but might be tightenable with a different paraphrase prompt (more conservative wording variation) or a per-pair `paraphrase_max_drift` measurement. Cosmetic vs. real depends on whether realistic coordinator queries drift like qwen2.5's rephrases do. | When real coordinator queries are available for a calibration run. |
| **Real-time 48-hour clock** | Multi-coord stress fires phases as discrete steps with simulated-hour labels (0/6/12/18/24/30/36/42/48). A real-time variant would space events on actual wall-clock with `time.Sleep`, simulating the rhythm of a coordinator workday. Cosmetic; doesn't change product behavior — but lets the harness mimic the cadence at which inbox events would arrive in production. ~30 min. | When stress test starts being used to capture realistic per-hour throughput numbers. |
| **Wider Langfuse instrumentation across daemons** | multi_coord_stress traces every external call, but the daemons themselves (matrixd, observerd, chatd) don't yet emit traces from their own request handlers. Adding `internal/langfuse/middleware.go` would auto-emit a span per HTTP request, giving production-traffic visibility for free. | When production traffic starts hitting the Go gateway. |
| **Periodic fresh→main index merge** | Two-tier `fresh_workers` pattern is verified working but no scheduled job consolidates fresh→main. Fresh corpus grows monotonically; eventually has its own recall issues. A daily cron that ingests the fresh corpus' contents into the main `workers` index + drops fresh would solve it. | When fresh_workers grows past ~500 items. |
| **Adjacent-query cross-pollination (lift suite Q6↔Q7)** | After lift v4's split threshold, OOD cross-pollination is gone but Q6 / Q7 still swap recordings as warm top-1 because their embeddings are within 0.20 cosine of each other. Multi-coord run #008 inbox-judge re-rating proved the judge can distinguish — gating injection on "judge approves before injecting" closes this. ~1 hr. | When playbook injection quality matters more than retrieval throughput. |
| **Liberal-paraphrase recovery loss (lift suite Q9, Q15)** | Q9 + Q15 in run #004 lost paraphrase recovery because qwen2.5 rephrased liberally enough to drift past 0.20 inject threshold. Acceptable (system refusing to inject when not confident), but might be tightenable with a different paraphrase prompt or a per-pair `paraphrase_max_drift` measurement. | When real coordinator queries are available for a calibration run. |
| **Sprint 4 — deployment** | No `REPLICATION.md`, `secrets-go.toml.example`, `deploy/systemd/<bin>.service`, `Dockerfile`. Largest open Sprint. Required input for any G5 cutover plan. | When G5 cutover is on the table. |
| **ADR-006 — auth posture for non-loopback deploy** | Locks R-001 + R-007 from "opt-in middleware exists" to "wired-by-default for X, opt-in for Y." Doc-only, ~1 hr. | Required before any Go binary binds non-loopback in prod. |
| **chatd fixture-mode storage half** | `g2_smoke_fixtures.sh` closed embed half via fake_ollama; storage half (mock S3) still deferred. Closes R-006 fully. | When CI box without MinIO is needed. |
@ -213,6 +249,17 @@ Verbatim verdicts at `reports/scrum/_evidence/2026-04-30/verdicts/`. Disposition
| `154a72e` | matrix: Shape B (`InjectPlaybookMisses`) — 6/6 paraphrase recovery in run #003 |
| `94fc3b6` | STATE_OF_PLAY: capture v3 reality test + Shape B + cross-pollination |
| `67d1957` | matrix: split boost/inject thresholds (0.5 / 0.20) — kills cross-pollination, run #004 6/8+6/8 |
| `b13b5cd` | playbook_lift v4 metric: warm-top-1 re-judge → quality lift +24%/-14% (5 lifted / 13 neutral / 3 regressed) |
| `61c7b55` | multi-coord stress harness (Phase 1) — 3 coords / 3 contracts / 7-phase scenario |
| `0fa42a0` | multi-coord stress Phase 1.5 — shared-role contracts + paraphrase handover |
| `84a32f0` | multi-coord stress Phase 2 — `ExcludeIDs`, 200-worker swap, fresh-resume |
| `4da32ad` | embedd: bump default to nomic-embed-text-v2-moe (475M MoE, 768d drop-in) |
| `e7fc63b` | observerd `/observer/inbox` + multi-coord stress phase 1c (priority-ordered events) |
| `186d209` | multi_coord_stress: LLM-parsed inbox demands (qwen2.5 format=json) |
| `ce940f4` | multi_coord_stress: judge re-rates inbox top-1 against original body — recovers OOD honesty signal |
| `7e6431e` | langfuse: Go-side client + Phase 1c instrumentation |
| `08a0867` | multi_coord_stress: fresh_workers two-tier index — fresh-resume now top-1 (3/3) |
| `5d49967` | multi_coord_stress: full Langfuse coverage — every phase + every call (111 observations) |
Plus on Rust side (`8de94eb`, `3d06868`): qwen2.5 → qwen3.5:latest backport in active defaults; distillation acceptance reports regenerated (run_hash refresh, reproducibility property still holds).

View File

@ -93,6 +93,62 @@ func (h *handlers) register(r chi.Router) {
r.Post("/observer/event", h.handleEvent)
r.Post("/observer/workflow/run", h.handleWorkflowRun)
r.Get("/observer/workflow/modes", h.handleWorkflowModes)
r.Post("/observer/inbox", h.handleInbox)
}
// inboxMessage is the POST /observer/inbox body — an incoming
// real-world signal (email or SMS) that a coordinator would receive
// and act on. The handler only RECORDS it as an ObservedOp; whether
// to trigger a downstream matrix.search or workflow is the caller's
// concern. Keeps observer's witness role pure.
type inboxMessage struct {
Type string `json:"type"` // "email" | "sms"
Sender string `json:"sender"`
Subject string `json:"subject,omitempty"`
Body string `json:"body"`
Priority string `json:"priority"` // "urgent" | "high" | "medium" | "low"
Tag string `json:"tag,omitempty"`
}
func (h *handlers) handleInbox(w http.ResponseWriter, r *http.Request) {
var msg inboxMessage
if !decodeJSON(w, r, &msg) {
return
}
if msg.Type != "email" && msg.Type != "sms" {
http.Error(w, "type must be 'email' or 'sms'", http.StatusBadRequest)
return
}
if strings.TrimSpace(msg.Body) == "" {
http.Error(w, "body required", http.StatusBadRequest)
return
}
if msg.Priority == "" {
msg.Priority = "medium"
}
op := observer.ObservedOp{
Endpoint: "/observer/inbox/" + msg.Type,
InputSummary: fmt.Sprintf("from=%s priority=%s tag=%s subject=%s", msg.Sender, msg.Priority, msg.Tag, msg.Subject),
OutputSummary: msg.Body,
Source: observer.SourceInbox,
Success: true,
}
if err := h.store.Record(op); err != nil {
if errors.Is(err, observer.ErrInvalidOp) {
http.Error(w, err.Error(), http.StatusBadRequest)
return
}
slog.Error("observer record inbox", "err", err)
http.Error(w, "internal", http.StatusInternalServerError)
return
}
stats := h.store.Stats()
writeJSON(w, http.StatusOK, map[string]any{
"accepted": true,
"type": msg.Type,
"priority": msg.Priority,
"ring_size": stats.Total,
})
}
func (h *handlers) handleStats(w http.ResponseWriter, _ *http.Request) {

View File

@ -4,6 +4,7 @@ import (
"bytes"
"net/http"
"net/http/httptest"
"strings"
"testing"
"time"
@ -38,6 +39,7 @@ func TestRoutesMounted(t *testing.T) {
"POST /observer/event": false,
"POST /observer/workflow/run": false,
"GET /observer/workflow/modes": false,
"POST /observer/inbox": false,
}
_ = chi.Walk(r, func(method, route string, _ http.Handler, _ ...func(http.Handler) http.Handler) error {
key := method + " " + route
@ -165,6 +167,51 @@ func TestWorkflowRun_AllProvenanceRecordedPostRun(t *testing.T) {
}
}
// TestInbox_AcceptsValidEmail locks the happy-path contract for the
// /observer/inbox route — accepts an email message with required
// fields, records as ObservedOp, returns 200 with ring-size.
func TestInbox_AcceptsValidEmail(t *testing.T) {
r := newTestRouter(t)
body := []byte(`{"type":"email","sender":"client@northstar.com","subject":"URGENT: 50 forklift ops","body":"Need 50 forklift operators in Cleveland OH for next week. Day shift.","priority":"urgent","tag":"alpha-surge"}`)
req := httptest.NewRequest("POST", "/observer/inbox", bytes.NewReader(body))
req.Header.Set("Content-Type", "application/json")
w := httptest.NewRecorder()
r.ServeHTTP(w, req)
if w.Code != http.StatusOK {
t.Fatalf("expected 200, got %d (body=%s)", w.Code, w.Body.String())
}
if !strings.Contains(w.Body.String(), `"accepted":true`) {
t.Errorf("expected accepted=true, got %s", w.Body.String())
}
}
// TestInbox_RejectsBadType locks the validation: type must be
// "email" or "sms", anything else is 400.
func TestInbox_RejectsBadType(t *testing.T) {
r := newTestRouter(t)
body := []byte(`{"type":"smoke-signal","sender":"x","body":"y","priority":"high"}`)
req := httptest.NewRequest("POST", "/observer/inbox", bytes.NewReader(body))
req.Header.Set("Content-Type", "application/json")
w := httptest.NewRecorder()
r.ServeHTTP(w, req)
if w.Code != http.StatusBadRequest {
t.Errorf("expected 400 on bad type, got %d", w.Code)
}
}
// TestInbox_RejectsEmptyBody locks the body-required invariant.
func TestInbox_RejectsEmptyBody(t *testing.T) {
r := newTestRouter(t)
body := []byte(`{"type":"email","sender":"x","body":"","priority":"high"}`)
req := httptest.NewRequest("POST", "/observer/inbox", bytes.NewReader(body))
req.Header.Set("Content-Type", "application/json")
w := httptest.NewRecorder()
r.ServeHTTP(w, req)
if w.Code != http.StatusBadRequest {
t.Errorf("expected 400 on empty body, got %d", w.Code)
}
}
// TestWorkflowRun_UnknownMode locks the 400 path on workflow definitions
// that reference modes not registered with the runner. The harness's
// reality test runs depend on this so an unknown-mode misconfiguration

217
internal/langfuse/client.go Normal file
View File

@ -0,0 +1,217 @@
// Package langfuse is a minimal Go-side client for the Langfuse v2
// ingestion API. Mirrors the surface area we need from the Rust
// crates/gateway/src/v1/langfuse_trace.rs emitter — Trace + Span,
// nothing else yet (no scores, no observations, no datasets).
//
// Auth is Basic over public_key:secret_key. URL + creds come from
// /etc/lakehouse/langfuse.env in production; tests can pass any URL.
//
// Best-effort transport: errors are logged but don't fail the calling
// path. Lakehouse's internal services should never go down because
// Langfuse is unreachable.
package langfuse
import (
"bytes"
"context"
"crypto/rand"
"encoding/base64"
"encoding/hex"
"encoding/json"
"fmt"
"log/slog"
"net/http"
"sync"
"time"
)
// Client posts traces + spans to Langfuse's ingestion endpoint.
// Events are buffered and flushed in batches. Always call Flush
// before exit; Close also flushes.
type Client struct {
url string
auth string // pre-encoded "Basic ..."
hc *http.Client
mu sync.Mutex
pending []event
maxBatch int
}
// New constructs a Client. URL like "http://localhost:3001"; creds
// from langfuse.env. nil hc → uses default with 5s timeout.
func New(url, publicKey, secretKey string, hc *http.Client) *Client {
if hc == nil {
hc = &http.Client{Timeout: 5 * time.Second}
}
auth := "Basic " + base64.StdEncoding.EncodeToString([]byte(publicKey+":"+secretKey))
return &Client{
url: url,
auth: auth,
hc: hc,
maxBatch: 50,
}
}
// NewID returns a hex string suitable as a trace/span id. Langfuse
// accepts arbitrary strings; a 16-byte random hex is unambiguous.
func NewID() string {
b := make([]byte, 16)
_, _ = rand.Read(b)
return hex.EncodeToString(b)
}
// event is one Langfuse ingestion envelope. Body shape varies by
// type (trace-create vs span-create); we use map[string]any to
// keep the wire shape declarative.
type event struct {
ID string `json:"id"`
Type string `json:"type"` // "trace-create" | "span-create"
Timestamp string `json:"timestamp"`
Body map[string]any `json:"body"`
}
// TraceInput is what callers fill in when starting a trace.
type TraceInput struct {
Name string
UserID string
Input any
Metadata map[string]any
Tags []string
}
// Trace records a top-level trace. Returns the trace id so callers
// can attach spans. Best-effort: errors are logged and the trace
// id is still returned so callers don't need error-handling for the
// common case.
func (c *Client) Trace(ctx context.Context, t TraceInput) string {
id := NewID()
body := map[string]any{
"id": id,
"name": t.Name,
}
if t.UserID != "" {
body["userId"] = t.UserID
}
if t.Input != nil {
body["input"] = t.Input
}
if t.Metadata != nil {
body["metadata"] = t.Metadata
}
if len(t.Tags) > 0 {
body["tags"] = t.Tags
}
c.queue(event{
ID: NewID(),
Type: "trace-create",
Timestamp: time.Now().UTC().Format(time.RFC3339Nano),
Body: body,
})
return id
}
// SpanInput is what callers fill in when recording a span.
type SpanInput struct {
TraceID string
ParentID string // optional — for nested spans
Name string
Input any
Output any
Metadata map[string]any
StartTime time.Time
EndTime time.Time
StatusCode int // 0 = success, anything else = error code
Level string // "DEBUG" | "DEFAULT" | "WARNING" | "ERROR"
}
// Span records one span attached to a trace. Returns the span id.
func (c *Client) Span(ctx context.Context, s SpanInput) string {
id := NewID()
body := map[string]any{
"id": id,
"traceId": s.TraceID,
"name": s.Name,
}
if s.ParentID != "" {
body["parentObservationId"] = s.ParentID
}
if s.Input != nil {
body["input"] = s.Input
}
if s.Output != nil {
body["output"] = s.Output
}
if s.Metadata != nil {
body["metadata"] = s.Metadata
}
if !s.StartTime.IsZero() {
body["startTime"] = s.StartTime.UTC().Format(time.RFC3339Nano)
}
if !s.EndTime.IsZero() {
body["endTime"] = s.EndTime.UTC().Format(time.RFC3339Nano)
}
if s.Level != "" {
body["level"] = s.Level
}
if s.StatusCode != 0 {
body["statusMessage"] = fmt.Sprintf("status_code=%d", s.StatusCode)
}
c.queue(event{
ID: NewID(),
Type: "span-create",
Timestamp: time.Now().UTC().Format(time.RFC3339Nano),
Body: body,
})
return id
}
func (c *Client) queue(e event) {
c.mu.Lock()
c.pending = append(c.pending, e)
shouldFlush := len(c.pending) >= c.maxBatch
c.mu.Unlock()
if shouldFlush {
_ = c.Flush(context.Background())
}
}
// Flush sends all queued events in one batch. Best-effort: returns
// the error but also logs; callers can ignore.
func (c *Client) Flush(ctx context.Context) error {
c.mu.Lock()
if len(c.pending) == 0 {
c.mu.Unlock()
return nil
}
batch := c.pending
c.pending = nil
c.mu.Unlock()
body, err := json.Marshal(map[string]any{"batch": batch})
if err != nil {
slog.Warn("langfuse: marshal batch", "err", err, "n", len(batch))
return err
}
req, err := http.NewRequestWithContext(ctx, "POST", c.url+"/api/public/ingestion", bytes.NewReader(body))
if err != nil {
return err
}
req.Header.Set("Authorization", c.auth)
req.Header.Set("Content-Type", "application/json")
resp, err := c.hc.Do(req)
if err != nil {
slog.Warn("langfuse: post", "err", err, "n", len(batch))
return err
}
defer resp.Body.Close()
if resp.StatusCode/100 != 2 && resp.StatusCode != 207 {
slog.Warn("langfuse: non-2xx", "status", resp.StatusCode, "n", len(batch))
return fmt.Errorf("langfuse ingestion: HTTP %d", resp.StatusCode)
}
return nil
}
// Close flushes any remaining events. Idempotent.
func (c *Client) Close() error {
return c.Flush(context.Background())
}

View File

@ -85,6 +85,15 @@ type SearchRequest struct {
PlaybookMaxDistance float64 `json:"playbook_max_distance,omitempty"`
PlaybookMaxInjectDistance float64 `json:"playbook_max_inject_distance,omitempty"`
MetadataFilter map[string]any `json:"metadata_filter,omitempty"`
// ExcludeIDs filters out specific worker IDs post-retrieval.
// Real-world driver: a coordinator places 200 workers at a
// contract, then mid-day the client asks for a different set —
// the next query should NOT return the already-placed workers.
// Filter runs after merge but before metadata filter, so an
// excluded ID never wastes a slot in the post-filter top-K.
// Also applies to playbook boost + Shape B inject — excluded
// answers are skipped at injection time.
ExcludeIDs []string `json:"exclude_ids,omitempty"`
}
// SearchResponse wraps the merged results plus per-corpus return
@ -204,6 +213,25 @@ func (r *Retriever) Search(ctx context.Context, req SearchRequest) (*SearchRespo
return allHits[i].Distance < allHits[j].Distance
})
// ExcludeIDs filter — applied first so excluded IDs don't waste
// a slot in the post-filter top-K. Real-world driver: coordinator
// has placed N workers at a contract; mid-day the client asks for
// alternatives, so this query passes ExcludeIDs=<placed_ids> and
// gets back fresh candidates instead of the same N.
if len(req.ExcludeIDs) > 0 {
excludeSet := make(map[string]bool, len(req.ExcludeIDs))
for _, id := range req.ExcludeIDs {
excludeSet[id] = true
}
kept := make([]Result, 0, len(allHits))
for _, h := range allHits {
if !excludeSet[h.ID] {
kept = append(kept, h)
}
}
allHits = kept
}
// Metadata filter (component B — staffing-side structured gate).
// Applied BEFORE top-K truncation so the filter doesn't accidentally
// reduce coverage further. Caller can request larger PerCorpusK to
@ -239,6 +267,23 @@ func (r *Retriever) Search(ctx context.Context, req SearchRequest) (*SearchRespo
if err != nil {
slog.Warn("matrix: playbook lookup failed; skipping boost+inject", "err", err)
} else if len(hits) > 0 {
// Filter playbook hits to honor ExcludeIDs — without this,
// an excluded answer in a playbook recording would re-enter
// the result set via Shape B inject, defeating the swap
// semantics that the exclude list exists to enforce.
if len(req.ExcludeIDs) > 0 {
excludeSet := make(map[string]bool, len(req.ExcludeIDs))
for _, id := range req.ExcludeIDs {
excludeSet[id] = true
}
keptHits := make([]PlaybookHit, 0, len(hits))
for _, h := range hits {
if !excludeSet[h.Entry.AnswerID] {
keptHits = append(keptHits, h)
}
}
hits = keptHits
}
resp.PlaybookBoosted = ApplyPlaybookBoost(resp.Results, hits)
maxInjectDist := float32(req.PlaybookMaxInjectDistance)
if maxInjectDist <= 0 {

View File

@ -41,6 +41,12 @@ const (
// the workflow handler was casting a string literal to Source,
// which worked coincidentally but left the taxonomy implicit.
SourceWorkflow Source = "workflow"
// SourceInbox tags ObservedOps emitted by /observer/inbox — incoming
// real-world signals (email, SMS) that a coordinator would receive
// and act on. The handler only RECORDS the message; downstream
// triggers (e.g. matrix.search on the parsed demand) are the
// caller's concern, recorded separately.
SourceInbox Source = "inbox"
)
// ObservedOp is one entry in the observer's ring buffer (and JSONL

View File

@ -43,7 +43,7 @@ bind = "127.0.0.1:3216"
# G2: Ollama local. G3+ may swap in OpenAI/Voyage by changing
# this URL + the wire format inside the provider.
provider_url = "http://localhost:11434"
default_model = "nomic-embed-text"
default_model = "nomic-embed-text-v2-moe"
[queryd]
bind = "127.0.0.1:3214"
@ -129,7 +129,7 @@ level = "info"
[models]
# Tier 1 — local hot path
local_fast = "qwen3.5:latest"
local_embed = "nomic-embed-text"
local_embed = "nomic-embed-text-v2-moe" # 475M MoE, drop-in upgrade from 137M v1 — verified 2026-04-30 same 768-dim
# local_judge stays on qwen2.5:latest — qwen3.5:latest is a vision-SSM
# build with 256K context that runs ~30s per judge call against the
# playbook_lift loop (verified 2026-04-30). qwen2.5:latest at ~1s/call

View File

@ -0,0 +1,77 @@
# Multi-Coordinator Stress Test — Run 001
**Generated:** 2026-04-30T12:54:09.621556469Z
**Coordinators:** alice / bob / carol (each with own playbook namespace: `playbook_alice` / `playbook_bob` / `playbook_carol`)
**Contracts:** alpha_milwaukee_distribution / beta_indianapolis_manufacturing / gamma_chicago_construction
**Corpora:** `workers,ethereal_workers`
**K per query:** 8
**Total events captured:** 52
**Evidence:** `reports/reality-tests/multi_coord_stress_001.json`
---
## Diversity — is the system locking into scenarios or cycling?
| Metric | Mean Jaccard | n pairs | Interpretation |
|---|---:|---:|---|
| Same role across different contracts | 0 | 0 | Lower = more diverse (different region/cert mix → different workers) |
| Different roles within same contract | 0.003703703703703704 | 18 | Should be near-zero (different roles = different worker pools) |
**Healthy ranges:**
- Same role across contracts: < 0.30 means the system is genuinely picking different workers per region/contract.
- Different roles same contract: < 0.10 means role-specific retrieval is working.
- If either is > 0.50, the system is "cycling" the same handful of workers regardless of query intent.
---
## Determinism — same query reissued, top-K stability
| Metric | Value |
|---|---:|
| Mean Jaccard on retrieval-only reissue | 1 |
| Number of reissue pairs | 12 |
**Interpretation:**
- ≥ 0.95: HNSW retrieval is highly deterministic; reissues land on near-identical top-K. Good — system locks into a stable view of "best workers for this query."
- 0.80 0.95: Some HNSW or embed variance, acceptable.
- < 0.80: Retrieval is unstable reissues see substantially different results, suggesting either embed nondeterminism (Ollama returning slightly different vectors) or vectord nondeterminism (HNSW insertion order affecting recall).
---
## Learning — handover hit rate
Bob takes Alice's contract using Alice's playbook namespace. Did Alice's recorded answers surface in Bob's results?
| Metric | Value |
|---|---:|
| Handover queries run | 4 |
| Alice's recorded answer at Bob's top-1 | 4 |
| Alice's recorded answer in Bob's top-K | 4 |
| **Handover hit rate (top-1)** | **1** |
**Interpretation:**
- Hit rate ≥ 0.5: handover is meaningful — the second coordinator inherits the first's institutional memory.
- Hit rate ≈ 0.0: playbook namespace isolation is working but the playbook itself isn't transferable, OR Bob's queries don't match Alice's recordings closely enough.
---
## Per-event capture
All matrix.search responses live in the JSON — top-K with worker IDs, distances, and per-corpus counts. Search by phase:
```bash
jq '.events[] | select(.phase == "merge")' reports/reality-tests/multi_coord_stress_001.json
jq '.events[] | select(.coordinator == "alice" and .phase == "baseline")' reports/reality-tests/multi_coord_stress_001.json
jq '.events[] | select(.role == "warehouse worker") | {phase, contract, top_k_ids: [.top_k[].id]}' reports/reality-tests/multi_coord_stress_001.json
```
---
## What's NOT in this run (Phase 1 deliberately defers)
- **48-hour clock.** Events fire as discrete steps, not on a timeline.
- **Email / SMS ingest.** No endpoints exist on the Go side yet.
- **New-resume injection mid-run.** The corpus is fixed at the start.
- **Langfuse traces.** Need Go-side wiring.
These are Phase 2/3. The Phase 1 substrate is what the time-based runner will mount on top of.

View File

@ -0,0 +1,82 @@
# Multi-Coordinator Stress Test — Run 002
**Generated:** 2026-04-30T13:02:13.570393819Z
**Coordinators:** alice / bob / carol (each with own playbook namespace: `playbook_alice` / `playbook_bob` / `playbook_carol`)
**Contracts:** alpha_milwaukee_distribution / beta_indianapolis_manufacturing / gamma_chicago_construction
**Corpora:** `workers,ethereal_workers`
**K per query:** 8
**Total events captured:** 56
**Evidence:** `reports/reality-tests/multi_coord_stress_002.json`
---
## Diversity — is the system locking into scenarios or cycling?
| Metric | Mean Jaccard | n pairs | Interpretation |
|---|---:|---:|---|
| Same role across different contracts | 0.11900691900691901 | 9 | Lower = more diverse (different region/cert mix → different workers) |
| Different roles within same contract | 0.003703703703703704 | 18 | Should be near-zero (different roles = different worker pools) |
**Healthy ranges:**
- Same role across contracts: < 0.30 means the system is genuinely picking different workers per region/contract.
- Different roles same contract: < 0.10 means role-specific retrieval is working.
- If either is > 0.50, the system is "cycling" the same handful of workers regardless of query intent.
---
## Determinism — same query reissued, top-K stability
| Metric | Value |
|---|---:|
| Mean Jaccard on retrieval-only reissue | 1 |
| Number of reissue pairs | 12 |
**Interpretation:**
- ≥ 0.95: HNSW retrieval is highly deterministic; reissues land on near-identical top-K. Good — system locks into a stable view of "best workers for this query."
- 0.80 0.95: Some HNSW or embed variance, acceptable.
- < 0.80: Retrieval is unstable reissues see substantially different results, suggesting either embed nondeterminism (Ollama returning slightly different vectors) or vectord nondeterminism (HNSW insertion order affecting recall).
---
## Learning — handover hit rate
Bob takes Alice's contract using Alice's playbook namespace. Did Alice's recorded answers surface in Bob's results?
| Metric | Value |
|---|---:|
| Verbatim handover queries run | 4 |
| Alice's recorded answer at Bob's top-1 (verbatim) | 4 |
| Alice's recorded answer in Bob's top-K (verbatim) | 4 |
| **Verbatim handover hit rate (top-1)** | **1** |
| Paraphrase handover queries run | 4 |
| Alice's recorded answer at Bob's top-1 (paraphrase) | 4 |
| Alice's recorded answer in Bob's top-K (paraphrase) | 4 |
| **Paraphrase handover hit rate (top-1)** | **1** |
**Interpretation:**
- Verbatim hit rate ≈ 1.0: trivial case — Bob runs identical queries; should always hit.
- Paraphrase hit rate ≥ 0.5: institutional memory survives wording change — the harder learning property.
- Paraphrase hit rate ≈ 0.0: Bob's paraphrases drift past the inject threshold, so Alice's recordings don't activate. Same caveat as the playbook_lift paraphrase pass.
---
## Per-event capture
All matrix.search responses live in the JSON — top-K with worker IDs, distances, and per-corpus counts. Search by phase:
```bash
jq '.events[] | select(.phase == "merge")' reports/reality-tests/multi_coord_stress_002.json
jq '.events[] | select(.coordinator == "alice" and .phase == "baseline")' reports/reality-tests/multi_coord_stress_002.json
jq '.events[] | select(.role == "warehouse worker") | {phase, contract, top_k_ids: [.top_k[].id]}' reports/reality-tests/multi_coord_stress_002.json
```
---
## What's NOT in this run (Phase 1 deliberately defers)
- **48-hour clock.** Events fire as discrete steps, not on a timeline.
- **Email / SMS ingest.** No endpoints exist on the Go side yet.
- **New-resume injection mid-run.** The corpus is fixed at the start.
- **Langfuse traces.** Need Go-side wiring.
These are Phase 2/3. The Phase 1 substrate is what the time-based runner will mount on top of.

View File

@ -0,0 +1,82 @@
# Multi-Coordinator Stress Test — Run 003
**Generated:** 2026-04-30T13:13:44.35966865Z
**Coordinators:** alice / bob / carol (each with own playbook namespace: `playbook_alice` / `playbook_bob` / `playbook_carol`)
**Contracts:** alpha_milwaukee_distribution / beta_indianapolis_manufacturing / gamma_chicago_construction
**Corpora:** `workers,ethereal_workers`
**K per query:** 8
**Total events captured:** 61
**Evidence:** `reports/reality-tests/multi_coord_stress_003.json`
---
## Diversity — is the system locking into scenarios or cycling?
| Metric | Mean Jaccard | n pairs | Interpretation |
|---|---:|---:|---|
| Same role across different contracts | 0.03068783068783069 | 9 | Lower = more diverse (different region/cert mix → different workers) |
| Different roles within same contract | 0 | 18 | Should be near-zero (different roles = different worker pools) |
**Healthy ranges:**
- Same role across contracts: < 0.30 means the system is genuinely picking different workers per region/contract.
- Different roles same contract: < 0.10 means role-specific retrieval is working.
- If either is > 0.50, the system is "cycling" the same handful of workers regardless of query intent.
---
## Determinism — same query reissued, top-K stability
| Metric | Value |
|---|---:|
| Mean Jaccard on retrieval-only reissue | 1 |
| Number of reissue pairs | 12 |
**Interpretation:**
- ≥ 0.95: HNSW retrieval is highly deterministic; reissues land on near-identical top-K. Good — system locks into a stable view of "best workers for this query."
- 0.80 0.95: Some HNSW or embed variance, acceptable.
- < 0.80: Retrieval is unstable reissues see substantially different results, suggesting either embed nondeterminism (Ollama returning slightly different vectors) or vectord nondeterminism (HNSW insertion order affecting recall).
---
## Learning — handover hit rate
Bob takes Alice's contract using Alice's playbook namespace. Did Alice's recorded answers surface in Bob's results?
| Metric | Value |
|---|---:|
| Verbatim handover queries run | 4 |
| Alice's recorded answer at Bob's top-1 (verbatim) | 4 |
| Alice's recorded answer in Bob's top-K (verbatim) | 4 |
| **Verbatim handover hit rate (top-1)** | **1** |
| Paraphrase handover queries run | 4 |
| Alice's recorded answer at Bob's top-1 (paraphrase) | 4 |
| Alice's recorded answer in Bob's top-K (paraphrase) | 4 |
| **Paraphrase handover hit rate (top-1)** | **1** |
**Interpretation:**
- Verbatim hit rate ≈ 1.0: trivial case — Bob runs identical queries; should always hit.
- Paraphrase hit rate ≥ 0.5: institutional memory survives wording change — the harder learning property.
- Paraphrase hit rate ≈ 0.0: Bob's paraphrases drift past the inject threshold, so Alice's recordings don't activate. Same caveat as the playbook_lift paraphrase pass.
---
## Per-event capture
All matrix.search responses live in the JSON — top-K with worker IDs, distances, and per-corpus counts. Search by phase:
```bash
jq '.events[] | select(.phase == "merge")' reports/reality-tests/multi_coord_stress_003.json
jq '.events[] | select(.coordinator == "alice" and .phase == "baseline")' reports/reality-tests/multi_coord_stress_003.json
jq '.events[] | select(.role == "warehouse worker") | {phase, contract, top_k_ids: [.top_k[].id]}' reports/reality-tests/multi_coord_stress_003.json
```
---
## What's NOT in this run (Phase 1 deliberately defers)
- **48-hour clock.** Events fire as discrete steps, not on a timeline.
- **Email / SMS ingest.** No endpoints exist on the Go side yet.
- **New-resume injection mid-run.** The corpus is fixed at the start.
- **Langfuse traces.** Need Go-side wiring.
These are Phase 2/3. The Phase 1 substrate is what the time-based runner will mount on top of.

View File

@ -0,0 +1,82 @@
# Multi-Coordinator Stress Test — Run 004
**Generated:** 2026-04-30T13:17:03.577877974Z
**Coordinators:** alice / bob / carol (each with own playbook namespace: `playbook_alice` / `playbook_bob` / `playbook_carol`)
**Contracts:** alpha_milwaukee_distribution / beta_indianapolis_manufacturing / gamma_chicago_construction
**Corpora:** `workers,ethereal_workers`
**K per query:** 8
**Total events captured:** 61
**Evidence:** `reports/reality-tests/multi_coord_stress_004.json`
---
## Diversity — is the system locking into scenarios or cycling?
| Metric | Mean Jaccard | n pairs | Interpretation |
|---|---:|---:|---|
| Same role across different contracts | 0.08013468013468013 | 9 | Lower = more diverse (different region/cert mix → different workers) |
| Different roles within same contract | 0.012820512820512822 | 18 | Should be near-zero (different roles = different worker pools) |
**Healthy ranges:**
- Same role across contracts: < 0.30 means the system is genuinely picking different workers per region/contract.
- Different roles same contract: < 0.10 means role-specific retrieval is working.
- If either is > 0.50, the system is "cycling" the same handful of workers regardless of query intent.
---
## Determinism — same query reissued, top-K stability
| Metric | Value |
|---|---:|
| Mean Jaccard on retrieval-only reissue | 1 |
| Number of reissue pairs | 12 |
**Interpretation:**
- ≥ 0.95: HNSW retrieval is highly deterministic; reissues land on near-identical top-K. Good — system locks into a stable view of "best workers for this query."
- 0.80 0.95: Some HNSW or embed variance, acceptable.
- < 0.80: Retrieval is unstable reissues see substantially different results, suggesting either embed nondeterminism (Ollama returning slightly different vectors) or vectord nondeterminism (HNSW insertion order affecting recall).
---
## Learning — handover hit rate
Bob takes Alice's contract using Alice's playbook namespace. Did Alice's recorded answers surface in Bob's results?
| Metric | Value |
|---|---:|
| Verbatim handover queries run | 4 |
| Alice's recorded answer at Bob's top-1 (verbatim) | 4 |
| Alice's recorded answer in Bob's top-K (verbatim) | 4 |
| **Verbatim handover hit rate (top-1)** | **1** |
| Paraphrase handover queries run | 4 |
| Alice's recorded answer at Bob's top-1 (paraphrase) | 4 |
| Alice's recorded answer in Bob's top-K (paraphrase) | 4 |
| **Paraphrase handover hit rate (top-1)** | **1** |
**Interpretation:**
- Verbatim hit rate ≈ 1.0: trivial case — Bob runs identical queries; should always hit.
- Paraphrase hit rate ≥ 0.5: institutional memory survives wording change — the harder learning property.
- Paraphrase hit rate ≈ 0.0: Bob's paraphrases drift past the inject threshold, so Alice's recordings don't activate. Same caveat as the playbook_lift paraphrase pass.
---
## Per-event capture
All matrix.search responses live in the JSON — top-K with worker IDs, distances, and per-corpus counts. Search by phase:
```bash
jq '.events[] | select(.phase == "merge")' reports/reality-tests/multi_coord_stress_004.json
jq '.events[] | select(.coordinator == "alice" and .phase == "baseline")' reports/reality-tests/multi_coord_stress_004.json
jq '.events[] | select(.role == "warehouse worker") | {phase, contract, top_k_ids: [.top_k[].id]}' reports/reality-tests/multi_coord_stress_004.json
```
---
## What's NOT in this run (Phase 1 deliberately defers)
- **48-hour clock.** Events fire as discrete steps, not on a timeline.
- **Email / SMS ingest.** No endpoints exist on the Go side yet.
- **New-resume injection mid-run.** The corpus is fixed at the start.
- **Langfuse traces.** Need Go-side wiring.
These are Phase 2/3. The Phase 1 substrate is what the time-based runner will mount on top of.

View File

@ -0,0 +1,82 @@
# Multi-Coordinator Stress Test — Run 005
**Generated:** 2026-04-30T13:25:15.497712275Z
**Coordinators:** alice / bob / carol (each with own playbook namespace: `playbook_alice` / `playbook_bob` / `playbook_carol`)
**Contracts:** alpha_milwaukee_distribution / beta_indianapolis_manufacturing / gamma_chicago_construction
**Corpora:** `workers,ethereal_workers`
**K per query:** 8
**Total events captured:** 61
**Evidence:** `reports/reality-tests/multi_coord_stress_005.json`
---
## Diversity — is the system locking into scenarios or cycling?
| Metric | Mean Jaccard | n pairs | Interpretation |
|---|---:|---:|---|
| Same role across different contracts | 0 | 9 | Lower = more diverse (different region/cert mix → different workers) |
| Different roles within same contract | 0.03610093610093609 | 18 | Should be near-zero (different roles = different worker pools) |
**Healthy ranges:**
- Same role across contracts: < 0.30 means the system is genuinely picking different workers per region/contract.
- Different roles same contract: < 0.10 means role-specific retrieval is working.
- If either is > 0.50, the system is "cycling" the same handful of workers regardless of query intent.
---
## Determinism — same query reissued, top-K stability
| Metric | Value |
|---|---:|
| Mean Jaccard on retrieval-only reissue | 1 |
| Number of reissue pairs | 12 |
**Interpretation:**
- ≥ 0.95: HNSW retrieval is highly deterministic; reissues land on near-identical top-K. Good — system locks into a stable view of "best workers for this query."
- 0.80 0.95: Some HNSW or embed variance, acceptable.
- < 0.80: Retrieval is unstable reissues see substantially different results, suggesting either embed nondeterminism (Ollama returning slightly different vectors) or vectord nondeterminism (HNSW insertion order affecting recall).
---
## Learning — handover hit rate
Bob takes Alice's contract using Alice's playbook namespace. Did Alice's recorded answers surface in Bob's results?
| Metric | Value |
|---|---:|
| Verbatim handover queries run | 4 |
| Alice's recorded answer at Bob's top-1 (verbatim) | 4 |
| Alice's recorded answer in Bob's top-K (verbatim) | 4 |
| **Verbatim handover hit rate (top-1)** | **1** |
| Paraphrase handover queries run | 4 |
| Alice's recorded answer at Bob's top-1 (paraphrase) | 4 |
| Alice's recorded answer in Bob's top-K (paraphrase) | 4 |
| **Paraphrase handover hit rate (top-1)** | **1** |
**Interpretation:**
- Verbatim hit rate ≈ 1.0: trivial case — Bob runs identical queries; should always hit.
- Paraphrase hit rate ≥ 0.5: institutional memory survives wording change — the harder learning property.
- Paraphrase hit rate ≈ 0.0: Bob's paraphrases drift past the inject threshold, so Alice's recordings don't activate. Same caveat as the playbook_lift paraphrase pass.
---
## Per-event capture
All matrix.search responses live in the JSON — top-K with worker IDs, distances, and per-corpus counts. Search by phase:
```bash
jq '.events[] | select(.phase == "merge")' reports/reality-tests/multi_coord_stress_005.json
jq '.events[] | select(.coordinator == "alice" and .phase == "baseline")' reports/reality-tests/multi_coord_stress_005.json
jq '.events[] | select(.role == "warehouse worker") | {phase, contract, top_k_ids: [.top_k[].id]}' reports/reality-tests/multi_coord_stress_005.json
```
---
## What's NOT in this run (Phase 1 deliberately defers)
- **48-hour clock.** Events fire as discrete steps, not on a timeline.
- **Email / SMS ingest.** No endpoints exist on the Go side yet.
- **New-resume injection mid-run.** The corpus is fixed at the start.
- **Langfuse traces.** Need Go-side wiring.
These are Phase 2/3. The Phase 1 substrate is what the time-based runner will mount on top of.

View File

@ -0,0 +1,82 @@
# Multi-Coordinator Stress Test — Run 006
**Generated:** 2026-04-30T13:33:24.568124731Z
**Coordinators:** alice / bob / carol (each with own playbook namespace: `playbook_alice` / `playbook_bob` / `playbook_carol`)
**Contracts:** alpha_milwaukee_distribution / beta_indianapolis_manufacturing / gamma_chicago_construction
**Corpora:** `workers,ethereal_workers`
**K per query:** 8
**Total events captured:** 67
**Evidence:** `reports/reality-tests/multi_coord_stress_006.json`
---
## Diversity — is the system locking into scenarios or cycling?
| Metric | Mean Jaccard | n pairs | Interpretation |
|---|---:|---:|---|
| Same role across different contracts | 0 | 9 | Lower = more diverse (different region/cert mix → different workers) |
| Different roles within same contract | 0.04603174603174603 | 18 | Should be near-zero (different roles = different worker pools) |
**Healthy ranges:**
- Same role across contracts: < 0.30 means the system is genuinely picking different workers per region/contract.
- Different roles same contract: < 0.10 means role-specific retrieval is working.
- If either is > 0.50, the system is "cycling" the same handful of workers regardless of query intent.
---
## Determinism — same query reissued, top-K stability
| Metric | Value |
|---|---:|
| Mean Jaccard on retrieval-only reissue | 1 |
| Number of reissue pairs | 12 |
**Interpretation:**
- ≥ 0.95: HNSW retrieval is highly deterministic; reissues land on near-identical top-K. Good — system locks into a stable view of "best workers for this query."
- 0.80 0.95: Some HNSW or embed variance, acceptable.
- < 0.80: Retrieval is unstable reissues see substantially different results, suggesting either embed nondeterminism (Ollama returning slightly different vectors) or vectord nondeterminism (HNSW insertion order affecting recall).
---
## Learning — handover hit rate
Bob takes Alice's contract using Alice's playbook namespace. Did Alice's recorded answers surface in Bob's results?
| Metric | Value |
|---|---:|
| Verbatim handover queries run | 4 |
| Alice's recorded answer at Bob's top-1 (verbatim) | 4 |
| Alice's recorded answer in Bob's top-K (verbatim) | 4 |
| **Verbatim handover hit rate (top-1)** | **1** |
| Paraphrase handover queries run | 4 |
| Alice's recorded answer at Bob's top-1 (paraphrase) | 4 |
| Alice's recorded answer in Bob's top-K (paraphrase) | 4 |
| **Paraphrase handover hit rate (top-1)** | **1** |
**Interpretation:**
- Verbatim hit rate ≈ 1.0: trivial case — Bob runs identical queries; should always hit.
- Paraphrase hit rate ≥ 0.5: institutional memory survives wording change — the harder learning property.
- Paraphrase hit rate ≈ 0.0: Bob's paraphrases drift past the inject threshold, so Alice's recordings don't activate. Same caveat as the playbook_lift paraphrase pass.
---
## Per-event capture
All matrix.search responses live in the JSON — top-K with worker IDs, distances, and per-corpus counts. Search by phase:
```bash
jq '.events[] | select(.phase == "merge")' reports/reality-tests/multi_coord_stress_006.json
jq '.events[] | select(.coordinator == "alice" and .phase == "baseline")' reports/reality-tests/multi_coord_stress_006.json
jq '.events[] | select(.role == "warehouse worker") | {phase, contract, top_k_ids: [.top_k[].id]}' reports/reality-tests/multi_coord_stress_006.json
```
---
## What's NOT in this run (Phase 1 deliberately defers)
- **48-hour clock.** Events fire as discrete steps, not on a timeline.
- **Email / SMS ingest.** No endpoints exist on the Go side yet.
- **New-resume injection mid-run.** The corpus is fixed at the start.
- **Langfuse traces.** Need Go-side wiring.
These are Phase 2/3. The Phase 1 substrate is what the time-based runner will mount on top of.

View File

@ -0,0 +1,82 @@
# Multi-Coordinator Stress Test — Run 007
**Generated:** 2026-04-30T19:50:04.791000091Z
**Coordinators:** alice / bob / carol (each with own playbook namespace: `playbook_alice` / `playbook_bob` / `playbook_carol`)
**Contracts:** alpha_milwaukee_distribution / beta_indianapolis_manufacturing / gamma_chicago_construction
**Corpora:** `workers,ethereal_workers`
**K per query:** 8
**Total events captured:** 67
**Evidence:** `reports/reality-tests/multi_coord_stress_007.json`
---
## Diversity — is the system locking into scenarios or cycling?
| Metric | Mean Jaccard | n pairs | Interpretation |
|---|---:|---:|---|
| Same role across different contracts | 0 | 9 | Lower = more diverse (different region/cert mix → different workers) |
| Different roles within same contract | 0.026455026455026454 | 18 | Should be near-zero (different roles = different worker pools) |
**Healthy ranges:**
- Same role across contracts: < 0.30 means the system is genuinely picking different workers per region/contract.
- Different roles same contract: < 0.10 means role-specific retrieval is working.
- If either is > 0.50, the system is "cycling" the same handful of workers regardless of query intent.
---
## Determinism — same query reissued, top-K stability
| Metric | Value |
|---|---:|
| Mean Jaccard on retrieval-only reissue | 1 |
| Number of reissue pairs | 12 |
**Interpretation:**
- ≥ 0.95: HNSW retrieval is highly deterministic; reissues land on near-identical top-K. Good — system locks into a stable view of "best workers for this query."
- 0.80 0.95: Some HNSW or embed variance, acceptable.
- < 0.80: Retrieval is unstable reissues see substantially different results, suggesting either embed nondeterminism (Ollama returning slightly different vectors) or vectord nondeterminism (HNSW insertion order affecting recall).
---
## Learning — handover hit rate
Bob takes Alice's contract using Alice's playbook namespace. Did Alice's recorded answers surface in Bob's results?
| Metric | Value |
|---|---:|
| Verbatim handover queries run | 4 |
| Alice's recorded answer at Bob's top-1 (verbatim) | 4 |
| Alice's recorded answer in Bob's top-K (verbatim) | 4 |
| **Verbatim handover hit rate (top-1)** | **1** |
| Paraphrase handover queries run | 4 |
| Alice's recorded answer at Bob's top-1 (paraphrase) | 4 |
| Alice's recorded answer in Bob's top-K (paraphrase) | 4 |
| **Paraphrase handover hit rate (top-1)** | **1** |
**Interpretation:**
- Verbatim hit rate ≈ 1.0: trivial case — Bob runs identical queries; should always hit.
- Paraphrase hit rate ≥ 0.5: institutional memory survives wording change — the harder learning property.
- Paraphrase hit rate ≈ 0.0: Bob's paraphrases drift past the inject threshold, so Alice's recordings don't activate. Same caveat as the playbook_lift paraphrase pass.
---
## Per-event capture
All matrix.search responses live in the JSON — top-K with worker IDs, distances, and per-corpus counts. Search by phase:
```bash
jq '.events[] | select(.phase == "merge")' reports/reality-tests/multi_coord_stress_007.json
jq '.events[] | select(.coordinator == "alice" and .phase == "baseline")' reports/reality-tests/multi_coord_stress_007.json
jq '.events[] | select(.role == "warehouse worker") | {phase, contract, top_k_ids: [.top_k[].id]}' reports/reality-tests/multi_coord_stress_007.json
```
---
## What's NOT in this run (Phase 1 deliberately defers)
- **48-hour clock.** Events fire as discrete steps, not on a timeline.
- **Email / SMS ingest.** No endpoints exist on the Go side yet.
- **New-resume injection mid-run.** The corpus is fixed at the start.
- **Langfuse traces.** Need Go-side wiring.
These are Phase 2/3. The Phase 1 substrate is what the time-based runner will mount on top of.

View File

@ -0,0 +1,82 @@
# Multi-Coordinator Stress Test — Run 008
**Generated:** 2026-04-30T21:15:37.045817146Z
**Coordinators:** alice / bob / carol (each with own playbook namespace: `playbook_alice` / `playbook_bob` / `playbook_carol`)
**Contracts:** alpha_milwaukee_distribution / beta_indianapolis_manufacturing / gamma_chicago_construction
**Corpora:** `workers,ethereal_workers`
**K per query:** 8
**Total events captured:** 67
**Evidence:** `reports/reality-tests/multi_coord_stress_008.json`
---
## Diversity — is the system locking into scenarios or cycling?
| Metric | Mean Jaccard | n pairs | Interpretation |
|---|---:|---:|---|
| Same role across different contracts | 0 | 9 | Lower = more diverse (different region/cert mix → different workers) |
| Different roles within same contract | 0.04126984126984126 | 18 | Should be near-zero (different roles = different worker pools) |
**Healthy ranges:**
- Same role across contracts: < 0.30 means the system is genuinely picking different workers per region/contract.
- Different roles same contract: < 0.10 means role-specific retrieval is working.
- If either is > 0.50, the system is "cycling" the same handful of workers regardless of query intent.
---
## Determinism — same query reissued, top-K stability
| Metric | Value |
|---|---:|
| Mean Jaccard on retrieval-only reissue | 1 |
| Number of reissue pairs | 12 |
**Interpretation:**
- ≥ 0.95: HNSW retrieval is highly deterministic; reissues land on near-identical top-K. Good — system locks into a stable view of "best workers for this query."
- 0.80 0.95: Some HNSW or embed variance, acceptable.
- < 0.80: Retrieval is unstable reissues see substantially different results, suggesting either embed nondeterminism (Ollama returning slightly different vectors) or vectord nondeterminism (HNSW insertion order affecting recall).
---
## Learning — handover hit rate
Bob takes Alice's contract using Alice's playbook namespace. Did Alice's recorded answers surface in Bob's results?
| Metric | Value |
|---|---:|
| Verbatim handover queries run | 4 |
| Alice's recorded answer at Bob's top-1 (verbatim) | 4 |
| Alice's recorded answer in Bob's top-K (verbatim) | 4 |
| **Verbatim handover hit rate (top-1)** | **1** |
| Paraphrase handover queries run | 4 |
| Alice's recorded answer at Bob's top-1 (paraphrase) | 4 |
| Alice's recorded answer in Bob's top-K (paraphrase) | 4 |
| **Paraphrase handover hit rate (top-1)** | **1** |
**Interpretation:**
- Verbatim hit rate ≈ 1.0: trivial case — Bob runs identical queries; should always hit.
- Paraphrase hit rate ≥ 0.5: institutional memory survives wording change — the harder learning property.
- Paraphrase hit rate ≈ 0.0: Bob's paraphrases drift past the inject threshold, so Alice's recordings don't activate. Same caveat as the playbook_lift paraphrase pass.
---
## Per-event capture
All matrix.search responses live in the JSON — top-K with worker IDs, distances, and per-corpus counts. Search by phase:
```bash
jq '.events[] | select(.phase == "merge")' reports/reality-tests/multi_coord_stress_008.json
jq '.events[] | select(.coordinator == "alice" and .phase == "baseline")' reports/reality-tests/multi_coord_stress_008.json
jq '.events[] | select(.role == "warehouse worker") | {phase, contract, top_k_ids: [.top_k[].id]}' reports/reality-tests/multi_coord_stress_008.json
```
---
## What's NOT in this run (Phase 1 deliberately defers)
- **48-hour clock.** Events fire as discrete steps, not on a timeline.
- **Email / SMS ingest.** No endpoints exist on the Go side yet.
- **New-resume injection mid-run.** The corpus is fixed at the start.
- **Langfuse traces.** Need Go-side wiring.
These are Phase 2/3. The Phase 1 substrate is what the time-based runner will mount on top of.

View File

@ -0,0 +1,82 @@
# Multi-Coordinator Stress Test — Run 009
**Generated:** 2026-04-30T21:23:59.011167722Z
**Coordinators:** alice / bob / carol (each with own playbook namespace: `playbook_alice` / `playbook_bob` / `playbook_carol`)
**Contracts:** alpha_milwaukee_distribution / beta_indianapolis_manufacturing / gamma_chicago_construction
**Corpora:** `workers,ethereal_workers`
**K per query:** 8
**Total events captured:** 67
**Evidence:** `reports/reality-tests/multi_coord_stress_009.json`
---
## Diversity — is the system locking into scenarios or cycling?
| Metric | Mean Jaccard | n pairs | Interpretation |
|---|---:|---:|---|
| Same role across different contracts | 0.015873015873015872 | 9 | Lower = more diverse (different region/cert mix → different workers) |
| Different roles within same contract | 0.015343915343915345 | 18 | Should be near-zero (different roles = different worker pools) |
**Healthy ranges:**
- Same role across contracts: < 0.30 means the system is genuinely picking different workers per region/contract.
- Different roles same contract: < 0.10 means role-specific retrieval is working.
- If either is > 0.50, the system is "cycling" the same handful of workers regardless of query intent.
---
## Determinism — same query reissued, top-K stability
| Metric | Value |
|---|---:|
| Mean Jaccard on retrieval-only reissue | 1 |
| Number of reissue pairs | 12 |
**Interpretation:**
- ≥ 0.95: HNSW retrieval is highly deterministic; reissues land on near-identical top-K. Good — system locks into a stable view of "best workers for this query."
- 0.80 0.95: Some HNSW or embed variance, acceptable.
- < 0.80: Retrieval is unstable reissues see substantially different results, suggesting either embed nondeterminism (Ollama returning slightly different vectors) or vectord nondeterminism (HNSW insertion order affecting recall).
---
## Learning — handover hit rate
Bob takes Alice's contract using Alice's playbook namespace. Did Alice's recorded answers surface in Bob's results?
| Metric | Value |
|---|---:|
| Verbatim handover queries run | 4 |
| Alice's recorded answer at Bob's top-1 (verbatim) | 4 |
| Alice's recorded answer in Bob's top-K (verbatim) | 4 |
| **Verbatim handover hit rate (top-1)** | **1** |
| Paraphrase handover queries run | 4 |
| Alice's recorded answer at Bob's top-1 (paraphrase) | 4 |
| Alice's recorded answer in Bob's top-K (paraphrase) | 4 |
| **Paraphrase handover hit rate (top-1)** | **1** |
**Interpretation:**
- Verbatim hit rate ≈ 1.0: trivial case — Bob runs identical queries; should always hit.
- Paraphrase hit rate ≥ 0.5: institutional memory survives wording change — the harder learning property.
- Paraphrase hit rate ≈ 0.0: Bob's paraphrases drift past the inject threshold, so Alice's recordings don't activate. Same caveat as the playbook_lift paraphrase pass.
---
## Per-event capture
All matrix.search responses live in the JSON — top-K with worker IDs, distances, and per-corpus counts. Search by phase:
```bash
jq '.events[] | select(.phase == "merge")' reports/reality-tests/multi_coord_stress_009.json
jq '.events[] | select(.coordinator == "alice" and .phase == "baseline")' reports/reality-tests/multi_coord_stress_009.json
jq '.events[] | select(.role == "warehouse worker") | {phase, contract, top_k_ids: [.top_k[].id]}' reports/reality-tests/multi_coord_stress_009.json
```
---
## What's NOT in this run (Phase 1 deliberately defers)
- **48-hour clock.** Events fire as discrete steps, not on a timeline.
- **Email / SMS ingest.** No endpoints exist on the Go side yet.
- **New-resume injection mid-run.** The corpus is fixed at the start.
- **Langfuse traces.** Need Go-side wiring.
These are Phase 2/3. The Phase 1 substrate is what the time-based runner will mount on top of.

View File

@ -0,0 +1,82 @@
# Multi-Coordinator Stress Test — Run 010
**Generated:** 2026-04-30T21:30:38.434794788Z
**Coordinators:** alice / bob / carol (each with own playbook namespace: `playbook_alice` / `playbook_bob` / `playbook_carol`)
**Contracts:** alpha_milwaukee_distribution / beta_indianapolis_manufacturing / gamma_chicago_construction
**Corpora:** `workers,ethereal_workers`
**K per query:** 8
**Total events captured:** 67
**Evidence:** `reports/reality-tests/multi_coord_stress_010.json`
---
## Diversity — is the system locking into scenarios or cycling?
| Metric | Mean Jaccard | n pairs | Interpretation |
|---|---:|---:|---|
| Same role across different contracts | 0.007407407407407408 | 9 | Lower = more diverse (different region/cert mix → different workers) |
| Different roles within same contract | 0.026455026455026454 | 18 | Should be near-zero (different roles = different worker pools) |
**Healthy ranges:**
- Same role across contracts: < 0.30 means the system is genuinely picking different workers per region/contract.
- Different roles same contract: < 0.10 means role-specific retrieval is working.
- If either is > 0.50, the system is "cycling" the same handful of workers regardless of query intent.
---
## Determinism — same query reissued, top-K stability
| Metric | Value |
|---|---:|
| Mean Jaccard on retrieval-only reissue | 1 |
| Number of reissue pairs | 12 |
**Interpretation:**
- ≥ 0.95: HNSW retrieval is highly deterministic; reissues land on near-identical top-K. Good — system locks into a stable view of "best workers for this query."
- 0.80 0.95: Some HNSW or embed variance, acceptable.
- < 0.80: Retrieval is unstable reissues see substantially different results, suggesting either embed nondeterminism (Ollama returning slightly different vectors) or vectord nondeterminism (HNSW insertion order affecting recall).
---
## Learning — handover hit rate
Bob takes Alice's contract using Alice's playbook namespace. Did Alice's recorded answers surface in Bob's results?
| Metric | Value |
|---|---:|
| Verbatim handover queries run | 4 |
| Alice's recorded answer at Bob's top-1 (verbatim) | 4 |
| Alice's recorded answer in Bob's top-K (verbatim) | 4 |
| **Verbatim handover hit rate (top-1)** | **1** |
| Paraphrase handover queries run | 4 |
| Alice's recorded answer at Bob's top-1 (paraphrase) | 4 |
| Alice's recorded answer in Bob's top-K (paraphrase) | 4 |
| **Paraphrase handover hit rate (top-1)** | **1** |
**Interpretation:**
- Verbatim hit rate ≈ 1.0: trivial case — Bob runs identical queries; should always hit.
- Paraphrase hit rate ≥ 0.5: institutional memory survives wording change — the harder learning property.
- Paraphrase hit rate ≈ 0.0: Bob's paraphrases drift past the inject threshold, so Alice's recordings don't activate. Same caveat as the playbook_lift paraphrase pass.
---
## Per-event capture
All matrix.search responses live in the JSON — top-K with worker IDs, distances, and per-corpus counts. Search by phase:
```bash
jq '.events[] | select(.phase == "merge")' reports/reality-tests/multi_coord_stress_010.json
jq '.events[] | select(.coordinator == "alice" and .phase == "baseline")' reports/reality-tests/multi_coord_stress_010.json
jq '.events[] | select(.role == "warehouse worker") | {phase, contract, top_k_ids: [.top_k[].id]}' reports/reality-tests/multi_coord_stress_010.json
```
---
## What's NOT in this run (Phase 1 deliberately defers)
- **48-hour clock.** Events fire as discrete steps, not on a timeline.
- **Email / SMS ingest.** No endpoints exist on the Go side yet.
- **New-resume injection mid-run.** The corpus is fixed at the start.
- **Langfuse traces.** Need Go-side wiring.
These are Phase 2/3. The Phase 1 substrate is what the time-based runner will mount on top of.

View File

@ -0,0 +1,82 @@
# Multi-Coordinator Stress Test — Run 011
**Generated:** 2026-04-30T21:41:26.801002955Z
**Coordinators:** alice / bob / carol (each with own playbook namespace: `playbook_alice` / `playbook_bob` / `playbook_carol`)
**Contracts:** alpha_milwaukee_distribution / beta_indianapolis_manufacturing / gamma_chicago_construction
**Corpora:** `workers,ethereal_workers`
**K per query:** 8
**Total events captured:** 67
**Evidence:** `reports/reality-tests/multi_coord_stress_011.json`
---
## Diversity — is the system locking into scenarios or cycling?
| Metric | Mean Jaccard | n pairs | Interpretation |
|---|---:|---:|---|
| Same role across different contracts | 0.025641025641025644 | 9 | Lower = more diverse (different region/cert mix → different workers) |
| Different roles within same contract | 0.06996336996336996 | 18 | Should be near-zero (different roles = different worker pools) |
**Healthy ranges:**
- Same role across contracts: < 0.30 means the system is genuinely picking different workers per region/contract.
- Different roles same contract: < 0.10 means role-specific retrieval is working.
- If either is > 0.50, the system is "cycling" the same handful of workers regardless of query intent.
---
## Determinism — same query reissued, top-K stability
| Metric | Value |
|---|---:|
| Mean Jaccard on retrieval-only reissue | 1 |
| Number of reissue pairs | 12 |
**Interpretation:**
- ≥ 0.95: HNSW retrieval is highly deterministic; reissues land on near-identical top-K. Good — system locks into a stable view of "best workers for this query."
- 0.80 0.95: Some HNSW or embed variance, acceptable.
- < 0.80: Retrieval is unstable reissues see substantially different results, suggesting either embed nondeterminism (Ollama returning slightly different vectors) or vectord nondeterminism (HNSW insertion order affecting recall).
---
## Learning — handover hit rate
Bob takes Alice's contract using Alice's playbook namespace. Did Alice's recorded answers surface in Bob's results?
| Metric | Value |
|---|---:|
| Verbatim handover queries run | 4 |
| Alice's recorded answer at Bob's top-1 (verbatim) | 4 |
| Alice's recorded answer in Bob's top-K (verbatim) | 4 |
| **Verbatim handover hit rate (top-1)** | **1** |
| Paraphrase handover queries run | 4 |
| Alice's recorded answer at Bob's top-1 (paraphrase) | 4 |
| Alice's recorded answer in Bob's top-K (paraphrase) | 4 |
| **Paraphrase handover hit rate (top-1)** | **1** |
**Interpretation:**
- Verbatim hit rate ≈ 1.0: trivial case — Bob runs identical queries; should always hit.
- Paraphrase hit rate ≥ 0.5: institutional memory survives wording change — the harder learning property.
- Paraphrase hit rate ≈ 0.0: Bob's paraphrases drift past the inject threshold, so Alice's recordings don't activate. Same caveat as the playbook_lift paraphrase pass.
---
## Per-event capture
All matrix.search responses live in the JSON — top-K with worker IDs, distances, and per-corpus counts. Search by phase:
```bash
jq '.events[] | select(.phase == "merge")' reports/reality-tests/multi_coord_stress_011.json
jq '.events[] | select(.coordinator == "alice" and .phase == "baseline")' reports/reality-tests/multi_coord_stress_011.json
jq '.events[] | select(.role == "warehouse worker") | {phase, contract, top_k_ids: [.top_k[].id]}' reports/reality-tests/multi_coord_stress_011.json
```
---
## What's NOT in this run (Phase 1 deliberately defers)
- **48-hour clock.** Events fire as discrete steps, not on a timeline.
- **Email / SMS ingest.** No endpoints exist on the Go side yet.
- **New-resume injection mid-run.** The corpus is fixed at the start.
- **Langfuse traces.** Need Go-side wiring.
These are Phase 2/3. The Phase 1 substrate is what the time-based runner will mount on top of.

View File

@ -0,0 +1,120 @@
# Playbook-Lift Reality Test — Run 005
**Generated:** 2026-04-30T12:40:48.475901847Z
**Judge:** `qwen2.5:latest` (Ollama, resolved from env JUDGE_MODEL=qwen2.5:latest)
**Corpora:** `workers,ethereal_workers`
**Workers limit:** 5000
**Queries:** `tests/reality/playbook_lift_queries.txt` (21 executed)
**K per pass:** 10
**Paraphrase pass:** ENABLED
**Re-judge pass:** ENABLED
**Evidence:** `reports/reality-tests/playbook_lift_005.json`
---
## Headline
| Metric | Value |
|---|---:|
| Total queries run | 21 |
| Cold-pass discoveries (judge-best ≠ top-1) | 7 |
| Warm-pass lifts (recorded playbook → top-1) | 5 |
| No change (judge-best already top-1, no playbook needed) | 16 |
| Playbook boosts triggered (warm pass) | 9 |
| Mean Δ top-1 distance (warm cold) | -0.076170966 |
| **Paraphrase pass — recorded answer at rank 0 (top-1)** | **5 / 7** |
| Paraphrase pass — recorded answer at any rank in top-K | 5 / 7 |
| **Quality lift** (warm top-1 rating > cold top-1 rating) | **5 / 21** |
| Quality neutral (warm top-1 rating = cold top-1 rating) | 13 / 21 |
| Quality regressed (warm top-1 rating < cold top-1 rating) | 3 / 21 |
**Verbatim lift rate:** 5 of 7 discoveries became top-1 after warm pass.
---
## Per-query results
| # | Query | Cold top-1 | Cold judge-best (rank/rating) | Recorded? | Warm top-1 | Judge-best warm rank | Lift |
|---|---|---|---|---|---|---|---|
| 1 | Forklift operator with OSHA-30, warehouse experience, day sh | e-5670 | 2/4 | ✓ e-5729 | e-5729 | 0 | **YES** |
| 2 | OSHA-30 certified forklift operator in Wisconsin, cold stora | e-6293 | 7/3 | — | w-1566 | 8 | no |
| 3 | Production worker with confined-space cert and hazmat traini | w-602 | 0/2 | — | w-3575 | 1 | no |
| 4 | CDL Class A driver, clean record, willing to do regional 4-d | w-3854 | 0/1 | — | w-3854 | 0 | no |
| 5 | Warehouse lead with current OSHA-30 certification, NOT OSHA- | w-1807 | 6/3 | — | w-1807 | 6 | no |
| 6 | Forklift-certified loader, certification must be active, dis | w-1807 | 3/4 | ✓ w-205 | w-4257 | 1 | no |
| 7 | Hazmat-certified warehouse worker comfortable with cold stor | e-4910 | 2/4 | ✓ w-4257 | w-205 | 1 | no |
| 8 | Bilingual production worker with team-lead experience and tr | w-4988 | 0/4 | — | w-4988 | 0 | no |
| 9 | Inventory specialist with confined-space cert and compliance | w-388 | 3/4 | ✓ w-3575 | w-3575 | 0 | **YES** |
| 10 | Warehouse worker who can run inventory cycles and lead a sma | e-3011 | 0/4 | — | e-3011 | 0 | no |
| 11 | Production line worker comfortable filling in as line superv | w-1387 | 0/4 | — | e-5729 | 1 | no |
| 12 | Customer service rep willing to cross-train into dispatch or | w-1451 | 0/2 | — | w-1451 | 0 | no |
| 13 | Reliable production line lead with strong attendance and lea | e-7360 | 5/4 | ✓ w-2886 | w-2886 | 0 | **YES** |
| 14 | Highly responsive forklift operator available for last-minut | e-6108 | 5/4 | ✓ w-1566 | w-1566 | 0 | **YES** |
| 15 | Engaged warehouse associate with strong safety compliance re | e-2743 | 2/4 | ✓ w-49 | w-49 | 0 | **YES** |
| 16 | CDL-A driver based in IL or WI, willing to run regional 4-da | w-2486 | 5/2 | — | w-2486 | 5 | no |
| 17 | Bilingual customer service rep in Indianapolis or Cincinnati | e-9749 | 9/2 | — | e-9749 | 9 | no |
| 18 | Production supervisor open to Midwest relocation for permane | w-379 | 6/3 | — | w-379 | 6 | no |
| 19 | Dental hygienist with three years experience, Indianapolis a | e-6772 | 0/1 | — | w-3575 | 1 | no |
| 20 | Registered nurse with ICU experience, willing to take per-di | w-379 | 0/1 | — | w-379 | 0 | no |
| 21 | Software engineer with React and TypeScript, three years exp | w-1773 | 0/1 | — | w-1773 | 0 | no |
---
## Paraphrase pass — does the playbook help similar-but-different queries?
For each query whose Pass 1 cold pass recorded a playbook entry, the
judge model rephrased the query, and the rephrased version was sent
through warm matrix.search. The recorded answer ID's rank in those
results tests whether cosine on the embedded paraphrase finds the
recorded query's vector.
| # | Original (≤40c) | Paraphrase (≤60c) | Recorded answer | Paraphrase top-1 | Recorded rank | Paraphrase lift |
|---|---|---|---|---|---|---|
| 1 | Forklift operator with OSHA-30, warehous | Seeking forklift operator certified in OSHA-30, looking for | e-5729 | e-5729 | 0 | **YES** |
| 6 | Forklift-certified loader, certification | Loader requiring active forklift certification, this must no | w-205 | w-205 | 0 | **YES** |
| 7 | Hazmat-certified warehouse worker comfor | Warehouse worker with Hazmat certification and experience in | w-4257 | w-4257 | 0 | **YES** |
| 9 | Inventory specialist with confined-space | Specialist in inventory management requiring certified confi | w-3575 | w-49 | -1 | no |
| 13 | Reliable production line lead with stron | Experienced production line supervisor with excellent punctu | w-2886 | w-2886 | 0 | **YES** |
| 14 | Highly responsive forklift operator avai | Available forklift operator ready for urgent shift coverage | w-1566 | w-1566 | 0 | **YES** |
| 15 | Engaged warehouse associate with strong | Warehouse associate dedicated to engagement and boasting a r | w-49 | w-984 | -1 | no |
---
## Honesty caveats
1. **Judge IS the ground truth proxy.** Without human-labeled relevance, the LLM
judge's verdict is what defines "best." If `qwen2.5:latest` rates badly,
the lift number is meaningless. To validate the judge itself, sample 510
verdicts manually and check agreement.
2. **Score-1.0 boost = distance halved.** Playbook math is
`distance' = distance × (1 - 0.5 × score)`. Lift requires the judge-best
result's pre-boost distance to be ≤ 2× the cold top-1's distance, otherwise
even halving doesn't promote it. Tight clusters → little visible lift.
3. **Verbatim vs paraphrase.** The verbatim lift rate (above) is the cheap
case — same query, recorded playbook, expected boost. The paraphrase
pass (when enabled) is the actual learning property: similar-but-different
queries hitting a recorded playbook. Compare verbatim and paraphrase
lift rates — paraphrase should be lower (semantic-distance gates some
playbook hits) but non-zero is the meaningful signal.
4. **Multi-corpus skew.** Default corpora=`workers,ethereal_workers` — if all judge-best
results land in one corpus, the matrix layer's purpose isn't being tested.
Check per-corpus distribution in the JSON.
5. **Judge resolution.** This run used `qwen2.5:latest` from
env JUDGE_MODEL=qwen2.5:latest.
Bumping the judge for run #N+1 means editing one line in lakehouse.toml.
6. **Paraphrase generation also uses the judge.** The same model that rates
relevance also rephrases queries. A judge that's bad at rating staffing
queries is probably also bad at rephrasing them. Worth sanity-checking
a sample of `paraphrase_query` values in the JSON before trusting the
paraphrase lift number.
## Next moves
- If lift rate ≥ 50% of discoveries: matrix layer + playbook is doing real
work. Move to paraphrase queries + tag-based boost (currently ignored).
- If lift rate < 20%: investigate why judge variance, distance gap too
wide, or playbook math too gentle. The score=1.0 / 0.5× formula may need
retuning.
- If discovery rate (cold judge-best ≠ top-1) is itself low: cosine is
already close to optimal on this query distribution. Either the corpus
is too narrow or the queries are too easy.

View File

@ -76,8 +76,14 @@ DIM="$(echo "$RESP" | jq -r '.dimension')"
N="$(echo "$RESP" | jq -r '.vectors | length')"
MODEL="$(echo "$RESP" | jq -r '.model')"
SAME="$(echo "$RESP" | jq -r '.vectors[0][0] == .vectors[1][0]')"
if [ "$DIM" = "768" ] && [ "$N" = "2" ] && [ "$MODEL" = "nomic-embed-text" ] && [ "$SAME" = "false" ]; then
echo " ✓ dim=768, model=nomic-embed-text, 2 distinct vectors"
# Accept any nomic-embed-text* family member as the default — v1
# (137M, 768d) and v2-moe (475M MoE, 768d) are both supported drop-ins.
# The smoke locks the dimension + the distinct-vectors property, NOT
# the exact model name (operators bump the model in lakehouse.toml
# without changing this smoke).
case "$MODEL" in nomic-embed-text*) MODEL_OK=1 ;; *) MODEL_OK=0 ;; esac
if [ "$DIM" = "768" ] && [ "$N" = "2" ] && [ "$MODEL_OK" = "1" ] && [ "$SAME" = "false" ]; then
echo " ✓ dim=768, model=$MODEL, 2 distinct vectors"
else
echo " ✗ resp: dim=$DIM n=$N model=$MODEL same=$SAME"; FAILED=1
fi

282
scripts/multi_coord_stress.sh Executable file
View File

@ -0,0 +1,282 @@
#!/usr/bin/env bash
# Multi-coordinator stress harness — Phase 1 of the 48-hour mock.
#
# Three coordinators (Alice / Bob / Carol) own three distinct contracts
# (Milwaukee distribution, Indianapolis manufacturing, Chicago
# construction). The driver fires phases:
# 1. baseline — each coord runs their contract's role queries
# 2. surge — each contract's demand doubles (URGENT phrasing)
# 3. merge — alpha + beta combined under alice
# 4. handover — bob takes alpha, USING alice's playbook namespace
# 5. split — alpha surge re-distributed across all 3 coords
# 6. reissue — non-determinism check: same baselines reissued
# 7. analysis — diversity + determinism + learning metrics
#
# Phase 1 deliberately skips the 48-hour clock, email/SMS endpoints,
# and Langfuse wiring — those are Phase 2/3.
#
# Usage:
# ./scripts/multi_coord_stress.sh # run #001
# RUN_ID=002 ./scripts/multi_coord_stress.sh
# K=12 ./scripts/multi_coord_stress.sh
set -euo pipefail
cd "$(dirname "$0")/.."
export PATH="$PATH:/usr/local/go/bin"
RUN_ID="${RUN_ID:-001}"
WORKERS_LIMIT="${WORKERS_LIMIT:-5000}"
ETHEREAL_LIMIT="${ETHEREAL_LIMIT:-0}"
CORPORA="${CORPORA:-workers,ethereal_workers}"
K="${K:-8}"
OUT_JSON="reports/reality-tests/multi_coord_stress_${RUN_ID}.json"
OUT_MD="reports/reality-tests/multi_coord_stress_${RUN_ID}.md"
if ! curl -sS --max-time 3 http://localhost:11434/api/tags >/dev/null 2>&1; then
echo "[stress] Ollama not reachable on :11434 — skipping (need it for embeddings)"
exit 0
fi
echo "[stress] building binaries..."
go build -o bin/ ./cmd/storaged ./cmd/catalogd ./cmd/ingestd ./cmd/queryd \
./cmd/embedd ./cmd/vectord ./cmd/pathwayd ./cmd/observerd \
./cmd/matrixd ./cmd/gateway \
./scripts/staffing_workers ./scripts/multi_coord_stress
pkill -f "bin/(storaged|catalogd|ingestd|queryd|embedd|vectord|pathwayd|observerd|matrixd|gateway)" 2>/dev/null || true
sleep 0.3
PIDS=()
TMP="$(mktemp -d)"
CFG="$TMP/stress.toml"
cleanup() {
echo "[stress] cleanup"
for p in "${PIDS[@]:-}"; do [ -n "${p:-}" ] && kill "$p" 2>/dev/null || true; done
rm -rf "$TMP"
}
trap cleanup EXIT INT TERM
cat > "$CFG" <<EOF
[s3]
endpoint = "http://localhost:9000"
region = "us-east-1"
bucket = "lakehouse-go-primary"
use_path_style = true
[gateway]
bind = "127.0.0.1:3110"
storaged_url = "http://127.0.0.1:3211"
catalogd_url = "http://127.0.0.1:3212"
ingestd_url = "http://127.0.0.1:3213"
queryd_url = "http://127.0.0.1:3214"
vectord_url = "http://127.0.0.1:3215"
embedd_url = "http://127.0.0.1:3216"
pathwayd_url = "http://127.0.0.1:3217"
matrixd_url = "http://127.0.0.1:3218"
observerd_url = "http://127.0.0.1:3219"
[storaged]
bind = "127.0.0.1:3211"
[catalogd]
bind = "127.0.0.1:3212"
storaged_url = "http://127.0.0.1:3211"
[ingestd]
bind = "127.0.0.1:3213"
storaged_url = "http://127.0.0.1:3211"
catalogd_url = "http://127.0.0.1:3212"
max_ingest_bytes = 268435456
[queryd]
bind = "127.0.0.1:3214"
catalogd_url = "http://127.0.0.1:3212"
secrets_path = "/etc/lakehouse/secrets-go.toml"
refresh_every = "1s"
[embedd]
bind = "127.0.0.1:3216"
provider_url = "http://localhost:11434"
default_model = "nomic-embed-text-v2-moe"
[vectord]
bind = "127.0.0.1:3215"
storaged_url = ""
[pathwayd]
bind = "127.0.0.1:3217"
persist_path = ""
[observerd]
bind = "127.0.0.1:3219"
persist_path = ""
[matrixd]
bind = "127.0.0.1:3218"
embedd_url = "http://127.0.0.1:3216"
vectord_url = "http://127.0.0.1:3215"
EOF
poll_health() {
local port="$1" deadline=$(($(date +%s) + 5))
while [ "$(date +%s)" -lt "$deadline" ]; do
if curl -sS --max-time 1 "http://127.0.0.1:$port/health" >/dev/null 2>&1; then return 0; fi
sleep 0.05
done
return 1
}
echo "[stress] launching stack..."
./bin/storaged -config "$CFG" > /tmp/stress_storaged.log 2>&1 & PIDS+=($!); poll_health 3211 || { echo "storaged failed"; exit 1; }
./bin/catalogd -config "$CFG" > /tmp/stress_catalogd.log 2>&1 & PIDS+=($!); poll_health 3212 || { echo "catalogd failed"; exit 1; }
./bin/ingestd -config "$CFG" > /tmp/stress_ingestd.log 2>&1 & PIDS+=($!); poll_health 3213 || { echo "ingestd failed"; exit 1; }
./bin/queryd -config "$CFG" > /tmp/stress_queryd.log 2>&1 & PIDS+=($!); poll_health 3214 || { echo "queryd failed"; exit 1; }
./bin/embedd -config "$CFG" > /tmp/stress_embedd.log 2>&1 & PIDS+=($!); poll_health 3216 || { echo "embedd failed"; exit 1; }
./bin/vectord -config "$CFG" > /tmp/stress_vectord.log 2>&1 & PIDS+=($!); poll_health 3215 || { echo "vectord failed"; exit 1; }
./bin/pathwayd -config "$CFG" > /tmp/stress_pathwayd.log 2>&1 & PIDS+=($!); poll_health 3217 || { echo "pathwayd failed"; exit 1; }
./bin/observerd -config "$CFG" > /tmp/stress_observerd.log 2>&1 & PIDS+=($!); poll_health 3219 || { echo "observerd failed"; exit 1; }
./bin/matrixd -config "$CFG" > /tmp/stress_matrixd.log 2>&1 & PIDS+=($!); poll_health 3218 || { echo "matrixd failed"; exit 1; }
./bin/gateway -config "$CFG" > /tmp/stress_gateway.log 2>&1 & PIDS+=($!); poll_health 3110 || { echo "gateway failed"; exit 1; }
echo
echo "[stress] ingest workers (limit=$WORKERS_LIMIT) into 'workers' corpus..."
./bin/staffing_workers -limit "$WORKERS_LIMIT"
echo
echo "[stress] ingest ethereal_workers (limit=$ETHEREAL_LIMIT, 0=all) into 'ethereal_workers' corpus..."
./bin/staffing_workers \
-parquet "/home/profit/lakehouse/data/datasets/ethereal_workers.parquet" \
-index-name ethereal_workers \
-id-prefix "e-" \
-limit "$ETHEREAL_LIMIT"
echo
echo "[stress] running multi-coord stress driver..."
EXTRA_FLAGS=""
if [ "${WITH_PARAPHRASE_HANDOVER:-1}" = "1" ]; then
EXTRA_FLAGS="$EXTRA_FLAGS -with-paraphrase-handover"
fi
./bin/multi_coord_stress \
-gateway "http://127.0.0.1:3110" \
-contracts tests/reality/contracts \
-corpora "$CORPORA" \
-k "$K" \
-out "$OUT_JSON" \
-ollama "http://localhost:11434" \
-judge "${JUDGE_MODEL:-qwen2.5:latest}" \
$EXTRA_FLAGS
echo
echo "[stress] generating markdown report → $OUT_MD"
# Render compact markdown from the JSON. Same shape as the lift harness
# reports so reviewers can compare format.
total=$(jq -r '.events | length' "$OUT_JSON")
gen_at=$(jq -r '.generated_at' "$OUT_JSON")
div_role=$(jq -r '.diversity.same_role_across_contracts_mean_jaccard' "$OUT_JSON")
div_role_n=$(jq -r '.diversity.num_pairs_same_role_across_contracts' "$OUT_JSON")
div_xrole=$(jq -r '.diversity.different_roles_same_contract_mean_jaccard' "$OUT_JSON")
div_xrole_n=$(jq -r '.diversity.num_pairs_different_roles_same_contract' "$OUT_JSON")
det_jacc=$(jq -r '.determinism.mean_jaccard' "$OUT_JSON")
det_n=$(jq -r '.determinism.num_reissued_pairs' "$OUT_JSON")
hand_run=$(jq -r '.learning.handover_queries_run' "$OUT_JSON")
hand_top1=$(jq -r '.learning.recorded_answers_top1_count' "$OUT_JSON")
hand_topk=$(jq -r '.learning.recorded_answers_topk_count' "$OUT_JSON")
hand_rate=$(jq -r '.learning.handover_hit_rate' "$OUT_JSON")
ph_run=$(jq -r '.learning.paraphrase_handover_run // 0' "$OUT_JSON")
ph_top1=$(jq -r '.learning.paraphrase_top1_count // 0' "$OUT_JSON")
ph_topk=$(jq -r '.learning.paraphrase_topk_count // 0' "$OUT_JSON")
ph_rate=$(jq -r '.learning.paraphrase_handover_hit_rate // 0' "$OUT_JSON")
cat > "$OUT_MD" <<MDEOF
# Multi-Coordinator Stress Test — Run ${RUN_ID}
**Generated:** ${gen_at}
**Coordinators:** alice / bob / carol (each with own playbook namespace: \`playbook_alice\` / \`playbook_bob\` / \`playbook_carol\`)
**Contracts:** $(jq -r '.contracts | join(" / ")' "$OUT_JSON")
**Corpora:** \`${CORPORA}\`
**K per query:** ${K}
**Total events captured:** ${total}
**Evidence:** \`${OUT_JSON}\`
---
## Diversity — is the system locking into scenarios or cycling?
| Metric | Mean Jaccard | n pairs | Interpretation |
|---|---:|---:|---|
| Same role across different contracts | ${div_role} | ${div_role_n} | Lower = more diverse (different region/cert mix → different workers) |
| Different roles within same contract | ${div_xrole} | ${div_xrole_n} | Should be near-zero (different roles = different worker pools) |
**Healthy ranges:**
- Same role across contracts: < 0.30 means the system is genuinely picking different workers per region/contract.
- Different roles same contract: < 0.10 means role-specific retrieval is working.
- If either is > 0.50, the system is "cycling" the same handful of workers regardless of query intent.
---
## Determinism — same query reissued, top-K stability
| Metric | Value |
|---|---:|
| Mean Jaccard on retrieval-only reissue | ${det_jacc} |
| Number of reissue pairs | ${det_n} |
**Interpretation:**
- ≥ 0.95: HNSW retrieval is highly deterministic; reissues land on near-identical top-K. Good — system locks into a stable view of "best workers for this query."
- 0.80 0.95: Some HNSW or embed variance, acceptable.
- < 0.80: Retrieval is unstable — reissues see substantially different results, suggesting either embed nondeterminism (Ollama returning slightly different vectors) or vectord nondeterminism (HNSW insertion order affecting recall).
---
## Learning — handover hit rate
Bob takes Alice's contract using Alice's playbook namespace. Did Alice's recorded answers surface in Bob's results?
| Metric | Value |
|---|---:|
| Verbatim handover queries run | ${hand_run} |
| Alice's recorded answer at Bob's top-1 (verbatim) | ${hand_top1} |
| Alice's recorded answer in Bob's top-K (verbatim) | ${hand_topk} |
| **Verbatim handover hit rate (top-1)** | **${hand_rate}** |
| Paraphrase handover queries run | ${ph_run} |
| Alice's recorded answer at Bob's top-1 (paraphrase) | ${ph_top1} |
| Alice's recorded answer in Bob's top-K (paraphrase) | ${ph_topk} |
| **Paraphrase handover hit rate (top-1)** | **${ph_rate}** |
**Interpretation:**
- Verbatim hit rate ≈ 1.0: trivial case — Bob runs identical queries; should always hit.
- Paraphrase hit rate ≥ 0.5: institutional memory survives wording change — the harder learning property.
- Paraphrase hit rate ≈ 0.0: Bob's paraphrases drift past the inject threshold, so Alice's recordings don't activate. Same caveat as the playbook_lift paraphrase pass.
---
## Per-event capture
All matrix.search responses live in the JSON — top-K with worker IDs, distances, and per-corpus counts. Search by phase:
\`\`\`bash
jq '.events[] | select(.phase == "merge")' ${OUT_JSON}
jq '.events[] | select(.coordinator == "alice" and .phase == "baseline")' ${OUT_JSON}
jq '.events[] | select(.role == "warehouse worker") | {phase, contract, top_k_ids: [.top_k[].id]}' ${OUT_JSON}
\`\`\`
---
## What's NOT in this run (Phase 1 deliberately defers)
- **48-hour clock.** Events fire as discrete steps, not on a timeline.
- **Email / SMS ingest.** No endpoints exist on the Go side yet.
- **New-resume injection mid-run.** The corpus is fixed at the start.
- **Langfuse traces.** Need Go-side wiring.
These are Phase 2/3. The Phase 1 substrate is what the time-based runner will mount on top of.
MDEOF
echo
echo "[stress] DONE"
echo "[stress] evidence: $OUT_JSON"
echo "[stress] report: $OUT_MD"

File diff suppressed because it is too large Load Diff

View File

@ -52,6 +52,11 @@ CONFIG_PATH="${CONFIG_PATH:-lakehouse.toml}"
# actual learning-property test (does cosine on paraphrase find the
# recorded entry?). Set WITH_PARAPHRASE=0 for a faster verbatim-only run.
WITH_PARAPHRASE="${WITH_PARAPHRASE:-1}"
# WITH_REJUDGE=1 (default) adds a Pass 4 — judge warm top-1 to measure
# quality lift (warm rating vs cold rating). Catches cases where Shape B
# surfaces a different-but-equally-good answer (which the rank-based
# lift metric misses). +21 judge calls (~30s on qwen2.5).
WITH_REJUDGE="${WITH_REJUDGE:-1}"
OUT_JSON="reports/reality-tests/playbook_lift_${RUN_ID}.json"
OUT_MD="reports/reality-tests/playbook_lift_${RUN_ID}.md"
@ -156,7 +161,7 @@ refresh_every = "1s"
[embedd]
bind = "127.0.0.1:3216"
provider_url = "http://localhost:11434"
default_model = "nomic-embed-text"
default_model = "nomic-embed-text-v2-moe"
[vectord]
bind = "127.0.0.1:3215"
@ -271,9 +276,12 @@ echo "[lift] running driver — judge=$EFFECTIVE_JUDGE · queries=$QUERIES_FILE
# and runs its own resolution chain (env → config → fallback). When
# JUDGE_MODEL IS set, the explicit -judge wins inside the Go driver
# regardless of what its env-lookup would find — flag wins by design.
PARAPHRASE_FLAG=""
EXTRA_FLAGS=""
if [ "$WITH_PARAPHRASE" = "1" ]; then
PARAPHRASE_FLAG="-with-paraphrase"
EXTRA_FLAGS="$EXTRA_FLAGS -with-paraphrase"
fi
if [ "$WITH_REJUDGE" = "1" ]; then
EXTRA_FLAGS="$EXTRA_FLAGS -with-rejudge"
fi
./bin/playbook_lift \
-config "$CONFIG_PATH" \
@ -284,7 +292,7 @@ fi
-judge "$JUDGE_MODEL" \
-k "$K" \
-out "$OUT_JSON" \
$PARAPHRASE_FLAG
$EXTRA_FLAGS
echo
echo "[lift] generating markdown report → $OUT_MD"
@ -302,6 +310,10 @@ generate_md() {
p_attempted=$(jq -r '.summary.paraphrase_attempted // 0' "$json")
p_top1=$(jq -r '.summary.paraphrase_top1_lifts // 0' "$json")
p_anyrank=$(jq -r '.summary.paraphrase_any_rank_hits // 0' "$json")
rj_attempted=$(jq -r '.summary.rejudge_attempted // 0' "$json")
q_lifted=$(jq -r '.summary.quality_lifted // 0' "$json")
q_neutral=$(jq -r '.summary.quality_neutral // 0' "$json")
q_regressed=$(jq -r '.summary.quality_regressed // 0' "$json")
# Only emit the paraphrase block when --with-paraphrase actually ran
# (i.e. .summary.paraphrase_attempted > 0). For verbatim-only runs we
@ -312,6 +324,13 @@ generate_md() {
| Paraphrase pass — recorded answer at any rank in top-K | ${p_anyrank} / ${p_attempted} |"
fi
rj_block=""
if [ "$rj_attempted" != "0" ] && [ "$rj_attempted" != "null" ]; then
rj_block="| **Quality lift** (warm top-1 rating > cold top-1 rating) | **${q_lifted} / ${rj_attempted}** |
| Quality neutral (warm top-1 rating = cold top-1 rating) | ${q_neutral} / ${rj_attempted} |
| Quality regressed (warm top-1 rating < cold top-1 rating) | ${q_regressed} / ${rj_attempted} |"
fi
cat > "$md" <<MDEOF
# Playbook-Lift Reality Test — Run ${RUN_ID}
@ -322,6 +341,7 @@ generate_md() {
**Queries:** \`${QUERIES_FILE}\` (${total} executed)
**K per pass:** ${K}
**Paraphrase pass:** $([ "$WITH_PARAPHRASE" = "1" ] && echo "ENABLED" || echo "disabled")
**Re-judge pass:** $([ "$WITH_REJUDGE" = "1" ] && echo "ENABLED" || echo "disabled")
**Evidence:** \`${OUT_JSON}\`
---
@ -337,6 +357,7 @@ generate_md() {
| Playbook boosts triggered (warm pass) | ${boosted} |
| Mean Δ top-1 distance (warm cold) | ${mean_delta} |
${p_block}
${rj_block}
**Verbatim lift rate:** ${lift} of ${discovery} discoveries became top-1 after warm pass.

View File

@ -75,12 +75,19 @@ type queryRun struct {
PlaybookRecorded bool `json:"playbook_recorded"`
PlaybookID string `json:"playbook_target_id,omitempty"`
WarmTop1ID string `json:"warm_top1_id"`
WarmTop1Distance float32 `json:"warm_top1_distance"`
WarmBoostedCount int `json:"warm_boosted_count"`
WarmJudgeBestRank int `json:"warm_judge_best_rank"`
WarmTop1ID string `json:"warm_top1_id"`
WarmTop1Distance float32 `json:"warm_top1_distance"`
WarmBoostedCount int `json:"warm_boosted_count"`
WarmJudgeBestRank int `json:"warm_judge_best_rank"` // rank of cold judge-best in warm — NOT the warm pass's own judge-best
WarmTop1Metadata json.RawMessage `json:"-"` // cached for Pass 4 rejudge; not emitted
Lift bool `json:"lift"` // judge-best was below top-1 cold, but top-1 warm
// WarmTop1Rating: only populated when --with-rejudge. Compare to
// ColdRatings[0] (== cold top-1 rating) to measure quality lift.
// *int so absence (no rejudge pass) and a 0-rating verdict are
// distinguishable.
WarmTop1Rating *int `json:"warm_top1_rating,omitempty"`
Lift bool `json:"lift"` // judge-best was below top-1 cold, but top-1 warm
// Paraphrase pass — only populated when --with-paraphrase. Tests
// the playbook's actual learning property: does a recorded entry
@ -114,6 +121,17 @@ type summary struct {
ParaphraseTop1Lifts int `json:"paraphrase_top1_lifts,omitempty"` // recorded answer surfaced at rank 0
ParaphraseAnyRankHits int `json:"paraphrase_any_rank_hits,omitempty"` // recorded answer surfaced at any rank in top-K
// Re-judge pass aggregates — only populated when --with-rejudge.
// Measures QUALITY lift (warm top-1 rating vs cold top-1 rating)
// rather than rank-of-cold-judge-best lift. The latter conflates
// "warm surfaced a different but equally-good result" with "warm
// shuffled ranks but the answer was the same"; quality lift
// disambiguates them.
RejudgeAttempted int `json:"rejudge_attempted,omitempty"` // queries that ran the rejudge pass
QualityLifted int `json:"quality_lifted,omitempty"` // warm-top-1 rating > cold-top-1 rating
QualityNeutral int `json:"quality_neutral,omitempty"` // ratings equal (could be same or different item)
QualityRegressed int `json:"quality_regressed,omitempty"` // warm-top-1 rating < cold-top-1 rating
GeneratedAt time.Time `json:"generated_at"`
}
@ -128,6 +146,7 @@ func main() {
k := flag.Int("k", 10, "top-k from matrix.search per pass")
out := flag.String("out", "reports/reality-tests/playbook_lift_001.json", "output JSONL path")
withParaphrase := flag.Bool("with-paraphrase", false, "after warm pass, generate a paraphrase via the judge model and re-query with playbook=true to test the learning property")
withRejudge := flag.Bool("with-rejudge", false, "after warm pass, judge warm top-1 to measure QUALITY lift (vs cold top-1 rating), not just rank-of-cold-judge-best")
flag.Parse()
// Judge resolution priority: explicit flag > $JUDGE_MODEL env >
@ -225,6 +244,7 @@ func main() {
}
runs[i].WarmTop1ID = resp.Results[0].ID
runs[i].WarmTop1Distance = resp.Results[0].Distance
runs[i].WarmTop1Metadata = resp.Results[0].Metadata // cache for Pass 4 rejudge
runs[i].WarmBoostedCount = resp.PlaybookBoosted
playbookBoostedTotal += resp.PlaybookBoosted
@ -304,6 +324,47 @@ func main() {
}
}
// Pass 4 (warm-rejudge) — opt-in via --with-rejudge. Judge warm
// top-1 against the same prompt as cold ratings, then compare to
// cold top-1 rating. This measures QUALITY lift (did the playbook
// produce a better candidate?) rather than just rank-of-cold-judge-
// best lift (did the recorded answer move to top-1, even if cold's
// top-1 was already good?). See STATE_OF_PLAY OPEN — added because
// run #003's verbatim 2/6 didn't tell us whether Shape B was
// surfacing better OR same-quality alternatives.
rejudgeAttempted := 0
qualityLifted := 0
qualityNeutral := 0
qualityRegressed := 0
if *withRejudge {
log.Printf("[lift] warm-rejudge pass: measuring quality lift (warm top-1 rating vs cold top-1 rating)")
for i := range runs {
if runs[i].WarmTop1ID == "" || len(runs[i].WarmTop1Metadata) == 0 {
continue // warm pass didn't complete for this query
}
rejudgeAttempted++
result := matrixResult{
ID: runs[i].WarmTop1ID,
Distance: runs[i].WarmTop1Distance,
Metadata: runs[i].WarmTop1Metadata,
}
warmRating := judgeRate(hc, *ollama, *judge, runs[i].Query, result)
runs[i].WarmTop1Rating = &warmRating
coldRating := 0
if len(runs[i].ColdRatings) > 0 {
coldRating = runs[i].ColdRatings[0]
}
switch {
case warmRating > coldRating:
qualityLifted++
case warmRating < coldRating:
qualityRegressed++
default:
qualityNeutral++
}
}
}
sum := summary{
Total: len(runs),
WithDiscovery: withDiscovery,
@ -314,6 +375,10 @@ func main() {
ParaphraseAttempted: paraphraseAttempted,
ParaphraseTop1Lifts: paraphraseTop1Lifts,
ParaphraseAnyRankHits: paraphraseAnyRankHits,
RejudgeAttempted: rejudgeAttempted,
QualityLifted: qualityLifted,
QualityNeutral: qualityNeutral,
QualityRegressed: qualityRegressed,
GeneratedAt: time.Now().UTC(),
}
if len(runs) > 0 {
@ -323,11 +388,11 @@ func main() {
if err := writeJSON(*out, runs, sum); err != nil {
log.Fatalf("write %s: %v", *out, err)
}
if *withParaphrase {
log.Printf("[lift] DONE — %d queries · discovery=%d · lift=%d · boosted=%d · meanΔdist=%.4f · paraphrase=%d/%d→top1, %d/%d→anyrank",
if *withParaphrase || *withRejudge {
log.Printf("[lift] DONE — %d queries · discovery=%d · lift=%d · boosted=%d · meanΔdist=%.4f · paraphrase=%d/%d→top1 · quality=lifted%d/neutral%d/regressed%d",
sum.Total, sum.WithDiscovery, sum.LiftCount, sum.PlaybookBoostedTotal, sum.MeanTop1DeltaDistance,
sum.ParaphraseTop1Lifts, sum.ParaphraseAttempted,
sum.ParaphraseAnyRankHits, sum.ParaphraseAttempted)
sum.QualityLifted, sum.QualityNeutral, sum.QualityRegressed)
} else {
log.Printf("[lift] DONE — %d queries · discovery=%d · lift=%d · boosted=%d · meanΔdist=%.4f",
sum.Total, sum.WithDiscovery, sum.LiftCount, sum.PlaybookBoostedTotal, sum.MeanTop1DeltaDistance)

View File

@ -0,0 +1,12 @@
{
"name": "alpha_milwaukee_distribution",
"client": "Northstar Logistics",
"location": "Milwaukee, WI metro",
"shift": "day",
"demand": [
{"role": "warehouse worker", "count": 200, "skills": ["pallet jack", "inventory"], "certs": ["OSHA-30"]},
{"role": "admin assistant", "count": 3, "skills": ["scheduling", "data entry"], "certs": []},
{"role": "heavy equipment operator", "count": 2, "skills": ["forklift", "bobcat"], "certs": ["OSHA-30", "forklift cert"]},
{"role": "industrial electrician", "count": 1, "skills": ["high voltage", "PLC"], "certs": ["journeyman"], "in_roster": false}
]
}

View File

@ -0,0 +1,12 @@
{
"name": "beta_indianapolis_manufacturing",
"client": "Crossroads Manufacturing",
"location": "Indianapolis, IN metro",
"shift": "swing",
"demand": [
{"role": "warehouse worker", "count": 150, "skills": ["assembly", "machine operation"], "certs": ["OSHA-10"]},
{"role": "admin assistant", "count": 4, "skills": ["scheduling", "documentation", "spanish"], "certs": []},
{"role": "heavy equipment operator", "count": 3, "skills": ["forklift", "pallet jack", "cold storage"], "certs": ["OSHA-30", "forklift cert"]},
{"role": "bilingual safety coordinator", "count": 1, "skills": ["spanish", "english", "training"], "certs": ["OSHA trainer"], "in_roster": false}
]
}

View File

@ -0,0 +1,12 @@
{
"name": "gamma_chicago_construction",
"client": "Loop Construction Group",
"location": "Chicago, IL metro",
"shift": "early-day",
"demand": [
{"role": "warehouse worker", "count": 80, "skills": ["framing", "rigging", "concrete"], "certs": ["OSHA-10"]},
{"role": "admin assistant", "count": 1, "skills": ["scheduling", "blueprint reading"], "certs": []},
{"role": "heavy equipment operator", "count": 2, "skills": ["mobile crane", "rigging signals", "bobcat"], "certs": ["NCCCO crane cert"]},
{"role": "drone surveyor", "count": 1, "skills": ["UAV piloting", "GIS", "site mapping"], "certs": ["FAA Part 107"], "in_roster": false}
]
}