g2_smoke: accept nomic-embed-text* family members as default

Pre-push hook caught the regression — the smoke hardcoded MODEL = "nomic-embed-text" and the bump to nomic-embed-text-v2-moe in 4da32ad failed the gate. Fix: glob-match the family prefix (nomic-embed-text*). Both v1 and v2-moe are 768d drop-ins; the property the smoke is locking is dim + distinct-vectors, not the exact model variant. Operators swap the variant in lakehouse.toml without needing to touch the smoke. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
STATE_OF_PLAY: capture multi-coord stress wave (Phase 1-3 verified)
2026-04-30 17:37:20 -05:00 · 2026-04-30 17:30:04 -05:00 · 2026-04-30 16:43:32 -05:00 · 2026-04-30 16:31:45 -05:00 · 2026-04-30 16:25:03 -05:00 · 2026-04-30 16:16:49 -05:00
27 changed files with 3263 additions and 21 deletions
--- a/STATE_OF_PLAY.md
+++ b/STATE_OF_PLAY.md
@ -1,7 +1,7 @@
 # STATE OF PLAY — Lakehouse-Go

-**Last verified:** 2026-04-30 ~07:25 CDT
-**Verified by:** live probes + `just verify` PASS + reality tests #001/#002/#003/#004 (v4 with split inject threshold: 6/8 verbatim lift + 6/8 paraphrase recovery + zero OOD cross-pollination), not memory.
+**Last verified:** 2026-04-30 ~16:42 CDT
+**Verified by:** live probes + `just verify` PASS + multi-coord stress run #011 (full 9-phase scenario, 67 captured events, 1 Langfuse trace + 111 child observations covering every phase + every external call), not memory.

 > **Read this FIRST.** When the user says "we're working on lakehouse," default to the Go rewrite (this repo); the Rust legacy at `/home/profit/lakehouse/` is maintenance-only. If memory contradicts this file, this file wins. Update it when something is verified working — not when a phase finishes.

@ -114,6 +114,34 @@ Evidence: `reports/reality-tests/playbook_lift_{001,002,003}.{json,md}`. Per the

 **v3 → v4 is the configuration evolution.** v3 ran with one threshold (0.5) for both boost and inject — paraphrase recovery hit 6/6 but a single strong recording (w-4435 for "OSHA-30 forklift Wisconsin") cross-pollinated to OOD queries (dental hygienist / RN / software engineer) because their text vectors fell within 0.5 cosine. v4 split: boost stays at 0.5 (safe — only re-ranks existing results), inject tightens to 0.20 (true paraphrase territory). Result: OOD cross-pollination eliminated, paraphrase recovery 6/8, verbatim lift 6/8. The two paraphrases that missed (Q9, Q15) were liberally rephrased — drift > 0.20 cosine.

+### Multi-coordinator stress test (Phase 1 → 3) — VERIFIED end-to-end
+
+Reality test #2 catalog. New harness `scripts/multi_coord_stress.{sh,go}` simulates 3 coordinators (alice/bob/carol) handling 3 distinct contracts (Milwaukee distribution / Indianapolis manufacturing / Chicago construction), each with their own playbook namespace. 9-phase operational narrative across simulated Hours 0–48: baseline → fresh-resume injection → inbox burst → mid-day surge → 200-worker swap → contract merge → handover (verbatim + paraphrase) → split → reissue.
+
+| Capability | Verified | Where |
+|---|---|---|
+| Per-coordinator playbook isolation | ✓ | `playbook_alice` / `playbook_bob` / `playbook_carol` corpora |
+| Same-role-across-contracts diversity | Jaccard 0.026 (n=9) — 97% workers differ per region | Phase 1 baseline |
+| Different-roles-same-contract diversity | Jaccard 0.070 (n=18) — 93% differ per role | Phase 1 baseline |
+| HNSW retrieval determinism | Jaccard 1.000 (n=12) | Phase 6 reissue |
+| Verbatim handover (Bob runs Alice's queries with Alice's playbook) | 4/4 | Phase 4 |
+| Paraphrase handover (Bob runs qwen2.5-paraphrased queries) | 4/4 | Phase 4b |
+| 200-worker swap with `ExcludeIDs` | Jaccard 0.000 — 8/8 placed workers fully replaced | Phase 2b |
+| Fresh-resume injection (two-tier `fresh_workers` index) | 3/3 fresh workers at top-1 | Phase 1b |
+| Inbox endpoint `/v1/observer/inbox` (email + SMS, priority weighting) | 6/6 events recorded | Phase 1c |
+| LLM demand parsing (qwen2.5 format=json on inbox bodies) | 6/6 parsed cleanly into structured `{role, count, location, certs, skills, shift}` | Phase 1c |
+| Judge re-rates inbox top-1 against ORIGINAL body | catches tight-distance-but-wrong (Q3 crane case: dist 0.23 → rating 1) | Phase 1c |
+| Langfuse Go-side tracing | 111 observations on a single run trace, browseable at http://localhost:3001 | Run #011 |
+
+**Substrate gains added by this wave:**
+- `internal/matrix/playbook.go` Shape B + split inject threshold (commit `67d1957` from earlier wave; verified in multi-coord too)
+- `internal/matrix/retrieve.go` `ExcludeIDs` field on `SearchRequest` — filters worker IDs at retrieve, boost, AND inject (so excluded answers can't sneak back via playbook). Real-world driver: coordinator placed N workers, client asks for replacements.
+- `internal/observer/types.go` `SourceInbox` taxonomy alongside `SourceMCP / SourceScenario / SourceWorkflow`
+- `cmd/observerd` `POST /observer/inbox` route — accepts `{type, sender, subject, body, priority, tag}` and records as `ObservedOp`. Type must be `email` or `sms`; body required; priority defaults to medium.
+- `internal/langfuse/client.go` — minimal Go-side Trace+Span client, best-effort posture (logs on error, never blocks calling path; same fail-open semantics as ADR-005 Decision 5.1).
+- `embedd default_model` bumped from `nomic-embed-text` (137M) → `nomic-embed-text-v2-moe` (475M, MoE, drop-in 768d). Same-role-across-contracts diversity went from 0.080 → 0.000 with the upgrade.
+- Two-tier index pattern: fresh content goes to `fresh_workers` (a small "hot" corpus); main queries include it in the corpora list. Solves the HNSW post-build add recall issue (incremental adds to a populated graph land in poorly-connected regions and disappear from search). Canonical NRT pattern; Lucene works the same way.
+
 ### Harness expansion (2026-04-30 ~05:30 CDT)

 `scripts/playbook_lift.sh` rewritten from a 5-daemon stripped harness to the **full 10-daemon prod-realistic stack** (chatd stays up independently). The 5-daemon version was structurally hiding bugs; expanding the daemon set surfaced 7 distinct fixes:
@ -171,10 +199,16 @@ Verbatim verdicts at `reports/scrum/_evidence/2026-04-30/verdicts/`. Disposition
 - The matrix indexer **5/5 components** are shipped. Don't propose to "build the matrix indexer" — it's done.
 - The 5-loop substrate's load-bearing gate is **PASSED**. v3 (`154a72e`) showed 6/6 paraphrase recovery via Shape B. Don't propose to "test if the playbook learns" — it does.
 - **Shape B is the playbook stance now.** When `use_playbook=true`, both `ApplyPlaybookBoost` (re-rank in place) AND `InjectPlaybookMisses` (insert recorded answers not in regular top-K) run. Don't revert to boost-only; v2 proved that path can't recover paraphrase queries.
+- **Boost / inject use SEPARATE thresholds.** Boost stays at `DefaultPlaybookMaxDistance = 0.5` (safe — only re-ranks results already in retrieval). Inject uses tighter `DefaultPlaybookMaxInjectDistance = 0.20` (Shape B forces results in, so loose match cross-pollinates wrong-domain). Don't merge them.
+- **Multi-coord product theory is empirically VALIDATED** by run #011 (Phase 3). Per-coordinator playbook namespaces (`playbook_alice` etc.) with cross-coordinator handover (Bob takes Alice's contract using `playbook_corpus=playbook_alice`) work end-to-end including paraphrase handover. Don't propose to "test if multi-coord works" — it does.
+- **Fresh content uses two-tier indexing.** Fresh resumes go to `fresh_workers` corpus, not the main `workers` index. coder/hnsw incremental adds to a populated 5K+ graph have recall issues; the small hot index has no such crowding. Periodic merge (post-G3) consolidates fresh→main. Canonical NRT pattern.
+- **`embedd.default_model = "nomic-embed-text-v2-moe"`** (475M MoE, 768d). Don't bump to `nomic-embed-text` (137M) "for speed" — diversity scores fell from 0.000 → 0.080 same-role-across-contracts on the smaller model. Cost (~5× slower ingest) is acceptable for once-per-deploy work.
+- **Inbox flow: parse + search + judge + trace.** `/v1/observer/inbox` records the body; coordinator/driver parses via LLM (qwen2.5 format=json), runs `matrix.search` on the parsed query, judges top-1 against the ORIGINAL body, emits Langfuse spans through it all. Don't replace the judge re-rate with a distance-only gate — tight distance + low rating is the load-bearing honesty signal.
 - `local_judge = "qwen2.5:latest"` for the lift loop — qwen3.5:latest is a vision-SSM 256K-ctx build that ran 30s/judge. Don't bump to qwen3.5 again "for cleanliness." Different model in the same tier has different cost geometry.
 - `qwen3.5:latest` IS available locally on this box. Opus's hub-only knowledge is a known-stale signal; the chatd_smoke uses it daily.
 - `temperature` is **omitted** for Anthropic 4.7 (handled by `Request.Temperature *float64`); don't re-add it.
 - chatd-smoke runs with **all cloud providers disabled** intentionally so the suite doesn't depend on API keys; that's why it can't catch B-3-class bugs (those need a fake-server fixture, see Sprint 0 follow-up).
+- **Langfuse Go-side client lives at `internal/langfuse/`** with best-effort fail-open posture. URL+creds from `/etc/lakehouse/langfuse.env`. Don't propose to "wire Langfuse on Go side" — it's wired; multi_coord_stress is the proof.

 ---

@ -182,9 +216,11 @@ Verbatim verdicts at `reports/scrum/_evidence/2026-04-30/verdicts/`. Disposition

 | Item | What | When to act |
 |---|---|---|
-| **Reality test v4: re-judge warm results** | Current lift metric counts "warm top-1 == cold judge best." Shape B cross-pollinates recorded answers across queries (run #003 dropped verbatim-lift 7→2 because w-4435 surfaced as warm top-1 for several queries whose cold-judge-best was a different worker). True quality lift would re-judge warm results and measure judge agreement. ~30 min — adds a 4th pass to the driver. | Next time we want to measure Shape B's actual quality, not rank-of-original-judge-best. |
-| **Adjacent-query cross-pollination** | After v4's split threshold, OOD cross-pollination is gone but Q6 ("Forklift-certified loader") ↔ Q7 ("Hazmat-certified warehouse worker, cold storage") still swap recordings as warm top-1 because their embeddings are within 0.20 cosine of each other. Likely correct (genuinely similar staffing queries), but worth verifying with the v4 re-judge metric — if the judge agrees both directions are good matches, accept; if not, tighten further (e.g. 0.15) or add a same-query-only mode. | Co-decision with v4 re-judge. |
-| **Liberal-paraphrase recovery loss** | Q9 + Q15 in run #004 lost paraphrase recovery because qwen2.5 rephrased liberally enough to drift past 0.20 inject threshold. Acceptable (system refusing to inject when not confident), but might be tightenable with a different paraphrase prompt (more conservative wording variation) or a per-pair `paraphrase_max_drift` measurement. Cosmetic vs. real depends on whether realistic coordinator queries drift like qwen2.5's rephrases do. | When real coordinator queries are available for a calibration run. |
+| **Real-time 48-hour clock** | Multi-coord stress fires phases as discrete steps with simulated-hour labels (0/6/12/18/24/30/36/42/48). A real-time variant would space events on actual wall-clock with `time.Sleep`, simulating the rhythm of a coordinator workday. Cosmetic; doesn't change product behavior — but lets the harness mimic the cadence at which inbox events would arrive in production. ~30 min. | When stress test starts being used to capture realistic per-hour throughput numbers. |
+| **Wider Langfuse instrumentation across daemons** | multi_coord_stress traces every external call, but the daemons themselves (matrixd, observerd, chatd) don't yet emit traces from their own request handlers. Adding `internal/langfuse/middleware.go` would auto-emit a span per HTTP request, giving production-traffic visibility for free. | When production traffic starts hitting the Go gateway. |
+| **Periodic fresh→main index merge** | Two-tier `fresh_workers` pattern is verified working but no scheduled job consolidates fresh→main. Fresh corpus grows monotonically; eventually has its own recall issues. A daily cron that ingests the fresh corpus' contents into the main `workers` index + drops fresh would solve it. | When fresh_workers grows past ~500 items. |
+| **Adjacent-query cross-pollination (lift suite Q6↔Q7)** | After lift v4's split threshold, OOD cross-pollination is gone but Q6 / Q7 still swap recordings as warm top-1 because their embeddings are within 0.20 cosine of each other. Multi-coord run #008 inbox-judge re-rating proved the judge can distinguish — gating injection on "judge approves before injecting" closes this. ~1 hr. | When playbook injection quality matters more than retrieval throughput. |
+| **Liberal-paraphrase recovery loss (lift suite Q9, Q15)** | Q9 + Q15 in run #004 lost paraphrase recovery because qwen2.5 rephrased liberally enough to drift past 0.20 inject threshold. Acceptable (system refusing to inject when not confident), but might be tightenable with a different paraphrase prompt or a per-pair `paraphrase_max_drift` measurement. | When real coordinator queries are available for a calibration run. |
 | **Sprint 4 — deployment** | No `REPLICATION.md`, `secrets-go.toml.example`, `deploy/systemd/<bin>.service`, `Dockerfile`. Largest open Sprint. Required input for any G5 cutover plan. | When G5 cutover is on the table. |
 | **ADR-006 — auth posture for non-loopback deploy** | Locks R-001 + R-007 from "opt-in middleware exists" to "wired-by-default for X, opt-in for Y." Doc-only, ~1 hr. | Required before any Go binary binds non-loopback in prod. |
 | **chatd fixture-mode storage half** | `g2_smoke_fixtures.sh` closed embed half via fake_ollama; storage half (mock S3) still deferred. Closes R-006 fully. | When CI box without MinIO is needed. |
@ -213,6 +249,17 @@ Verbatim verdicts at `reports/scrum/_evidence/2026-04-30/verdicts/`. Disposition
 | `154a72e` | matrix: Shape B (`InjectPlaybookMisses`) — 6/6 paraphrase recovery in run #003 |
 | `94fc3b6` | STATE_OF_PLAY: capture v3 reality test + Shape B + cross-pollination |
 | `67d1957` | matrix: split boost/inject thresholds (0.5 / 0.20) — kills cross-pollination, run #004 6/8+6/8 |
+| `b13b5cd` | playbook_lift v4 metric: warm-top-1 re-judge → quality lift +24%/-14% (5 lifted / 13 neutral / 3 regressed) |
+| `61c7b55` | multi-coord stress harness (Phase 1) — 3 coords / 3 contracts / 7-phase scenario |
+| `0fa42a0` | multi-coord stress Phase 1.5 — shared-role contracts + paraphrase handover |
+| `84a32f0` | multi-coord stress Phase 2 — `ExcludeIDs`, 200-worker swap, fresh-resume |
+| `4da32ad` | embedd: bump default to nomic-embed-text-v2-moe (475M MoE, 768d drop-in) |
+| `e7fc63b` | observerd `/observer/inbox` + multi-coord stress phase 1c (priority-ordered events) |
+| `186d209` | multi_coord_stress: LLM-parsed inbox demands (qwen2.5 format=json) |
+| `ce940f4` | multi_coord_stress: judge re-rates inbox top-1 against original body — recovers OOD honesty signal |
+| `7e6431e` | langfuse: Go-side client + Phase 1c instrumentation |
+| `08a0867` | multi_coord_stress: fresh_workers two-tier index — fresh-resume now top-1 (3/3) |
+| `5d49967` | multi_coord_stress: full Langfuse coverage — every phase + every call (111 observations) |

 Plus on Rust side (`8de94eb`, `3d06868`): qwen2.5 → qwen3.5:latest backport in active defaults; distillation acceptance reports regenerated (run_hash refresh, reproducibility property still holds).

--- a/cmd/observerd/main.go
+++ b/cmd/observerd/main.go
@ -93,6 +93,62 @@ func (h *handlers) register(r chi.Router) {
 	r.Post("/observer/event", h.handleEvent)
 	r.Post("/observer/workflow/run", h.handleWorkflowRun)
 	r.Get("/observer/workflow/modes", h.handleWorkflowModes)
+	r.Post("/observer/inbox", h.handleInbox)
+}
+
+// inboxMessage is the POST /observer/inbox body — an incoming
+// real-world signal (email or SMS) that a coordinator would receive
+// and act on. The handler only RECORDS it as an ObservedOp; whether
+// to trigger a downstream matrix.search or workflow is the caller's
+// concern. Keeps observer's witness role pure.
+type inboxMessage struct {
+	Type     string `json:"type"`     // "email" | "sms"
+	Sender   string `json:"sender"`
+	Subject  string `json:"subject,omitempty"`
+	Body     string `json:"body"`
+	Priority string `json:"priority"` // "urgent" | "high" | "medium" | "low"
+	Tag      string `json:"tag,omitempty"`
+}
+
+func (h *handlers) handleInbox(w http.ResponseWriter, r *http.Request) {
+	var msg inboxMessage
+	if !decodeJSON(w, r, &msg) {
+		return
+	}
+	if msg.Type != "email" && msg.Type != "sms" {
+		http.Error(w, "type must be 'email' or 'sms'", http.StatusBadRequest)
+		return
+	}
+	if strings.TrimSpace(msg.Body) == "" {
+		http.Error(w, "body required", http.StatusBadRequest)
+		return
+	}
+	if msg.Priority == "" {
+		msg.Priority = "medium"
+	}
+	op := observer.ObservedOp{
+		Endpoint:      "/observer/inbox/" + msg.Type,
+		InputSummary:  fmt.Sprintf("from=%s priority=%s tag=%s subject=%s", msg.Sender, msg.Priority, msg.Tag, msg.Subject),
+		OutputSummary: msg.Body,
+		Source:        observer.SourceInbox,
+		Success:       true,
+	}
+	if err := h.store.Record(op); err != nil {
+		if errors.Is(err, observer.ErrInvalidOp) {
+			http.Error(w, err.Error(), http.StatusBadRequest)
+			return
+		}
+		slog.Error("observer record inbox", "err", err)
+		http.Error(w, "internal", http.StatusInternalServerError)
+		return
+	}
+	stats := h.store.Stats()
+	writeJSON(w, http.StatusOK, map[string]any{
+		"accepted":  true,
+		"type":      msg.Type,
+		"priority":  msg.Priority,
+		"ring_size": stats.Total,
+	})
 }

 func (h *handlers) handleStats(w http.ResponseWriter, _ *http.Request) {
--- a/cmd/observerd/main_test.go
+++ b/cmd/observerd/main_test.go
@ -4,6 +4,7 @@ import (
 	"bytes"
 	"net/http"
 	"net/http/httptest"
+	"strings"
 	"testing"
 	"time"

@ -38,6 +39,7 @@ func TestRoutesMounted(t *testing.T) {
 		"POST /observer/event":          false,
 		"POST /observer/workflow/run":   false,
 		"GET /observer/workflow/modes":  false,
+		"POST /observer/inbox":          false,
 	}
 	_ = chi.Walk(r, func(method, route string, _ http.Handler, _ ...func(http.Handler) http.Handler) error {
 		key := method + " " + route
@ -165,6 +167,51 @@ func TestWorkflowRun_AllProvenanceRecordedPostRun(t *testing.T) {
 	}
 }

+// TestInbox_AcceptsValidEmail locks the happy-path contract for the
+// /observer/inbox route — accepts an email message with required
+// fields, records as ObservedOp, returns 200 with ring-size.
+func TestInbox_AcceptsValidEmail(t *testing.T) {
+	r := newTestRouter(t)
+	body := []byte(`{"type":"email","sender":"client@northstar.com","subject":"URGENT: 50 forklift ops","body":"Need 50 forklift operators in Cleveland OH for next week. Day shift.","priority":"urgent","tag":"alpha-surge"}`)
+	req := httptest.NewRequest("POST", "/observer/inbox", bytes.NewReader(body))
+	req.Header.Set("Content-Type", "application/json")
+	w := httptest.NewRecorder()
+	r.ServeHTTP(w, req)
+	if w.Code != http.StatusOK {
+		t.Fatalf("expected 200, got %d (body=%s)", w.Code, w.Body.String())
+	}
+	if !strings.Contains(w.Body.String(), `"accepted":true`) {
+		t.Errorf("expected accepted=true, got %s", w.Body.String())
+	}
+}
+
+// TestInbox_RejectsBadType locks the validation: type must be
+// "email" or "sms", anything else is 400.
+func TestInbox_RejectsBadType(t *testing.T) {
+	r := newTestRouter(t)
+	body := []byte(`{"type":"smoke-signal","sender":"x","body":"y","priority":"high"}`)
+	req := httptest.NewRequest("POST", "/observer/inbox", bytes.NewReader(body))
+	req.Header.Set("Content-Type", "application/json")
+	w := httptest.NewRecorder()
+	r.ServeHTTP(w, req)
+	if w.Code != http.StatusBadRequest {
+		t.Errorf("expected 400 on bad type, got %d", w.Code)
+	}
+}
+
+// TestInbox_RejectsEmptyBody locks the body-required invariant.
+func TestInbox_RejectsEmptyBody(t *testing.T) {
+	r := newTestRouter(t)
+	body := []byte(`{"type":"email","sender":"x","body":"","priority":"high"}`)
+	req := httptest.NewRequest("POST", "/observer/inbox", bytes.NewReader(body))
+	req.Header.Set("Content-Type", "application/json")
+	w := httptest.NewRecorder()
+	r.ServeHTTP(w, req)
+	if w.Code != http.StatusBadRequest {
+		t.Errorf("expected 400 on empty body, got %d", w.Code)
+	}
+}
+
 // TestWorkflowRun_UnknownMode locks the 400 path on workflow definitions
 // that reference modes not registered with the runner. The harness's
 // reality test runs depend on this so an unknown-mode misconfiguration
--- a/internal/langfuse/client.go
+++ b/internal/langfuse/client.go
@ -0,0 +1,217 @@
+// Package langfuse is a minimal Go-side client for the Langfuse v2
+// ingestion API. Mirrors the surface area we need from the Rust
+// crates/gateway/src/v1/langfuse_trace.rs emitter — Trace + Span,
+// nothing else yet (no scores, no observations, no datasets).
+//
+// Auth is Basic over public_key:secret_key. URL + creds come from
+// /etc/lakehouse/langfuse.env in production; tests can pass any URL.
+//
+// Best-effort transport: errors are logged but don't fail the calling
+// path. Lakehouse's internal services should never go down because
+// Langfuse is unreachable.
+package langfuse
+
+import (
+	"bytes"
+	"context"
+	"crypto/rand"
+	"encoding/base64"
+	"encoding/hex"
+	"encoding/json"
+	"fmt"
+	"log/slog"
+	"net/http"
+	"sync"
+	"time"
+)
+
+// Client posts traces + spans to Langfuse's ingestion endpoint.
+// Events are buffered and flushed in batches. Always call Flush
+// before exit; Close also flushes.
+type Client struct {
+	url       string
+	auth      string // pre-encoded "Basic ..."
+	hc        *http.Client
+	mu        sync.Mutex
+	pending   []event
+	maxBatch  int
+}
+
+// New constructs a Client. URL like "http://localhost:3001"; creds
+// from langfuse.env. nil hc → uses default with 5s timeout.
+func New(url, publicKey, secretKey string, hc *http.Client) *Client {
+	if hc == nil {
+		hc = &http.Client{Timeout: 5 * time.Second}
+	}
+	auth := "Basic " + base64.StdEncoding.EncodeToString([]byte(publicKey+":"+secretKey))
+	return &Client{
+		url:      url,
+		auth:     auth,
+		hc:       hc,
+		maxBatch: 50,
+	}
+}
+
+// NewID returns a hex string suitable as a trace/span id. Langfuse
+// accepts arbitrary strings; a 16-byte random hex is unambiguous.
+func NewID() string {
+	b := make([]byte, 16)
+	_, _ = rand.Read(b)
+	return hex.EncodeToString(b)
+}
+
+// event is one Langfuse ingestion envelope. Body shape varies by
+// type (trace-create vs span-create); we use map[string]any to
+// keep the wire shape declarative.
+type event struct {
+	ID        string         `json:"id"`
+	Type      string         `json:"type"` // "trace-create" | "span-create"
+	Timestamp string         `json:"timestamp"`
+	Body      map[string]any `json:"body"`
+}
+
+// TraceInput is what callers fill in when starting a trace.
+type TraceInput struct {
+	Name     string
+	UserID   string
+	Input    any
+	Metadata map[string]any
+	Tags     []string
+}
+
+// Trace records a top-level trace. Returns the trace id so callers
+// can attach spans. Best-effort: errors are logged and the trace
+// id is still returned so callers don't need error-handling for the
+// common case.
+func (c *Client) Trace(ctx context.Context, t TraceInput) string {
+	id := NewID()
+	body := map[string]any{
+		"id":   id,
+		"name": t.Name,
+	}
+	if t.UserID != "" {
+		body["userId"] = t.UserID
+	}
+	if t.Input != nil {
+		body["input"] = t.Input
+	}
+	if t.Metadata != nil {
+		body["metadata"] = t.Metadata
+	}
+	if len(t.Tags) > 0 {
+		body["tags"] = t.Tags
+	}
+	c.queue(event{
+		ID:        NewID(),
+		Type:      "trace-create",
+		Timestamp: time.Now().UTC().Format(time.RFC3339Nano),
+		Body:      body,
+	})
+	return id
+}
+
+// SpanInput is what callers fill in when recording a span.
+type SpanInput struct {
+	TraceID    string
+	ParentID   string // optional — for nested spans
+	Name       string
+	Input      any
+	Output     any
+	Metadata   map[string]any
+	StartTime  time.Time
+	EndTime    time.Time
+	StatusCode int    // 0 = success, anything else = error code
+	Level      string // "DEBUG" | "DEFAULT" | "WARNING" | "ERROR"
+}
+
+// Span records one span attached to a trace. Returns the span id.
+func (c *Client) Span(ctx context.Context, s SpanInput) string {
+	id := NewID()
+	body := map[string]any{
+		"id":      id,
+		"traceId": s.TraceID,
+		"name":    s.Name,
+	}
+	if s.ParentID != "" {
+		body["parentObservationId"] = s.ParentID
+	}
+	if s.Input != nil {
+		body["input"] = s.Input
+	}
+	if s.Output != nil {
+		body["output"] = s.Output
+	}
+	if s.Metadata != nil {
+		body["metadata"] = s.Metadata
+	}
+	if !s.StartTime.IsZero() {
+		body["startTime"] = s.StartTime.UTC().Format(time.RFC3339Nano)
+	}
+	if !s.EndTime.IsZero() {
+		body["endTime"] = s.EndTime.UTC().Format(time.RFC3339Nano)
+	}
+	if s.Level != "" {
+		body["level"] = s.Level
+	}
+	if s.StatusCode != 0 {
+		body["statusMessage"] = fmt.Sprintf("status_code=%d", s.StatusCode)
+	}
+	c.queue(event{
+		ID:        NewID(),
+		Type:      "span-create",
+		Timestamp: time.Now().UTC().Format(time.RFC3339Nano),
+		Body:      body,
+	})
+	return id
+}
+
+func (c *Client) queue(e event) {
+	c.mu.Lock()
+	c.pending = append(c.pending, e)
+	shouldFlush := len(c.pending) >= c.maxBatch
+	c.mu.Unlock()
+	if shouldFlush {
+		_ = c.Flush(context.Background())
+	}
+}
+
+// Flush sends all queued events in one batch. Best-effort: returns
+// the error but also logs; callers can ignore.
+func (c *Client) Flush(ctx context.Context) error {
+	c.mu.Lock()
+	if len(c.pending) == 0 {
+		c.mu.Unlock()
+		return nil
+	}
+	batch := c.pending
+	c.pending = nil
+	c.mu.Unlock()
+
+	body, err := json.Marshal(map[string]any{"batch": batch})
+	if err != nil {
+		slog.Warn("langfuse: marshal batch", "err", err, "n", len(batch))
+		return err
+	}
+	req, err := http.NewRequestWithContext(ctx, "POST", c.url+"/api/public/ingestion", bytes.NewReader(body))
+	if err != nil {
+		return err
+	}
+	req.Header.Set("Authorization", c.auth)
+	req.Header.Set("Content-Type", "application/json")
+	resp, err := c.hc.Do(req)
+	if err != nil {
+		slog.Warn("langfuse: post", "err", err, "n", len(batch))
+		return err
+	}
+	defer resp.Body.Close()
+	if resp.StatusCode/100 != 2 && resp.StatusCode != 207 {
+		slog.Warn("langfuse: non-2xx", "status", resp.StatusCode, "n", len(batch))
+		return fmt.Errorf("langfuse ingestion: HTTP %d", resp.StatusCode)
+	}
+	return nil
+}
+
+// Close flushes any remaining events. Idempotent.
+func (c *Client) Close() error {
+	return c.Flush(context.Background())
+}
--- a/internal/matrix/retrieve.go
+++ b/internal/matrix/retrieve.go
@ -85,6 +85,15 @@ type SearchRequest struct {
 	PlaybookMaxDistance       float64        `json:"playbook_max_distance,omitempty"`
 	PlaybookMaxInjectDistance float64        `json:"playbook_max_inject_distance,omitempty"`
 	MetadataFilter            map[string]any `json:"metadata_filter,omitempty"`
+	// ExcludeIDs filters out specific worker IDs post-retrieval.
+	// Real-world driver: a coordinator places 200 workers at a
+	// contract, then mid-day the client asks for a different set —
+	// the next query should NOT return the already-placed workers.
+	// Filter runs after merge but before metadata filter, so an
+	// excluded ID never wastes a slot in the post-filter top-K.
+	// Also applies to playbook boost + Shape B inject — excluded
+	// answers are skipped at injection time.
+	ExcludeIDs []string `json:"exclude_ids,omitempty"`
 }

 // SearchResponse wraps the merged results plus per-corpus return
@ -204,6 +213,25 @@ func (r *Retriever) Search(ctx context.Context, req SearchRequest) (*SearchRespo
 		return allHits[i].Distance < allHits[j].Distance
 	})

+	// ExcludeIDs filter — applied first so excluded IDs don't waste
+	// a slot in the post-filter top-K. Real-world driver: coordinator
+	// has placed N workers at a contract; mid-day the client asks for
+	// alternatives, so this query passes ExcludeIDs=<placed_ids> and
+	// gets back fresh candidates instead of the same N.
+	if len(req.ExcludeIDs) > 0 {
+		excludeSet := make(map[string]bool, len(req.ExcludeIDs))
+		for _, id := range req.ExcludeIDs {
+			excludeSet[id] = true
+		}
+		kept := make([]Result, 0, len(allHits))
+		for _, h := range allHits {
+			if !excludeSet[h.ID] {
+				kept = append(kept, h)
+			}
+		}
+		allHits = kept
+	}
+
 	// Metadata filter (component B — staffing-side structured gate).
 	// Applied BEFORE top-K truncation so the filter doesn't accidentally
 	// reduce coverage further. Caller can request larger PerCorpusK to
@ -239,6 +267,23 @@ func (r *Retriever) Search(ctx context.Context, req SearchRequest) (*SearchRespo
 		if err != nil {
 			slog.Warn("matrix: playbook lookup failed; skipping boost+inject", "err", err)
 		} else if len(hits) > 0 {
+			// Filter playbook hits to honor ExcludeIDs — without this,
+			// an excluded answer in a playbook recording would re-enter
+			// the result set via Shape B inject, defeating the swap
+			// semantics that the exclude list exists to enforce.
+			if len(req.ExcludeIDs) > 0 {
+				excludeSet := make(map[string]bool, len(req.ExcludeIDs))
+				for _, id := range req.ExcludeIDs {
+					excludeSet[id] = true
+				}
+				keptHits := make([]PlaybookHit, 0, len(hits))
+				for _, h := range hits {
+					if !excludeSet[h.Entry.AnswerID] {
+						keptHits = append(keptHits, h)
+					}
+				}
+				hits = keptHits
+			}
 			resp.PlaybookBoosted = ApplyPlaybookBoost(resp.Results, hits)
 			maxInjectDist := float32(req.PlaybookMaxInjectDistance)
 			if maxInjectDist <= 0 {
--- a/internal/observer/types.go
+++ b/internal/observer/types.go
@ -41,6 +41,12 @@ const (
 	// the workflow handler was casting a string literal to Source,
 	// which worked coincidentally but left the taxonomy implicit.
 	SourceWorkflow Source = "workflow"
+	// SourceInbox tags ObservedOps emitted by /observer/inbox — incoming
+	// real-world signals (email, SMS) that a coordinator would receive
+	// and act on. The handler only RECORDS the message; downstream
+	// triggers (e.g. matrix.search on the parsed demand) are the
+	// caller's concern, recorded separately.
+	SourceInbox Source = "inbox"
 )

 // ObservedOp is one entry in the observer's ring buffer (and JSONL
--- a/lakehouse.toml
+++ b/lakehouse.toml
@ -43,7 +43,7 @@ bind = "127.0.0.1:3216"
 # G2: Ollama local. G3+ may swap in OpenAI/Voyage by changing
 # this URL + the wire format inside the provider.
 provider_url  = "http://localhost:11434"
-default_model = "nomic-embed-text"
+default_model = "nomic-embed-text-v2-moe"

 [queryd]
 bind = "127.0.0.1:3214"
@ -129,7 +129,7 @@ level = "info"
 [models]
 # Tier 1 — local hot path
 local_fast    = "qwen3.5:latest"
-local_embed   = "nomic-embed-text"
+local_embed   = "nomic-embed-text-v2-moe"  # 475M MoE, drop-in upgrade from 137M v1 — verified 2026-04-30 same 768-dim
 # local_judge stays on qwen2.5:latest — qwen3.5:latest is a vision-SSM
 # build with 256K context that runs ~30s per judge call against the
 # playbook_lift loop (verified 2026-04-30). qwen2.5:latest at ~1s/call
--- a/reports/reality-tests/multi_coord_stress_001.md
+++ b/reports/reality-tests/multi_coord_stress_001.md
@ -0,0 +1,77 @@
+# Multi-Coordinator Stress Test — Run 001
+
+**Generated:** 2026-04-30T12:54:09.621556469Z
+**Coordinators:** alice / bob / carol (each with own playbook namespace: `playbook_alice` / `playbook_bob` / `playbook_carol`)
+**Contracts:** alpha_milwaukee_distribution / beta_indianapolis_manufacturing / gamma_chicago_construction
+**Corpora:** `workers,ethereal_workers`
+**K per query:** 8
+**Total events captured:** 52
+**Evidence:** `reports/reality-tests/multi_coord_stress_001.json`
+
+---
+
+## Diversity — is the system locking into scenarios or cycling?
+
+| Metric | Mean Jaccard | n pairs | Interpretation |
+|---|---:|---:|---|
+| Same role across different contracts | 0 | 0 | Lower = more diverse (different region/cert mix → different workers) |
+| Different roles within same contract | 0.003703703703703704 | 18 | Should be near-zero (different roles = different worker pools) |
+
+**Healthy ranges:**
+- Same role across contracts: < 0.30 means the system is genuinely picking different workers per region/contract.
+- Different roles same contract: < 0.10 means role-specific retrieval is working.
+- If either is > 0.50, the system is "cycling" the same handful of workers regardless of query intent.
+
+---
+
+## Determinism — same query reissued, top-K stability
+
+| Metric | Value |
+|---|---:|
+| Mean Jaccard on retrieval-only reissue | 1 |
+| Number of reissue pairs | 12 |
+
+**Interpretation:**
+- ≥ 0.95: HNSW retrieval is highly deterministic; reissues land on near-identical top-K. Good — system locks into a stable view of "best workers for this query."
+- 0.80 – 0.95: Some HNSW or embed variance, acceptable.
+- < 0.80: Retrieval is unstable — reissues see substantially different results, suggesting either embed nondeterminism (Ollama returning slightly different vectors) or vectord nondeterminism (HNSW insertion order affecting recall).
+
+---
+
+## Learning — handover hit rate
+
+Bob takes Alice's contract using Alice's playbook namespace. Did Alice's recorded answers surface in Bob's results?
+
+| Metric | Value |
+|---|---:|
+| Handover queries run | 4 |
+| Alice's recorded answer at Bob's top-1 | 4 |
+| Alice's recorded answer in Bob's top-K | 4 |
+| **Handover hit rate (top-1)** | **1** |
+
+**Interpretation:**
+- Hit rate ≥ 0.5: handover is meaningful — the second coordinator inherits the first's institutional memory.
+- Hit rate ≈ 0.0: playbook namespace isolation is working but the playbook itself isn't transferable, OR Bob's queries don't match Alice's recordings closely enough.
+
+---
+
+## Per-event capture
+
+All matrix.search responses live in the JSON — top-K with worker IDs, distances, and per-corpus counts. Search by phase:
+
+```bash
+jq '.events[] | select(.phase == "merge")' reports/reality-tests/multi_coord_stress_001.json
+jq '.events[] | select(.coordinator == "alice" and .phase == "baseline")' reports/reality-tests/multi_coord_stress_001.json
+jq '.events[] | select(.role == "warehouse worker") | {phase, contract, top_k_ids: [.top_k[].id]}' reports/reality-tests/multi_coord_stress_001.json
+```
+
+---
+
+## What's NOT in this run (Phase 1 deliberately defers)
+
+- **48-hour clock.** Events fire as discrete steps, not on a timeline.
+- **Email / SMS ingest.** No endpoints exist on the Go side yet.
+- **New-resume injection mid-run.** The corpus is fixed at the start.
+- **Langfuse traces.** Need Go-side wiring.
+
+These are Phase 2/3. The Phase 1 substrate is what the time-based runner will mount on top of.
--- a/reports/reality-tests/multi_coord_stress_002.md
+++ b/reports/reality-tests/multi_coord_stress_002.md
@ -0,0 +1,82 @@
+# Multi-Coordinator Stress Test — Run 002
+
+**Generated:** 2026-04-30T13:02:13.570393819Z
+**Coordinators:** alice / bob / carol (each with own playbook namespace: `playbook_alice` / `playbook_bob` / `playbook_carol`)
+**Contracts:** alpha_milwaukee_distribution / beta_indianapolis_manufacturing / gamma_chicago_construction
+**Corpora:** `workers,ethereal_workers`
+**K per query:** 8
+**Total events captured:** 56
+**Evidence:** `reports/reality-tests/multi_coord_stress_002.json`
+
+---
+
+## Diversity — is the system locking into scenarios or cycling?
+
+| Metric | Mean Jaccard | n pairs | Interpretation |
+|---|---:|---:|---|
+| Same role across different contracts | 0.11900691900691901 | 9 | Lower = more diverse (different region/cert mix → different workers) |
+| Different roles within same contract | 0.003703703703703704 | 18 | Should be near-zero (different roles = different worker pools) |
+
+**Healthy ranges:**
+- Same role across contracts: < 0.30 means the system is genuinely picking different workers per region/contract.
+- Different roles same contract: < 0.10 means role-specific retrieval is working.
+- If either is > 0.50, the system is "cycling" the same handful of workers regardless of query intent.
+
+---
+
+## Determinism — same query reissued, top-K stability
+
+| Metric | Value |
+|---|---:|
+| Mean Jaccard on retrieval-only reissue | 1 |
+| Number of reissue pairs | 12 |
+
+**Interpretation:**
+- ≥ 0.95: HNSW retrieval is highly deterministic; reissues land on near-identical top-K. Good — system locks into a stable view of "best workers for this query."
+- 0.80 – 0.95: Some HNSW or embed variance, acceptable.
+- < 0.80: Retrieval is unstable — reissues see substantially different results, suggesting either embed nondeterminism (Ollama returning slightly different vectors) or vectord nondeterminism (HNSW insertion order affecting recall).
+
+---
+
+## Learning — handover hit rate
+
+Bob takes Alice's contract using Alice's playbook namespace. Did Alice's recorded answers surface in Bob's results?
+
+| Metric | Value |
+|---|---:|
+| Verbatim handover queries run | 4 |
+| Alice's recorded answer at Bob's top-1 (verbatim) | 4 |
+| Alice's recorded answer in Bob's top-K (verbatim) | 4 |
+| **Verbatim handover hit rate (top-1)** | **1** |
+| Paraphrase handover queries run | 4 |
+| Alice's recorded answer at Bob's top-1 (paraphrase) | 4 |
+| Alice's recorded answer in Bob's top-K (paraphrase) | 4 |
+| **Paraphrase handover hit rate (top-1)** | **1** |
+
+**Interpretation:**
+- Verbatim hit rate ≈ 1.0: trivial case — Bob runs identical queries; should always hit.
+- Paraphrase hit rate ≥ 0.5: institutional memory survives wording change — the harder learning property.
+- Paraphrase hit rate ≈ 0.0: Bob's paraphrases drift past the inject threshold, so Alice's recordings don't activate. Same caveat as the playbook_lift paraphrase pass.
+
+---
+
+## Per-event capture
+
+All matrix.search responses live in the JSON — top-K with worker IDs, distances, and per-corpus counts. Search by phase:
+
+```bash
+jq '.events[] | select(.phase == "merge")' reports/reality-tests/multi_coord_stress_002.json
+jq '.events[] | select(.coordinator == "alice" and .phase == "baseline")' reports/reality-tests/multi_coord_stress_002.json
+jq '.events[] | select(.role == "warehouse worker") | {phase, contract, top_k_ids: [.top_k[].id]}' reports/reality-tests/multi_coord_stress_002.json
+```
+
+---
+
+## What's NOT in this run (Phase 1 deliberately defers)
+
+- **48-hour clock.** Events fire as discrete steps, not on a timeline.
+- **Email / SMS ingest.** No endpoints exist on the Go side yet.
+- **New-resume injection mid-run.** The corpus is fixed at the start.
+- **Langfuse traces.** Need Go-side wiring.
+
+These are Phase 2/3. The Phase 1 substrate is what the time-based runner will mount on top of.
--- a/reports/reality-tests/multi_coord_stress_003.md
+++ b/reports/reality-tests/multi_coord_stress_003.md
@ -0,0 +1,82 @@
+# Multi-Coordinator Stress Test — Run 003
+
+**Generated:** 2026-04-30T13:13:44.35966865Z
+**Coordinators:** alice / bob / carol (each with own playbook namespace: `playbook_alice` / `playbook_bob` / `playbook_carol`)
+**Contracts:** alpha_milwaukee_distribution / beta_indianapolis_manufacturing / gamma_chicago_construction
+**Corpora:** `workers,ethereal_workers`
+**K per query:** 8
+**Total events captured:** 61
+**Evidence:** `reports/reality-tests/multi_coord_stress_003.json`
+
+---
+
+## Diversity — is the system locking into scenarios or cycling?
+
+| Metric | Mean Jaccard | n pairs | Interpretation |
+|---|---:|---:|---|
+| Same role across different contracts | 0.03068783068783069 | 9 | Lower = more diverse (different region/cert mix → different workers) |
+| Different roles within same contract | 0 | 18 | Should be near-zero (different roles = different worker pools) |
+
+**Healthy ranges:**
+- Same role across contracts: < 0.30 means the system is genuinely picking different workers per region/contract.
+- Different roles same contract: < 0.10 means role-specific retrieval is working.
+- If either is > 0.50, the system is "cycling" the same handful of workers regardless of query intent.
+
+---
+
+## Determinism — same query reissued, top-K stability
+
+| Metric | Value |
+|---|---:|
+| Mean Jaccard on retrieval-only reissue | 1 |
+| Number of reissue pairs | 12 |
+
+**Interpretation:**
+- ≥ 0.95: HNSW retrieval is highly deterministic; reissues land on near-identical top-K. Good — system locks into a stable view of "best workers for this query."
+- 0.80 – 0.95: Some HNSW or embed variance, acceptable.
+- < 0.80: Retrieval is unstable — reissues see substantially different results, suggesting either embed nondeterminism (Ollama returning slightly different vectors) or vectord nondeterminism (HNSW insertion order affecting recall).
+
+---
+
+## Learning — handover hit rate
+
+Bob takes Alice's contract using Alice's playbook namespace. Did Alice's recorded answers surface in Bob's results?
+
+| Metric | Value |
+|---|---:|
+| Verbatim handover queries run | 4 |
+| Alice's recorded answer at Bob's top-1 (verbatim) | 4 |
+| Alice's recorded answer in Bob's top-K (verbatim) | 4 |
+| **Verbatim handover hit rate (top-1)** | **1** |
+| Paraphrase handover queries run | 4 |
+| Alice's recorded answer at Bob's top-1 (paraphrase) | 4 |
+| Alice's recorded answer in Bob's top-K (paraphrase) | 4 |
+| **Paraphrase handover hit rate (top-1)** | **1** |
+
+**Interpretation:**
+- Verbatim hit rate ≈ 1.0: trivial case — Bob runs identical queries; should always hit.
+- Paraphrase hit rate ≥ 0.5: institutional memory survives wording change — the harder learning property.
+- Paraphrase hit rate ≈ 0.0: Bob's paraphrases drift past the inject threshold, so Alice's recordings don't activate. Same caveat as the playbook_lift paraphrase pass.
+
+---
+
+## Per-event capture
+
+All matrix.search responses live in the JSON — top-K with worker IDs, distances, and per-corpus counts. Search by phase:
+
+```bash
+jq '.events[] | select(.phase == "merge")' reports/reality-tests/multi_coord_stress_003.json
+jq '.events[] | select(.coordinator == "alice" and .phase == "baseline")' reports/reality-tests/multi_coord_stress_003.json
+jq '.events[] | select(.role == "warehouse worker") | {phase, contract, top_k_ids: [.top_k[].id]}' reports/reality-tests/multi_coord_stress_003.json
+```
+
+---
+
+## What's NOT in this run (Phase 1 deliberately defers)
+
+- **48-hour clock.** Events fire as discrete steps, not on a timeline.
+- **Email / SMS ingest.** No endpoints exist on the Go side yet.
+- **New-resume injection mid-run.** The corpus is fixed at the start.
+- **Langfuse traces.** Need Go-side wiring.
+
+These are Phase 2/3. The Phase 1 substrate is what the time-based runner will mount on top of.
--- a/reports/reality-tests/multi_coord_stress_004.md
+++ b/reports/reality-tests/multi_coord_stress_004.md
@ -0,0 +1,82 @@
+# Multi-Coordinator Stress Test — Run 004
+
+**Generated:** 2026-04-30T13:17:03.577877974Z
+**Coordinators:** alice / bob / carol (each with own playbook namespace: `playbook_alice` / `playbook_bob` / `playbook_carol`)
+**Contracts:** alpha_milwaukee_distribution / beta_indianapolis_manufacturing / gamma_chicago_construction
+**Corpora:** `workers,ethereal_workers`
+**K per query:** 8
+**Total events captured:** 61
+**Evidence:** `reports/reality-tests/multi_coord_stress_004.json`
+
+---
+
+## Diversity — is the system locking into scenarios or cycling?
+
+| Metric | Mean Jaccard | n pairs | Interpretation |
+|---|---:|---:|---|
+| Same role across different contracts | 0.08013468013468013 | 9 | Lower = more diverse (different region/cert mix → different workers) |
+| Different roles within same contract | 0.012820512820512822 | 18 | Should be near-zero (different roles = different worker pools) |
+
+**Healthy ranges:**
+- Same role across contracts: < 0.30 means the system is genuinely picking different workers per region/contract.
+- Different roles same contract: < 0.10 means role-specific retrieval is working.
+- If either is > 0.50, the system is "cycling" the same handful of workers regardless of query intent.
+
+---
+
+## Determinism — same query reissued, top-K stability
+
+| Metric | Value |
+|---|---:|
+| Mean Jaccard on retrieval-only reissue | 1 |
+| Number of reissue pairs | 12 |
+
+**Interpretation:**
+- ≥ 0.95: HNSW retrieval is highly deterministic; reissues land on near-identical top-K. Good — system locks into a stable view of "best workers for this query."
+- 0.80 – 0.95: Some HNSW or embed variance, acceptable.
+- < 0.80: Retrieval is unstable — reissues see substantially different results, suggesting either embed nondeterminism (Ollama returning slightly different vectors) or vectord nondeterminism (HNSW insertion order affecting recall).
+
+---
+
+## Learning — handover hit rate
+
+Bob takes Alice's contract using Alice's playbook namespace. Did Alice's recorded answers surface in Bob's results?
+
+| Metric | Value |
+|---|---:|
+| Verbatim handover queries run | 4 |
+| Alice's recorded answer at Bob's top-1 (verbatim) | 4 |
+| Alice's recorded answer in Bob's top-K (verbatim) | 4 |
+| **Verbatim handover hit rate (top-1)** | **1** |
+| Paraphrase handover queries run | 4 |
+| Alice's recorded answer at Bob's top-1 (paraphrase) | 4 |
+| Alice's recorded answer in Bob's top-K (paraphrase) | 4 |
+| **Paraphrase handover hit rate (top-1)** | **1** |
+
+**Interpretation:**
+- Verbatim hit rate ≈ 1.0: trivial case — Bob runs identical queries; should always hit.
+- Paraphrase hit rate ≥ 0.5: institutional memory survives wording change — the harder learning property.
+- Paraphrase hit rate ≈ 0.0: Bob's paraphrases drift past the inject threshold, so Alice's recordings don't activate. Same caveat as the playbook_lift paraphrase pass.
+
+---
+
+## Per-event capture
+
+All matrix.search responses live in the JSON — top-K with worker IDs, distances, and per-corpus counts. Search by phase:
+
+```bash
+jq '.events[] | select(.phase == "merge")' reports/reality-tests/multi_coord_stress_004.json
+jq '.events[] | select(.coordinator == "alice" and .phase == "baseline")' reports/reality-tests/multi_coord_stress_004.json
+jq '.events[] | select(.role == "warehouse worker") | {phase, contract, top_k_ids: [.top_k[].id]}' reports/reality-tests/multi_coord_stress_004.json
+```
+
+---
+
+## What's NOT in this run (Phase 1 deliberately defers)
+
+- **48-hour clock.** Events fire as discrete steps, not on a timeline.
+- **Email / SMS ingest.** No endpoints exist on the Go side yet.
+- **New-resume injection mid-run.** The corpus is fixed at the start.
+- **Langfuse traces.** Need Go-side wiring.
+
+These are Phase 2/3. The Phase 1 substrate is what the time-based runner will mount on top of.
--- a/reports/reality-tests/multi_coord_stress_005.md
+++ b/reports/reality-tests/multi_coord_stress_005.md
@ -0,0 +1,82 @@
+# Multi-Coordinator Stress Test — Run 005
+
+**Generated:** 2026-04-30T13:25:15.497712275Z
+**Coordinators:** alice / bob / carol (each with own playbook namespace: `playbook_alice` / `playbook_bob` / `playbook_carol`)
+**Contracts:** alpha_milwaukee_distribution / beta_indianapolis_manufacturing / gamma_chicago_construction
+**Corpora:** `workers,ethereal_workers`
+**K per query:** 8
+**Total events captured:** 61
+**Evidence:** `reports/reality-tests/multi_coord_stress_005.json`
+
+---
+
+## Diversity — is the system locking into scenarios or cycling?
+
+| Metric | Mean Jaccard | n pairs | Interpretation |
+|---|---:|---:|---|
+| Same role across different contracts | 0 | 9 | Lower = more diverse (different region/cert mix → different workers) |
+| Different roles within same contract | 0.03610093610093609 | 18 | Should be near-zero (different roles = different worker pools) |
+
+**Healthy ranges:**
+- Same role across contracts: < 0.30 means the system is genuinely picking different workers per region/contract.
+- Different roles same contract: < 0.10 means role-specific retrieval is working.
+- If either is > 0.50, the system is "cycling" the same handful of workers regardless of query intent.
+
+---
+
+## Determinism — same query reissued, top-K stability
+
+| Metric | Value |
+|---|---:|
+| Mean Jaccard on retrieval-only reissue | 1 |
+| Number of reissue pairs | 12 |
+
+**Interpretation:**
+- ≥ 0.95: HNSW retrieval is highly deterministic; reissues land on near-identical top-K. Good — system locks into a stable view of "best workers for this query."
+- 0.80 – 0.95: Some HNSW or embed variance, acceptable.
+- < 0.80: Retrieval is unstable — reissues see substantially different results, suggesting either embed nondeterminism (Ollama returning slightly different vectors) or vectord nondeterminism (HNSW insertion order affecting recall).
+
+---
+
+## Learning — handover hit rate
+
+Bob takes Alice's contract using Alice's playbook namespace. Did Alice's recorded answers surface in Bob's results?
+
+| Metric | Value |
+|---|---:|
+| Verbatim handover queries run | 4 |
+| Alice's recorded answer at Bob's top-1 (verbatim) | 4 |
+| Alice's recorded answer in Bob's top-K (verbatim) | 4 |
+| **Verbatim handover hit rate (top-1)** | **1** |
+| Paraphrase handover queries run | 4 |
+| Alice's recorded answer at Bob's top-1 (paraphrase) | 4 |
+| Alice's recorded answer in Bob's top-K (paraphrase) | 4 |
+| **Paraphrase handover hit rate (top-1)** | **1** |
+
+**Interpretation:**
+- Verbatim hit rate ≈ 1.0: trivial case — Bob runs identical queries; should always hit.
+- Paraphrase hit rate ≥ 0.5: institutional memory survives wording change — the harder learning property.
+- Paraphrase hit rate ≈ 0.0: Bob's paraphrases drift past the inject threshold, so Alice's recordings don't activate. Same caveat as the playbook_lift paraphrase pass.
+
+---
+
+## Per-event capture
+
+All matrix.search responses live in the JSON — top-K with worker IDs, distances, and per-corpus counts. Search by phase:
+
+```bash
+jq '.events[] | select(.phase == "merge")' reports/reality-tests/multi_coord_stress_005.json
+jq '.events[] | select(.coordinator == "alice" and .phase == "baseline")' reports/reality-tests/multi_coord_stress_005.json
+jq '.events[] | select(.role == "warehouse worker") | {phase, contract, top_k_ids: [.top_k[].id]}' reports/reality-tests/multi_coord_stress_005.json
+```
+
+---
+
+## What's NOT in this run (Phase 1 deliberately defers)
+
+- **48-hour clock.** Events fire as discrete steps, not on a timeline.
+- **Email / SMS ingest.** No endpoints exist on the Go side yet.
+- **New-resume injection mid-run.** The corpus is fixed at the start.
+- **Langfuse traces.** Need Go-side wiring.
+
+These are Phase 2/3. The Phase 1 substrate is what the time-based runner will mount on top of.
--- a/reports/reality-tests/multi_coord_stress_006.md
+++ b/reports/reality-tests/multi_coord_stress_006.md
@ -0,0 +1,82 @@
+# Multi-Coordinator Stress Test — Run 006
+
+**Generated:** 2026-04-30T13:33:24.568124731Z
+**Coordinators:** alice / bob / carol (each with own playbook namespace: `playbook_alice` / `playbook_bob` / `playbook_carol`)
+**Contracts:** alpha_milwaukee_distribution / beta_indianapolis_manufacturing / gamma_chicago_construction
+**Corpora:** `workers,ethereal_workers`
+**K per query:** 8
+**Total events captured:** 67
+**Evidence:** `reports/reality-tests/multi_coord_stress_006.json`
+
+---
+
+## Diversity — is the system locking into scenarios or cycling?
+
+| Metric | Mean Jaccard | n pairs | Interpretation |
+|---|---:|---:|---|
+| Same role across different contracts | 0 | 9 | Lower = more diverse (different region/cert mix → different workers) |
+| Different roles within same contract | 0.04603174603174603 | 18 | Should be near-zero (different roles = different worker pools) |
+
+**Healthy ranges:**
+- Same role across contracts: < 0.30 means the system is genuinely picking different workers per region/contract.
+- Different roles same contract: < 0.10 means role-specific retrieval is working.
+- If either is > 0.50, the system is "cycling" the same handful of workers regardless of query intent.
+
+---
+
+## Determinism — same query reissued, top-K stability
+
+| Metric | Value |
+|---|---:|
+| Mean Jaccard on retrieval-only reissue | 1 |
+| Number of reissue pairs | 12 |
+
+**Interpretation:**
+- ≥ 0.95: HNSW retrieval is highly deterministic; reissues land on near-identical top-K. Good — system locks into a stable view of "best workers for this query."
+- 0.80 – 0.95: Some HNSW or embed variance, acceptable.
+- < 0.80: Retrieval is unstable — reissues see substantially different results, suggesting either embed nondeterminism (Ollama returning slightly different vectors) or vectord nondeterminism (HNSW insertion order affecting recall).
+
+---
+
+## Learning — handover hit rate
+
+Bob takes Alice's contract using Alice's playbook namespace. Did Alice's recorded answers surface in Bob's results?
+
+| Metric | Value |
+|---|---:|
+| Verbatim handover queries run | 4 |
+| Alice's recorded answer at Bob's top-1 (verbatim) | 4 |
+| Alice's recorded answer in Bob's top-K (verbatim) | 4 |
+| **Verbatim handover hit rate (top-1)** | **1** |
+| Paraphrase handover queries run | 4 |
+| Alice's recorded answer at Bob's top-1 (paraphrase) | 4 |
+| Alice's recorded answer in Bob's top-K (paraphrase) | 4 |
+| **Paraphrase handover hit rate (top-1)** | **1** |
+
+**Interpretation:**
+- Verbatim hit rate ≈ 1.0: trivial case — Bob runs identical queries; should always hit.
+- Paraphrase hit rate ≥ 0.5: institutional memory survives wording change — the harder learning property.
+- Paraphrase hit rate ≈ 0.0: Bob's paraphrases drift past the inject threshold, so Alice's recordings don't activate. Same caveat as the playbook_lift paraphrase pass.
+
+---
+
+## Per-event capture
+
+All matrix.search responses live in the JSON — top-K with worker IDs, distances, and per-corpus counts. Search by phase:
+
+```bash
+jq '.events[] | select(.phase == "merge")' reports/reality-tests/multi_coord_stress_006.json
+jq '.events[] | select(.coordinator == "alice" and .phase == "baseline")' reports/reality-tests/multi_coord_stress_006.json
+jq '.events[] | select(.role == "warehouse worker") | {phase, contract, top_k_ids: [.top_k[].id]}' reports/reality-tests/multi_coord_stress_006.json
+```
+
+---
+
+## What's NOT in this run (Phase 1 deliberately defers)
+
+- **48-hour clock.** Events fire as discrete steps, not on a timeline.
+- **Email / SMS ingest.** No endpoints exist on the Go side yet.
+- **New-resume injection mid-run.** The corpus is fixed at the start.
+- **Langfuse traces.** Need Go-side wiring.
+
+These are Phase 2/3. The Phase 1 substrate is what the time-based runner will mount on top of.
--- a/reports/reality-tests/multi_coord_stress_007.md
+++ b/reports/reality-tests/multi_coord_stress_007.md
@ -0,0 +1,82 @@
+# Multi-Coordinator Stress Test — Run 007
+
+**Generated:** 2026-04-30T19:50:04.791000091Z
+**Coordinators:** alice / bob / carol (each with own playbook namespace: `playbook_alice` / `playbook_bob` / `playbook_carol`)
+**Contracts:** alpha_milwaukee_distribution / beta_indianapolis_manufacturing / gamma_chicago_construction
+**Corpora:** `workers,ethereal_workers`
+**K per query:** 8
+**Total events captured:** 67
+**Evidence:** `reports/reality-tests/multi_coord_stress_007.json`
+
+---
+
+## Diversity — is the system locking into scenarios or cycling?
+
+| Metric | Mean Jaccard | n pairs | Interpretation |
+|---|---:|---:|---|
+| Same role across different contracts | 0 | 9 | Lower = more diverse (different region/cert mix → different workers) |
+| Different roles within same contract | 0.026455026455026454 | 18 | Should be near-zero (different roles = different worker pools) |
+
+**Healthy ranges:**
+- Same role across contracts: < 0.30 means the system is genuinely picking different workers per region/contract.
+- Different roles same contract: < 0.10 means role-specific retrieval is working.
+- If either is > 0.50, the system is "cycling" the same handful of workers regardless of query intent.
+
+---
+
+## Determinism — same query reissued, top-K stability
+
+| Metric | Value |
+|---|---:|
+| Mean Jaccard on retrieval-only reissue | 1 |
+| Number of reissue pairs | 12 |
+
+**Interpretation:**
+- ≥ 0.95: HNSW retrieval is highly deterministic; reissues land on near-identical top-K. Good — system locks into a stable view of "best workers for this query."
+- 0.80 – 0.95: Some HNSW or embed variance, acceptable.
+- < 0.80: Retrieval is unstable — reissues see substantially different results, suggesting either embed nondeterminism (Ollama returning slightly different vectors) or vectord nondeterminism (HNSW insertion order affecting recall).
+
+---
+
+## Learning — handover hit rate
+
+Bob takes Alice's contract using Alice's playbook namespace. Did Alice's recorded answers surface in Bob's results?
+
+| Metric | Value |
+|---|---:|
+| Verbatim handover queries run | 4 |
+| Alice's recorded answer at Bob's top-1 (verbatim) | 4 |
+| Alice's recorded answer in Bob's top-K (verbatim) | 4 |
+| **Verbatim handover hit rate (top-1)** | **1** |
+| Paraphrase handover queries run | 4 |
+| Alice's recorded answer at Bob's top-1 (paraphrase) | 4 |
+| Alice's recorded answer in Bob's top-K (paraphrase) | 4 |
+| **Paraphrase handover hit rate (top-1)** | **1** |
+
+**Interpretation:**
+- Verbatim hit rate ≈ 1.0: trivial case — Bob runs identical queries; should always hit.
+- Paraphrase hit rate ≥ 0.5: institutional memory survives wording change — the harder learning property.
+- Paraphrase hit rate ≈ 0.0: Bob's paraphrases drift past the inject threshold, so Alice's recordings don't activate. Same caveat as the playbook_lift paraphrase pass.
+
+---
+
+## Per-event capture
+
+All matrix.search responses live in the JSON — top-K with worker IDs, distances, and per-corpus counts. Search by phase:
+
+```bash
+jq '.events[] | select(.phase == "merge")' reports/reality-tests/multi_coord_stress_007.json
+jq '.events[] | select(.coordinator == "alice" and .phase == "baseline")' reports/reality-tests/multi_coord_stress_007.json
+jq '.events[] | select(.role == "warehouse worker") | {phase, contract, top_k_ids: [.top_k[].id]}' reports/reality-tests/multi_coord_stress_007.json
+```
+
+---
+
+## What's NOT in this run (Phase 1 deliberately defers)
+
+- **48-hour clock.** Events fire as discrete steps, not on a timeline.
+- **Email / SMS ingest.** No endpoints exist on the Go side yet.
+- **New-resume injection mid-run.** The corpus is fixed at the start.
+- **Langfuse traces.** Need Go-side wiring.
+
+These are Phase 2/3. The Phase 1 substrate is what the time-based runner will mount on top of.
--- a/reports/reality-tests/multi_coord_stress_008.md
+++ b/reports/reality-tests/multi_coord_stress_008.md
@ -0,0 +1,82 @@
+# Multi-Coordinator Stress Test — Run 008
+
+**Generated:** 2026-04-30T21:15:37.045817146Z
+**Coordinators:** alice / bob / carol (each with own playbook namespace: `playbook_alice` / `playbook_bob` / `playbook_carol`)
+**Contracts:** alpha_milwaukee_distribution / beta_indianapolis_manufacturing / gamma_chicago_construction
+**Corpora:** `workers,ethereal_workers`
+**K per query:** 8
+**Total events captured:** 67
+**Evidence:** `reports/reality-tests/multi_coord_stress_008.json`
+
+---
+
+## Diversity — is the system locking into scenarios or cycling?
+
+| Metric | Mean Jaccard | n pairs | Interpretation |
+|---|---:|---:|---|
+| Same role across different contracts | 0 | 9 | Lower = more diverse (different region/cert mix → different workers) |
+| Different roles within same contract | 0.04126984126984126 | 18 | Should be near-zero (different roles = different worker pools) |
+
+**Healthy ranges:**
+- Same role across contracts: < 0.30 means the system is genuinely picking different workers per region/contract.
+- Different roles same contract: < 0.10 means role-specific retrieval is working.
+- If either is > 0.50, the system is "cycling" the same handful of workers regardless of query intent.
+
+---
+
+## Determinism — same query reissued, top-K stability
+
+| Metric | Value |
+|---|---:|
+| Mean Jaccard on retrieval-only reissue | 1 |
+| Number of reissue pairs | 12 |
+
+**Interpretation:**
+- ≥ 0.95: HNSW retrieval is highly deterministic; reissues land on near-identical top-K. Good — system locks into a stable view of "best workers for this query."
+- 0.80 – 0.95: Some HNSW or embed variance, acceptable.
+- < 0.80: Retrieval is unstable — reissues see substantially different results, suggesting either embed nondeterminism (Ollama returning slightly different vectors) or vectord nondeterminism (HNSW insertion order affecting recall).
+
+---
+
+## Learning — handover hit rate
+
+Bob takes Alice's contract using Alice's playbook namespace. Did Alice's recorded answers surface in Bob's results?
+
+| Metric | Value |
+|---|---:|
+| Verbatim handover queries run | 4 |
+| Alice's recorded answer at Bob's top-1 (verbatim) | 4 |
+| Alice's recorded answer in Bob's top-K (verbatim) | 4 |
+| **Verbatim handover hit rate (top-1)** | **1** |
+| Paraphrase handover queries run | 4 |
+| Alice's recorded answer at Bob's top-1 (paraphrase) | 4 |
+| Alice's recorded answer in Bob's top-K (paraphrase) | 4 |
+| **Paraphrase handover hit rate (top-1)** | **1** |
+
+**Interpretation:**
+- Verbatim hit rate ≈ 1.0: trivial case — Bob runs identical queries; should always hit.
+- Paraphrase hit rate ≥ 0.5: institutional memory survives wording change — the harder learning property.
+- Paraphrase hit rate ≈ 0.0: Bob's paraphrases drift past the inject threshold, so Alice's recordings don't activate. Same caveat as the playbook_lift paraphrase pass.
+
+---
+
+## Per-event capture
+
+All matrix.search responses live in the JSON — top-K with worker IDs, distances, and per-corpus counts. Search by phase:
+
+```bash
+jq '.events[] | select(.phase == "merge")' reports/reality-tests/multi_coord_stress_008.json
+jq '.events[] | select(.coordinator == "alice" and .phase == "baseline")' reports/reality-tests/multi_coord_stress_008.json
+jq '.events[] | select(.role == "warehouse worker") | {phase, contract, top_k_ids: [.top_k[].id]}' reports/reality-tests/multi_coord_stress_008.json
+```
+
+---
+
+## What's NOT in this run (Phase 1 deliberately defers)
+
+- **48-hour clock.** Events fire as discrete steps, not on a timeline.
+- **Email / SMS ingest.** No endpoints exist on the Go side yet.
+- **New-resume injection mid-run.** The corpus is fixed at the start.
+- **Langfuse traces.** Need Go-side wiring.
+
+These are Phase 2/3. The Phase 1 substrate is what the time-based runner will mount on top of.
--- a/reports/reality-tests/multi_coord_stress_009.md
+++ b/reports/reality-tests/multi_coord_stress_009.md
@ -0,0 +1,82 @@
+# Multi-Coordinator Stress Test — Run 009
+
+**Generated:** 2026-04-30T21:23:59.011167722Z
+**Coordinators:** alice / bob / carol (each with own playbook namespace: `playbook_alice` / `playbook_bob` / `playbook_carol`)
+**Contracts:** alpha_milwaukee_distribution / beta_indianapolis_manufacturing / gamma_chicago_construction
+**Corpora:** `workers,ethereal_workers`
+**K per query:** 8
+**Total events captured:** 67
+**Evidence:** `reports/reality-tests/multi_coord_stress_009.json`
+
+---
+
+## Diversity — is the system locking into scenarios or cycling?
+
+| Metric | Mean Jaccard | n pairs | Interpretation |
+|---|---:|---:|---|
+| Same role across different contracts | 0.015873015873015872 | 9 | Lower = more diverse (different region/cert mix → different workers) |
+| Different roles within same contract | 0.015343915343915345 | 18 | Should be near-zero (different roles = different worker pools) |
+
+**Healthy ranges:**
+- Same role across contracts: < 0.30 means the system is genuinely picking different workers per region/contract.
+- Different roles same contract: < 0.10 means role-specific retrieval is working.
+- If either is > 0.50, the system is "cycling" the same handful of workers regardless of query intent.
+
+---
+
+## Determinism — same query reissued, top-K stability
+
+| Metric | Value |
+|---|---:|
+| Mean Jaccard on retrieval-only reissue | 1 |
+| Number of reissue pairs | 12 |
+
+**Interpretation:**
+- ≥ 0.95: HNSW retrieval is highly deterministic; reissues land on near-identical top-K. Good — system locks into a stable view of "best workers for this query."
+- 0.80 – 0.95: Some HNSW or embed variance, acceptable.
+- < 0.80: Retrieval is unstable — reissues see substantially different results, suggesting either embed nondeterminism (Ollama returning slightly different vectors) or vectord nondeterminism (HNSW insertion order affecting recall).
+
+---
+
+## Learning — handover hit rate
+
+Bob takes Alice's contract using Alice's playbook namespace. Did Alice's recorded answers surface in Bob's results?
+
+| Metric | Value |
+|---|---:|
+| Verbatim handover queries run | 4 |
+| Alice's recorded answer at Bob's top-1 (verbatim) | 4 |
+| Alice's recorded answer in Bob's top-K (verbatim) | 4 |
+| **Verbatim handover hit rate (top-1)** | **1** |
+| Paraphrase handover queries run | 4 |
+| Alice's recorded answer at Bob's top-1 (paraphrase) | 4 |
+| Alice's recorded answer in Bob's top-K (paraphrase) | 4 |
+| **Paraphrase handover hit rate (top-1)** | **1** |
+
+**Interpretation:**
+- Verbatim hit rate ≈ 1.0: trivial case — Bob runs identical queries; should always hit.
+- Paraphrase hit rate ≥ 0.5: institutional memory survives wording change — the harder learning property.
+- Paraphrase hit rate ≈ 0.0: Bob's paraphrases drift past the inject threshold, so Alice's recordings don't activate. Same caveat as the playbook_lift paraphrase pass.
+
+---
+
+## Per-event capture
+
+All matrix.search responses live in the JSON — top-K with worker IDs, distances, and per-corpus counts. Search by phase:
+
+```bash
+jq '.events[] | select(.phase == "merge")' reports/reality-tests/multi_coord_stress_009.json
+jq '.events[] | select(.coordinator == "alice" and .phase == "baseline")' reports/reality-tests/multi_coord_stress_009.json
+jq '.events[] | select(.role == "warehouse worker") | {phase, contract, top_k_ids: [.top_k[].id]}' reports/reality-tests/multi_coord_stress_009.json
+```
+
+---
+
+## What's NOT in this run (Phase 1 deliberately defers)
+
+- **48-hour clock.** Events fire as discrete steps, not on a timeline.
+- **Email / SMS ingest.** No endpoints exist on the Go side yet.
+- **New-resume injection mid-run.** The corpus is fixed at the start.
+- **Langfuse traces.** Need Go-side wiring.
+
+These are Phase 2/3. The Phase 1 substrate is what the time-based runner will mount on top of.
--- a/reports/reality-tests/multi_coord_stress_010.md
+++ b/reports/reality-tests/multi_coord_stress_010.md
@ -0,0 +1,82 @@
+# Multi-Coordinator Stress Test — Run 010
+
+**Generated:** 2026-04-30T21:30:38.434794788Z
+**Coordinators:** alice / bob / carol (each with own playbook namespace: `playbook_alice` / `playbook_bob` / `playbook_carol`)
+**Contracts:** alpha_milwaukee_distribution / beta_indianapolis_manufacturing / gamma_chicago_construction
+**Corpora:** `workers,ethereal_workers`
+**K per query:** 8
+**Total events captured:** 67
+**Evidence:** `reports/reality-tests/multi_coord_stress_010.json`
+
+---
+
+## Diversity — is the system locking into scenarios or cycling?
+
+| Metric | Mean Jaccard | n pairs | Interpretation |
+|---|---:|---:|---|
+| Same role across different contracts | 0.007407407407407408 | 9 | Lower = more diverse (different region/cert mix → different workers) |
+| Different roles within same contract | 0.026455026455026454 | 18 | Should be near-zero (different roles = different worker pools) |
+
+**Healthy ranges:**
+- Same role across contracts: < 0.30 means the system is genuinely picking different workers per region/contract.
+- Different roles same contract: < 0.10 means role-specific retrieval is working.
+- If either is > 0.50, the system is "cycling" the same handful of workers regardless of query intent.
+
+---
+
+## Determinism — same query reissued, top-K stability
+
+| Metric | Value |
+|---|---:|
+| Mean Jaccard on retrieval-only reissue | 1 |
+| Number of reissue pairs | 12 |
+
+**Interpretation:**
+- ≥ 0.95: HNSW retrieval is highly deterministic; reissues land on near-identical top-K. Good — system locks into a stable view of "best workers for this query."
+- 0.80 – 0.95: Some HNSW or embed variance, acceptable.
+- < 0.80: Retrieval is unstable — reissues see substantially different results, suggesting either embed nondeterminism (Ollama returning slightly different vectors) or vectord nondeterminism (HNSW insertion order affecting recall).
+
+---
+
+## Learning — handover hit rate
+
+Bob takes Alice's contract using Alice's playbook namespace. Did Alice's recorded answers surface in Bob's results?
+
+| Metric | Value |
+|---|---:|
+| Verbatim handover queries run | 4 |
+| Alice's recorded answer at Bob's top-1 (verbatim) | 4 |
+| Alice's recorded answer in Bob's top-K (verbatim) | 4 |
+| **Verbatim handover hit rate (top-1)** | **1** |
+| Paraphrase handover queries run | 4 |
+| Alice's recorded answer at Bob's top-1 (paraphrase) | 4 |
+| Alice's recorded answer in Bob's top-K (paraphrase) | 4 |
+| **Paraphrase handover hit rate (top-1)** | **1** |
+
+**Interpretation:**
+- Verbatim hit rate ≈ 1.0: trivial case — Bob runs identical queries; should always hit.
+- Paraphrase hit rate ≥ 0.5: institutional memory survives wording change — the harder learning property.
+- Paraphrase hit rate ≈ 0.0: Bob's paraphrases drift past the inject threshold, so Alice's recordings don't activate. Same caveat as the playbook_lift paraphrase pass.
+
+---
+
+## Per-event capture
+
+All matrix.search responses live in the JSON — top-K with worker IDs, distances, and per-corpus counts. Search by phase:
+
+```bash
+jq '.events[] | select(.phase == "merge")' reports/reality-tests/multi_coord_stress_010.json
+jq '.events[] | select(.coordinator == "alice" and .phase == "baseline")' reports/reality-tests/multi_coord_stress_010.json
+jq '.events[] | select(.role == "warehouse worker") | {phase, contract, top_k_ids: [.top_k[].id]}' reports/reality-tests/multi_coord_stress_010.json
+```
+
+---
+
+## What's NOT in this run (Phase 1 deliberately defers)
+
+- **48-hour clock.** Events fire as discrete steps, not on a timeline.
+- **Email / SMS ingest.** No endpoints exist on the Go side yet.
+- **New-resume injection mid-run.** The corpus is fixed at the start.
+- **Langfuse traces.** Need Go-side wiring.
+
+These are Phase 2/3. The Phase 1 substrate is what the time-based runner will mount on top of.
--- a/reports/reality-tests/multi_coord_stress_011.md
+++ b/reports/reality-tests/multi_coord_stress_011.md
@ -0,0 +1,82 @@
+# Multi-Coordinator Stress Test — Run 011
+
+**Generated:** 2026-04-30T21:41:26.801002955Z
+**Coordinators:** alice / bob / carol (each with own playbook namespace: `playbook_alice` / `playbook_bob` / `playbook_carol`)
+**Contracts:** alpha_milwaukee_distribution / beta_indianapolis_manufacturing / gamma_chicago_construction
+**Corpora:** `workers,ethereal_workers`
+**K per query:** 8
+**Total events captured:** 67
+**Evidence:** `reports/reality-tests/multi_coord_stress_011.json`
+
+---
+
+## Diversity — is the system locking into scenarios or cycling?
+
+| Metric | Mean Jaccard | n pairs | Interpretation |
+|---|---:|---:|---|
+| Same role across different contracts | 0.025641025641025644 | 9 | Lower = more diverse (different region/cert mix → different workers) |
+| Different roles within same contract | 0.06996336996336996 | 18 | Should be near-zero (different roles = different worker pools) |
+
+**Healthy ranges:**
+- Same role across contracts: < 0.30 means the system is genuinely picking different workers per region/contract.
+- Different roles same contract: < 0.10 means role-specific retrieval is working.
+- If either is > 0.50, the system is "cycling" the same handful of workers regardless of query intent.
+
+---
+
+## Determinism — same query reissued, top-K stability
+
+| Metric | Value |
+|---|---:|
+| Mean Jaccard on retrieval-only reissue | 1 |
+| Number of reissue pairs | 12 |
+
+**Interpretation:**
+- ≥ 0.95: HNSW retrieval is highly deterministic; reissues land on near-identical top-K. Good — system locks into a stable view of "best workers for this query."
+- 0.80 – 0.95: Some HNSW or embed variance, acceptable.
+- < 0.80: Retrieval is unstable — reissues see substantially different results, suggesting either embed nondeterminism (Ollama returning slightly different vectors) or vectord nondeterminism (HNSW insertion order affecting recall).
+
+---
+
+## Learning — handover hit rate
+
+Bob takes Alice's contract using Alice's playbook namespace. Did Alice's recorded answers surface in Bob's results?
+
+| Metric | Value |
+|---|---:|
+| Verbatim handover queries run | 4 |
+| Alice's recorded answer at Bob's top-1 (verbatim) | 4 |
+| Alice's recorded answer in Bob's top-K (verbatim) | 4 |
+| **Verbatim handover hit rate (top-1)** | **1** |
+| Paraphrase handover queries run | 4 |
+| Alice's recorded answer at Bob's top-1 (paraphrase) | 4 |
+| Alice's recorded answer in Bob's top-K (paraphrase) | 4 |
+| **Paraphrase handover hit rate (top-1)** | **1** |
+
+**Interpretation:**
+- Verbatim hit rate ≈ 1.0: trivial case — Bob runs identical queries; should always hit.
+- Paraphrase hit rate ≥ 0.5: institutional memory survives wording change — the harder learning property.
+- Paraphrase hit rate ≈ 0.0: Bob's paraphrases drift past the inject threshold, so Alice's recordings don't activate. Same caveat as the playbook_lift paraphrase pass.
+
+---
+
+## Per-event capture
+
+All matrix.search responses live in the JSON — top-K with worker IDs, distances, and per-corpus counts. Search by phase:
+
+```bash
+jq '.events[] | select(.phase == "merge")' reports/reality-tests/multi_coord_stress_011.json
+jq '.events[] | select(.coordinator == "alice" and .phase == "baseline")' reports/reality-tests/multi_coord_stress_011.json
+jq '.events[] | select(.role == "warehouse worker") | {phase, contract, top_k_ids: [.top_k[].id]}' reports/reality-tests/multi_coord_stress_011.json
+```
+
+---
+
+## What's NOT in this run (Phase 1 deliberately defers)
+
+- **48-hour clock.** Events fire as discrete steps, not on a timeline.
+- **Email / SMS ingest.** No endpoints exist on the Go side yet.
+- **New-resume injection mid-run.** The corpus is fixed at the start.
+- **Langfuse traces.** Need Go-side wiring.
+
+These are Phase 2/3. The Phase 1 substrate is what the time-based runner will mount on top of.
--- a/reports/reality-tests/playbook_lift_005.md
+++ b/reports/reality-tests/playbook_lift_005.md
@ -0,0 +1,120 @@
+# Playbook-Lift Reality Test — Run 005
+
+**Generated:** 2026-04-30T12:40:48.475901847Z
+**Judge:** `qwen2.5:latest` (Ollama, resolved from env JUDGE_MODEL=qwen2.5:latest)
+**Corpora:** `workers,ethereal_workers`
+**Workers limit:** 5000
+**Queries:** `tests/reality/playbook_lift_queries.txt` (21 executed)
+**K per pass:** 10
+**Paraphrase pass:** ENABLED
+**Re-judge pass:** ENABLED
+**Evidence:** `reports/reality-tests/playbook_lift_005.json`
+
+---
+
+## Headline
+
+| Metric | Value |
+|---|---:|
+| Total queries run | 21 |
+| Cold-pass discoveries (judge-best ≠ top-1) | 7 |
+| Warm-pass lifts (recorded playbook → top-1) | 5 |
+| No change (judge-best already top-1, no playbook needed) | 16 |
+| Playbook boosts triggered (warm pass) | 9 |
+| Mean Δ top-1 distance (warm − cold) | -0.076170966 |
+| **Paraphrase pass — recorded answer at rank 0 (top-1)** | **5 / 7** |
+| Paraphrase pass — recorded answer at any rank in top-K | 5 / 7 |
+| **Quality lift** (warm top-1 rating > cold top-1 rating) | **5 / 21** |
+| Quality neutral (warm top-1 rating = cold top-1 rating) | 13 / 21 |
+| Quality regressed (warm top-1 rating < cold top-1 rating) | 3 / 21 |
+
+**Verbatim lift rate:** 5 of 7 discoveries became top-1 after warm pass.
+
+---
+
+## Per-query results
+
+| # | Query | Cold top-1 | Cold judge-best (rank/rating) | Recorded? | Warm top-1 | Judge-best warm rank | Lift |
+|---|---|---|---|---|---|---|---|
+| 1 | Forklift operator with OSHA-30, warehouse experience, day sh | e-5670 | 2/4 | ✓ e-5729 | e-5729 | 0 | **YES** |
+| 2 | OSHA-30 certified forklift operator in Wisconsin, cold stora | e-6293 | 7/3 | — | w-1566 | 8 | no |
+| 3 | Production worker with confined-space cert and hazmat traini | w-602 | 0/2 | — | w-3575 | 1 | no |
+| 4 | CDL Class A driver, clean record, willing to do regional 4-d | w-3854 | 0/1 | — | w-3854 | 0 | no |
+| 5 | Warehouse lead with current OSHA-30 certification, NOT OSHA- | w-1807 | 6/3 | — | w-1807 | 6 | no |
+| 6 | Forklift-certified loader, certification must be active, dis | w-1807 | 3/4 | ✓ w-205 | w-4257 | 1 | no |
+| 7 | Hazmat-certified warehouse worker comfortable with cold stor | e-4910 | 2/4 | ✓ w-4257 | w-205 | 1 | no |
+| 8 | Bilingual production worker with team-lead experience and tr | w-4988 | 0/4 | — | w-4988 | 0 | no |
+| 9 | Inventory specialist with confined-space cert and compliance | w-388 | 3/4 | ✓ w-3575 | w-3575 | 0 | **YES** |
+| 10 | Warehouse worker who can run inventory cycles and lead a sma | e-3011 | 0/4 | — | e-3011 | 0 | no |
+| 11 | Production line worker comfortable filling in as line superv | w-1387 | 0/4 | — | e-5729 | 1 | no |
+| 12 | Customer service rep willing to cross-train into dispatch or | w-1451 | 0/2 | — | w-1451 | 0 | no |
+| 13 | Reliable production line lead with strong attendance and lea | e-7360 | 5/4 | ✓ w-2886 | w-2886 | 0 | **YES** |
+| 14 | Highly responsive forklift operator available for last-minut | e-6108 | 5/4 | ✓ w-1566 | w-1566 | 0 | **YES** |
+| 15 | Engaged warehouse associate with strong safety compliance re | e-2743 | 2/4 | ✓ w-49 | w-49 | 0 | **YES** |
+| 16 | CDL-A driver based in IL or WI, willing to run regional 4-da | w-2486 | 5/2 | — | w-2486 | 5 | no |
+| 17 | Bilingual customer service rep in Indianapolis or Cincinnati | e-9749 | 9/2 | — | e-9749 | 9 | no |
+| 18 | Production supervisor open to Midwest relocation for permane | w-379 | 6/3 | — | w-379 | 6 | no |
+| 19 | Dental hygienist with three years experience, Indianapolis a | e-6772 | 0/1 | — | w-3575 | 1 | no |
+| 20 | Registered nurse with ICU experience, willing to take per-di | w-379 | 0/1 | — | w-379 | 0 | no |
+| 21 | Software engineer with React and TypeScript, three years exp | w-1773 | 0/1 | — | w-1773 | 0 | no |
+
+---
+
+## Paraphrase pass — does the playbook help similar-but-different queries?
+
+For each query whose Pass 1 cold pass recorded a playbook entry, the
+judge model rephrased the query, and the rephrased version was sent
+through warm matrix.search. The recorded answer ID's rank in those
+results tests whether cosine on the embedded paraphrase finds the
+recorded query's vector.
+
+| # | Original (≤40c) | Paraphrase (≤60c) | Recorded answer | Paraphrase top-1 | Recorded rank | Paraphrase lift |
+|---|---|---|---|---|---|---|
+| 1 | Forklift operator with OSHA-30, warehous | Seeking forklift operator certified in OSHA-30, looking for  | e-5729 | e-5729 | 0 | **YES** |
+| 6 | Forklift-certified loader, certification | Loader requiring active forklift certification, this must no | w-205 | w-205 | 0 | **YES** |
+| 7 | Hazmat-certified warehouse worker comfor | Warehouse worker with Hazmat certification and experience in | w-4257 | w-4257 | 0 | **YES** |
+| 9 | Inventory specialist with confined-space | Specialist in inventory management requiring certified confi | w-3575 | w-49 | -1 | no |
+| 13 | Reliable production line lead with stron | Experienced production line supervisor with excellent punctu | w-2886 | w-2886 | 0 | **YES** |
+| 14 | Highly responsive forklift operator avai | Available forklift operator ready for urgent shift coverage | w-1566 | w-1566 | 0 | **YES** |
+| 15 | Engaged warehouse associate with strong  | Warehouse associate dedicated to engagement and boasting a r | w-49 | w-984 | -1 | no |
+
+---
+
+## Honesty caveats
+
+1. **Judge IS the ground truth proxy.** Without human-labeled relevance, the LLM
+   judge's verdict is what defines "best." If `qwen2.5:latest` rates badly,
+   the lift number is meaningless. To validate the judge itself, sample 5–10
+   verdicts manually and check agreement.
+2. **Score-1.0 boost = distance halved.** Playbook math is
+   `distance' = distance × (1 - 0.5 × score)`. Lift requires the judge-best
+   result's pre-boost distance to be ≤ 2× the cold top-1's distance, otherwise
+   even halving doesn't promote it. Tight clusters → little visible lift.
+3. **Verbatim vs paraphrase.** The verbatim lift rate (above) is the cheap
+   case — same query, recorded playbook, expected boost. The paraphrase
+   pass (when enabled) is the actual learning property: similar-but-different
+   queries hitting a recorded playbook. Compare verbatim and paraphrase
+   lift rates — paraphrase should be lower (semantic-distance gates some
+   playbook hits) but non-zero is the meaningful signal.
+4. **Multi-corpus skew.** Default corpora=`workers,ethereal_workers` — if all judge-best
+   results land in one corpus, the matrix layer's purpose isn't being tested.
+   Check per-corpus distribution in the JSON.
+5. **Judge resolution.** This run used `qwen2.5:latest` from
+   env JUDGE_MODEL=qwen2.5:latest.
+   Bumping the judge for run #N+1 means editing one line in lakehouse.toml.
+6. **Paraphrase generation also uses the judge.** The same model that rates
+   relevance also rephrases queries. A judge that's bad at rating staffing
+   queries is probably also bad at rephrasing them. Worth sanity-checking
+   a sample of `paraphrase_query` values in the JSON before trusting the
+   paraphrase lift number.
+
+## Next moves
+
+- If lift rate ≥ 50% of discoveries: matrix layer + playbook is doing real
+  work. Move to paraphrase queries + tag-based boost (currently ignored).
+- If lift rate < 20%: investigate why — judge variance, distance gap too
+  wide, or playbook math too gentle. The score=1.0 / 0.5× formula may need
+  retuning.
+- If discovery rate (cold judge-best ≠ top-1) is itself low: cosine is
+  already close to optimal on this query distribution. Either the corpus
+  is too narrow or the queries are too easy.
--- a/scripts/g2_smoke.sh
+++ b/scripts/g2_smoke.sh
@ -76,8 +76,14 @@ DIM="$(echo "$RESP" | jq -r '.dimension')"
 N="$(echo "$RESP" | jq -r '.vectors | length')"
 MODEL="$(echo "$RESP" | jq -r '.model')"
 SAME="$(echo "$RESP" | jq -r '.vectors[0][0] == .vectors[1][0]')"
-if [ "$DIM" = "768" ] && [ "$N" = "2" ] && [ "$MODEL" = "nomic-embed-text" ] && [ "$SAME" = "false" ]; then
-  echo "  ✓ dim=768, model=nomic-embed-text, 2 distinct vectors"
+# Accept any nomic-embed-text* family member as the default — v1
+# (137M, 768d) and v2-moe (475M MoE, 768d) are both supported drop-ins.
+# The smoke locks the dimension + the distinct-vectors property, NOT
+# the exact model name (operators bump the model in lakehouse.toml
+# without changing this smoke).
+case "$MODEL" in nomic-embed-text*) MODEL_OK=1 ;; *) MODEL_OK=0 ;; esac
+if [ "$DIM" = "768" ] && [ "$N" = "2" ] && [ "$MODEL_OK" = "1" ] && [ "$SAME" = "false" ]; then
+  echo "  ✓ dim=768, model=$MODEL, 2 distinct vectors"
 else
  echo "  ✗ resp: dim=$DIM n=$N model=$MODEL same=$SAME"; FAILED=1
 fi
--- a/scripts/multi_coord_stress.sh
+++ b/scripts/multi_coord_stress.sh
@ -0,0 +1,282 @@
+#!/usr/bin/env bash
+# Multi-coordinator stress harness — Phase 1 of the 48-hour mock.
+#
+# Three coordinators (Alice / Bob / Carol) own three distinct contracts
+# (Milwaukee distribution, Indianapolis manufacturing, Chicago
+# construction). The driver fires phases:
+#   1. baseline — each coord runs their contract's role queries
+#   2. surge    — each contract's demand doubles (URGENT phrasing)
+#   3. merge    — alpha + beta combined under alice
+#   4. handover — bob takes alpha, USING alice's playbook namespace
+#   5. split    — alpha surge re-distributed across all 3 coords
+#   6. reissue  — non-determinism check: same baselines reissued
+#   7. analysis — diversity + determinism + learning metrics
+#
+# Phase 1 deliberately skips the 48-hour clock, email/SMS endpoints,
+# and Langfuse wiring — those are Phase 2/3.
+#
+# Usage:
+#   ./scripts/multi_coord_stress.sh                    # run #001
+#   RUN_ID=002 ./scripts/multi_coord_stress.sh
+#   K=12 ./scripts/multi_coord_stress.sh
+
+set -euo pipefail
+cd "$(dirname "$0")/.."
+
+export PATH="$PATH:/usr/local/go/bin"
+
+RUN_ID="${RUN_ID:-001}"
+WORKERS_LIMIT="${WORKERS_LIMIT:-5000}"
+ETHEREAL_LIMIT="${ETHEREAL_LIMIT:-0}"
+CORPORA="${CORPORA:-workers,ethereal_workers}"
+K="${K:-8}"
+
+OUT_JSON="reports/reality-tests/multi_coord_stress_${RUN_ID}.json"
+OUT_MD="reports/reality-tests/multi_coord_stress_${RUN_ID}.md"
+
+if ! curl -sS --max-time 3 http://localhost:11434/api/tags >/dev/null 2>&1; then
+  echo "[stress] Ollama not reachable on :11434 — skipping (need it for embeddings)"
+  exit 0
+fi
+
+echo "[stress] building binaries..."
+go build -o bin/ ./cmd/storaged ./cmd/catalogd ./cmd/ingestd ./cmd/queryd \
+                 ./cmd/embedd ./cmd/vectord ./cmd/pathwayd ./cmd/observerd \
+                 ./cmd/matrixd ./cmd/gateway \
+                 ./scripts/staffing_workers ./scripts/multi_coord_stress
+
+pkill -f "bin/(storaged|catalogd|ingestd|queryd|embedd|vectord|pathwayd|observerd|matrixd|gateway)" 2>/dev/null || true
+sleep 0.3
+
+PIDS=()
+TMP="$(mktemp -d)"
+CFG="$TMP/stress.toml"
+
+cleanup() {
+  echo "[stress] cleanup"
+  for p in "${PIDS[@]:-}"; do [ -n "${p:-}" ] && kill "$p" 2>/dev/null || true; done
+  rm -rf "$TMP"
+}
+trap cleanup EXIT INT TERM
+
+cat > "$CFG" <<EOF
+[s3]
+endpoint        = "http://localhost:9000"
+region          = "us-east-1"
+bucket          = "lakehouse-go-primary"
+use_path_style  = true
+
+[gateway]
+bind = "127.0.0.1:3110"
+storaged_url = "http://127.0.0.1:3211"
+catalogd_url = "http://127.0.0.1:3212"
+ingestd_url  = "http://127.0.0.1:3213"
+queryd_url   = "http://127.0.0.1:3214"
+vectord_url  = "http://127.0.0.1:3215"
+embedd_url   = "http://127.0.0.1:3216"
+pathwayd_url = "http://127.0.0.1:3217"
+matrixd_url  = "http://127.0.0.1:3218"
+observerd_url = "http://127.0.0.1:3219"
+
+[storaged]
+bind = "127.0.0.1:3211"
+
+[catalogd]
+bind = "127.0.0.1:3212"
+storaged_url = "http://127.0.0.1:3211"
+
+[ingestd]
+bind = "127.0.0.1:3213"
+storaged_url = "http://127.0.0.1:3211"
+catalogd_url = "http://127.0.0.1:3212"
+max_ingest_bytes = 268435456
+
+[queryd]
+bind = "127.0.0.1:3214"
+catalogd_url = "http://127.0.0.1:3212"
+secrets_path = "/etc/lakehouse/secrets-go.toml"
+refresh_every = "1s"
+
+[embedd]
+bind = "127.0.0.1:3216"
+provider_url  = "http://localhost:11434"
+default_model = "nomic-embed-text-v2-moe"
+
+[vectord]
+bind = "127.0.0.1:3215"
+storaged_url = ""
+
+[pathwayd]
+bind = "127.0.0.1:3217"
+persist_path = ""
+
+[observerd]
+bind = "127.0.0.1:3219"
+persist_path = ""
+
+[matrixd]
+bind = "127.0.0.1:3218"
+embedd_url  = "http://127.0.0.1:3216"
+vectord_url = "http://127.0.0.1:3215"
+EOF
+
+poll_health() {
+  local port="$1" deadline=$(($(date +%s) + 5))
+  while [ "$(date +%s)" -lt "$deadline" ]; do
+    if curl -sS --max-time 1 "http://127.0.0.1:$port/health" >/dev/null 2>&1; then return 0; fi
+    sleep 0.05
+  done
+  return 1
+}
+
+echo "[stress] launching stack..."
+./bin/storaged  -config "$CFG" > /tmp/stress_storaged.log  2>&1 & PIDS+=($!); poll_health 3211 || { echo "storaged failed"; exit 1; }
+./bin/catalogd  -config "$CFG" > /tmp/stress_catalogd.log  2>&1 & PIDS+=($!); poll_health 3212 || { echo "catalogd failed"; exit 1; }
+./bin/ingestd   -config "$CFG" > /tmp/stress_ingestd.log   2>&1 & PIDS+=($!); poll_health 3213 || { echo "ingestd failed"; exit 1; }
+./bin/queryd    -config "$CFG" > /tmp/stress_queryd.log    2>&1 & PIDS+=($!); poll_health 3214 || { echo "queryd failed"; exit 1; }
+./bin/embedd    -config "$CFG" > /tmp/stress_embedd.log    2>&1 & PIDS+=($!); poll_health 3216 || { echo "embedd failed"; exit 1; }
+./bin/vectord   -config "$CFG" > /tmp/stress_vectord.log   2>&1 & PIDS+=($!); poll_health 3215 || { echo "vectord failed"; exit 1; }
+./bin/pathwayd  -config "$CFG" > /tmp/stress_pathwayd.log  2>&1 & PIDS+=($!); poll_health 3217 || { echo "pathwayd failed"; exit 1; }
+./bin/observerd -config "$CFG" > /tmp/stress_observerd.log 2>&1 & PIDS+=($!); poll_health 3219 || { echo "observerd failed"; exit 1; }
+./bin/matrixd   -config "$CFG" > /tmp/stress_matrixd.log   2>&1 & PIDS+=($!); poll_health 3218 || { echo "matrixd failed"; exit 1; }
+./bin/gateway   -config "$CFG" > /tmp/stress_gateway.log   2>&1 & PIDS+=($!); poll_health 3110 || { echo "gateway failed"; exit 1; }
+
+echo
+echo "[stress] ingest workers (limit=$WORKERS_LIMIT) into 'workers' corpus..."
+./bin/staffing_workers -limit "$WORKERS_LIMIT"
+
+echo
+echo "[stress] ingest ethereal_workers (limit=$ETHEREAL_LIMIT, 0=all) into 'ethereal_workers' corpus..."
+./bin/staffing_workers \
+  -parquet "/home/profit/lakehouse/data/datasets/ethereal_workers.parquet" \
+  -index-name ethereal_workers \
+  -id-prefix "e-" \
+  -limit "$ETHEREAL_LIMIT"
+
+echo
+echo "[stress] running multi-coord stress driver..."
+EXTRA_FLAGS=""
+if [ "${WITH_PARAPHRASE_HANDOVER:-1}" = "1" ]; then
+  EXTRA_FLAGS="$EXTRA_FLAGS -with-paraphrase-handover"
+fi
+./bin/multi_coord_stress \
+  -gateway "http://127.0.0.1:3110" \
+  -contracts tests/reality/contracts \
+  -corpora "$CORPORA" \
+  -k "$K" \
+  -out "$OUT_JSON" \
+  -ollama  "http://localhost:11434" \
+  -judge   "${JUDGE_MODEL:-qwen2.5:latest}" \
+  $EXTRA_FLAGS
+
+echo
+echo "[stress] generating markdown report → $OUT_MD"
+
+# Render compact markdown from the JSON. Same shape as the lift harness
+# reports so reviewers can compare format.
+total=$(jq -r '.events | length' "$OUT_JSON")
+gen_at=$(jq -r '.generated_at' "$OUT_JSON")
+div_role=$(jq -r '.diversity.same_role_across_contracts_mean_jaccard' "$OUT_JSON")
+div_role_n=$(jq -r '.diversity.num_pairs_same_role_across_contracts' "$OUT_JSON")
+div_xrole=$(jq -r '.diversity.different_roles_same_contract_mean_jaccard' "$OUT_JSON")
+div_xrole_n=$(jq -r '.diversity.num_pairs_different_roles_same_contract' "$OUT_JSON")
+det_jacc=$(jq -r '.determinism.mean_jaccard' "$OUT_JSON")
+det_n=$(jq -r '.determinism.num_reissued_pairs' "$OUT_JSON")
+hand_run=$(jq -r '.learning.handover_queries_run' "$OUT_JSON")
+hand_top1=$(jq -r '.learning.recorded_answers_top1_count' "$OUT_JSON")
+hand_topk=$(jq -r '.learning.recorded_answers_topk_count' "$OUT_JSON")
+hand_rate=$(jq -r '.learning.handover_hit_rate' "$OUT_JSON")
+ph_run=$(jq -r '.learning.paraphrase_handover_run // 0' "$OUT_JSON")
+ph_top1=$(jq -r '.learning.paraphrase_top1_count // 0' "$OUT_JSON")
+ph_topk=$(jq -r '.learning.paraphrase_topk_count // 0' "$OUT_JSON")
+ph_rate=$(jq -r '.learning.paraphrase_handover_hit_rate // 0' "$OUT_JSON")
+
+cat > "$OUT_MD" <<MDEOF
+# Multi-Coordinator Stress Test — Run ${RUN_ID}
+
+**Generated:** ${gen_at}
+**Coordinators:** alice / bob / carol (each with own playbook namespace: \`playbook_alice\` / \`playbook_bob\` / \`playbook_carol\`)
+**Contracts:** $(jq -r '.contracts | join(" / ")' "$OUT_JSON")
+**Corpora:** \`${CORPORA}\`
+**K per query:** ${K}
+**Total events captured:** ${total}
+**Evidence:** \`${OUT_JSON}\`
+
+---
+
+## Diversity — is the system locking into scenarios or cycling?
+
+| Metric | Mean Jaccard | n pairs | Interpretation |
+|---|---:|---:|---|
+| Same role across different contracts | ${div_role} | ${div_role_n} | Lower = more diverse (different region/cert mix → different workers) |
+| Different roles within same contract | ${div_xrole} | ${div_xrole_n} | Should be near-zero (different roles = different worker pools) |
+
+**Healthy ranges:**
+- Same role across contracts: < 0.30 means the system is genuinely picking different workers per region/contract.
+- Different roles same contract: < 0.10 means role-specific retrieval is working.
+- If either is > 0.50, the system is "cycling" the same handful of workers regardless of query intent.
+
+---
+
+## Determinism — same query reissued, top-K stability
+
+| Metric | Value |
+|---|---:|
+| Mean Jaccard on retrieval-only reissue | ${det_jacc} |
+| Number of reissue pairs | ${det_n} |
+
+**Interpretation:**
+- ≥ 0.95: HNSW retrieval is highly deterministic; reissues land on near-identical top-K. Good — system locks into a stable view of "best workers for this query."
+- 0.80 – 0.95: Some HNSW or embed variance, acceptable.
+- < 0.80: Retrieval is unstable — reissues see substantially different results, suggesting either embed nondeterminism (Ollama returning slightly different vectors) or vectord nondeterminism (HNSW insertion order affecting recall).
+
+---
+
+## Learning — handover hit rate
+
+Bob takes Alice's contract using Alice's playbook namespace. Did Alice's recorded answers surface in Bob's results?
+
+| Metric | Value |
+|---|---:|
+| Verbatim handover queries run | ${hand_run} |
+| Alice's recorded answer at Bob's top-1 (verbatim) | ${hand_top1} |
+| Alice's recorded answer in Bob's top-K (verbatim) | ${hand_topk} |
+| **Verbatim handover hit rate (top-1)** | **${hand_rate}** |
+| Paraphrase handover queries run | ${ph_run} |
+| Alice's recorded answer at Bob's top-1 (paraphrase) | ${ph_top1} |
+| Alice's recorded answer in Bob's top-K (paraphrase) | ${ph_topk} |
+| **Paraphrase handover hit rate (top-1)** | **${ph_rate}** |
+
+**Interpretation:**
+- Verbatim hit rate ≈ 1.0: trivial case — Bob runs identical queries; should always hit.
+- Paraphrase hit rate ≥ 0.5: institutional memory survives wording change — the harder learning property.
+- Paraphrase hit rate ≈ 0.0: Bob's paraphrases drift past the inject threshold, so Alice's recordings don't activate. Same caveat as the playbook_lift paraphrase pass.
+
+---
+
+## Per-event capture
+
+All matrix.search responses live in the JSON — top-K with worker IDs, distances, and per-corpus counts. Search by phase:
+
+\`\`\`bash
+jq '.events[] | select(.phase == "merge")' ${OUT_JSON}
+jq '.events[] | select(.coordinator == "alice" and .phase == "baseline")' ${OUT_JSON}
+jq '.events[] | select(.role == "warehouse worker") | {phase, contract, top_k_ids: [.top_k[].id]}' ${OUT_JSON}
+\`\`\`
+
+---
+
+## What's NOT in this run (Phase 1 deliberately defers)
+
+- **48-hour clock.** Events fire as discrete steps, not on a timeline.
+- **Email / SMS ingest.** No endpoints exist on the Go side yet.
+- **New-resume injection mid-run.** The corpus is fixed at the start.
+- **Langfuse traces.** Need Go-side wiring.
+
+These are Phase 2/3. The Phase 1 substrate is what the time-based runner will mount on top of.
+MDEOF
+
+echo
+echo "[stress] DONE"
+echo "[stress]   evidence:  $OUT_JSON"
+echo "[stress]   report:    $OUT_MD"
--- a/scripts/multi_coord_stress/main.go
+++ b/scripts/multi_coord_stress/main.go
--- a/scripts/playbook_lift.sh
+++ b/scripts/playbook_lift.sh
@ -52,6 +52,11 @@ CONFIG_PATH="${CONFIG_PATH:-lakehouse.toml}"
 # actual learning-property test (does cosine on paraphrase find the
 # recorded entry?). Set WITH_PARAPHRASE=0 for a faster verbatim-only run.
 WITH_PARAPHRASE="${WITH_PARAPHRASE:-1}"
+# WITH_REJUDGE=1 (default) adds a Pass 4 — judge warm top-1 to measure
+# quality lift (warm rating vs cold rating). Catches cases where Shape B
+# surfaces a different-but-equally-good answer (which the rank-based
+# lift metric misses). +21 judge calls (~30s on qwen2.5).
+WITH_REJUDGE="${WITH_REJUDGE:-1}"

 OUT_JSON="reports/reality-tests/playbook_lift_${RUN_ID}.json"
 OUT_MD="reports/reality-tests/playbook_lift_${RUN_ID}.md"
@ -156,7 +161,7 @@ refresh_every = "1s"
 [embedd]
 bind = "127.0.0.1:3216"
 provider_url  = "http://localhost:11434"
-default_model = "nomic-embed-text"
+default_model = "nomic-embed-text-v2-moe"

 [vectord]
 bind = "127.0.0.1:3215"
@ -271,9 +276,12 @@ echo "[lift] running driver — judge=$EFFECTIVE_JUDGE · queries=$QUERIES_FILE
 # and runs its own resolution chain (env → config → fallback). When
 # JUDGE_MODEL IS set, the explicit -judge wins inside the Go driver
 # regardless of what its env-lookup would find — flag wins by design.
-PARAPHRASE_FLAG=""
+EXTRA_FLAGS=""
 if [ "$WITH_PARAPHRASE" = "1" ]; then
-  PARAPHRASE_FLAG="-with-paraphrase"
+  EXTRA_FLAGS="$EXTRA_FLAGS -with-paraphrase"
+fi
+if [ "$WITH_REJUDGE" = "1" ]; then
+  EXTRA_FLAGS="$EXTRA_FLAGS -with-rejudge"
 fi
 ./bin/playbook_lift \
  -config  "$CONFIG_PATH" \
@ -284,7 +292,7 @@ fi
  -judge   "$JUDGE_MODEL" \
  -k       "$K" \
  -out     "$OUT_JSON" \
-  $PARAPHRASE_FLAG
+  $EXTRA_FLAGS

 echo
 echo "[lift] generating markdown report → $OUT_MD"
@ -302,6 +310,10 @@ generate_md() {
  p_attempted=$(jq -r '.summary.paraphrase_attempted // 0' "$json")
  p_top1=$(jq -r '.summary.paraphrase_top1_lifts // 0' "$json")
  p_anyrank=$(jq -r '.summary.paraphrase_any_rank_hits // 0' "$json")
+  rj_attempted=$(jq -r '.summary.rejudge_attempted // 0' "$json")
+  q_lifted=$(jq -r '.summary.quality_lifted // 0' "$json")
+  q_neutral=$(jq -r '.summary.quality_neutral // 0' "$json")
+  q_regressed=$(jq -r '.summary.quality_regressed // 0' "$json")

  # Only emit the paraphrase block when --with-paraphrase actually ran
  # (i.e. .summary.paraphrase_attempted > 0). For verbatim-only runs we
@ -312,6 +324,13 @@ generate_md() {
 | Paraphrase pass — recorded answer at any rank in top-K | ${p_anyrank} / ${p_attempted} |"
  fi

+  rj_block=""
+  if [ "$rj_attempted" != "0" ] && [ "$rj_attempted" != "null" ]; then
+    rj_block="| **Quality lift** (warm top-1 rating > cold top-1 rating) | **${q_lifted} / ${rj_attempted}** |
+| Quality neutral (warm top-1 rating = cold top-1 rating) | ${q_neutral} / ${rj_attempted} |
+| Quality regressed (warm top-1 rating < cold top-1 rating) | ${q_regressed} / ${rj_attempted} |"
+  fi
+
  cat > "$md" <<MDEOF
 # Playbook-Lift Reality Test — Run ${RUN_ID}

@ -322,6 +341,7 @@ generate_md() {
 **Queries:** \`${QUERIES_FILE}\` (${total} executed)
 **K per pass:** ${K}
 **Paraphrase pass:** $([ "$WITH_PARAPHRASE" = "1" ] && echo "ENABLED" || echo "disabled")
+**Re-judge pass:** $([ "$WITH_REJUDGE" = "1" ] && echo "ENABLED" || echo "disabled")
 **Evidence:** \`${OUT_JSON}\`

 ---
@ -337,6 +357,7 @@ generate_md() {
 | Playbook boosts triggered (warm pass) | ${boosted} |
 | Mean Δ top-1 distance (warm − cold) | ${mean_delta} |
 ${p_block}
+${rj_block}

 **Verbatim lift rate:** ${lift} of ${discovery} discoveries became top-1 after warm pass.

--- a/scripts/playbook_lift/main.go
+++ b/scripts/playbook_lift/main.go
@ -75,12 +75,19 @@ type queryRun struct {
 	PlaybookRecorded bool   `json:"playbook_recorded"`
 	PlaybookID       string `json:"playbook_target_id,omitempty"`

-	WarmTop1ID       string  `json:"warm_top1_id"`
-	WarmTop1Distance float32 `json:"warm_top1_distance"`
-	WarmBoostedCount int     `json:"warm_boosted_count"`
-	WarmJudgeBestRank int    `json:"warm_judge_best_rank"`
+	WarmTop1ID       string          `json:"warm_top1_id"`
+	WarmTop1Distance float32         `json:"warm_top1_distance"`
+	WarmBoostedCount int             `json:"warm_boosted_count"`
+	WarmJudgeBestRank int            `json:"warm_judge_best_rank"` // rank of cold judge-best in warm — NOT the warm pass's own judge-best
+	WarmTop1Metadata json.RawMessage `json:"-"`                    // cached for Pass 4 rejudge; not emitted

-	Lift bool   `json:"lift"`            // judge-best was below top-1 cold, but top-1 warm
+	// WarmTop1Rating: only populated when --with-rejudge. Compare to
+	// ColdRatings[0] (== cold top-1 rating) to measure quality lift.
+	// *int so absence (no rejudge pass) and a 0-rating verdict are
+	// distinguishable.
+	WarmTop1Rating *int `json:"warm_top1_rating,omitempty"`
+
+	Lift bool `json:"lift"` // judge-best was below top-1 cold, but top-1 warm

 	// Paraphrase pass — only populated when --with-paraphrase. Tests
 	// the playbook's actual learning property: does a recorded entry
@ -114,6 +121,17 @@ type summary struct {
 	ParaphraseTop1Lifts   int `json:"paraphrase_top1_lifts,omitempty"`  // recorded answer surfaced at rank 0
 	ParaphraseAnyRankHits int `json:"paraphrase_any_rank_hits,omitempty"` // recorded answer surfaced at any rank in top-K

+	// Re-judge pass aggregates — only populated when --with-rejudge.
+	// Measures QUALITY lift (warm top-1 rating vs cold top-1 rating)
+	// rather than rank-of-cold-judge-best lift. The latter conflates
+	// "warm surfaced a different but equally-good result" with "warm
+	// shuffled ranks but the answer was the same"; quality lift
+	// disambiguates them.
+	RejudgeAttempted   int `json:"rejudge_attempted,omitempty"`   // queries that ran the rejudge pass
+	QualityLifted      int `json:"quality_lifted,omitempty"`      // warm-top-1 rating > cold-top-1 rating
+	QualityNeutral     int `json:"quality_neutral,omitempty"`     // ratings equal (could be same or different item)
+	QualityRegressed   int `json:"quality_regressed,omitempty"`   // warm-top-1 rating < cold-top-1 rating
+
 	GeneratedAt time.Time `json:"generated_at"`
 }

@ -128,6 +146,7 @@ func main() {
 	k := flag.Int("k", 10, "top-k from matrix.search per pass")
 	out := flag.String("out", "reports/reality-tests/playbook_lift_001.json", "output JSONL path")
 	withParaphrase := flag.Bool("with-paraphrase", false, "after warm pass, generate a paraphrase via the judge model and re-query with playbook=true to test the learning property")
+	withRejudge := flag.Bool("with-rejudge", false, "after warm pass, judge warm top-1 to measure QUALITY lift (vs cold top-1 rating), not just rank-of-cold-judge-best")
 	flag.Parse()

 	// Judge resolution priority: explicit flag > $JUDGE_MODEL env >
@ -225,6 +244,7 @@ func main() {
 		}
 		runs[i].WarmTop1ID = resp.Results[0].ID
 		runs[i].WarmTop1Distance = resp.Results[0].Distance
+		runs[i].WarmTop1Metadata = resp.Results[0].Metadata // cache for Pass 4 rejudge
 		runs[i].WarmBoostedCount = resp.PlaybookBoosted
 		playbookBoostedTotal += resp.PlaybookBoosted

@ -304,6 +324,47 @@ func main() {
 		}
 	}

+	// Pass 4 (warm-rejudge) — opt-in via --with-rejudge. Judge warm
+	// top-1 against the same prompt as cold ratings, then compare to
+	// cold top-1 rating. This measures QUALITY lift (did the playbook
+	// produce a better candidate?) rather than just rank-of-cold-judge-
+	// best lift (did the recorded answer move to top-1, even if cold's
+	// top-1 was already good?). See STATE_OF_PLAY OPEN — added because
+	// run #003's verbatim 2/6 didn't tell us whether Shape B was
+	// surfacing better OR same-quality alternatives.
+	rejudgeAttempted := 0
+	qualityLifted := 0
+	qualityNeutral := 0
+	qualityRegressed := 0
+	if *withRejudge {
+		log.Printf("[lift] warm-rejudge pass: measuring quality lift (warm top-1 rating vs cold top-1 rating)")
+		for i := range runs {
+			if runs[i].WarmTop1ID == "" || len(runs[i].WarmTop1Metadata) == 0 {
+				continue // warm pass didn't complete for this query
+			}
+			rejudgeAttempted++
+			result := matrixResult{
+				ID:       runs[i].WarmTop1ID,
+				Distance: runs[i].WarmTop1Distance,
+				Metadata: runs[i].WarmTop1Metadata,
+			}
+			warmRating := judgeRate(hc, *ollama, *judge, runs[i].Query, result)
+			runs[i].WarmTop1Rating = &warmRating
+			coldRating := 0
+			if len(runs[i].ColdRatings) > 0 {
+				coldRating = runs[i].ColdRatings[0]
+			}
+			switch {
+			case warmRating > coldRating:
+				qualityLifted++
+			case warmRating < coldRating:
+				qualityRegressed++
+			default:
+				qualityNeutral++
+			}
+		}
+	}
+
 	sum := summary{
 		Total:                 len(runs),
 		WithDiscovery:         withDiscovery,
@ -314,6 +375,10 @@ func main() {
 		ParaphraseAttempted:   paraphraseAttempted,
 		ParaphraseTop1Lifts:   paraphraseTop1Lifts,
 		ParaphraseAnyRankHits: paraphraseAnyRankHits,
+		RejudgeAttempted:      rejudgeAttempted,
+		QualityLifted:         qualityLifted,
+		QualityNeutral:        qualityNeutral,
+		QualityRegressed:      qualityRegressed,
 		GeneratedAt:           time.Now().UTC(),
 	}
 	if len(runs) > 0 {
@ -323,11 +388,11 @@ func main() {
 	if err := writeJSON(*out, runs, sum); err != nil {
 		log.Fatalf("write %s: %v", *out, err)
 	}
-	if *withParaphrase {
-		log.Printf("[lift] DONE — %d queries · discovery=%d · lift=%d · boosted=%d · meanΔdist=%.4f · paraphrase=%d/%d→top1, %d/%d→anyrank",
+	if *withParaphrase || *withRejudge {
+		log.Printf("[lift] DONE — %d queries · discovery=%d · lift=%d · boosted=%d · meanΔdist=%.4f · paraphrase=%d/%d→top1 · quality=lifted%d/neutral%d/regressed%d",
 			sum.Total, sum.WithDiscovery, sum.LiftCount, sum.PlaybookBoostedTotal, sum.MeanTop1DeltaDistance,
 			sum.ParaphraseTop1Lifts, sum.ParaphraseAttempted,
-			sum.ParaphraseAnyRankHits, sum.ParaphraseAttempted)
+			sum.QualityLifted, sum.QualityNeutral, sum.QualityRegressed)
 	} else {
 		log.Printf("[lift] DONE — %d queries · discovery=%d · lift=%d · boosted=%d · meanΔdist=%.4f",
 			sum.Total, sum.WithDiscovery, sum.LiftCount, sum.PlaybookBoostedTotal, sum.MeanTop1DeltaDistance)
--- a/tests/reality/contracts/contract_alpha.json
+++ b/tests/reality/contracts/contract_alpha.json
@ -0,0 +1,12 @@
+{
+  "name": "alpha_milwaukee_distribution",
+  "client": "Northstar Logistics",
+  "location": "Milwaukee, WI metro",
+  "shift": "day",
+  "demand": [
+    {"role": "warehouse worker", "count": 200, "skills": ["pallet jack", "inventory"], "certs": ["OSHA-30"]},
+    {"role": "admin assistant", "count": 3, "skills": ["scheduling", "data entry"], "certs": []},
+    {"role": "heavy equipment operator", "count": 2, "skills": ["forklift", "bobcat"], "certs": ["OSHA-30", "forklift cert"]},
+    {"role": "industrial electrician", "count": 1, "skills": ["high voltage", "PLC"], "certs": ["journeyman"], "in_roster": false}
+  ]
+}
--- a/tests/reality/contracts/contract_beta.json
+++ b/tests/reality/contracts/contract_beta.json
@ -0,0 +1,12 @@
+{
+  "name": "beta_indianapolis_manufacturing",
+  "client": "Crossroads Manufacturing",
+  "location": "Indianapolis, IN metro",
+  "shift": "swing",
+  "demand": [
+    {"role": "warehouse worker", "count": 150, "skills": ["assembly", "machine operation"], "certs": ["OSHA-10"]},
+    {"role": "admin assistant", "count": 4, "skills": ["scheduling", "documentation", "spanish"], "certs": []},
+    {"role": "heavy equipment operator", "count": 3, "skills": ["forklift", "pallet jack", "cold storage"], "certs": ["OSHA-30", "forklift cert"]},
+    {"role": "bilingual safety coordinator", "count": 1, "skills": ["spanish", "english", "training"], "certs": ["OSHA trainer"], "in_roster": false}
+  ]
+}
--- a/tests/reality/contracts/contract_gamma.json
+++ b/tests/reality/contracts/contract_gamma.json
@ -0,0 +1,12 @@
+{
+  "name": "gamma_chicago_construction",
+  "client": "Loop Construction Group",
+  "location": "Chicago, IL metro",
+  "shift": "early-day",
+  "demand": [
+    {"role": "warehouse worker", "count": 80, "skills": ["framing", "rigging", "concrete"], "certs": ["OSHA-10"]},
+    {"role": "admin assistant", "count": 1, "skills": ["scheduling", "blueprint reading"], "certs": []},
+    {"role": "heavy equipment operator", "count": 2, "skills": ["mobile crane", "rigging signals", "bobcat"], "certs": ["NCCCO crane cert"]},
+    {"role": "drone surveyor", "count": 1, "skills": ["UAV piloting", "GIS", "site mapping"], "certs": ["FAA Part 107"], "in_roster": false}
+  ]
+}