27 changed files with 21 additions and 3263 deletions
--- a/STATE_OF_PLAY.md
+++ b/STATE_OF_PLAY.md
@ -1,7 +1,7 @@
 # STATE OF PLAY — Lakehouse-Go

-**Last verified:** 2026-04-30 ~16:42 CDT
-**Verified by:** live probes + `just verify` PASS + multi-coord stress run #011 (full 9-phase scenario, 67 captured events, 1 Langfuse trace + 111 child observations covering every phase + every external call), not memory.
+**Last verified:** 2026-04-30 ~07:25 CDT
+**Verified by:** live probes + `just verify` PASS + reality tests #001/#002/#003/#004 (v4 with split inject threshold: 6/8 verbatim lift + 6/8 paraphrase recovery + zero OOD cross-pollination), not memory.

 > **Read this FIRST.** When the user says "we're working on lakehouse," default to the Go rewrite (this repo); the Rust legacy at `/home/profit/lakehouse/` is maintenance-only. If memory contradicts this file, this file wins. Update it when something is verified working — not when a phase finishes.

@ -114,34 +114,6 @@ Evidence: `reports/reality-tests/playbook_lift_{001,002,003}.{json,md}`. Per the

 **v3 → v4 is the configuration evolution.** v3 ran with one threshold (0.5) for both boost and inject — paraphrase recovery hit 6/6 but a single strong recording (w-4435 for "OSHA-30 forklift Wisconsin") cross-pollinated to OOD queries (dental hygienist / RN / software engineer) because their text vectors fell within 0.5 cosine. v4 split: boost stays at 0.5 (safe — only re-ranks existing results), inject tightens to 0.20 (true paraphrase territory). Result: OOD cross-pollination eliminated, paraphrase recovery 6/8, verbatim lift 6/8. The two paraphrases that missed (Q9, Q15) were liberally rephrased — drift > 0.20 cosine.

-### Multi-coordinator stress test (Phase 1 → 3) — VERIFIED end-to-end
-
-Reality test #2 catalog. New harness `scripts/multi_coord_stress.{sh,go}` simulates 3 coordinators (alice/bob/carol) handling 3 distinct contracts (Milwaukee distribution / Indianapolis manufacturing / Chicago construction), each with their own playbook namespace. 9-phase operational narrative across simulated Hours 0–48: baseline → fresh-resume injection → inbox burst → mid-day surge → 200-worker swap → contract merge → handover (verbatim + paraphrase) → split → reissue.
-
-| Capability | Verified | Where |
-|---|---|---|
-| Per-coordinator playbook isolation | ✓ | `playbook_alice` / `playbook_bob` / `playbook_carol` corpora |
-| Same-role-across-contracts diversity | Jaccard 0.026 (n=9) — 97% workers differ per region | Phase 1 baseline |
-| Different-roles-same-contract diversity | Jaccard 0.070 (n=18) — 93% differ per role | Phase 1 baseline |
-| HNSW retrieval determinism | Jaccard 1.000 (n=12) | Phase 6 reissue |
-| Verbatim handover (Bob runs Alice's queries with Alice's playbook) | 4/4 | Phase 4 |
-| Paraphrase handover (Bob runs qwen2.5-paraphrased queries) | 4/4 | Phase 4b |
-| 200-worker swap with `ExcludeIDs` | Jaccard 0.000 — 8/8 placed workers fully replaced | Phase 2b |
-| Fresh-resume injection (two-tier `fresh_workers` index) | 3/3 fresh workers at top-1 | Phase 1b |
-| Inbox endpoint `/v1/observer/inbox` (email + SMS, priority weighting) | 6/6 events recorded | Phase 1c |
-| LLM demand parsing (qwen2.5 format=json on inbox bodies) | 6/6 parsed cleanly into structured `{role, count, location, certs, skills, shift}` | Phase 1c |
-| Judge re-rates inbox top-1 against ORIGINAL body | catches tight-distance-but-wrong (Q3 crane case: dist 0.23 → rating 1) | Phase 1c |
-| Langfuse Go-side tracing | 111 observations on a single run trace, browseable at http://localhost:3001 | Run #011 |
-
-**Substrate gains added by this wave:**
- `internal/matrix/playbook.go` Shape B + split inject threshold (commit `67d1957` from earlier wave; verified in multi-coord too)
- `internal/matrix/retrieve.go` `ExcludeIDs` field on `SearchRequest` — filters worker IDs at retrieve, boost, AND inject (so excluded answers can't sneak back via playbook). Real-world driver: coordinator placed N workers, client asks for replacements.
- `internal/observer/types.go` `SourceInbox` taxonomy alongside `SourceMCP / SourceScenario / SourceWorkflow`
- `cmd/observerd` `POST /observer/inbox` route — accepts `{type, sender, subject, body, priority, tag}` and records as `ObservedOp`. Type must be `email` or `sms`; body required; priority defaults to medium.
- `internal/langfuse/client.go` — minimal Go-side Trace+Span client, best-effort posture (logs on error, never blocks calling path; same fail-open semantics as ADR-005 Decision 5.1).
- `embedd default_model` bumped from `nomic-embed-text` (137M) → `nomic-embed-text-v2-moe` (475M, MoE, drop-in 768d). Same-role-across-contracts diversity went from 0.080 → 0.000 with the upgrade.
- Two-tier index pattern: fresh content goes to `fresh_workers` (a small "hot" corpus); main queries include it in the corpora list. Solves the HNSW post-build add recall issue (incremental adds to a populated graph land in poorly-connected regions and disappear from search). Canonical NRT pattern; Lucene works the same way.
-
 ### Harness expansion (2026-04-30 ~05:30 CDT)

 `scripts/playbook_lift.sh` rewritten from a 5-daemon stripped harness to the **full 10-daemon prod-realistic stack** (chatd stays up independently). The 5-daemon version was structurally hiding bugs; expanding the daemon set surfaced 7 distinct fixes:
@ -199,16 +171,10 @@ Verbatim verdicts at `reports/scrum/_evidence/2026-04-30/verdicts/`. Disposition
 - The matrix indexer **5/5 components** are shipped. Don't propose to "build the matrix indexer" — it's done.
 - The 5-loop substrate's load-bearing gate is **PASSED**. v3 (`154a72e`) showed 6/6 paraphrase recovery via Shape B. Don't propose to "test if the playbook learns" — it does.
 - **Shape B is the playbook stance now.** When `use_playbook=true`, both `ApplyPlaybookBoost` (re-rank in place) AND `InjectPlaybookMisses` (insert recorded answers not in regular top-K) run. Don't revert to boost-only; v2 proved that path can't recover paraphrase queries.
- **Boost / inject use SEPARATE thresholds.** Boost stays at `DefaultPlaybookMaxDistance = 0.5` (safe — only re-ranks results already in retrieval). Inject uses tighter `DefaultPlaybookMaxInjectDistance = 0.20` (Shape B forces results in, so loose match cross-pollinates wrong-domain). Don't merge them.
- **Multi-coord product theory is empirically VALIDATED** by run #011 (Phase 3). Per-coordinator playbook namespaces (`playbook_alice` etc.) with cross-coordinator handover (Bob takes Alice's contract using `playbook_corpus=playbook_alice`) work end-to-end including paraphrase handover. Don't propose to "test if multi-coord works" — it does.
- **Fresh content uses two-tier indexing.** Fresh resumes go to `fresh_workers` corpus, not the main `workers` index. coder/hnsw incremental adds to a populated 5K+ graph have recall issues; the small hot index has no such crowding. Periodic merge (post-G3) consolidates fresh→main. Canonical NRT pattern.
- **`embedd.default_model = "nomic-embed-text-v2-moe"`** (475M MoE, 768d). Don't bump to `nomic-embed-text` (137M) "for speed" — diversity scores fell from 0.000 → 0.080 same-role-across-contracts on the smaller model. Cost (~5× slower ingest) is acceptable for once-per-deploy work.
- **Inbox flow: parse + search + judge + trace.** `/v1/observer/inbox` records the body; coordinator/driver parses via LLM (qwen2.5 format=json), runs `matrix.search` on the parsed query, judges top-1 against the ORIGINAL body, emits Langfuse spans through it all. Don't replace the judge re-rate with a distance-only gate — tight distance + low rating is the load-bearing honesty signal.
 - `local_judge = "qwen2.5:latest"` for the lift loop — qwen3.5:latest is a vision-SSM 256K-ctx build that ran 30s/judge. Don't bump to qwen3.5 again "for cleanliness." Different model in the same tier has different cost geometry.
 - `qwen3.5:latest` IS available locally on this box. Opus's hub-only knowledge is a known-stale signal; the chatd_smoke uses it daily.
 - `temperature` is **omitted** for Anthropic 4.7 (handled by `Request.Temperature *float64`); don't re-add it.
 - chatd-smoke runs with **all cloud providers disabled** intentionally so the suite doesn't depend on API keys; that's why it can't catch B-3-class bugs (those need a fake-server fixture, see Sprint 0 follow-up).
- **Langfuse Go-side client lives at `internal/langfuse/`** with best-effort fail-open posture. URL+creds from `/etc/lakehouse/langfuse.env`. Don't propose to "wire Langfuse on Go side" — it's wired; multi_coord_stress is the proof.

 ---

@ -216,11 +182,9 @@ Verbatim verdicts at `reports/scrum/_evidence/2026-04-30/verdicts/`. Disposition

 | Item | What | When to act |
 |---|---|---|
-| **Real-time 48-hour clock** | Multi-coord stress fires phases as discrete steps with simulated-hour labels (0/6/12/18/24/30/36/42/48). A real-time variant would space events on actual wall-clock with `time.Sleep`, simulating the rhythm of a coordinator workday. Cosmetic; doesn't change product behavior — but lets the harness mimic the cadence at which inbox events would arrive in production. ~30 min. | When stress test starts being used to capture realistic per-hour throughput numbers. |
-| **Wider Langfuse instrumentation across daemons** | multi_coord_stress traces every external call, but the daemons themselves (matrixd, observerd, chatd) don't yet emit traces from their own request handlers. Adding `internal/langfuse/middleware.go` would auto-emit a span per HTTP request, giving production-traffic visibility for free. | When production traffic starts hitting the Go gateway. |
-| **Periodic fresh→main index merge** | Two-tier `fresh_workers` pattern is verified working but no scheduled job consolidates fresh→main. Fresh corpus grows monotonically; eventually has its own recall issues. A daily cron that ingests the fresh corpus' contents into the main `workers` index + drops fresh would solve it. | When fresh_workers grows past ~500 items. |
-| **Adjacent-query cross-pollination (lift suite Q6↔Q7)** | After lift v4's split threshold, OOD cross-pollination is gone but Q6 / Q7 still swap recordings as warm top-1 because their embeddings are within 0.20 cosine of each other. Multi-coord run #008 inbox-judge re-rating proved the judge can distinguish — gating injection on "judge approves before injecting" closes this. ~1 hr. | When playbook injection quality matters more than retrieval throughput. |
-| **Liberal-paraphrase recovery loss (lift suite Q9, Q15)** | Q9 + Q15 in run #004 lost paraphrase recovery because qwen2.5 rephrased liberally enough to drift past 0.20 inject threshold. Acceptable (system refusing to inject when not confident), but might be tightenable with a different paraphrase prompt or a per-pair `paraphrase_max_drift` measurement. | When real coordinator queries are available for a calibration run. |
+| **Reality test v4: re-judge warm results** | Current lift metric counts "warm top-1 == cold judge best." Shape B cross-pollinates recorded answers across queries (run #003 dropped verbatim-lift 7→2 because w-4435 surfaced as warm top-1 for several queries whose cold-judge-best was a different worker). True quality lift would re-judge warm results and measure judge agreement. ~30 min — adds a 4th pass to the driver. | Next time we want to measure Shape B's actual quality, not rank-of-original-judge-best. |
+| **Adjacent-query cross-pollination** | After v4's split threshold, OOD cross-pollination is gone but Q6 ("Forklift-certified loader") ↔ Q7 ("Hazmat-certified warehouse worker, cold storage") still swap recordings as warm top-1 because their embeddings are within 0.20 cosine of each other. Likely correct (genuinely similar staffing queries), but worth verifying with the v4 re-judge metric — if the judge agrees both directions are good matches, accept; if not, tighten further (e.g. 0.15) or add a same-query-only mode. | Co-decision with v4 re-judge. |
+| **Liberal-paraphrase recovery loss** | Q9 + Q15 in run #004 lost paraphrase recovery because qwen2.5 rephrased liberally enough to drift past 0.20 inject threshold. Acceptable (system refusing to inject when not confident), but might be tightenable with a different paraphrase prompt (more conservative wording variation) or a per-pair `paraphrase_max_drift` measurement. Cosmetic vs. real depends on whether realistic coordinator queries drift like qwen2.5's rephrases do. | When real coordinator queries are available for a calibration run. |
 | **Sprint 4 — deployment** | No `REPLICATION.md`, `secrets-go.toml.example`, `deploy/systemd/<bin>.service`, `Dockerfile`. Largest open Sprint. Required input for any G5 cutover plan. | When G5 cutover is on the table. |
 | **ADR-006 — auth posture for non-loopback deploy** | Locks R-001 + R-007 from "opt-in middleware exists" to "wired-by-default for X, opt-in for Y." Doc-only, ~1 hr. | Required before any Go binary binds non-loopback in prod. |
 | **chatd fixture-mode storage half** | `g2_smoke_fixtures.sh` closed embed half via fake_ollama; storage half (mock S3) still deferred. Closes R-006 fully. | When CI box without MinIO is needed. |
@ -249,17 +213,6 @@ Verbatim verdicts at `reports/scrum/_evidence/2026-04-30/verdicts/`. Disposition
 | `154a72e` | matrix: Shape B (`InjectPlaybookMisses`) — 6/6 paraphrase recovery in run #003 |
 | `94fc3b6` | STATE_OF_PLAY: capture v3 reality test + Shape B + cross-pollination |
 | `67d1957` | matrix: split boost/inject thresholds (0.5 / 0.20) — kills cross-pollination, run #004 6/8+6/8 |
-| `b13b5cd` | playbook_lift v4 metric: warm-top-1 re-judge → quality lift +24%/-14% (5 lifted / 13 neutral / 3 regressed) |
-| `61c7b55` | multi-coord stress harness (Phase 1) — 3 coords / 3 contracts / 7-phase scenario |
-| `0fa42a0` | multi-coord stress Phase 1.5 — shared-role contracts + paraphrase handover |
-| `84a32f0` | multi-coord stress Phase 2 — `ExcludeIDs`, 200-worker swap, fresh-resume |
-| `4da32ad` | embedd: bump default to nomic-embed-text-v2-moe (475M MoE, 768d drop-in) |
-| `e7fc63b` | observerd `/observer/inbox` + multi-coord stress phase 1c (priority-ordered events) |
-| `186d209` | multi_coord_stress: LLM-parsed inbox demands (qwen2.5 format=json) |
-| `ce940f4` | multi_coord_stress: judge re-rates inbox top-1 against original body — recovers OOD honesty signal |
-| `7e6431e` | langfuse: Go-side client + Phase 1c instrumentation |
-| `08a0867` | multi_coord_stress: fresh_workers two-tier index — fresh-resume now top-1 (3/3) |
-| `5d49967` | multi_coord_stress: full Langfuse coverage — every phase + every call (111 observations) |

 Plus on Rust side (`8de94eb`, `3d06868`): qwen2.5 → qwen3.5:latest backport in active defaults; distillation acceptance reports regenerated (run_hash refresh, reproducibility property still holds).

--- a/cmd/observerd/main.go
+++ b/cmd/observerd/main.go
@ -93,62 +93,6 @@ func (h *handlers) register(r chi.Router) {
 	r.Post("/observer/event", h.handleEvent)
 	r.Post("/observer/workflow/run", h.handleWorkflowRun)
 	r.Get("/observer/workflow/modes", h.handleWorkflowModes)
-	r.Post("/observer/inbox", h.handleInbox)
-}
-
-// inboxMessage is the POST /observer/inbox body — an incoming
-// real-world signal (email or SMS) that a coordinator would receive
-// and act on. The handler only RECORDS it as an ObservedOp; whether
-// to trigger a downstream matrix.search or workflow is the caller's
-// concern. Keeps observer's witness role pure.
-type inboxMessage struct {
-	Type     string `json:"type"`     // "email" | "sms"
-	Sender   string `json:"sender"`
-	Subject  string `json:"subject,omitempty"`
-	Body     string `json:"body"`
-	Priority string `json:"priority"` // "urgent" | "high" | "medium" | "low"
-	Tag      string `json:"tag,omitempty"`
-}
-
-func (h *handlers) handleInbox(w http.ResponseWriter, r *http.Request) {
-	var msg inboxMessage
-	if !decodeJSON(w, r, &msg) {
-		return
-	}
-	if msg.Type != "email" && msg.Type != "sms" {
-		http.Error(w, "type must be 'email' or 'sms'", http.StatusBadRequest)
-		return
-	}
-	if strings.TrimSpace(msg.Body) == "" {
-		http.Error(w, "body required", http.StatusBadRequest)
-		return
-	}
-	if msg.Priority == "" {
-		msg.Priority = "medium"
-	}
-	op := observer.ObservedOp{
-		Endpoint:      "/observer/inbox/" + msg.Type,
-		InputSummary:  fmt.Sprintf("from=%s priority=%s tag=%s subject=%s", msg.Sender, msg.Priority, msg.Tag, msg.Subject),
-		OutputSummary: msg.Body,
-		Source:        observer.SourceInbox,
-		Success:       true,
-	}
-	if err := h.store.Record(op); err != nil {
-		if errors.Is(err, observer.ErrInvalidOp) {
-			http.Error(w, err.Error(), http.StatusBadRequest)
-			return
-		}
-		slog.Error("observer record inbox", "err", err)
-		http.Error(w, "internal", http.StatusInternalServerError)
-		return
-	}
-	stats := h.store.Stats()
-	writeJSON(w, http.StatusOK, map[string]any{
-		"accepted":  true,
-		"type":      msg.Type,
-		"priority":  msg.Priority,
-		"ring_size": stats.Total,
-	})
 }

 func (h *handlers) handleStats(w http.ResponseWriter, _ *http.Request) {
--- a/cmd/observerd/main_test.go
+++ b/cmd/observerd/main_test.go
@ -4,7 +4,6 @@ import (
 	"bytes"
 	"net/http"
 	"net/http/httptest"
-	"strings"
 	"testing"
 	"time"

@ -39,7 +38,6 @@ func TestRoutesMounted(t *testing.T) {
 		"POST /observer/event":          false,
 		"POST /observer/workflow/run":   false,
 		"GET /observer/workflow/modes":  false,
-		"POST /observer/inbox":          false,
 	}
 	_ = chi.Walk(r, func(method, route string, _ http.Handler, _ ...func(http.Handler) http.Handler) error {
 		key := method + " " + route
@ -167,51 +165,6 @@ func TestWorkflowRun_AllProvenanceRecordedPostRun(t *testing.T) {
 	}
 }

-// TestInbox_AcceptsValidEmail locks the happy-path contract for the
-// /observer/inbox route — accepts an email message with required
-// fields, records as ObservedOp, returns 200 with ring-size.
-func TestInbox_AcceptsValidEmail(t *testing.T) {
-	r := newTestRouter(t)
-	body := []byte(`{"type":"email","sender":"client@northstar.com","subject":"URGENT: 50 forklift ops","body":"Need 50 forklift operators in Cleveland OH for next week. Day shift.","priority":"urgent","tag":"alpha-surge"}`)
-	req := httptest.NewRequest("POST", "/observer/inbox", bytes.NewReader(body))
-	req.Header.Set("Content-Type", "application/json")
-	w := httptest.NewRecorder()
-	r.ServeHTTP(w, req)
-	if w.Code != http.StatusOK {
-		t.Fatalf("expected 200, got %d (body=%s)", w.Code, w.Body.String())
-	}
-	if !strings.Contains(w.Body.String(), `"accepted":true`) {
-		t.Errorf("expected accepted=true, got %s", w.Body.String())
-	}
-}
-
-// TestInbox_RejectsBadType locks the validation: type must be
-// "email" or "sms", anything else is 400.
-func TestInbox_RejectsBadType(t *testing.T) {
-	r := newTestRouter(t)
-	body := []byte(`{"type":"smoke-signal","sender":"x","body":"y","priority":"high"}`)
-	req := httptest.NewRequest("POST", "/observer/inbox", bytes.NewReader(body))
-	req.Header.Set("Content-Type", "application/json")
-	w := httptest.NewRecorder()
-	r.ServeHTTP(w, req)
-	if w.Code != http.StatusBadRequest {
-		t.Errorf("expected 400 on bad type, got %d", w.Code)
-	}
-}
-
-// TestInbox_RejectsEmptyBody locks the body-required invariant.
-func TestInbox_RejectsEmptyBody(t *testing.T) {
-	r := newTestRouter(t)
-	body := []byte(`{"type":"email","sender":"x","body":"","priority":"high"}`)
-	req := httptest.NewRequest("POST", "/observer/inbox", bytes.NewReader(body))
-	req.Header.Set("Content-Type", "application/json")
-	w := httptest.NewRecorder()
-	r.ServeHTTP(w, req)
-	if w.Code != http.StatusBadRequest {
-		t.Errorf("expected 400 on empty body, got %d", w.Code)
-	}
-}
-
 // TestWorkflowRun_UnknownMode locks the 400 path on workflow definitions
 // that reference modes not registered with the runner. The harness's
 // reality test runs depend on this so an unknown-mode misconfiguration
--- a/internal/langfuse/client.go
+++ b/internal/langfuse/client.go
@ -1,217 +0,0 @@
-// Package langfuse is a minimal Go-side client for the Langfuse v2
-// ingestion API. Mirrors the surface area we need from the Rust
-// crates/gateway/src/v1/langfuse_trace.rs emitter — Trace + Span,
-// nothing else yet (no scores, no observations, no datasets).
-//
-// Auth is Basic over public_key:secret_key. URL + creds come from
-// /etc/lakehouse/langfuse.env in production; tests can pass any URL.
-//
-// Best-effort transport: errors are logged but don't fail the calling
-// path. Lakehouse's internal services should never go down because
-// Langfuse is unreachable.
-package langfuse
-
-import (
-	"bytes"
-	"context"
-	"crypto/rand"
-	"encoding/base64"
-	"encoding/hex"
-	"encoding/json"
-	"fmt"
-	"log/slog"
-	"net/http"
-	"sync"
-	"time"
-)
-
-// Client posts traces + spans to Langfuse's ingestion endpoint.
-// Events are buffered and flushed in batches. Always call Flush
-// before exit; Close also flushes.
-type Client struct {
-	url       string
-	auth      string // pre-encoded "Basic ..."
-	hc        *http.Client
-	mu        sync.Mutex
-	pending   []event
-	maxBatch  int
-}
-
-// New constructs a Client. URL like "http://localhost:3001"; creds
-// from langfuse.env. nil hc → uses default with 5s timeout.
-func New(url, publicKey, secretKey string, hc *http.Client) *Client {
-	if hc == nil {
-		hc = &http.Client{Timeout: 5 * time.Second}
-	}
-	auth := "Basic " + base64.StdEncoding.EncodeToString([]byte(publicKey+":"+secretKey))
-	return &Client{
-		url:      url,
-		auth:     auth,
-		hc:       hc,
-		maxBatch: 50,
-	}
-}
-
-// NewID returns a hex string suitable as a trace/span id. Langfuse
-// accepts arbitrary strings; a 16-byte random hex is unambiguous.
-func NewID() string {
-	b := make([]byte, 16)
-	_, _ = rand.Read(b)
-	return hex.EncodeToString(b)
-}
-
-// event is one Langfuse ingestion envelope. Body shape varies by
-// type (trace-create vs span-create); we use map[string]any to
-// keep the wire shape declarative.
-type event struct {
-	ID        string         `json:"id"`
-	Type      string         `json:"type"` // "trace-create" | "span-create"
-	Timestamp string         `json:"timestamp"`
-	Body      map[string]any `json:"body"`
-}
-
-// TraceInput is what callers fill in when starting a trace.
-type TraceInput struct {
-	Name     string
-	UserID   string
-	Input    any
-	Metadata map[string]any
-	Tags     []string
-}
-
-// Trace records a top-level trace. Returns the trace id so callers
-// can attach spans. Best-effort: errors are logged and the trace
-// id is still returned so callers don't need error-handling for the
-// common case.
-func (c *Client) Trace(ctx context.Context, t TraceInput) string {
-	id := NewID()
-	body := map[string]any{
-		"id":   id,
-		"name": t.Name,
-	}
-	if t.UserID != "" {
-		body["userId"] = t.UserID
-	}
-	if t.Input != nil {
-		body["input"] = t.Input
-	}
-	if t.Metadata != nil {
-		body["metadata"] = t.Metadata
-	}
-	if len(t.Tags) > 0 {
-		body["tags"] = t.Tags
-	}
-	c.queue(event{
-		ID:        NewID(),
-		Type:      "trace-create",
-		Timestamp: time.Now().UTC().Format(time.RFC3339Nano),
-		Body:      body,
-	})
-	return id
-}
-
-// SpanInput is what callers fill in when recording a span.
-type SpanInput struct {
-	TraceID    string
-	ParentID   string // optional — for nested spans
-	Name       string
-	Input      any
-	Output     any
-	Metadata   map[string]any
-	StartTime  time.Time
-	EndTime    time.Time
-	StatusCode int    // 0 = success, anything else = error code
-	Level      string // "DEBUG" | "DEFAULT" | "WARNING" | "ERROR"
-}
-
-// Span records one span attached to a trace. Returns the span id.
-func (c *Client) Span(ctx context.Context, s SpanInput) string {
-	id := NewID()
-	body := map[string]any{
-		"id":      id,
-		"traceId": s.TraceID,
-		"name":    s.Name,
-	}
-	if s.ParentID != "" {
-		body["parentObservationId"] = s.ParentID
-	}
-	if s.Input != nil {
-		body["input"] = s.Input
-	}
-	if s.Output != nil {
-		body["output"] = s.Output
-	}
-	if s.Metadata != nil {
-		body["metadata"] = s.Metadata
-	}
-	if !s.StartTime.IsZero() {
-		body["startTime"] = s.StartTime.UTC().Format(time.RFC3339Nano)
-	}
-	if !s.EndTime.IsZero() {
-		body["endTime"] = s.EndTime.UTC().Format(time.RFC3339Nano)
-	}
-	if s.Level != "" {
-		body["level"] = s.Level
-	}
-	if s.StatusCode != 0 {
-		body["statusMessage"] = fmt.Sprintf("status_code=%d", s.StatusCode)
-	}
-	c.queue(event{
-		ID:        NewID(),
-		Type:      "span-create",
-		Timestamp: time.Now().UTC().Format(time.RFC3339Nano),
-		Body:      body,
-	})
-	return id
-}
-
-func (c *Client) queue(e event) {
-	c.mu.Lock()
-	c.pending = append(c.pending, e)
-	shouldFlush := len(c.pending) >= c.maxBatch
-	c.mu.Unlock()
-	if shouldFlush {
-		_ = c.Flush(context.Background())
-	}
-}
-
-// Flush sends all queued events in one batch. Best-effort: returns
-// the error but also logs; callers can ignore.
-func (c *Client) Flush(ctx context.Context) error {
-	c.mu.Lock()
-	if len(c.pending) == 0 {
-		c.mu.Unlock()
-		return nil
-	}
-	batch := c.pending
-	c.pending = nil
-	c.mu.Unlock()
-
-	body, err := json.Marshal(map[string]any{"batch": batch})
-	if err != nil {
-		slog.Warn("langfuse: marshal batch", "err", err, "n", len(batch))
-		return err
-	}
-	req, err := http.NewRequestWithContext(ctx, "POST", c.url+"/api/public/ingestion", bytes.NewReader(body))
-	if err != nil {
-		return err
-	}
-	req.Header.Set("Authorization", c.auth)
-	req.Header.Set("Content-Type", "application/json")
-	resp, err := c.hc.Do(req)
-	if err != nil {
-		slog.Warn("langfuse: post", "err", err, "n", len(batch))
-		return err
-	}
-	defer resp.Body.Close()
-	if resp.StatusCode/100 != 2 && resp.StatusCode != 207 {
-		slog.Warn("langfuse: non-2xx", "status", resp.StatusCode, "n", len(batch))
-		return fmt.Errorf("langfuse ingestion: HTTP %d", resp.StatusCode)
-	}
-	return nil
-}
-
-// Close flushes any remaining events. Idempotent.
-func (c *Client) Close() error {
-	return c.Flush(context.Background())
-}
--- a/internal/matrix/retrieve.go
+++ b/internal/matrix/retrieve.go
@ -85,15 +85,6 @@ type SearchRequest struct {
 	PlaybookMaxDistance       float64        `json:"playbook_max_distance,omitempty"`
 	PlaybookMaxInjectDistance float64        `json:"playbook_max_inject_distance,omitempty"`
 	MetadataFilter            map[string]any `json:"metadata_filter,omitempty"`
-	// ExcludeIDs filters out specific worker IDs post-retrieval.
-	// Real-world driver: a coordinator places 200 workers at a
-	// contract, then mid-day the client asks for a different set —
-	// the next query should NOT return the already-placed workers.
-	// Filter runs after merge but before metadata filter, so an
-	// excluded ID never wastes a slot in the post-filter top-K.
-	// Also applies to playbook boost + Shape B inject — excluded
-	// answers are skipped at injection time.
-	ExcludeIDs []string `json:"exclude_ids,omitempty"`
 }

 // SearchResponse wraps the merged results plus per-corpus return
@ -213,25 +204,6 @@ func (r *Retriever) Search(ctx context.Context, req SearchRequest) (*SearchRespo
 		return allHits[i].Distance < allHits[j].Distance
 	})

-	// ExcludeIDs filter — applied first so excluded IDs don't waste
-	// a slot in the post-filter top-K. Real-world driver: coordinator
-	// has placed N workers at a contract; mid-day the client asks for
-	// alternatives, so this query passes ExcludeIDs=<placed_ids> and
-	// gets back fresh candidates instead of the same N.
-	if len(req.ExcludeIDs) > 0 {
-		excludeSet := make(map[string]bool, len(req.ExcludeIDs))
-		for _, id := range req.ExcludeIDs {
-			excludeSet[id] = true
-		}
-		kept := make([]Result, 0, len(allHits))
-		for _, h := range allHits {
-			if !excludeSet[h.ID] {
-				kept = append(kept, h)
-			}
-		}
-		allHits = kept
-	}
-
 	// Metadata filter (component B — staffing-side structured gate).
 	// Applied BEFORE top-K truncation so the filter doesn't accidentally
 	// reduce coverage further. Caller can request larger PerCorpusK to
@ -267,23 +239,6 @@ func (r *Retriever) Search(ctx context.Context, req SearchRequest) (*SearchRespo
 		if err != nil {
 			slog.Warn("matrix: playbook lookup failed; skipping boost+inject", "err", err)
 		} else if len(hits) > 0 {
-			// Filter playbook hits to honor ExcludeIDs — without this,
-			// an excluded answer in a playbook recording would re-enter
-			// the result set via Shape B inject, defeating the swap
-			// semantics that the exclude list exists to enforce.
-			if len(req.ExcludeIDs) > 0 {
-				excludeSet := make(map[string]bool, len(req.ExcludeIDs))
-				for _, id := range req.ExcludeIDs {
-					excludeSet[id] = true
-				}
-				keptHits := make([]PlaybookHit, 0, len(hits))
-				for _, h := range hits {
-					if !excludeSet[h.Entry.AnswerID] {
-						keptHits = append(keptHits, h)
-					}
-				}
-				hits = keptHits
-			}
 			resp.PlaybookBoosted = ApplyPlaybookBoost(resp.Results, hits)
 			maxInjectDist := float32(req.PlaybookMaxInjectDistance)
 			if maxInjectDist <= 0 {
--- a/internal/observer/types.go
+++ b/internal/observer/types.go
@ -41,12 +41,6 @@ const (
 	// the workflow handler was casting a string literal to Source,
 	// which worked coincidentally but left the taxonomy implicit.
 	SourceWorkflow Source = "workflow"
-	// SourceInbox tags ObservedOps emitted by /observer/inbox — incoming
-	// real-world signals (email, SMS) that a coordinator would receive
-	// and act on. The handler only RECORDS the message; downstream
-	// triggers (e.g. matrix.search on the parsed demand) are the
-	// caller's concern, recorded separately.
-	SourceInbox Source = "inbox"
 )

 // ObservedOp is one entry in the observer's ring buffer (and JSONL
--- a/lakehouse.toml
+++ b/lakehouse.toml
@ -43,7 +43,7 @@ bind = "127.0.0.1:3216"
 # G2: Ollama local. G3+ may swap in OpenAI/Voyage by changing
 # this URL + the wire format inside the provider.
 provider_url  = "http://localhost:11434"
-default_model = "nomic-embed-text-v2-moe"
+default_model = "nomic-embed-text"

 [queryd]
 bind = "127.0.0.1:3214"
@ -129,7 +129,7 @@ level = "info"
 [models]
 # Tier 1 — local hot path
 local_fast    = "qwen3.5:latest"
-local_embed   = "nomic-embed-text-v2-moe"  # 475M MoE, drop-in upgrade from 137M v1 — verified 2026-04-30 same 768-dim
+local_embed   = "nomic-embed-text"
 # local_judge stays on qwen2.5:latest — qwen3.5:latest is a vision-SSM
 # build with 256K context that runs ~30s per judge call against the
 # playbook_lift loop (verified 2026-04-30). qwen2.5:latest at ~1s/call
--- a/reports/reality-tests/multi_coord_stress_001.md
+++ b/reports/reality-tests/multi_coord_stress_001.md
@ -1,77 +0,0 @@
-# Multi-Coordinator Stress Test — Run 001
-
-**Generated:** 2026-04-30T12:54:09.621556469Z
-**Coordinators:** alice / bob / carol (each with own playbook namespace: `playbook_alice` / `playbook_bob` / `playbook_carol`)
-**Contracts:** alpha_milwaukee_distribution / beta_indianapolis_manufacturing / gamma_chicago_construction
-**Corpora:** `workers,ethereal_workers`
-**K per query:** 8
-**Total events captured:** 52
-**Evidence:** `reports/reality-tests/multi_coord_stress_001.json`
-
---
-
-## Diversity — is the system locking into scenarios or cycling?
-
-| Metric | Mean Jaccard | n pairs | Interpretation |
-|---|---:|---:|---|
-| Same role across different contracts | 0 | 0 | Lower = more diverse (different region/cert mix → different workers) |
-| Different roles within same contract | 0.003703703703703704 | 18 | Should be near-zero (different roles = different worker pools) |
-
-**Healthy ranges:**
- Same role across contracts: < 0.30 means the system is genuinely picking different workers per region/contract.
- Different roles same contract: < 0.10 means role-specific retrieval is working.
- If either is > 0.50, the system is "cycling" the same handful of workers regardless of query intent.
-
---
-
-## Determinism — same query reissued, top-K stability
-
-| Metric | Value |
-|---|---:|
-| Mean Jaccard on retrieval-only reissue | 1 |
-| Number of reissue pairs | 12 |
-
-**Interpretation:**
- ≥ 0.95: HNSW retrieval is highly deterministic; reissues land on near-identical top-K. Good — system locks into a stable view of "best workers for this query."
- 0.80 – 0.95: Some HNSW or embed variance, acceptable.
- < 0.80: Retrieval is unstable — reissues see substantially different results, suggesting either embed nondeterminism (Ollama returning slightly different vectors) or vectord nondeterminism (HNSW insertion order affecting recall).
-
---
-
-## Learning — handover hit rate
-
-Bob takes Alice's contract using Alice's playbook namespace. Did Alice's recorded answers surface in Bob's results?
-
-| Metric | Value |
-|---|---:|
-| Handover queries run | 4 |
-| Alice's recorded answer at Bob's top-1 | 4 |
-| Alice's recorded answer in Bob's top-K | 4 |
-| **Handover hit rate (top-1)** | **1** |
-
-**Interpretation:**
- Hit rate ≥ 0.5: handover is meaningful — the second coordinator inherits the first's institutional memory.
- Hit rate ≈ 0.0: playbook namespace isolation is working but the playbook itself isn't transferable, OR Bob's queries don't match Alice's recordings closely enough.
-
---
-
-## Per-event capture
-
-All matrix.search responses live in the JSON — top-K with worker IDs, distances, and per-corpus counts. Search by phase:
-
-```bash
-jq '.events[] | select(.phase == "merge")' reports/reality-tests/multi_coord_stress_001.json
-jq '.events[] | select(.coordinator == "alice" and .phase == "baseline")' reports/reality-tests/multi_coord_stress_001.json
-jq '.events[] | select(.role == "warehouse worker") | {phase, contract, top_k_ids: [.top_k[].id]}' reports/reality-tests/multi_coord_stress_001.json
-```
-
---
-
-## What's NOT in this run (Phase 1 deliberately defers)
-
- **48-hour clock.** Events fire as discrete steps, not on a timeline.
- **Email / SMS ingest.** No endpoints exist on the Go side yet.
- **New-resume injection mid-run.** The corpus is fixed at the start.
- **Langfuse traces.** Need Go-side wiring.
-
-These are Phase 2/3. The Phase 1 substrate is what the time-based runner will mount on top of.
--- a/reports/reality-tests/multi_coord_stress_002.md
+++ b/reports/reality-tests/multi_coord_stress_002.md
@ -1,82 +0,0 @@
-# Multi-Coordinator Stress Test — Run 002
-
-**Generated:** 2026-04-30T13:02:13.570393819Z
-**Coordinators:** alice / bob / carol (each with own playbook namespace: `playbook_alice` / `playbook_bob` / `playbook_carol`)
-**Contracts:** alpha_milwaukee_distribution / beta_indianapolis_manufacturing / gamma_chicago_construction
-**Corpora:** `workers,ethereal_workers`
-**K per query:** 8
-**Total events captured:** 56
-**Evidence:** `reports/reality-tests/multi_coord_stress_002.json`
-
---
-
-## Diversity — is the system locking into scenarios or cycling?
-
-| Metric | Mean Jaccard | n pairs | Interpretation |
-|---|---:|---:|---|
-| Same role across different contracts | 0.11900691900691901 | 9 | Lower = more diverse (different region/cert mix → different workers) |
-| Different roles within same contract | 0.003703703703703704 | 18 | Should be near-zero (different roles = different worker pools) |
-
-**Healthy ranges:**
- Same role across contracts: < 0.30 means the system is genuinely picking different workers per region/contract.
- Different roles same contract: < 0.10 means role-specific retrieval is working.
- If either is > 0.50, the system is "cycling" the same handful of workers regardless of query intent.
-
---
-
-## Determinism — same query reissued, top-K stability
-
-| Metric | Value |
-|---|---:|
-| Mean Jaccard on retrieval-only reissue | 1 |
-| Number of reissue pairs | 12 |
-
-**Interpretation:**
- ≥ 0.95: HNSW retrieval is highly deterministic; reissues land on near-identical top-K. Good — system locks into a stable view of "best workers for this query."
- 0.80 – 0.95: Some HNSW or embed variance, acceptable.
- < 0.80: Retrieval is unstable — reissues see substantially different results, suggesting either embed nondeterminism (Ollama returning slightly different vectors) or vectord nondeterminism (HNSW insertion order affecting recall).
-
---
-
-## Learning — handover hit rate
-
-Bob takes Alice's contract using Alice's playbook namespace. Did Alice's recorded answers surface in Bob's results?
-
-| Metric | Value |
-|---|---:|
-| Verbatim handover queries run | 4 |
-| Alice's recorded answer at Bob's top-1 (verbatim) | 4 |
-| Alice's recorded answer in Bob's top-K (verbatim) | 4 |
-| **Verbatim handover hit rate (top-1)** | **1** |
-| Paraphrase handover queries run | 4 |
-| Alice's recorded answer at Bob's top-1 (paraphrase) | 4 |
-| Alice's recorded answer in Bob's top-K (paraphrase) | 4 |
-| **Paraphrase handover hit rate (top-1)** | **1** |
-
-**Interpretation:**
- Verbatim hit rate ≈ 1.0: trivial case — Bob runs identical queries; should always hit.
- Paraphrase hit rate ≥ 0.5: institutional memory survives wording change — the harder learning property.
- Paraphrase hit rate ≈ 0.0: Bob's paraphrases drift past the inject threshold, so Alice's recordings don't activate. Same caveat as the playbook_lift paraphrase pass.
-
---
-
-## Per-event capture
-
-All matrix.search responses live in the JSON — top-K with worker IDs, distances, and per-corpus counts. Search by phase:
-
-```bash
-jq '.events[] | select(.phase == "merge")' reports/reality-tests/multi_coord_stress_002.json
-jq '.events[] | select(.coordinator == "alice" and .phase == "baseline")' reports/reality-tests/multi_coord_stress_002.json
-jq '.events[] | select(.role == "warehouse worker") | {phase, contract, top_k_ids: [.top_k[].id]}' reports/reality-tests/multi_coord_stress_002.json
-```
-
---
-
-## What's NOT in this run (Phase 1 deliberately defers)
-
- **48-hour clock.** Events fire as discrete steps, not on a timeline.
- **Email / SMS ingest.** No endpoints exist on the Go side yet.
- **New-resume injection mid-run.** The corpus is fixed at the start.
- **Langfuse traces.** Need Go-side wiring.
-
-These are Phase 2/3. The Phase 1 substrate is what the time-based runner will mount on top of.
--- a/reports/reality-tests/multi_coord_stress_003.md
+++ b/reports/reality-tests/multi_coord_stress_003.md
@ -1,82 +0,0 @@
-# Multi-Coordinator Stress Test — Run 003
-
-**Generated:** 2026-04-30T13:13:44.35966865Z
-**Coordinators:** alice / bob / carol (each with own playbook namespace: `playbook_alice` / `playbook_bob` / `playbook_carol`)
-**Contracts:** alpha_milwaukee_distribution / beta_indianapolis_manufacturing / gamma_chicago_construction
-**Corpora:** `workers,ethereal_workers`
-**K per query:** 8
-**Total events captured:** 61
-**Evidence:** `reports/reality-tests/multi_coord_stress_003.json`
-
---
-
-## Diversity — is the system locking into scenarios or cycling?
-
-| Metric | Mean Jaccard | n pairs | Interpretation |
-|---|---:|---:|---|
-| Same role across different contracts | 0.03068783068783069 | 9 | Lower = more diverse (different region/cert mix → different workers) |
-| Different roles within same contract | 0 | 18 | Should be near-zero (different roles = different worker pools) |
-
-**Healthy ranges:**
- Same role across contracts: < 0.30 means the system is genuinely picking different workers per region/contract.
- Different roles same contract: < 0.10 means role-specific retrieval is working.
- If either is > 0.50, the system is "cycling" the same handful of workers regardless of query intent.
-
---
-
-## Determinism — same query reissued, top-K stability
-
-| Metric | Value |
-|---|---:|
-| Mean Jaccard on retrieval-only reissue | 1 |
-| Number of reissue pairs | 12 |
-
-**Interpretation:**
- ≥ 0.95: HNSW retrieval is highly deterministic; reissues land on near-identical top-K. Good — system locks into a stable view of "best workers for this query."
- 0.80 – 0.95: Some HNSW or embed variance, acceptable.
- < 0.80: Retrieval is unstable — reissues see substantially different results, suggesting either embed nondeterminism (Ollama returning slightly different vectors) or vectord nondeterminism (HNSW insertion order affecting recall).
-
---
-
-## Learning — handover hit rate
-
-Bob takes Alice's contract using Alice's playbook namespace. Did Alice's recorded answers surface in Bob's results?
-
-| Metric | Value |
-|---|---:|
-| Verbatim handover queries run | 4 |
-| Alice's recorded answer at Bob's top-1 (verbatim) | 4 |
-| Alice's recorded answer in Bob's top-K (verbatim) | 4 |
-| **Verbatim handover hit rate (top-1)** | **1** |
-| Paraphrase handover queries run | 4 |
-| Alice's recorded answer at Bob's top-1 (paraphrase) | 4 |
-| Alice's recorded answer in Bob's top-K (paraphrase) | 4 |
-| **Paraphrase handover hit rate (top-1)** | **1** |
-
-**Interpretation:**
- Verbatim hit rate ≈ 1.0: trivial case — Bob runs identical queries; should always hit.
- Paraphrase hit rate ≥ 0.5: institutional memory survives wording change — the harder learning property.
- Paraphrase hit rate ≈ 0.0: Bob's paraphrases drift past the inject threshold, so Alice's recordings don't activate. Same caveat as the playbook_lift paraphrase pass.
-
---
-
-## Per-event capture
-
-All matrix.search responses live in the JSON — top-K with worker IDs, distances, and per-corpus counts. Search by phase:
-
-```bash
-jq '.events[] | select(.phase == "merge")' reports/reality-tests/multi_coord_stress_003.json
-jq '.events[] | select(.coordinator == "alice" and .phase == "baseline")' reports/reality-tests/multi_coord_stress_003.json
-jq '.events[] | select(.role == "warehouse worker") | {phase, contract, top_k_ids: [.top_k[].id]}' reports/reality-tests/multi_coord_stress_003.json
-```
-
---
-
-## What's NOT in this run (Phase 1 deliberately defers)
-
- **48-hour clock.** Events fire as discrete steps, not on a timeline.
- **Email / SMS ingest.** No endpoints exist on the Go side yet.
- **New-resume injection mid-run.** The corpus is fixed at the start.
- **Langfuse traces.** Need Go-side wiring.
-
-These are Phase 2/3. The Phase 1 substrate is what the time-based runner will mount on top of.
--- a/reports/reality-tests/multi_coord_stress_004.md
+++ b/reports/reality-tests/multi_coord_stress_004.md
@ -1,82 +0,0 @@
-# Multi-Coordinator Stress Test — Run 004
-
-**Generated:** 2026-04-30T13:17:03.577877974Z
-**Coordinators:** alice / bob / carol (each with own playbook namespace: `playbook_alice` / `playbook_bob` / `playbook_carol`)
-**Contracts:** alpha_milwaukee_distribution / beta_indianapolis_manufacturing / gamma_chicago_construction
-**Corpora:** `workers,ethereal_workers`
-**K per query:** 8
-**Total events captured:** 61
-**Evidence:** `reports/reality-tests/multi_coord_stress_004.json`
-
---
-
-## Diversity — is the system locking into scenarios or cycling?
-
-| Metric | Mean Jaccard | n pairs | Interpretation |
-|---|---:|---:|---|
-| Same role across different contracts | 0.08013468013468013 | 9 | Lower = more diverse (different region/cert mix → different workers) |
-| Different roles within same contract | 0.012820512820512822 | 18 | Should be near-zero (different roles = different worker pools) |
-
-**Healthy ranges:**
- Same role across contracts: < 0.30 means the system is genuinely picking different workers per region/contract.
- Different roles same contract: < 0.10 means role-specific retrieval is working.
- If either is > 0.50, the system is "cycling" the same handful of workers regardless of query intent.
-
---
-
-## Determinism — same query reissued, top-K stability
-
-| Metric | Value |
-|---|---:|
-| Mean Jaccard on retrieval-only reissue | 1 |
-| Number of reissue pairs | 12 |
-
-**Interpretation:**
- ≥ 0.95: HNSW retrieval is highly deterministic; reissues land on near-identical top-K. Good — system locks into a stable view of "best workers for this query."
- 0.80 – 0.95: Some HNSW or embed variance, acceptable.
- < 0.80: Retrieval is unstable — reissues see substantially different results, suggesting either embed nondeterminism (Ollama returning slightly different vectors) or vectord nondeterminism (HNSW insertion order affecting recall).
-
---
-
-## Learning — handover hit rate
-
-Bob takes Alice's contract using Alice's playbook namespace. Did Alice's recorded answers surface in Bob's results?
-
-| Metric | Value |
-|---|---:|
-| Verbatim handover queries run | 4 |
-| Alice's recorded answer at Bob's top-1 (verbatim) | 4 |
-| Alice's recorded answer in Bob's top-K (verbatim) | 4 |
-| **Verbatim handover hit rate (top-1)** | **1** |
-| Paraphrase handover queries run | 4 |
-| Alice's recorded answer at Bob's top-1 (paraphrase) | 4 |
-| Alice's recorded answer in Bob's top-K (paraphrase) | 4 |
-| **Paraphrase handover hit rate (top-1)** | **1** |
-
-**Interpretation:**
- Verbatim hit rate ≈ 1.0: trivial case — Bob runs identical queries; should always hit.
- Paraphrase hit rate ≥ 0.5: institutional memory survives wording change — the harder learning property.
- Paraphrase hit rate ≈ 0.0: Bob's paraphrases drift past the inject threshold, so Alice's recordings don't activate. Same caveat as the playbook_lift paraphrase pass.
-
---
-
-## Per-event capture
-
-All matrix.search responses live in the JSON — top-K with worker IDs, distances, and per-corpus counts. Search by phase:
-
-```bash
-jq '.events[] | select(.phase == "merge")' reports/reality-tests/multi_coord_stress_004.json
-jq '.events[] | select(.coordinator == "alice" and .phase == "baseline")' reports/reality-tests/multi_coord_stress_004.json
-jq '.events[] | select(.role == "warehouse worker") | {phase, contract, top_k_ids: [.top_k[].id]}' reports/reality-tests/multi_coord_stress_004.json
-```
-
---
-
-## What's NOT in this run (Phase 1 deliberately defers)
-
- **48-hour clock.** Events fire as discrete steps, not on a timeline.
- **Email / SMS ingest.** No endpoints exist on the Go side yet.
- **New-resume injection mid-run.** The corpus is fixed at the start.
- **Langfuse traces.** Need Go-side wiring.
-
-These are Phase 2/3. The Phase 1 substrate is what the time-based runner will mount on top of.
--- a/reports/reality-tests/multi_coord_stress_005.md
+++ b/reports/reality-tests/multi_coord_stress_005.md
@ -1,82 +0,0 @@
-# Multi-Coordinator Stress Test — Run 005
-
-**Generated:** 2026-04-30T13:25:15.497712275Z
-**Coordinators:** alice / bob / carol (each with own playbook namespace: `playbook_alice` / `playbook_bob` / `playbook_carol`)
-**Contracts:** alpha_milwaukee_distribution / beta_indianapolis_manufacturing / gamma_chicago_construction
-**Corpora:** `workers,ethereal_workers`
-**K per query:** 8
-**Total events captured:** 61
-**Evidence:** `reports/reality-tests/multi_coord_stress_005.json`
-
---
-
-## Diversity — is the system locking into scenarios or cycling?
-
-| Metric | Mean Jaccard | n pairs | Interpretation |
-|---|---:|---:|---|
-| Same role across different contracts | 0 | 9 | Lower = more diverse (different region/cert mix → different workers) |
-| Different roles within same contract | 0.03610093610093609 | 18 | Should be near-zero (different roles = different worker pools) |
-
-**Healthy ranges:**
- Same role across contracts: < 0.30 means the system is genuinely picking different workers per region/contract.
- Different roles same contract: < 0.10 means role-specific retrieval is working.
- If either is > 0.50, the system is "cycling" the same handful of workers regardless of query intent.
-
---
-
-## Determinism — same query reissued, top-K stability
-
-| Metric | Value |
-|---|---:|
-| Mean Jaccard on retrieval-only reissue | 1 |
-| Number of reissue pairs | 12 |
-
-**Interpretation:**
- ≥ 0.95: HNSW retrieval is highly deterministic; reissues land on near-identical top-K. Good — system locks into a stable view of "best workers for this query."
- 0.80 – 0.95: Some HNSW or embed variance, acceptable.
- < 0.80: Retrieval is unstable — reissues see substantially different results, suggesting either embed nondeterminism (Ollama returning slightly different vectors) or vectord nondeterminism (HNSW insertion order affecting recall).
-
---
-
-## Learning — handover hit rate
-
-Bob takes Alice's contract using Alice's playbook namespace. Did Alice's recorded answers surface in Bob's results?
-
-| Metric | Value |
-|---|---:|
-| Verbatim handover queries run | 4 |
-| Alice's recorded answer at Bob's top-1 (verbatim) | 4 |
-| Alice's recorded answer in Bob's top-K (verbatim) | 4 |
-| **Verbatim handover hit rate (top-1)** | **1** |
-| Paraphrase handover queries run | 4 |
-| Alice's recorded answer at Bob's top-1 (paraphrase) | 4 |
-| Alice's recorded answer in Bob's top-K (paraphrase) | 4 |
-| **Paraphrase handover hit rate (top-1)** | **1** |
-
-**Interpretation:**
- Verbatim hit rate ≈ 1.0: trivial case — Bob runs identical queries; should always hit.
- Paraphrase hit rate ≥ 0.5: institutional memory survives wording change — the harder learning property.
- Paraphrase hit rate ≈ 0.0: Bob's paraphrases drift past the inject threshold, so Alice's recordings don't activate. Same caveat as the playbook_lift paraphrase pass.
-
---
-
-## Per-event capture
-
-All matrix.search responses live in the JSON — top-K with worker IDs, distances, and per-corpus counts. Search by phase:
-
-```bash
-jq '.events[] | select(.phase == "merge")' reports/reality-tests/multi_coord_stress_005.json
-jq '.events[] | select(.coordinator == "alice" and .phase == "baseline")' reports/reality-tests/multi_coord_stress_005.json
-jq '.events[] | select(.role == "warehouse worker") | {phase, contract, top_k_ids: [.top_k[].id]}' reports/reality-tests/multi_coord_stress_005.json
-```
-
---
-
-## What's NOT in this run (Phase 1 deliberately defers)
-
- **48-hour clock.** Events fire as discrete steps, not on a timeline.
- **Email / SMS ingest.** No endpoints exist on the Go side yet.
- **New-resume injection mid-run.** The corpus is fixed at the start.
- **Langfuse traces.** Need Go-side wiring.
-
-These are Phase 2/3. The Phase 1 substrate is what the time-based runner will mount on top of.
--- a/reports/reality-tests/multi_coord_stress_006.md
+++ b/reports/reality-tests/multi_coord_stress_006.md
@ -1,82 +0,0 @@
-# Multi-Coordinator Stress Test — Run 006
-
-**Generated:** 2026-04-30T13:33:24.568124731Z
-**Coordinators:** alice / bob / carol (each with own playbook namespace: `playbook_alice` / `playbook_bob` / `playbook_carol`)
-**Contracts:** alpha_milwaukee_distribution / beta_indianapolis_manufacturing / gamma_chicago_construction
-**Corpora:** `workers,ethereal_workers`
-**K per query:** 8
-**Total events captured:** 67
-**Evidence:** `reports/reality-tests/multi_coord_stress_006.json`
-
---
-
-## Diversity — is the system locking into scenarios or cycling?
-
-| Metric | Mean Jaccard | n pairs | Interpretation |
-|---|---:|---:|---|
-| Same role across different contracts | 0 | 9 | Lower = more diverse (different region/cert mix → different workers) |
-| Different roles within same contract | 0.04603174603174603 | 18 | Should be near-zero (different roles = different worker pools) |
-
-**Healthy ranges:**
- Same role across contracts: < 0.30 means the system is genuinely picking different workers per region/contract.
- Different roles same contract: < 0.10 means role-specific retrieval is working.
- If either is > 0.50, the system is "cycling" the same handful of workers regardless of query intent.
-
---
-
-## Determinism — same query reissued, top-K stability
-
-| Metric | Value |
-|---|---:|
-| Mean Jaccard on retrieval-only reissue | 1 |
-| Number of reissue pairs | 12 |
-
-**Interpretation:**
- ≥ 0.95: HNSW retrieval is highly deterministic; reissues land on near-identical top-K. Good — system locks into a stable view of "best workers for this query."
- 0.80 – 0.95: Some HNSW or embed variance, acceptable.
- < 0.80: Retrieval is unstable — reissues see substantially different results, suggesting either embed nondeterminism (Ollama returning slightly different vectors) or vectord nondeterminism (HNSW insertion order affecting recall).
-
---
-
-## Learning — handover hit rate
-
-Bob takes Alice's contract using Alice's playbook namespace. Did Alice's recorded answers surface in Bob's results?
-
-| Metric | Value |
-|---|---:|
-| Verbatim handover queries run | 4 |
-| Alice's recorded answer at Bob's top-1 (verbatim) | 4 |
-| Alice's recorded answer in Bob's top-K (verbatim) | 4 |
-| **Verbatim handover hit rate (top-1)** | **1** |
-| Paraphrase handover queries run | 4 |
-| Alice's recorded answer at Bob's top-1 (paraphrase) | 4 |
-| Alice's recorded answer in Bob's top-K (paraphrase) | 4 |
-| **Paraphrase handover hit rate (top-1)** | **1** |
-
-**Interpretation:**
- Verbatim hit rate ≈ 1.0: trivial case — Bob runs identical queries; should always hit.
- Paraphrase hit rate ≥ 0.5: institutional memory survives wording change — the harder learning property.
- Paraphrase hit rate ≈ 0.0: Bob's paraphrases drift past the inject threshold, so Alice's recordings don't activate. Same caveat as the playbook_lift paraphrase pass.
-
---
-
-## Per-event capture
-
-All matrix.search responses live in the JSON — top-K with worker IDs, distances, and per-corpus counts. Search by phase:
-
-```bash
-jq '.events[] | select(.phase == "merge")' reports/reality-tests/multi_coord_stress_006.json
-jq '.events[] | select(.coordinator == "alice" and .phase == "baseline")' reports/reality-tests/multi_coord_stress_006.json
-jq '.events[] | select(.role == "warehouse worker") | {phase, contract, top_k_ids: [.top_k[].id]}' reports/reality-tests/multi_coord_stress_006.json
-```
-
---
-
-## What's NOT in this run (Phase 1 deliberately defers)
-
- **48-hour clock.** Events fire as discrete steps, not on a timeline.
- **Email / SMS ingest.** No endpoints exist on the Go side yet.
- **New-resume injection mid-run.** The corpus is fixed at the start.
- **Langfuse traces.** Need Go-side wiring.
-
-These are Phase 2/3. The Phase 1 substrate is what the time-based runner will mount on top of.
--- a/reports/reality-tests/multi_coord_stress_007.md
+++ b/reports/reality-tests/multi_coord_stress_007.md
@ -1,82 +0,0 @@
-# Multi-Coordinator Stress Test — Run 007
-
-**Generated:** 2026-04-30T19:50:04.791000091Z
-**Coordinators:** alice / bob / carol (each with own playbook namespace: `playbook_alice` / `playbook_bob` / `playbook_carol`)
-**Contracts:** alpha_milwaukee_distribution / beta_indianapolis_manufacturing / gamma_chicago_construction
-**Corpora:** `workers,ethereal_workers`
-**K per query:** 8
-**Total events captured:** 67
-**Evidence:** `reports/reality-tests/multi_coord_stress_007.json`
-
---
-
-## Diversity — is the system locking into scenarios or cycling?
-
-| Metric | Mean Jaccard | n pairs | Interpretation |
-|---|---:|---:|---|
-| Same role across different contracts | 0 | 9 | Lower = more diverse (different region/cert mix → different workers) |
-| Different roles within same contract | 0.026455026455026454 | 18 | Should be near-zero (different roles = different worker pools) |
-
-**Healthy ranges:**
- Same role across contracts: < 0.30 means the system is genuinely picking different workers per region/contract.
- Different roles same contract: < 0.10 means role-specific retrieval is working.
- If either is > 0.50, the system is "cycling" the same handful of workers regardless of query intent.
-
---
-
-## Determinism — same query reissued, top-K stability
-
-| Metric | Value |
-|---|---:|
-| Mean Jaccard on retrieval-only reissue | 1 |
-| Number of reissue pairs | 12 |
-
-**Interpretation:**
- ≥ 0.95: HNSW retrieval is highly deterministic; reissues land on near-identical top-K. Good — system locks into a stable view of "best workers for this query."
- 0.80 – 0.95: Some HNSW or embed variance, acceptable.
- < 0.80: Retrieval is unstable — reissues see substantially different results, suggesting either embed nondeterminism (Ollama returning slightly different vectors) or vectord nondeterminism (HNSW insertion order affecting recall).
-
---
-
-## Learning — handover hit rate
-
-Bob takes Alice's contract using Alice's playbook namespace. Did Alice's recorded answers surface in Bob's results?
-
-| Metric | Value |
-|---|---:|
-| Verbatim handover queries run | 4 |
-| Alice's recorded answer at Bob's top-1 (verbatim) | 4 |
-| Alice's recorded answer in Bob's top-K (verbatim) | 4 |
-| **Verbatim handover hit rate (top-1)** | **1** |
-| Paraphrase handover queries run | 4 |
-| Alice's recorded answer at Bob's top-1 (paraphrase) | 4 |
-| Alice's recorded answer in Bob's top-K (paraphrase) | 4 |
-| **Paraphrase handover hit rate (top-1)** | **1** |
-
-**Interpretation:**
- Verbatim hit rate ≈ 1.0: trivial case — Bob runs identical queries; should always hit.
- Paraphrase hit rate ≥ 0.5: institutional memory survives wording change — the harder learning property.
- Paraphrase hit rate ≈ 0.0: Bob's paraphrases drift past the inject threshold, so Alice's recordings don't activate. Same caveat as the playbook_lift paraphrase pass.
-
---
-
-## Per-event capture
-
-All matrix.search responses live in the JSON — top-K with worker IDs, distances, and per-corpus counts. Search by phase:
-
-```bash
-jq '.events[] | select(.phase == "merge")' reports/reality-tests/multi_coord_stress_007.json
-jq '.events[] | select(.coordinator == "alice" and .phase == "baseline")' reports/reality-tests/multi_coord_stress_007.json
-jq '.events[] | select(.role == "warehouse worker") | {phase, contract, top_k_ids: [.top_k[].id]}' reports/reality-tests/multi_coord_stress_007.json
-```
-
---
-
-## What's NOT in this run (Phase 1 deliberately defers)
-
- **48-hour clock.** Events fire as discrete steps, not on a timeline.
- **Email / SMS ingest.** No endpoints exist on the Go side yet.
- **New-resume injection mid-run.** The corpus is fixed at the start.
- **Langfuse traces.** Need Go-side wiring.
-
-These are Phase 2/3. The Phase 1 substrate is what the time-based runner will mount on top of.
--- a/reports/reality-tests/multi_coord_stress_008.md
+++ b/reports/reality-tests/multi_coord_stress_008.md
@ -1,82 +0,0 @@
-# Multi-Coordinator Stress Test — Run 008
-
-**Generated:** 2026-04-30T21:15:37.045817146Z
-**Coordinators:** alice / bob / carol (each with own playbook namespace: `playbook_alice` / `playbook_bob` / `playbook_carol`)
-**Contracts:** alpha_milwaukee_distribution / beta_indianapolis_manufacturing / gamma_chicago_construction
-**Corpora:** `workers,ethereal_workers`
-**K per query:** 8
-**Total events captured:** 67
-**Evidence:** `reports/reality-tests/multi_coord_stress_008.json`
-
---
-
-## Diversity — is the system locking into scenarios or cycling?
-
-| Metric | Mean Jaccard | n pairs | Interpretation |
-|---|---:|---:|---|
-| Same role across different contracts | 0 | 9 | Lower = more diverse (different region/cert mix → different workers) |
-| Different roles within same contract | 0.04126984126984126 | 18 | Should be near-zero (different roles = different worker pools) |
-
-**Healthy ranges:**
- Same role across contracts: < 0.30 means the system is genuinely picking different workers per region/contract.
- Different roles same contract: < 0.10 means role-specific retrieval is working.
- If either is > 0.50, the system is "cycling" the same handful of workers regardless of query intent.
-
---
-
-## Determinism — same query reissued, top-K stability
-
-| Metric | Value |
-|---|---:|
-| Mean Jaccard on retrieval-only reissue | 1 |
-| Number of reissue pairs | 12 |
-
-**Interpretation:**
- ≥ 0.95: HNSW retrieval is highly deterministic; reissues land on near-identical top-K. Good — system locks into a stable view of "best workers for this query."
- 0.80 – 0.95: Some HNSW or embed variance, acceptable.
- < 0.80: Retrieval is unstable — reissues see substantially different results, suggesting either embed nondeterminism (Ollama returning slightly different vectors) or vectord nondeterminism (HNSW insertion order affecting recall).
-
---
-
-## Learning — handover hit rate
-
-Bob takes Alice's contract using Alice's playbook namespace. Did Alice's recorded answers surface in Bob's results?
-
-| Metric | Value |
-|---|---:|
-| Verbatim handover queries run | 4 |
-| Alice's recorded answer at Bob's top-1 (verbatim) | 4 |
-| Alice's recorded answer in Bob's top-K (verbatim) | 4 |
-| **Verbatim handover hit rate (top-1)** | **1** |
-| Paraphrase handover queries run | 4 |
-| Alice's recorded answer at Bob's top-1 (paraphrase) | 4 |
-| Alice's recorded answer in Bob's top-K (paraphrase) | 4 |
-| **Paraphrase handover hit rate (top-1)** | **1** |
-
-**Interpretation:**
- Verbatim hit rate ≈ 1.0: trivial case — Bob runs identical queries; should always hit.
- Paraphrase hit rate ≥ 0.5: institutional memory survives wording change — the harder learning property.
- Paraphrase hit rate ≈ 0.0: Bob's paraphrases drift past the inject threshold, so Alice's recordings don't activate. Same caveat as the playbook_lift paraphrase pass.
-
---
-
-## Per-event capture
-
-All matrix.search responses live in the JSON — top-K with worker IDs, distances, and per-corpus counts. Search by phase:
-
-```bash
-jq '.events[] | select(.phase == "merge")' reports/reality-tests/multi_coord_stress_008.json
-jq '.events[] | select(.coordinator == "alice" and .phase == "baseline")' reports/reality-tests/multi_coord_stress_008.json
-jq '.events[] | select(.role == "warehouse worker") | {phase, contract, top_k_ids: [.top_k[].id]}' reports/reality-tests/multi_coord_stress_008.json
-```
-
---
-
-## What's NOT in this run (Phase 1 deliberately defers)
-
- **48-hour clock.** Events fire as discrete steps, not on a timeline.
- **Email / SMS ingest.** No endpoints exist on the Go side yet.
- **New-resume injection mid-run.** The corpus is fixed at the start.
- **Langfuse traces.** Need Go-side wiring.
-
-These are Phase 2/3. The Phase 1 substrate is what the time-based runner will mount on top of.
--- a/reports/reality-tests/multi_coord_stress_009.md
+++ b/reports/reality-tests/multi_coord_stress_009.md
@ -1,82 +0,0 @@
-# Multi-Coordinator Stress Test — Run 009
-
-**Generated:** 2026-04-30T21:23:59.011167722Z
-**Coordinators:** alice / bob / carol (each with own playbook namespace: `playbook_alice` / `playbook_bob` / `playbook_carol`)
-**Contracts:** alpha_milwaukee_distribution / beta_indianapolis_manufacturing / gamma_chicago_construction
-**Corpora:** `workers,ethereal_workers`
-**K per query:** 8
-**Total events captured:** 67
-**Evidence:** `reports/reality-tests/multi_coord_stress_009.json`
-
---
-
-## Diversity — is the system locking into scenarios or cycling?
-
-| Metric | Mean Jaccard | n pairs | Interpretation |
-|---|---:|---:|---|
-| Same role across different contracts | 0.015873015873015872 | 9 | Lower = more diverse (different region/cert mix → different workers) |
-| Different roles within same contract | 0.015343915343915345 | 18 | Should be near-zero (different roles = different worker pools) |
-
-**Healthy ranges:**
- Same role across contracts: < 0.30 means the system is genuinely picking different workers per region/contract.
- Different roles same contract: < 0.10 means role-specific retrieval is working.
- If either is > 0.50, the system is "cycling" the same handful of workers regardless of query intent.
-
---
-
-## Determinism — same query reissued, top-K stability
-
-| Metric | Value |
-|---|---:|
-| Mean Jaccard on retrieval-only reissue | 1 |
-| Number of reissue pairs | 12 |
-
-**Interpretation:**
- ≥ 0.95: HNSW retrieval is highly deterministic; reissues land on near-identical top-K. Good — system locks into a stable view of "best workers for this query."
- 0.80 – 0.95: Some HNSW or embed variance, acceptable.
- < 0.80: Retrieval is unstable — reissues see substantially different results, suggesting either embed nondeterminism (Ollama returning slightly different vectors) or vectord nondeterminism (HNSW insertion order affecting recall).
-
---
-
-## Learning — handover hit rate
-
-Bob takes Alice's contract using Alice's playbook namespace. Did Alice's recorded answers surface in Bob's results?
-
-| Metric | Value |
-|---|---:|
-| Verbatim handover queries run | 4 |
-| Alice's recorded answer at Bob's top-1 (verbatim) | 4 |
-| Alice's recorded answer in Bob's top-K (verbatim) | 4 |
-| **Verbatim handover hit rate (top-1)** | **1** |
-| Paraphrase handover queries run | 4 |
-| Alice's recorded answer at Bob's top-1 (paraphrase) | 4 |
-| Alice's recorded answer in Bob's top-K (paraphrase) | 4 |
-| **Paraphrase handover hit rate (top-1)** | **1** |
-
-**Interpretation:**
- Verbatim hit rate ≈ 1.0: trivial case — Bob runs identical queries; should always hit.
- Paraphrase hit rate ≥ 0.5: institutional memory survives wording change — the harder learning property.
- Paraphrase hit rate ≈ 0.0: Bob's paraphrases drift past the inject threshold, so Alice's recordings don't activate. Same caveat as the playbook_lift paraphrase pass.
-
---
-
-## Per-event capture
-
-All matrix.search responses live in the JSON — top-K with worker IDs, distances, and per-corpus counts. Search by phase:
-
-```bash
-jq '.events[] | select(.phase == "merge")' reports/reality-tests/multi_coord_stress_009.json
-jq '.events[] | select(.coordinator == "alice" and .phase == "baseline")' reports/reality-tests/multi_coord_stress_009.json
-jq '.events[] | select(.role == "warehouse worker") | {phase, contract, top_k_ids: [.top_k[].id]}' reports/reality-tests/multi_coord_stress_009.json
-```
-
---
-
-## What's NOT in this run (Phase 1 deliberately defers)
-
- **48-hour clock.** Events fire as discrete steps, not on a timeline.
- **Email / SMS ingest.** No endpoints exist on the Go side yet.
- **New-resume injection mid-run.** The corpus is fixed at the start.
- **Langfuse traces.** Need Go-side wiring.
-
-These are Phase 2/3. The Phase 1 substrate is what the time-based runner will mount on top of.
--- a/reports/reality-tests/multi_coord_stress_010.md
+++ b/reports/reality-tests/multi_coord_stress_010.md
@ -1,82 +0,0 @@
-# Multi-Coordinator Stress Test — Run 010
-
-**Generated:** 2026-04-30T21:30:38.434794788Z
-**Coordinators:** alice / bob / carol (each with own playbook namespace: `playbook_alice` / `playbook_bob` / `playbook_carol`)
-**Contracts:** alpha_milwaukee_distribution / beta_indianapolis_manufacturing / gamma_chicago_construction
-**Corpora:** `workers,ethereal_workers`
-**K per query:** 8
-**Total events captured:** 67
-**Evidence:** `reports/reality-tests/multi_coord_stress_010.json`
-
---
-
-## Diversity — is the system locking into scenarios or cycling?
-
-| Metric | Mean Jaccard | n pairs | Interpretation |
-|---|---:|---:|---|
-| Same role across different contracts | 0.007407407407407408 | 9 | Lower = more diverse (different region/cert mix → different workers) |
-| Different roles within same contract | 0.026455026455026454 | 18 | Should be near-zero (different roles = different worker pools) |
-
-**Healthy ranges:**
- Same role across contracts: < 0.30 means the system is genuinely picking different workers per region/contract.
- Different roles same contract: < 0.10 means role-specific retrieval is working.
- If either is > 0.50, the system is "cycling" the same handful of workers regardless of query intent.
-
---
-
-## Determinism — same query reissued, top-K stability
-
-| Metric | Value |
-|---|---:|
-| Mean Jaccard on retrieval-only reissue | 1 |
-| Number of reissue pairs | 12 |
-
-**Interpretation:**
- ≥ 0.95: HNSW retrieval is highly deterministic; reissues land on near-identical top-K. Good — system locks into a stable view of "best workers for this query."
- 0.80 – 0.95: Some HNSW or embed variance, acceptable.
- < 0.80: Retrieval is unstable — reissues see substantially different results, suggesting either embed nondeterminism (Ollama returning slightly different vectors) or vectord nondeterminism (HNSW insertion order affecting recall).
-
---
-
-## Learning — handover hit rate
-
-Bob takes Alice's contract using Alice's playbook namespace. Did Alice's recorded answers surface in Bob's results?
-
-| Metric | Value |
-|---|---:|
-| Verbatim handover queries run | 4 |
-| Alice's recorded answer at Bob's top-1 (verbatim) | 4 |
-| Alice's recorded answer in Bob's top-K (verbatim) | 4 |
-| **Verbatim handover hit rate (top-1)** | **1** |
-| Paraphrase handover queries run | 4 |
-| Alice's recorded answer at Bob's top-1 (paraphrase) | 4 |
-| Alice's recorded answer in Bob's top-K (paraphrase) | 4 |
-| **Paraphrase handover hit rate (top-1)** | **1** |
-
-**Interpretation:**
- Verbatim hit rate ≈ 1.0: trivial case — Bob runs identical queries; should always hit.
- Paraphrase hit rate ≥ 0.5: institutional memory survives wording change — the harder learning property.
- Paraphrase hit rate ≈ 0.0: Bob's paraphrases drift past the inject threshold, so Alice's recordings don't activate. Same caveat as the playbook_lift paraphrase pass.
-
---
-
-## Per-event capture
-
-All matrix.search responses live in the JSON — top-K with worker IDs, distances, and per-corpus counts. Search by phase:
-
-```bash
-jq '.events[] | select(.phase == "merge")' reports/reality-tests/multi_coord_stress_010.json
-jq '.events[] | select(.coordinator == "alice" and .phase == "baseline")' reports/reality-tests/multi_coord_stress_010.json
-jq '.events[] | select(.role == "warehouse worker") | {phase, contract, top_k_ids: [.top_k[].id]}' reports/reality-tests/multi_coord_stress_010.json
-```
-
---
-
-## What's NOT in this run (Phase 1 deliberately defers)
-
- **48-hour clock.** Events fire as discrete steps, not on a timeline.
- **Email / SMS ingest.** No endpoints exist on the Go side yet.
- **New-resume injection mid-run.** The corpus is fixed at the start.
- **Langfuse traces.** Need Go-side wiring.
-
-These are Phase 2/3. The Phase 1 substrate is what the time-based runner will mount on top of.
--- a/reports/reality-tests/multi_coord_stress_011.md
+++ b/reports/reality-tests/multi_coord_stress_011.md
@ -1,82 +0,0 @@
-# Multi-Coordinator Stress Test — Run 011
-
-**Generated:** 2026-04-30T21:41:26.801002955Z
-**Coordinators:** alice / bob / carol (each with own playbook namespace: `playbook_alice` / `playbook_bob` / `playbook_carol`)
-**Contracts:** alpha_milwaukee_distribution / beta_indianapolis_manufacturing / gamma_chicago_construction
-**Corpora:** `workers,ethereal_workers`
-**K per query:** 8
-**Total events captured:** 67
-**Evidence:** `reports/reality-tests/multi_coord_stress_011.json`
-
---
-
-## Diversity — is the system locking into scenarios or cycling?
-
-| Metric | Mean Jaccard | n pairs | Interpretation |
-|---|---:|---:|---|
-| Same role across different contracts | 0.025641025641025644 | 9 | Lower = more diverse (different region/cert mix → different workers) |
-| Different roles within same contract | 0.06996336996336996 | 18 | Should be near-zero (different roles = different worker pools) |
-
-**Healthy ranges:**
- Same role across contracts: < 0.30 means the system is genuinely picking different workers per region/contract.
- Different roles same contract: < 0.10 means role-specific retrieval is working.
- If either is > 0.50, the system is "cycling" the same handful of workers regardless of query intent.
-
---
-
-## Determinism — same query reissued, top-K stability
-
-| Metric | Value |
-|---|---:|
-| Mean Jaccard on retrieval-only reissue | 1 |
-| Number of reissue pairs | 12 |
-
-**Interpretation:**
- ≥ 0.95: HNSW retrieval is highly deterministic; reissues land on near-identical top-K. Good — system locks into a stable view of "best workers for this query."
- 0.80 – 0.95: Some HNSW or embed variance, acceptable.
- < 0.80: Retrieval is unstable — reissues see substantially different results, suggesting either embed nondeterminism (Ollama returning slightly different vectors) or vectord nondeterminism (HNSW insertion order affecting recall).
-
---
-
-## Learning — handover hit rate
-
-Bob takes Alice's contract using Alice's playbook namespace. Did Alice's recorded answers surface in Bob's results?
-
-| Metric | Value |
-|---|---:|
-| Verbatim handover queries run | 4 |
-| Alice's recorded answer at Bob's top-1 (verbatim) | 4 |
-| Alice's recorded answer in Bob's top-K (verbatim) | 4 |
-| **Verbatim handover hit rate (top-1)** | **1** |
-| Paraphrase handover queries run | 4 |
-| Alice's recorded answer at Bob's top-1 (paraphrase) | 4 |
-| Alice's recorded answer in Bob's top-K (paraphrase) | 4 |
-| **Paraphrase handover hit rate (top-1)** | **1** |
-
-**Interpretation:**
- Verbatim hit rate ≈ 1.0: trivial case — Bob runs identical queries; should always hit.
- Paraphrase hit rate ≥ 0.5: institutional memory survives wording change — the harder learning property.
- Paraphrase hit rate ≈ 0.0: Bob's paraphrases drift past the inject threshold, so Alice's recordings don't activate. Same caveat as the playbook_lift paraphrase pass.
-
---
-
-## Per-event capture
-
-All matrix.search responses live in the JSON — top-K with worker IDs, distances, and per-corpus counts. Search by phase:
-
-```bash
-jq '.events[] | select(.phase == "merge")' reports/reality-tests/multi_coord_stress_011.json
-jq '.events[] | select(.coordinator == "alice" and .phase == "baseline")' reports/reality-tests/multi_coord_stress_011.json
-jq '.events[] | select(.role == "warehouse worker") | {phase, contract, top_k_ids: [.top_k[].id]}' reports/reality-tests/multi_coord_stress_011.json
-```
-
---
-
-## What's NOT in this run (Phase 1 deliberately defers)
-
- **48-hour clock.** Events fire as discrete steps, not on a timeline.
- **Email / SMS ingest.** No endpoints exist on the Go side yet.
- **New-resume injection mid-run.** The corpus is fixed at the start.
- **Langfuse traces.** Need Go-side wiring.
-
-These are Phase 2/3. The Phase 1 substrate is what the time-based runner will mount on top of.
--- a/reports/reality-tests/playbook_lift_005.md
+++ b/reports/reality-tests/playbook_lift_005.md
@ -1,120 +0,0 @@
-# Playbook-Lift Reality Test — Run 005
-
-**Generated:** 2026-04-30T12:40:48.475901847Z
-**Judge:** `qwen2.5:latest` (Ollama, resolved from env JUDGE_MODEL=qwen2.5:latest)
-**Corpora:** `workers,ethereal_workers`
-**Workers limit:** 5000
-**Queries:** `tests/reality/playbook_lift_queries.txt` (21 executed)
-**K per pass:** 10
-**Paraphrase pass:** ENABLED
-**Re-judge pass:** ENABLED
-**Evidence:** `reports/reality-tests/playbook_lift_005.json`
-
---
-
-## Headline
-
-| Metric | Value |
-|---|---:|
-| Total queries run | 21 |
-| Cold-pass discoveries (judge-best ≠ top-1) | 7 |
-| Warm-pass lifts (recorded playbook → top-1) | 5 |
-| No change (judge-best already top-1, no playbook needed) | 16 |
-| Playbook boosts triggered (warm pass) | 9 |
-| Mean Δ top-1 distance (warm − cold) | -0.076170966 |
-| **Paraphrase pass — recorded answer at rank 0 (top-1)** | **5 / 7** |
-| Paraphrase pass — recorded answer at any rank in top-K | 5 / 7 |
-| **Quality lift** (warm top-1 rating > cold top-1 rating) | **5 / 21** |
-| Quality neutral (warm top-1 rating = cold top-1 rating) | 13 / 21 |
-| Quality regressed (warm top-1 rating < cold top-1 rating) | 3 / 21 |
-
-**Verbatim lift rate:** 5 of 7 discoveries became top-1 after warm pass.
-
---
-
-## Per-query results
-
-| # | Query | Cold top-1 | Cold judge-best (rank/rating) | Recorded? | Warm top-1 | Judge-best warm rank | Lift |
-|---|---|---|---|---|---|---|---|
-| 1 | Forklift operator with OSHA-30, warehouse experience, day sh | e-5670 | 2/4 | ✓ e-5729 | e-5729 | 0 | **YES** |
-| 2 | OSHA-30 certified forklift operator in Wisconsin, cold stora | e-6293 | 7/3 | — | w-1566 | 8 | no |
-| 3 | Production worker with confined-space cert and hazmat traini | w-602 | 0/2 | — | w-3575 | 1 | no |
-| 4 | CDL Class A driver, clean record, willing to do regional 4-d | w-3854 | 0/1 | — | w-3854 | 0 | no |
-| 5 | Warehouse lead with current OSHA-30 certification, NOT OSHA- | w-1807 | 6/3 | — | w-1807 | 6 | no |
-| 6 | Forklift-certified loader, certification must be active, dis | w-1807 | 3/4 | ✓ w-205 | w-4257 | 1 | no |
-| 7 | Hazmat-certified warehouse worker comfortable with cold stor | e-4910 | 2/4 | ✓ w-4257 | w-205 | 1 | no |
-| 8 | Bilingual production worker with team-lead experience and tr | w-4988 | 0/4 | — | w-4988 | 0 | no |
-| 9 | Inventory specialist with confined-space cert and compliance | w-388 | 3/4 | ✓ w-3575 | w-3575 | 0 | **YES** |
-| 10 | Warehouse worker who can run inventory cycles and lead a sma | e-3011 | 0/4 | — | e-3011 | 0 | no |
-| 11 | Production line worker comfortable filling in as line superv | w-1387 | 0/4 | — | e-5729 | 1 | no |
-| 12 | Customer service rep willing to cross-train into dispatch or | w-1451 | 0/2 | — | w-1451 | 0 | no |
-| 13 | Reliable production line lead with strong attendance and lea | e-7360 | 5/4 | ✓ w-2886 | w-2886 | 0 | **YES** |
-| 14 | Highly responsive forklift operator available for last-minut | e-6108 | 5/4 | ✓ w-1566 | w-1566 | 0 | **YES** |
-| 15 | Engaged warehouse associate with strong safety compliance re | e-2743 | 2/4 | ✓ w-49 | w-49 | 0 | **YES** |
-| 16 | CDL-A driver based in IL or WI, willing to run regional 4-da | w-2486 | 5/2 | — | w-2486 | 5 | no |
-| 17 | Bilingual customer service rep in Indianapolis or Cincinnati | e-9749 | 9/2 | — | e-9749 | 9 | no |
-| 18 | Production supervisor open to Midwest relocation for permane | w-379 | 6/3 | — | w-379 | 6 | no |
-| 19 | Dental hygienist with three years experience, Indianapolis a | e-6772 | 0/1 | — | w-3575 | 1 | no |
-| 20 | Registered nurse with ICU experience, willing to take per-di | w-379 | 0/1 | — | w-379 | 0 | no |
-| 21 | Software engineer with React and TypeScript, three years exp | w-1773 | 0/1 | — | w-1773 | 0 | no |
-
---
-
-## Paraphrase pass — does the playbook help similar-but-different queries?
-
-For each query whose Pass 1 cold pass recorded a playbook entry, the
-judge model rephrased the query, and the rephrased version was sent
-through warm matrix.search. The recorded answer ID's rank in those
-results tests whether cosine on the embedded paraphrase finds the
-recorded query's vector.
-
-| # | Original (≤40c) | Paraphrase (≤60c) | Recorded answer | Paraphrase top-1 | Recorded rank | Paraphrase lift |
-|---|---|---|---|---|---|---|
-| 1 | Forklift operator with OSHA-30, warehous | Seeking forklift operator certified in OSHA-30, looking for  | e-5729 | e-5729 | 0 | **YES** |
-| 6 | Forklift-certified loader, certification | Loader requiring active forklift certification, this must no | w-205 | w-205 | 0 | **YES** |
-| 7 | Hazmat-certified warehouse worker comfor | Warehouse worker with Hazmat certification and experience in | w-4257 | w-4257 | 0 | **YES** |
-| 9 | Inventory specialist with confined-space | Specialist in inventory management requiring certified confi | w-3575 | w-49 | -1 | no |
-| 13 | Reliable production line lead with stron | Experienced production line supervisor with excellent punctu | w-2886 | w-2886 | 0 | **YES** |
-| 14 | Highly responsive forklift operator avai | Available forklift operator ready for urgent shift coverage | w-1566 | w-1566 | 0 | **YES** |
-| 15 | Engaged warehouse associate with strong  | Warehouse associate dedicated to engagement and boasting a r | w-49 | w-984 | -1 | no |
-
---
-
-## Honesty caveats
-
-1. **Judge IS the ground truth proxy.** Without human-labeled relevance, the LLM
-   judge's verdict is what defines "best." If `qwen2.5:latest` rates badly,
-   the lift number is meaningless. To validate the judge itself, sample 5–10
-   verdicts manually and check agreement.
-2. **Score-1.0 boost = distance halved.** Playbook math is
-   `distance' = distance × (1 - 0.5 × score)`. Lift requires the judge-best
-   result's pre-boost distance to be ≤ 2× the cold top-1's distance, otherwise
-   even halving doesn't promote it. Tight clusters → little visible lift.
-3. **Verbatim vs paraphrase.** The verbatim lift rate (above) is the cheap
-   case — same query, recorded playbook, expected boost. The paraphrase
-   pass (when enabled) is the actual learning property: similar-but-different
-   queries hitting a recorded playbook. Compare verbatim and paraphrase
-   lift rates — paraphrase should be lower (semantic-distance gates some
-   playbook hits) but non-zero is the meaningful signal.
-4. **Multi-corpus skew.** Default corpora=`workers,ethereal_workers` — if all judge-best
-   results land in one corpus, the matrix layer's purpose isn't being tested.
-   Check per-corpus distribution in the JSON.
-5. **Judge resolution.** This run used `qwen2.5:latest` from
-   env JUDGE_MODEL=qwen2.5:latest.
-   Bumping the judge for run #N+1 means editing one line in lakehouse.toml.
-6. **Paraphrase generation also uses the judge.** The same model that rates
-   relevance also rephrases queries. A judge that's bad at rating staffing
-   queries is probably also bad at rephrasing them. Worth sanity-checking
-   a sample of `paraphrase_query` values in the JSON before trusting the
-   paraphrase lift number.
-
-## Next moves
-
- If lift rate ≥ 50% of discoveries: matrix layer + playbook is doing real
-  work. Move to paraphrase queries + tag-based boost (currently ignored).
- If lift rate < 20%: investigate why — judge variance, distance gap too
-  wide, or playbook math too gentle. The score=1.0 / 0.5× formula may need
-  retuning.
- If discovery rate (cold judge-best ≠ top-1) is itself low: cosine is
-  already close to optimal on this query distribution. Either the corpus
-  is too narrow or the queries are too easy.
--- a/scripts/g2_smoke.sh
+++ b/scripts/g2_smoke.sh
@ -76,14 +76,8 @@ DIM="$(echo "$RESP" | jq -r '.dimension')"
 N="$(echo "$RESP" | jq -r '.vectors | length')"
 MODEL="$(echo "$RESP" | jq -r '.model')"
 SAME="$(echo "$RESP" | jq -r '.vectors[0][0] == .vectors[1][0]')"
-# Accept any nomic-embed-text* family member as the default — v1
-# (137M, 768d) and v2-moe (475M MoE, 768d) are both supported drop-ins.
-# The smoke locks the dimension + the distinct-vectors property, NOT
-# the exact model name (operators bump the model in lakehouse.toml
-# without changing this smoke).
-case "$MODEL" in nomic-embed-text*) MODEL_OK=1 ;; *) MODEL_OK=0 ;; esac
-if [ "$DIM" = "768" ] && [ "$N" = "2" ] && [ "$MODEL_OK" = "1" ] && [ "$SAME" = "false" ]; then
-  echo "  ✓ dim=768, model=$MODEL, 2 distinct vectors"
+if [ "$DIM" = "768" ] && [ "$N" = "2" ] && [ "$MODEL" = "nomic-embed-text" ] && [ "$SAME" = "false" ]; then
+  echo "  ✓ dim=768, model=nomic-embed-text, 2 distinct vectors"
 else
  echo "  ✗ resp: dim=$DIM n=$N model=$MODEL same=$SAME"; FAILED=1
 fi
--- a/scripts/multi_coord_stress.sh
+++ b/scripts/multi_coord_stress.sh
@ -1,282 +0,0 @@
-#!/usr/bin/env bash
-# Multi-coordinator stress harness — Phase 1 of the 48-hour mock.
-#
-# Three coordinators (Alice / Bob / Carol) own three distinct contracts
-# (Milwaukee distribution, Indianapolis manufacturing, Chicago
-# construction). The driver fires phases:
-#   1. baseline — each coord runs their contract's role queries
-#   2. surge    — each contract's demand doubles (URGENT phrasing)
-#   3. merge    — alpha + beta combined under alice
-#   4. handover — bob takes alpha, USING alice's playbook namespace
-#   5. split    — alpha surge re-distributed across all 3 coords
-#   6. reissue  — non-determinism check: same baselines reissued
-#   7. analysis — diversity + determinism + learning metrics
-#
-# Phase 1 deliberately skips the 48-hour clock, email/SMS endpoints,
-# and Langfuse wiring — those are Phase 2/3.
-#
-# Usage:
-#   ./scripts/multi_coord_stress.sh                    # run #001
-#   RUN_ID=002 ./scripts/multi_coord_stress.sh
-#   K=12 ./scripts/multi_coord_stress.sh
-
-set -euo pipefail
-cd "$(dirname "$0")/.."
-
-export PATH="$PATH:/usr/local/go/bin"
-
-RUN_ID="${RUN_ID:-001}"
-WORKERS_LIMIT="${WORKERS_LIMIT:-5000}"
-ETHEREAL_LIMIT="${ETHEREAL_LIMIT:-0}"
-CORPORA="${CORPORA:-workers,ethereal_workers}"
-K="${K:-8}"
-
-OUT_JSON="reports/reality-tests/multi_coord_stress_${RUN_ID}.json"
-OUT_MD="reports/reality-tests/multi_coord_stress_${RUN_ID}.md"
-
-if ! curl -sS --max-time 3 http://localhost:11434/api/tags >/dev/null 2>&1; then
-  echo "[stress] Ollama not reachable on :11434 — skipping (need it for embeddings)"
-  exit 0
-fi
-
-echo "[stress] building binaries..."
-go build -o bin/ ./cmd/storaged ./cmd/catalogd ./cmd/ingestd ./cmd/queryd \
-                 ./cmd/embedd ./cmd/vectord ./cmd/pathwayd ./cmd/observerd \
-                 ./cmd/matrixd ./cmd/gateway \
-                 ./scripts/staffing_workers ./scripts/multi_coord_stress
-
-pkill -f "bin/(storaged|catalogd|ingestd|queryd|embedd|vectord|pathwayd|observerd|matrixd|gateway)" 2>/dev/null || true
-sleep 0.3
-
-PIDS=()
-TMP="$(mktemp -d)"
-CFG="$TMP/stress.toml"
-
-cleanup() {
-  echo "[stress] cleanup"
-  for p in "${PIDS[@]:-}"; do [ -n "${p:-}" ] && kill "$p" 2>/dev/null || true; done
-  rm -rf "$TMP"
-}
-trap cleanup EXIT INT TERM
-
-cat > "$CFG" <<EOF
-[s3]
-endpoint        = "http://localhost:9000"
-region          = "us-east-1"
-bucket          = "lakehouse-go-primary"
-use_path_style  = true
-
-[gateway]
-bind = "127.0.0.1:3110"
-storaged_url = "http://127.0.0.1:3211"
-catalogd_url = "http://127.0.0.1:3212"
-ingestd_url  = "http://127.0.0.1:3213"
-queryd_url   = "http://127.0.0.1:3214"
-vectord_url  = "http://127.0.0.1:3215"
-embedd_url   = "http://127.0.0.1:3216"
-pathwayd_url = "http://127.0.0.1:3217"
-matrixd_url  = "http://127.0.0.1:3218"
-observerd_url = "http://127.0.0.1:3219"
-
-[storaged]
-bind = "127.0.0.1:3211"
-
-[catalogd]
-bind = "127.0.0.1:3212"
-storaged_url = "http://127.0.0.1:3211"
-
-[ingestd]
-bind = "127.0.0.1:3213"
-storaged_url = "http://127.0.0.1:3211"
-catalogd_url = "http://127.0.0.1:3212"
-max_ingest_bytes = 268435456
-
-[queryd]
-bind = "127.0.0.1:3214"
-catalogd_url = "http://127.0.0.1:3212"
-secrets_path = "/etc/lakehouse/secrets-go.toml"
-refresh_every = "1s"
-
-[embedd]
-bind = "127.0.0.1:3216"
-provider_url  = "http://localhost:11434"
-default_model = "nomic-embed-text-v2-moe"
-
-[vectord]
-bind = "127.0.0.1:3215"
-storaged_url = ""
-
-[pathwayd]
-bind = "127.0.0.1:3217"
-persist_path = ""
-
-[observerd]
-bind = "127.0.0.1:3219"
-persist_path = ""
-
-[matrixd]
-bind = "127.0.0.1:3218"
-embedd_url  = "http://127.0.0.1:3216"
-vectord_url = "http://127.0.0.1:3215"
-EOF
-
-poll_health() {
-  local port="$1" deadline=$(($(date +%s) + 5))
-  while [ "$(date +%s)" -lt "$deadline" ]; do
-    if curl -sS --max-time 1 "http://127.0.0.1:$port/health" >/dev/null 2>&1; then return 0; fi
-    sleep 0.05
-  done
-  return 1
-}
-
-echo "[stress] launching stack..."
-./bin/storaged  -config "$CFG" > /tmp/stress_storaged.log  2>&1 & PIDS+=($!); poll_health 3211 || { echo "storaged failed"; exit 1; }
-./bin/catalogd  -config "$CFG" > /tmp/stress_catalogd.log  2>&1 & PIDS+=($!); poll_health 3212 || { echo "catalogd failed"; exit 1; }
-./bin/ingestd   -config "$CFG" > /tmp/stress_ingestd.log   2>&1 & PIDS+=($!); poll_health 3213 || { echo "ingestd failed"; exit 1; }
-./bin/queryd    -config "$CFG" > /tmp/stress_queryd.log    2>&1 & PIDS+=($!); poll_health 3214 || { echo "queryd failed"; exit 1; }
-./bin/embedd    -config "$CFG" > /tmp/stress_embedd.log    2>&1 & PIDS+=($!); poll_health 3216 || { echo "embedd failed"; exit 1; }
-./bin/vectord   -config "$CFG" > /tmp/stress_vectord.log   2>&1 & PIDS+=($!); poll_health 3215 || { echo "vectord failed"; exit 1; }
-./bin/pathwayd  -config "$CFG" > /tmp/stress_pathwayd.log  2>&1 & PIDS+=($!); poll_health 3217 || { echo "pathwayd failed"; exit 1; }
-./bin/observerd -config "$CFG" > /tmp/stress_observerd.log 2>&1 & PIDS+=($!); poll_health 3219 || { echo "observerd failed"; exit 1; }
-./bin/matrixd   -config "$CFG" > /tmp/stress_matrixd.log   2>&1 & PIDS+=($!); poll_health 3218 || { echo "matrixd failed"; exit 1; }
-./bin/gateway   -config "$CFG" > /tmp/stress_gateway.log   2>&1 & PIDS+=($!); poll_health 3110 || { echo "gateway failed"; exit 1; }
-
-echo
-echo "[stress] ingest workers (limit=$WORKERS_LIMIT) into 'workers' corpus..."
-./bin/staffing_workers -limit "$WORKERS_LIMIT"
-
-echo
-echo "[stress] ingest ethereal_workers (limit=$ETHEREAL_LIMIT, 0=all) into 'ethereal_workers' corpus..."
-./bin/staffing_workers \
-  -parquet "/home/profit/lakehouse/data/datasets/ethereal_workers.parquet" \
-  -index-name ethereal_workers \
-  -id-prefix "e-" \
-  -limit "$ETHEREAL_LIMIT"
-
-echo
-echo "[stress] running multi-coord stress driver..."
-EXTRA_FLAGS=""
-if [ "${WITH_PARAPHRASE_HANDOVER:-1}" = "1" ]; then
-  EXTRA_FLAGS="$EXTRA_FLAGS -with-paraphrase-handover"
-fi
-./bin/multi_coord_stress \
-  -gateway "http://127.0.0.1:3110" \
-  -contracts tests/reality/contracts \
-  -corpora "$CORPORA" \
-  -k "$K" \
-  -out "$OUT_JSON" \
-  -ollama  "http://localhost:11434" \
-  -judge   "${JUDGE_MODEL:-qwen2.5:latest}" \
-  $EXTRA_FLAGS
-
-echo
-echo "[stress] generating markdown report → $OUT_MD"
-
-# Render compact markdown from the JSON. Same shape as the lift harness
-# reports so reviewers can compare format.
-total=$(jq -r '.events | length' "$OUT_JSON")
-gen_at=$(jq -r '.generated_at' "$OUT_JSON")
-div_role=$(jq -r '.diversity.same_role_across_contracts_mean_jaccard' "$OUT_JSON")
-div_role_n=$(jq -r '.diversity.num_pairs_same_role_across_contracts' "$OUT_JSON")
-div_xrole=$(jq -r '.diversity.different_roles_same_contract_mean_jaccard' "$OUT_JSON")
-div_xrole_n=$(jq -r '.diversity.num_pairs_different_roles_same_contract' "$OUT_JSON")
-det_jacc=$(jq -r '.determinism.mean_jaccard' "$OUT_JSON")
-det_n=$(jq -r '.determinism.num_reissued_pairs' "$OUT_JSON")
-hand_run=$(jq -r '.learning.handover_queries_run' "$OUT_JSON")
-hand_top1=$(jq -r '.learning.recorded_answers_top1_count' "$OUT_JSON")
-hand_topk=$(jq -r '.learning.recorded_answers_topk_count' "$OUT_JSON")
-hand_rate=$(jq -r '.learning.handover_hit_rate' "$OUT_JSON")
-ph_run=$(jq -r '.learning.paraphrase_handover_run // 0' "$OUT_JSON")
-ph_top1=$(jq -r '.learning.paraphrase_top1_count // 0' "$OUT_JSON")
-ph_topk=$(jq -r '.learning.paraphrase_topk_count // 0' "$OUT_JSON")
-ph_rate=$(jq -r '.learning.paraphrase_handover_hit_rate // 0' "$OUT_JSON")
-
-cat > "$OUT_MD" <<MDEOF
-# Multi-Coordinator Stress Test — Run ${RUN_ID}
-
-**Generated:** ${gen_at}
-**Coordinators:** alice / bob / carol (each with own playbook namespace: \`playbook_alice\` / \`playbook_bob\` / \`playbook_carol\`)
-**Contracts:** $(jq -r '.contracts | join(" / ")' "$OUT_JSON")
-**Corpora:** \`${CORPORA}\`
-**K per query:** ${K}
-**Total events captured:** ${total}
-**Evidence:** \`${OUT_JSON}\`
-
---
-
-## Diversity — is the system locking into scenarios or cycling?
-
-| Metric | Mean Jaccard | n pairs | Interpretation |
-|---|---:|---:|---|
-| Same role across different contracts | ${div_role} | ${div_role_n} | Lower = more diverse (different region/cert mix → different workers) |
-| Different roles within same contract | ${div_xrole} | ${div_xrole_n} | Should be near-zero (different roles = different worker pools) |
-
-**Healthy ranges:**
- Same role across contracts: < 0.30 means the system is genuinely picking different workers per region/contract.
- Different roles same contract: < 0.10 means role-specific retrieval is working.
- If either is > 0.50, the system is "cycling" the same handful of workers regardless of query intent.
-
---
-
-## Determinism — same query reissued, top-K stability
-
-| Metric | Value |
-|---|---:|
-| Mean Jaccard on retrieval-only reissue | ${det_jacc} |
-| Number of reissue pairs | ${det_n} |
-
-**Interpretation:**
- ≥ 0.95: HNSW retrieval is highly deterministic; reissues land on near-identical top-K. Good — system locks into a stable view of "best workers for this query."
- 0.80 – 0.95: Some HNSW or embed variance, acceptable.
- < 0.80: Retrieval is unstable — reissues see substantially different results, suggesting either embed nondeterminism (Ollama returning slightly different vectors) or vectord nondeterminism (HNSW insertion order affecting recall).
-
---
-
-## Learning — handover hit rate
-
-Bob takes Alice's contract using Alice's playbook namespace. Did Alice's recorded answers surface in Bob's results?
-
-| Metric | Value |
-|---|---:|
-| Verbatim handover queries run | ${hand_run} |
-| Alice's recorded answer at Bob's top-1 (verbatim) | ${hand_top1} |
-| Alice's recorded answer in Bob's top-K (verbatim) | ${hand_topk} |
-| **Verbatim handover hit rate (top-1)** | **${hand_rate}** |
-| Paraphrase handover queries run | ${ph_run} |
-| Alice's recorded answer at Bob's top-1 (paraphrase) | ${ph_top1} |
-| Alice's recorded answer in Bob's top-K (paraphrase) | ${ph_topk} |
-| **Paraphrase handover hit rate (top-1)** | **${ph_rate}** |
-
-**Interpretation:**
- Verbatim hit rate ≈ 1.0: trivial case — Bob runs identical queries; should always hit.
- Paraphrase hit rate ≥ 0.5: institutional memory survives wording change — the harder learning property.
- Paraphrase hit rate ≈ 0.0: Bob's paraphrases drift past the inject threshold, so Alice's recordings don't activate. Same caveat as the playbook_lift paraphrase pass.
-
---
-
-## Per-event capture
-
-All matrix.search responses live in the JSON — top-K with worker IDs, distances, and per-corpus counts. Search by phase:
-
-\`\`\`bash
-jq '.events[] | select(.phase == "merge")' ${OUT_JSON}
-jq '.events[] | select(.coordinator == "alice" and .phase == "baseline")' ${OUT_JSON}
-jq '.events[] | select(.role == "warehouse worker") | {phase, contract, top_k_ids: [.top_k[].id]}' ${OUT_JSON}
-\`\`\`
-
---
-
-## What's NOT in this run (Phase 1 deliberately defers)
-
- **48-hour clock.** Events fire as discrete steps, not on a timeline.
- **Email / SMS ingest.** No endpoints exist on the Go side yet.
- **New-resume injection mid-run.** The corpus is fixed at the start.
- **Langfuse traces.** Need Go-side wiring.
-
-These are Phase 2/3. The Phase 1 substrate is what the time-based runner will mount on top of.
-MDEOF
-
-echo
-echo "[stress] DONE"
-echo "[stress]   evidence:  $OUT_JSON"
-echo "[stress]   report:    $OUT_MD"
--- a/scripts/multi_coord_stress/main.go
+++ b/scripts/multi_coord_stress/main.go
--- a/scripts/playbook_lift.sh
+++ b/scripts/playbook_lift.sh
@ -52,11 +52,6 @@ CONFIG_PATH="${CONFIG_PATH:-lakehouse.toml}"
 # actual learning-property test (does cosine on paraphrase find the
 # recorded entry?). Set WITH_PARAPHRASE=0 for a faster verbatim-only run.
 WITH_PARAPHRASE="${WITH_PARAPHRASE:-1}"
-# WITH_REJUDGE=1 (default) adds a Pass 4 — judge warm top-1 to measure
-# quality lift (warm rating vs cold rating). Catches cases where Shape B
-# surfaces a different-but-equally-good answer (which the rank-based
-# lift metric misses). +21 judge calls (~30s on qwen2.5).
-WITH_REJUDGE="${WITH_REJUDGE:-1}"

 OUT_JSON="reports/reality-tests/playbook_lift_${RUN_ID}.json"
 OUT_MD="reports/reality-tests/playbook_lift_${RUN_ID}.md"
@ -161,7 +156,7 @@ refresh_every = "1s"
 [embedd]
 bind = "127.0.0.1:3216"
 provider_url  = "http://localhost:11434"
-default_model = "nomic-embed-text-v2-moe"
+default_model = "nomic-embed-text"

 [vectord]
 bind = "127.0.0.1:3215"
@ -276,12 +271,9 @@ echo "[lift] running driver — judge=$EFFECTIVE_JUDGE · queries=$QUERIES_FILE
 # and runs its own resolution chain (env → config → fallback). When
 # JUDGE_MODEL IS set, the explicit -judge wins inside the Go driver
 # regardless of what its env-lookup would find — flag wins by design.
-EXTRA_FLAGS=""
+PARAPHRASE_FLAG=""
 if [ "$WITH_PARAPHRASE" = "1" ]; then
-  EXTRA_FLAGS="$EXTRA_FLAGS -with-paraphrase"
-fi
-if [ "$WITH_REJUDGE" = "1" ]; then
-  EXTRA_FLAGS="$EXTRA_FLAGS -with-rejudge"
+  PARAPHRASE_FLAG="-with-paraphrase"
 fi
 ./bin/playbook_lift \
  -config  "$CONFIG_PATH" \
@ -292,7 +284,7 @@ fi
  -judge   "$JUDGE_MODEL" \
  -k       "$K" \
  -out     "$OUT_JSON" \
-  $EXTRA_FLAGS
+  $PARAPHRASE_FLAG

 echo
 echo "[lift] generating markdown report → $OUT_MD"
@ -310,10 +302,6 @@ generate_md() {
  p_attempted=$(jq -r '.summary.paraphrase_attempted // 0' "$json")
  p_top1=$(jq -r '.summary.paraphrase_top1_lifts // 0' "$json")
  p_anyrank=$(jq -r '.summary.paraphrase_any_rank_hits // 0' "$json")
-  rj_attempted=$(jq -r '.summary.rejudge_attempted // 0' "$json")
-  q_lifted=$(jq -r '.summary.quality_lifted // 0' "$json")
-  q_neutral=$(jq -r '.summary.quality_neutral // 0' "$json")
-  q_regressed=$(jq -r '.summary.quality_regressed // 0' "$json")

  # Only emit the paraphrase block when --with-paraphrase actually ran
  # (i.e. .summary.paraphrase_attempted > 0). For verbatim-only runs we
@ -324,13 +312,6 @@ generate_md() {
 | Paraphrase pass — recorded answer at any rank in top-K | ${p_anyrank} / ${p_attempted} |"
  fi

-  rj_block=""
-  if [ "$rj_attempted" != "0" ] && [ "$rj_attempted" != "null" ]; then
-    rj_block="| **Quality lift** (warm top-1 rating > cold top-1 rating) | **${q_lifted} / ${rj_attempted}** |
-| Quality neutral (warm top-1 rating = cold top-1 rating) | ${q_neutral} / ${rj_attempted} |
-| Quality regressed (warm top-1 rating < cold top-1 rating) | ${q_regressed} / ${rj_attempted} |"
-  fi
-
  cat > "$md" <<MDEOF
 # Playbook-Lift Reality Test — Run ${RUN_ID}

@ -341,7 +322,6 @@ generate_md() {
 **Queries:** \`${QUERIES_FILE}\` (${total} executed)
 **K per pass:** ${K}
 **Paraphrase pass:** $([ "$WITH_PARAPHRASE" = "1" ] && echo "ENABLED" || echo "disabled")
-**Re-judge pass:** $([ "$WITH_REJUDGE" = "1" ] && echo "ENABLED" || echo "disabled")
 **Evidence:** \`${OUT_JSON}\`

 ---
@ -357,7 +337,6 @@ generate_md() {
 | Playbook boosts triggered (warm pass) | ${boosted} |
 | Mean Δ top-1 distance (warm − cold) | ${mean_delta} |
 ${p_block}
-${rj_block}

 **Verbatim lift rate:** ${lift} of ${discovery} discoveries became top-1 after warm pass.

--- a/scripts/playbook_lift/main.go
+++ b/scripts/playbook_lift/main.go
@ -75,19 +75,12 @@ type queryRun struct {
 	PlaybookRecorded bool   `json:"playbook_recorded"`
 	PlaybookID       string `json:"playbook_target_id,omitempty"`

-	WarmTop1ID       string          `json:"warm_top1_id"`
-	WarmTop1Distance float32         `json:"warm_top1_distance"`
-	WarmBoostedCount int             `json:"warm_boosted_count"`
-	WarmJudgeBestRank int            `json:"warm_judge_best_rank"` // rank of cold judge-best in warm — NOT the warm pass's own judge-best
-	WarmTop1Metadata json.RawMessage `json:"-"`                    // cached for Pass 4 rejudge; not emitted
+	WarmTop1ID       string  `json:"warm_top1_id"`
+	WarmTop1Distance float32 `json:"warm_top1_distance"`
+	WarmBoostedCount int     `json:"warm_boosted_count"`
+	WarmJudgeBestRank int    `json:"warm_judge_best_rank"`

-	// WarmTop1Rating: only populated when --with-rejudge. Compare to
-	// ColdRatings[0] (== cold top-1 rating) to measure quality lift.
-	// *int so absence (no rejudge pass) and a 0-rating verdict are
-	// distinguishable.
-	WarmTop1Rating *int `json:"warm_top1_rating,omitempty"`
-
-	Lift bool `json:"lift"` // judge-best was below top-1 cold, but top-1 warm
+	Lift bool   `json:"lift"`            // judge-best was below top-1 cold, but top-1 warm

 	// Paraphrase pass — only populated when --with-paraphrase. Tests
 	// the playbook's actual learning property: does a recorded entry
@ -121,17 +114,6 @@ type summary struct {
 	ParaphraseTop1Lifts   int `json:"paraphrase_top1_lifts,omitempty"`  // recorded answer surfaced at rank 0
 	ParaphraseAnyRankHits int `json:"paraphrase_any_rank_hits,omitempty"` // recorded answer surfaced at any rank in top-K

-	// Re-judge pass aggregates — only populated when --with-rejudge.
-	// Measures QUALITY lift (warm top-1 rating vs cold top-1 rating)
-	// rather than rank-of-cold-judge-best lift. The latter conflates
-	// "warm surfaced a different but equally-good result" with "warm
-	// shuffled ranks but the answer was the same"; quality lift
-	// disambiguates them.
-	RejudgeAttempted   int `json:"rejudge_attempted,omitempty"`   // queries that ran the rejudge pass
-	QualityLifted      int `json:"quality_lifted,omitempty"`      // warm-top-1 rating > cold-top-1 rating
-	QualityNeutral     int `json:"quality_neutral,omitempty"`     // ratings equal (could be same or different item)
-	QualityRegressed   int `json:"quality_regressed,omitempty"`   // warm-top-1 rating < cold-top-1 rating
-
 	GeneratedAt time.Time `json:"generated_at"`
 }

@ -146,7 +128,6 @@ func main() {
 	k := flag.Int("k", 10, "top-k from matrix.search per pass")
 	out := flag.String("out", "reports/reality-tests/playbook_lift_001.json", "output JSONL path")
 	withParaphrase := flag.Bool("with-paraphrase", false, "after warm pass, generate a paraphrase via the judge model and re-query with playbook=true to test the learning property")
-	withRejudge := flag.Bool("with-rejudge", false, "after warm pass, judge warm top-1 to measure QUALITY lift (vs cold top-1 rating), not just rank-of-cold-judge-best")
 	flag.Parse()

 	// Judge resolution priority: explicit flag > $JUDGE_MODEL env >
@ -244,7 +225,6 @@ func main() {
 		}
 		runs[i].WarmTop1ID = resp.Results[0].ID
 		runs[i].WarmTop1Distance = resp.Results[0].Distance
-		runs[i].WarmTop1Metadata = resp.Results[0].Metadata // cache for Pass 4 rejudge
 		runs[i].WarmBoostedCount = resp.PlaybookBoosted
 		playbookBoostedTotal += resp.PlaybookBoosted

@ -324,47 +304,6 @@ func main() {
 		}
 	}

-	// Pass 4 (warm-rejudge) — opt-in via --with-rejudge. Judge warm
-	// top-1 against the same prompt as cold ratings, then compare to
-	// cold top-1 rating. This measures QUALITY lift (did the playbook
-	// produce a better candidate?) rather than just rank-of-cold-judge-
-	// best lift (did the recorded answer move to top-1, even if cold's
-	// top-1 was already good?). See STATE_OF_PLAY OPEN — added because
-	// run #003's verbatim 2/6 didn't tell us whether Shape B was
-	// surfacing better OR same-quality alternatives.
-	rejudgeAttempted := 0
-	qualityLifted := 0
-	qualityNeutral := 0
-	qualityRegressed := 0
-	if *withRejudge {
-		log.Printf("[lift] warm-rejudge pass: measuring quality lift (warm top-1 rating vs cold top-1 rating)")
-		for i := range runs {
-			if runs[i].WarmTop1ID == "" || len(runs[i].WarmTop1Metadata) == 0 {
-				continue // warm pass didn't complete for this query
-			}
-			rejudgeAttempted++
-			result := matrixResult{
-				ID:       runs[i].WarmTop1ID,
-				Distance: runs[i].WarmTop1Distance,
-				Metadata: runs[i].WarmTop1Metadata,
-			}
-			warmRating := judgeRate(hc, *ollama, *judge, runs[i].Query, result)
-			runs[i].WarmTop1Rating = &warmRating
-			coldRating := 0
-			if len(runs[i].ColdRatings) > 0 {
-				coldRating = runs[i].ColdRatings[0]
-			}
-			switch {
-			case warmRating > coldRating:
-				qualityLifted++
-			case warmRating < coldRating:
-				qualityRegressed++
-			default:
-				qualityNeutral++
-			}
-		}
-	}
-
 	sum := summary{
 		Total:                 len(runs),
 		WithDiscovery:         withDiscovery,
@ -375,10 +314,6 @@ func main() {
 		ParaphraseAttempted:   paraphraseAttempted,
 		ParaphraseTop1Lifts:   paraphraseTop1Lifts,
 		ParaphraseAnyRankHits: paraphraseAnyRankHits,
-		RejudgeAttempted:      rejudgeAttempted,
-		QualityLifted:         qualityLifted,
-		QualityNeutral:        qualityNeutral,
-		QualityRegressed:      qualityRegressed,
 		GeneratedAt:           time.Now().UTC(),
 	}
 	if len(runs) > 0 {
@ -388,11 +323,11 @@ func main() {
 	if err := writeJSON(*out, runs, sum); err != nil {
 		log.Fatalf("write %s: %v", *out, err)
 	}
-	if *withParaphrase || *withRejudge {
-		log.Printf("[lift] DONE — %d queries · discovery=%d · lift=%d · boosted=%d · meanΔdist=%.4f · paraphrase=%d/%d→top1 · quality=lifted%d/neutral%d/regressed%d",
+	if *withParaphrase {
+		log.Printf("[lift] DONE — %d queries · discovery=%d · lift=%d · boosted=%d · meanΔdist=%.4f · paraphrase=%d/%d→top1, %d/%d→anyrank",
 			sum.Total, sum.WithDiscovery, sum.LiftCount, sum.PlaybookBoostedTotal, sum.MeanTop1DeltaDistance,
 			sum.ParaphraseTop1Lifts, sum.ParaphraseAttempted,
-			sum.QualityLifted, sum.QualityNeutral, sum.QualityRegressed)
+			sum.ParaphraseAnyRankHits, sum.ParaphraseAttempted)
 	} else {
 		log.Printf("[lift] DONE — %d queries · discovery=%d · lift=%d · boosted=%d · meanΔdist=%.4f",
 			sum.Total, sum.WithDiscovery, sum.LiftCount, sum.PlaybookBoostedTotal, sum.MeanTop1DeltaDistance)
--- a/tests/reality/contracts/contract_alpha.json
+++ b/tests/reality/contracts/contract_alpha.json
@ -1,12 +0,0 @@
-{
-  "name": "alpha_milwaukee_distribution",
-  "client": "Northstar Logistics",
-  "location": "Milwaukee, WI metro",
-  "shift": "day",
-  "demand": [
-    {"role": "warehouse worker", "count": 200, "skills": ["pallet jack", "inventory"], "certs": ["OSHA-30"]},
-    {"role": "admin assistant", "count": 3, "skills": ["scheduling", "data entry"], "certs": []},
-    {"role": "heavy equipment operator", "count": 2, "skills": ["forklift", "bobcat"], "certs": ["OSHA-30", "forklift cert"]},
-    {"role": "industrial electrician", "count": 1, "skills": ["high voltage", "PLC"], "certs": ["journeyman"], "in_roster": false}
-  ]
-}
--- a/tests/reality/contracts/contract_beta.json
+++ b/tests/reality/contracts/contract_beta.json
@ -1,12 +0,0 @@
-{
-  "name": "beta_indianapolis_manufacturing",
-  "client": "Crossroads Manufacturing",
-  "location": "Indianapolis, IN metro",
-  "shift": "swing",
-  "demand": [
-    {"role": "warehouse worker", "count": 150, "skills": ["assembly", "machine operation"], "certs": ["OSHA-10"]},
-    {"role": "admin assistant", "count": 4, "skills": ["scheduling", "documentation", "spanish"], "certs": []},
-    {"role": "heavy equipment operator", "count": 3, "skills": ["forklift", "pallet jack", "cold storage"], "certs": ["OSHA-30", "forklift cert"]},
-    {"role": "bilingual safety coordinator", "count": 1, "skills": ["spanish", "english", "training"], "certs": ["OSHA trainer"], "in_roster": false}
-  ]
-}
--- a/tests/reality/contracts/contract_gamma.json
+++ b/tests/reality/contracts/contract_gamma.json
@ -1,12 +0,0 @@
-{
-  "name": "gamma_chicago_construction",
-  "client": "Loop Construction Group",
-  "location": "Chicago, IL metro",
-  "shift": "early-day",
-  "demand": [
-    {"role": "warehouse worker", "count": 80, "skills": ["framing", "rigging", "concrete"], "certs": ["OSHA-10"]},
-    {"role": "admin assistant", "count": 1, "skills": ["scheduling", "blueprint reading"], "certs": []},
-    {"role": "heavy equipment operator", "count": 2, "skills": ["mobile crane", "rigging signals", "bobcat"], "certs": ["NCCCO crane cert"]},
-    {"role": "drone surveyor", "count": 1, "skills": ["UAV piloting", "GIS", "site mapping"], "certs": ["FAA Part 107"], "in_roster": false}
-  ]
-}