Compare commits

..

No commits in common. "f971e647456557e835e3ab6a97d372024573112a" and "87cbd10090aa83ca8aa626404dff5553e7323cb7" have entirely different histories.

27 changed files with 21 additions and 3263 deletions

View File

@ -1,7 +1,7 @@
# STATE OF PLAY — Lakehouse-Go
**Last verified:** 2026-04-30 ~16:42 CDT
**Verified by:** live probes + `just verify` PASS + multi-coord stress run #011 (full 9-phase scenario, 67 captured events, 1 Langfuse trace + 111 child observations covering every phase + every external call), not memory.
**Last verified:** 2026-04-30 ~07:25 CDT
**Verified by:** live probes + `just verify` PASS + reality tests #001/#002/#003/#004 (v4 with split inject threshold: 6/8 verbatim lift + 6/8 paraphrase recovery + zero OOD cross-pollination), not memory.
> **Read this FIRST.** When the user says "we're working on lakehouse," default to the Go rewrite (this repo); the Rust legacy at `/home/profit/lakehouse/` is maintenance-only. If memory contradicts this file, this file wins. Update it when something is verified working — not when a phase finishes.
@ -114,34 +114,6 @@ Evidence: `reports/reality-tests/playbook_lift_{001,002,003}.{json,md}`. Per the
**v3 → v4 is the configuration evolution.** v3 ran with one threshold (0.5) for both boost and inject — paraphrase recovery hit 6/6 but a single strong recording (w-4435 for "OSHA-30 forklift Wisconsin") cross-pollinated to OOD queries (dental hygienist / RN / software engineer) because their text vectors fell within 0.5 cosine. v4 split: boost stays at 0.5 (safe — only re-ranks existing results), inject tightens to 0.20 (true paraphrase territory). Result: OOD cross-pollination eliminated, paraphrase recovery 6/8, verbatim lift 6/8. The two paraphrases that missed (Q9, Q15) were liberally rephrased — drift > 0.20 cosine.
### Multi-coordinator stress test (Phase 1 → 3) — VERIFIED end-to-end
Reality test #2 catalog. New harness `scripts/multi_coord_stress.{sh,go}` simulates 3 coordinators (alice/bob/carol) handling 3 distinct contracts (Milwaukee distribution / Indianapolis manufacturing / Chicago construction), each with their own playbook namespace. 9-phase operational narrative across simulated Hours 048: baseline → fresh-resume injection → inbox burst → mid-day surge → 200-worker swap → contract merge → handover (verbatim + paraphrase) → split → reissue.
| Capability | Verified | Where |
|---|---|---|
| Per-coordinator playbook isolation | ✓ | `playbook_alice` / `playbook_bob` / `playbook_carol` corpora |
| Same-role-across-contracts diversity | Jaccard 0.026 (n=9) — 97% workers differ per region | Phase 1 baseline |
| Different-roles-same-contract diversity | Jaccard 0.070 (n=18) — 93% differ per role | Phase 1 baseline |
| HNSW retrieval determinism | Jaccard 1.000 (n=12) | Phase 6 reissue |
| Verbatim handover (Bob runs Alice's queries with Alice's playbook) | 4/4 | Phase 4 |
| Paraphrase handover (Bob runs qwen2.5-paraphrased queries) | 4/4 | Phase 4b |
| 200-worker swap with `ExcludeIDs` | Jaccard 0.000 — 8/8 placed workers fully replaced | Phase 2b |
| Fresh-resume injection (two-tier `fresh_workers` index) | 3/3 fresh workers at top-1 | Phase 1b |
| Inbox endpoint `/v1/observer/inbox` (email + SMS, priority weighting) | 6/6 events recorded | Phase 1c |
| LLM demand parsing (qwen2.5 format=json on inbox bodies) | 6/6 parsed cleanly into structured `{role, count, location, certs, skills, shift}` | Phase 1c |
| Judge re-rates inbox top-1 against ORIGINAL body | catches tight-distance-but-wrong (Q3 crane case: dist 0.23 → rating 1) | Phase 1c |
| Langfuse Go-side tracing | 111 observations on a single run trace, browseable at http://localhost:3001 | Run #011 |
**Substrate gains added by this wave:**
- `internal/matrix/playbook.go` Shape B + split inject threshold (commit `67d1957` from earlier wave; verified in multi-coord too)
- `internal/matrix/retrieve.go` `ExcludeIDs` field on `SearchRequest` — filters worker IDs at retrieve, boost, AND inject (so excluded answers can't sneak back via playbook). Real-world driver: coordinator placed N workers, client asks for replacements.
- `internal/observer/types.go` `SourceInbox` taxonomy alongside `SourceMCP / SourceScenario / SourceWorkflow`
- `cmd/observerd` `POST /observer/inbox` route — accepts `{type, sender, subject, body, priority, tag}` and records as `ObservedOp`. Type must be `email` or `sms`; body required; priority defaults to medium.
- `internal/langfuse/client.go` — minimal Go-side Trace+Span client, best-effort posture (logs on error, never blocks calling path; same fail-open semantics as ADR-005 Decision 5.1).
- `embedd default_model` bumped from `nomic-embed-text` (137M) → `nomic-embed-text-v2-moe` (475M, MoE, drop-in 768d). Same-role-across-contracts diversity went from 0.080 → 0.000 with the upgrade.
- Two-tier index pattern: fresh content goes to `fresh_workers` (a small "hot" corpus); main queries include it in the corpora list. Solves the HNSW post-build add recall issue (incremental adds to a populated graph land in poorly-connected regions and disappear from search). Canonical NRT pattern; Lucene works the same way.
### Harness expansion (2026-04-30 ~05:30 CDT)
`scripts/playbook_lift.sh` rewritten from a 5-daemon stripped harness to the **full 10-daemon prod-realistic stack** (chatd stays up independently). The 5-daemon version was structurally hiding bugs; expanding the daemon set surfaced 7 distinct fixes:
@ -199,16 +171,10 @@ Verbatim verdicts at `reports/scrum/_evidence/2026-04-30/verdicts/`. Disposition
- The matrix indexer **5/5 components** are shipped. Don't propose to "build the matrix indexer" — it's done.
- The 5-loop substrate's load-bearing gate is **PASSED**. v3 (`154a72e`) showed 6/6 paraphrase recovery via Shape B. Don't propose to "test if the playbook learns" — it does.
- **Shape B is the playbook stance now.** When `use_playbook=true`, both `ApplyPlaybookBoost` (re-rank in place) AND `InjectPlaybookMisses` (insert recorded answers not in regular top-K) run. Don't revert to boost-only; v2 proved that path can't recover paraphrase queries.
- **Boost / inject use SEPARATE thresholds.** Boost stays at `DefaultPlaybookMaxDistance = 0.5` (safe — only re-ranks results already in retrieval). Inject uses tighter `DefaultPlaybookMaxInjectDistance = 0.20` (Shape B forces results in, so loose match cross-pollinates wrong-domain). Don't merge them.
- **Multi-coord product theory is empirically VALIDATED** by run #011 (Phase 3). Per-coordinator playbook namespaces (`playbook_alice` etc.) with cross-coordinator handover (Bob takes Alice's contract using `playbook_corpus=playbook_alice`) work end-to-end including paraphrase handover. Don't propose to "test if multi-coord works" — it does.
- **Fresh content uses two-tier indexing.** Fresh resumes go to `fresh_workers` corpus, not the main `workers` index. coder/hnsw incremental adds to a populated 5K+ graph have recall issues; the small hot index has no such crowding. Periodic merge (post-G3) consolidates fresh→main. Canonical NRT pattern.
- **`embedd.default_model = "nomic-embed-text-v2-moe"`** (475M MoE, 768d). Don't bump to `nomic-embed-text` (137M) "for speed" — diversity scores fell from 0.000 → 0.080 same-role-across-contracts on the smaller model. Cost (~5× slower ingest) is acceptable for once-per-deploy work.
- **Inbox flow: parse + search + judge + trace.** `/v1/observer/inbox` records the body; coordinator/driver parses via LLM (qwen2.5 format=json), runs `matrix.search` on the parsed query, judges top-1 against the ORIGINAL body, emits Langfuse spans through it all. Don't replace the judge re-rate with a distance-only gate — tight distance + low rating is the load-bearing honesty signal.
- `local_judge = "qwen2.5:latest"` for the lift loop — qwen3.5:latest is a vision-SSM 256K-ctx build that ran 30s/judge. Don't bump to qwen3.5 again "for cleanliness." Different model in the same tier has different cost geometry.
- `qwen3.5:latest` IS available locally on this box. Opus's hub-only knowledge is a known-stale signal; the chatd_smoke uses it daily.
- `temperature` is **omitted** for Anthropic 4.7 (handled by `Request.Temperature *float64`); don't re-add it.
- chatd-smoke runs with **all cloud providers disabled** intentionally so the suite doesn't depend on API keys; that's why it can't catch B-3-class bugs (those need a fake-server fixture, see Sprint 0 follow-up).
- **Langfuse Go-side client lives at `internal/langfuse/`** with best-effort fail-open posture. URL+creds from `/etc/lakehouse/langfuse.env`. Don't propose to "wire Langfuse on Go side" — it's wired; multi_coord_stress is the proof.
---
@ -216,11 +182,9 @@ Verbatim verdicts at `reports/scrum/_evidence/2026-04-30/verdicts/`. Disposition
| Item | What | When to act |
|---|---|---|
| **Real-time 48-hour clock** | Multi-coord stress fires phases as discrete steps with simulated-hour labels (0/6/12/18/24/30/36/42/48). A real-time variant would space events on actual wall-clock with `time.Sleep`, simulating the rhythm of a coordinator workday. Cosmetic; doesn't change product behavior — but lets the harness mimic the cadence at which inbox events would arrive in production. ~30 min. | When stress test starts being used to capture realistic per-hour throughput numbers. |
| **Wider Langfuse instrumentation across daemons** | multi_coord_stress traces every external call, but the daemons themselves (matrixd, observerd, chatd) don't yet emit traces from their own request handlers. Adding `internal/langfuse/middleware.go` would auto-emit a span per HTTP request, giving production-traffic visibility for free. | When production traffic starts hitting the Go gateway. |
| **Periodic fresh→main index merge** | Two-tier `fresh_workers` pattern is verified working but no scheduled job consolidates fresh→main. Fresh corpus grows monotonically; eventually has its own recall issues. A daily cron that ingests the fresh corpus' contents into the main `workers` index + drops fresh would solve it. | When fresh_workers grows past ~500 items. |
| **Adjacent-query cross-pollination (lift suite Q6↔Q7)** | After lift v4's split threshold, OOD cross-pollination is gone but Q6 / Q7 still swap recordings as warm top-1 because their embeddings are within 0.20 cosine of each other. Multi-coord run #008 inbox-judge re-rating proved the judge can distinguish — gating injection on "judge approves before injecting" closes this. ~1 hr. | When playbook injection quality matters more than retrieval throughput. |
| **Liberal-paraphrase recovery loss (lift suite Q9, Q15)** | Q9 + Q15 in run #004 lost paraphrase recovery because qwen2.5 rephrased liberally enough to drift past 0.20 inject threshold. Acceptable (system refusing to inject when not confident), but might be tightenable with a different paraphrase prompt or a per-pair `paraphrase_max_drift` measurement. | When real coordinator queries are available for a calibration run. |
| **Reality test v4: re-judge warm results** | Current lift metric counts "warm top-1 == cold judge best." Shape B cross-pollinates recorded answers across queries (run #003 dropped verbatim-lift 7→2 because w-4435 surfaced as warm top-1 for several queries whose cold-judge-best was a different worker). True quality lift would re-judge warm results and measure judge agreement. ~30 min — adds a 4th pass to the driver. | Next time we want to measure Shape B's actual quality, not rank-of-original-judge-best. |
| **Adjacent-query cross-pollination** | After v4's split threshold, OOD cross-pollination is gone but Q6 ("Forklift-certified loader") ↔ Q7 ("Hazmat-certified warehouse worker, cold storage") still swap recordings as warm top-1 because their embeddings are within 0.20 cosine of each other. Likely correct (genuinely similar staffing queries), but worth verifying with the v4 re-judge metric — if the judge agrees both directions are good matches, accept; if not, tighten further (e.g. 0.15) or add a same-query-only mode. | Co-decision with v4 re-judge. |
| **Liberal-paraphrase recovery loss** | Q9 + Q15 in run #004 lost paraphrase recovery because qwen2.5 rephrased liberally enough to drift past 0.20 inject threshold. Acceptable (system refusing to inject when not confident), but might be tightenable with a different paraphrase prompt (more conservative wording variation) or a per-pair `paraphrase_max_drift` measurement. Cosmetic vs. real depends on whether realistic coordinator queries drift like qwen2.5's rephrases do. | When real coordinator queries are available for a calibration run. |
| **Sprint 4 — deployment** | No `REPLICATION.md`, `secrets-go.toml.example`, `deploy/systemd/<bin>.service`, `Dockerfile`. Largest open Sprint. Required input for any G5 cutover plan. | When G5 cutover is on the table. |
| **ADR-006 — auth posture for non-loopback deploy** | Locks R-001 + R-007 from "opt-in middleware exists" to "wired-by-default for X, opt-in for Y." Doc-only, ~1 hr. | Required before any Go binary binds non-loopback in prod. |
| **chatd fixture-mode storage half** | `g2_smoke_fixtures.sh` closed embed half via fake_ollama; storage half (mock S3) still deferred. Closes R-006 fully. | When CI box without MinIO is needed. |
@ -249,17 +213,6 @@ Verbatim verdicts at `reports/scrum/_evidence/2026-04-30/verdicts/`. Disposition
| `154a72e` | matrix: Shape B (`InjectPlaybookMisses`) — 6/6 paraphrase recovery in run #003 |
| `94fc3b6` | STATE_OF_PLAY: capture v3 reality test + Shape B + cross-pollination |
| `67d1957` | matrix: split boost/inject thresholds (0.5 / 0.20) — kills cross-pollination, run #004 6/8+6/8 |
| `b13b5cd` | playbook_lift v4 metric: warm-top-1 re-judge → quality lift +24%/-14% (5 lifted / 13 neutral / 3 regressed) |
| `61c7b55` | multi-coord stress harness (Phase 1) — 3 coords / 3 contracts / 7-phase scenario |
| `0fa42a0` | multi-coord stress Phase 1.5 — shared-role contracts + paraphrase handover |
| `84a32f0` | multi-coord stress Phase 2 — `ExcludeIDs`, 200-worker swap, fresh-resume |
| `4da32ad` | embedd: bump default to nomic-embed-text-v2-moe (475M MoE, 768d drop-in) |
| `e7fc63b` | observerd `/observer/inbox` + multi-coord stress phase 1c (priority-ordered events) |
| `186d209` | multi_coord_stress: LLM-parsed inbox demands (qwen2.5 format=json) |
| `ce940f4` | multi_coord_stress: judge re-rates inbox top-1 against original body — recovers OOD honesty signal |
| `7e6431e` | langfuse: Go-side client + Phase 1c instrumentation |
| `08a0867` | multi_coord_stress: fresh_workers two-tier index — fresh-resume now top-1 (3/3) |
| `5d49967` | multi_coord_stress: full Langfuse coverage — every phase + every call (111 observations) |
Plus on Rust side (`8de94eb`, `3d06868`): qwen2.5 → qwen3.5:latest backport in active defaults; distillation acceptance reports regenerated (run_hash refresh, reproducibility property still holds).

View File

@ -93,62 +93,6 @@ func (h *handlers) register(r chi.Router) {
r.Post("/observer/event", h.handleEvent)
r.Post("/observer/workflow/run", h.handleWorkflowRun)
r.Get("/observer/workflow/modes", h.handleWorkflowModes)
r.Post("/observer/inbox", h.handleInbox)
}
// inboxMessage is the POST /observer/inbox body — an incoming
// real-world signal (email or SMS) that a coordinator would receive
// and act on. The handler only RECORDS it as an ObservedOp; whether
// to trigger a downstream matrix.search or workflow is the caller's
// concern. Keeps observer's witness role pure.
type inboxMessage struct {
Type string `json:"type"` // "email" | "sms"
Sender string `json:"sender"`
Subject string `json:"subject,omitempty"`
Body string `json:"body"`
Priority string `json:"priority"` // "urgent" | "high" | "medium" | "low"
Tag string `json:"tag,omitempty"`
}
func (h *handlers) handleInbox(w http.ResponseWriter, r *http.Request) {
var msg inboxMessage
if !decodeJSON(w, r, &msg) {
return
}
if msg.Type != "email" && msg.Type != "sms" {
http.Error(w, "type must be 'email' or 'sms'", http.StatusBadRequest)
return
}
if strings.TrimSpace(msg.Body) == "" {
http.Error(w, "body required", http.StatusBadRequest)
return
}
if msg.Priority == "" {
msg.Priority = "medium"
}
op := observer.ObservedOp{
Endpoint: "/observer/inbox/" + msg.Type,
InputSummary: fmt.Sprintf("from=%s priority=%s tag=%s subject=%s", msg.Sender, msg.Priority, msg.Tag, msg.Subject),
OutputSummary: msg.Body,
Source: observer.SourceInbox,
Success: true,
}
if err := h.store.Record(op); err != nil {
if errors.Is(err, observer.ErrInvalidOp) {
http.Error(w, err.Error(), http.StatusBadRequest)
return
}
slog.Error("observer record inbox", "err", err)
http.Error(w, "internal", http.StatusInternalServerError)
return
}
stats := h.store.Stats()
writeJSON(w, http.StatusOK, map[string]any{
"accepted": true,
"type": msg.Type,
"priority": msg.Priority,
"ring_size": stats.Total,
})
}
func (h *handlers) handleStats(w http.ResponseWriter, _ *http.Request) {

View File

@ -4,7 +4,6 @@ import (
"bytes"
"net/http"
"net/http/httptest"
"strings"
"testing"
"time"
@ -39,7 +38,6 @@ func TestRoutesMounted(t *testing.T) {
"POST /observer/event": false,
"POST /observer/workflow/run": false,
"GET /observer/workflow/modes": false,
"POST /observer/inbox": false,
}
_ = chi.Walk(r, func(method, route string, _ http.Handler, _ ...func(http.Handler) http.Handler) error {
key := method + " " + route
@ -167,51 +165,6 @@ func TestWorkflowRun_AllProvenanceRecordedPostRun(t *testing.T) {
}
}
// TestInbox_AcceptsValidEmail locks the happy-path contract for the
// /observer/inbox route — accepts an email message with required
// fields, records as ObservedOp, returns 200 with ring-size.
func TestInbox_AcceptsValidEmail(t *testing.T) {
r := newTestRouter(t)
body := []byte(`{"type":"email","sender":"client@northstar.com","subject":"URGENT: 50 forklift ops","body":"Need 50 forklift operators in Cleveland OH for next week. Day shift.","priority":"urgent","tag":"alpha-surge"}`)
req := httptest.NewRequest("POST", "/observer/inbox", bytes.NewReader(body))
req.Header.Set("Content-Type", "application/json")
w := httptest.NewRecorder()
r.ServeHTTP(w, req)
if w.Code != http.StatusOK {
t.Fatalf("expected 200, got %d (body=%s)", w.Code, w.Body.String())
}
if !strings.Contains(w.Body.String(), `"accepted":true`) {
t.Errorf("expected accepted=true, got %s", w.Body.String())
}
}
// TestInbox_RejectsBadType locks the validation: type must be
// "email" or "sms", anything else is 400.
func TestInbox_RejectsBadType(t *testing.T) {
r := newTestRouter(t)
body := []byte(`{"type":"smoke-signal","sender":"x","body":"y","priority":"high"}`)
req := httptest.NewRequest("POST", "/observer/inbox", bytes.NewReader(body))
req.Header.Set("Content-Type", "application/json")
w := httptest.NewRecorder()
r.ServeHTTP(w, req)
if w.Code != http.StatusBadRequest {
t.Errorf("expected 400 on bad type, got %d", w.Code)
}
}
// TestInbox_RejectsEmptyBody locks the body-required invariant.
func TestInbox_RejectsEmptyBody(t *testing.T) {
r := newTestRouter(t)
body := []byte(`{"type":"email","sender":"x","body":"","priority":"high"}`)
req := httptest.NewRequest("POST", "/observer/inbox", bytes.NewReader(body))
req.Header.Set("Content-Type", "application/json")
w := httptest.NewRecorder()
r.ServeHTTP(w, req)
if w.Code != http.StatusBadRequest {
t.Errorf("expected 400 on empty body, got %d", w.Code)
}
}
// TestWorkflowRun_UnknownMode locks the 400 path on workflow definitions
// that reference modes not registered with the runner. The harness's
// reality test runs depend on this so an unknown-mode misconfiguration

View File

@ -1,217 +0,0 @@
// Package langfuse is a minimal Go-side client for the Langfuse v2
// ingestion API. Mirrors the surface area we need from the Rust
// crates/gateway/src/v1/langfuse_trace.rs emitter — Trace + Span,
// nothing else yet (no scores, no observations, no datasets).
//
// Auth is Basic over public_key:secret_key. URL + creds come from
// /etc/lakehouse/langfuse.env in production; tests can pass any URL.
//
// Best-effort transport: errors are logged but don't fail the calling
// path. Lakehouse's internal services should never go down because
// Langfuse is unreachable.
package langfuse
import (
"bytes"
"context"
"crypto/rand"
"encoding/base64"
"encoding/hex"
"encoding/json"
"fmt"
"log/slog"
"net/http"
"sync"
"time"
)
// Client posts traces + spans to Langfuse's ingestion endpoint.
// Events are buffered and flushed in batches. Always call Flush
// before exit; Close also flushes.
type Client struct {
url string
auth string // pre-encoded "Basic ..."
hc *http.Client
mu sync.Mutex
pending []event
maxBatch int
}
// New constructs a Client. URL like "http://localhost:3001"; creds
// from langfuse.env. nil hc → uses default with 5s timeout.
func New(url, publicKey, secretKey string, hc *http.Client) *Client {
if hc == nil {
hc = &http.Client{Timeout: 5 * time.Second}
}
auth := "Basic " + base64.StdEncoding.EncodeToString([]byte(publicKey+":"+secretKey))
return &Client{
url: url,
auth: auth,
hc: hc,
maxBatch: 50,
}
}
// NewID returns a hex string suitable as a trace/span id. Langfuse
// accepts arbitrary strings; a 16-byte random hex is unambiguous.
func NewID() string {
b := make([]byte, 16)
_, _ = rand.Read(b)
return hex.EncodeToString(b)
}
// event is one Langfuse ingestion envelope. Body shape varies by
// type (trace-create vs span-create); we use map[string]any to
// keep the wire shape declarative.
type event struct {
ID string `json:"id"`
Type string `json:"type"` // "trace-create" | "span-create"
Timestamp string `json:"timestamp"`
Body map[string]any `json:"body"`
}
// TraceInput is what callers fill in when starting a trace.
type TraceInput struct {
Name string
UserID string
Input any
Metadata map[string]any
Tags []string
}
// Trace records a top-level trace. Returns the trace id so callers
// can attach spans. Best-effort: errors are logged and the trace
// id is still returned so callers don't need error-handling for the
// common case.
func (c *Client) Trace(ctx context.Context, t TraceInput) string {
id := NewID()
body := map[string]any{
"id": id,
"name": t.Name,
}
if t.UserID != "" {
body["userId"] = t.UserID
}
if t.Input != nil {
body["input"] = t.Input
}
if t.Metadata != nil {
body["metadata"] = t.Metadata
}
if len(t.Tags) > 0 {
body["tags"] = t.Tags
}
c.queue(event{
ID: NewID(),
Type: "trace-create",
Timestamp: time.Now().UTC().Format(time.RFC3339Nano),
Body: body,
})
return id
}
// SpanInput is what callers fill in when recording a span.
type SpanInput struct {
TraceID string
ParentID string // optional — for nested spans
Name string
Input any
Output any
Metadata map[string]any
StartTime time.Time
EndTime time.Time
StatusCode int // 0 = success, anything else = error code
Level string // "DEBUG" | "DEFAULT" | "WARNING" | "ERROR"
}
// Span records one span attached to a trace. Returns the span id.
func (c *Client) Span(ctx context.Context, s SpanInput) string {
id := NewID()
body := map[string]any{
"id": id,
"traceId": s.TraceID,
"name": s.Name,
}
if s.ParentID != "" {
body["parentObservationId"] = s.ParentID
}
if s.Input != nil {
body["input"] = s.Input
}
if s.Output != nil {
body["output"] = s.Output
}
if s.Metadata != nil {
body["metadata"] = s.Metadata
}
if !s.StartTime.IsZero() {
body["startTime"] = s.StartTime.UTC().Format(time.RFC3339Nano)
}
if !s.EndTime.IsZero() {
body["endTime"] = s.EndTime.UTC().Format(time.RFC3339Nano)
}
if s.Level != "" {
body["level"] = s.Level
}
if s.StatusCode != 0 {
body["statusMessage"] = fmt.Sprintf("status_code=%d", s.StatusCode)
}
c.queue(event{
ID: NewID(),
Type: "span-create",
Timestamp: time.Now().UTC().Format(time.RFC3339Nano),
Body: body,
})
return id
}
func (c *Client) queue(e event) {
c.mu.Lock()
c.pending = append(c.pending, e)
shouldFlush := len(c.pending) >= c.maxBatch
c.mu.Unlock()
if shouldFlush {
_ = c.Flush(context.Background())
}
}
// Flush sends all queued events in one batch. Best-effort: returns
// the error but also logs; callers can ignore.
func (c *Client) Flush(ctx context.Context) error {
c.mu.Lock()
if len(c.pending) == 0 {
c.mu.Unlock()
return nil
}
batch := c.pending
c.pending = nil
c.mu.Unlock()
body, err := json.Marshal(map[string]any{"batch": batch})
if err != nil {
slog.Warn("langfuse: marshal batch", "err", err, "n", len(batch))
return err
}
req, err := http.NewRequestWithContext(ctx, "POST", c.url+"/api/public/ingestion", bytes.NewReader(body))
if err != nil {
return err
}
req.Header.Set("Authorization", c.auth)
req.Header.Set("Content-Type", "application/json")
resp, err := c.hc.Do(req)
if err != nil {
slog.Warn("langfuse: post", "err", err, "n", len(batch))
return err
}
defer resp.Body.Close()
if resp.StatusCode/100 != 2 && resp.StatusCode != 207 {
slog.Warn("langfuse: non-2xx", "status", resp.StatusCode, "n", len(batch))
return fmt.Errorf("langfuse ingestion: HTTP %d", resp.StatusCode)
}
return nil
}
// Close flushes any remaining events. Idempotent.
func (c *Client) Close() error {
return c.Flush(context.Background())
}

View File

@ -85,15 +85,6 @@ type SearchRequest struct {
PlaybookMaxDistance float64 `json:"playbook_max_distance,omitempty"`
PlaybookMaxInjectDistance float64 `json:"playbook_max_inject_distance,omitempty"`
MetadataFilter map[string]any `json:"metadata_filter,omitempty"`
// ExcludeIDs filters out specific worker IDs post-retrieval.
// Real-world driver: a coordinator places 200 workers at a
// contract, then mid-day the client asks for a different set —
// the next query should NOT return the already-placed workers.
// Filter runs after merge but before metadata filter, so an
// excluded ID never wastes a slot in the post-filter top-K.
// Also applies to playbook boost + Shape B inject — excluded
// answers are skipped at injection time.
ExcludeIDs []string `json:"exclude_ids,omitempty"`
}
// SearchResponse wraps the merged results plus per-corpus return
@ -213,25 +204,6 @@ func (r *Retriever) Search(ctx context.Context, req SearchRequest) (*SearchRespo
return allHits[i].Distance < allHits[j].Distance
})
// ExcludeIDs filter — applied first so excluded IDs don't waste
// a slot in the post-filter top-K. Real-world driver: coordinator
// has placed N workers at a contract; mid-day the client asks for
// alternatives, so this query passes ExcludeIDs=<placed_ids> and
// gets back fresh candidates instead of the same N.
if len(req.ExcludeIDs) > 0 {
excludeSet := make(map[string]bool, len(req.ExcludeIDs))
for _, id := range req.ExcludeIDs {
excludeSet[id] = true
}
kept := make([]Result, 0, len(allHits))
for _, h := range allHits {
if !excludeSet[h.ID] {
kept = append(kept, h)
}
}
allHits = kept
}
// Metadata filter (component B — staffing-side structured gate).
// Applied BEFORE top-K truncation so the filter doesn't accidentally
// reduce coverage further. Caller can request larger PerCorpusK to
@ -267,23 +239,6 @@ func (r *Retriever) Search(ctx context.Context, req SearchRequest) (*SearchRespo
if err != nil {
slog.Warn("matrix: playbook lookup failed; skipping boost+inject", "err", err)
} else if len(hits) > 0 {
// Filter playbook hits to honor ExcludeIDs — without this,
// an excluded answer in a playbook recording would re-enter
// the result set via Shape B inject, defeating the swap
// semantics that the exclude list exists to enforce.
if len(req.ExcludeIDs) > 0 {
excludeSet := make(map[string]bool, len(req.ExcludeIDs))
for _, id := range req.ExcludeIDs {
excludeSet[id] = true
}
keptHits := make([]PlaybookHit, 0, len(hits))
for _, h := range hits {
if !excludeSet[h.Entry.AnswerID] {
keptHits = append(keptHits, h)
}
}
hits = keptHits
}
resp.PlaybookBoosted = ApplyPlaybookBoost(resp.Results, hits)
maxInjectDist := float32(req.PlaybookMaxInjectDistance)
if maxInjectDist <= 0 {

View File

@ -41,12 +41,6 @@ const (
// the workflow handler was casting a string literal to Source,
// which worked coincidentally but left the taxonomy implicit.
SourceWorkflow Source = "workflow"
// SourceInbox tags ObservedOps emitted by /observer/inbox — incoming
// real-world signals (email, SMS) that a coordinator would receive
// and act on. The handler only RECORDS the message; downstream
// triggers (e.g. matrix.search on the parsed demand) are the
// caller's concern, recorded separately.
SourceInbox Source = "inbox"
)
// ObservedOp is one entry in the observer's ring buffer (and JSONL

View File

@ -43,7 +43,7 @@ bind = "127.0.0.1:3216"
# G2: Ollama local. G3+ may swap in OpenAI/Voyage by changing
# this URL + the wire format inside the provider.
provider_url = "http://localhost:11434"
default_model = "nomic-embed-text-v2-moe"
default_model = "nomic-embed-text"
[queryd]
bind = "127.0.0.1:3214"
@ -129,7 +129,7 @@ level = "info"
[models]
# Tier 1 — local hot path
local_fast = "qwen3.5:latest"
local_embed = "nomic-embed-text-v2-moe" # 475M MoE, drop-in upgrade from 137M v1 — verified 2026-04-30 same 768-dim
local_embed = "nomic-embed-text"
# local_judge stays on qwen2.5:latest — qwen3.5:latest is a vision-SSM
# build with 256K context that runs ~30s per judge call against the
# playbook_lift loop (verified 2026-04-30). qwen2.5:latest at ~1s/call

View File

@ -1,77 +0,0 @@
# Multi-Coordinator Stress Test — Run 001
**Generated:** 2026-04-30T12:54:09.621556469Z
**Coordinators:** alice / bob / carol (each with own playbook namespace: `playbook_alice` / `playbook_bob` / `playbook_carol`)
**Contracts:** alpha_milwaukee_distribution / beta_indianapolis_manufacturing / gamma_chicago_construction
**Corpora:** `workers,ethereal_workers`
**K per query:** 8
**Total events captured:** 52
**Evidence:** `reports/reality-tests/multi_coord_stress_001.json`
---
## Diversity — is the system locking into scenarios or cycling?
| Metric | Mean Jaccard | n pairs | Interpretation |
|---|---:|---:|---|
| Same role across different contracts | 0 | 0 | Lower = more diverse (different region/cert mix → different workers) |
| Different roles within same contract | 0.003703703703703704 | 18 | Should be near-zero (different roles = different worker pools) |
**Healthy ranges:**
- Same role across contracts: < 0.30 means the system is genuinely picking different workers per region/contract.
- Different roles same contract: < 0.10 means role-specific retrieval is working.
- If either is > 0.50, the system is "cycling" the same handful of workers regardless of query intent.
---
## Determinism — same query reissued, top-K stability
| Metric | Value |
|---|---:|
| Mean Jaccard on retrieval-only reissue | 1 |
| Number of reissue pairs | 12 |
**Interpretation:**
- ≥ 0.95: HNSW retrieval is highly deterministic; reissues land on near-identical top-K. Good — system locks into a stable view of "best workers for this query."
- 0.80 0.95: Some HNSW or embed variance, acceptable.
- < 0.80: Retrieval is unstable reissues see substantially different results, suggesting either embed nondeterminism (Ollama returning slightly different vectors) or vectord nondeterminism (HNSW insertion order affecting recall).
---
## Learning — handover hit rate
Bob takes Alice's contract using Alice's playbook namespace. Did Alice's recorded answers surface in Bob's results?
| Metric | Value |
|---|---:|
| Handover queries run | 4 |
| Alice's recorded answer at Bob's top-1 | 4 |
| Alice's recorded answer in Bob's top-K | 4 |
| **Handover hit rate (top-1)** | **1** |
**Interpretation:**
- Hit rate ≥ 0.5: handover is meaningful — the second coordinator inherits the first's institutional memory.
- Hit rate ≈ 0.0: playbook namespace isolation is working but the playbook itself isn't transferable, OR Bob's queries don't match Alice's recordings closely enough.
---
## Per-event capture
All matrix.search responses live in the JSON — top-K with worker IDs, distances, and per-corpus counts. Search by phase:
```bash
jq '.events[] | select(.phase == "merge")' reports/reality-tests/multi_coord_stress_001.json
jq '.events[] | select(.coordinator == "alice" and .phase == "baseline")' reports/reality-tests/multi_coord_stress_001.json
jq '.events[] | select(.role == "warehouse worker") | {phase, contract, top_k_ids: [.top_k[].id]}' reports/reality-tests/multi_coord_stress_001.json
```
---
## What's NOT in this run (Phase 1 deliberately defers)
- **48-hour clock.** Events fire as discrete steps, not on a timeline.
- **Email / SMS ingest.** No endpoints exist on the Go side yet.
- **New-resume injection mid-run.** The corpus is fixed at the start.
- **Langfuse traces.** Need Go-side wiring.
These are Phase 2/3. The Phase 1 substrate is what the time-based runner will mount on top of.

View File

@ -1,82 +0,0 @@
# Multi-Coordinator Stress Test — Run 002
**Generated:** 2026-04-30T13:02:13.570393819Z
**Coordinators:** alice / bob / carol (each with own playbook namespace: `playbook_alice` / `playbook_bob` / `playbook_carol`)
**Contracts:** alpha_milwaukee_distribution / beta_indianapolis_manufacturing / gamma_chicago_construction
**Corpora:** `workers,ethereal_workers`
**K per query:** 8
**Total events captured:** 56
**Evidence:** `reports/reality-tests/multi_coord_stress_002.json`
---
## Diversity — is the system locking into scenarios or cycling?
| Metric | Mean Jaccard | n pairs | Interpretation |
|---|---:|---:|---|
| Same role across different contracts | 0.11900691900691901 | 9 | Lower = more diverse (different region/cert mix → different workers) |
| Different roles within same contract | 0.003703703703703704 | 18 | Should be near-zero (different roles = different worker pools) |
**Healthy ranges:**
- Same role across contracts: < 0.30 means the system is genuinely picking different workers per region/contract.
- Different roles same contract: < 0.10 means role-specific retrieval is working.
- If either is > 0.50, the system is "cycling" the same handful of workers regardless of query intent.
---
## Determinism — same query reissued, top-K stability
| Metric | Value |
|---|---:|
| Mean Jaccard on retrieval-only reissue | 1 |
| Number of reissue pairs | 12 |
**Interpretation:**
- ≥ 0.95: HNSW retrieval is highly deterministic; reissues land on near-identical top-K. Good — system locks into a stable view of "best workers for this query."
- 0.80 0.95: Some HNSW or embed variance, acceptable.
- < 0.80: Retrieval is unstable reissues see substantially different results, suggesting either embed nondeterminism (Ollama returning slightly different vectors) or vectord nondeterminism (HNSW insertion order affecting recall).
---
## Learning — handover hit rate
Bob takes Alice's contract using Alice's playbook namespace. Did Alice's recorded answers surface in Bob's results?
| Metric | Value |
|---|---:|
| Verbatim handover queries run | 4 |
| Alice's recorded answer at Bob's top-1 (verbatim) | 4 |
| Alice's recorded answer in Bob's top-K (verbatim) | 4 |
| **Verbatim handover hit rate (top-1)** | **1** |
| Paraphrase handover queries run | 4 |
| Alice's recorded answer at Bob's top-1 (paraphrase) | 4 |
| Alice's recorded answer in Bob's top-K (paraphrase) | 4 |
| **Paraphrase handover hit rate (top-1)** | **1** |
**Interpretation:**
- Verbatim hit rate ≈ 1.0: trivial case — Bob runs identical queries; should always hit.
- Paraphrase hit rate ≥ 0.5: institutional memory survives wording change — the harder learning property.
- Paraphrase hit rate ≈ 0.0: Bob's paraphrases drift past the inject threshold, so Alice's recordings don't activate. Same caveat as the playbook_lift paraphrase pass.
---
## Per-event capture
All matrix.search responses live in the JSON — top-K with worker IDs, distances, and per-corpus counts. Search by phase:
```bash
jq '.events[] | select(.phase == "merge")' reports/reality-tests/multi_coord_stress_002.json
jq '.events[] | select(.coordinator == "alice" and .phase == "baseline")' reports/reality-tests/multi_coord_stress_002.json
jq '.events[] | select(.role == "warehouse worker") | {phase, contract, top_k_ids: [.top_k[].id]}' reports/reality-tests/multi_coord_stress_002.json
```
---
## What's NOT in this run (Phase 1 deliberately defers)
- **48-hour clock.** Events fire as discrete steps, not on a timeline.
- **Email / SMS ingest.** No endpoints exist on the Go side yet.
- **New-resume injection mid-run.** The corpus is fixed at the start.
- **Langfuse traces.** Need Go-side wiring.
These are Phase 2/3. The Phase 1 substrate is what the time-based runner will mount on top of.

View File

@ -1,82 +0,0 @@
# Multi-Coordinator Stress Test — Run 003
**Generated:** 2026-04-30T13:13:44.35966865Z
**Coordinators:** alice / bob / carol (each with own playbook namespace: `playbook_alice` / `playbook_bob` / `playbook_carol`)
**Contracts:** alpha_milwaukee_distribution / beta_indianapolis_manufacturing / gamma_chicago_construction
**Corpora:** `workers,ethereal_workers`
**K per query:** 8
**Total events captured:** 61
**Evidence:** `reports/reality-tests/multi_coord_stress_003.json`
---
## Diversity — is the system locking into scenarios or cycling?
| Metric | Mean Jaccard | n pairs | Interpretation |
|---|---:|---:|---|
| Same role across different contracts | 0.03068783068783069 | 9 | Lower = more diverse (different region/cert mix → different workers) |
| Different roles within same contract | 0 | 18 | Should be near-zero (different roles = different worker pools) |
**Healthy ranges:**
- Same role across contracts: < 0.30 means the system is genuinely picking different workers per region/contract.
- Different roles same contract: < 0.10 means role-specific retrieval is working.
- If either is > 0.50, the system is "cycling" the same handful of workers regardless of query intent.
---
## Determinism — same query reissued, top-K stability
| Metric | Value |
|---|---:|
| Mean Jaccard on retrieval-only reissue | 1 |
| Number of reissue pairs | 12 |
**Interpretation:**
- ≥ 0.95: HNSW retrieval is highly deterministic; reissues land on near-identical top-K. Good — system locks into a stable view of "best workers for this query."
- 0.80 0.95: Some HNSW or embed variance, acceptable.
- < 0.80: Retrieval is unstable reissues see substantially different results, suggesting either embed nondeterminism (Ollama returning slightly different vectors) or vectord nondeterminism (HNSW insertion order affecting recall).
---
## Learning — handover hit rate
Bob takes Alice's contract using Alice's playbook namespace. Did Alice's recorded answers surface in Bob's results?
| Metric | Value |
|---|---:|
| Verbatim handover queries run | 4 |
| Alice's recorded answer at Bob's top-1 (verbatim) | 4 |
| Alice's recorded answer in Bob's top-K (verbatim) | 4 |
| **Verbatim handover hit rate (top-1)** | **1** |
| Paraphrase handover queries run | 4 |
| Alice's recorded answer at Bob's top-1 (paraphrase) | 4 |
| Alice's recorded answer in Bob's top-K (paraphrase) | 4 |
| **Paraphrase handover hit rate (top-1)** | **1** |
**Interpretation:**
- Verbatim hit rate ≈ 1.0: trivial case — Bob runs identical queries; should always hit.
- Paraphrase hit rate ≥ 0.5: institutional memory survives wording change — the harder learning property.
- Paraphrase hit rate ≈ 0.0: Bob's paraphrases drift past the inject threshold, so Alice's recordings don't activate. Same caveat as the playbook_lift paraphrase pass.
---
## Per-event capture
All matrix.search responses live in the JSON — top-K with worker IDs, distances, and per-corpus counts. Search by phase:
```bash
jq '.events[] | select(.phase == "merge")' reports/reality-tests/multi_coord_stress_003.json
jq '.events[] | select(.coordinator == "alice" and .phase == "baseline")' reports/reality-tests/multi_coord_stress_003.json
jq '.events[] | select(.role == "warehouse worker") | {phase, contract, top_k_ids: [.top_k[].id]}' reports/reality-tests/multi_coord_stress_003.json
```
---
## What's NOT in this run (Phase 1 deliberately defers)
- **48-hour clock.** Events fire as discrete steps, not on a timeline.
- **Email / SMS ingest.** No endpoints exist on the Go side yet.
- **New-resume injection mid-run.** The corpus is fixed at the start.
- **Langfuse traces.** Need Go-side wiring.
These are Phase 2/3. The Phase 1 substrate is what the time-based runner will mount on top of.

View File

@ -1,82 +0,0 @@
# Multi-Coordinator Stress Test — Run 004
**Generated:** 2026-04-30T13:17:03.577877974Z
**Coordinators:** alice / bob / carol (each with own playbook namespace: `playbook_alice` / `playbook_bob` / `playbook_carol`)
**Contracts:** alpha_milwaukee_distribution / beta_indianapolis_manufacturing / gamma_chicago_construction
**Corpora:** `workers,ethereal_workers`
**K per query:** 8
**Total events captured:** 61
**Evidence:** `reports/reality-tests/multi_coord_stress_004.json`
---
## Diversity — is the system locking into scenarios or cycling?
| Metric | Mean Jaccard | n pairs | Interpretation |
|---|---:|---:|---|
| Same role across different contracts | 0.08013468013468013 | 9 | Lower = more diverse (different region/cert mix → different workers) |
| Different roles within same contract | 0.012820512820512822 | 18 | Should be near-zero (different roles = different worker pools) |
**Healthy ranges:**
- Same role across contracts: < 0.30 means the system is genuinely picking different workers per region/contract.
- Different roles same contract: < 0.10 means role-specific retrieval is working.
- If either is > 0.50, the system is "cycling" the same handful of workers regardless of query intent.
---
## Determinism — same query reissued, top-K stability
| Metric | Value |
|---|---:|
| Mean Jaccard on retrieval-only reissue | 1 |
| Number of reissue pairs | 12 |
**Interpretation:**
- ≥ 0.95: HNSW retrieval is highly deterministic; reissues land on near-identical top-K. Good — system locks into a stable view of "best workers for this query."
- 0.80 0.95: Some HNSW or embed variance, acceptable.
- < 0.80: Retrieval is unstable reissues see substantially different results, suggesting either embed nondeterminism (Ollama returning slightly different vectors) or vectord nondeterminism (HNSW insertion order affecting recall).
---
## Learning — handover hit rate
Bob takes Alice's contract using Alice's playbook namespace. Did Alice's recorded answers surface in Bob's results?
| Metric | Value |
|---|---:|
| Verbatim handover queries run | 4 |
| Alice's recorded answer at Bob's top-1 (verbatim) | 4 |
| Alice's recorded answer in Bob's top-K (verbatim) | 4 |
| **Verbatim handover hit rate (top-1)** | **1** |
| Paraphrase handover queries run | 4 |
| Alice's recorded answer at Bob's top-1 (paraphrase) | 4 |
| Alice's recorded answer in Bob's top-K (paraphrase) | 4 |
| **Paraphrase handover hit rate (top-1)** | **1** |
**Interpretation:**
- Verbatim hit rate ≈ 1.0: trivial case — Bob runs identical queries; should always hit.
- Paraphrase hit rate ≥ 0.5: institutional memory survives wording change — the harder learning property.
- Paraphrase hit rate ≈ 0.0: Bob's paraphrases drift past the inject threshold, so Alice's recordings don't activate. Same caveat as the playbook_lift paraphrase pass.
---
## Per-event capture
All matrix.search responses live in the JSON — top-K with worker IDs, distances, and per-corpus counts. Search by phase:
```bash
jq '.events[] | select(.phase == "merge")' reports/reality-tests/multi_coord_stress_004.json
jq '.events[] | select(.coordinator == "alice" and .phase == "baseline")' reports/reality-tests/multi_coord_stress_004.json
jq '.events[] | select(.role == "warehouse worker") | {phase, contract, top_k_ids: [.top_k[].id]}' reports/reality-tests/multi_coord_stress_004.json
```
---
## What's NOT in this run (Phase 1 deliberately defers)
- **48-hour clock.** Events fire as discrete steps, not on a timeline.
- **Email / SMS ingest.** No endpoints exist on the Go side yet.
- **New-resume injection mid-run.** The corpus is fixed at the start.
- **Langfuse traces.** Need Go-side wiring.
These are Phase 2/3. The Phase 1 substrate is what the time-based runner will mount on top of.

View File

@ -1,82 +0,0 @@
# Multi-Coordinator Stress Test — Run 005
**Generated:** 2026-04-30T13:25:15.497712275Z
**Coordinators:** alice / bob / carol (each with own playbook namespace: `playbook_alice` / `playbook_bob` / `playbook_carol`)
**Contracts:** alpha_milwaukee_distribution / beta_indianapolis_manufacturing / gamma_chicago_construction
**Corpora:** `workers,ethereal_workers`
**K per query:** 8
**Total events captured:** 61
**Evidence:** `reports/reality-tests/multi_coord_stress_005.json`
---
## Diversity — is the system locking into scenarios or cycling?
| Metric | Mean Jaccard | n pairs | Interpretation |
|---|---:|---:|---|
| Same role across different contracts | 0 | 9 | Lower = more diverse (different region/cert mix → different workers) |
| Different roles within same contract | 0.03610093610093609 | 18 | Should be near-zero (different roles = different worker pools) |
**Healthy ranges:**
- Same role across contracts: < 0.30 means the system is genuinely picking different workers per region/contract.
- Different roles same contract: < 0.10 means role-specific retrieval is working.
- If either is > 0.50, the system is "cycling" the same handful of workers regardless of query intent.
---
## Determinism — same query reissued, top-K stability
| Metric | Value |
|---|---:|
| Mean Jaccard on retrieval-only reissue | 1 |
| Number of reissue pairs | 12 |
**Interpretation:**
- ≥ 0.95: HNSW retrieval is highly deterministic; reissues land on near-identical top-K. Good — system locks into a stable view of "best workers for this query."
- 0.80 0.95: Some HNSW or embed variance, acceptable.
- < 0.80: Retrieval is unstable reissues see substantially different results, suggesting either embed nondeterminism (Ollama returning slightly different vectors) or vectord nondeterminism (HNSW insertion order affecting recall).
---
## Learning — handover hit rate
Bob takes Alice's contract using Alice's playbook namespace. Did Alice's recorded answers surface in Bob's results?
| Metric | Value |
|---|---:|
| Verbatim handover queries run | 4 |
| Alice's recorded answer at Bob's top-1 (verbatim) | 4 |
| Alice's recorded answer in Bob's top-K (verbatim) | 4 |
| **Verbatim handover hit rate (top-1)** | **1** |
| Paraphrase handover queries run | 4 |
| Alice's recorded answer at Bob's top-1 (paraphrase) | 4 |
| Alice's recorded answer in Bob's top-K (paraphrase) | 4 |
| **Paraphrase handover hit rate (top-1)** | **1** |
**Interpretation:**
- Verbatim hit rate ≈ 1.0: trivial case — Bob runs identical queries; should always hit.
- Paraphrase hit rate ≥ 0.5: institutional memory survives wording change — the harder learning property.
- Paraphrase hit rate ≈ 0.0: Bob's paraphrases drift past the inject threshold, so Alice's recordings don't activate. Same caveat as the playbook_lift paraphrase pass.
---
## Per-event capture
All matrix.search responses live in the JSON — top-K with worker IDs, distances, and per-corpus counts. Search by phase:
```bash
jq '.events[] | select(.phase == "merge")' reports/reality-tests/multi_coord_stress_005.json
jq '.events[] | select(.coordinator == "alice" and .phase == "baseline")' reports/reality-tests/multi_coord_stress_005.json
jq '.events[] | select(.role == "warehouse worker") | {phase, contract, top_k_ids: [.top_k[].id]}' reports/reality-tests/multi_coord_stress_005.json
```
---
## What's NOT in this run (Phase 1 deliberately defers)
- **48-hour clock.** Events fire as discrete steps, not on a timeline.
- **Email / SMS ingest.** No endpoints exist on the Go side yet.
- **New-resume injection mid-run.** The corpus is fixed at the start.
- **Langfuse traces.** Need Go-side wiring.
These are Phase 2/3. The Phase 1 substrate is what the time-based runner will mount on top of.

View File

@ -1,82 +0,0 @@
# Multi-Coordinator Stress Test — Run 006
**Generated:** 2026-04-30T13:33:24.568124731Z
**Coordinators:** alice / bob / carol (each with own playbook namespace: `playbook_alice` / `playbook_bob` / `playbook_carol`)
**Contracts:** alpha_milwaukee_distribution / beta_indianapolis_manufacturing / gamma_chicago_construction
**Corpora:** `workers,ethereal_workers`
**K per query:** 8
**Total events captured:** 67
**Evidence:** `reports/reality-tests/multi_coord_stress_006.json`
---
## Diversity — is the system locking into scenarios or cycling?
| Metric | Mean Jaccard | n pairs | Interpretation |
|---|---:|---:|---|
| Same role across different contracts | 0 | 9 | Lower = more diverse (different region/cert mix → different workers) |
| Different roles within same contract | 0.04603174603174603 | 18 | Should be near-zero (different roles = different worker pools) |
**Healthy ranges:**
- Same role across contracts: < 0.30 means the system is genuinely picking different workers per region/contract.
- Different roles same contract: < 0.10 means role-specific retrieval is working.
- If either is > 0.50, the system is "cycling" the same handful of workers regardless of query intent.
---
## Determinism — same query reissued, top-K stability
| Metric | Value |
|---|---:|
| Mean Jaccard on retrieval-only reissue | 1 |
| Number of reissue pairs | 12 |
**Interpretation:**
- ≥ 0.95: HNSW retrieval is highly deterministic; reissues land on near-identical top-K. Good — system locks into a stable view of "best workers for this query."
- 0.80 0.95: Some HNSW or embed variance, acceptable.
- < 0.80: Retrieval is unstable reissues see substantially different results, suggesting either embed nondeterminism (Ollama returning slightly different vectors) or vectord nondeterminism (HNSW insertion order affecting recall).
---
## Learning — handover hit rate
Bob takes Alice's contract using Alice's playbook namespace. Did Alice's recorded answers surface in Bob's results?
| Metric | Value |
|---|---:|
| Verbatim handover queries run | 4 |
| Alice's recorded answer at Bob's top-1 (verbatim) | 4 |
| Alice's recorded answer in Bob's top-K (verbatim) | 4 |
| **Verbatim handover hit rate (top-1)** | **1** |
| Paraphrase handover queries run | 4 |
| Alice's recorded answer at Bob's top-1 (paraphrase) | 4 |
| Alice's recorded answer in Bob's top-K (paraphrase) | 4 |
| **Paraphrase handover hit rate (top-1)** | **1** |
**Interpretation:**
- Verbatim hit rate ≈ 1.0: trivial case — Bob runs identical queries; should always hit.
- Paraphrase hit rate ≥ 0.5: institutional memory survives wording change — the harder learning property.
- Paraphrase hit rate ≈ 0.0: Bob's paraphrases drift past the inject threshold, so Alice's recordings don't activate. Same caveat as the playbook_lift paraphrase pass.
---
## Per-event capture
All matrix.search responses live in the JSON — top-K with worker IDs, distances, and per-corpus counts. Search by phase:
```bash
jq '.events[] | select(.phase == "merge")' reports/reality-tests/multi_coord_stress_006.json
jq '.events[] | select(.coordinator == "alice" and .phase == "baseline")' reports/reality-tests/multi_coord_stress_006.json
jq '.events[] | select(.role == "warehouse worker") | {phase, contract, top_k_ids: [.top_k[].id]}' reports/reality-tests/multi_coord_stress_006.json
```
---
## What's NOT in this run (Phase 1 deliberately defers)
- **48-hour clock.** Events fire as discrete steps, not on a timeline.
- **Email / SMS ingest.** No endpoints exist on the Go side yet.
- **New-resume injection mid-run.** The corpus is fixed at the start.
- **Langfuse traces.** Need Go-side wiring.
These are Phase 2/3. The Phase 1 substrate is what the time-based runner will mount on top of.

View File

@ -1,82 +0,0 @@
# Multi-Coordinator Stress Test — Run 007
**Generated:** 2026-04-30T19:50:04.791000091Z
**Coordinators:** alice / bob / carol (each with own playbook namespace: `playbook_alice` / `playbook_bob` / `playbook_carol`)
**Contracts:** alpha_milwaukee_distribution / beta_indianapolis_manufacturing / gamma_chicago_construction
**Corpora:** `workers,ethereal_workers`
**K per query:** 8
**Total events captured:** 67
**Evidence:** `reports/reality-tests/multi_coord_stress_007.json`
---
## Diversity — is the system locking into scenarios or cycling?
| Metric | Mean Jaccard | n pairs | Interpretation |
|---|---:|---:|---|
| Same role across different contracts | 0 | 9 | Lower = more diverse (different region/cert mix → different workers) |
| Different roles within same contract | 0.026455026455026454 | 18 | Should be near-zero (different roles = different worker pools) |
**Healthy ranges:**
- Same role across contracts: < 0.30 means the system is genuinely picking different workers per region/contract.
- Different roles same contract: < 0.10 means role-specific retrieval is working.
- If either is > 0.50, the system is "cycling" the same handful of workers regardless of query intent.
---
## Determinism — same query reissued, top-K stability
| Metric | Value |
|---|---:|
| Mean Jaccard on retrieval-only reissue | 1 |
| Number of reissue pairs | 12 |
**Interpretation:**
- ≥ 0.95: HNSW retrieval is highly deterministic; reissues land on near-identical top-K. Good — system locks into a stable view of "best workers for this query."
- 0.80 0.95: Some HNSW or embed variance, acceptable.
- < 0.80: Retrieval is unstable reissues see substantially different results, suggesting either embed nondeterminism (Ollama returning slightly different vectors) or vectord nondeterminism (HNSW insertion order affecting recall).
---
## Learning — handover hit rate
Bob takes Alice's contract using Alice's playbook namespace. Did Alice's recorded answers surface in Bob's results?
| Metric | Value |
|---|---:|
| Verbatim handover queries run | 4 |
| Alice's recorded answer at Bob's top-1 (verbatim) | 4 |
| Alice's recorded answer in Bob's top-K (verbatim) | 4 |
| **Verbatim handover hit rate (top-1)** | **1** |
| Paraphrase handover queries run | 4 |
| Alice's recorded answer at Bob's top-1 (paraphrase) | 4 |
| Alice's recorded answer in Bob's top-K (paraphrase) | 4 |
| **Paraphrase handover hit rate (top-1)** | **1** |
**Interpretation:**
- Verbatim hit rate ≈ 1.0: trivial case — Bob runs identical queries; should always hit.
- Paraphrase hit rate ≥ 0.5: institutional memory survives wording change — the harder learning property.
- Paraphrase hit rate ≈ 0.0: Bob's paraphrases drift past the inject threshold, so Alice's recordings don't activate. Same caveat as the playbook_lift paraphrase pass.
---
## Per-event capture
All matrix.search responses live in the JSON — top-K with worker IDs, distances, and per-corpus counts. Search by phase:
```bash
jq '.events[] | select(.phase == "merge")' reports/reality-tests/multi_coord_stress_007.json
jq '.events[] | select(.coordinator == "alice" and .phase == "baseline")' reports/reality-tests/multi_coord_stress_007.json
jq '.events[] | select(.role == "warehouse worker") | {phase, contract, top_k_ids: [.top_k[].id]}' reports/reality-tests/multi_coord_stress_007.json
```
---
## What's NOT in this run (Phase 1 deliberately defers)
- **48-hour clock.** Events fire as discrete steps, not on a timeline.
- **Email / SMS ingest.** No endpoints exist on the Go side yet.
- **New-resume injection mid-run.** The corpus is fixed at the start.
- **Langfuse traces.** Need Go-side wiring.
These are Phase 2/3. The Phase 1 substrate is what the time-based runner will mount on top of.

View File

@ -1,82 +0,0 @@
# Multi-Coordinator Stress Test — Run 008
**Generated:** 2026-04-30T21:15:37.045817146Z
**Coordinators:** alice / bob / carol (each with own playbook namespace: `playbook_alice` / `playbook_bob` / `playbook_carol`)
**Contracts:** alpha_milwaukee_distribution / beta_indianapolis_manufacturing / gamma_chicago_construction
**Corpora:** `workers,ethereal_workers`
**K per query:** 8
**Total events captured:** 67
**Evidence:** `reports/reality-tests/multi_coord_stress_008.json`
---
## Diversity — is the system locking into scenarios or cycling?
| Metric | Mean Jaccard | n pairs | Interpretation |
|---|---:|---:|---|
| Same role across different contracts | 0 | 9 | Lower = more diverse (different region/cert mix → different workers) |
| Different roles within same contract | 0.04126984126984126 | 18 | Should be near-zero (different roles = different worker pools) |
**Healthy ranges:**
- Same role across contracts: < 0.30 means the system is genuinely picking different workers per region/contract.
- Different roles same contract: < 0.10 means role-specific retrieval is working.
- If either is > 0.50, the system is "cycling" the same handful of workers regardless of query intent.
---
## Determinism — same query reissued, top-K stability
| Metric | Value |
|---|---:|
| Mean Jaccard on retrieval-only reissue | 1 |
| Number of reissue pairs | 12 |
**Interpretation:**
- ≥ 0.95: HNSW retrieval is highly deterministic; reissues land on near-identical top-K. Good — system locks into a stable view of "best workers for this query."
- 0.80 0.95: Some HNSW or embed variance, acceptable.
- < 0.80: Retrieval is unstable reissues see substantially different results, suggesting either embed nondeterminism (Ollama returning slightly different vectors) or vectord nondeterminism (HNSW insertion order affecting recall).
---
## Learning — handover hit rate
Bob takes Alice's contract using Alice's playbook namespace. Did Alice's recorded answers surface in Bob's results?
| Metric | Value |
|---|---:|
| Verbatim handover queries run | 4 |
| Alice's recorded answer at Bob's top-1 (verbatim) | 4 |
| Alice's recorded answer in Bob's top-K (verbatim) | 4 |
| **Verbatim handover hit rate (top-1)** | **1** |
| Paraphrase handover queries run | 4 |
| Alice's recorded answer at Bob's top-1 (paraphrase) | 4 |
| Alice's recorded answer in Bob's top-K (paraphrase) | 4 |
| **Paraphrase handover hit rate (top-1)** | **1** |
**Interpretation:**
- Verbatim hit rate ≈ 1.0: trivial case — Bob runs identical queries; should always hit.
- Paraphrase hit rate ≥ 0.5: institutional memory survives wording change — the harder learning property.
- Paraphrase hit rate ≈ 0.0: Bob's paraphrases drift past the inject threshold, so Alice's recordings don't activate. Same caveat as the playbook_lift paraphrase pass.
---
## Per-event capture
All matrix.search responses live in the JSON — top-K with worker IDs, distances, and per-corpus counts. Search by phase:
```bash
jq '.events[] | select(.phase == "merge")' reports/reality-tests/multi_coord_stress_008.json
jq '.events[] | select(.coordinator == "alice" and .phase == "baseline")' reports/reality-tests/multi_coord_stress_008.json
jq '.events[] | select(.role == "warehouse worker") | {phase, contract, top_k_ids: [.top_k[].id]}' reports/reality-tests/multi_coord_stress_008.json
```
---
## What's NOT in this run (Phase 1 deliberately defers)
- **48-hour clock.** Events fire as discrete steps, not on a timeline.
- **Email / SMS ingest.** No endpoints exist on the Go side yet.
- **New-resume injection mid-run.** The corpus is fixed at the start.
- **Langfuse traces.** Need Go-side wiring.
These are Phase 2/3. The Phase 1 substrate is what the time-based runner will mount on top of.

View File

@ -1,82 +0,0 @@
# Multi-Coordinator Stress Test — Run 009
**Generated:** 2026-04-30T21:23:59.011167722Z
**Coordinators:** alice / bob / carol (each with own playbook namespace: `playbook_alice` / `playbook_bob` / `playbook_carol`)
**Contracts:** alpha_milwaukee_distribution / beta_indianapolis_manufacturing / gamma_chicago_construction
**Corpora:** `workers,ethereal_workers`
**K per query:** 8
**Total events captured:** 67
**Evidence:** `reports/reality-tests/multi_coord_stress_009.json`
---
## Diversity — is the system locking into scenarios or cycling?
| Metric | Mean Jaccard | n pairs | Interpretation |
|---|---:|---:|---|
| Same role across different contracts | 0.015873015873015872 | 9 | Lower = more diverse (different region/cert mix → different workers) |
| Different roles within same contract | 0.015343915343915345 | 18 | Should be near-zero (different roles = different worker pools) |
**Healthy ranges:**
- Same role across contracts: < 0.30 means the system is genuinely picking different workers per region/contract.
- Different roles same contract: < 0.10 means role-specific retrieval is working.
- If either is > 0.50, the system is "cycling" the same handful of workers regardless of query intent.
---
## Determinism — same query reissued, top-K stability
| Metric | Value |
|---|---:|
| Mean Jaccard on retrieval-only reissue | 1 |
| Number of reissue pairs | 12 |
**Interpretation:**
- ≥ 0.95: HNSW retrieval is highly deterministic; reissues land on near-identical top-K. Good — system locks into a stable view of "best workers for this query."
- 0.80 0.95: Some HNSW or embed variance, acceptable.
- < 0.80: Retrieval is unstable reissues see substantially different results, suggesting either embed nondeterminism (Ollama returning slightly different vectors) or vectord nondeterminism (HNSW insertion order affecting recall).
---
## Learning — handover hit rate
Bob takes Alice's contract using Alice's playbook namespace. Did Alice's recorded answers surface in Bob's results?
| Metric | Value |
|---|---:|
| Verbatim handover queries run | 4 |
| Alice's recorded answer at Bob's top-1 (verbatim) | 4 |
| Alice's recorded answer in Bob's top-K (verbatim) | 4 |
| **Verbatim handover hit rate (top-1)** | **1** |
| Paraphrase handover queries run | 4 |
| Alice's recorded answer at Bob's top-1 (paraphrase) | 4 |
| Alice's recorded answer in Bob's top-K (paraphrase) | 4 |
| **Paraphrase handover hit rate (top-1)** | **1** |
**Interpretation:**
- Verbatim hit rate ≈ 1.0: trivial case — Bob runs identical queries; should always hit.
- Paraphrase hit rate ≥ 0.5: institutional memory survives wording change — the harder learning property.
- Paraphrase hit rate ≈ 0.0: Bob's paraphrases drift past the inject threshold, so Alice's recordings don't activate. Same caveat as the playbook_lift paraphrase pass.
---
## Per-event capture
All matrix.search responses live in the JSON — top-K with worker IDs, distances, and per-corpus counts. Search by phase:
```bash
jq '.events[] | select(.phase == "merge")' reports/reality-tests/multi_coord_stress_009.json
jq '.events[] | select(.coordinator == "alice" and .phase == "baseline")' reports/reality-tests/multi_coord_stress_009.json
jq '.events[] | select(.role == "warehouse worker") | {phase, contract, top_k_ids: [.top_k[].id]}' reports/reality-tests/multi_coord_stress_009.json
```
---
## What's NOT in this run (Phase 1 deliberately defers)
- **48-hour clock.** Events fire as discrete steps, not on a timeline.
- **Email / SMS ingest.** No endpoints exist on the Go side yet.
- **New-resume injection mid-run.** The corpus is fixed at the start.
- **Langfuse traces.** Need Go-side wiring.
These are Phase 2/3. The Phase 1 substrate is what the time-based runner will mount on top of.

View File

@ -1,82 +0,0 @@
# Multi-Coordinator Stress Test — Run 010
**Generated:** 2026-04-30T21:30:38.434794788Z
**Coordinators:** alice / bob / carol (each with own playbook namespace: `playbook_alice` / `playbook_bob` / `playbook_carol`)
**Contracts:** alpha_milwaukee_distribution / beta_indianapolis_manufacturing / gamma_chicago_construction
**Corpora:** `workers,ethereal_workers`
**K per query:** 8
**Total events captured:** 67
**Evidence:** `reports/reality-tests/multi_coord_stress_010.json`
---
## Diversity — is the system locking into scenarios or cycling?
| Metric | Mean Jaccard | n pairs | Interpretation |
|---|---:|---:|---|
| Same role across different contracts | 0.007407407407407408 | 9 | Lower = more diverse (different region/cert mix → different workers) |
| Different roles within same contract | 0.026455026455026454 | 18 | Should be near-zero (different roles = different worker pools) |
**Healthy ranges:**
- Same role across contracts: < 0.30 means the system is genuinely picking different workers per region/contract.
- Different roles same contract: < 0.10 means role-specific retrieval is working.
- If either is > 0.50, the system is "cycling" the same handful of workers regardless of query intent.
---
## Determinism — same query reissued, top-K stability
| Metric | Value |
|---|---:|
| Mean Jaccard on retrieval-only reissue | 1 |
| Number of reissue pairs | 12 |
**Interpretation:**
- ≥ 0.95: HNSW retrieval is highly deterministic; reissues land on near-identical top-K. Good — system locks into a stable view of "best workers for this query."
- 0.80 0.95: Some HNSW or embed variance, acceptable.
- < 0.80: Retrieval is unstable reissues see substantially different results, suggesting either embed nondeterminism (Ollama returning slightly different vectors) or vectord nondeterminism (HNSW insertion order affecting recall).
---
## Learning — handover hit rate
Bob takes Alice's contract using Alice's playbook namespace. Did Alice's recorded answers surface in Bob's results?
| Metric | Value |
|---|---:|
| Verbatim handover queries run | 4 |
| Alice's recorded answer at Bob's top-1 (verbatim) | 4 |
| Alice's recorded answer in Bob's top-K (verbatim) | 4 |
| **Verbatim handover hit rate (top-1)** | **1** |
| Paraphrase handover queries run | 4 |
| Alice's recorded answer at Bob's top-1 (paraphrase) | 4 |
| Alice's recorded answer in Bob's top-K (paraphrase) | 4 |
| **Paraphrase handover hit rate (top-1)** | **1** |
**Interpretation:**
- Verbatim hit rate ≈ 1.0: trivial case — Bob runs identical queries; should always hit.
- Paraphrase hit rate ≥ 0.5: institutional memory survives wording change — the harder learning property.
- Paraphrase hit rate ≈ 0.0: Bob's paraphrases drift past the inject threshold, so Alice's recordings don't activate. Same caveat as the playbook_lift paraphrase pass.
---
## Per-event capture
All matrix.search responses live in the JSON — top-K with worker IDs, distances, and per-corpus counts. Search by phase:
```bash
jq '.events[] | select(.phase == "merge")' reports/reality-tests/multi_coord_stress_010.json
jq '.events[] | select(.coordinator == "alice" and .phase == "baseline")' reports/reality-tests/multi_coord_stress_010.json
jq '.events[] | select(.role == "warehouse worker") | {phase, contract, top_k_ids: [.top_k[].id]}' reports/reality-tests/multi_coord_stress_010.json
```
---
## What's NOT in this run (Phase 1 deliberately defers)
- **48-hour clock.** Events fire as discrete steps, not on a timeline.
- **Email / SMS ingest.** No endpoints exist on the Go side yet.
- **New-resume injection mid-run.** The corpus is fixed at the start.
- **Langfuse traces.** Need Go-side wiring.
These are Phase 2/3. The Phase 1 substrate is what the time-based runner will mount on top of.

View File

@ -1,82 +0,0 @@
# Multi-Coordinator Stress Test — Run 011
**Generated:** 2026-04-30T21:41:26.801002955Z
**Coordinators:** alice / bob / carol (each with own playbook namespace: `playbook_alice` / `playbook_bob` / `playbook_carol`)
**Contracts:** alpha_milwaukee_distribution / beta_indianapolis_manufacturing / gamma_chicago_construction
**Corpora:** `workers,ethereal_workers`
**K per query:** 8
**Total events captured:** 67
**Evidence:** `reports/reality-tests/multi_coord_stress_011.json`
---
## Diversity — is the system locking into scenarios or cycling?
| Metric | Mean Jaccard | n pairs | Interpretation |
|---|---:|---:|---|
| Same role across different contracts | 0.025641025641025644 | 9 | Lower = more diverse (different region/cert mix → different workers) |
| Different roles within same contract | 0.06996336996336996 | 18 | Should be near-zero (different roles = different worker pools) |
**Healthy ranges:**
- Same role across contracts: < 0.30 means the system is genuinely picking different workers per region/contract.
- Different roles same contract: < 0.10 means role-specific retrieval is working.
- If either is > 0.50, the system is "cycling" the same handful of workers regardless of query intent.
---
## Determinism — same query reissued, top-K stability
| Metric | Value |
|---|---:|
| Mean Jaccard on retrieval-only reissue | 1 |
| Number of reissue pairs | 12 |
**Interpretation:**
- ≥ 0.95: HNSW retrieval is highly deterministic; reissues land on near-identical top-K. Good — system locks into a stable view of "best workers for this query."
- 0.80 0.95: Some HNSW or embed variance, acceptable.
- < 0.80: Retrieval is unstable reissues see substantially different results, suggesting either embed nondeterminism (Ollama returning slightly different vectors) or vectord nondeterminism (HNSW insertion order affecting recall).
---
## Learning — handover hit rate
Bob takes Alice's contract using Alice's playbook namespace. Did Alice's recorded answers surface in Bob's results?
| Metric | Value |
|---|---:|
| Verbatim handover queries run | 4 |
| Alice's recorded answer at Bob's top-1 (verbatim) | 4 |
| Alice's recorded answer in Bob's top-K (verbatim) | 4 |
| **Verbatim handover hit rate (top-1)** | **1** |
| Paraphrase handover queries run | 4 |
| Alice's recorded answer at Bob's top-1 (paraphrase) | 4 |
| Alice's recorded answer in Bob's top-K (paraphrase) | 4 |
| **Paraphrase handover hit rate (top-1)** | **1** |
**Interpretation:**
- Verbatim hit rate ≈ 1.0: trivial case — Bob runs identical queries; should always hit.
- Paraphrase hit rate ≥ 0.5: institutional memory survives wording change — the harder learning property.
- Paraphrase hit rate ≈ 0.0: Bob's paraphrases drift past the inject threshold, so Alice's recordings don't activate. Same caveat as the playbook_lift paraphrase pass.
---
## Per-event capture
All matrix.search responses live in the JSON — top-K with worker IDs, distances, and per-corpus counts. Search by phase:
```bash
jq '.events[] | select(.phase == "merge")' reports/reality-tests/multi_coord_stress_011.json
jq '.events[] | select(.coordinator == "alice" and .phase == "baseline")' reports/reality-tests/multi_coord_stress_011.json
jq '.events[] | select(.role == "warehouse worker") | {phase, contract, top_k_ids: [.top_k[].id]}' reports/reality-tests/multi_coord_stress_011.json
```
---
## What's NOT in this run (Phase 1 deliberately defers)
- **48-hour clock.** Events fire as discrete steps, not on a timeline.
- **Email / SMS ingest.** No endpoints exist on the Go side yet.
- **New-resume injection mid-run.** The corpus is fixed at the start.
- **Langfuse traces.** Need Go-side wiring.
These are Phase 2/3. The Phase 1 substrate is what the time-based runner will mount on top of.

View File

@ -1,120 +0,0 @@
# Playbook-Lift Reality Test — Run 005
**Generated:** 2026-04-30T12:40:48.475901847Z
**Judge:** `qwen2.5:latest` (Ollama, resolved from env JUDGE_MODEL=qwen2.5:latest)
**Corpora:** `workers,ethereal_workers`
**Workers limit:** 5000
**Queries:** `tests/reality/playbook_lift_queries.txt` (21 executed)
**K per pass:** 10
**Paraphrase pass:** ENABLED
**Re-judge pass:** ENABLED
**Evidence:** `reports/reality-tests/playbook_lift_005.json`
---
## Headline
| Metric | Value |
|---|---:|
| Total queries run | 21 |
| Cold-pass discoveries (judge-best ≠ top-1) | 7 |
| Warm-pass lifts (recorded playbook → top-1) | 5 |
| No change (judge-best already top-1, no playbook needed) | 16 |
| Playbook boosts triggered (warm pass) | 9 |
| Mean Δ top-1 distance (warm cold) | -0.076170966 |
| **Paraphrase pass — recorded answer at rank 0 (top-1)** | **5 / 7** |
| Paraphrase pass — recorded answer at any rank in top-K | 5 / 7 |
| **Quality lift** (warm top-1 rating > cold top-1 rating) | **5 / 21** |
| Quality neutral (warm top-1 rating = cold top-1 rating) | 13 / 21 |
| Quality regressed (warm top-1 rating < cold top-1 rating) | 3 / 21 |
**Verbatim lift rate:** 5 of 7 discoveries became top-1 after warm pass.
---
## Per-query results
| # | Query | Cold top-1 | Cold judge-best (rank/rating) | Recorded? | Warm top-1 | Judge-best warm rank | Lift |
|---|---|---|---|---|---|---|---|
| 1 | Forklift operator with OSHA-30, warehouse experience, day sh | e-5670 | 2/4 | ✓ e-5729 | e-5729 | 0 | **YES** |
| 2 | OSHA-30 certified forklift operator in Wisconsin, cold stora | e-6293 | 7/3 | — | w-1566 | 8 | no |
| 3 | Production worker with confined-space cert and hazmat traini | w-602 | 0/2 | — | w-3575 | 1 | no |
| 4 | CDL Class A driver, clean record, willing to do regional 4-d | w-3854 | 0/1 | — | w-3854 | 0 | no |
| 5 | Warehouse lead with current OSHA-30 certification, NOT OSHA- | w-1807 | 6/3 | — | w-1807 | 6 | no |
| 6 | Forklift-certified loader, certification must be active, dis | w-1807 | 3/4 | ✓ w-205 | w-4257 | 1 | no |
| 7 | Hazmat-certified warehouse worker comfortable with cold stor | e-4910 | 2/4 | ✓ w-4257 | w-205 | 1 | no |
| 8 | Bilingual production worker with team-lead experience and tr | w-4988 | 0/4 | — | w-4988 | 0 | no |
| 9 | Inventory specialist with confined-space cert and compliance | w-388 | 3/4 | ✓ w-3575 | w-3575 | 0 | **YES** |
| 10 | Warehouse worker who can run inventory cycles and lead a sma | e-3011 | 0/4 | — | e-3011 | 0 | no |
| 11 | Production line worker comfortable filling in as line superv | w-1387 | 0/4 | — | e-5729 | 1 | no |
| 12 | Customer service rep willing to cross-train into dispatch or | w-1451 | 0/2 | — | w-1451 | 0 | no |
| 13 | Reliable production line lead with strong attendance and lea | e-7360 | 5/4 | ✓ w-2886 | w-2886 | 0 | **YES** |
| 14 | Highly responsive forklift operator available for last-minut | e-6108 | 5/4 | ✓ w-1566 | w-1566 | 0 | **YES** |
| 15 | Engaged warehouse associate with strong safety compliance re | e-2743 | 2/4 | ✓ w-49 | w-49 | 0 | **YES** |
| 16 | CDL-A driver based in IL or WI, willing to run regional 4-da | w-2486 | 5/2 | — | w-2486 | 5 | no |
| 17 | Bilingual customer service rep in Indianapolis or Cincinnati | e-9749 | 9/2 | — | e-9749 | 9 | no |
| 18 | Production supervisor open to Midwest relocation for permane | w-379 | 6/3 | — | w-379 | 6 | no |
| 19 | Dental hygienist with three years experience, Indianapolis a | e-6772 | 0/1 | — | w-3575 | 1 | no |
| 20 | Registered nurse with ICU experience, willing to take per-di | w-379 | 0/1 | — | w-379 | 0 | no |
| 21 | Software engineer with React and TypeScript, three years exp | w-1773 | 0/1 | — | w-1773 | 0 | no |
---
## Paraphrase pass — does the playbook help similar-but-different queries?
For each query whose Pass 1 cold pass recorded a playbook entry, the
judge model rephrased the query, and the rephrased version was sent
through warm matrix.search. The recorded answer ID's rank in those
results tests whether cosine on the embedded paraphrase finds the
recorded query's vector.
| # | Original (≤40c) | Paraphrase (≤60c) | Recorded answer | Paraphrase top-1 | Recorded rank | Paraphrase lift |
|---|---|---|---|---|---|---|
| 1 | Forklift operator with OSHA-30, warehous | Seeking forklift operator certified in OSHA-30, looking for | e-5729 | e-5729 | 0 | **YES** |
| 6 | Forklift-certified loader, certification | Loader requiring active forklift certification, this must no | w-205 | w-205 | 0 | **YES** |
| 7 | Hazmat-certified warehouse worker comfor | Warehouse worker with Hazmat certification and experience in | w-4257 | w-4257 | 0 | **YES** |
| 9 | Inventory specialist with confined-space | Specialist in inventory management requiring certified confi | w-3575 | w-49 | -1 | no |
| 13 | Reliable production line lead with stron | Experienced production line supervisor with excellent punctu | w-2886 | w-2886 | 0 | **YES** |
| 14 | Highly responsive forklift operator avai | Available forklift operator ready for urgent shift coverage | w-1566 | w-1566 | 0 | **YES** |
| 15 | Engaged warehouse associate with strong | Warehouse associate dedicated to engagement and boasting a r | w-49 | w-984 | -1 | no |
---
## Honesty caveats
1. **Judge IS the ground truth proxy.** Without human-labeled relevance, the LLM
judge's verdict is what defines "best." If `qwen2.5:latest` rates badly,
the lift number is meaningless. To validate the judge itself, sample 510
verdicts manually and check agreement.
2. **Score-1.0 boost = distance halved.** Playbook math is
`distance' = distance × (1 - 0.5 × score)`. Lift requires the judge-best
result's pre-boost distance to be ≤ 2× the cold top-1's distance, otherwise
even halving doesn't promote it. Tight clusters → little visible lift.
3. **Verbatim vs paraphrase.** The verbatim lift rate (above) is the cheap
case — same query, recorded playbook, expected boost. The paraphrase
pass (when enabled) is the actual learning property: similar-but-different
queries hitting a recorded playbook. Compare verbatim and paraphrase
lift rates — paraphrase should be lower (semantic-distance gates some
playbook hits) but non-zero is the meaningful signal.
4. **Multi-corpus skew.** Default corpora=`workers,ethereal_workers` — if all judge-best
results land in one corpus, the matrix layer's purpose isn't being tested.
Check per-corpus distribution in the JSON.
5. **Judge resolution.** This run used `qwen2.5:latest` from
env JUDGE_MODEL=qwen2.5:latest.
Bumping the judge for run #N+1 means editing one line in lakehouse.toml.
6. **Paraphrase generation also uses the judge.** The same model that rates
relevance also rephrases queries. A judge that's bad at rating staffing
queries is probably also bad at rephrasing them. Worth sanity-checking
a sample of `paraphrase_query` values in the JSON before trusting the
paraphrase lift number.
## Next moves
- If lift rate ≥ 50% of discoveries: matrix layer + playbook is doing real
work. Move to paraphrase queries + tag-based boost (currently ignored).
- If lift rate < 20%: investigate why judge variance, distance gap too
wide, or playbook math too gentle. The score=1.0 / 0.5× formula may need
retuning.
- If discovery rate (cold judge-best ≠ top-1) is itself low: cosine is
already close to optimal on this query distribution. Either the corpus
is too narrow or the queries are too easy.

View File

@ -76,14 +76,8 @@ DIM="$(echo "$RESP" | jq -r '.dimension')"
N="$(echo "$RESP" | jq -r '.vectors | length')"
MODEL="$(echo "$RESP" | jq -r '.model')"
SAME="$(echo "$RESP" | jq -r '.vectors[0][0] == .vectors[1][0]')"
# Accept any nomic-embed-text* family member as the default — v1
# (137M, 768d) and v2-moe (475M MoE, 768d) are both supported drop-ins.
# The smoke locks the dimension + the distinct-vectors property, NOT
# the exact model name (operators bump the model in lakehouse.toml
# without changing this smoke).
case "$MODEL" in nomic-embed-text*) MODEL_OK=1 ;; *) MODEL_OK=0 ;; esac
if [ "$DIM" = "768" ] && [ "$N" = "2" ] && [ "$MODEL_OK" = "1" ] && [ "$SAME" = "false" ]; then
echo " ✓ dim=768, model=$MODEL, 2 distinct vectors"
if [ "$DIM" = "768" ] && [ "$N" = "2" ] && [ "$MODEL" = "nomic-embed-text" ] && [ "$SAME" = "false" ]; then
echo " ✓ dim=768, model=nomic-embed-text, 2 distinct vectors"
else
echo " ✗ resp: dim=$DIM n=$N model=$MODEL same=$SAME"; FAILED=1
fi

View File

@ -1,282 +0,0 @@
#!/usr/bin/env bash
# Multi-coordinator stress harness — Phase 1 of the 48-hour mock.
#
# Three coordinators (Alice / Bob / Carol) own three distinct contracts
# (Milwaukee distribution, Indianapolis manufacturing, Chicago
# construction). The driver fires phases:
# 1. baseline — each coord runs their contract's role queries
# 2. surge — each contract's demand doubles (URGENT phrasing)
# 3. merge — alpha + beta combined under alice
# 4. handover — bob takes alpha, USING alice's playbook namespace
# 5. split — alpha surge re-distributed across all 3 coords
# 6. reissue — non-determinism check: same baselines reissued
# 7. analysis — diversity + determinism + learning metrics
#
# Phase 1 deliberately skips the 48-hour clock, email/SMS endpoints,
# and Langfuse wiring — those are Phase 2/3.
#
# Usage:
# ./scripts/multi_coord_stress.sh # run #001
# RUN_ID=002 ./scripts/multi_coord_stress.sh
# K=12 ./scripts/multi_coord_stress.sh
set -euo pipefail
cd "$(dirname "$0")/.."
export PATH="$PATH:/usr/local/go/bin"
RUN_ID="${RUN_ID:-001}"
WORKERS_LIMIT="${WORKERS_LIMIT:-5000}"
ETHEREAL_LIMIT="${ETHEREAL_LIMIT:-0}"
CORPORA="${CORPORA:-workers,ethereal_workers}"
K="${K:-8}"
OUT_JSON="reports/reality-tests/multi_coord_stress_${RUN_ID}.json"
OUT_MD="reports/reality-tests/multi_coord_stress_${RUN_ID}.md"
if ! curl -sS --max-time 3 http://localhost:11434/api/tags >/dev/null 2>&1; then
echo "[stress] Ollama not reachable on :11434 — skipping (need it for embeddings)"
exit 0
fi
echo "[stress] building binaries..."
go build -o bin/ ./cmd/storaged ./cmd/catalogd ./cmd/ingestd ./cmd/queryd \
./cmd/embedd ./cmd/vectord ./cmd/pathwayd ./cmd/observerd \
./cmd/matrixd ./cmd/gateway \
./scripts/staffing_workers ./scripts/multi_coord_stress
pkill -f "bin/(storaged|catalogd|ingestd|queryd|embedd|vectord|pathwayd|observerd|matrixd|gateway)" 2>/dev/null || true
sleep 0.3
PIDS=()
TMP="$(mktemp -d)"
CFG="$TMP/stress.toml"
cleanup() {
echo "[stress] cleanup"
for p in "${PIDS[@]:-}"; do [ -n "${p:-}" ] && kill "$p" 2>/dev/null || true; done
rm -rf "$TMP"
}
trap cleanup EXIT INT TERM
cat > "$CFG" <<EOF
[s3]
endpoint = "http://localhost:9000"
region = "us-east-1"
bucket = "lakehouse-go-primary"
use_path_style = true
[gateway]
bind = "127.0.0.1:3110"
storaged_url = "http://127.0.0.1:3211"
catalogd_url = "http://127.0.0.1:3212"
ingestd_url = "http://127.0.0.1:3213"
queryd_url = "http://127.0.0.1:3214"
vectord_url = "http://127.0.0.1:3215"
embedd_url = "http://127.0.0.1:3216"
pathwayd_url = "http://127.0.0.1:3217"
matrixd_url = "http://127.0.0.1:3218"
observerd_url = "http://127.0.0.1:3219"
[storaged]
bind = "127.0.0.1:3211"
[catalogd]
bind = "127.0.0.1:3212"
storaged_url = "http://127.0.0.1:3211"
[ingestd]
bind = "127.0.0.1:3213"
storaged_url = "http://127.0.0.1:3211"
catalogd_url = "http://127.0.0.1:3212"
max_ingest_bytes = 268435456
[queryd]
bind = "127.0.0.1:3214"
catalogd_url = "http://127.0.0.1:3212"
secrets_path = "/etc/lakehouse/secrets-go.toml"
refresh_every = "1s"
[embedd]
bind = "127.0.0.1:3216"
provider_url = "http://localhost:11434"
default_model = "nomic-embed-text-v2-moe"
[vectord]
bind = "127.0.0.1:3215"
storaged_url = ""
[pathwayd]
bind = "127.0.0.1:3217"
persist_path = ""
[observerd]
bind = "127.0.0.1:3219"
persist_path = ""
[matrixd]
bind = "127.0.0.1:3218"
embedd_url = "http://127.0.0.1:3216"
vectord_url = "http://127.0.0.1:3215"
EOF
poll_health() {
local port="$1" deadline=$(($(date +%s) + 5))
while [ "$(date +%s)" -lt "$deadline" ]; do
if curl -sS --max-time 1 "http://127.0.0.1:$port/health" >/dev/null 2>&1; then return 0; fi
sleep 0.05
done
return 1
}
echo "[stress] launching stack..."
./bin/storaged -config "$CFG" > /tmp/stress_storaged.log 2>&1 & PIDS+=($!); poll_health 3211 || { echo "storaged failed"; exit 1; }
./bin/catalogd -config "$CFG" > /tmp/stress_catalogd.log 2>&1 & PIDS+=($!); poll_health 3212 || { echo "catalogd failed"; exit 1; }
./bin/ingestd -config "$CFG" > /tmp/stress_ingestd.log 2>&1 & PIDS+=($!); poll_health 3213 || { echo "ingestd failed"; exit 1; }
./bin/queryd -config "$CFG" > /tmp/stress_queryd.log 2>&1 & PIDS+=($!); poll_health 3214 || { echo "queryd failed"; exit 1; }
./bin/embedd -config "$CFG" > /tmp/stress_embedd.log 2>&1 & PIDS+=($!); poll_health 3216 || { echo "embedd failed"; exit 1; }
./bin/vectord -config "$CFG" > /tmp/stress_vectord.log 2>&1 & PIDS+=($!); poll_health 3215 || { echo "vectord failed"; exit 1; }
./bin/pathwayd -config "$CFG" > /tmp/stress_pathwayd.log 2>&1 & PIDS+=($!); poll_health 3217 || { echo "pathwayd failed"; exit 1; }
./bin/observerd -config "$CFG" > /tmp/stress_observerd.log 2>&1 & PIDS+=($!); poll_health 3219 || { echo "observerd failed"; exit 1; }
./bin/matrixd -config "$CFG" > /tmp/stress_matrixd.log 2>&1 & PIDS+=($!); poll_health 3218 || { echo "matrixd failed"; exit 1; }
./bin/gateway -config "$CFG" > /tmp/stress_gateway.log 2>&1 & PIDS+=($!); poll_health 3110 || { echo "gateway failed"; exit 1; }
echo
echo "[stress] ingest workers (limit=$WORKERS_LIMIT) into 'workers' corpus..."
./bin/staffing_workers -limit "$WORKERS_LIMIT"
echo
echo "[stress] ingest ethereal_workers (limit=$ETHEREAL_LIMIT, 0=all) into 'ethereal_workers' corpus..."
./bin/staffing_workers \
-parquet "/home/profit/lakehouse/data/datasets/ethereal_workers.parquet" \
-index-name ethereal_workers \
-id-prefix "e-" \
-limit "$ETHEREAL_LIMIT"
echo
echo "[stress] running multi-coord stress driver..."
EXTRA_FLAGS=""
if [ "${WITH_PARAPHRASE_HANDOVER:-1}" = "1" ]; then
EXTRA_FLAGS="$EXTRA_FLAGS -with-paraphrase-handover"
fi
./bin/multi_coord_stress \
-gateway "http://127.0.0.1:3110" \
-contracts tests/reality/contracts \
-corpora "$CORPORA" \
-k "$K" \
-out "$OUT_JSON" \
-ollama "http://localhost:11434" \
-judge "${JUDGE_MODEL:-qwen2.5:latest}" \
$EXTRA_FLAGS
echo
echo "[stress] generating markdown report → $OUT_MD"
# Render compact markdown from the JSON. Same shape as the lift harness
# reports so reviewers can compare format.
total=$(jq -r '.events | length' "$OUT_JSON")
gen_at=$(jq -r '.generated_at' "$OUT_JSON")
div_role=$(jq -r '.diversity.same_role_across_contracts_mean_jaccard' "$OUT_JSON")
div_role_n=$(jq -r '.diversity.num_pairs_same_role_across_contracts' "$OUT_JSON")
div_xrole=$(jq -r '.diversity.different_roles_same_contract_mean_jaccard' "$OUT_JSON")
div_xrole_n=$(jq -r '.diversity.num_pairs_different_roles_same_contract' "$OUT_JSON")
det_jacc=$(jq -r '.determinism.mean_jaccard' "$OUT_JSON")
det_n=$(jq -r '.determinism.num_reissued_pairs' "$OUT_JSON")
hand_run=$(jq -r '.learning.handover_queries_run' "$OUT_JSON")
hand_top1=$(jq -r '.learning.recorded_answers_top1_count' "$OUT_JSON")
hand_topk=$(jq -r '.learning.recorded_answers_topk_count' "$OUT_JSON")
hand_rate=$(jq -r '.learning.handover_hit_rate' "$OUT_JSON")
ph_run=$(jq -r '.learning.paraphrase_handover_run // 0' "$OUT_JSON")
ph_top1=$(jq -r '.learning.paraphrase_top1_count // 0' "$OUT_JSON")
ph_topk=$(jq -r '.learning.paraphrase_topk_count // 0' "$OUT_JSON")
ph_rate=$(jq -r '.learning.paraphrase_handover_hit_rate // 0' "$OUT_JSON")
cat > "$OUT_MD" <<MDEOF
# Multi-Coordinator Stress Test — Run ${RUN_ID}
**Generated:** ${gen_at}
**Coordinators:** alice / bob / carol (each with own playbook namespace: \`playbook_alice\` / \`playbook_bob\` / \`playbook_carol\`)
**Contracts:** $(jq -r '.contracts | join(" / ")' "$OUT_JSON")
**Corpora:** \`${CORPORA}\`
**K per query:** ${K}
**Total events captured:** ${total}
**Evidence:** \`${OUT_JSON}\`
---
## Diversity — is the system locking into scenarios or cycling?
| Metric | Mean Jaccard | n pairs | Interpretation |
|---|---:|---:|---|
| Same role across different contracts | ${div_role} | ${div_role_n} | Lower = more diverse (different region/cert mix → different workers) |
| Different roles within same contract | ${div_xrole} | ${div_xrole_n} | Should be near-zero (different roles = different worker pools) |
**Healthy ranges:**
- Same role across contracts: < 0.30 means the system is genuinely picking different workers per region/contract.
- Different roles same contract: < 0.10 means role-specific retrieval is working.
- If either is > 0.50, the system is "cycling" the same handful of workers regardless of query intent.
---
## Determinism — same query reissued, top-K stability
| Metric | Value |
|---|---:|
| Mean Jaccard on retrieval-only reissue | ${det_jacc} |
| Number of reissue pairs | ${det_n} |
**Interpretation:**
- ≥ 0.95: HNSW retrieval is highly deterministic; reissues land on near-identical top-K. Good — system locks into a stable view of "best workers for this query."
- 0.80 0.95: Some HNSW or embed variance, acceptable.
- < 0.80: Retrieval is unstable — reissues see substantially different results, suggesting either embed nondeterminism (Ollama returning slightly different vectors) or vectord nondeterminism (HNSW insertion order affecting recall).
---
## Learning — handover hit rate
Bob takes Alice's contract using Alice's playbook namespace. Did Alice's recorded answers surface in Bob's results?
| Metric | Value |
|---|---:|
| Verbatim handover queries run | ${hand_run} |
| Alice's recorded answer at Bob's top-1 (verbatim) | ${hand_top1} |
| Alice's recorded answer in Bob's top-K (verbatim) | ${hand_topk} |
| **Verbatim handover hit rate (top-1)** | **${hand_rate}** |
| Paraphrase handover queries run | ${ph_run} |
| Alice's recorded answer at Bob's top-1 (paraphrase) | ${ph_top1} |
| Alice's recorded answer in Bob's top-K (paraphrase) | ${ph_topk} |
| **Paraphrase handover hit rate (top-1)** | **${ph_rate}** |
**Interpretation:**
- Verbatim hit rate ≈ 1.0: trivial case — Bob runs identical queries; should always hit.
- Paraphrase hit rate ≥ 0.5: institutional memory survives wording change — the harder learning property.
- Paraphrase hit rate ≈ 0.0: Bob's paraphrases drift past the inject threshold, so Alice's recordings don't activate. Same caveat as the playbook_lift paraphrase pass.
---
## Per-event capture
All matrix.search responses live in the JSON — top-K with worker IDs, distances, and per-corpus counts. Search by phase:
\`\`\`bash
jq '.events[] | select(.phase == "merge")' ${OUT_JSON}
jq '.events[] | select(.coordinator == "alice" and .phase == "baseline")' ${OUT_JSON}
jq '.events[] | select(.role == "warehouse worker") | {phase, contract, top_k_ids: [.top_k[].id]}' ${OUT_JSON}
\`\`\`
---
## What's NOT in this run (Phase 1 deliberately defers)
- **48-hour clock.** Events fire as discrete steps, not on a timeline.
- **Email / SMS ingest.** No endpoints exist on the Go side yet.
- **New-resume injection mid-run.** The corpus is fixed at the start.
- **Langfuse traces.** Need Go-side wiring.
These are Phase 2/3. The Phase 1 substrate is what the time-based runner will mount on top of.
MDEOF
echo
echo "[stress] DONE"
echo "[stress] evidence: $OUT_JSON"
echo "[stress] report: $OUT_MD"

File diff suppressed because it is too large Load Diff

View File

@ -52,11 +52,6 @@ CONFIG_PATH="${CONFIG_PATH:-lakehouse.toml}"
# actual learning-property test (does cosine on paraphrase find the
# recorded entry?). Set WITH_PARAPHRASE=0 for a faster verbatim-only run.
WITH_PARAPHRASE="${WITH_PARAPHRASE:-1}"
# WITH_REJUDGE=1 (default) adds a Pass 4 — judge warm top-1 to measure
# quality lift (warm rating vs cold rating). Catches cases where Shape B
# surfaces a different-but-equally-good answer (which the rank-based
# lift metric misses). +21 judge calls (~30s on qwen2.5).
WITH_REJUDGE="${WITH_REJUDGE:-1}"
OUT_JSON="reports/reality-tests/playbook_lift_${RUN_ID}.json"
OUT_MD="reports/reality-tests/playbook_lift_${RUN_ID}.md"
@ -161,7 +156,7 @@ refresh_every = "1s"
[embedd]
bind = "127.0.0.1:3216"
provider_url = "http://localhost:11434"
default_model = "nomic-embed-text-v2-moe"
default_model = "nomic-embed-text"
[vectord]
bind = "127.0.0.1:3215"
@ -276,12 +271,9 @@ echo "[lift] running driver — judge=$EFFECTIVE_JUDGE · queries=$QUERIES_FILE
# and runs its own resolution chain (env → config → fallback). When
# JUDGE_MODEL IS set, the explicit -judge wins inside the Go driver
# regardless of what its env-lookup would find — flag wins by design.
EXTRA_FLAGS=""
PARAPHRASE_FLAG=""
if [ "$WITH_PARAPHRASE" = "1" ]; then
EXTRA_FLAGS="$EXTRA_FLAGS -with-paraphrase"
fi
if [ "$WITH_REJUDGE" = "1" ]; then
EXTRA_FLAGS="$EXTRA_FLAGS -with-rejudge"
PARAPHRASE_FLAG="-with-paraphrase"
fi
./bin/playbook_lift \
-config "$CONFIG_PATH" \
@ -292,7 +284,7 @@ fi
-judge "$JUDGE_MODEL" \
-k "$K" \
-out "$OUT_JSON" \
$EXTRA_FLAGS
$PARAPHRASE_FLAG
echo
echo "[lift] generating markdown report → $OUT_MD"
@ -310,10 +302,6 @@ generate_md() {
p_attempted=$(jq -r '.summary.paraphrase_attempted // 0' "$json")
p_top1=$(jq -r '.summary.paraphrase_top1_lifts // 0' "$json")
p_anyrank=$(jq -r '.summary.paraphrase_any_rank_hits // 0' "$json")
rj_attempted=$(jq -r '.summary.rejudge_attempted // 0' "$json")
q_lifted=$(jq -r '.summary.quality_lifted // 0' "$json")
q_neutral=$(jq -r '.summary.quality_neutral // 0' "$json")
q_regressed=$(jq -r '.summary.quality_regressed // 0' "$json")
# Only emit the paraphrase block when --with-paraphrase actually ran
# (i.e. .summary.paraphrase_attempted > 0). For verbatim-only runs we
@ -324,13 +312,6 @@ generate_md() {
| Paraphrase pass — recorded answer at any rank in top-K | ${p_anyrank} / ${p_attempted} |"
fi
rj_block=""
if [ "$rj_attempted" != "0" ] && [ "$rj_attempted" != "null" ]; then
rj_block="| **Quality lift** (warm top-1 rating > cold top-1 rating) | **${q_lifted} / ${rj_attempted}** |
| Quality neutral (warm top-1 rating = cold top-1 rating) | ${q_neutral} / ${rj_attempted} |
| Quality regressed (warm top-1 rating < cold top-1 rating) | ${q_regressed} / ${rj_attempted} |"
fi
cat > "$md" <<MDEOF
# Playbook-Lift Reality Test — Run ${RUN_ID}
@ -341,7 +322,6 @@ generate_md() {
**Queries:** \`${QUERIES_FILE}\` (${total} executed)
**K per pass:** ${K}
**Paraphrase pass:** $([ "$WITH_PARAPHRASE" = "1" ] && echo "ENABLED" || echo "disabled")
**Re-judge pass:** $([ "$WITH_REJUDGE" = "1" ] && echo "ENABLED" || echo "disabled")
**Evidence:** \`${OUT_JSON}\`
---
@ -357,7 +337,6 @@ generate_md() {
| Playbook boosts triggered (warm pass) | ${boosted} |
| Mean Δ top-1 distance (warm cold) | ${mean_delta} |
${p_block}
${rj_block}
**Verbatim lift rate:** ${lift} of ${discovery} discoveries became top-1 after warm pass.

View File

@ -75,19 +75,12 @@ type queryRun struct {
PlaybookRecorded bool `json:"playbook_recorded"`
PlaybookID string `json:"playbook_target_id,omitempty"`
WarmTop1ID string `json:"warm_top1_id"`
WarmTop1Distance float32 `json:"warm_top1_distance"`
WarmBoostedCount int `json:"warm_boosted_count"`
WarmJudgeBestRank int `json:"warm_judge_best_rank"` // rank of cold judge-best in warm — NOT the warm pass's own judge-best
WarmTop1Metadata json.RawMessage `json:"-"` // cached for Pass 4 rejudge; not emitted
WarmTop1ID string `json:"warm_top1_id"`
WarmTop1Distance float32 `json:"warm_top1_distance"`
WarmBoostedCount int `json:"warm_boosted_count"`
WarmJudgeBestRank int `json:"warm_judge_best_rank"`
// WarmTop1Rating: only populated when --with-rejudge. Compare to
// ColdRatings[0] (== cold top-1 rating) to measure quality lift.
// *int so absence (no rejudge pass) and a 0-rating verdict are
// distinguishable.
WarmTop1Rating *int `json:"warm_top1_rating,omitempty"`
Lift bool `json:"lift"` // judge-best was below top-1 cold, but top-1 warm
Lift bool `json:"lift"` // judge-best was below top-1 cold, but top-1 warm
// Paraphrase pass — only populated when --with-paraphrase. Tests
// the playbook's actual learning property: does a recorded entry
@ -121,17 +114,6 @@ type summary struct {
ParaphraseTop1Lifts int `json:"paraphrase_top1_lifts,omitempty"` // recorded answer surfaced at rank 0
ParaphraseAnyRankHits int `json:"paraphrase_any_rank_hits,omitempty"` // recorded answer surfaced at any rank in top-K
// Re-judge pass aggregates — only populated when --with-rejudge.
// Measures QUALITY lift (warm top-1 rating vs cold top-1 rating)
// rather than rank-of-cold-judge-best lift. The latter conflates
// "warm surfaced a different but equally-good result" with "warm
// shuffled ranks but the answer was the same"; quality lift
// disambiguates them.
RejudgeAttempted int `json:"rejudge_attempted,omitempty"` // queries that ran the rejudge pass
QualityLifted int `json:"quality_lifted,omitempty"` // warm-top-1 rating > cold-top-1 rating
QualityNeutral int `json:"quality_neutral,omitempty"` // ratings equal (could be same or different item)
QualityRegressed int `json:"quality_regressed,omitempty"` // warm-top-1 rating < cold-top-1 rating
GeneratedAt time.Time `json:"generated_at"`
}
@ -146,7 +128,6 @@ func main() {
k := flag.Int("k", 10, "top-k from matrix.search per pass")
out := flag.String("out", "reports/reality-tests/playbook_lift_001.json", "output JSONL path")
withParaphrase := flag.Bool("with-paraphrase", false, "after warm pass, generate a paraphrase via the judge model and re-query with playbook=true to test the learning property")
withRejudge := flag.Bool("with-rejudge", false, "after warm pass, judge warm top-1 to measure QUALITY lift (vs cold top-1 rating), not just rank-of-cold-judge-best")
flag.Parse()
// Judge resolution priority: explicit flag > $JUDGE_MODEL env >
@ -244,7 +225,6 @@ func main() {
}
runs[i].WarmTop1ID = resp.Results[0].ID
runs[i].WarmTop1Distance = resp.Results[0].Distance
runs[i].WarmTop1Metadata = resp.Results[0].Metadata // cache for Pass 4 rejudge
runs[i].WarmBoostedCount = resp.PlaybookBoosted
playbookBoostedTotal += resp.PlaybookBoosted
@ -324,47 +304,6 @@ func main() {
}
}
// Pass 4 (warm-rejudge) — opt-in via --with-rejudge. Judge warm
// top-1 against the same prompt as cold ratings, then compare to
// cold top-1 rating. This measures QUALITY lift (did the playbook
// produce a better candidate?) rather than just rank-of-cold-judge-
// best lift (did the recorded answer move to top-1, even if cold's
// top-1 was already good?). See STATE_OF_PLAY OPEN — added because
// run #003's verbatim 2/6 didn't tell us whether Shape B was
// surfacing better OR same-quality alternatives.
rejudgeAttempted := 0
qualityLifted := 0
qualityNeutral := 0
qualityRegressed := 0
if *withRejudge {
log.Printf("[lift] warm-rejudge pass: measuring quality lift (warm top-1 rating vs cold top-1 rating)")
for i := range runs {
if runs[i].WarmTop1ID == "" || len(runs[i].WarmTop1Metadata) == 0 {
continue // warm pass didn't complete for this query
}
rejudgeAttempted++
result := matrixResult{
ID: runs[i].WarmTop1ID,
Distance: runs[i].WarmTop1Distance,
Metadata: runs[i].WarmTop1Metadata,
}
warmRating := judgeRate(hc, *ollama, *judge, runs[i].Query, result)
runs[i].WarmTop1Rating = &warmRating
coldRating := 0
if len(runs[i].ColdRatings) > 0 {
coldRating = runs[i].ColdRatings[0]
}
switch {
case warmRating > coldRating:
qualityLifted++
case warmRating < coldRating:
qualityRegressed++
default:
qualityNeutral++
}
}
}
sum := summary{
Total: len(runs),
WithDiscovery: withDiscovery,
@ -375,10 +314,6 @@ func main() {
ParaphraseAttempted: paraphraseAttempted,
ParaphraseTop1Lifts: paraphraseTop1Lifts,
ParaphraseAnyRankHits: paraphraseAnyRankHits,
RejudgeAttempted: rejudgeAttempted,
QualityLifted: qualityLifted,
QualityNeutral: qualityNeutral,
QualityRegressed: qualityRegressed,
GeneratedAt: time.Now().UTC(),
}
if len(runs) > 0 {
@ -388,11 +323,11 @@ func main() {
if err := writeJSON(*out, runs, sum); err != nil {
log.Fatalf("write %s: %v", *out, err)
}
if *withParaphrase || *withRejudge {
log.Printf("[lift] DONE — %d queries · discovery=%d · lift=%d · boosted=%d · meanΔdist=%.4f · paraphrase=%d/%d→top1 · quality=lifted%d/neutral%d/regressed%d",
if *withParaphrase {
log.Printf("[lift] DONE — %d queries · discovery=%d · lift=%d · boosted=%d · meanΔdist=%.4f · paraphrase=%d/%d→top1, %d/%d→anyrank",
sum.Total, sum.WithDiscovery, sum.LiftCount, sum.PlaybookBoostedTotal, sum.MeanTop1DeltaDistance,
sum.ParaphraseTop1Lifts, sum.ParaphraseAttempted,
sum.QualityLifted, sum.QualityNeutral, sum.QualityRegressed)
sum.ParaphraseAnyRankHits, sum.ParaphraseAttempted)
} else {
log.Printf("[lift] DONE — %d queries · discovery=%d · lift=%d · boosted=%d · meanΔdist=%.4f",
sum.Total, sum.WithDiscovery, sum.LiftCount, sum.PlaybookBoostedTotal, sum.MeanTop1DeltaDistance)

View File

@ -1,12 +0,0 @@
{
"name": "alpha_milwaukee_distribution",
"client": "Northstar Logistics",
"location": "Milwaukee, WI metro",
"shift": "day",
"demand": [
{"role": "warehouse worker", "count": 200, "skills": ["pallet jack", "inventory"], "certs": ["OSHA-30"]},
{"role": "admin assistant", "count": 3, "skills": ["scheduling", "data entry"], "certs": []},
{"role": "heavy equipment operator", "count": 2, "skills": ["forklift", "bobcat"], "certs": ["OSHA-30", "forklift cert"]},
{"role": "industrial electrician", "count": 1, "skills": ["high voltage", "PLC"], "certs": ["journeyman"], "in_roster": false}
]
}

View File

@ -1,12 +0,0 @@
{
"name": "beta_indianapolis_manufacturing",
"client": "Crossroads Manufacturing",
"location": "Indianapolis, IN metro",
"shift": "swing",
"demand": [
{"role": "warehouse worker", "count": 150, "skills": ["assembly", "machine operation"], "certs": ["OSHA-10"]},
{"role": "admin assistant", "count": 4, "skills": ["scheduling", "documentation", "spanish"], "certs": []},
{"role": "heavy equipment operator", "count": 3, "skills": ["forklift", "pallet jack", "cold storage"], "certs": ["OSHA-30", "forklift cert"]},
{"role": "bilingual safety coordinator", "count": 1, "skills": ["spanish", "english", "training"], "certs": ["OSHA trainer"], "in_roster": false}
]
}

View File

@ -1,12 +0,0 @@
{
"name": "gamma_chicago_construction",
"client": "Loop Construction Group",
"location": "Chicago, IL metro",
"shift": "early-day",
"demand": [
{"role": "warehouse worker", "count": 80, "skills": ["framing", "rigging", "concrete"], "certs": ["OSHA-10"]},
{"role": "admin assistant", "count": 1, "skills": ["scheduling", "blueprint reading"], "certs": []},
{"role": "heavy equipment operator", "count": 2, "skills": ["mobile crane", "rigging signals", "bobcat"], "certs": ["NCCCO crane cert"]},
{"role": "drone surveyor", "count": 1, "skills": ["UAV piloting", "GIS", "site mapping"], "certs": ["FAA Part 107"], "in_roster": false}
]
}