Compare commits
No commits in common. "f971e647456557e835e3ab6a97d372024573112a" and "87cbd10090aa83ca8aa626404dff5553e7323cb7" have entirely different histories.
f971e64745
...
87cbd10090
@ -1,7 +1,7 @@
|
||||
# STATE OF PLAY — Lakehouse-Go
|
||||
|
||||
**Last verified:** 2026-04-30 ~16:42 CDT
|
||||
**Verified by:** live probes + `just verify` PASS + multi-coord stress run #011 (full 9-phase scenario, 67 captured events, 1 Langfuse trace + 111 child observations covering every phase + every external call), not memory.
|
||||
**Last verified:** 2026-04-30 ~07:25 CDT
|
||||
**Verified by:** live probes + `just verify` PASS + reality tests #001/#002/#003/#004 (v4 with split inject threshold: 6/8 verbatim lift + 6/8 paraphrase recovery + zero OOD cross-pollination), not memory.
|
||||
|
||||
> **Read this FIRST.** When the user says "we're working on lakehouse," default to the Go rewrite (this repo); the Rust legacy at `/home/profit/lakehouse/` is maintenance-only. If memory contradicts this file, this file wins. Update it when something is verified working — not when a phase finishes.
|
||||
|
||||
@ -114,34 +114,6 @@ Evidence: `reports/reality-tests/playbook_lift_{001,002,003}.{json,md}`. Per the
|
||||
|
||||
**v3 → v4 is the configuration evolution.** v3 ran with one threshold (0.5) for both boost and inject — paraphrase recovery hit 6/6 but a single strong recording (w-4435 for "OSHA-30 forklift Wisconsin") cross-pollinated to OOD queries (dental hygienist / RN / software engineer) because their text vectors fell within 0.5 cosine. v4 split: boost stays at 0.5 (safe — only re-ranks existing results), inject tightens to 0.20 (true paraphrase territory). Result: OOD cross-pollination eliminated, paraphrase recovery 6/8, verbatim lift 6/8. The two paraphrases that missed (Q9, Q15) were liberally rephrased — drift > 0.20 cosine.
|
||||
|
||||
### Multi-coordinator stress test (Phase 1 → 3) — VERIFIED end-to-end
|
||||
|
||||
Reality test #2 catalog. New harness `scripts/multi_coord_stress.{sh,go}` simulates 3 coordinators (alice/bob/carol) handling 3 distinct contracts (Milwaukee distribution / Indianapolis manufacturing / Chicago construction), each with their own playbook namespace. 9-phase operational narrative across simulated Hours 0–48: baseline → fresh-resume injection → inbox burst → mid-day surge → 200-worker swap → contract merge → handover (verbatim + paraphrase) → split → reissue.
|
||||
|
||||
| Capability | Verified | Where |
|
||||
|---|---|---|
|
||||
| Per-coordinator playbook isolation | ✓ | `playbook_alice` / `playbook_bob` / `playbook_carol` corpora |
|
||||
| Same-role-across-contracts diversity | Jaccard 0.026 (n=9) — 97% workers differ per region | Phase 1 baseline |
|
||||
| Different-roles-same-contract diversity | Jaccard 0.070 (n=18) — 93% differ per role | Phase 1 baseline |
|
||||
| HNSW retrieval determinism | Jaccard 1.000 (n=12) | Phase 6 reissue |
|
||||
| Verbatim handover (Bob runs Alice's queries with Alice's playbook) | 4/4 | Phase 4 |
|
||||
| Paraphrase handover (Bob runs qwen2.5-paraphrased queries) | 4/4 | Phase 4b |
|
||||
| 200-worker swap with `ExcludeIDs` | Jaccard 0.000 — 8/8 placed workers fully replaced | Phase 2b |
|
||||
| Fresh-resume injection (two-tier `fresh_workers` index) | 3/3 fresh workers at top-1 | Phase 1b |
|
||||
| Inbox endpoint `/v1/observer/inbox` (email + SMS, priority weighting) | 6/6 events recorded | Phase 1c |
|
||||
| LLM demand parsing (qwen2.5 format=json on inbox bodies) | 6/6 parsed cleanly into structured `{role, count, location, certs, skills, shift}` | Phase 1c |
|
||||
| Judge re-rates inbox top-1 against ORIGINAL body | catches tight-distance-but-wrong (Q3 crane case: dist 0.23 → rating 1) | Phase 1c |
|
||||
| Langfuse Go-side tracing | 111 observations on a single run trace, browseable at http://localhost:3001 | Run #011 |
|
||||
|
||||
**Substrate gains added by this wave:**
|
||||
- `internal/matrix/playbook.go` Shape B + split inject threshold (commit `67d1957` from earlier wave; verified in multi-coord too)
|
||||
- `internal/matrix/retrieve.go` `ExcludeIDs` field on `SearchRequest` — filters worker IDs at retrieve, boost, AND inject (so excluded answers can't sneak back via playbook). Real-world driver: coordinator placed N workers, client asks for replacements.
|
||||
- `internal/observer/types.go` `SourceInbox` taxonomy alongside `SourceMCP / SourceScenario / SourceWorkflow`
|
||||
- `cmd/observerd` `POST /observer/inbox` route — accepts `{type, sender, subject, body, priority, tag}` and records as `ObservedOp`. Type must be `email` or `sms`; body required; priority defaults to medium.
|
||||
- `internal/langfuse/client.go` — minimal Go-side Trace+Span client, best-effort posture (logs on error, never blocks calling path; same fail-open semantics as ADR-005 Decision 5.1).
|
||||
- `embedd default_model` bumped from `nomic-embed-text` (137M) → `nomic-embed-text-v2-moe` (475M, MoE, drop-in 768d). Same-role-across-contracts diversity went from 0.080 → 0.000 with the upgrade.
|
||||
- Two-tier index pattern: fresh content goes to `fresh_workers` (a small "hot" corpus); main queries include it in the corpora list. Solves the HNSW post-build add recall issue (incremental adds to a populated graph land in poorly-connected regions and disappear from search). Canonical NRT pattern; Lucene works the same way.
|
||||
|
||||
### Harness expansion (2026-04-30 ~05:30 CDT)
|
||||
|
||||
`scripts/playbook_lift.sh` rewritten from a 5-daemon stripped harness to the **full 10-daemon prod-realistic stack** (chatd stays up independently). The 5-daemon version was structurally hiding bugs; expanding the daemon set surfaced 7 distinct fixes:
|
||||
@ -199,16 +171,10 @@ Verbatim verdicts at `reports/scrum/_evidence/2026-04-30/verdicts/`. Disposition
|
||||
- The matrix indexer **5/5 components** are shipped. Don't propose to "build the matrix indexer" — it's done.
|
||||
- The 5-loop substrate's load-bearing gate is **PASSED**. v3 (`154a72e`) showed 6/6 paraphrase recovery via Shape B. Don't propose to "test if the playbook learns" — it does.
|
||||
- **Shape B is the playbook stance now.** When `use_playbook=true`, both `ApplyPlaybookBoost` (re-rank in place) AND `InjectPlaybookMisses` (insert recorded answers not in regular top-K) run. Don't revert to boost-only; v2 proved that path can't recover paraphrase queries.
|
||||
- **Boost / inject use SEPARATE thresholds.** Boost stays at `DefaultPlaybookMaxDistance = 0.5` (safe — only re-ranks results already in retrieval). Inject uses tighter `DefaultPlaybookMaxInjectDistance = 0.20` (Shape B forces results in, so loose match cross-pollinates wrong-domain). Don't merge them.
|
||||
- **Multi-coord product theory is empirically VALIDATED** by run #011 (Phase 3). Per-coordinator playbook namespaces (`playbook_alice` etc.) with cross-coordinator handover (Bob takes Alice's contract using `playbook_corpus=playbook_alice`) work end-to-end including paraphrase handover. Don't propose to "test if multi-coord works" — it does.
|
||||
- **Fresh content uses two-tier indexing.** Fresh resumes go to `fresh_workers` corpus, not the main `workers` index. coder/hnsw incremental adds to a populated 5K+ graph have recall issues; the small hot index has no such crowding. Periodic merge (post-G3) consolidates fresh→main. Canonical NRT pattern.
|
||||
- **`embedd.default_model = "nomic-embed-text-v2-moe"`** (475M MoE, 768d). Don't bump to `nomic-embed-text` (137M) "for speed" — diversity scores fell from 0.000 → 0.080 same-role-across-contracts on the smaller model. Cost (~5× slower ingest) is acceptable for once-per-deploy work.
|
||||
- **Inbox flow: parse + search + judge + trace.** `/v1/observer/inbox` records the body; coordinator/driver parses via LLM (qwen2.5 format=json), runs `matrix.search` on the parsed query, judges top-1 against the ORIGINAL body, emits Langfuse spans through it all. Don't replace the judge re-rate with a distance-only gate — tight distance + low rating is the load-bearing honesty signal.
|
||||
- `local_judge = "qwen2.5:latest"` for the lift loop — qwen3.5:latest is a vision-SSM 256K-ctx build that ran 30s/judge. Don't bump to qwen3.5 again "for cleanliness." Different model in the same tier has different cost geometry.
|
||||
- `qwen3.5:latest` IS available locally on this box. Opus's hub-only knowledge is a known-stale signal; the chatd_smoke uses it daily.
|
||||
- `temperature` is **omitted** for Anthropic 4.7 (handled by `Request.Temperature *float64`); don't re-add it.
|
||||
- chatd-smoke runs with **all cloud providers disabled** intentionally so the suite doesn't depend on API keys; that's why it can't catch B-3-class bugs (those need a fake-server fixture, see Sprint 0 follow-up).
|
||||
- **Langfuse Go-side client lives at `internal/langfuse/`** with best-effort fail-open posture. URL+creds from `/etc/lakehouse/langfuse.env`. Don't propose to "wire Langfuse on Go side" — it's wired; multi_coord_stress is the proof.
|
||||
|
||||
---
|
||||
|
||||
@ -216,11 +182,9 @@ Verbatim verdicts at `reports/scrum/_evidence/2026-04-30/verdicts/`. Disposition
|
||||
|
||||
| Item | What | When to act |
|
||||
|---|---|---|
|
||||
| **Real-time 48-hour clock** | Multi-coord stress fires phases as discrete steps with simulated-hour labels (0/6/12/18/24/30/36/42/48). A real-time variant would space events on actual wall-clock with `time.Sleep`, simulating the rhythm of a coordinator workday. Cosmetic; doesn't change product behavior — but lets the harness mimic the cadence at which inbox events would arrive in production. ~30 min. | When stress test starts being used to capture realistic per-hour throughput numbers. |
|
||||
| **Wider Langfuse instrumentation across daemons** | multi_coord_stress traces every external call, but the daemons themselves (matrixd, observerd, chatd) don't yet emit traces from their own request handlers. Adding `internal/langfuse/middleware.go` would auto-emit a span per HTTP request, giving production-traffic visibility for free. | When production traffic starts hitting the Go gateway. |
|
||||
| **Periodic fresh→main index merge** | Two-tier `fresh_workers` pattern is verified working but no scheduled job consolidates fresh→main. Fresh corpus grows monotonically; eventually has its own recall issues. A daily cron that ingests the fresh corpus' contents into the main `workers` index + drops fresh would solve it. | When fresh_workers grows past ~500 items. |
|
||||
| **Adjacent-query cross-pollination (lift suite Q6↔Q7)** | After lift v4's split threshold, OOD cross-pollination is gone but Q6 / Q7 still swap recordings as warm top-1 because their embeddings are within 0.20 cosine of each other. Multi-coord run #008 inbox-judge re-rating proved the judge can distinguish — gating injection on "judge approves before injecting" closes this. ~1 hr. | When playbook injection quality matters more than retrieval throughput. |
|
||||
| **Liberal-paraphrase recovery loss (lift suite Q9, Q15)** | Q9 + Q15 in run #004 lost paraphrase recovery because qwen2.5 rephrased liberally enough to drift past 0.20 inject threshold. Acceptable (system refusing to inject when not confident), but might be tightenable with a different paraphrase prompt or a per-pair `paraphrase_max_drift` measurement. | When real coordinator queries are available for a calibration run. |
|
||||
| **Reality test v4: re-judge warm results** | Current lift metric counts "warm top-1 == cold judge best." Shape B cross-pollinates recorded answers across queries (run #003 dropped verbatim-lift 7→2 because w-4435 surfaced as warm top-1 for several queries whose cold-judge-best was a different worker). True quality lift would re-judge warm results and measure judge agreement. ~30 min — adds a 4th pass to the driver. | Next time we want to measure Shape B's actual quality, not rank-of-original-judge-best. |
|
||||
| **Adjacent-query cross-pollination** | After v4's split threshold, OOD cross-pollination is gone but Q6 ("Forklift-certified loader") ↔ Q7 ("Hazmat-certified warehouse worker, cold storage") still swap recordings as warm top-1 because their embeddings are within 0.20 cosine of each other. Likely correct (genuinely similar staffing queries), but worth verifying with the v4 re-judge metric — if the judge agrees both directions are good matches, accept; if not, tighten further (e.g. 0.15) or add a same-query-only mode. | Co-decision with v4 re-judge. |
|
||||
| **Liberal-paraphrase recovery loss** | Q9 + Q15 in run #004 lost paraphrase recovery because qwen2.5 rephrased liberally enough to drift past 0.20 inject threshold. Acceptable (system refusing to inject when not confident), but might be tightenable with a different paraphrase prompt (more conservative wording variation) or a per-pair `paraphrase_max_drift` measurement. Cosmetic vs. real depends on whether realistic coordinator queries drift like qwen2.5's rephrases do. | When real coordinator queries are available for a calibration run. |
|
||||
| **Sprint 4 — deployment** | No `REPLICATION.md`, `secrets-go.toml.example`, `deploy/systemd/<bin>.service`, `Dockerfile`. Largest open Sprint. Required input for any G5 cutover plan. | When G5 cutover is on the table. |
|
||||
| **ADR-006 — auth posture for non-loopback deploy** | Locks R-001 + R-007 from "opt-in middleware exists" to "wired-by-default for X, opt-in for Y." Doc-only, ~1 hr. | Required before any Go binary binds non-loopback in prod. |
|
||||
| **chatd fixture-mode storage half** | `g2_smoke_fixtures.sh` closed embed half via fake_ollama; storage half (mock S3) still deferred. Closes R-006 fully. | When CI box without MinIO is needed. |
|
||||
@ -249,17 +213,6 @@ Verbatim verdicts at `reports/scrum/_evidence/2026-04-30/verdicts/`. Disposition
|
||||
| `154a72e` | matrix: Shape B (`InjectPlaybookMisses`) — 6/6 paraphrase recovery in run #003 |
|
||||
| `94fc3b6` | STATE_OF_PLAY: capture v3 reality test + Shape B + cross-pollination |
|
||||
| `67d1957` | matrix: split boost/inject thresholds (0.5 / 0.20) — kills cross-pollination, run #004 6/8+6/8 |
|
||||
| `b13b5cd` | playbook_lift v4 metric: warm-top-1 re-judge → quality lift +24%/-14% (5 lifted / 13 neutral / 3 regressed) |
|
||||
| `61c7b55` | multi-coord stress harness (Phase 1) — 3 coords / 3 contracts / 7-phase scenario |
|
||||
| `0fa42a0` | multi-coord stress Phase 1.5 — shared-role contracts + paraphrase handover |
|
||||
| `84a32f0` | multi-coord stress Phase 2 — `ExcludeIDs`, 200-worker swap, fresh-resume |
|
||||
| `4da32ad` | embedd: bump default to nomic-embed-text-v2-moe (475M MoE, 768d drop-in) |
|
||||
| `e7fc63b` | observerd `/observer/inbox` + multi-coord stress phase 1c (priority-ordered events) |
|
||||
| `186d209` | multi_coord_stress: LLM-parsed inbox demands (qwen2.5 format=json) |
|
||||
| `ce940f4` | multi_coord_stress: judge re-rates inbox top-1 against original body — recovers OOD honesty signal |
|
||||
| `7e6431e` | langfuse: Go-side client + Phase 1c instrumentation |
|
||||
| `08a0867` | multi_coord_stress: fresh_workers two-tier index — fresh-resume now top-1 (3/3) |
|
||||
| `5d49967` | multi_coord_stress: full Langfuse coverage — every phase + every call (111 observations) |
|
||||
|
||||
Plus on Rust side (`8de94eb`, `3d06868`): qwen2.5 → qwen3.5:latest backport in active defaults; distillation acceptance reports regenerated (run_hash refresh, reproducibility property still holds).
|
||||
|
||||
|
||||
@ -93,62 +93,6 @@ func (h *handlers) register(r chi.Router) {
|
||||
r.Post("/observer/event", h.handleEvent)
|
||||
r.Post("/observer/workflow/run", h.handleWorkflowRun)
|
||||
r.Get("/observer/workflow/modes", h.handleWorkflowModes)
|
||||
r.Post("/observer/inbox", h.handleInbox)
|
||||
}
|
||||
|
||||
// inboxMessage is the POST /observer/inbox body — an incoming
|
||||
// real-world signal (email or SMS) that a coordinator would receive
|
||||
// and act on. The handler only RECORDS it as an ObservedOp; whether
|
||||
// to trigger a downstream matrix.search or workflow is the caller's
|
||||
// concern. Keeps observer's witness role pure.
|
||||
type inboxMessage struct {
|
||||
Type string `json:"type"` // "email" | "sms"
|
||||
Sender string `json:"sender"`
|
||||
Subject string `json:"subject,omitempty"`
|
||||
Body string `json:"body"`
|
||||
Priority string `json:"priority"` // "urgent" | "high" | "medium" | "low"
|
||||
Tag string `json:"tag,omitempty"`
|
||||
}
|
||||
|
||||
func (h *handlers) handleInbox(w http.ResponseWriter, r *http.Request) {
|
||||
var msg inboxMessage
|
||||
if !decodeJSON(w, r, &msg) {
|
||||
return
|
||||
}
|
||||
if msg.Type != "email" && msg.Type != "sms" {
|
||||
http.Error(w, "type must be 'email' or 'sms'", http.StatusBadRequest)
|
||||
return
|
||||
}
|
||||
if strings.TrimSpace(msg.Body) == "" {
|
||||
http.Error(w, "body required", http.StatusBadRequest)
|
||||
return
|
||||
}
|
||||
if msg.Priority == "" {
|
||||
msg.Priority = "medium"
|
||||
}
|
||||
op := observer.ObservedOp{
|
||||
Endpoint: "/observer/inbox/" + msg.Type,
|
||||
InputSummary: fmt.Sprintf("from=%s priority=%s tag=%s subject=%s", msg.Sender, msg.Priority, msg.Tag, msg.Subject),
|
||||
OutputSummary: msg.Body,
|
||||
Source: observer.SourceInbox,
|
||||
Success: true,
|
||||
}
|
||||
if err := h.store.Record(op); err != nil {
|
||||
if errors.Is(err, observer.ErrInvalidOp) {
|
||||
http.Error(w, err.Error(), http.StatusBadRequest)
|
||||
return
|
||||
}
|
||||
slog.Error("observer record inbox", "err", err)
|
||||
http.Error(w, "internal", http.StatusInternalServerError)
|
||||
return
|
||||
}
|
||||
stats := h.store.Stats()
|
||||
writeJSON(w, http.StatusOK, map[string]any{
|
||||
"accepted": true,
|
||||
"type": msg.Type,
|
||||
"priority": msg.Priority,
|
||||
"ring_size": stats.Total,
|
||||
})
|
||||
}
|
||||
|
||||
func (h *handlers) handleStats(w http.ResponseWriter, _ *http.Request) {
|
||||
|
||||
@ -4,7 +4,6 @@ import (
|
||||
"bytes"
|
||||
"net/http"
|
||||
"net/http/httptest"
|
||||
"strings"
|
||||
"testing"
|
||||
"time"
|
||||
|
||||
@ -39,7 +38,6 @@ func TestRoutesMounted(t *testing.T) {
|
||||
"POST /observer/event": false,
|
||||
"POST /observer/workflow/run": false,
|
||||
"GET /observer/workflow/modes": false,
|
||||
"POST /observer/inbox": false,
|
||||
}
|
||||
_ = chi.Walk(r, func(method, route string, _ http.Handler, _ ...func(http.Handler) http.Handler) error {
|
||||
key := method + " " + route
|
||||
@ -167,51 +165,6 @@ func TestWorkflowRun_AllProvenanceRecordedPostRun(t *testing.T) {
|
||||
}
|
||||
}
|
||||
|
||||
// TestInbox_AcceptsValidEmail locks the happy-path contract for the
|
||||
// /observer/inbox route — accepts an email message with required
|
||||
// fields, records as ObservedOp, returns 200 with ring-size.
|
||||
func TestInbox_AcceptsValidEmail(t *testing.T) {
|
||||
r := newTestRouter(t)
|
||||
body := []byte(`{"type":"email","sender":"client@northstar.com","subject":"URGENT: 50 forklift ops","body":"Need 50 forklift operators in Cleveland OH for next week. Day shift.","priority":"urgent","tag":"alpha-surge"}`)
|
||||
req := httptest.NewRequest("POST", "/observer/inbox", bytes.NewReader(body))
|
||||
req.Header.Set("Content-Type", "application/json")
|
||||
w := httptest.NewRecorder()
|
||||
r.ServeHTTP(w, req)
|
||||
if w.Code != http.StatusOK {
|
||||
t.Fatalf("expected 200, got %d (body=%s)", w.Code, w.Body.String())
|
||||
}
|
||||
if !strings.Contains(w.Body.String(), `"accepted":true`) {
|
||||
t.Errorf("expected accepted=true, got %s", w.Body.String())
|
||||
}
|
||||
}
|
||||
|
||||
// TestInbox_RejectsBadType locks the validation: type must be
|
||||
// "email" or "sms", anything else is 400.
|
||||
func TestInbox_RejectsBadType(t *testing.T) {
|
||||
r := newTestRouter(t)
|
||||
body := []byte(`{"type":"smoke-signal","sender":"x","body":"y","priority":"high"}`)
|
||||
req := httptest.NewRequest("POST", "/observer/inbox", bytes.NewReader(body))
|
||||
req.Header.Set("Content-Type", "application/json")
|
||||
w := httptest.NewRecorder()
|
||||
r.ServeHTTP(w, req)
|
||||
if w.Code != http.StatusBadRequest {
|
||||
t.Errorf("expected 400 on bad type, got %d", w.Code)
|
||||
}
|
||||
}
|
||||
|
||||
// TestInbox_RejectsEmptyBody locks the body-required invariant.
|
||||
func TestInbox_RejectsEmptyBody(t *testing.T) {
|
||||
r := newTestRouter(t)
|
||||
body := []byte(`{"type":"email","sender":"x","body":"","priority":"high"}`)
|
||||
req := httptest.NewRequest("POST", "/observer/inbox", bytes.NewReader(body))
|
||||
req.Header.Set("Content-Type", "application/json")
|
||||
w := httptest.NewRecorder()
|
||||
r.ServeHTTP(w, req)
|
||||
if w.Code != http.StatusBadRequest {
|
||||
t.Errorf("expected 400 on empty body, got %d", w.Code)
|
||||
}
|
||||
}
|
||||
|
||||
// TestWorkflowRun_UnknownMode locks the 400 path on workflow definitions
|
||||
// that reference modes not registered with the runner. The harness's
|
||||
// reality test runs depend on this so an unknown-mode misconfiguration
|
||||
|
||||
@ -1,217 +0,0 @@
|
||||
// Package langfuse is a minimal Go-side client for the Langfuse v2
|
||||
// ingestion API. Mirrors the surface area we need from the Rust
|
||||
// crates/gateway/src/v1/langfuse_trace.rs emitter — Trace + Span,
|
||||
// nothing else yet (no scores, no observations, no datasets).
|
||||
//
|
||||
// Auth is Basic over public_key:secret_key. URL + creds come from
|
||||
// /etc/lakehouse/langfuse.env in production; tests can pass any URL.
|
||||
//
|
||||
// Best-effort transport: errors are logged but don't fail the calling
|
||||
// path. Lakehouse's internal services should never go down because
|
||||
// Langfuse is unreachable.
|
||||
package langfuse
|
||||
|
||||
import (
|
||||
"bytes"
|
||||
"context"
|
||||
"crypto/rand"
|
||||
"encoding/base64"
|
||||
"encoding/hex"
|
||||
"encoding/json"
|
||||
"fmt"
|
||||
"log/slog"
|
||||
"net/http"
|
||||
"sync"
|
||||
"time"
|
||||
)
|
||||
|
||||
// Client posts traces + spans to Langfuse's ingestion endpoint.
|
||||
// Events are buffered and flushed in batches. Always call Flush
|
||||
// before exit; Close also flushes.
|
||||
type Client struct {
|
||||
url string
|
||||
auth string // pre-encoded "Basic ..."
|
||||
hc *http.Client
|
||||
mu sync.Mutex
|
||||
pending []event
|
||||
maxBatch int
|
||||
}
|
||||
|
||||
// New constructs a Client. URL like "http://localhost:3001"; creds
|
||||
// from langfuse.env. nil hc → uses default with 5s timeout.
|
||||
func New(url, publicKey, secretKey string, hc *http.Client) *Client {
|
||||
if hc == nil {
|
||||
hc = &http.Client{Timeout: 5 * time.Second}
|
||||
}
|
||||
auth := "Basic " + base64.StdEncoding.EncodeToString([]byte(publicKey+":"+secretKey))
|
||||
return &Client{
|
||||
url: url,
|
||||
auth: auth,
|
||||
hc: hc,
|
||||
maxBatch: 50,
|
||||
}
|
||||
}
|
||||
|
||||
// NewID returns a hex string suitable as a trace/span id. Langfuse
|
||||
// accepts arbitrary strings; a 16-byte random hex is unambiguous.
|
||||
func NewID() string {
|
||||
b := make([]byte, 16)
|
||||
_, _ = rand.Read(b)
|
||||
return hex.EncodeToString(b)
|
||||
}
|
||||
|
||||
// event is one Langfuse ingestion envelope. Body shape varies by
|
||||
// type (trace-create vs span-create); we use map[string]any to
|
||||
// keep the wire shape declarative.
|
||||
type event struct {
|
||||
ID string `json:"id"`
|
||||
Type string `json:"type"` // "trace-create" | "span-create"
|
||||
Timestamp string `json:"timestamp"`
|
||||
Body map[string]any `json:"body"`
|
||||
}
|
||||
|
||||
// TraceInput is what callers fill in when starting a trace.
|
||||
type TraceInput struct {
|
||||
Name string
|
||||
UserID string
|
||||
Input any
|
||||
Metadata map[string]any
|
||||
Tags []string
|
||||
}
|
||||
|
||||
// Trace records a top-level trace. Returns the trace id so callers
|
||||
// can attach spans. Best-effort: errors are logged and the trace
|
||||
// id is still returned so callers don't need error-handling for the
|
||||
// common case.
|
||||
func (c *Client) Trace(ctx context.Context, t TraceInput) string {
|
||||
id := NewID()
|
||||
body := map[string]any{
|
||||
"id": id,
|
||||
"name": t.Name,
|
||||
}
|
||||
if t.UserID != "" {
|
||||
body["userId"] = t.UserID
|
||||
}
|
||||
if t.Input != nil {
|
||||
body["input"] = t.Input
|
||||
}
|
||||
if t.Metadata != nil {
|
||||
body["metadata"] = t.Metadata
|
||||
}
|
||||
if len(t.Tags) > 0 {
|
||||
body["tags"] = t.Tags
|
||||
}
|
||||
c.queue(event{
|
||||
ID: NewID(),
|
||||
Type: "trace-create",
|
||||
Timestamp: time.Now().UTC().Format(time.RFC3339Nano),
|
||||
Body: body,
|
||||
})
|
||||
return id
|
||||
}
|
||||
|
||||
// SpanInput is what callers fill in when recording a span.
|
||||
type SpanInput struct {
|
||||
TraceID string
|
||||
ParentID string // optional — for nested spans
|
||||
Name string
|
||||
Input any
|
||||
Output any
|
||||
Metadata map[string]any
|
||||
StartTime time.Time
|
||||
EndTime time.Time
|
||||
StatusCode int // 0 = success, anything else = error code
|
||||
Level string // "DEBUG" | "DEFAULT" | "WARNING" | "ERROR"
|
||||
}
|
||||
|
||||
// Span records one span attached to a trace. Returns the span id.
|
||||
func (c *Client) Span(ctx context.Context, s SpanInput) string {
|
||||
id := NewID()
|
||||
body := map[string]any{
|
||||
"id": id,
|
||||
"traceId": s.TraceID,
|
||||
"name": s.Name,
|
||||
}
|
||||
if s.ParentID != "" {
|
||||
body["parentObservationId"] = s.ParentID
|
||||
}
|
||||
if s.Input != nil {
|
||||
body["input"] = s.Input
|
||||
}
|
||||
if s.Output != nil {
|
||||
body["output"] = s.Output
|
||||
}
|
||||
if s.Metadata != nil {
|
||||
body["metadata"] = s.Metadata
|
||||
}
|
||||
if !s.StartTime.IsZero() {
|
||||
body["startTime"] = s.StartTime.UTC().Format(time.RFC3339Nano)
|
||||
}
|
||||
if !s.EndTime.IsZero() {
|
||||
body["endTime"] = s.EndTime.UTC().Format(time.RFC3339Nano)
|
||||
}
|
||||
if s.Level != "" {
|
||||
body["level"] = s.Level
|
||||
}
|
||||
if s.StatusCode != 0 {
|
||||
body["statusMessage"] = fmt.Sprintf("status_code=%d", s.StatusCode)
|
||||
}
|
||||
c.queue(event{
|
||||
ID: NewID(),
|
||||
Type: "span-create",
|
||||
Timestamp: time.Now().UTC().Format(time.RFC3339Nano),
|
||||
Body: body,
|
||||
})
|
||||
return id
|
||||
}
|
||||
|
||||
func (c *Client) queue(e event) {
|
||||
c.mu.Lock()
|
||||
c.pending = append(c.pending, e)
|
||||
shouldFlush := len(c.pending) >= c.maxBatch
|
||||
c.mu.Unlock()
|
||||
if shouldFlush {
|
||||
_ = c.Flush(context.Background())
|
||||
}
|
||||
}
|
||||
|
||||
// Flush sends all queued events in one batch. Best-effort: returns
|
||||
// the error but also logs; callers can ignore.
|
||||
func (c *Client) Flush(ctx context.Context) error {
|
||||
c.mu.Lock()
|
||||
if len(c.pending) == 0 {
|
||||
c.mu.Unlock()
|
||||
return nil
|
||||
}
|
||||
batch := c.pending
|
||||
c.pending = nil
|
||||
c.mu.Unlock()
|
||||
|
||||
body, err := json.Marshal(map[string]any{"batch": batch})
|
||||
if err != nil {
|
||||
slog.Warn("langfuse: marshal batch", "err", err, "n", len(batch))
|
||||
return err
|
||||
}
|
||||
req, err := http.NewRequestWithContext(ctx, "POST", c.url+"/api/public/ingestion", bytes.NewReader(body))
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
req.Header.Set("Authorization", c.auth)
|
||||
req.Header.Set("Content-Type", "application/json")
|
||||
resp, err := c.hc.Do(req)
|
||||
if err != nil {
|
||||
slog.Warn("langfuse: post", "err", err, "n", len(batch))
|
||||
return err
|
||||
}
|
||||
defer resp.Body.Close()
|
||||
if resp.StatusCode/100 != 2 && resp.StatusCode != 207 {
|
||||
slog.Warn("langfuse: non-2xx", "status", resp.StatusCode, "n", len(batch))
|
||||
return fmt.Errorf("langfuse ingestion: HTTP %d", resp.StatusCode)
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
// Close flushes any remaining events. Idempotent.
|
||||
func (c *Client) Close() error {
|
||||
return c.Flush(context.Background())
|
||||
}
|
||||
@ -85,15 +85,6 @@ type SearchRequest struct {
|
||||
PlaybookMaxDistance float64 `json:"playbook_max_distance,omitempty"`
|
||||
PlaybookMaxInjectDistance float64 `json:"playbook_max_inject_distance,omitempty"`
|
||||
MetadataFilter map[string]any `json:"metadata_filter,omitempty"`
|
||||
// ExcludeIDs filters out specific worker IDs post-retrieval.
|
||||
// Real-world driver: a coordinator places 200 workers at a
|
||||
// contract, then mid-day the client asks for a different set —
|
||||
// the next query should NOT return the already-placed workers.
|
||||
// Filter runs after merge but before metadata filter, so an
|
||||
// excluded ID never wastes a slot in the post-filter top-K.
|
||||
// Also applies to playbook boost + Shape B inject — excluded
|
||||
// answers are skipped at injection time.
|
||||
ExcludeIDs []string `json:"exclude_ids,omitempty"`
|
||||
}
|
||||
|
||||
// SearchResponse wraps the merged results plus per-corpus return
|
||||
@ -213,25 +204,6 @@ func (r *Retriever) Search(ctx context.Context, req SearchRequest) (*SearchRespo
|
||||
return allHits[i].Distance < allHits[j].Distance
|
||||
})
|
||||
|
||||
// ExcludeIDs filter — applied first so excluded IDs don't waste
|
||||
// a slot in the post-filter top-K. Real-world driver: coordinator
|
||||
// has placed N workers at a contract; mid-day the client asks for
|
||||
// alternatives, so this query passes ExcludeIDs=<placed_ids> and
|
||||
// gets back fresh candidates instead of the same N.
|
||||
if len(req.ExcludeIDs) > 0 {
|
||||
excludeSet := make(map[string]bool, len(req.ExcludeIDs))
|
||||
for _, id := range req.ExcludeIDs {
|
||||
excludeSet[id] = true
|
||||
}
|
||||
kept := make([]Result, 0, len(allHits))
|
||||
for _, h := range allHits {
|
||||
if !excludeSet[h.ID] {
|
||||
kept = append(kept, h)
|
||||
}
|
||||
}
|
||||
allHits = kept
|
||||
}
|
||||
|
||||
// Metadata filter (component B — staffing-side structured gate).
|
||||
// Applied BEFORE top-K truncation so the filter doesn't accidentally
|
||||
// reduce coverage further. Caller can request larger PerCorpusK to
|
||||
@ -267,23 +239,6 @@ func (r *Retriever) Search(ctx context.Context, req SearchRequest) (*SearchRespo
|
||||
if err != nil {
|
||||
slog.Warn("matrix: playbook lookup failed; skipping boost+inject", "err", err)
|
||||
} else if len(hits) > 0 {
|
||||
// Filter playbook hits to honor ExcludeIDs — without this,
|
||||
// an excluded answer in a playbook recording would re-enter
|
||||
// the result set via Shape B inject, defeating the swap
|
||||
// semantics that the exclude list exists to enforce.
|
||||
if len(req.ExcludeIDs) > 0 {
|
||||
excludeSet := make(map[string]bool, len(req.ExcludeIDs))
|
||||
for _, id := range req.ExcludeIDs {
|
||||
excludeSet[id] = true
|
||||
}
|
||||
keptHits := make([]PlaybookHit, 0, len(hits))
|
||||
for _, h := range hits {
|
||||
if !excludeSet[h.Entry.AnswerID] {
|
||||
keptHits = append(keptHits, h)
|
||||
}
|
||||
}
|
||||
hits = keptHits
|
||||
}
|
||||
resp.PlaybookBoosted = ApplyPlaybookBoost(resp.Results, hits)
|
||||
maxInjectDist := float32(req.PlaybookMaxInjectDistance)
|
||||
if maxInjectDist <= 0 {
|
||||
|
||||
@ -41,12 +41,6 @@ const (
|
||||
// the workflow handler was casting a string literal to Source,
|
||||
// which worked coincidentally but left the taxonomy implicit.
|
||||
SourceWorkflow Source = "workflow"
|
||||
// SourceInbox tags ObservedOps emitted by /observer/inbox — incoming
|
||||
// real-world signals (email, SMS) that a coordinator would receive
|
||||
// and act on. The handler only RECORDS the message; downstream
|
||||
// triggers (e.g. matrix.search on the parsed demand) are the
|
||||
// caller's concern, recorded separately.
|
||||
SourceInbox Source = "inbox"
|
||||
)
|
||||
|
||||
// ObservedOp is one entry in the observer's ring buffer (and JSONL
|
||||
|
||||
@ -43,7 +43,7 @@ bind = "127.0.0.1:3216"
|
||||
# G2: Ollama local. G3+ may swap in OpenAI/Voyage by changing
|
||||
# this URL + the wire format inside the provider.
|
||||
provider_url = "http://localhost:11434"
|
||||
default_model = "nomic-embed-text-v2-moe"
|
||||
default_model = "nomic-embed-text"
|
||||
|
||||
[queryd]
|
||||
bind = "127.0.0.1:3214"
|
||||
@ -129,7 +129,7 @@ level = "info"
|
||||
[models]
|
||||
# Tier 1 — local hot path
|
||||
local_fast = "qwen3.5:latest"
|
||||
local_embed = "nomic-embed-text-v2-moe" # 475M MoE, drop-in upgrade from 137M v1 — verified 2026-04-30 same 768-dim
|
||||
local_embed = "nomic-embed-text"
|
||||
# local_judge stays on qwen2.5:latest — qwen3.5:latest is a vision-SSM
|
||||
# build with 256K context that runs ~30s per judge call against the
|
||||
# playbook_lift loop (verified 2026-04-30). qwen2.5:latest at ~1s/call
|
||||
|
||||
@ -1,77 +0,0 @@
|
||||
# Multi-Coordinator Stress Test — Run 001
|
||||
|
||||
**Generated:** 2026-04-30T12:54:09.621556469Z
|
||||
**Coordinators:** alice / bob / carol (each with own playbook namespace: `playbook_alice` / `playbook_bob` / `playbook_carol`)
|
||||
**Contracts:** alpha_milwaukee_distribution / beta_indianapolis_manufacturing / gamma_chicago_construction
|
||||
**Corpora:** `workers,ethereal_workers`
|
||||
**K per query:** 8
|
||||
**Total events captured:** 52
|
||||
**Evidence:** `reports/reality-tests/multi_coord_stress_001.json`
|
||||
|
||||
---
|
||||
|
||||
## Diversity — is the system locking into scenarios or cycling?
|
||||
|
||||
| Metric | Mean Jaccard | n pairs | Interpretation |
|
||||
|---|---:|---:|---|
|
||||
| Same role across different contracts | 0 | 0 | Lower = more diverse (different region/cert mix → different workers) |
|
||||
| Different roles within same contract | 0.003703703703703704 | 18 | Should be near-zero (different roles = different worker pools) |
|
||||
|
||||
**Healthy ranges:**
|
||||
- Same role across contracts: < 0.30 means the system is genuinely picking different workers per region/contract.
|
||||
- Different roles same contract: < 0.10 means role-specific retrieval is working.
|
||||
- If either is > 0.50, the system is "cycling" the same handful of workers regardless of query intent.
|
||||
|
||||
---
|
||||
|
||||
## Determinism — same query reissued, top-K stability
|
||||
|
||||
| Metric | Value |
|
||||
|---|---:|
|
||||
| Mean Jaccard on retrieval-only reissue | 1 |
|
||||
| Number of reissue pairs | 12 |
|
||||
|
||||
**Interpretation:**
|
||||
- ≥ 0.95: HNSW retrieval is highly deterministic; reissues land on near-identical top-K. Good — system locks into a stable view of "best workers for this query."
|
||||
- 0.80 – 0.95: Some HNSW or embed variance, acceptable.
|
||||
- < 0.80: Retrieval is unstable — reissues see substantially different results, suggesting either embed nondeterminism (Ollama returning slightly different vectors) or vectord nondeterminism (HNSW insertion order affecting recall).
|
||||
|
||||
---
|
||||
|
||||
## Learning — handover hit rate
|
||||
|
||||
Bob takes Alice's contract using Alice's playbook namespace. Did Alice's recorded answers surface in Bob's results?
|
||||
|
||||
| Metric | Value |
|
||||
|---|---:|
|
||||
| Handover queries run | 4 |
|
||||
| Alice's recorded answer at Bob's top-1 | 4 |
|
||||
| Alice's recorded answer in Bob's top-K | 4 |
|
||||
| **Handover hit rate (top-1)** | **1** |
|
||||
|
||||
**Interpretation:**
|
||||
- Hit rate ≥ 0.5: handover is meaningful — the second coordinator inherits the first's institutional memory.
|
||||
- Hit rate ≈ 0.0: playbook namespace isolation is working but the playbook itself isn't transferable, OR Bob's queries don't match Alice's recordings closely enough.
|
||||
|
||||
---
|
||||
|
||||
## Per-event capture
|
||||
|
||||
All matrix.search responses live in the JSON — top-K with worker IDs, distances, and per-corpus counts. Search by phase:
|
||||
|
||||
```bash
|
||||
jq '.events[] | select(.phase == "merge")' reports/reality-tests/multi_coord_stress_001.json
|
||||
jq '.events[] | select(.coordinator == "alice" and .phase == "baseline")' reports/reality-tests/multi_coord_stress_001.json
|
||||
jq '.events[] | select(.role == "warehouse worker") | {phase, contract, top_k_ids: [.top_k[].id]}' reports/reality-tests/multi_coord_stress_001.json
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## What's NOT in this run (Phase 1 deliberately defers)
|
||||
|
||||
- **48-hour clock.** Events fire as discrete steps, not on a timeline.
|
||||
- **Email / SMS ingest.** No endpoints exist on the Go side yet.
|
||||
- **New-resume injection mid-run.** The corpus is fixed at the start.
|
||||
- **Langfuse traces.** Need Go-side wiring.
|
||||
|
||||
These are Phase 2/3. The Phase 1 substrate is what the time-based runner will mount on top of.
|
||||
@ -1,82 +0,0 @@
|
||||
# Multi-Coordinator Stress Test — Run 002
|
||||
|
||||
**Generated:** 2026-04-30T13:02:13.570393819Z
|
||||
**Coordinators:** alice / bob / carol (each with own playbook namespace: `playbook_alice` / `playbook_bob` / `playbook_carol`)
|
||||
**Contracts:** alpha_milwaukee_distribution / beta_indianapolis_manufacturing / gamma_chicago_construction
|
||||
**Corpora:** `workers,ethereal_workers`
|
||||
**K per query:** 8
|
||||
**Total events captured:** 56
|
||||
**Evidence:** `reports/reality-tests/multi_coord_stress_002.json`
|
||||
|
||||
---
|
||||
|
||||
## Diversity — is the system locking into scenarios or cycling?
|
||||
|
||||
| Metric | Mean Jaccard | n pairs | Interpretation |
|
||||
|---|---:|---:|---|
|
||||
| Same role across different contracts | 0.11900691900691901 | 9 | Lower = more diverse (different region/cert mix → different workers) |
|
||||
| Different roles within same contract | 0.003703703703703704 | 18 | Should be near-zero (different roles = different worker pools) |
|
||||
|
||||
**Healthy ranges:**
|
||||
- Same role across contracts: < 0.30 means the system is genuinely picking different workers per region/contract.
|
||||
- Different roles same contract: < 0.10 means role-specific retrieval is working.
|
||||
- If either is > 0.50, the system is "cycling" the same handful of workers regardless of query intent.
|
||||
|
||||
---
|
||||
|
||||
## Determinism — same query reissued, top-K stability
|
||||
|
||||
| Metric | Value |
|
||||
|---|---:|
|
||||
| Mean Jaccard on retrieval-only reissue | 1 |
|
||||
| Number of reissue pairs | 12 |
|
||||
|
||||
**Interpretation:**
|
||||
- ≥ 0.95: HNSW retrieval is highly deterministic; reissues land on near-identical top-K. Good — system locks into a stable view of "best workers for this query."
|
||||
- 0.80 – 0.95: Some HNSW or embed variance, acceptable.
|
||||
- < 0.80: Retrieval is unstable — reissues see substantially different results, suggesting either embed nondeterminism (Ollama returning slightly different vectors) or vectord nondeterminism (HNSW insertion order affecting recall).
|
||||
|
||||
---
|
||||
|
||||
## Learning — handover hit rate
|
||||
|
||||
Bob takes Alice's contract using Alice's playbook namespace. Did Alice's recorded answers surface in Bob's results?
|
||||
|
||||
| Metric | Value |
|
||||
|---|---:|
|
||||
| Verbatim handover queries run | 4 |
|
||||
| Alice's recorded answer at Bob's top-1 (verbatim) | 4 |
|
||||
| Alice's recorded answer in Bob's top-K (verbatim) | 4 |
|
||||
| **Verbatim handover hit rate (top-1)** | **1** |
|
||||
| Paraphrase handover queries run | 4 |
|
||||
| Alice's recorded answer at Bob's top-1 (paraphrase) | 4 |
|
||||
| Alice's recorded answer in Bob's top-K (paraphrase) | 4 |
|
||||
| **Paraphrase handover hit rate (top-1)** | **1** |
|
||||
|
||||
**Interpretation:**
|
||||
- Verbatim hit rate ≈ 1.0: trivial case — Bob runs identical queries; should always hit.
|
||||
- Paraphrase hit rate ≥ 0.5: institutional memory survives wording change — the harder learning property.
|
||||
- Paraphrase hit rate ≈ 0.0: Bob's paraphrases drift past the inject threshold, so Alice's recordings don't activate. Same caveat as the playbook_lift paraphrase pass.
|
||||
|
||||
---
|
||||
|
||||
## Per-event capture
|
||||
|
||||
All matrix.search responses live in the JSON — top-K with worker IDs, distances, and per-corpus counts. Search by phase:
|
||||
|
||||
```bash
|
||||
jq '.events[] | select(.phase == "merge")' reports/reality-tests/multi_coord_stress_002.json
|
||||
jq '.events[] | select(.coordinator == "alice" and .phase == "baseline")' reports/reality-tests/multi_coord_stress_002.json
|
||||
jq '.events[] | select(.role == "warehouse worker") | {phase, contract, top_k_ids: [.top_k[].id]}' reports/reality-tests/multi_coord_stress_002.json
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## What's NOT in this run (Phase 1 deliberately defers)
|
||||
|
||||
- **48-hour clock.** Events fire as discrete steps, not on a timeline.
|
||||
- **Email / SMS ingest.** No endpoints exist on the Go side yet.
|
||||
- **New-resume injection mid-run.** The corpus is fixed at the start.
|
||||
- **Langfuse traces.** Need Go-side wiring.
|
||||
|
||||
These are Phase 2/3. The Phase 1 substrate is what the time-based runner will mount on top of.
|
||||
@ -1,82 +0,0 @@
|
||||
# Multi-Coordinator Stress Test — Run 003
|
||||
|
||||
**Generated:** 2026-04-30T13:13:44.35966865Z
|
||||
**Coordinators:** alice / bob / carol (each with own playbook namespace: `playbook_alice` / `playbook_bob` / `playbook_carol`)
|
||||
**Contracts:** alpha_milwaukee_distribution / beta_indianapolis_manufacturing / gamma_chicago_construction
|
||||
**Corpora:** `workers,ethereal_workers`
|
||||
**K per query:** 8
|
||||
**Total events captured:** 61
|
||||
**Evidence:** `reports/reality-tests/multi_coord_stress_003.json`
|
||||
|
||||
---
|
||||
|
||||
## Diversity — is the system locking into scenarios or cycling?
|
||||
|
||||
| Metric | Mean Jaccard | n pairs | Interpretation |
|
||||
|---|---:|---:|---|
|
||||
| Same role across different contracts | 0.03068783068783069 | 9 | Lower = more diverse (different region/cert mix → different workers) |
|
||||
| Different roles within same contract | 0 | 18 | Should be near-zero (different roles = different worker pools) |
|
||||
|
||||
**Healthy ranges:**
|
||||
- Same role across contracts: < 0.30 means the system is genuinely picking different workers per region/contract.
|
||||
- Different roles same contract: < 0.10 means role-specific retrieval is working.
|
||||
- If either is > 0.50, the system is "cycling" the same handful of workers regardless of query intent.
|
||||
|
||||
---
|
||||
|
||||
## Determinism — same query reissued, top-K stability
|
||||
|
||||
| Metric | Value |
|
||||
|---|---:|
|
||||
| Mean Jaccard on retrieval-only reissue | 1 |
|
||||
| Number of reissue pairs | 12 |
|
||||
|
||||
**Interpretation:**
|
||||
- ≥ 0.95: HNSW retrieval is highly deterministic; reissues land on near-identical top-K. Good — system locks into a stable view of "best workers for this query."
|
||||
- 0.80 – 0.95: Some HNSW or embed variance, acceptable.
|
||||
- < 0.80: Retrieval is unstable — reissues see substantially different results, suggesting either embed nondeterminism (Ollama returning slightly different vectors) or vectord nondeterminism (HNSW insertion order affecting recall).
|
||||
|
||||
---
|
||||
|
||||
## Learning — handover hit rate
|
||||
|
||||
Bob takes Alice's contract using Alice's playbook namespace. Did Alice's recorded answers surface in Bob's results?
|
||||
|
||||
| Metric | Value |
|
||||
|---|---:|
|
||||
| Verbatim handover queries run | 4 |
|
||||
| Alice's recorded answer at Bob's top-1 (verbatim) | 4 |
|
||||
| Alice's recorded answer in Bob's top-K (verbatim) | 4 |
|
||||
| **Verbatim handover hit rate (top-1)** | **1** |
|
||||
| Paraphrase handover queries run | 4 |
|
||||
| Alice's recorded answer at Bob's top-1 (paraphrase) | 4 |
|
||||
| Alice's recorded answer in Bob's top-K (paraphrase) | 4 |
|
||||
| **Paraphrase handover hit rate (top-1)** | **1** |
|
||||
|
||||
**Interpretation:**
|
||||
- Verbatim hit rate ≈ 1.0: trivial case — Bob runs identical queries; should always hit.
|
||||
- Paraphrase hit rate ≥ 0.5: institutional memory survives wording change — the harder learning property.
|
||||
- Paraphrase hit rate ≈ 0.0: Bob's paraphrases drift past the inject threshold, so Alice's recordings don't activate. Same caveat as the playbook_lift paraphrase pass.
|
||||
|
||||
---
|
||||
|
||||
## Per-event capture
|
||||
|
||||
All matrix.search responses live in the JSON — top-K with worker IDs, distances, and per-corpus counts. Search by phase:
|
||||
|
||||
```bash
|
||||
jq '.events[] | select(.phase == "merge")' reports/reality-tests/multi_coord_stress_003.json
|
||||
jq '.events[] | select(.coordinator == "alice" and .phase == "baseline")' reports/reality-tests/multi_coord_stress_003.json
|
||||
jq '.events[] | select(.role == "warehouse worker") | {phase, contract, top_k_ids: [.top_k[].id]}' reports/reality-tests/multi_coord_stress_003.json
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## What's NOT in this run (Phase 1 deliberately defers)
|
||||
|
||||
- **48-hour clock.** Events fire as discrete steps, not on a timeline.
|
||||
- **Email / SMS ingest.** No endpoints exist on the Go side yet.
|
||||
- **New-resume injection mid-run.** The corpus is fixed at the start.
|
||||
- **Langfuse traces.** Need Go-side wiring.
|
||||
|
||||
These are Phase 2/3. The Phase 1 substrate is what the time-based runner will mount on top of.
|
||||
@ -1,82 +0,0 @@
|
||||
# Multi-Coordinator Stress Test — Run 004
|
||||
|
||||
**Generated:** 2026-04-30T13:17:03.577877974Z
|
||||
**Coordinators:** alice / bob / carol (each with own playbook namespace: `playbook_alice` / `playbook_bob` / `playbook_carol`)
|
||||
**Contracts:** alpha_milwaukee_distribution / beta_indianapolis_manufacturing / gamma_chicago_construction
|
||||
**Corpora:** `workers,ethereal_workers`
|
||||
**K per query:** 8
|
||||
**Total events captured:** 61
|
||||
**Evidence:** `reports/reality-tests/multi_coord_stress_004.json`
|
||||
|
||||
---
|
||||
|
||||
## Diversity — is the system locking into scenarios or cycling?
|
||||
|
||||
| Metric | Mean Jaccard | n pairs | Interpretation |
|
||||
|---|---:|---:|---|
|
||||
| Same role across different contracts | 0.08013468013468013 | 9 | Lower = more diverse (different region/cert mix → different workers) |
|
||||
| Different roles within same contract | 0.012820512820512822 | 18 | Should be near-zero (different roles = different worker pools) |
|
||||
|
||||
**Healthy ranges:**
|
||||
- Same role across contracts: < 0.30 means the system is genuinely picking different workers per region/contract.
|
||||
- Different roles same contract: < 0.10 means role-specific retrieval is working.
|
||||
- If either is > 0.50, the system is "cycling" the same handful of workers regardless of query intent.
|
||||
|
||||
---
|
||||
|
||||
## Determinism — same query reissued, top-K stability
|
||||
|
||||
| Metric | Value |
|
||||
|---|---:|
|
||||
| Mean Jaccard on retrieval-only reissue | 1 |
|
||||
| Number of reissue pairs | 12 |
|
||||
|
||||
**Interpretation:**
|
||||
- ≥ 0.95: HNSW retrieval is highly deterministic; reissues land on near-identical top-K. Good — system locks into a stable view of "best workers for this query."
|
||||
- 0.80 – 0.95: Some HNSW or embed variance, acceptable.
|
||||
- < 0.80: Retrieval is unstable — reissues see substantially different results, suggesting either embed nondeterminism (Ollama returning slightly different vectors) or vectord nondeterminism (HNSW insertion order affecting recall).
|
||||
|
||||
---
|
||||
|
||||
## Learning — handover hit rate
|
||||
|
||||
Bob takes Alice's contract using Alice's playbook namespace. Did Alice's recorded answers surface in Bob's results?
|
||||
|
||||
| Metric | Value |
|
||||
|---|---:|
|
||||
| Verbatim handover queries run | 4 |
|
||||
| Alice's recorded answer at Bob's top-1 (verbatim) | 4 |
|
||||
| Alice's recorded answer in Bob's top-K (verbatim) | 4 |
|
||||
| **Verbatim handover hit rate (top-1)** | **1** |
|
||||
| Paraphrase handover queries run | 4 |
|
||||
| Alice's recorded answer at Bob's top-1 (paraphrase) | 4 |
|
||||
| Alice's recorded answer in Bob's top-K (paraphrase) | 4 |
|
||||
| **Paraphrase handover hit rate (top-1)** | **1** |
|
||||
|
||||
**Interpretation:**
|
||||
- Verbatim hit rate ≈ 1.0: trivial case — Bob runs identical queries; should always hit.
|
||||
- Paraphrase hit rate ≥ 0.5: institutional memory survives wording change — the harder learning property.
|
||||
- Paraphrase hit rate ≈ 0.0: Bob's paraphrases drift past the inject threshold, so Alice's recordings don't activate. Same caveat as the playbook_lift paraphrase pass.
|
||||
|
||||
---
|
||||
|
||||
## Per-event capture
|
||||
|
||||
All matrix.search responses live in the JSON — top-K with worker IDs, distances, and per-corpus counts. Search by phase:
|
||||
|
||||
```bash
|
||||
jq '.events[] | select(.phase == "merge")' reports/reality-tests/multi_coord_stress_004.json
|
||||
jq '.events[] | select(.coordinator == "alice" and .phase == "baseline")' reports/reality-tests/multi_coord_stress_004.json
|
||||
jq '.events[] | select(.role == "warehouse worker") | {phase, contract, top_k_ids: [.top_k[].id]}' reports/reality-tests/multi_coord_stress_004.json
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## What's NOT in this run (Phase 1 deliberately defers)
|
||||
|
||||
- **48-hour clock.** Events fire as discrete steps, not on a timeline.
|
||||
- **Email / SMS ingest.** No endpoints exist on the Go side yet.
|
||||
- **New-resume injection mid-run.** The corpus is fixed at the start.
|
||||
- **Langfuse traces.** Need Go-side wiring.
|
||||
|
||||
These are Phase 2/3. The Phase 1 substrate is what the time-based runner will mount on top of.
|
||||
@ -1,82 +0,0 @@
|
||||
# Multi-Coordinator Stress Test — Run 005
|
||||
|
||||
**Generated:** 2026-04-30T13:25:15.497712275Z
|
||||
**Coordinators:** alice / bob / carol (each with own playbook namespace: `playbook_alice` / `playbook_bob` / `playbook_carol`)
|
||||
**Contracts:** alpha_milwaukee_distribution / beta_indianapolis_manufacturing / gamma_chicago_construction
|
||||
**Corpora:** `workers,ethereal_workers`
|
||||
**K per query:** 8
|
||||
**Total events captured:** 61
|
||||
**Evidence:** `reports/reality-tests/multi_coord_stress_005.json`
|
||||
|
||||
---
|
||||
|
||||
## Diversity — is the system locking into scenarios or cycling?
|
||||
|
||||
| Metric | Mean Jaccard | n pairs | Interpretation |
|
||||
|---|---:|---:|---|
|
||||
| Same role across different contracts | 0 | 9 | Lower = more diverse (different region/cert mix → different workers) |
|
||||
| Different roles within same contract | 0.03610093610093609 | 18 | Should be near-zero (different roles = different worker pools) |
|
||||
|
||||
**Healthy ranges:**
|
||||
- Same role across contracts: < 0.30 means the system is genuinely picking different workers per region/contract.
|
||||
- Different roles same contract: < 0.10 means role-specific retrieval is working.
|
||||
- If either is > 0.50, the system is "cycling" the same handful of workers regardless of query intent.
|
||||
|
||||
---
|
||||
|
||||
## Determinism — same query reissued, top-K stability
|
||||
|
||||
| Metric | Value |
|
||||
|---|---:|
|
||||
| Mean Jaccard on retrieval-only reissue | 1 |
|
||||
| Number of reissue pairs | 12 |
|
||||
|
||||
**Interpretation:**
|
||||
- ≥ 0.95: HNSW retrieval is highly deterministic; reissues land on near-identical top-K. Good — system locks into a stable view of "best workers for this query."
|
||||
- 0.80 – 0.95: Some HNSW or embed variance, acceptable.
|
||||
- < 0.80: Retrieval is unstable — reissues see substantially different results, suggesting either embed nondeterminism (Ollama returning slightly different vectors) or vectord nondeterminism (HNSW insertion order affecting recall).
|
||||
|
||||
---
|
||||
|
||||
## Learning — handover hit rate
|
||||
|
||||
Bob takes Alice's contract using Alice's playbook namespace. Did Alice's recorded answers surface in Bob's results?
|
||||
|
||||
| Metric | Value |
|
||||
|---|---:|
|
||||
| Verbatim handover queries run | 4 |
|
||||
| Alice's recorded answer at Bob's top-1 (verbatim) | 4 |
|
||||
| Alice's recorded answer in Bob's top-K (verbatim) | 4 |
|
||||
| **Verbatim handover hit rate (top-1)** | **1** |
|
||||
| Paraphrase handover queries run | 4 |
|
||||
| Alice's recorded answer at Bob's top-1 (paraphrase) | 4 |
|
||||
| Alice's recorded answer in Bob's top-K (paraphrase) | 4 |
|
||||
| **Paraphrase handover hit rate (top-1)** | **1** |
|
||||
|
||||
**Interpretation:**
|
||||
- Verbatim hit rate ≈ 1.0: trivial case — Bob runs identical queries; should always hit.
|
||||
- Paraphrase hit rate ≥ 0.5: institutional memory survives wording change — the harder learning property.
|
||||
- Paraphrase hit rate ≈ 0.0: Bob's paraphrases drift past the inject threshold, so Alice's recordings don't activate. Same caveat as the playbook_lift paraphrase pass.
|
||||
|
||||
---
|
||||
|
||||
## Per-event capture
|
||||
|
||||
All matrix.search responses live in the JSON — top-K with worker IDs, distances, and per-corpus counts. Search by phase:
|
||||
|
||||
```bash
|
||||
jq '.events[] | select(.phase == "merge")' reports/reality-tests/multi_coord_stress_005.json
|
||||
jq '.events[] | select(.coordinator == "alice" and .phase == "baseline")' reports/reality-tests/multi_coord_stress_005.json
|
||||
jq '.events[] | select(.role == "warehouse worker") | {phase, contract, top_k_ids: [.top_k[].id]}' reports/reality-tests/multi_coord_stress_005.json
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## What's NOT in this run (Phase 1 deliberately defers)
|
||||
|
||||
- **48-hour clock.** Events fire as discrete steps, not on a timeline.
|
||||
- **Email / SMS ingest.** No endpoints exist on the Go side yet.
|
||||
- **New-resume injection mid-run.** The corpus is fixed at the start.
|
||||
- **Langfuse traces.** Need Go-side wiring.
|
||||
|
||||
These are Phase 2/3. The Phase 1 substrate is what the time-based runner will mount on top of.
|
||||
@ -1,82 +0,0 @@
|
||||
# Multi-Coordinator Stress Test — Run 006
|
||||
|
||||
**Generated:** 2026-04-30T13:33:24.568124731Z
|
||||
**Coordinators:** alice / bob / carol (each with own playbook namespace: `playbook_alice` / `playbook_bob` / `playbook_carol`)
|
||||
**Contracts:** alpha_milwaukee_distribution / beta_indianapolis_manufacturing / gamma_chicago_construction
|
||||
**Corpora:** `workers,ethereal_workers`
|
||||
**K per query:** 8
|
||||
**Total events captured:** 67
|
||||
**Evidence:** `reports/reality-tests/multi_coord_stress_006.json`
|
||||
|
||||
---
|
||||
|
||||
## Diversity — is the system locking into scenarios or cycling?
|
||||
|
||||
| Metric | Mean Jaccard | n pairs | Interpretation |
|
||||
|---|---:|---:|---|
|
||||
| Same role across different contracts | 0 | 9 | Lower = more diverse (different region/cert mix → different workers) |
|
||||
| Different roles within same contract | 0.04603174603174603 | 18 | Should be near-zero (different roles = different worker pools) |
|
||||
|
||||
**Healthy ranges:**
|
||||
- Same role across contracts: < 0.30 means the system is genuinely picking different workers per region/contract.
|
||||
- Different roles same contract: < 0.10 means role-specific retrieval is working.
|
||||
- If either is > 0.50, the system is "cycling" the same handful of workers regardless of query intent.
|
||||
|
||||
---
|
||||
|
||||
## Determinism — same query reissued, top-K stability
|
||||
|
||||
| Metric | Value |
|
||||
|---|---:|
|
||||
| Mean Jaccard on retrieval-only reissue | 1 |
|
||||
| Number of reissue pairs | 12 |
|
||||
|
||||
**Interpretation:**
|
||||
- ≥ 0.95: HNSW retrieval is highly deterministic; reissues land on near-identical top-K. Good — system locks into a stable view of "best workers for this query."
|
||||
- 0.80 – 0.95: Some HNSW or embed variance, acceptable.
|
||||
- < 0.80: Retrieval is unstable — reissues see substantially different results, suggesting either embed nondeterminism (Ollama returning slightly different vectors) or vectord nondeterminism (HNSW insertion order affecting recall).
|
||||
|
||||
---
|
||||
|
||||
## Learning — handover hit rate
|
||||
|
||||
Bob takes Alice's contract using Alice's playbook namespace. Did Alice's recorded answers surface in Bob's results?
|
||||
|
||||
| Metric | Value |
|
||||
|---|---:|
|
||||
| Verbatim handover queries run | 4 |
|
||||
| Alice's recorded answer at Bob's top-1 (verbatim) | 4 |
|
||||
| Alice's recorded answer in Bob's top-K (verbatim) | 4 |
|
||||
| **Verbatim handover hit rate (top-1)** | **1** |
|
||||
| Paraphrase handover queries run | 4 |
|
||||
| Alice's recorded answer at Bob's top-1 (paraphrase) | 4 |
|
||||
| Alice's recorded answer in Bob's top-K (paraphrase) | 4 |
|
||||
| **Paraphrase handover hit rate (top-1)** | **1** |
|
||||
|
||||
**Interpretation:**
|
||||
- Verbatim hit rate ≈ 1.0: trivial case — Bob runs identical queries; should always hit.
|
||||
- Paraphrase hit rate ≥ 0.5: institutional memory survives wording change — the harder learning property.
|
||||
- Paraphrase hit rate ≈ 0.0: Bob's paraphrases drift past the inject threshold, so Alice's recordings don't activate. Same caveat as the playbook_lift paraphrase pass.
|
||||
|
||||
---
|
||||
|
||||
## Per-event capture
|
||||
|
||||
All matrix.search responses live in the JSON — top-K with worker IDs, distances, and per-corpus counts. Search by phase:
|
||||
|
||||
```bash
|
||||
jq '.events[] | select(.phase == "merge")' reports/reality-tests/multi_coord_stress_006.json
|
||||
jq '.events[] | select(.coordinator == "alice" and .phase == "baseline")' reports/reality-tests/multi_coord_stress_006.json
|
||||
jq '.events[] | select(.role == "warehouse worker") | {phase, contract, top_k_ids: [.top_k[].id]}' reports/reality-tests/multi_coord_stress_006.json
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## What's NOT in this run (Phase 1 deliberately defers)
|
||||
|
||||
- **48-hour clock.** Events fire as discrete steps, not on a timeline.
|
||||
- **Email / SMS ingest.** No endpoints exist on the Go side yet.
|
||||
- **New-resume injection mid-run.** The corpus is fixed at the start.
|
||||
- **Langfuse traces.** Need Go-side wiring.
|
||||
|
||||
These are Phase 2/3. The Phase 1 substrate is what the time-based runner will mount on top of.
|
||||
@ -1,82 +0,0 @@
|
||||
# Multi-Coordinator Stress Test — Run 007
|
||||
|
||||
**Generated:** 2026-04-30T19:50:04.791000091Z
|
||||
**Coordinators:** alice / bob / carol (each with own playbook namespace: `playbook_alice` / `playbook_bob` / `playbook_carol`)
|
||||
**Contracts:** alpha_milwaukee_distribution / beta_indianapolis_manufacturing / gamma_chicago_construction
|
||||
**Corpora:** `workers,ethereal_workers`
|
||||
**K per query:** 8
|
||||
**Total events captured:** 67
|
||||
**Evidence:** `reports/reality-tests/multi_coord_stress_007.json`
|
||||
|
||||
---
|
||||
|
||||
## Diversity — is the system locking into scenarios or cycling?
|
||||
|
||||
| Metric | Mean Jaccard | n pairs | Interpretation |
|
||||
|---|---:|---:|---|
|
||||
| Same role across different contracts | 0 | 9 | Lower = more diverse (different region/cert mix → different workers) |
|
||||
| Different roles within same contract | 0.026455026455026454 | 18 | Should be near-zero (different roles = different worker pools) |
|
||||
|
||||
**Healthy ranges:**
|
||||
- Same role across contracts: < 0.30 means the system is genuinely picking different workers per region/contract.
|
||||
- Different roles same contract: < 0.10 means role-specific retrieval is working.
|
||||
- If either is > 0.50, the system is "cycling" the same handful of workers regardless of query intent.
|
||||
|
||||
---
|
||||
|
||||
## Determinism — same query reissued, top-K stability
|
||||
|
||||
| Metric | Value |
|
||||
|---|---:|
|
||||
| Mean Jaccard on retrieval-only reissue | 1 |
|
||||
| Number of reissue pairs | 12 |
|
||||
|
||||
**Interpretation:**
|
||||
- ≥ 0.95: HNSW retrieval is highly deterministic; reissues land on near-identical top-K. Good — system locks into a stable view of "best workers for this query."
|
||||
- 0.80 – 0.95: Some HNSW or embed variance, acceptable.
|
||||
- < 0.80: Retrieval is unstable — reissues see substantially different results, suggesting either embed nondeterminism (Ollama returning slightly different vectors) or vectord nondeterminism (HNSW insertion order affecting recall).
|
||||
|
||||
---
|
||||
|
||||
## Learning — handover hit rate
|
||||
|
||||
Bob takes Alice's contract using Alice's playbook namespace. Did Alice's recorded answers surface in Bob's results?
|
||||
|
||||
| Metric | Value |
|
||||
|---|---:|
|
||||
| Verbatim handover queries run | 4 |
|
||||
| Alice's recorded answer at Bob's top-1 (verbatim) | 4 |
|
||||
| Alice's recorded answer in Bob's top-K (verbatim) | 4 |
|
||||
| **Verbatim handover hit rate (top-1)** | **1** |
|
||||
| Paraphrase handover queries run | 4 |
|
||||
| Alice's recorded answer at Bob's top-1 (paraphrase) | 4 |
|
||||
| Alice's recorded answer in Bob's top-K (paraphrase) | 4 |
|
||||
| **Paraphrase handover hit rate (top-1)** | **1** |
|
||||
|
||||
**Interpretation:**
|
||||
- Verbatim hit rate ≈ 1.0: trivial case — Bob runs identical queries; should always hit.
|
||||
- Paraphrase hit rate ≥ 0.5: institutional memory survives wording change — the harder learning property.
|
||||
- Paraphrase hit rate ≈ 0.0: Bob's paraphrases drift past the inject threshold, so Alice's recordings don't activate. Same caveat as the playbook_lift paraphrase pass.
|
||||
|
||||
---
|
||||
|
||||
## Per-event capture
|
||||
|
||||
All matrix.search responses live in the JSON — top-K with worker IDs, distances, and per-corpus counts. Search by phase:
|
||||
|
||||
```bash
|
||||
jq '.events[] | select(.phase == "merge")' reports/reality-tests/multi_coord_stress_007.json
|
||||
jq '.events[] | select(.coordinator == "alice" and .phase == "baseline")' reports/reality-tests/multi_coord_stress_007.json
|
||||
jq '.events[] | select(.role == "warehouse worker") | {phase, contract, top_k_ids: [.top_k[].id]}' reports/reality-tests/multi_coord_stress_007.json
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## What's NOT in this run (Phase 1 deliberately defers)
|
||||
|
||||
- **48-hour clock.** Events fire as discrete steps, not on a timeline.
|
||||
- **Email / SMS ingest.** No endpoints exist on the Go side yet.
|
||||
- **New-resume injection mid-run.** The corpus is fixed at the start.
|
||||
- **Langfuse traces.** Need Go-side wiring.
|
||||
|
||||
These are Phase 2/3. The Phase 1 substrate is what the time-based runner will mount on top of.
|
||||
@ -1,82 +0,0 @@
|
||||
# Multi-Coordinator Stress Test — Run 008
|
||||
|
||||
**Generated:** 2026-04-30T21:15:37.045817146Z
|
||||
**Coordinators:** alice / bob / carol (each with own playbook namespace: `playbook_alice` / `playbook_bob` / `playbook_carol`)
|
||||
**Contracts:** alpha_milwaukee_distribution / beta_indianapolis_manufacturing / gamma_chicago_construction
|
||||
**Corpora:** `workers,ethereal_workers`
|
||||
**K per query:** 8
|
||||
**Total events captured:** 67
|
||||
**Evidence:** `reports/reality-tests/multi_coord_stress_008.json`
|
||||
|
||||
---
|
||||
|
||||
## Diversity — is the system locking into scenarios or cycling?
|
||||
|
||||
| Metric | Mean Jaccard | n pairs | Interpretation |
|
||||
|---|---:|---:|---|
|
||||
| Same role across different contracts | 0 | 9 | Lower = more diverse (different region/cert mix → different workers) |
|
||||
| Different roles within same contract | 0.04126984126984126 | 18 | Should be near-zero (different roles = different worker pools) |
|
||||
|
||||
**Healthy ranges:**
|
||||
- Same role across contracts: < 0.30 means the system is genuinely picking different workers per region/contract.
|
||||
- Different roles same contract: < 0.10 means role-specific retrieval is working.
|
||||
- If either is > 0.50, the system is "cycling" the same handful of workers regardless of query intent.
|
||||
|
||||
---
|
||||
|
||||
## Determinism — same query reissued, top-K stability
|
||||
|
||||
| Metric | Value |
|
||||
|---|---:|
|
||||
| Mean Jaccard on retrieval-only reissue | 1 |
|
||||
| Number of reissue pairs | 12 |
|
||||
|
||||
**Interpretation:**
|
||||
- ≥ 0.95: HNSW retrieval is highly deterministic; reissues land on near-identical top-K. Good — system locks into a stable view of "best workers for this query."
|
||||
- 0.80 – 0.95: Some HNSW or embed variance, acceptable.
|
||||
- < 0.80: Retrieval is unstable — reissues see substantially different results, suggesting either embed nondeterminism (Ollama returning slightly different vectors) or vectord nondeterminism (HNSW insertion order affecting recall).
|
||||
|
||||
---
|
||||
|
||||
## Learning — handover hit rate
|
||||
|
||||
Bob takes Alice's contract using Alice's playbook namespace. Did Alice's recorded answers surface in Bob's results?
|
||||
|
||||
| Metric | Value |
|
||||
|---|---:|
|
||||
| Verbatim handover queries run | 4 |
|
||||
| Alice's recorded answer at Bob's top-1 (verbatim) | 4 |
|
||||
| Alice's recorded answer in Bob's top-K (verbatim) | 4 |
|
||||
| **Verbatim handover hit rate (top-1)** | **1** |
|
||||
| Paraphrase handover queries run | 4 |
|
||||
| Alice's recorded answer at Bob's top-1 (paraphrase) | 4 |
|
||||
| Alice's recorded answer in Bob's top-K (paraphrase) | 4 |
|
||||
| **Paraphrase handover hit rate (top-1)** | **1** |
|
||||
|
||||
**Interpretation:**
|
||||
- Verbatim hit rate ≈ 1.0: trivial case — Bob runs identical queries; should always hit.
|
||||
- Paraphrase hit rate ≥ 0.5: institutional memory survives wording change — the harder learning property.
|
||||
- Paraphrase hit rate ≈ 0.0: Bob's paraphrases drift past the inject threshold, so Alice's recordings don't activate. Same caveat as the playbook_lift paraphrase pass.
|
||||
|
||||
---
|
||||
|
||||
## Per-event capture
|
||||
|
||||
All matrix.search responses live in the JSON — top-K with worker IDs, distances, and per-corpus counts. Search by phase:
|
||||
|
||||
```bash
|
||||
jq '.events[] | select(.phase == "merge")' reports/reality-tests/multi_coord_stress_008.json
|
||||
jq '.events[] | select(.coordinator == "alice" and .phase == "baseline")' reports/reality-tests/multi_coord_stress_008.json
|
||||
jq '.events[] | select(.role == "warehouse worker") | {phase, contract, top_k_ids: [.top_k[].id]}' reports/reality-tests/multi_coord_stress_008.json
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## What's NOT in this run (Phase 1 deliberately defers)
|
||||
|
||||
- **48-hour clock.** Events fire as discrete steps, not on a timeline.
|
||||
- **Email / SMS ingest.** No endpoints exist on the Go side yet.
|
||||
- **New-resume injection mid-run.** The corpus is fixed at the start.
|
||||
- **Langfuse traces.** Need Go-side wiring.
|
||||
|
||||
These are Phase 2/3. The Phase 1 substrate is what the time-based runner will mount on top of.
|
||||
@ -1,82 +0,0 @@
|
||||
# Multi-Coordinator Stress Test — Run 009
|
||||
|
||||
**Generated:** 2026-04-30T21:23:59.011167722Z
|
||||
**Coordinators:** alice / bob / carol (each with own playbook namespace: `playbook_alice` / `playbook_bob` / `playbook_carol`)
|
||||
**Contracts:** alpha_milwaukee_distribution / beta_indianapolis_manufacturing / gamma_chicago_construction
|
||||
**Corpora:** `workers,ethereal_workers`
|
||||
**K per query:** 8
|
||||
**Total events captured:** 67
|
||||
**Evidence:** `reports/reality-tests/multi_coord_stress_009.json`
|
||||
|
||||
---
|
||||
|
||||
## Diversity — is the system locking into scenarios or cycling?
|
||||
|
||||
| Metric | Mean Jaccard | n pairs | Interpretation |
|
||||
|---|---:|---:|---|
|
||||
| Same role across different contracts | 0.015873015873015872 | 9 | Lower = more diverse (different region/cert mix → different workers) |
|
||||
| Different roles within same contract | 0.015343915343915345 | 18 | Should be near-zero (different roles = different worker pools) |
|
||||
|
||||
**Healthy ranges:**
|
||||
- Same role across contracts: < 0.30 means the system is genuinely picking different workers per region/contract.
|
||||
- Different roles same contract: < 0.10 means role-specific retrieval is working.
|
||||
- If either is > 0.50, the system is "cycling" the same handful of workers regardless of query intent.
|
||||
|
||||
---
|
||||
|
||||
## Determinism — same query reissued, top-K stability
|
||||
|
||||
| Metric | Value |
|
||||
|---|---:|
|
||||
| Mean Jaccard on retrieval-only reissue | 1 |
|
||||
| Number of reissue pairs | 12 |
|
||||
|
||||
**Interpretation:**
|
||||
- ≥ 0.95: HNSW retrieval is highly deterministic; reissues land on near-identical top-K. Good — system locks into a stable view of "best workers for this query."
|
||||
- 0.80 – 0.95: Some HNSW or embed variance, acceptable.
|
||||
- < 0.80: Retrieval is unstable — reissues see substantially different results, suggesting either embed nondeterminism (Ollama returning slightly different vectors) or vectord nondeterminism (HNSW insertion order affecting recall).
|
||||
|
||||
---
|
||||
|
||||
## Learning — handover hit rate
|
||||
|
||||
Bob takes Alice's contract using Alice's playbook namespace. Did Alice's recorded answers surface in Bob's results?
|
||||
|
||||
| Metric | Value |
|
||||
|---|---:|
|
||||
| Verbatim handover queries run | 4 |
|
||||
| Alice's recorded answer at Bob's top-1 (verbatim) | 4 |
|
||||
| Alice's recorded answer in Bob's top-K (verbatim) | 4 |
|
||||
| **Verbatim handover hit rate (top-1)** | **1** |
|
||||
| Paraphrase handover queries run | 4 |
|
||||
| Alice's recorded answer at Bob's top-1 (paraphrase) | 4 |
|
||||
| Alice's recorded answer in Bob's top-K (paraphrase) | 4 |
|
||||
| **Paraphrase handover hit rate (top-1)** | **1** |
|
||||
|
||||
**Interpretation:**
|
||||
- Verbatim hit rate ≈ 1.0: trivial case — Bob runs identical queries; should always hit.
|
||||
- Paraphrase hit rate ≥ 0.5: institutional memory survives wording change — the harder learning property.
|
||||
- Paraphrase hit rate ≈ 0.0: Bob's paraphrases drift past the inject threshold, so Alice's recordings don't activate. Same caveat as the playbook_lift paraphrase pass.
|
||||
|
||||
---
|
||||
|
||||
## Per-event capture
|
||||
|
||||
All matrix.search responses live in the JSON — top-K with worker IDs, distances, and per-corpus counts. Search by phase:
|
||||
|
||||
```bash
|
||||
jq '.events[] | select(.phase == "merge")' reports/reality-tests/multi_coord_stress_009.json
|
||||
jq '.events[] | select(.coordinator == "alice" and .phase == "baseline")' reports/reality-tests/multi_coord_stress_009.json
|
||||
jq '.events[] | select(.role == "warehouse worker") | {phase, contract, top_k_ids: [.top_k[].id]}' reports/reality-tests/multi_coord_stress_009.json
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## What's NOT in this run (Phase 1 deliberately defers)
|
||||
|
||||
- **48-hour clock.** Events fire as discrete steps, not on a timeline.
|
||||
- **Email / SMS ingest.** No endpoints exist on the Go side yet.
|
||||
- **New-resume injection mid-run.** The corpus is fixed at the start.
|
||||
- **Langfuse traces.** Need Go-side wiring.
|
||||
|
||||
These are Phase 2/3. The Phase 1 substrate is what the time-based runner will mount on top of.
|
||||
@ -1,82 +0,0 @@
|
||||
# Multi-Coordinator Stress Test — Run 010
|
||||
|
||||
**Generated:** 2026-04-30T21:30:38.434794788Z
|
||||
**Coordinators:** alice / bob / carol (each with own playbook namespace: `playbook_alice` / `playbook_bob` / `playbook_carol`)
|
||||
**Contracts:** alpha_milwaukee_distribution / beta_indianapolis_manufacturing / gamma_chicago_construction
|
||||
**Corpora:** `workers,ethereal_workers`
|
||||
**K per query:** 8
|
||||
**Total events captured:** 67
|
||||
**Evidence:** `reports/reality-tests/multi_coord_stress_010.json`
|
||||
|
||||
---
|
||||
|
||||
## Diversity — is the system locking into scenarios or cycling?
|
||||
|
||||
| Metric | Mean Jaccard | n pairs | Interpretation |
|
||||
|---|---:|---:|---|
|
||||
| Same role across different contracts | 0.007407407407407408 | 9 | Lower = more diverse (different region/cert mix → different workers) |
|
||||
| Different roles within same contract | 0.026455026455026454 | 18 | Should be near-zero (different roles = different worker pools) |
|
||||
|
||||
**Healthy ranges:**
|
||||
- Same role across contracts: < 0.30 means the system is genuinely picking different workers per region/contract.
|
||||
- Different roles same contract: < 0.10 means role-specific retrieval is working.
|
||||
- If either is > 0.50, the system is "cycling" the same handful of workers regardless of query intent.
|
||||
|
||||
---
|
||||
|
||||
## Determinism — same query reissued, top-K stability
|
||||
|
||||
| Metric | Value |
|
||||
|---|---:|
|
||||
| Mean Jaccard on retrieval-only reissue | 1 |
|
||||
| Number of reissue pairs | 12 |
|
||||
|
||||
**Interpretation:**
|
||||
- ≥ 0.95: HNSW retrieval is highly deterministic; reissues land on near-identical top-K. Good — system locks into a stable view of "best workers for this query."
|
||||
- 0.80 – 0.95: Some HNSW or embed variance, acceptable.
|
||||
- < 0.80: Retrieval is unstable — reissues see substantially different results, suggesting either embed nondeterminism (Ollama returning slightly different vectors) or vectord nondeterminism (HNSW insertion order affecting recall).
|
||||
|
||||
---
|
||||
|
||||
## Learning — handover hit rate
|
||||
|
||||
Bob takes Alice's contract using Alice's playbook namespace. Did Alice's recorded answers surface in Bob's results?
|
||||
|
||||
| Metric | Value |
|
||||
|---|---:|
|
||||
| Verbatim handover queries run | 4 |
|
||||
| Alice's recorded answer at Bob's top-1 (verbatim) | 4 |
|
||||
| Alice's recorded answer in Bob's top-K (verbatim) | 4 |
|
||||
| **Verbatim handover hit rate (top-1)** | **1** |
|
||||
| Paraphrase handover queries run | 4 |
|
||||
| Alice's recorded answer at Bob's top-1 (paraphrase) | 4 |
|
||||
| Alice's recorded answer in Bob's top-K (paraphrase) | 4 |
|
||||
| **Paraphrase handover hit rate (top-1)** | **1** |
|
||||
|
||||
**Interpretation:**
|
||||
- Verbatim hit rate ≈ 1.0: trivial case — Bob runs identical queries; should always hit.
|
||||
- Paraphrase hit rate ≥ 0.5: institutional memory survives wording change — the harder learning property.
|
||||
- Paraphrase hit rate ≈ 0.0: Bob's paraphrases drift past the inject threshold, so Alice's recordings don't activate. Same caveat as the playbook_lift paraphrase pass.
|
||||
|
||||
---
|
||||
|
||||
## Per-event capture
|
||||
|
||||
All matrix.search responses live in the JSON — top-K with worker IDs, distances, and per-corpus counts. Search by phase:
|
||||
|
||||
```bash
|
||||
jq '.events[] | select(.phase == "merge")' reports/reality-tests/multi_coord_stress_010.json
|
||||
jq '.events[] | select(.coordinator == "alice" and .phase == "baseline")' reports/reality-tests/multi_coord_stress_010.json
|
||||
jq '.events[] | select(.role == "warehouse worker") | {phase, contract, top_k_ids: [.top_k[].id]}' reports/reality-tests/multi_coord_stress_010.json
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## What's NOT in this run (Phase 1 deliberately defers)
|
||||
|
||||
- **48-hour clock.** Events fire as discrete steps, not on a timeline.
|
||||
- **Email / SMS ingest.** No endpoints exist on the Go side yet.
|
||||
- **New-resume injection mid-run.** The corpus is fixed at the start.
|
||||
- **Langfuse traces.** Need Go-side wiring.
|
||||
|
||||
These are Phase 2/3. The Phase 1 substrate is what the time-based runner will mount on top of.
|
||||
@ -1,82 +0,0 @@
|
||||
# Multi-Coordinator Stress Test — Run 011
|
||||
|
||||
**Generated:** 2026-04-30T21:41:26.801002955Z
|
||||
**Coordinators:** alice / bob / carol (each with own playbook namespace: `playbook_alice` / `playbook_bob` / `playbook_carol`)
|
||||
**Contracts:** alpha_milwaukee_distribution / beta_indianapolis_manufacturing / gamma_chicago_construction
|
||||
**Corpora:** `workers,ethereal_workers`
|
||||
**K per query:** 8
|
||||
**Total events captured:** 67
|
||||
**Evidence:** `reports/reality-tests/multi_coord_stress_011.json`
|
||||
|
||||
---
|
||||
|
||||
## Diversity — is the system locking into scenarios or cycling?
|
||||
|
||||
| Metric | Mean Jaccard | n pairs | Interpretation |
|
||||
|---|---:|---:|---|
|
||||
| Same role across different contracts | 0.025641025641025644 | 9 | Lower = more diverse (different region/cert mix → different workers) |
|
||||
| Different roles within same contract | 0.06996336996336996 | 18 | Should be near-zero (different roles = different worker pools) |
|
||||
|
||||
**Healthy ranges:**
|
||||
- Same role across contracts: < 0.30 means the system is genuinely picking different workers per region/contract.
|
||||
- Different roles same contract: < 0.10 means role-specific retrieval is working.
|
||||
- If either is > 0.50, the system is "cycling" the same handful of workers regardless of query intent.
|
||||
|
||||
---
|
||||
|
||||
## Determinism — same query reissued, top-K stability
|
||||
|
||||
| Metric | Value |
|
||||
|---|---:|
|
||||
| Mean Jaccard on retrieval-only reissue | 1 |
|
||||
| Number of reissue pairs | 12 |
|
||||
|
||||
**Interpretation:**
|
||||
- ≥ 0.95: HNSW retrieval is highly deterministic; reissues land on near-identical top-K. Good — system locks into a stable view of "best workers for this query."
|
||||
- 0.80 – 0.95: Some HNSW or embed variance, acceptable.
|
||||
- < 0.80: Retrieval is unstable — reissues see substantially different results, suggesting either embed nondeterminism (Ollama returning slightly different vectors) or vectord nondeterminism (HNSW insertion order affecting recall).
|
||||
|
||||
---
|
||||
|
||||
## Learning — handover hit rate
|
||||
|
||||
Bob takes Alice's contract using Alice's playbook namespace. Did Alice's recorded answers surface in Bob's results?
|
||||
|
||||
| Metric | Value |
|
||||
|---|---:|
|
||||
| Verbatim handover queries run | 4 |
|
||||
| Alice's recorded answer at Bob's top-1 (verbatim) | 4 |
|
||||
| Alice's recorded answer in Bob's top-K (verbatim) | 4 |
|
||||
| **Verbatim handover hit rate (top-1)** | **1** |
|
||||
| Paraphrase handover queries run | 4 |
|
||||
| Alice's recorded answer at Bob's top-1 (paraphrase) | 4 |
|
||||
| Alice's recorded answer in Bob's top-K (paraphrase) | 4 |
|
||||
| **Paraphrase handover hit rate (top-1)** | **1** |
|
||||
|
||||
**Interpretation:**
|
||||
- Verbatim hit rate ≈ 1.0: trivial case — Bob runs identical queries; should always hit.
|
||||
- Paraphrase hit rate ≥ 0.5: institutional memory survives wording change — the harder learning property.
|
||||
- Paraphrase hit rate ≈ 0.0: Bob's paraphrases drift past the inject threshold, so Alice's recordings don't activate. Same caveat as the playbook_lift paraphrase pass.
|
||||
|
||||
---
|
||||
|
||||
## Per-event capture
|
||||
|
||||
All matrix.search responses live in the JSON — top-K with worker IDs, distances, and per-corpus counts. Search by phase:
|
||||
|
||||
```bash
|
||||
jq '.events[] | select(.phase == "merge")' reports/reality-tests/multi_coord_stress_011.json
|
||||
jq '.events[] | select(.coordinator == "alice" and .phase == "baseline")' reports/reality-tests/multi_coord_stress_011.json
|
||||
jq '.events[] | select(.role == "warehouse worker") | {phase, contract, top_k_ids: [.top_k[].id]}' reports/reality-tests/multi_coord_stress_011.json
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## What's NOT in this run (Phase 1 deliberately defers)
|
||||
|
||||
- **48-hour clock.** Events fire as discrete steps, not on a timeline.
|
||||
- **Email / SMS ingest.** No endpoints exist on the Go side yet.
|
||||
- **New-resume injection mid-run.** The corpus is fixed at the start.
|
||||
- **Langfuse traces.** Need Go-side wiring.
|
||||
|
||||
These are Phase 2/3. The Phase 1 substrate is what the time-based runner will mount on top of.
|
||||
@ -1,120 +0,0 @@
|
||||
# Playbook-Lift Reality Test — Run 005
|
||||
|
||||
**Generated:** 2026-04-30T12:40:48.475901847Z
|
||||
**Judge:** `qwen2.5:latest` (Ollama, resolved from env JUDGE_MODEL=qwen2.5:latest)
|
||||
**Corpora:** `workers,ethereal_workers`
|
||||
**Workers limit:** 5000
|
||||
**Queries:** `tests/reality/playbook_lift_queries.txt` (21 executed)
|
||||
**K per pass:** 10
|
||||
**Paraphrase pass:** ENABLED
|
||||
**Re-judge pass:** ENABLED
|
||||
**Evidence:** `reports/reality-tests/playbook_lift_005.json`
|
||||
|
||||
---
|
||||
|
||||
## Headline
|
||||
|
||||
| Metric | Value |
|
||||
|---|---:|
|
||||
| Total queries run | 21 |
|
||||
| Cold-pass discoveries (judge-best ≠ top-1) | 7 |
|
||||
| Warm-pass lifts (recorded playbook → top-1) | 5 |
|
||||
| No change (judge-best already top-1, no playbook needed) | 16 |
|
||||
| Playbook boosts triggered (warm pass) | 9 |
|
||||
| Mean Δ top-1 distance (warm − cold) | -0.076170966 |
|
||||
| **Paraphrase pass — recorded answer at rank 0 (top-1)** | **5 / 7** |
|
||||
| Paraphrase pass — recorded answer at any rank in top-K | 5 / 7 |
|
||||
| **Quality lift** (warm top-1 rating > cold top-1 rating) | **5 / 21** |
|
||||
| Quality neutral (warm top-1 rating = cold top-1 rating) | 13 / 21 |
|
||||
| Quality regressed (warm top-1 rating < cold top-1 rating) | 3 / 21 |
|
||||
|
||||
**Verbatim lift rate:** 5 of 7 discoveries became top-1 after warm pass.
|
||||
|
||||
---
|
||||
|
||||
## Per-query results
|
||||
|
||||
| # | Query | Cold top-1 | Cold judge-best (rank/rating) | Recorded? | Warm top-1 | Judge-best warm rank | Lift |
|
||||
|---|---|---|---|---|---|---|---|
|
||||
| 1 | Forklift operator with OSHA-30, warehouse experience, day sh | e-5670 | 2/4 | ✓ e-5729 | e-5729 | 0 | **YES** |
|
||||
| 2 | OSHA-30 certified forklift operator in Wisconsin, cold stora | e-6293 | 7/3 | — | w-1566 | 8 | no |
|
||||
| 3 | Production worker with confined-space cert and hazmat traini | w-602 | 0/2 | — | w-3575 | 1 | no |
|
||||
| 4 | CDL Class A driver, clean record, willing to do regional 4-d | w-3854 | 0/1 | — | w-3854 | 0 | no |
|
||||
| 5 | Warehouse lead with current OSHA-30 certification, NOT OSHA- | w-1807 | 6/3 | — | w-1807 | 6 | no |
|
||||
| 6 | Forklift-certified loader, certification must be active, dis | w-1807 | 3/4 | ✓ w-205 | w-4257 | 1 | no |
|
||||
| 7 | Hazmat-certified warehouse worker comfortable with cold stor | e-4910 | 2/4 | ✓ w-4257 | w-205 | 1 | no |
|
||||
| 8 | Bilingual production worker with team-lead experience and tr | w-4988 | 0/4 | — | w-4988 | 0 | no |
|
||||
| 9 | Inventory specialist with confined-space cert and compliance | w-388 | 3/4 | ✓ w-3575 | w-3575 | 0 | **YES** |
|
||||
| 10 | Warehouse worker who can run inventory cycles and lead a sma | e-3011 | 0/4 | — | e-3011 | 0 | no |
|
||||
| 11 | Production line worker comfortable filling in as line superv | w-1387 | 0/4 | — | e-5729 | 1 | no |
|
||||
| 12 | Customer service rep willing to cross-train into dispatch or | w-1451 | 0/2 | — | w-1451 | 0 | no |
|
||||
| 13 | Reliable production line lead with strong attendance and lea | e-7360 | 5/4 | ✓ w-2886 | w-2886 | 0 | **YES** |
|
||||
| 14 | Highly responsive forklift operator available for last-minut | e-6108 | 5/4 | ✓ w-1566 | w-1566 | 0 | **YES** |
|
||||
| 15 | Engaged warehouse associate with strong safety compliance re | e-2743 | 2/4 | ✓ w-49 | w-49 | 0 | **YES** |
|
||||
| 16 | CDL-A driver based in IL or WI, willing to run regional 4-da | w-2486 | 5/2 | — | w-2486 | 5 | no |
|
||||
| 17 | Bilingual customer service rep in Indianapolis or Cincinnati | e-9749 | 9/2 | — | e-9749 | 9 | no |
|
||||
| 18 | Production supervisor open to Midwest relocation for permane | w-379 | 6/3 | — | w-379 | 6 | no |
|
||||
| 19 | Dental hygienist with three years experience, Indianapolis a | e-6772 | 0/1 | — | w-3575 | 1 | no |
|
||||
| 20 | Registered nurse with ICU experience, willing to take per-di | w-379 | 0/1 | — | w-379 | 0 | no |
|
||||
| 21 | Software engineer with React and TypeScript, three years exp | w-1773 | 0/1 | — | w-1773 | 0 | no |
|
||||
|
||||
---
|
||||
|
||||
## Paraphrase pass — does the playbook help similar-but-different queries?
|
||||
|
||||
For each query whose Pass 1 cold pass recorded a playbook entry, the
|
||||
judge model rephrased the query, and the rephrased version was sent
|
||||
through warm matrix.search. The recorded answer ID's rank in those
|
||||
results tests whether cosine on the embedded paraphrase finds the
|
||||
recorded query's vector.
|
||||
|
||||
| # | Original (≤40c) | Paraphrase (≤60c) | Recorded answer | Paraphrase top-1 | Recorded rank | Paraphrase lift |
|
||||
|---|---|---|---|---|---|---|
|
||||
| 1 | Forklift operator with OSHA-30, warehous | Seeking forklift operator certified in OSHA-30, looking for | e-5729 | e-5729 | 0 | **YES** |
|
||||
| 6 | Forklift-certified loader, certification | Loader requiring active forklift certification, this must no | w-205 | w-205 | 0 | **YES** |
|
||||
| 7 | Hazmat-certified warehouse worker comfor | Warehouse worker with Hazmat certification and experience in | w-4257 | w-4257 | 0 | **YES** |
|
||||
| 9 | Inventory specialist with confined-space | Specialist in inventory management requiring certified confi | w-3575 | w-49 | -1 | no |
|
||||
| 13 | Reliable production line lead with stron | Experienced production line supervisor with excellent punctu | w-2886 | w-2886 | 0 | **YES** |
|
||||
| 14 | Highly responsive forklift operator avai | Available forklift operator ready for urgent shift coverage | w-1566 | w-1566 | 0 | **YES** |
|
||||
| 15 | Engaged warehouse associate with strong | Warehouse associate dedicated to engagement and boasting a r | w-49 | w-984 | -1 | no |
|
||||
|
||||
---
|
||||
|
||||
## Honesty caveats
|
||||
|
||||
1. **Judge IS the ground truth proxy.** Without human-labeled relevance, the LLM
|
||||
judge's verdict is what defines "best." If `qwen2.5:latest` rates badly,
|
||||
the lift number is meaningless. To validate the judge itself, sample 5–10
|
||||
verdicts manually and check agreement.
|
||||
2. **Score-1.0 boost = distance halved.** Playbook math is
|
||||
`distance' = distance × (1 - 0.5 × score)`. Lift requires the judge-best
|
||||
result's pre-boost distance to be ≤ 2× the cold top-1's distance, otherwise
|
||||
even halving doesn't promote it. Tight clusters → little visible lift.
|
||||
3. **Verbatim vs paraphrase.** The verbatim lift rate (above) is the cheap
|
||||
case — same query, recorded playbook, expected boost. The paraphrase
|
||||
pass (when enabled) is the actual learning property: similar-but-different
|
||||
queries hitting a recorded playbook. Compare verbatim and paraphrase
|
||||
lift rates — paraphrase should be lower (semantic-distance gates some
|
||||
playbook hits) but non-zero is the meaningful signal.
|
||||
4. **Multi-corpus skew.** Default corpora=`workers,ethereal_workers` — if all judge-best
|
||||
results land in one corpus, the matrix layer's purpose isn't being tested.
|
||||
Check per-corpus distribution in the JSON.
|
||||
5. **Judge resolution.** This run used `qwen2.5:latest` from
|
||||
env JUDGE_MODEL=qwen2.5:latest.
|
||||
Bumping the judge for run #N+1 means editing one line in lakehouse.toml.
|
||||
6. **Paraphrase generation also uses the judge.** The same model that rates
|
||||
relevance also rephrases queries. A judge that's bad at rating staffing
|
||||
queries is probably also bad at rephrasing them. Worth sanity-checking
|
||||
a sample of `paraphrase_query` values in the JSON before trusting the
|
||||
paraphrase lift number.
|
||||
|
||||
## Next moves
|
||||
|
||||
- If lift rate ≥ 50% of discoveries: matrix layer + playbook is doing real
|
||||
work. Move to paraphrase queries + tag-based boost (currently ignored).
|
||||
- If lift rate < 20%: investigate why — judge variance, distance gap too
|
||||
wide, or playbook math too gentle. The score=1.0 / 0.5× formula may need
|
||||
retuning.
|
||||
- If discovery rate (cold judge-best ≠ top-1) is itself low: cosine is
|
||||
already close to optimal on this query distribution. Either the corpus
|
||||
is too narrow or the queries are too easy.
|
||||
@ -76,14 +76,8 @@ DIM="$(echo "$RESP" | jq -r '.dimension')"
|
||||
N="$(echo "$RESP" | jq -r '.vectors | length')"
|
||||
MODEL="$(echo "$RESP" | jq -r '.model')"
|
||||
SAME="$(echo "$RESP" | jq -r '.vectors[0][0] == .vectors[1][0]')"
|
||||
# Accept any nomic-embed-text* family member as the default — v1
|
||||
# (137M, 768d) and v2-moe (475M MoE, 768d) are both supported drop-ins.
|
||||
# The smoke locks the dimension + the distinct-vectors property, NOT
|
||||
# the exact model name (operators bump the model in lakehouse.toml
|
||||
# without changing this smoke).
|
||||
case "$MODEL" in nomic-embed-text*) MODEL_OK=1 ;; *) MODEL_OK=0 ;; esac
|
||||
if [ "$DIM" = "768" ] && [ "$N" = "2" ] && [ "$MODEL_OK" = "1" ] && [ "$SAME" = "false" ]; then
|
||||
echo " ✓ dim=768, model=$MODEL, 2 distinct vectors"
|
||||
if [ "$DIM" = "768" ] && [ "$N" = "2" ] && [ "$MODEL" = "nomic-embed-text" ] && [ "$SAME" = "false" ]; then
|
||||
echo " ✓ dim=768, model=nomic-embed-text, 2 distinct vectors"
|
||||
else
|
||||
echo " ✗ resp: dim=$DIM n=$N model=$MODEL same=$SAME"; FAILED=1
|
||||
fi
|
||||
|
||||
@ -1,282 +0,0 @@
|
||||
#!/usr/bin/env bash
|
||||
# Multi-coordinator stress harness — Phase 1 of the 48-hour mock.
|
||||
#
|
||||
# Three coordinators (Alice / Bob / Carol) own three distinct contracts
|
||||
# (Milwaukee distribution, Indianapolis manufacturing, Chicago
|
||||
# construction). The driver fires phases:
|
||||
# 1. baseline — each coord runs their contract's role queries
|
||||
# 2. surge — each contract's demand doubles (URGENT phrasing)
|
||||
# 3. merge — alpha + beta combined under alice
|
||||
# 4. handover — bob takes alpha, USING alice's playbook namespace
|
||||
# 5. split — alpha surge re-distributed across all 3 coords
|
||||
# 6. reissue — non-determinism check: same baselines reissued
|
||||
# 7. analysis — diversity + determinism + learning metrics
|
||||
#
|
||||
# Phase 1 deliberately skips the 48-hour clock, email/SMS endpoints,
|
||||
# and Langfuse wiring — those are Phase 2/3.
|
||||
#
|
||||
# Usage:
|
||||
# ./scripts/multi_coord_stress.sh # run #001
|
||||
# RUN_ID=002 ./scripts/multi_coord_stress.sh
|
||||
# K=12 ./scripts/multi_coord_stress.sh
|
||||
|
||||
set -euo pipefail
|
||||
cd "$(dirname "$0")/.."
|
||||
|
||||
export PATH="$PATH:/usr/local/go/bin"
|
||||
|
||||
RUN_ID="${RUN_ID:-001}"
|
||||
WORKERS_LIMIT="${WORKERS_LIMIT:-5000}"
|
||||
ETHEREAL_LIMIT="${ETHEREAL_LIMIT:-0}"
|
||||
CORPORA="${CORPORA:-workers,ethereal_workers}"
|
||||
K="${K:-8}"
|
||||
|
||||
OUT_JSON="reports/reality-tests/multi_coord_stress_${RUN_ID}.json"
|
||||
OUT_MD="reports/reality-tests/multi_coord_stress_${RUN_ID}.md"
|
||||
|
||||
if ! curl -sS --max-time 3 http://localhost:11434/api/tags >/dev/null 2>&1; then
|
||||
echo "[stress] Ollama not reachable on :11434 — skipping (need it for embeddings)"
|
||||
exit 0
|
||||
fi
|
||||
|
||||
echo "[stress] building binaries..."
|
||||
go build -o bin/ ./cmd/storaged ./cmd/catalogd ./cmd/ingestd ./cmd/queryd \
|
||||
./cmd/embedd ./cmd/vectord ./cmd/pathwayd ./cmd/observerd \
|
||||
./cmd/matrixd ./cmd/gateway \
|
||||
./scripts/staffing_workers ./scripts/multi_coord_stress
|
||||
|
||||
pkill -f "bin/(storaged|catalogd|ingestd|queryd|embedd|vectord|pathwayd|observerd|matrixd|gateway)" 2>/dev/null || true
|
||||
sleep 0.3
|
||||
|
||||
PIDS=()
|
||||
TMP="$(mktemp -d)"
|
||||
CFG="$TMP/stress.toml"
|
||||
|
||||
cleanup() {
|
||||
echo "[stress] cleanup"
|
||||
for p in "${PIDS[@]:-}"; do [ -n "${p:-}" ] && kill "$p" 2>/dev/null || true; done
|
||||
rm -rf "$TMP"
|
||||
}
|
||||
trap cleanup EXIT INT TERM
|
||||
|
||||
cat > "$CFG" <<EOF
|
||||
[s3]
|
||||
endpoint = "http://localhost:9000"
|
||||
region = "us-east-1"
|
||||
bucket = "lakehouse-go-primary"
|
||||
use_path_style = true
|
||||
|
||||
[gateway]
|
||||
bind = "127.0.0.1:3110"
|
||||
storaged_url = "http://127.0.0.1:3211"
|
||||
catalogd_url = "http://127.0.0.1:3212"
|
||||
ingestd_url = "http://127.0.0.1:3213"
|
||||
queryd_url = "http://127.0.0.1:3214"
|
||||
vectord_url = "http://127.0.0.1:3215"
|
||||
embedd_url = "http://127.0.0.1:3216"
|
||||
pathwayd_url = "http://127.0.0.1:3217"
|
||||
matrixd_url = "http://127.0.0.1:3218"
|
||||
observerd_url = "http://127.0.0.1:3219"
|
||||
|
||||
[storaged]
|
||||
bind = "127.0.0.1:3211"
|
||||
|
||||
[catalogd]
|
||||
bind = "127.0.0.1:3212"
|
||||
storaged_url = "http://127.0.0.1:3211"
|
||||
|
||||
[ingestd]
|
||||
bind = "127.0.0.1:3213"
|
||||
storaged_url = "http://127.0.0.1:3211"
|
||||
catalogd_url = "http://127.0.0.1:3212"
|
||||
max_ingest_bytes = 268435456
|
||||
|
||||
[queryd]
|
||||
bind = "127.0.0.1:3214"
|
||||
catalogd_url = "http://127.0.0.1:3212"
|
||||
secrets_path = "/etc/lakehouse/secrets-go.toml"
|
||||
refresh_every = "1s"
|
||||
|
||||
[embedd]
|
||||
bind = "127.0.0.1:3216"
|
||||
provider_url = "http://localhost:11434"
|
||||
default_model = "nomic-embed-text-v2-moe"
|
||||
|
||||
[vectord]
|
||||
bind = "127.0.0.1:3215"
|
||||
storaged_url = ""
|
||||
|
||||
[pathwayd]
|
||||
bind = "127.0.0.1:3217"
|
||||
persist_path = ""
|
||||
|
||||
[observerd]
|
||||
bind = "127.0.0.1:3219"
|
||||
persist_path = ""
|
||||
|
||||
[matrixd]
|
||||
bind = "127.0.0.1:3218"
|
||||
embedd_url = "http://127.0.0.1:3216"
|
||||
vectord_url = "http://127.0.0.1:3215"
|
||||
EOF
|
||||
|
||||
poll_health() {
|
||||
local port="$1" deadline=$(($(date +%s) + 5))
|
||||
while [ "$(date +%s)" -lt "$deadline" ]; do
|
||||
if curl -sS --max-time 1 "http://127.0.0.1:$port/health" >/dev/null 2>&1; then return 0; fi
|
||||
sleep 0.05
|
||||
done
|
||||
return 1
|
||||
}
|
||||
|
||||
echo "[stress] launching stack..."
|
||||
./bin/storaged -config "$CFG" > /tmp/stress_storaged.log 2>&1 & PIDS+=($!); poll_health 3211 || { echo "storaged failed"; exit 1; }
|
||||
./bin/catalogd -config "$CFG" > /tmp/stress_catalogd.log 2>&1 & PIDS+=($!); poll_health 3212 || { echo "catalogd failed"; exit 1; }
|
||||
./bin/ingestd -config "$CFG" > /tmp/stress_ingestd.log 2>&1 & PIDS+=($!); poll_health 3213 || { echo "ingestd failed"; exit 1; }
|
||||
./bin/queryd -config "$CFG" > /tmp/stress_queryd.log 2>&1 & PIDS+=($!); poll_health 3214 || { echo "queryd failed"; exit 1; }
|
||||
./bin/embedd -config "$CFG" > /tmp/stress_embedd.log 2>&1 & PIDS+=($!); poll_health 3216 || { echo "embedd failed"; exit 1; }
|
||||
./bin/vectord -config "$CFG" > /tmp/stress_vectord.log 2>&1 & PIDS+=($!); poll_health 3215 || { echo "vectord failed"; exit 1; }
|
||||
./bin/pathwayd -config "$CFG" > /tmp/stress_pathwayd.log 2>&1 & PIDS+=($!); poll_health 3217 || { echo "pathwayd failed"; exit 1; }
|
||||
./bin/observerd -config "$CFG" > /tmp/stress_observerd.log 2>&1 & PIDS+=($!); poll_health 3219 || { echo "observerd failed"; exit 1; }
|
||||
./bin/matrixd -config "$CFG" > /tmp/stress_matrixd.log 2>&1 & PIDS+=($!); poll_health 3218 || { echo "matrixd failed"; exit 1; }
|
||||
./bin/gateway -config "$CFG" > /tmp/stress_gateway.log 2>&1 & PIDS+=($!); poll_health 3110 || { echo "gateway failed"; exit 1; }
|
||||
|
||||
echo
|
||||
echo "[stress] ingest workers (limit=$WORKERS_LIMIT) into 'workers' corpus..."
|
||||
./bin/staffing_workers -limit "$WORKERS_LIMIT"
|
||||
|
||||
echo
|
||||
echo "[stress] ingest ethereal_workers (limit=$ETHEREAL_LIMIT, 0=all) into 'ethereal_workers' corpus..."
|
||||
./bin/staffing_workers \
|
||||
-parquet "/home/profit/lakehouse/data/datasets/ethereal_workers.parquet" \
|
||||
-index-name ethereal_workers \
|
||||
-id-prefix "e-" \
|
||||
-limit "$ETHEREAL_LIMIT"
|
||||
|
||||
echo
|
||||
echo "[stress] running multi-coord stress driver..."
|
||||
EXTRA_FLAGS=""
|
||||
if [ "${WITH_PARAPHRASE_HANDOVER:-1}" = "1" ]; then
|
||||
EXTRA_FLAGS="$EXTRA_FLAGS -with-paraphrase-handover"
|
||||
fi
|
||||
./bin/multi_coord_stress \
|
||||
-gateway "http://127.0.0.1:3110" \
|
||||
-contracts tests/reality/contracts \
|
||||
-corpora "$CORPORA" \
|
||||
-k "$K" \
|
||||
-out "$OUT_JSON" \
|
||||
-ollama "http://localhost:11434" \
|
||||
-judge "${JUDGE_MODEL:-qwen2.5:latest}" \
|
||||
$EXTRA_FLAGS
|
||||
|
||||
echo
|
||||
echo "[stress] generating markdown report → $OUT_MD"
|
||||
|
||||
# Render compact markdown from the JSON. Same shape as the lift harness
|
||||
# reports so reviewers can compare format.
|
||||
total=$(jq -r '.events | length' "$OUT_JSON")
|
||||
gen_at=$(jq -r '.generated_at' "$OUT_JSON")
|
||||
div_role=$(jq -r '.diversity.same_role_across_contracts_mean_jaccard' "$OUT_JSON")
|
||||
div_role_n=$(jq -r '.diversity.num_pairs_same_role_across_contracts' "$OUT_JSON")
|
||||
div_xrole=$(jq -r '.diversity.different_roles_same_contract_mean_jaccard' "$OUT_JSON")
|
||||
div_xrole_n=$(jq -r '.diversity.num_pairs_different_roles_same_contract' "$OUT_JSON")
|
||||
det_jacc=$(jq -r '.determinism.mean_jaccard' "$OUT_JSON")
|
||||
det_n=$(jq -r '.determinism.num_reissued_pairs' "$OUT_JSON")
|
||||
hand_run=$(jq -r '.learning.handover_queries_run' "$OUT_JSON")
|
||||
hand_top1=$(jq -r '.learning.recorded_answers_top1_count' "$OUT_JSON")
|
||||
hand_topk=$(jq -r '.learning.recorded_answers_topk_count' "$OUT_JSON")
|
||||
hand_rate=$(jq -r '.learning.handover_hit_rate' "$OUT_JSON")
|
||||
ph_run=$(jq -r '.learning.paraphrase_handover_run // 0' "$OUT_JSON")
|
||||
ph_top1=$(jq -r '.learning.paraphrase_top1_count // 0' "$OUT_JSON")
|
||||
ph_topk=$(jq -r '.learning.paraphrase_topk_count // 0' "$OUT_JSON")
|
||||
ph_rate=$(jq -r '.learning.paraphrase_handover_hit_rate // 0' "$OUT_JSON")
|
||||
|
||||
cat > "$OUT_MD" <<MDEOF
|
||||
# Multi-Coordinator Stress Test — Run ${RUN_ID}
|
||||
|
||||
**Generated:** ${gen_at}
|
||||
**Coordinators:** alice / bob / carol (each with own playbook namespace: \`playbook_alice\` / \`playbook_bob\` / \`playbook_carol\`)
|
||||
**Contracts:** $(jq -r '.contracts | join(" / ")' "$OUT_JSON")
|
||||
**Corpora:** \`${CORPORA}\`
|
||||
**K per query:** ${K}
|
||||
**Total events captured:** ${total}
|
||||
**Evidence:** \`${OUT_JSON}\`
|
||||
|
||||
---
|
||||
|
||||
## Diversity — is the system locking into scenarios or cycling?
|
||||
|
||||
| Metric | Mean Jaccard | n pairs | Interpretation |
|
||||
|---|---:|---:|---|
|
||||
| Same role across different contracts | ${div_role} | ${div_role_n} | Lower = more diverse (different region/cert mix → different workers) |
|
||||
| Different roles within same contract | ${div_xrole} | ${div_xrole_n} | Should be near-zero (different roles = different worker pools) |
|
||||
|
||||
**Healthy ranges:**
|
||||
- Same role across contracts: < 0.30 means the system is genuinely picking different workers per region/contract.
|
||||
- Different roles same contract: < 0.10 means role-specific retrieval is working.
|
||||
- If either is > 0.50, the system is "cycling" the same handful of workers regardless of query intent.
|
||||
|
||||
---
|
||||
|
||||
## Determinism — same query reissued, top-K stability
|
||||
|
||||
| Metric | Value |
|
||||
|---|---:|
|
||||
| Mean Jaccard on retrieval-only reissue | ${det_jacc} |
|
||||
| Number of reissue pairs | ${det_n} |
|
||||
|
||||
**Interpretation:**
|
||||
- ≥ 0.95: HNSW retrieval is highly deterministic; reissues land on near-identical top-K. Good — system locks into a stable view of "best workers for this query."
|
||||
- 0.80 – 0.95: Some HNSW or embed variance, acceptable.
|
||||
- < 0.80: Retrieval is unstable — reissues see substantially different results, suggesting either embed nondeterminism (Ollama returning slightly different vectors) or vectord nondeterminism (HNSW insertion order affecting recall).
|
||||
|
||||
---
|
||||
|
||||
## Learning — handover hit rate
|
||||
|
||||
Bob takes Alice's contract using Alice's playbook namespace. Did Alice's recorded answers surface in Bob's results?
|
||||
|
||||
| Metric | Value |
|
||||
|---|---:|
|
||||
| Verbatim handover queries run | ${hand_run} |
|
||||
| Alice's recorded answer at Bob's top-1 (verbatim) | ${hand_top1} |
|
||||
| Alice's recorded answer in Bob's top-K (verbatim) | ${hand_topk} |
|
||||
| **Verbatim handover hit rate (top-1)** | **${hand_rate}** |
|
||||
| Paraphrase handover queries run | ${ph_run} |
|
||||
| Alice's recorded answer at Bob's top-1 (paraphrase) | ${ph_top1} |
|
||||
| Alice's recorded answer in Bob's top-K (paraphrase) | ${ph_topk} |
|
||||
| **Paraphrase handover hit rate (top-1)** | **${ph_rate}** |
|
||||
|
||||
**Interpretation:**
|
||||
- Verbatim hit rate ≈ 1.0: trivial case — Bob runs identical queries; should always hit.
|
||||
- Paraphrase hit rate ≥ 0.5: institutional memory survives wording change — the harder learning property.
|
||||
- Paraphrase hit rate ≈ 0.0: Bob's paraphrases drift past the inject threshold, so Alice's recordings don't activate. Same caveat as the playbook_lift paraphrase pass.
|
||||
|
||||
---
|
||||
|
||||
## Per-event capture
|
||||
|
||||
All matrix.search responses live in the JSON — top-K with worker IDs, distances, and per-corpus counts. Search by phase:
|
||||
|
||||
\`\`\`bash
|
||||
jq '.events[] | select(.phase == "merge")' ${OUT_JSON}
|
||||
jq '.events[] | select(.coordinator == "alice" and .phase == "baseline")' ${OUT_JSON}
|
||||
jq '.events[] | select(.role == "warehouse worker") | {phase, contract, top_k_ids: [.top_k[].id]}' ${OUT_JSON}
|
||||
\`\`\`
|
||||
|
||||
---
|
||||
|
||||
## What's NOT in this run (Phase 1 deliberately defers)
|
||||
|
||||
- **48-hour clock.** Events fire as discrete steps, not on a timeline.
|
||||
- **Email / SMS ingest.** No endpoints exist on the Go side yet.
|
||||
- **New-resume injection mid-run.** The corpus is fixed at the start.
|
||||
- **Langfuse traces.** Need Go-side wiring.
|
||||
|
||||
These are Phase 2/3. The Phase 1 substrate is what the time-based runner will mount on top of.
|
||||
MDEOF
|
||||
|
||||
echo
|
||||
echo "[stress] DONE"
|
||||
echo "[stress] evidence: $OUT_JSON"
|
||||
echo "[stress] report: $OUT_MD"
|
||||
File diff suppressed because it is too large
Load Diff
@ -52,11 +52,6 @@ CONFIG_PATH="${CONFIG_PATH:-lakehouse.toml}"
|
||||
# actual learning-property test (does cosine on paraphrase find the
|
||||
# recorded entry?). Set WITH_PARAPHRASE=0 for a faster verbatim-only run.
|
||||
WITH_PARAPHRASE="${WITH_PARAPHRASE:-1}"
|
||||
# WITH_REJUDGE=1 (default) adds a Pass 4 — judge warm top-1 to measure
|
||||
# quality lift (warm rating vs cold rating). Catches cases where Shape B
|
||||
# surfaces a different-but-equally-good answer (which the rank-based
|
||||
# lift metric misses). +21 judge calls (~30s on qwen2.5).
|
||||
WITH_REJUDGE="${WITH_REJUDGE:-1}"
|
||||
|
||||
OUT_JSON="reports/reality-tests/playbook_lift_${RUN_ID}.json"
|
||||
OUT_MD="reports/reality-tests/playbook_lift_${RUN_ID}.md"
|
||||
@ -161,7 +156,7 @@ refresh_every = "1s"
|
||||
[embedd]
|
||||
bind = "127.0.0.1:3216"
|
||||
provider_url = "http://localhost:11434"
|
||||
default_model = "nomic-embed-text-v2-moe"
|
||||
default_model = "nomic-embed-text"
|
||||
|
||||
[vectord]
|
||||
bind = "127.0.0.1:3215"
|
||||
@ -276,12 +271,9 @@ echo "[lift] running driver — judge=$EFFECTIVE_JUDGE · queries=$QUERIES_FILE
|
||||
# and runs its own resolution chain (env → config → fallback). When
|
||||
# JUDGE_MODEL IS set, the explicit -judge wins inside the Go driver
|
||||
# regardless of what its env-lookup would find — flag wins by design.
|
||||
EXTRA_FLAGS=""
|
||||
PARAPHRASE_FLAG=""
|
||||
if [ "$WITH_PARAPHRASE" = "1" ]; then
|
||||
EXTRA_FLAGS="$EXTRA_FLAGS -with-paraphrase"
|
||||
fi
|
||||
if [ "$WITH_REJUDGE" = "1" ]; then
|
||||
EXTRA_FLAGS="$EXTRA_FLAGS -with-rejudge"
|
||||
PARAPHRASE_FLAG="-with-paraphrase"
|
||||
fi
|
||||
./bin/playbook_lift \
|
||||
-config "$CONFIG_PATH" \
|
||||
@ -292,7 +284,7 @@ fi
|
||||
-judge "$JUDGE_MODEL" \
|
||||
-k "$K" \
|
||||
-out "$OUT_JSON" \
|
||||
$EXTRA_FLAGS
|
||||
$PARAPHRASE_FLAG
|
||||
|
||||
echo
|
||||
echo "[lift] generating markdown report → $OUT_MD"
|
||||
@ -310,10 +302,6 @@ generate_md() {
|
||||
p_attempted=$(jq -r '.summary.paraphrase_attempted // 0' "$json")
|
||||
p_top1=$(jq -r '.summary.paraphrase_top1_lifts // 0' "$json")
|
||||
p_anyrank=$(jq -r '.summary.paraphrase_any_rank_hits // 0' "$json")
|
||||
rj_attempted=$(jq -r '.summary.rejudge_attempted // 0' "$json")
|
||||
q_lifted=$(jq -r '.summary.quality_lifted // 0' "$json")
|
||||
q_neutral=$(jq -r '.summary.quality_neutral // 0' "$json")
|
||||
q_regressed=$(jq -r '.summary.quality_regressed // 0' "$json")
|
||||
|
||||
# Only emit the paraphrase block when --with-paraphrase actually ran
|
||||
# (i.e. .summary.paraphrase_attempted > 0). For verbatim-only runs we
|
||||
@ -324,13 +312,6 @@ generate_md() {
|
||||
| Paraphrase pass — recorded answer at any rank in top-K | ${p_anyrank} / ${p_attempted} |"
|
||||
fi
|
||||
|
||||
rj_block=""
|
||||
if [ "$rj_attempted" != "0" ] && [ "$rj_attempted" != "null" ]; then
|
||||
rj_block="| **Quality lift** (warm top-1 rating > cold top-1 rating) | **${q_lifted} / ${rj_attempted}** |
|
||||
| Quality neutral (warm top-1 rating = cold top-1 rating) | ${q_neutral} / ${rj_attempted} |
|
||||
| Quality regressed (warm top-1 rating < cold top-1 rating) | ${q_regressed} / ${rj_attempted} |"
|
||||
fi
|
||||
|
||||
cat > "$md" <<MDEOF
|
||||
# Playbook-Lift Reality Test — Run ${RUN_ID}
|
||||
|
||||
@ -341,7 +322,6 @@ generate_md() {
|
||||
**Queries:** \`${QUERIES_FILE}\` (${total} executed)
|
||||
**K per pass:** ${K}
|
||||
**Paraphrase pass:** $([ "$WITH_PARAPHRASE" = "1" ] && echo "ENABLED" || echo "disabled")
|
||||
**Re-judge pass:** $([ "$WITH_REJUDGE" = "1" ] && echo "ENABLED" || echo "disabled")
|
||||
**Evidence:** \`${OUT_JSON}\`
|
||||
|
||||
---
|
||||
@ -357,7 +337,6 @@ generate_md() {
|
||||
| Playbook boosts triggered (warm pass) | ${boosted} |
|
||||
| Mean Δ top-1 distance (warm − cold) | ${mean_delta} |
|
||||
${p_block}
|
||||
${rj_block}
|
||||
|
||||
**Verbatim lift rate:** ${lift} of ${discovery} discoveries became top-1 after warm pass.
|
||||
|
||||
|
||||
@ -75,19 +75,12 @@ type queryRun struct {
|
||||
PlaybookRecorded bool `json:"playbook_recorded"`
|
||||
PlaybookID string `json:"playbook_target_id,omitempty"`
|
||||
|
||||
WarmTop1ID string `json:"warm_top1_id"`
|
||||
WarmTop1Distance float32 `json:"warm_top1_distance"`
|
||||
WarmBoostedCount int `json:"warm_boosted_count"`
|
||||
WarmJudgeBestRank int `json:"warm_judge_best_rank"` // rank of cold judge-best in warm — NOT the warm pass's own judge-best
|
||||
WarmTop1Metadata json.RawMessage `json:"-"` // cached for Pass 4 rejudge; not emitted
|
||||
WarmTop1ID string `json:"warm_top1_id"`
|
||||
WarmTop1Distance float32 `json:"warm_top1_distance"`
|
||||
WarmBoostedCount int `json:"warm_boosted_count"`
|
||||
WarmJudgeBestRank int `json:"warm_judge_best_rank"`
|
||||
|
||||
// WarmTop1Rating: only populated when --with-rejudge. Compare to
|
||||
// ColdRatings[0] (== cold top-1 rating) to measure quality lift.
|
||||
// *int so absence (no rejudge pass) and a 0-rating verdict are
|
||||
// distinguishable.
|
||||
WarmTop1Rating *int `json:"warm_top1_rating,omitempty"`
|
||||
|
||||
Lift bool `json:"lift"` // judge-best was below top-1 cold, but top-1 warm
|
||||
Lift bool `json:"lift"` // judge-best was below top-1 cold, but top-1 warm
|
||||
|
||||
// Paraphrase pass — only populated when --with-paraphrase. Tests
|
||||
// the playbook's actual learning property: does a recorded entry
|
||||
@ -121,17 +114,6 @@ type summary struct {
|
||||
ParaphraseTop1Lifts int `json:"paraphrase_top1_lifts,omitempty"` // recorded answer surfaced at rank 0
|
||||
ParaphraseAnyRankHits int `json:"paraphrase_any_rank_hits,omitempty"` // recorded answer surfaced at any rank in top-K
|
||||
|
||||
// Re-judge pass aggregates — only populated when --with-rejudge.
|
||||
// Measures QUALITY lift (warm top-1 rating vs cold top-1 rating)
|
||||
// rather than rank-of-cold-judge-best lift. The latter conflates
|
||||
// "warm surfaced a different but equally-good result" with "warm
|
||||
// shuffled ranks but the answer was the same"; quality lift
|
||||
// disambiguates them.
|
||||
RejudgeAttempted int `json:"rejudge_attempted,omitempty"` // queries that ran the rejudge pass
|
||||
QualityLifted int `json:"quality_lifted,omitempty"` // warm-top-1 rating > cold-top-1 rating
|
||||
QualityNeutral int `json:"quality_neutral,omitempty"` // ratings equal (could be same or different item)
|
||||
QualityRegressed int `json:"quality_regressed,omitempty"` // warm-top-1 rating < cold-top-1 rating
|
||||
|
||||
GeneratedAt time.Time `json:"generated_at"`
|
||||
}
|
||||
|
||||
@ -146,7 +128,6 @@ func main() {
|
||||
k := flag.Int("k", 10, "top-k from matrix.search per pass")
|
||||
out := flag.String("out", "reports/reality-tests/playbook_lift_001.json", "output JSONL path")
|
||||
withParaphrase := flag.Bool("with-paraphrase", false, "after warm pass, generate a paraphrase via the judge model and re-query with playbook=true to test the learning property")
|
||||
withRejudge := flag.Bool("with-rejudge", false, "after warm pass, judge warm top-1 to measure QUALITY lift (vs cold top-1 rating), not just rank-of-cold-judge-best")
|
||||
flag.Parse()
|
||||
|
||||
// Judge resolution priority: explicit flag > $JUDGE_MODEL env >
|
||||
@ -244,7 +225,6 @@ func main() {
|
||||
}
|
||||
runs[i].WarmTop1ID = resp.Results[0].ID
|
||||
runs[i].WarmTop1Distance = resp.Results[0].Distance
|
||||
runs[i].WarmTop1Metadata = resp.Results[0].Metadata // cache for Pass 4 rejudge
|
||||
runs[i].WarmBoostedCount = resp.PlaybookBoosted
|
||||
playbookBoostedTotal += resp.PlaybookBoosted
|
||||
|
||||
@ -324,47 +304,6 @@ func main() {
|
||||
}
|
||||
}
|
||||
|
||||
// Pass 4 (warm-rejudge) — opt-in via --with-rejudge. Judge warm
|
||||
// top-1 against the same prompt as cold ratings, then compare to
|
||||
// cold top-1 rating. This measures QUALITY lift (did the playbook
|
||||
// produce a better candidate?) rather than just rank-of-cold-judge-
|
||||
// best lift (did the recorded answer move to top-1, even if cold's
|
||||
// top-1 was already good?). See STATE_OF_PLAY OPEN — added because
|
||||
// run #003's verbatim 2/6 didn't tell us whether Shape B was
|
||||
// surfacing better OR same-quality alternatives.
|
||||
rejudgeAttempted := 0
|
||||
qualityLifted := 0
|
||||
qualityNeutral := 0
|
||||
qualityRegressed := 0
|
||||
if *withRejudge {
|
||||
log.Printf("[lift] warm-rejudge pass: measuring quality lift (warm top-1 rating vs cold top-1 rating)")
|
||||
for i := range runs {
|
||||
if runs[i].WarmTop1ID == "" || len(runs[i].WarmTop1Metadata) == 0 {
|
||||
continue // warm pass didn't complete for this query
|
||||
}
|
||||
rejudgeAttempted++
|
||||
result := matrixResult{
|
||||
ID: runs[i].WarmTop1ID,
|
||||
Distance: runs[i].WarmTop1Distance,
|
||||
Metadata: runs[i].WarmTop1Metadata,
|
||||
}
|
||||
warmRating := judgeRate(hc, *ollama, *judge, runs[i].Query, result)
|
||||
runs[i].WarmTop1Rating = &warmRating
|
||||
coldRating := 0
|
||||
if len(runs[i].ColdRatings) > 0 {
|
||||
coldRating = runs[i].ColdRatings[0]
|
||||
}
|
||||
switch {
|
||||
case warmRating > coldRating:
|
||||
qualityLifted++
|
||||
case warmRating < coldRating:
|
||||
qualityRegressed++
|
||||
default:
|
||||
qualityNeutral++
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
sum := summary{
|
||||
Total: len(runs),
|
||||
WithDiscovery: withDiscovery,
|
||||
@ -375,10 +314,6 @@ func main() {
|
||||
ParaphraseAttempted: paraphraseAttempted,
|
||||
ParaphraseTop1Lifts: paraphraseTop1Lifts,
|
||||
ParaphraseAnyRankHits: paraphraseAnyRankHits,
|
||||
RejudgeAttempted: rejudgeAttempted,
|
||||
QualityLifted: qualityLifted,
|
||||
QualityNeutral: qualityNeutral,
|
||||
QualityRegressed: qualityRegressed,
|
||||
GeneratedAt: time.Now().UTC(),
|
||||
}
|
||||
if len(runs) > 0 {
|
||||
@ -388,11 +323,11 @@ func main() {
|
||||
if err := writeJSON(*out, runs, sum); err != nil {
|
||||
log.Fatalf("write %s: %v", *out, err)
|
||||
}
|
||||
if *withParaphrase || *withRejudge {
|
||||
log.Printf("[lift] DONE — %d queries · discovery=%d · lift=%d · boosted=%d · meanΔdist=%.4f · paraphrase=%d/%d→top1 · quality=lifted%d/neutral%d/regressed%d",
|
||||
if *withParaphrase {
|
||||
log.Printf("[lift] DONE — %d queries · discovery=%d · lift=%d · boosted=%d · meanΔdist=%.4f · paraphrase=%d/%d→top1, %d/%d→anyrank",
|
||||
sum.Total, sum.WithDiscovery, sum.LiftCount, sum.PlaybookBoostedTotal, sum.MeanTop1DeltaDistance,
|
||||
sum.ParaphraseTop1Lifts, sum.ParaphraseAttempted,
|
||||
sum.QualityLifted, sum.QualityNeutral, sum.QualityRegressed)
|
||||
sum.ParaphraseAnyRankHits, sum.ParaphraseAttempted)
|
||||
} else {
|
||||
log.Printf("[lift] DONE — %d queries · discovery=%d · lift=%d · boosted=%d · meanΔdist=%.4f",
|
||||
sum.Total, sum.WithDiscovery, sum.LiftCount, sum.PlaybookBoostedTotal, sum.MeanTop1DeltaDistance)
|
||||
|
||||
@ -1,12 +0,0 @@
|
||||
{
|
||||
"name": "alpha_milwaukee_distribution",
|
||||
"client": "Northstar Logistics",
|
||||
"location": "Milwaukee, WI metro",
|
||||
"shift": "day",
|
||||
"demand": [
|
||||
{"role": "warehouse worker", "count": 200, "skills": ["pallet jack", "inventory"], "certs": ["OSHA-30"]},
|
||||
{"role": "admin assistant", "count": 3, "skills": ["scheduling", "data entry"], "certs": []},
|
||||
{"role": "heavy equipment operator", "count": 2, "skills": ["forklift", "bobcat"], "certs": ["OSHA-30", "forklift cert"]},
|
||||
{"role": "industrial electrician", "count": 1, "skills": ["high voltage", "PLC"], "certs": ["journeyman"], "in_roster": false}
|
||||
]
|
||||
}
|
||||
@ -1,12 +0,0 @@
|
||||
{
|
||||
"name": "beta_indianapolis_manufacturing",
|
||||
"client": "Crossroads Manufacturing",
|
||||
"location": "Indianapolis, IN metro",
|
||||
"shift": "swing",
|
||||
"demand": [
|
||||
{"role": "warehouse worker", "count": 150, "skills": ["assembly", "machine operation"], "certs": ["OSHA-10"]},
|
||||
{"role": "admin assistant", "count": 4, "skills": ["scheduling", "documentation", "spanish"], "certs": []},
|
||||
{"role": "heavy equipment operator", "count": 3, "skills": ["forklift", "pallet jack", "cold storage"], "certs": ["OSHA-30", "forklift cert"]},
|
||||
{"role": "bilingual safety coordinator", "count": 1, "skills": ["spanish", "english", "training"], "certs": ["OSHA trainer"], "in_roster": false}
|
||||
]
|
||||
}
|
||||
@ -1,12 +0,0 @@
|
||||
{
|
||||
"name": "gamma_chicago_construction",
|
||||
"client": "Loop Construction Group",
|
||||
"location": "Chicago, IL metro",
|
||||
"shift": "early-day",
|
||||
"demand": [
|
||||
{"role": "warehouse worker", "count": 80, "skills": ["framing", "rigging", "concrete"], "certs": ["OSHA-10"]},
|
||||
{"role": "admin assistant", "count": 1, "skills": ["scheduling", "blueprint reading"], "certs": []},
|
||||
{"role": "heavy equipment operator", "count": 2, "skills": ["mobile crane", "rigging signals", "bobcat"], "certs": ["NCCCO crane cert"]},
|
||||
{"role": "drone surveyor", "count": 1, "skills": ["UAV piloting", "GIS", "site mapping"], "certs": ["FAA Part 107"], "in_roster": false}
|
||||
]
|
||||
}
|
||||
Loading…
x
Reference in New Issue
Block a user