golangLAKEHOUSE

Author	SHA1	Message	Date
root	6c93a38093	scrum multi_coord_phase3: 4 fixes from cross-lineage review Cross-lineage scrum on bundle 87cbd10..f971e64 (3,652 lines) produced 4 actionable findings, all defensive hardening. 1. (Opus WARN) internal/langfuse/client.go:queue Synchronous Flush at maxBatch threshold blocked the calling goroutine for the full 5s HTTP timeout when Langfuse hiccupped, defeating the "best-effort, never blocks calling path" contract in the package doc. Now fire-and-forget via goroutine. 2. (Opus + Kimi convergent) cmd/observerd/main.go:handleInbox - Free-form priority string was accepted; "nonsense" passed through unchecked. Now closed enum: urgent\|high\|medium\|low (+ empty defaults to medium). Tested: TestInbox_RejectsBadPriority. - No size cap on body, only emptiness check; multi-MB payloads would bloat observer's ring + JSONL. Now 8 KiB cap returns 413. Tested: TestInbox_RejectsOversizedBody. - Subject/sender/tag concatenated into InputSummary without newline stripping; embedded \n could corrupt JSONL line-based parsers. New sanitizeInboxField strips \r\n + caps at 256 chars before interpolation. 3. (Opus INFO) scripts/multi_coord_stress/main.go Removed dead `must[T]` generic — tracedSearch took over the fail-fast role for matrix searches, so the helper became unused. 4. (Opus INFO) scripts/multi_coord_stress/main.go:Event `JudgeRating int` collapsed "judge errored" and "judge said unrated" both to 0. Changed to *int — nil = errored, 1-5 = verdict. judgeInboxResult still returns 0 on error; caller gates on > 0 before assigning. Dismissed (with rationale): - Opus WARN ExcludeIDs ordering: verified by code read — filter applies after sort + before top-K truncation as documented; no slot waste possible. - Opus INFO 10 prior-run reports contradict #011: those are point-in-time snapshots; intentional history. - Kimi INFO Langfuse error suppression: design intent (best-effort per package doc). - Kimi INFO contract schema validation: defer until contract count grows enough to make hand-edit drift a real risk. - Kimi INFO paraphrase prompt duplicated across lift + multi_coord: defer (lift to internal/paraphrase/ when a third consumer appears). - Qwen HOLD: single-line, no actionable finding. go test ./cmd/observerd ./internal/langfuse all green; multi_coord driver builds clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 17:42:07 -05:00
root	e7fc63b216	observerd: /observer/inbox + multi-coord stress phase 1c (priority-ordered events) Phase 3 ask: real-world inbox-style event injection during the stress test. Coordinators in production receive emails + SMS that trigger contract responses; the substrate has to RECORD these signals AND react with a search using the embedded demand. This commit lands the endpoint and exercises it end-to-end in the stress harness. observerd surface: - New POST /observer/inbox route — accepts {type, sender, subject, body, priority, tag} and records as ObservedOp with Source=SourceInbox. Type must be email\|sms; body required; priority defaults to medium. The handler ONLY records — downstream triggers (search, ingest, etc.) are the caller's concern, recorded separately. Keeps the witness role pure. - New observer.SourceInbox = "inbox" alongside SourceMCP / SourceScenario / SourceWorkflow. - Three contract tests on the new route (happy path / bad type / empty body), router-mount test extended, all green. Stress harness phase 1c (Hour 9): - 6 inbox events fire in priority order (urgent → high → medium): 2 urgent emails (forklift Cleveland, production Indianapolis) 1 high email (crane Chicago) 1 high sms (bilingual safety Indianapolis) 1 medium sms (drone Chicago) 1 medium email (warehouse Milwaukee FYI) - Each event: 1. POSTs to /v1/observer/inbox (recorded by observerd) 2. Triggers matrix.search using a parsed demand (the demand extraction is hard-coded for now; production needs a small LLM to parse from body) 3. Captures both as events in the run JSON Run #006 result (with v2-moe embedder + all phases including inbox): Diversity: Same-role-across-contracts Jaccard = 0.000 (n=9) Different-roles-same-contract Jaccard = 0.046 (n=18) Determinism: 1.000 Verbatim handover: 4/4 (100%) Paraphrase handover: 4/4 (100%) Inbox burst: 6/6 events accepted by observerd (200 status, all recorded) 6/6 triggered searches produced distinct top-1 worker IDs distance distribution: 0.24 (Indy production) → 0.71 (Chicago drone surveyor — honest stretch since drones aren't in the 5K-worker corpus, system surfaces closest neighbor at high distance rather than fabricating) The drone-Chicago case is the architectural-honesty signal: when the demand asks for a specialist NOT in the roster, the system returns the closest semantic neighbor with a distance that flags "this is a stretch." Coordinators reading distances see "we don't have a great match here" rather than a confident wrong answer. Total events captured: 67 (was 61 pre-inbox). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 08:34:36 -05:00
root	9ce067bd9d	observerd: test that locks ADR-005 Decision 5.3 TestWorkflowRun_AllProvenanceRecordedPostRun proves that handleWorkflowRun records ObservedOps only AFTER runner.Run returns, not interleaved with node execution. The test pauses inside a node via a controlled channel, samples observer.Store mid-run (must be 0), unblocks, then samples again (must be N). If a future commit adds per-node streaming (e.g. runner.NodeHook firing as each node finishes), n1's record would appear before the unblock and the first assertion fires. This is intentional test-as-spec lock. Closing the streaming gap is deferred per the ADR ("acceptable for short workflows; streaming callback is the right shape when workflows get longer") — but if someone later adds the streaming callback without updating the ADR, this test catches it in `go test` instead of leaving the doc and code drifted. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 06:35:41 -05:00
root	6c02c905c8	scrum lift_001: 4 fixes from cross-lineage review Cross-lineage scrum on b2e45f7 produced 1 convergent + 3 single-reviewer findings worth fixing. All apply. 1. (Opus WARN + Qwen INFO convergent) scripts/playbook_lift.sh: replace sleep 2.5 in SQL probe with active polling up to 5s. refresh_every=1s is a lower bound; under load the manifest may not be visible in a fixed sleep, which would 4xx the probe and abort the reality run. 2. (Opus INFO) scripts/playbook_lift.sh: report template glued "env JUDGE_MODEL" + value as "env JUDGE_MODELqwen2.5:latest" with no separator. Replaced two :+/:- substitution chains with a single JUDGE_SOURCE variable computed once at the top of the harness. 3. (Opus INFO) scripts/staffing_workers/main.go: -id-prefix "" silently allowed, defeating the flag's purpose (cross-corpus collision prevent). Now log.Fatal at startup with explicit hint. 4. (Opus WARN) cmd/{pathwayd,observerd}/main_test.go: newTestRouter returned http.Handler then re-cast to chi.Router for chi.Walk. Returning chi.Router directly satisfies http.Handler AND avoids an assertion that would panic if future middleware wraps the router. Dismissed (with rationale): - Kimi INFO hardcoded MinIO endpoint: harness is local-by-design. - Kimi WARN matrixd accepts 502/500: documented; real retriever needs real upstreams the test doesn't spin up. - Qwen INFO queryd string.Contains: brittle but very low risk; restating through typed-error path would couple without adding signal. go test ./cmd/{matrixd,queryd,pathwayd,observerd} all green. Verdicts at reports/scrum/_evidence/2026-04-30/verdicts/lift_001_*.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 06:27:24 -05:00
root	b2e45f7f26	playbook_lift: harness expansion + reality test #001 (7/8 lift, 87.5%) The 5-loop substrate's load-bearing gate is verified — playbook + matrix indexer give the results we're looking for. Per the report's rubric, lift ≥ 50% of discoveries means matrix is doing real work; 7/8 = 87.5% blew through that. Harness was structurally hiding bugs behind a 5-daemon stripped boot. Expanding to the full 10-daemon prod stack surfaced 7 fixes in cascade: 1. driver→matrixd: {"query": ...} → {"query_text": ...} field name 2. harness temp toml missing [s3] → wrong default bucket → catalogd rehydrate 500 on first call 3. harness→queryd SQL probe: {"q": ...} → {"sql": ...} field name 4. expand boot from 5 → 10 daemons in dep-ordered launch 5. add SQL surface probe (3-row CSV ingest → COUNT(*)=3 assertion) 6. candidates corpus was synthetic SWE-tech (Swift/iOS, Scala/Spark) — wrong domain for staffing queries; replaced with ethereal_workers (10K rows, real staffing schema, "e-" id prefix to avoid collision with workers' "w-"). staffing_workers driver gains -index-name + -id-prefix flags so the same binary serves both corpora 7. local_judge qwen3.5:latest is a vision-SSM 256K-ctx build running ~30s per judge call against the lift loop; reverted to qwen2.5:latest (~1s/call, 30× faster, held lift theory) Each contract drift (1, 3) is now locked into a cmd/<bin>/main_test.go so future drift fires in `go test`, not in a reality run. R-005 closed: - cmd/matrixd/main_test.go (new) — playbook record drift detector + score bounds + 6 routes mounted - cmd/queryd/main_test.go — wrong-field-name drift detector - cmd/pathwayd/main_test.go (new) — 9 routes + add round-trip + retire - cmd/observerd/main_test.go (new) — 4 routes + invalid-op + unknown-mode `go test ./cmd/{matrixd,queryd,pathwayd,observerd}` all green. Reality test results (reports/reality-tests/playbook_lift_001.{json,md}): Queries 21 (staffing-domain, 7 categories) Discoveries 8 (judge ≠ cosine top-1) Lifts 7/8 (87.5%) Boosts triggered 9 Mean Δ distance -0.053 (warm closer than cold) OOD honesty dental/RN/SWE rated 1, no fake matches Cross-corpus boosts confirmed (e- ↔ w- swaps in lifts) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 06:22:21 -05:00

5 Commits