golangLAKEHOUSE

Author	SHA1	Message	Date
root	c0a55b1182	parity reports: regenerated 2026-05-03 morning verification All 6 probes re-run post-restart for today's verification: validator(6/6) + extract_json(12/12) + session_log(4/4) + materializer(2/2) + embed(8/8) + subject_audit(6/6) = 38/38	2026-05-03 05:27:14 -05:00
root	262a77a52a	subject-audit parity (Step 8) — Go reader + cross-runtime probe Per /home/profit/lakehouse/docs/specs/SUBJECT_MANIFESTS_ON_CATALOGD.md §5 Step 8. Go side reads SubjectManifest + verifies HMAC chain on per-subject audit JSONL files using IDENTICAL canonical-JSON + HMAC-SHA256 algorithm to crates/catalogd/src/subject_audit.rs. A Rust-written chain now verifies under Go and vice versa. Files: - internal/catalogd/subject.go SubjectManifest, SubjectAuditRow, AuditAccessor, AuditLogEntry LoadSubjectManifest, LoadKeyFile (32-byte minimum, matches Rust) ReadAuditLog, VerifyChain canonicalRowBytesFromRaw (production), canonicalRowBytesFromStruct (tests) computeRowHMAC, CanonicalAndHmac (parity helper) - internal/catalogd/subject_test.go (10 unit tests) - scripts/cutover/parity/subject_audit_helper/main.go CLI helper mirroring crates/catalogd/src/bin/parity_subject_audit.rs - scripts/cutover/parity/subject_audit_parity.sh Two-phase probe: known-answer + every real audit log Two real bugs caught + fixed by the probe authoring loop: 1. omitempty on AuditAccessor.TraceID stripped the field when empty, producing different canonical bytes than Rust (which always writes the field). Removed omitempty. Rust + Go now produce identical bytes for rows with trace_id="" (the common production case). 2. time.RFC3339Nano strips trailing zeros from nanoseconds, producing "...46143921" where Rust's chrono AutoSi produces "...461439210". Hashing through the parsed-then-re-marshaled struct breaks the chain on any row whose nanos end in 0. Fixed by canonicalizing from the RAW LINE BYTES (preserves the original timestamp string byte-for-byte). Test TestVerifyChain_RawBytesPreserveTimePrecision regression-locks this with a hand-crafted nanos=461439210 row. Live verification (6 / 6 byte-identical assertions): - Phase 1 known-answer: canonical bytes (266) + HMAC match - Phase 2 real logs: WORKER-1..5 audit JSONL all verify under both runtimes with identical (count, tip, verified, error) output Report: reports/cutover/gauntlet_2026-05-02/parity/subject_audit_parity.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 04:17:15 -05:00
root	5d3996b51d	STATE_OF_PLAY: Rust is not maintenance-only as of 2026-05-02 Frames the Rust system accurately — it's receiving parity work + infrastructure (Lance gauntlet, sidecar drop, observability parity), not just security fixes. Points readers at lakehouse/STATE_OF_PLAY.md + docs/ARCHITECTURE_COMPARISON.md for the cross-runtime view. Also commits today's parity probe report regenerations (5/5 still 32/32 post-Lance work). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 22:24:14 -05:00
root	b314ed1c94	parity: /v1/embed cross-runtime probe (5th probe, 8/8 cosine match) Today's sidecar drop (lakehouse ba928b1) changed Rust's embed transport from gateway → sidecar → Ollama (2 hops) to gateway → Ollama directly. Go's embedd has always been direct. A drift here would mean: same query, different vector → different HNSW top-K → different staffing recommendations. This probe is the regression gate for that surface. Fixtures cover staffing-domain shapes (forklift, welder, OSHA, dental, CNC) plus stress shapes (unicode "Café résumé ⭐ 你好", single char "x", 200-word long fixture). Match metric: cosine similarity ≥ 0.99999. Byte-equal isn't expected — Go round-trips through []float32 internally while Rust stays at Vec<f64>, so JSON serialization introduces small float drift. What matters operationally is vector direction (HNSW uses cosine distance), and both runtimes preserve it when calling the same Ollama with the same model. Result: 8/8 fixtures match including the long + unicode cases. Sidecar drop didn't disturb the embed surface. The probe also forces both endpoints to use `nomic-embed-text` so the v1-vs-v2-moe default difference doesn't pollute the comparison. 5th cross-runtime parity probe joining the family: - validator_parity (6/6) - extract_json_parity (12/12) - session_log_parity (4/4) - materializer_parity (2/2) - embed_parity (8/8) — this commit Cumulative: 32/32 parity assertions across 5 probes covering HTTP shape (validator, embed), CLI output (materializer), unit behavior (extract_json), and persisted shape (session_log). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 06:28:40 -05:00
root	a21a34b057	docs: close 2 cross-runtime parity gaps + document unified log Companion to lakehouse 98b6647. Architecture comparison decisions tracker now captures: - Go validatord direct header read (fixes 6847bbc): closes the case where Langfuse-off middleware passthrough silently dropped forwarded X-Lakehouse-Trace-Id - Rust IterateResponse trace_id echo (fixes 98b6647): closes the asymmetry where Go's response carried the join key and Rust's didn't - Unified longitudinal log demonstrated end-to-end: both daemons co-writing /tmp/lakehouse-validator/sessions.jsonl, distinct daemon tags, one DuckDB query covers both 24/24 parity assertions (validator 6/6, extract_json 12/12, session_log 4/4, materializer 2/2) hold against live :3100 + :4110. Both runtimes deployed with today's full stack. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 06:25:21 -05:00
root	1263720497	validatord: always populate session_id (fallback when Langfuse off) Surfaced during the 2026-05-02 deploy + reality wave: the persistent Go stack runs without LANGFUSE_URL/PUBLIC_KEY/SECRET_KEY env, so shared.langfuseMiddleware operates as a passthrough — never minting a trace id, never stashing it on the request context. Result: session_id was empty on every JSONL row, breaking correlation across the longitudinal log + replay_runs.jsonl + future Langfuse traces. The fix: validatord falls back to a locally-generated time-ordered hex id when both the X-Lakehouse-Trace-Id header AND the middleware context are empty. Same shape Langfuse accepts, so a future deploy that turns Langfuse on doesn't break correlation — already-emitted session_ids stay valid as Langfuse trace ids. Verified post-deploy by driving 9 /v1/iterate sessions through the persistent stack at :4110: - 6 accepted on iter 0 (qwen2.5:latest first-shot 75%) - 2 max_iter_exhausted (no_json on prose-y prompts) - 1 infra_error (chatd cold-start probe timed out at 5s) Latest row's session_id: "18abbabdc2306a83-c2306aa9" (was: "") Probe re-runs (validator_parity, session_log_parity) included as post-deploy artifacts; both 6/6 + 4/4 with the freshly-restarted persistent gateway+validatord binaries. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 06:03:43 -05:00
root	fa4e1b4e16	parity: session_log probe + Rust observability parity recorded Companion to lakehouse commit 57bde63 (Rust gateway gains trace-id propagation + coordinator session JSONL). The cross-runtime parity probe is the regression gate that prevents silent schema drift between the two runtimes. scripts/cutover/parity/session_log_parity.sh: - 4 fixtures (accepted_grounded, max_iter_exhausted, infra_error, unicode_in_prompt) feed identical input to both helpers - jq -e validity gate + non-trivial-equal guard prevents the "both sides fail identically → spurious match" failure mode (caught one IFS='\|\|' bug during initial authoring — recorded in the script comment) - normalize() strips timestamp + daemon (legitimate per-producer differences); everything else must be byte-equal - Result: 4/4 fixtures match, including unicode scripts/cutover/parity/session_log_helper/main.go: - Tiny stdin/stdout Go helper that round-trips a fixture through validator.SessionRecord serde - Counterpart to crates/gateway/src/bin/parity_session_log.rs docs/ARCHITECTURE_COMPARISON.md decisions tracker: - "Rust observability parity" row added (DONE 2026-05-02) - Cross-runtime probe documented as reusable gate STATE_OF_PLAY refreshed. Both observability pieces (trace-id propagation, session JSONL) now exist on both runtimes. Operators who point Rust gateway and Go validatord at the same session-log path get a unified longitudinal stream queryable via DuckDB. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 05:39:49 -05:00
root	7d6636b33e	validator: align ValidationError JSON to Rust serde shape (6/6 parity) Closes the 2026-05-02 parity finding: validator_parity probe found 5/6 body shapes diverging because Go emitted {"Kind":"...","Field":"...","Reason":"..."} while Rust emits the externally-tagged-enum {"Schema":{"field":"...","reason":"..."}}. A caller parsing the error envelope would break silently in cutover. ## Changes internal/validator/types.go: - Custom MarshalJSON emits the Rust shape: Schema: {"Schema": {"field":"x","reason":"y"}} Completeness: {"Completeness":{"reason":"y"}} Consistency: {"Consistency": {"reason":"y"}} Policy: {"Policy": {"reason":"y"}} - Custom UnmarshalJSON accepts BOTH the new Rust shape AND the legacy flat shape (migration safety for any persisted error rows). - Unknown variants (e.g. a future Rust addition Go hasn't learned) surface as an Unmarshal error, not a silent default. internal/validator/types_test.go: - 4 pinning tests anchor the wire format. Failing them = wire-format drift; the parity probe is the secondary line of defense. scripts/validatord_smoke.sh: - Updated probes to read the new variant-name shape (jq keys[0], .Schema.field) instead of legacy .Kind/.Field. ## Verification - internal/validator unit tests: PASS (4 new + all existing). - cmd/validatord HTTP tests: PASS (UnmarshalJSON falls through to flat shape so existing tests reading ValidationError still work). - validatord_smoke.sh: 5/5 PASS through gateway :3110. - validator parity probe re-run: 6/6 match (was 1/6). ## Pattern Per architecture_comparison's "use the dual-implementation as a measurement instrument" thesis: a parity probe surfaced this gap; 50 LOC of MarshalJSON closed it; 4 pinning tests prevent regression; the probe is the longitudinal gate. Cutover-friendly direction (Go matches Rust) chosen because Rust is the existing production contract. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 04:49:28 -05:00
root	b0c8a3f227	parity probes: materializer + extract_json (caught + fixed real bug) Two new cross-runtime parity probes joining the validator probe from the gauntlet wave. Pattern: feed identical input through Rust and Go; diff outputs. Each probe surfaced a different signal. ## Materializer parity probe scripts/cutover/parity/materializer_parity.sh runs Bun + Go materializer against an identical synthetic data/_kb/ root, diffs the resulting evidence/ JSONL byte-equivalent (modulo provenance.recorded_at). First run: 0/2 match. Real finding: Go's Provenance.LineOffset had `json:"line_offset,omitempty"` which strips the field when value is 0. Line offset 0 is the FIRST ROW of every source file — a real semantic value, not absent. Bun side always emits it. Fix: drop `omitempty` on Provenance.LineOffset. Updated comment explaining why. Re-run: 2/2 match. On-wire JSON parity holds. ## extract_json parity probe scripts/cutover/parity/extract_json_parity.sh feeds 12 fixture strings through both runtimes' extract_json: - fenced ```json``` blocks - unfenced ``` blocks - bare braces with prose around - first-balanced-of-many - nested objects - unicode in string values - escaped quotes - empty object - top-level array (both return first inner object) - no JSON - depth-balanced but invalid syntax - trailing garbage Substrate gate: cargo test -p gateway extract_json PASS before probe. Result: 12/12 match. Algorithms genuinely equivalent. ## scripts/cutover/parity/extract_json_helper/main.go Tiny Go binary that reads stdin, calls validator.ExtractJSON, prints {matched, value} JSON. Counterpart to the Rust parity_extract_json binary in golangLAKEHOUSE's sibling lakehouse repo (separate commit). ## Pattern crystallized Every cross-runtime port should land with a parity probe. Three probes now exist: - validator (5/6 wire-format gap captured 2026-05-02) - materializer (caught + fixed real bug 2026-05-02) - extract_json (12/12 match 2026-05-02) The instrument is reusable — each new shared HTTP/CLI surface gets a probe row added. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 04:43:54 -05:00
root	e8cf113af8	gauntlet 2026-05-02: smoke chain + per-component scrum + parity probe Production-readiness gauntlet exploiting the dual Rust/Go implementation as a measurement instrument. ## Phase 1 — Full smoke chain 21/21 PASS in ~60s. Substrate intact across the full service surface. ## Phase 2 — Per-component scrum (token-volume fix) Prior wave (165KB diff): Kimi 62 tokens out, Qwen 297 → no useful analysis. This wave splits today's commits into 4 focused bundles (36-71KB each): c1 validatord (46KB) → 0 convergent / 11 distinct c2 vectord substrate (36KB) → 0 convergent / 10 distinct c3 materializer (71KB) → 0 convergent / 6 distinct (Opus emitted a BLOCK then self-retracted in same response) c4 replay (45KB) → 0 convergent / 10 distinct Reviewer engagement vs prior wave: Kimi went 62 → ~250 tokens out once bundles dropped below 60KB. scripts/scrum_review.sh hardening: * Diff-size guard (warn >60KB, hard-fail >100KB, SCRUM_FORCE_OVERSIZE=1 override) * Tightened prompt — file path must appear EXACTLY as in diff so post-processor can grep WHERE: lines reliably * Auto-tally step dedupes by (reviewer, location); convergence counts distinct lineages (closes the prior `opus+opus+opus` false-convergence bug) ## Phase 3 — Cross-runtime validator parity probe (the headline finding) scripts/cutover/parity/validator_parity.sh sends 6 identical /v1/validate cases to Rust :3100 AND Go :4110, compares status+body. Result: 6/6 status codes match · 5/6 body shapes diverge. Rust returns serde-tagged enum: {"Schema":{"field":"x","reason":"y"}} Go returns flat exported-fields: {"Kind":"schema","Field":"x","Reason":"y"} Both round-trip inside their own runtime; a caller swapping one for the other would break parsing silently. Captured as new _open_ row in docs/ARCHITECTURE_COMPARISON.md decisions tracker. This is the "use the dual-implementation as a measurement instrument" return — single-repo scrums can't catch this class of cross-runtime drift. ## Phase 4 — Production assessment ship-with-known-gap. Validator wire-format gap is documented, not regressed. ~50 LOC future fix on Go side (custom MarshalJSON on ValidationError to match Rust's serde shape). Persistent stack config (/tmp/lakehouse-persistent.toml) gains validatord on :3221 + persistent-validatord binary so operators bringing up the persistent stack get the new daemon automatically. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 04:05:18 -05:00
root	09299a27b7	scrum 2026-05-02: materializer+replay+vectord — ship-with-fixes Cross-lineage review of 89ca72d (Opus + Kimi + Qwen3-coder). Convergent findings (≥2 reviewers): NONE. - Kimi BLOCK (materializer main.go exits 0 on validation fail): confabulation. Code does os.Exit(1) at lines 65-66. - Qwen BLOCK (saveTask race condition): confabulation. All access to inflight/pending is under s.mu. - Qwen WARN (saveAfter nil deref): confabulation. Explicit `if h.persist == nil { return }` guard at line 184. - Opus BLOCK (TestSaveTask_Coalesces): self-withdrawn in same response. Opus WARNs actioned: - Detached docstring on TestAdd_SmallIndex_ConcurrentDistinctIDs — attached. - isoDatePartition fallback comment — clarified as defense-in-depth (MaterializeAll guards upstream; branch unreachable through public surface). Disposition + verdicts in reports/scrum/_evidence/2026-05-02/. Pattern matches feedback_cross_lineage_review.md: only Opus emits BLOCK-class findings worth verifying; Kimi/Qwen single-reviewer BLOCKs failed trace verification. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 03:35:12 -05:00
root	89ca72d471	materializer + replay ports + vectord substrate fix verified at scale Two threads landing together — the doc edits interleave so they ship in a single commit. 1. vectord substrate fix verified at original scale (closes the 2026-05-01 thread). Re-ran multitier 5min @ conc=50: 132,211 scenarios at 438/sec, 6/6 classes at 0% failure (was 4/6 pre-fix). Throughput dropped 1,115 → 438/sec because previously-broken scenarios now do real HNSW Add work — honest cost of correctness. The fix (i.vectors side-store + safeGraphAdd recover wrappers + smallIndexRebuildThreshold=32 + saveTask coalescing) holds at the footprint that originally surfaced the bug. 2. Materializer port — internal/materializer + cmd/materializer + scripts/materializer_smoke.sh. Ports scripts/distillation/transforms.ts (12 transforms) + build_evidence_index.ts (idempotency, day-partition, receipt). On-wire JSON shape matches TS so Bun and Go runs are interchangeable. 14 tests green. 3. Replay port — internal/replay + cmd/replay + scripts/replay_smoke.sh. Ports scripts/distillation/replay.ts (retrieve → bundle → /v1/chat → validate → log). Closes audit-FULL phase 7 live invocation on the Go side. Both runtimes append to the same data/_kb/replay_runs.jsonl (schema=replay_run.v1). 14 tests green. Side effect on internal/distillation/types.go: EvidenceRecord gained prompt_tokens, completion_tokens, and metadata fields to mirror the TS shape the materializer transforms produce. STATE_OF_PLAY refreshed to 2026-05-02; ARCHITECTURE_COMPARISON decisions tracker moves the materializer + replay items from _open_ to DONE and adds the substrate-fix scale verification row. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 03:31:02 -05:00
root	277884b5eb	multitier_100k: 335k scenarios @ 1,115/sec against 100k corpus, 4/6 at 0% fail J asked for a much more sophisticated test using the 100k corpus from the Rust legacy database. This commit ships: scripts/cutover/multitier/main.go — 6-scenario harness with weighted random selection per goroutine. Mixes search, email/SMS/fill validators (in-process via internal/validator), profile swap with ExcludeIDs, repeat-cache exercise, and playbook record/replay. Scenarios + weights (cumulative scenario fractions): 35% cold_search_email — search + email outreach + EmailValidator 15% surge_fill_validate — search + fill proposal + FillValidator + record 15% profile_swap — original search + ExcludeIDs swap + no-overlap check 15% repeat_cache — same query × 5 (cache effectiveness) 10% sms_validate — SMS draft (≤160 chars, phone for SSN-FP guard) 10% playbook_record_replay — cold → record → warm w/ use_playbook=true Test results (5-min sustained, conc=50, 100k workers indexed): TOTAL 335,257 scenarios @ 1,115/sec cold_search_email 117k @ 0.0% fail · p50 2.2ms · p99 8.6ms surge_fill_validate 50k @ 98.8% fail (substrate bug below) profile_swap 50k @ 0.0% fail · p50 4.5ms · ExcludeIDs verified repeat_cache 50k × 5 = 252k searches @ 0.0% fail · p50 11.7ms sms_validate 33k @ 0.0% fail · phone-pattern guard works playbook_record_replay 33k @ 96.8% fail (substrate bug below) Total successful workflows: ~250k+ Validator integration verified at load: 150,930 EmailValidator passes across cold_search_email + sms_validate 35 + 1,061 successful FillValidator + playbook_record (where the bug didn't fire) zero false positives on the SSN-pattern guard against phone numbers Resource footprint at 100k: vectord 1.23GB RSS (linear with 100k vectors) matrixd 26MB, 75% CPU (1-core saturated at conc=50) Total across 11 daemons: 1.7GB Compare to Rust at 14.9GB — ~10× less even at 100k. SUBSTRATE BUG SURFACED: coder/hnsw v0.6.1 nil-deref in layerNode.search at graph.go:95. Triggers on /v1/matrix/playbooks/record under sustained writes to the small playbook_memory index. Both Add and Search paths can panic. Workaround applied (this commit) in internal/vectord/index.go BatchAdd: recover() guard converts panic to error; daemon stays up instead of crashing the request handler. Operator recovery procedure (also documented in the report): curl -X DELETE http://localhost:4215/vectors/index/playbook_memory Next record recreates the index fresh. Real fix DEFERRED — open in docs/ARCHITECTURE_COMPARISON.md Decisions tracker. Three options: a) upstream patch to coder/hnsw b) custom small-index Add path that always rebuilds when len < threshold c) alternate store for playbook_memory (Lance? in-memory map?) Evidence: reports/cutover/multitier_100k.md (full methodology + results + repro + bug analysis). docs/ARCHITECTURE_COMPARISON.md Decisions tracker updated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 06:28:50 -05:00
root	3a2823c02f	g5 cutover: bigger load test — 5.87M req, 0 errors, 370MB RSS Larger-scale follow-up to the original load test. Three axis expansions: corpus 200→5K workers, body variety 6→200 distinct queries, concurrency sweep 10/50/100/200, plus mixed embed+search workload. Concurrency sweep on /v1/matrix/search direct (3 min each): conc=10: 486,733 req · 2,704 RPS · p50 2.19ms · p99 6.7ms conc=50: 1,148,543 req · 6,381 RPS · p50 7.08ms · p99 20ms conc=100: 1,253,389 req · 6,963 RPS · p50 13.34ms · p99 37ms conc=200: 1,460,676 req · 8,114 RPS · p50 23.45ms · p99 56ms Mixed embed+search at 60 conc each, 90s: /v1/embed: 1,127,854 req · 12,531 RPS · p50 3.31ms · p99 14.6ms /v1/matrix/search: 392,229 req · 4,358 RPS · p50 12.68ms · p99 33.8ms TOTAL: 5,869,424 requests across ~13.5 minutes. ZERO errors. Resource footprint during peak load: matrixd 105% CPU, 33MB RSS (bottleneck — pegs 1 core) vectord 39% CPU, 82MB RSS gateway 44% CPU, 41MB RSS embedd 30% CPU, 67MB RSS Total RSS across 11 daemons: ~370MB Compare to Rust gateway under similar load: 14.9GB RSS, 374% CPU. Go uses ~40x less memory + spreads load across daemons rather than packing into one mega-process. Saturation analysis: - conc 10→50: +135% RPS (linear-ish scaling) - conc 50→100: +9% RPS (saturation begins) - conc 100→200: +17% RPS (matrixd 1-core pegged) Headroom paths if production exceeds current demand: 1. Run multiple matrixd instances behind a load balancer. Substrate is stateless (recordings via storaged), horizontal scale is straightforward. 2. Profile matrixd's per-request work (role-gate + judge-eligibility + result merge). 3. Skip Bun for hot endpoints (direct nginx → Go = 5.7x previously measured). Evidence: reports/cutover/g5_load_test_big.md (full tables + methodology + repro script). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 05:18:00 -05:00
root	2a974d6dea	docs: ARCHITECTURE_COMPARISON.md as living source file Per J's request: move the parallel-runtime comparison from reports/cutover/ (where it lived as cutover-prep evidence) into docs/ as the source-of-truth file. J will keep updating it as fixes ship on either side. Restructured for living-document use: - Status header (last refresh date, owner, update triggers) - 'How to update this doc' section with explicit dos and don'ts - Decisions tracker at top — actioned items with commit refs + open backlog with LOC estimates - Each comparison section now has 'Last verified' columns where numbers are time-sensitive - Change log section at bottom for one-line entries on every meaningful refresh The original at reports/cutover/architecture_comparison.md gains a 'THIS IS A SNAPSHOT' header pointing at the docs/ source. Kept as historical record but no longer the place to update. Sister pointer file in /home/profit/lakehouse/docs/ARCHITECTURE_COMPARISON.md so the doc is reachable from either repo side. That file explicitly says the source lives in golangLAKEHOUSE and warns against authoritative content in the pointer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 04:56:20 -05:00
root	b3ad14832d	architecture_comparison: Rust vs Go lakehouse — weaknesses, strengths, abstracts to address J asked for the comparison before locking in primary line. This report documents what's actually structurally different vs implementation-level different, and what to do about each. Key findings: 1. Python sidecar is the single biggest architectural lever - Rust: gateway → HTTP → Python sidecar :3200 → HTTP → Ollama - Go: gateway → HTTP → embedd → HTTP → Ollama (no Python) - Sidecar adds zero compute over Ollama (just pydantic + httpx) - 63× perf gap (8,119 vs 128 RPS) driven by sidecar + cache absence 2. Process model: Rust 1 mega-binary (14.9G RSS), Go 11 daemons - Rust: simpler ops at small scale, panic blast radius = whole system - Go: per-daemon scale + crash isolation, more config surface 3. Code volume: Go 15,128 lines vs Rust 35,447 + 1,237 sidecar - Go is 43% the size doing similar work - Gap concentrated in vectord (Rust 11k lines, Go 804 — Lance + benchmarking) 4. Distillation pipeline asymmetry - Audit/observation: BOTH sides parallel-mature - Production: Rust-only (materializer + replay + RAG/pref export) - Go can READ everything but can't PRODUCE evidence 5. Production validators (FillValidator/EmailValidator/'/v1/validate') - Rust has them (1,286 lines, 12 tests each) - Go doesn't — matrix gate covers role bleed but not structural validation Cross-cutting abstracts to address regardless of which wins: - Drop Python sidecar from Rust (call Ollama directly) - Add LRU embed cache to Rust aibridge - Port materializer + replay + validators to Go - Pin shared JSONL schemas as canonical (both runtimes consume same spec) - Decide on Lance backend (defer until corpus > 5M rows) If keeping Go primary: port materializer first, validators second, skip Lance. If keeping Rust primary: drop Python + add cache, port chatd 5-provider dispatcher + cross-role gate from Go. Bottom line: substrate is parallel-mature on observation; producer side is Rust-only; performance structurally favors Go ~60× on warm workloads; operations favors Go on isolation; production deployment favors Rust today. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 04:34:24 -05:00
root	c164a3da96	g5 cutover: production load test — 0 errors / 101k req · Go direct = 2,772 RPS Sustained-traffic load test against the cutover slice. Three runs, zero correctness errors across 101,770 total requests. Substrate holds up under concurrent load — matrix gate, vectord HNSW, embedd cache, gateway proxy all hold. This was the load test's primary question; latency numbers are secondary. scripts/cutover/loadgen — focused Go load generator. 6-query rotating body mix (Forklift/CNC/Warehouse/Picker/Loader/Shipping). Configurable URL/concurrency/duration. Reports per-status-code counts + p50/p95/p99 latencies + JSON summary on stderr. Three runs: baseline (Bun → Go, conc=1, 10s): 4,085 req · 408 RPS · p50 1.3ms · p99 32ms · max 215ms sustained (Bun → Go, conc=10, 30s): 14,527 req · 484 RPS · p50 4.6ms · p99 92ms · max 372ms direct (→ Go, conc=10, 30s): 83,158 req · 2,772 RPS · p50 2.5ms · p99 8.5ms · max 16ms Critical findings: 1. ZERO correctness errors across 101k requests. No 5xx, no transport errors, no panics. Concurrency-safety verified across matrix gate / vectord / gateway / embedd cache. 2. Direct-to-Go is production-grade. 2,772 RPS at p99 8.5ms on a single host, no scaling cliff at concurrency=10. 3. Bun frontend is the bottleneck. -82% RPS, +982% p99 vs direct. Single-process JS event loop queueing under concurrent requests — known Bun proxy-mode characteristic. The substrate itself isn't the limiter. 4. For staffing-domain demand levels (<1 RPS typical per coordinator), Bun-fronted 484 RPS has 480× headroom. No urgency to optimize Bun out of the data path. If/when concurrent demand grows orders of magnitude, the path is nginx → Go direct for hot endpoints, skip Bun. Substrate is now load-tested and verified production-ready. What this load test does NOT cover (documented in g5_load_test.md): cold-cache embed, larger corpus, mixed read/write, multi-host, full 5-loop traffic with judge gate calls. Each is its own probe shape. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 04:20:41 -05:00
root	6507dff26d	g5 cutover: first 5-loop end-to-end through Bun frontend Companion to c522ace (cutover slice live). That commit proved infrastructure (Bun /_go/* → Go gateway). This commit proves the SUBSTRATE'S CORE LEARNING BEHAVIOR through the same path. Two tests against persistent Go stack on :4110 with the 200-worker corpus, all traffic via Bun frontend on :3700: TEST 1: same-role boost fires with exact math Q1: Need 3 Forklift Operators in Aurora IL for Parallel Machining query_role: "Forklift Operator" cold (use_playbook=false): rank=0 id=w-43 dist=0.4449 Brian Ramirez, Springfield IL POST /_go/v1/matrix/playbooks/record: query_text=Q1, role=Forklift Operator, answer_id=w-43, score=1.0 → playbook_id=pb-1126c52bd106df6b warm (use_playbook=true): rank=0 id=w-43 dist=0.2224 ← halved boosted=1, injected=0 Math check: BoostFactor = 1 - 0.5score = 0.5 (for score=1.0). Expected warm_dist = 0.4449 0.5 = 0.22245. Observed: 0.2224. 4-decimal exact through 3 HTTP hops. TEST 2: cross-role gate prevents bleed Q2: Need 1 CNC Operator in Detroit MI for Beacon Freight query_role: "CNC Operator" use_playbook: true (Forklift recording from Test 1 in playbook corpus) result: rank=0 id=w-175 Kevin Ruiz (Machine Operator, Detroit MI) rank=2 id=w-102 Laura Long (Forklift Operator, Cleveland OH) boosted=0, injected=0 ← role gate fired correctly w-102 (Forklift Operator) appears at rank 2 organically via cosine retrieval — but boosted=0 confirms the Forklift PLAYBOOK did NOT influence this query. Surgical: gate suppresses playbook-driven boosts from cross-role recordings, leaves organic retrieval untouched. What this confirms about the substrate: 1. Learning works — single recording → measurable, math-exact boost 2. Bleed protection works — role gate (real_001 fix) holds through cutover slice 3. Math holds across HTTP hops — Bun → gateway → matrixd → vectord with no drift 4. Substrate works through real production-shape framing — CORS, content-type, body forwarding, all transparent The substrate's reason-for-being (5-loop learning) is now demonstrably executing on persistent daemons under production-shape frontend traffic. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 04:14:21 -05:00
root	c522acec8b	g5 cutover slice live — first real Bun-frontend traffic to Go substrate J said "let's go" → "next" (option 3): actual flip via Bun mcp-server. Done. Real Bun-frontend traffic now reaches the Go substrate via /_go/* on Bun :3700, routed to the persistent Go gateway at :4110. Companion change in /home/profit/lakehouse (Rust legacy): mcp-server/index.ts: new /_go/* pass-through, opt-in via GO_LAKEHOUSE_URL env var. Off-by-default (returns 503 on /_go/* with rationale). Existing /api/* (Rust gateway) path unchanged. Committed locally on the demo/post-pr11 branch. System config: /etc/systemd/system/lakehouse-agent.service.d/go-cutover.conf adds Environment=GO_LAKEHOUSE_URL=http://127.0.0.1:4110 to the systemd-managed Bun service. Reversible via systemctl revert lakehouse-agent. Live verification (operator curl through Bun frontend): - /_go/health: gateway responds {"status":"ok","service":"gateway"} - /_go/v1/embed: nomic-embed-text-v2-moe vectors, dim=768 - /_go/v1/matrix/search vs persistent 200-worker corpus: rank=0 id=w-43 Brian Ramirez (Forklift Operator, Springfield IL) rank=1 id=w-102 Laura Long (Forklift Operator, Cleveland OH) rank=2 id=w-101 Terrence Gray (Forklift Operator, Champaign IL) 3/3 role match, top-1 in IL exactly - /api/health: lakehouse ok (Rust path unchanged — control verified) What this is NOT: - Not an nginx flip — devop.live/lakehouse/* still goes through /api/* → Rust :3100. /_go/* is parallel slice for opt-in. - Not a tool-level cutover — each /_go/<path> is a manual choice; no automatic mapping of Rust paths to Go equivalents. - Not a transformation layer — caller sends Go-shaped requests (e.g. /_go/v1/embed expects {texts, model}, not {text}). Three cutover unit properties verified: - ADDITIVE: zero modification to any existing Bun tool - REVERSIBLE: unset GO_LAKEHOUSE_URL → /_go/* → 503 - ISOLATED: Rust gateway state unaffected (different port, different binary, different MinIO bucket) This is the cutover slice operators can use to validate Go-side handlers under realistic frontend conditions before any production-traffic flip. Next step (deferred): pick a specific mcp-server tool to optionally route through Go with response- shape adapter — that's a product-visible flip rather than this infrastructure-visible slice. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 03:45:41 -05:00
root	77a3dcf266	cutover: first end-to-end coordinator query against persistent Go stack Three real-shape demand queries against the long-running 11-daemon stack with 500 workers ingested from workers_500k.parquet (real production data). Substrate is producing useful answers: Q1 (Forklift @ Aurora IL): 5/5 role match, top 3 in IL, dist 0.44-0.46 Q2 (CNC @ Detroit MI): top-1 in Detroit MI exactly, role pulls Machine Operator (semantic neighbor) Q3 (Warehouse @ Indianapolis IN): top-1 in Indianapolis IN, 5/5 role match, dist 0.42-0.54 This is the FIRST end-to-end coordinator-shape query against the persistent Go stack — every prior reality test (real_001..real_005) ran through harness-transient stacks that died on exit. This one ran against daemons that have been up for minutes and stayed up through retrieval. Geo is load-bearing: top-1 city/state matched in 3/3 queries. Embedder treats geography as a primary feature. Q2's CNC→Machine Operator gap exposes the playbook learning loop's purpose: judge would rate this ~3/5; the first time a coordinator approves a Machine Operator for a CNC Operator query, that recording starts shifting substrate behavior. That's the loop we've been building toward — the persistent stack is now the substrate that loop will run on. Evidence: reports/cutover/persistent_stack_first_query.md (full top-K tables + read on each query). What this does NOT prove: - Production-volume load (3 queries, 500 workers) - Concurrent latency - Full 5-loop substrate (this exercised retrieval only; no playbook recordings exist on the persistent stack yet) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 03:10:09 -05:00
root	09904d5222	cutover: persistent Go stack milestone — first long-running deployment + first Go-emitted audit_baselines entry J's "let's go" instruction: leave OPEN list behind, push the Go substrate forward into actual deployment shape. This commit marks the first time the Go side has run as long-running daemons rather than per-harness transient processes, and the first time the shared cross-runtime longitudinal log has carried a Go-emitted entry alongside the Rust ones. What landed: scripts/cutover/start_go_stack.sh — the persistent-stack runbook. Brings up all 11 daemons (storaged → catalogd → ingestd → queryd → embedd → vectord → pathwayd → observerd → matrixd → gateway, plus chatd-if-not-already-up) in dependency order via nohup + disown. Anchored pkill per feedback_pkill_scope (never bare "bin/"). Logs land in /tmp/gostack-logs/<bin>.log, one per daemon. Verified live state: - All 11 services healthy on :3110 + :3211-:3220 - gateway → embedd proxy returns nomic-embed-text-v2-moe vectors - chatd reports 5/5 providers loaded - No port collision with Rust gateway on :3100 - Daemons stay up after exit of the start script (production shape, not harness-transient) audit_baselines.jsonl crosses the runtime boundary: - 7 Rust-emitted entries (last: ca7375ea 2026-04-27) - 1 Go-emitted entry (ee2a40c 2026-05-01T07:53:54Z) appended via ./bin/audit_full -append-baseline - Same envelope shape, same metric set, same drift comparator semantics — operators running either runtime grow the same log What this DOES prove: - Substrate parity at deployment shape (not just unit tests) - Cross-runtime artifact write-side compatibility (was previously proven on read side via audit_baselines roundtrip) - The deploy machinery works end-to-end for the persistent case What this does NOT prove (still ahead): - Real coordinator traffic against the Go stack (no nginx flip yet; devop.live/lakehouse/ still serves through Rust) - Go-side production materializer (Phase 2 is observer-only) - Replay tool parity (Phase 7 is observer-only) - The 5-loop product gate against actual humans reports/cutover/SUMMARY.md now logs three new rows: - audit-FULL with 12/12 phases ported - First Go-emitted audit_baselines entry - Persistent Go stack live Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 02:55:29 -05:00
root	ee2a40c505	audit-FULL: port phases 1/2/5/7 — only acceptance.ts (TS-only) remains skipped Closes 4 of the 5 phases the initial audit-FULL port left as deferred. The pattern: most "deferred" phases didn't actually need the un-ported Rust pieces — they were observer-mode by design and just needed to read existing on-disk artifacts. Phase 1 (schema validators) → ported via exec.Command: Invokes `go test ./internal/distillation/...` — the Go equivalent of Rust's `bun test auditor/schemas/distillation/`. New GoTestModule field on AuditFullOptions controls the package pattern; empty disables the invocation (test mode, prevents recursion when audit-full is invoked from inside `go test`). Phase 2 (evidence materialization) → ported as observer: Reads data/evidence/ directly and tallies rows + tier-1 source hits. Doesn't re-run the materializer (which is Rust-side TS). Emits p2_evidence_rows + p2_evidence_skips metrics matching Rust shape — drop-in audit_baselines.jsonl entries possible. Phase 5 (run summary) → ported as observer: Reads reports/distillation/{run_id}/summary.json + 5 stage receipts. Validates schema_version=1, run_hash sha256, git_commit 40-char hex, all stage receipts decode as JSON. Full schema validation (StageReceipt schema) is intentionally NOT ported — it would require porting the TS schemas/distillation/ validators in full; basic shape checks catch the load-bearing invariants. Phase 7 (replay log) → ported as observer: Reads data/_kb/replay_runs.jsonl, validates last 50 rows parse as JSON. Skips the live-replay invocation that Rust's phase 7 also does — porting Rust replay.ts is substantial and not in scope. The "log shape sanity" check is what audit-full actually needs; the live invocation is a separate concern. Phase 6 (acceptance gate) — STILL SKIPPED: Rust acceptance.ts is a TS-only fixture harness with bun-specific deps. Porting the fixtures (tests/fixtures/distillation/acceptance/) + the 22-invariant runner to Go is an ADR-worth undertaking. Documented in the header comment. Live-data probe (against /home/profit/lakehouse): Skips count: 4 → 1 (only phase 6). Required checks: 6/6 → 12/12 PASS. New metric: p2_evidence_rows=1055, BYTE-EQUAL to the Rust pipeline's collect.records_out from the latest summary.json. Cross-runtime parity now extends across phases 0/1/2/3/4/5/7. 6 new tests: - TestPhase2_EvidenceTallyFromOnDisk: row + tier-1-hit tallying - TestPhase5_FullSummaryFlow: complete run-summary fixture passes - TestPhase5_ShortRunHashCaught: bad run_hash fails required check - TestPhase7_ReplayLogReadsFromDisk: row-count reporting - TestPhase7_MalformedTailRowsCaught: structural parse failure - TestRunAuditFull_FullFixtureFlow updated to seed evidence/ + reports/distillation/ for the phases now wired. Cleanup: removed local sortStrings helper (replaced with sort.Strings now that `sort` is imported for phase 5's mtime-sort). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 02:35:13 -05:00
root	55b8c76a8c	distillation: audit-FULL pipeline port (phases 0/3/4) — cross-runtime metric parity verified Ports the metric-collection passes from scripts/distillation/audit_full.ts. The substrate that PRODUCES audit_baselines.jsonl entries — the half OPEN #2 left as "deferred to next wave" after the read/write substrate landed in ca142b9. Phase coverage: Phase 0 (file presence) ported Phase 1 (schema validators) skipped (Go's `go test` covers it) Phase 2 (materializer dry-run) deferred (Go materializer not yet ported) Phase 3 (scored-runs distribution) ported Phase 4 (contamination firewall) ported Phase 5 (receipts validation) deferred (Go run-summary JSON not yet emitted) Phase 6 (replay sanity) deferred (Go replay tool not ported) Phase 7 (run summary lineage) deferred (same) Cross-runtime parity verified end-to-end: Go-side audit-full against /home/profit/lakehouse produced metrics IDENTICAL to the last Rust-emitted audit_baselines.jsonl entry. All 8 ported metrics match byte-for-byte: p3_accepted=386, p3_partial=132, p3_rejected=57, p3_human=480, p4_sft_rows=353, p4_rag_rows=448, p4_pref_pairs=83, p4_total_quarantined=1325 6/6 required checks pass on live data. Components: - internal/distillation/audit_full.go: PhaseCheck struct (mirrors Rust shape), PhaseCheckReport aggregation, RunAuditFull orchestrator, auditPhase0/3/4 implementations, FormatAuditFullReport Markdown writer. - cmd/audit_full/main.go: CLI binary with -root, -out, -json, -append-baseline flags. Operators run "./bin/audit_full -append-baseline" to grow the longitudinal log alongside the Rust pipeline (entries are interchangeable — same envelope shape). - 6 new tests: empty-root failure handling, full-fixture clean PASS (locks all 8 metrics + all 6 required checks), SFT firewall contamination detection, preference self-pair detection, sig_hash regex correctness (rejects wrong-length + uppercase), Markdown formatter smoke. Live-data probe captured at reports/cutover/audit_full_go_vs_rust.md (linked from reports/cutover/SUMMARY.md). Same shape as the audit_baselines round-trip evidence — both Go-side ports of the distillation surface are now validated against real Rust data, not just fixtures. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 01:30:23 -05:00
root	0d4f033b34	audit_baselines: round-trip validation against live Rust data Same shape of proof as embed_parity.sh for the embed endpoint: take the just-shipped Go port (ca142b9) and validate it against the actual production data the Rust legacy emits, not just unit- test fixtures. Locks the cross-runtime parity that operators running mixed pipelines depend on. scripts/cutover/audit_baselines_validate.go: - Reads /home/profit/lakehouse/data/_kb/audit_baselines.jsonl - Parses every entry via the Go AuditBaseline struct - Round-trips the last entry: encode → decode → field-by-field equality check (catches any silently-dropped JSON keys) - Calls LoadLastBaseline against the live file (proves the public API works on real shapes, not just inline parsing) - Computes BuildAuditDriftTable(first → last) — full-window lineage drift over the captured baselines Live-data probe results (reports/cutover/audit_baselines_roundtrip.md): - 7 entries parse without error - Round-trip is byte-equal on every metric + every header field - Drift table fires the expected verdicts: - p2_evidence_rows 12→82 (+583%) → warn (above 20% threshold) - p3_accepted/partial/rejected/human 0→non-zero → warn (the zero-baseline edge case TestBuildAuditDriftTable_ZeroBaseline was designed to lock — verified now firing on real history) - p4_* metrics +0% → ok (stable across the window) What this does NOT prove (documented in the report): the Go-side audit-FULL pipeline that PRODUCES baselines doesn't exist yet. Only the load/append/drift substrate is ported. Operators running audit-full from Go would still need a metric-collection pass — that's a separate port deliberately not in this wave. reports/cutover/SUMMARY.md gains a new row alongside the embed parity entries; cutover-prep verification log keeps the discipline of "verified against real data, not just fixtures." Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 00:20:18 -05:00
root	cca32344f3	reality_test real_005: negation probe — substrate gap is correctly out-of-scope 5 explicit-negation queries ("Need Forklift Operators in Aurora IL, NOT in Detroit", "excluding Cornerstone Fabrication roster", etc.) through the standard playbook_lift harness. Goal: characterize whether the substrate has negation handling or silently treats "NOT X" as "X". Headline: substrate has zero negation handling. Cosine on dense embeddings tokenizes "NOT in Detroit" identical to "in Detroit" plus noise — there is no logical-quantifier representation in the embedding space. This is a structural property of dense embeddings, not a substrate bug. Per-query observations: - Q1 (Aurora IL, NOT Detroit): all top-10 rated 1-2/5 by judge - Q2 (NOT Beacon Freight): top-1 rated 4/5 — accidentally OK because role+city signal pulled non-Beacon worker naturally - Q3 (excluding Cornerstone): unanimous 1/5 across top-10 - Q4 (NOT Detroit-area): all top-10 rated 1-2/5 - Q5 (exclude Heritage Foods): top-1 rated 4/5 — accidentally OK The judge IS the safety net: when retrieval can't honor the constraint, the judge refuses to approve any result. That's the honesty signal — `discovery=0` for the run aggregates it. No code change. The architectural answer for production is: - UI surfaces an "exclude" affordance that populates ExcludeIDs (already supported, added in multi-coord stress 200-worker swap) - Coordinators don't type natural-language negation — they click - Substrate's role: surface honesty signal (judge ratings) + don't pretend to honor unparseable constraints Adding NL-negation handling at the substrate level would be product debt — it would let coordinators type sloppier queries that silently fail when the LLM extractor misses a phrasing. Don't ship until production traffic demonstrates demand for it. Findings: reports/reality-tests/real_005_findings.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 23:06:06 -05:00
root	0331288641	playbook_lift: LLM-based role extractor closes shorthand bleed (real_004) real_003 left a known-weak hole: shorthand-style queries ("{count} {role} {city} {state} ...") have no separator between role and city, so a regex can't reliably extract — leaving the cross-role gate disabled when both record AND query are shorthand. This commit adds a roleExtractor with regex-first + LLM fallback: - Regex first (fast, deterministic) — handles need + client_first + looking from real_003b. ~75% of styles, no LLM cost paid. - LLM fallback when regex returns empty AND model is configured — Ollama-shape /api/chat with format=json, schema-tight prompt, temperature 0. ~1-3s on local qwen2.5. - Per-process cache — paraphrase + rejudge passes reuse the same query 4× per run; cache prevents 4× LLM cost. - Off-by-default — opt-in via -llm-role-extract flag (CLI) and LLM_ROLE_EXTRACT=1 env var (harness wrapper). real_003b shipping config unchanged unless explicitly enabled. 8 new tests in scripts/playbook_lift/main_test.go: - TestRoleExtractor_RegexFirst: LLM not called when regex matches - TestRoleExtractor_LLMFallback: shorthand goes to LLM - TestRoleExtractor_LLMOffLeavesEmpty: opt-in default preserved - TestRoleExtractor_Cache: 3 calls = 1 LLM hit - TestRoleExtractor_NilSafe: nil receiver runs regex only - TestExtractRoleViaLLM_HTTPError + _BadJSON: failure paths - TestRoleExtractor_ClosesCrossRoleShorthandBleed: synthetic witness for the real_003 scenario — both record + query are shorthand, regex returns "" for both, LLM produces DIFFERENT role tokens for CNC vs Forklift, so matrix gate's cross-role rejection (locked separately in TestInjectPlaybookMisses_RoleGateRejectsCrossRole) fires correctly. This is the load-bearing verification. Reality test real_004 ran the same 40-query stress as real_003 with LLM extraction on. Cross-style same-role boosts fired correctly across all 4 styles for Loaders + Packers + Shipping Clerk clusters (including shorthand → other-style transfer). No cross-role bleed observed. The reality test alone can't be a clean "with vs without" comparison (HNSW build is non-deterministic across runs, and real_004 stochastics didn't trigger a shorthand recording at all), which is why the unit-test witness exists. Production note (in real_004_findings.md): LLM extraction is for reality-test coverage of arbitrary query shapes. Production should extract role at INGEST time (when the inbox parser already runs an LLM) and pass already-resolved role through requests — same shape as multi_coord_stress's existing Demand{Role: ...} model. The hot path should never need the harness extractor's per-query LLM cost. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 22:51:27 -05:00
root	3263254f1c	reality_test real_003: 40-query paraphrase stress + extractor extension Stress-tests the role gate with 40 queries (10 fill_events rows × 4 styles): need, client_first, looking, shorthand. Each row's role + client + city stays the same; only the surface phrasing changes. real_003 (original extractor) confirmed the shorthand-vs-shorthand failure mode: CNC Operator shorthand recording leaked w-2404 onto Forklift Operator shorthand query within the same Beacon Freight Detroit cluster. Both record + query had empty role (extractor returns "" for shorthand because there's no separator between role and city), gate disabled, distance check passed, bleed fired. Fix: extended extractRoleFromNeed to handle client_first ("{client} needs N {role} in...") and looking ("Looking for N {role} at...") patterns. Shorthand left intentionally unmatched — "Forklift Operator Detroit" is shape-indistinguishable from "Forklift" + "Operator Detroit" without an LLM extractor or known- cities lookup. real_003b (extended extractor) verifies bleed closed across all 4 styles for this dataset. Forklift Operator queries keep w-2136 (the cold-pass-correct match) regardless of which style the query came in. Same-role boosts now fire correctly across styles — a CNC Operator recording made in `looking` style boosts the CNC need-form query. scripts/cutover/gen_real_queries.go: added -styles flag with values need\|client_first\|looking\|shorthand\|all (default need preserves real_001/002 behavior). Tests/reality/real_coord_queries_v2.txt is the 40-query stress file. scripts/playbook_lift/main_test.go: 10 sub-tests lock the four documented patterns + shorthand limitation + lift-suite-style queries (no clean role, returns empty as expected). Aggregate metrics: - real_003 (original): disc=7, lift=7, boost=14, meanΔ=-0.108 - real_003b (extended): disc=11, lift=10, boost=31, meanΔ=-0.202 The growth reflects more LEGITIMATE same-role same-cluster transfer firing across styles, not bleed (verified by per-cluster bleed table — Forklift Operator queries unchanged across all 4 styles). Known limitation documented in real_003_findings.md: same-cluster, same-role queries in shorthand still embed close enough that a shorthand recording could bleed onto a different-role shorthand query if both record + query strip role. Closing this requires LLM extraction or known-cities lookup at record + query time. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 21:42:02 -05:00
root	997527be4d	matrix: cross-role playbook gate — closes real_001 bleed (OPEN #1 ) real_001 surfaced same-client+city queries bleeding across roles: Q#2 (Forklift Operator @ Beacon Freight Detroit) recorded e-6193 in the playbook corpus. Q#5 (Pickers same client+city) and Q#10 (CNC Operator same client+city) embedded within 0.13-0.18 cosine of Q#2's query — well inside the 0.20 inject threshold — so e-6193 injected on both, demoting the cold-pass-correct workers. Root cause: the inject distance threshold isn't tight enough on the same-client+city cluster. Cosine collapses queries that share city + client + count-token + time-token regardless of role. The existing judge gate is per-injection at record time and doesn't fire at retrieve time. Fix: structural role gate in front of both Shape A boost and Shape B inject. PlaybookEntry gains Role; SearchRequest gains QueryRole. When both are non-empty and differ under roleEqual's case+plural normalization, the entry is rejected before BoostFactor or judge-gate logic runs. Backward-compat: empty role on either side disables the gate — preserves behavior for the lift suite's free-form multi-constraint queries that have no clean single role. Caller-supplied (not inferred), so existing recordings unaffected. Wire-through: - internal/matrix/playbook.go: Role field, NewPlaybookEntryWithRole, roleEqual helper with plural+case normalization - internal/matrix/retrieve.go: QueryRole on SearchRequest, threaded to both ApplyPlaybookBoost + InjectPlaybookMisses - cmd/matrixd/main.go: role on POST /matrix/playbooks/record + bulk - scripts/playbook_lift/main.go: extractRoleFromNeed regex pulls role from "Need N {role}{s} in" queries (the fill_events shape); free-form queries fall back to empty (gate disabled) Tests (5 new): - TestInjectPlaybookMisses_RoleGateRejectsCrossRole: exact Q#10 scenario (distance 0.135, recorded "Forklift Operator", query "CNC Operator") — locks the bleed at unit level - TestInjectPlaybookMisses_RoleGateAllowsSameRole: Forklift Operator recording fires on Forklift Operators query (plural normalization) - TestInjectPlaybookMisses_RoleGateBackwardCompat: empty Role on either side = gate disabled, preserves current behavior - TestApplyPlaybookBoost_RoleGateRejectsCrossRole: Shape A defense in depth — boost doesn't fire on cross-role even when answer is in cold top-K - TestRoleEqual_PluralAndCase: case + -s + -es plural normalization Verification (real_002, same query set as real_001): - Q#5 Pickers @ Beacon Freight: e-6193 → e-8499 (no bleed) - Q#10 CNC Operator @ Beacon Freight: e-6193 → w-2404 (no bleed) - Discoveries + lifts unchanged at 2 each (same-role lift still fires) - Mean Δdist tightens from -0.127 to -0.040 (boosts no longer pulling distances through the floor on cross-role mismatches) Findings: reports/reality-tests/real_002_findings.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 20:34:10 -05:00
root	7f2f112e6a	reality_test real_001: real-shape coordinator queries — surfaces cross-role bleed First retrieval probe with non-synthetic query distribution. Pulls N rows from /home/profit/lakehouse/data/datasets/fill_events.parquet (real-shape demand data) and translates each to the natural language a coordinator would type: "Need {count} {role}s in {city} {state} starting at {at} for {client}". Headline: 8/10 cold-pass top-1 = judge-best on real distribution. Substrate works on queries it was never trained for. v2-moe + workers corpus carry the load. Surfaced finding (the real value of running this): same-client+city queries cluster, and Shape A's distance boost bleeds across roles within the cluster. Q#2 (Forklift @ Beacon Freight Detroit) records e-6193 in the playbook corpus. Q#5 (Pickers same client+city) and Q#10 (CNC Operator same client+city) inherit e-6193 at warm top-1 even though: - Neither query has its own recorded playbook. - Neither warm pass triggers a Shape B inject (boosted=0). - The roles are different staffing categories. Q#10 specifically demoted the cold-pass-correct w-3759 (judge rating 4 at rank 0) for a worker who was approved by the judge for a different role on a different query. Why the lift suite missed it: synthetic queries use 7 disjoint scenario buckets (forklift+OSHA+WI / CDL+IL / etc.). Real demand clusters on (client, city). The cluster doesn't exist in the synthetic distribution. Why the judge gate doesn't catch it: the gate (5a3364f) is per-injection at record time. After approval the worker rides Shape A distance boosts on all later same-cluster queries with no second gate call. Becomes new OPEN #1. Fix candidate: role-scoped playbook corpus metadata + Shape A boost gate on role match. Cheap; doesn't need new judge calls. Files: - scripts/cutover/gen_real_queries.go: parquet → coordinator NL - tests/reality/real_coord_queries.txt: 10 generated queries - reports/reality-tests/playbook_lift_real_001.md: harness output - reports/reality-tests/real_001_findings.md: the reading Repro: go run scripts/cutover/gen_real_queries.go -limit 10 > tests/reality/real_coord_queries.txt QUERIES_FILE=tests/reality/real_coord_queries.txt RUN_ID=real_001 \ WITH_PARAPHRASE=0 WITH_REJUDGE=0 ./scripts/playbook_lift.sh Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 20:18:40 -05:00
root	5687ec65c2	G5 cutover prep: embed parity probe — Rust /ai/embed ↔ Go /v1/embed verified First concrete cutover artifact: scripts/cutover/embed_parity.sh brings up Go embedd + gateway alongside the live Rust gateway, hits both /ai/embed and /v1/embed with the same forced model, and emits a per-date verdict report under reports/cutover/. Why embed first: the parity invariant is one math identity (cosine sim of vectors against same input). Retrieve has thousands of edge cases. If embed parity holds, all downstream vector consumers inherit confidence; if it doesn't, we catch it in 30s instead of after a flip. Verdict 2026-04-30: 5/5 samples cosine=1.000000 with model forced to nomic-embed-text (v1). Same with nomic-embed-text-v2-moe (both Ollamas have it loaded). Math is provably equivalent across the gateway plumbing. Drift catalog (reports/cutover/SUMMARY.md): - URL: Rust /ai/embed vs Go /v1/embed - Wire: Rust {embeddings, dimensions} (plural) vs Go {vectors, dimension} (singular). Wire-format adapter is the only real cutover work for this endpoint. - L2 norm: Rust unit vectors (~1.0); Go raw Ollama (~20-23). Same direction (cos=1.0); harmless under cosine-distance HNSW (which is Go vectord's default), but worth fixing in internal/embed/ before extending to euclidean indexes. reports/cutover/ now tracked (joined the scrum/ + reality-tests/ exemptions in .gitignore). Next probe: /v1/matrix/retrieve ↔ Rust /vectors/hybrid for the real user-facing retrieve path. Embed parity gives that probe a clean foundation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 20:07:04 -05:00
root	5d49967833	multi_coord_stress: full Langfuse coverage — every phase + every call Phase 1c-only tracing (commit 7e6431e) was the proof-of-concept. This commit threads tracing through every phase: baseline / fresh- resume / inbox burst / surge / swap / merge / handover (verbatim + paraphrase) / split / reissue. Each phase is a parent span; each matrix.search / LLM call inside is a child span. Refactor: - One run-level trace is created at driver startup. - New startPhase(name, hour, meta) helper emits a phase span as a child of the run trace; subsequent emitSpan calls nest under it. - New tracedSearch(spanName, query, corpora, ...) wraps matrixSearch with span emission. Every search call site replaced with this so the input/output JSON (query, corpora, k, playbook, exclude_n → top-K ids, top1 distance, boost/inject counts) lands in Langfuse. - Phase 4b's paraphrase generation also emits llm.paraphrase spans. - Phase 1c's existing inline span emission converted to use the new helpers (no more inboxTraceID variable). Run #011 result: trace landed at http://localhost:3001 with 111 observations attached. Span breakdown: phase.* parents: 9 (one per phase that ran) matrix.search.baseline: 10 matrix.search.fresh_verify: 3 (top-1 confirmed for all 3 fresh) observerd.inbox.record: 6 llm.parse_demand: 6 matrix.search.inbox: 6 llm.judge_top1: 6 matrix.search.surge: 12 matrix.search.swap_orig: 1 matrix.search.swap_replace: 1 matrix.search.merge: 6 matrix.search.handover_verbatim: 4 llm.paraphrase: 4 matrix.search.handover_paraphrase: 4 matrix.search.split: 4 matrix.search.reissue: 12 matrix.search.reissue_retrieval_only: 12 ───────────── Total: 111 Browse: http://localhost:3001 → Traces → "multi_coord_stress run" Each phase is a collapsible section showing per-call timing and input/output JSON. Operators can drill into any single retrieval to see exactly what query was issued and what came back. All other metrics held: diversity 0.026, determinism 1.000, verbatim handover 4/4, paraphrase handover 4/4, fresh-resume 3/3 at top-1 (two-tier index), 200-worker swap Jaccard 0.000. This is the FULL TEST J asked for — every action in the run visible in Langfuse, full input/output drilldown. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 16:43:32 -05:00
root	08a086779b	multi_coord_stress: fresh_workers two-tier index — fresh-resume now top-1 Runs #003-#009 surfaced the same finding: fresh workers added mid-run to the main 'workers' vectord index (5K items) reliably absorbed (HTTP 200) but failed to surface in semantic queries even with content-matching prompts. Distances on the verify queries sat at 0.25-0.65 against existing workers; fresh items were beyond top-K. Better embedder (v2-moe) didn't help — distances got TIGHTER on existing items, pushing fresh items further out of reach. Root cause: coder/hnsw incremental adds to a populated graph land in poorly-connected regions and disappear from search traversal. Known property of HNSW post-build adds; not a bug. Fix: two-tier index pattern (canonical NRT search architecture). Fresh content goes to a small "hot" corpus (fresh_workers); main queries include it in the corpora list and merge results. Hot corpus has no recall crowding because it's tiny; periodic batch job (post- G3) merges it into the main index. Implementation: - ensureFreshIndex(hc, gw, name, dim) — idempotent POST /v1/vectors/index. 409 from re-create treated as "already there." - ingestFreshWorker now takes idx parameter so callers can target fresh_workers instead of workers. - multi_coord_stress phase 1b creates fresh_workers index + ingests 3 fresh workers there + searches verifyCorpora=[workers, ethereal_workers, fresh_workers]. Run #010 result: fresh-001 (Senior tower crane rigger NCCCO Chicago) top-1: fresh-001 from fresh_workers, distance 0.143 fresh-002 (Bilingual Spanish/English OSHA trainer Indianapolis) top-1: fresh-002 from fresh_workers, distance 0.146 fresh-003 (FAA Part 107 drone surveyor Chicago) top-1: fresh-003 from fresh_workers, distance 0.129 3/3 fresh workers surface at top-1 — the absorption-but-not- findable issue from runs #003-#009 is closed. All other metrics held: diversity 0.007, determinism 1.000, verbatim handover 4/4, paraphrase handover 4/4, swap Jaccard 0.000, inbox burst all 6 events accepted + traced to Langfuse. This is the final structural fix for the multi-coord stress suite. Phase 3 is feature-complete. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 16:31:45 -05:00
root	7e6431e4fd	langfuse: Go-side client + Phase 1c instrumentation The Rust side has Langfuse tracing already (gateway/v1/langfuse_trace.rs); this commit lands Go-side parity so the multi-coord stress harness can emit traces visible at http://localhost:3001. internal/langfuse/client.go: - Minimal Trace + Span + Flush API mirroring what the Rust emitter uses. Auth: Basic over public_key:secret_key. - Best-effort posture: errors are slog.Warn'd, never block calling paths. Same fail-open as observerd's persistor (ADR-005 Decision 5.1) — observability is a witness, not a gate. - Events buffered until 50, then auto-flushed; explicit Flush() at process exit. - Each Trace/Span returns its id so callers can build hierarchies. multi_coord_stress driver wiring: - New --langfuse-env flag (default /etc/lakehouse/langfuse.env). Empty / missing / unparseable file → skip tracing with a logged warning; run still proceeds. - Phase 1c (inbox burst) now emits one parent trace + 4 spans per inbox event: 1. observerd.inbox.record (post to /v1/observer/inbox) 2. llm.parse_demand (qwen2.5 → structured fields) 3. matrix.search (parsed query → top-K) 4. llm.judge_top1 (rate top-1 vs original body) Each span carries input/output JSON + start/end times so the Langfuse UI shows a full waterfall per event. Run #009 result: Trace landed: "multi_coord_stress phase 1c inbox burst" Observations attached: 24 (= 6 events × 4 spans) Tags: stress, phase-1c, inbox Browseable at http://localhost:3001 by tag query. Other harness metrics: diversity 0.016, determinism 1.000, verbatim handover 4/4, paraphrase handover 4/4 — all unchanged by the tracing addition (best-effort post in parallel). Phase 1c is the proof-of-concept; future commits can wrap other phases (baseline / merge / handover / split) in traces too. Once that's done, the entire stress run becomes scrubbable in Langfuse without grepping the events JSON. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 16:25:03 -05:00
root	ce940f4a14	multi_coord_stress: judge re-rates inbox top-1 — recovers honesty signal Run #007 surfaced a tradeoff: LLM-parsed inbox queries produce much tighter cosine distances (0.05-0.10 in three cases) but lose the "system has no good match" signal that high-distance results give. A coordinator UI showing only distance can't tell wrong-domain matches apart from real ones. Fix: judge re-rates top-1 against the ORIGINAL inbox body (not the LLM-parsed query). Coordinators see both: - distance: how close was retrieval in vector space - rating: does this person actually fit the original ask The pair tells the honest story. Run #008 result on the 6 inbox events: Demand Top-1 Distance Rating Reading ───────────────────────────────────────────────────────────── Forklift Cleveland w-3573 0.29 4 Strong Production Indy e-1764 0.41 3 Adjacent Crane Chicago e-7798 0.23 1 TIGHT BUT WRONG Bilingual safety Indy w-3918 0.05 5 Perfect Drone Chicago e-1058 0.06 5 Perfect (verify e-1058) Warehouse Milwaukee w-460 0.32 4 Strong The crane-Chicago case is the architectural-honesty signal at work: distance 0.23 says "tight match" but the judge says rating 1 reading the original body. A coordinator seeing only distance would ship the wrong worker; coordinator seeing distance+rating sees the disagreement and escalates. Net distribution: 5/6 rated 3+ (acceptable→perfect), 1/6 rated 1 (irrelevant despite tight cosine). The substrate-honesty signal is recovered without losing the LLM-parse quality wins. Cost: 6 extra judge calls (~9s on qwen2.5). Production amortizes when judge runs only on top-1 of high-priority inbox events; the search-cost-vs-quality tradeoff lives in the priority gate. Implementation: - New JudgeRating int field on Event (omitempty so non-judged events stay clean in JSON) - New judgeInboxResult helper, reusing the same prompt structure as playbook_lift's judgeRate. The two could share an internal package if a third judge consumer appears. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 16:16:49 -05:00
root	186d209aae	multi_coord_stress: LLM-parsed inbox demands (qwen2.5) Replaced the hard-coded DemandQuery on inbox events with an actual LLM call: each email/SMS body is parsed by qwen2.5 (format=json, schema-anchored) into structured {role, count, location, certs, skills, shift}. The driver then composes a query string from those fields and runs matrix.search. This is the real-product flow that the Phase 3 stress test was asking for: real bodies → real LLM parsing → real search. Before this commit, the DemandQuery was my hand-crafted string, which made the inbox phase trivial. Run #007 result vs #006 (same bodies, parser swapped): All 6 inbox events parsed cleanly — qwen2.5 nailed: "Need 50 forklift operators in Cleveland OH for Monday day shift. OSHA-30 + active forklift cert required." → {role:"forklift operator", count:50, location:"Cleveland, OH", certs:["OSHA-30","active forklift cert"], skills:[], shift:"day"} Other 5 similarly faithful (indy stayed as "indy", count defaulted to 1 when unspecified, no hallucinated fields). LLM-parsed queries produced TIGHTER matches than hard-coded: Demand #006 dist #007 dist Δ Crane Chicago 0.499 0.093 -82% Drone Chicago 0.707 0.073 -90% Bilingual safety 0.240 0.048 -80% Forklift Cleveland 0.330 0.273 -17% Production Indy 0.260 0.399 +53% Warehouse Milwaukee 0.458 0.420 -8% Three matches landed at distance < 0.10 — verbatim-replay-tight territory. Structured queries embed sharper than conversational hand-crafted strings. Other metrics unchanged: diversity 0.000, determinism 1.000, verbatim handover 4/4, paraphrase handover 4/4. Tradeoff worth flagging: the drone-Chicago case dropped from distance 0.71 (clear "we don't have one") to 0.07 (confident match returned). The OOD honesty signal weakens when LLM-parsed structure makes any closest-neighbor look tight. Future Phase 4 work: judge re-rates the top match before surfacing, so coordinators see "your demand was for X but the closest match scored 2/5" rather than just the worker ID + distance. Substrate cost: +6 LLM calls per inbox burst (~9s on qwen2.5). Production would amortize via a small dedicated parser model. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 14:51:19 -05:00
root	e7fc63b216	observerd: /observer/inbox + multi-coord stress phase 1c (priority-ordered events) Phase 3 ask: real-world inbox-style event injection during the stress test. Coordinators in production receive emails + SMS that trigger contract responses; the substrate has to RECORD these signals AND react with a search using the embedded demand. This commit lands the endpoint and exercises it end-to-end in the stress harness. observerd surface: - New POST /observer/inbox route — accepts {type, sender, subject, body, priority, tag} and records as ObservedOp with Source=SourceInbox. Type must be email\|sms; body required; priority defaults to medium. The handler ONLY records — downstream triggers (search, ingest, etc.) are the caller's concern, recorded separately. Keeps the witness role pure. - New observer.SourceInbox = "inbox" alongside SourceMCP / SourceScenario / SourceWorkflow. - Three contract tests on the new route (happy path / bad type / empty body), router-mount test extended, all green. Stress harness phase 1c (Hour 9): - 6 inbox events fire in priority order (urgent → high → medium): 2 urgent emails (forklift Cleveland, production Indianapolis) 1 high email (crane Chicago) 1 high sms (bilingual safety Indianapolis) 1 medium sms (drone Chicago) 1 medium email (warehouse Milwaukee FYI) - Each event: 1. POSTs to /v1/observer/inbox (recorded by observerd) 2. Triggers matrix.search using a parsed demand (the demand extraction is hard-coded for now; production needs a small LLM to parse from body) 3. Captures both as events in the run JSON Run #006 result (with v2-moe embedder + all phases including inbox): Diversity: Same-role-across-contracts Jaccard = 0.000 (n=9) Different-roles-same-contract Jaccard = 0.046 (n=18) Determinism: 1.000 Verbatim handover: 4/4 (100%) Paraphrase handover: 4/4 (100%) Inbox burst: 6/6 events accepted by observerd (200 status, all recorded) 6/6 triggered searches produced distinct top-1 worker IDs distance distribution: 0.24 (Indy production) → 0.71 (Chicago drone surveyor — honest stretch since drones aren't in the 5K-worker corpus, system surfaces closest neighbor at high distance rather than fabricating) The drone-Chicago case is the architectural-honesty signal: when the demand asks for a specialist NOT in the roster, the system returns the closest semantic neighbor with a distance that flags "this is a stretch." Coordinators reading distances see "we don't have a great match here" rather than a confident wrong answer. Total events captured: 67 (was 61 pre-inbox). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 08:34:36 -05:00
root	4da32ad102	embedd: bump default to nomic-embed-text-v2-moe (475M MoE, 768d drop-in) Local Ollama has three embedding models loaded: nomic-embed-text:latest 137M 768d (previous default) nomic-embed-text-v2-moe:latest 475M 768d (this commit's default) qwen3-embedding:latest 7.6B 4096d (would require dim change) v2-moe is a drop-in upgrade — same 768 dim, 3.5× more params, MoE architecture. Workers index doesn't need rebuilding, just future ingests embed with the stronger model. Run #005 result on the multi-coord stress suite: Diversity (same-role-across-contracts): 0.080 → 0.000 (n=9) → MoE is more discriminating: zero worker overlap across Milwaukee / Indianapolis / Chicago for shared role names. The geo + cert + skill context fully separates worker pools. Different-roles-same-contract: 0.013 → 0.036 (still ~96% diff) Determinism: 1.000 (unchanged) Verbatim handover: 4/4 (100%) Paraphrase handover: 4/4 (100%) 200-worker swap: Jaccard 0.000 (unchanged — still perfect) Fresh-resume verify: STILL doesn't surface fresh workers in top-8. With v2-moe, distances increased (top-1 = 0.43–0.65 vs v1's 0.25–0.39) — the embedder is MORE discriminating, but the fresh worker's vector still doesn't outrank the 8th-best existing worker. Now suspect of being an HNSW post-build add issue (coder/hnsw incremental adds can land in hard-to-reach graph regions, not an embedder problem). Better embedder didn't fix it; needs a different strategy: full index rebuild after fresh adds, or explicit playbook-layer score boost for fresh workers, or hybrid (keyword + semantic) retrieval. Phase 3 investigation. Cost: ingest is ~5× slower (workers 20s→100s; ethereal 35s→112s). Acceptable for the quality jump on diversity. Real production with incremental ingest won't pay this once-per-deploy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 08:26:52 -05:00
root	84a32f0d29	multi-coord stress Phase 2: ExcludeIDs + fresh-resume + 200-worker swap Three Phase 2 additions land in this commit: 1. matrix.SearchRequest gains ExcludeIDs ([]string) — filters specific worker IDs out of results post-retrieval, AND skips them at the playbook boost+inject step (so excluded answers can't sneak back via Shape B). Real-world driver: coordinator placed N workers, client asks for replacements, system needs alternatives, not the same N. Threaded through retrieve.go after merge but before metadata filter so excluded IDs don't waste post-filter top-K slots. 2. New harness phase 2b: 200-worker swap simulation. Captures the top-K from alpha's warehouse query, then re-issues with exclude_ids=<placed>. Result Jaccard(orig, swap) measures whether the substrate finds genuine alternatives. 3. New harness phase 1b: fresh-resume mid-run injection. Three new workers ingested via /v1/embed + /v1/vectors/index/workers/add, then verified findable via semantic queries matching resume content. Plus Hour labels on every event (operational narrative: 0/6/12/18/ 24/30/36/42/48) and a refactor of captureEvent to take hour as a param. Run #003 + #004 results (5K workers + 10K ethereal): Diversity (#004): Same-role-across-contracts Jaccard = 0.080 (n=9) Different-roles-same-contract Jaccard = 0.013 (n=18) Determinism: 1.000 (#004 unchanged) Verbatim handover: 4/4 = 100% Paraphrase handover: 4/4 = 100% Phase 2b — 200-worker swap (Jaccard 0.000): 8 originally-placed workers fully replaced by 8 alternatives. ExcludeIDs substrate change works end-to-end — boost AND inject both honor the exclusion, so excluded workers don't return via the playbook either. Phase 1b — fresh-resume injection: REAL PRODUCT FINDING. Substrate ABSORPTION is fine — 3 /v1/vectors/index/workers/add calls at 200 status, 3 vectors persisted. But none of the 3 fresh workers surfaced in top-8 even with semantic queries matching their resume content (e.g. "Senior tower crane rigger NCCCO Chicago" vs fresh-001's resume "Senior rigger with 12 years tower-crane signaling..." NCCCO + Chicago). Top-1 came from existing workers at distance ~0.25; fresh workers' distances must be > 0.25, pushing them past rank 8. Cause: dense retrieval at 5000+ workers means many existing profiles cluster near any specific query in cosine space; nomic-embed-text-v2 (137M) introduces enough noise that a fresh worker doesn't reliably outrank them just because the text content overlaps. Workarounds (Phase 3 work): (a) hybrid retrieval (keyword + semantic), (b) playbook-layer score boost for fresh adds, (c) larger embedder. Documented in run #004 report. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 08:19:29 -05:00
root	0fa42a0cc3	multi-coord stress Phase 1.5: shared-role contracts + paraphrase handover Phase 1 had two known gaps: (1) the 3 contracts had zero shared role names, so same-role-across-contracts Jaccard was vacuous (n=0); (2) the verbatim handover at 100% was the trivial case, not the hard learning test (paraphrased queries against another coord's playbook). Both fixed in this commit. Contract redesign — all 3 contracts now share warehouse worker / admin assistant / heavy equipment operator roles, plus a unique specialist per contract (industrial electrician / bilingual safety coord / drone surveyor — the "specialist not on the standard roster" case from J's spec). Counts and skill mixes vary per region. New driver phase 4b — paraphrase handover. Bob runs qwen2.5-paraphrased versions of Alice's contract queries against Alice's playbook namespace. Tests whether institutional memory propagates across coordinators AND across natural wording variation that Bob would introduce when running Alice's contract. Run #002 result (5K workers + 10K ethereal_workers, 4 demand × 3 coords + paraphrase handover): Diversity (the question J asked: locking or cycling?): Same-role-across-contracts Jaccard = 0.119 (n=9) → 88% of workers DIFFER across regions for the same role name. Milwaukee warehouse vs Indianapolis warehouse vs Chicago warehouse pull mostly distinct top-K from the same population. The system locks into geo+cert+skill context, not cycling. Different-roles-same-contract Jaccard = 0.004 (n=18) → role-specific retrieval works (unchanged from Phase 1). Determinism: Jaccard = 1.000 (n=12) — unchanged. Learning: Verbatim handover 4/4 = 100% (trivial case, expected) Paraphrase handover 4/4 = 100% (HARD case — passes!) Of those 4 paraphrase recoveries: - 2 used boost (Alice's recording was already in Bob's paraphrase top-K; ApplyPlaybookBoost re-ranked to top-1) - 2 used Shape B inject (recording wasn't in Bob's paraphrase top-K; InjectPlaybookMisses brought it in) The boost/inject mix is healthy — both paths are used and both produce correct top-1s. Multi-coord institutional memory propagation is empirically working under wording variation. Sample warehouse worker top-1s across contracts (proves diversity): alice / Milwaukee → w-713 bob / Indianapolis → e-8447 carol / Chicago → e-7145 Three different workers from the same 15K-person population, selected on geo+cert+skill context. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 08:03:16 -05:00
root	61c7b55e48	multi-coord stress harness — Phase 1 of 48-hour mock Three coordinators (alice / bob / carol) with three contracts (Milwaukee distribution / Indianapolis manufacturing / Chicago construction). 7-phase scenario runner: baseline → surge → merge → handover → split → reissue → analysis. Each coord has a separate playbook namespace (playbook_{name}) so institutional memory stays isolated by default but transferable on demand. Phase 1 deliberately skips the 48-hour clock, email/SMS endpoints, and Langfuse tracing — those are Phase 2/3. Run #001 (52 events, 4 queries × 3 coords × 2 demand flavors): Diversity: Different-roles-same-contract Jaccard = 0.004 (n=18) → role-specific retrieval is working perfectly. Different roles within one contract pull totally different worker pools. System is NOT cycling; locks into per-role retrieval. Same-role-across-contracts Jaccard = N/A (n=0) → TEST-DESIGN ISSUE: the 3 contracts use distinct role names per industry (warehouse worker / production worker / general laborer), so no exact-name overlaps exist. Phase 2 should either share at least one role across contracts OR add a skill-based diversity metric. Determinism: Jaccard = 1.000 (n=12) → HNSW + Ollama retrieval is fully deterministic on identical query text. coder/hnsw + nomic-embed-text are stable. Learning: handover hit rate = 4/4 = 100% → Bob inherits Alice's recordings perfectly when bob runs identical queries with alice's playbook namespace. CAVEAT: this tests the trivial verbatim case, not paraphrase handover. The harder test (bob runs paraphrased queries with alice's playbook) is Phase 2 work. Per-event capture in JSON: every matrix.search response is logged with phase / coordinator / contract / role / query / top-K IDs + distances + per-corpus counts + boosted/injected counts. Reviewable via: jq '.events[] \| select(.phase == "merge")' jq '.events[] \| select(.coordinator == "alice")' jq '.events[] \| select(.role == "warehouse worker")' Notable finding from per-event: carol's "general laborer" and "crane operator" queries both surface w-1009 as top-1, with crane operator at distance 0.098 (very tight) and general laborer at 0.297. The system found a worker who legitimately covers both roles — realistic for small construction crews. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 07:55:29 -05:00
root	b13b5cd7a1	playbook_lift v4 metric: warm-top-1 re-judge — quality lift +24%/-14% The rank-based "lift" metric (warm-top-1 == cold-judge-best) doesn't distinguish "Shape B surfaced a strictly-better answer" from "Shape B shuffled ranks but quality is unchanged" from "Shape B replaced a good answer with a wrong one." This commit adds Pass 4: judge warm top-1 with the same prompt as cold ratings, then bucket the comparison. Implementation: - New --with-rejudge driver flag (default off). - New WITH_REJUDGE harness env (default 1, on for prod runs). - queryRun gains WarmTop1Metadata (cached during Pass 2 for the rejudge call) + WarmTop1Rating *int (nil-distinguishable; nil = no rejudge, 0..5 = rating). - summary gains RejudgeAttempted, QualityLifted, QualityNeutral, QualityRegressed (counts of warm-rating > / == / < cold-rating). - Markdown headline gains a Quality block when rejudge ran. - ~21 extra judge calls (~30s on qwen2.5). Run #005 result (split inject threshold 0.20 + paraphrase + rejudge): Quality lifted 5 / 21 (24%) — 3× +2 rating, 2× +1 rating Quality neutral 13 / 21 (62%) — includes OOD queries holding 1 Quality regressed 3 / 21 (14%) Net rating delta +3 across 21 queries (+0.14 average) The 5 lifts were all rating-2 cold replaced with rating-3 or rating-4 warm — Shape B took mediocre matches and substituted substantively better ones. The 3 regressions were small (-1, -1, -3). Q11 is the cautionary tale: cold top-1 "production line worker" (rating 4) got replaced by Q1's recorded "forklift OSHA-30 operator" e-5729 (rating 1). Adjacent-domain cross-pollination — production worker and forklift operator embed within 0.20 cosine because both are warehouse-adjacent staffing queries, even though the judge correctly distinguishes them. The split-threshold defense (0.5 boost / 0.20 inject) catches OOD cross-pollination (Q19/Q20/Q21 all stayed neutral at rating 1) but not adjacent-domain cross-pollination. Net product verdict: working, net-positive on quality, but the worst case (Q11 4→1) is customer-visible and warrants a tighter inject threshold OR an additional gate beyond cosine distance. Filed in STATE_OF_PLAY OPEN as a follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 07:42:04 -05:00
root	67d1957b87	matrix: split boost / inject thresholds — kills Shape B cross-pollination Run #003 surfaced rampant cross-pollination: Q2's "OSHA-30 forklift Wisconsin" recording (w-4435) became warm top-1 for Q19 (dental hygienist), Q20 (RN), Q21 (software engineer), and 6 other unrelated staffing queries. Cause: InjectPlaybookMisses inherited the same DefaultPlaybookMaxDistance (0.5) as the boost path, but inject is structurally riskier than boost — boost only re-ranks results that already retrieved on their own merits, while inject FORCES a result into top-K, so a loose match cross-pollinates wrong-domain answers. Empirical motivation from v3: Implied playbook hit distances for cross-pollinated cases: 0.20-0.46 Implied distances for the 6/6 paraphrase recoveries: 0.23-0.30 Threshold of 0.20 should keep most paraphrases, kill the OOD bleed. Implementation: - New DefaultPlaybookMaxInjectDistance = 0.20 in playbook.go. - New PlaybookMaxInjectDistance field on SearchRequest (override). - InjectPlaybookMisses signature gains maxInjectDist param; hits whose Distance exceeds it are skipped (boost path may still re-rank them). - TestInjectPlaybookMisses_RespectsInjectThreshold locks the contract with one tight + one loose hit, asserting only the tight one injects. - Existing tests pass explicit threshold (0 = default for tight tests, 0.5 for the dedupe test which uses 0.30 hits). Run #004 result on identical queries with the split threshold: Verbatim discovery 8 (vs v3's 6 — judge variance, separate) Verbatim lift 6 / 8 (75%) Paraphrase top-1 6 / 8 (75%) Paraphrase any-rank in K 6 / 8 OOD queries Q19/Q20/Q21 ALL show warm top-1 = cold top-1 (no injection) — cross-pollination eliminated where it was wrong-direction. Mean Δ top-1 distance dropped from -0.164 (v3, distorted) to -0.071 (v4, comparable to v1's -0.053). Two paraphrases missed in v4 (Q9, Q15) were ones where qwen2.5 rephrased liberally enough to drift past 0.20 — Q9: "Inventory specialist..." → "Individual needed for inventory management..." and Q15: "Engaged warehouse associate..." → "Warehouse associate currently engaged with a robust history...". The system correctly refusing to inject when it's not confident is the right product behavior; the boost path still re-ranks recorded answers when they appear in regular retrieval. The Q6 ↔ Q7 cross-pollination ("Forklift-certified loader" ↔ "Hazmat warehouse worker") is legitimate — these are genuinely similar staffing queries and the judge ranks both directions as plausible. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 07:24:55 -05:00
root	154a72ea5e	matrix: Shape B — inject playbook misses + 6/6 paraphrase recovery The v0 boost-only stance documented in internal/matrix/playbook.go:22-27 ("the boost only re-ranks results that ALREADY surfaced from the regular retrieval") couldn't promote recorded answers that dropped out of a paraphrase's top-K. playbook_lift_002 surfaced exactly that gap: 0/2 paraphrase recoveries because the recorded answers weren't in regular retrieval at all (rank=-1). Shape B: when warm-pass retrieval doesn't surface a playbook hit's answer, inject a synthetic Result for it directly. Distance = playbook_hit_distance × BoostFactor — same formula as the boost path so injections land in comparable distance space. Caller re-sorts + truncates after both boost and inject have run. Result on playbook_lift_003 (Shape B + paraphrase pass): Verbatim discovery 6 Verbatim lift 2 / 6 Paraphrase top-1 6 / 6 Paraphrase any-rank in K 6 / 6 Mean Δ top-1 distance -0.1637 (warm closer than cold) Every paraphrase the judge generated landed the v1-recorded answer at top-1 of the new query's results. The learning property holds — cosine on embed(paraphrase) finds the recorded query's vector within DefaultPlaybookMaxDistance (0.5), and Shape B injects the answer. Verbatim lift dropped from v1's 7/8 because Shape B cross-pollinates recorded answers across queries. w-4435 (Q2's recording) appears as warm top-1 for several other queries because their embeddings are within the playbook hit threshold of "OSHA-30 forklift Wisconsin." This is a feature, not a bug — the matrix layer's purpose is to share knowledge across queries — but the lift metric only counts "warm top-1 == cold judge best," so cross-pollinated lifts don't register. A v3 metric would re-judge warm pass to measure true judge improvement. Tests: - TestInjectPlaybookMisses_AddsMissingAnswers — primary claim - TestInjectPlaybookMisses_SkipsAnswersAlreadyPresent — no double-inject - TestInjectPlaybookMisses_DedupesPerAnswer — multi-hit same answer - TestInjectPlaybookMisses_EmptyHits — fast-path no-op Driver fix: ParaphraseRecordedRank int → *int. The `omitempty` int silently dropped rank=0 (top-1, the WANTED value) from JSON, making the v003 report show "null" instead of "0" for every successful recovery. Pointer keeps nil/rank-0 distinguishable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 07:06:13 -05:00
root	e9822f025d	playbook_lift v2: paraphrase pass + run #002 finds boost-only limit Adds an opt-in Pass 3 to the lift driver: for each query whose Pass 1 recorded a playbook, ask the judge to rephrase the query, then re-query with playbook=true and check whether the recorded answer surfaces in top-K. This is the test the v1 report's caveat #3 explicitly flagged as the actual learning-property gate (not the cheap verbatim case). Implementation: - New flag --with-paraphrase on the driver (default off). - New WITH_PARAPHRASE env in the harness (default 1, on for prod runs). - New paraphrase_* fields on queryRun + summary, // 0 fallback in jq so re-rendering verbatim-only evidence stays clean. - generateParaphrase() calls the same judge model with format=json and a tight schema; temperature=0.5 for variance without domain drift. - Markdown report adds a paraphrase per-query table (only when the pass ran) and an honesty caveat about judge-also-rephrases coupling. Run #002 result (reports/reality-tests/playbook_lift_002.{json,md}): Verbatim lift 2/2 (100% — Q7 + Q13, both stable from v1) Paraphrase top-1 0/2 Paraphrase any-rank in K 0/2 Both paraphrases dropped the recorded answer OUT of top-K entirely (rank=-1). This isn't a paraphrase-quality problem — qwen2.5's outputs preserved intent ("Hazmat-certified warehouse worker comfortable with cold storage" → "Warehouse worker with Hazmat certification and experience in cold storage"). It's the v0 boost-only stance documented in internal/matrix/playbook.go:22-27: the boost only re-ranks results that ALREADY surfaced from regular retrieval. If paraphrase's cosine retrieval doesn't include the recorded answer in top-K, no boost can promote it. The "Shape B" upgrade mentioned in the playbook.go comment — inject playbook hits directly even when they weren't in the top-K — is what would close this gap. The reality test surfaced exactly the gap the docs warned about. Worth filing as the next product gate. Run-to-run variance also visible: v1 had 8 discoveries, v2 had 2. HNSW insertion order + judge variance both contribute. Stability of Q7 and Q13 across both runs (lifted in v1 AND v2) is the most reliable signal in the dataset. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 06:47:41 -05:00
root	b2e45f7f26	playbook_lift: harness expansion + reality test #001 (7/8 lift, 87.5%) The 5-loop substrate's load-bearing gate is verified — playbook + matrix indexer give the results we're looking for. Per the report's rubric, lift ≥ 50% of discoveries means matrix is doing real work; 7/8 = 87.5% blew through that. Harness was structurally hiding bugs behind a 5-daemon stripped boot. Expanding to the full 10-daemon prod stack surfaced 7 fixes in cascade: 1. driver→matrixd: {"query": ...} → {"query_text": ...} field name 2. harness temp toml missing [s3] → wrong default bucket → catalogd rehydrate 500 on first call 3. harness→queryd SQL probe: {"q": ...} → {"sql": ...} field name 4. expand boot from 5 → 10 daemons in dep-ordered launch 5. add SQL surface probe (3-row CSV ingest → COUNT(*)=3 assertion) 6. candidates corpus was synthetic SWE-tech (Swift/iOS, Scala/Spark) — wrong domain for staffing queries; replaced with ethereal_workers (10K rows, real staffing schema, "e-" id prefix to avoid collision with workers' "w-"). staffing_workers driver gains -index-name + -id-prefix flags so the same binary serves both corpora 7. local_judge qwen3.5:latest is a vision-SSM 256K-ctx build running ~30s per judge call against the lift loop; reverted to qwen2.5:latest (~1s/call, 30× faster, held lift theory) Each contract drift (1, 3) is now locked into a cmd/<bin>/main_test.go so future drift fires in `go test`, not in a reality run. R-005 closed: - cmd/matrixd/main_test.go (new) — playbook record drift detector + score bounds + 6 routes mounted - cmd/queryd/main_test.go — wrong-field-name drift detector - cmd/pathwayd/main_test.go (new) — 9 routes + add round-trip + retire - cmd/observerd/main_test.go (new) — 4 routes + invalid-op + unknown-mode `go test ./cmd/{matrixd,queryd,pathwayd,observerd}` all green. Reality test results (reports/reality-tests/playbook_lift_001.{json,md}): Queries 21 (staffing-domain, 7 categories) Discoveries 8 (judge ≠ cosine top-1) Lifts 7/8 (87.5%) Boosts triggered 9 Mean Δ distance -0.053 (warm closer than cold) OOD honesty dental/RN/SWE rated 1, no fake matches Cross-corpus boosts confirmed (e- ↔ w- swaps in lifts) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 06:22:21 -05:00
root	848cbf5fef	phase 3: playbook_lift harness reads judge from config migrate the reality-test harness's judge-model default from a hardcoded "qwen3.5:latest" string to cfg.Models.LocalJudge. resolution priority: explicit -judge flag > $JUDGE_MODEL env > cfg.Models.LocalJudge from lakehouse.toml > hardcoded fallback. bumping the judge for run #N+1 now means editing one line in lakehouse.toml [models].local_judge — no Go file or shell script edits required. changes: - scripts/playbook_lift/main.go: -config flag added, judge default flips to "" so resolution chain runs. Imports internal/shared for config loader. - scripts/playbook_lift.sh: JUDGE_MODEL no longer defaulted in bash; EFFECTIVE_JUDGE resolved by mirror-of-the-Go-chain (env > config grep > qwen3.5:latest fallback). Used for the Ollama presence check + report header. Pre-flight grep avoids requiring jq just to read the toml. - reports/reality-tests/README.md: documents the 4-step priority chain. verified all 4 paths produce the expected judge: - config (no env): qwen3.5:latest (from lakehouse.toml) - env override: env wins - flag override: flag wins over env - missing config: DefaultConfig fallback still gives qwen3.5:latest just verify PASS. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 23:57:28 -05:00
root	3dd7d9fe30	reality-tests: playbook-lift harness — does the 5-loop substrate beat raw cosine? First reality test driver. Two-pass design: - Pass 1 (cold): matrix.search use_playbook=false → small-model judge rates top-K → record playbook entry pointing at the highest-rated result (which may NOT be top-1 by distance — that's the discovery). - Pass 2 (warm): same queries with use_playbook=true → measure ranking shift. Lift = real if recorded answer becomes top-1. Files: - scripts/playbook_lift/main.go driver (391 LoC) - scripts/playbook_lift.sh stack-bring-up + report gen - tests/reality/playbook_lift_queries.txt query corpus (5 placeholders; J writes real 20+) - reports/reality-tests/README.md framework + interpretation - .gitignore track reports/reality-tests/ but ignore per-run JSON evidence This answers the gate from project_small_model_pipeline_vision.md: "the playbook + matrix indexer must give the results we're looking for." Without ground-truth labels, the LLM judge is the proxy — the same small-model thesis applied to evaluation. Honest about that limitation in the generated reports. Driver compiles clean; full run requires Ollama + workers/candidates ingest. Skips cleanly if Ollama absent. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 23:22:36 -05:00
root	c41698acae	scrum rerun-2 — 50/60 (Δ R1 +7, Δ baseline +15) at c7e3124 Audited stash-clean c7e3124 (30 commits past rerun-1 4840c10). 3 HIGH risks closed (R-002 internal/shared, R-003 internal/storeclient, R-008 queryd/db.go). 3 advanced to partial (R-001 via fail-loud-bind + opt-in auth, R-006 via g2_smoke_fixtures, R-007 via ADR-003 auth.go). Biggest move: Agent Memory Correctness 4 → 9 — pathway Mem0 ops (ADD/UPDATE/REVISE/RETIRE/HISTORY) all tested, including cycle-detection and retired-trace-exclusion. Sprint 2 acceptance criteria are now verified code, not design-bar work. Two new findings: - F1 (MED): cmd/{matrixd,observerd,pathwayd}/main_test.go absent — reopens R-005 against new daemons. - F2 (LOW): scripts/staffing_*/main.go flag-defaults reach /home/profit/lakehouse/data/... Evidence under reports/scrum/_evidence/rerun2/ (local; per .gitkeep convention). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 23:13:01 -05:00
root	ff9823b871	scrum audit re-run: 35 → 43 / 60 after Phase A-E + S0.3 Re-runs the SCRUM.md framework against HEAD (4840c10) to score the delta from the audit baseline at 91edd43. Composite +8. Scoring deltas: Reproducibility 7 → 9 (just verify, just doctor, pre-push hook) Test Coverage 6 → 8 (168 proof harness assertions; Go-test gaps in shared/storeclient remain) Trust Boundary 7 → 7 (no code change; R-001/R-007 open) Memory Correctness 3 → 4 (vectord persistence proven; Mem0 pathway/playbook still not ported) Deployment Readiness 4 → 5 (just doctor; REPLICATION/systemd open) Maintainability 8 → 8 (spine unchanged; harness obeys CLAUDE_REFACTOR_GUARDRAILS) Risk register changes: R-004 (smokes not gated) CLOSED — just verify + pre-push hook R-005 (cmd/main.go untested) partial — proof harness covers wiring R-012 (empty tests/ dir) CLOSED — populated by harness R-001/R-002/R-003/R-006/R-007/R-008/R-009/R-010 unchanged Sprint 0 progress: S0.1 just doctor DONE S0.3 just verify + pre-push DONE S0.6 tests/ dir cleanup DONE S0.2 just smoke-fixtures open S0.4 cmd/main_test × 6 partial (harness coverage; go-test gap) S0.5 shared/storeclient tests open (HIGH risks still unaddressed) New finding from this rerun (worth recording): Queryd refresh-tick race in 04_query_correctness — cache-warm binaries fire SELECTs faster than queryd's 500ms refresh tick. Caught by integration mode going 104/0/1 → 102/1/1, fixed at 4840c10 with proof_wait_for_sql helper. Exactly the failure-mode the harness was designed to catch. Original 5 audit reports preserved as immutable history at 91edd43; this file documents the delta only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 05:37:45 -05:00
root	91edd43164	scrum audit: 5 reports under reports/scrum/ · score 35/60 Adapts docs/SCRUM.md framework (originally written for the matrix-agent-validated repo) to the Go rewrite. Five deliverables: golang-lakehouse-scrum-test.md top-line + scoring + verdict risk-register.md 12 findings, R-001..R-012 claim-coverage-table.md claim/test/risk for Sprint 2 sprint-backlog.md 5 sprints, ~2 weeks of work acceptance-gates.md DoD as runnable commands Every claim cites file:line, command output, or "missing evidence." Smoke chain ran clean (33s wall, all 9 PASS) and is captured in reports/scrum/_evidence/smoke_chain.log (gitignored — runtime artifact). Scoring: Reproducibility 7/10 9 smokes deterministic, no just/CI gate Test Coverage 6/10 internal/ packages tested, 6/7 cmd/ aren't Trust Boundary 7/10 escapes ok, zero auth, /sql is RCE-eq off-loopback Memory Correctness 3/10 pathway/playbook/observer not yet ported Deployment Readiness 4/10 no REPLICATION, no env template, no systemd Maintainability 8/10 no god-files, 7 lean binaries, ADRs current Top three risks: R-001 HIGH queryd /sql + DuckDB + non-loopback bind = RCE-equivalent R-002 HIGH internal/shared (server.go + config.go) zero tests R-003 HIGH internal/storeclient zero tests, used by 2 services R-004 MED 9-smoke chain green but not gated (no justfile/hook) The audit is the work; refactors come after. Sprint 0 owns coverage + CI gating; Sprint 1 owns trust-boundary decisions; Sprints 2-3 are mostly design-bar work for unbuilt agent components. .gitignore exception: /reports/* + !/reports/scrum/ keeps reports/ a runtime-artifact directory while exposing reports/scrum/ as tracked documentation. Mirrors the pattern future audit passes will land in. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 04:51:47 -05:00

50 Commits