lakehouse

Author	SHA1	Message	Date
root	626f18d491	pathway_memory: audit-consensus → retire wire Some checks failed lakehouse/auditor 1 blocking issue: todo!() macro call in tests/real-world/scrum_master_pipeline.ts When observer's hand-review explicitly rejects the output of a hot-swap-recommended model, the matrix's recommendation was wrong for this context. Auto-retire the trace so future agents don't get the same poisoned recommendation in their preamble. crates/vectord/src/pathway_memory.rs — add `trace_uid` to HotSwapCandidate response and populate from the matched trace. This gives consumers single-trace precision for /pathway/retire. tests/real-world/scrum_master_pipeline.ts: - HotSwapCandidate interface gains trace_uid - new retirePathwayTrace() helper (fire-and-forget, fall-open) - in the obsVerdict reject branch: if hotSwap was active AND the rejected model is the hot-swap-recommended one AND observer confidence ≥0.7, fire retire and null hotSwap so post-loop replay bookkeeping doesn't double-process. - hotSwap declared `let` (was const) so it can be nulled Cycle verdicts ("needs different angle") don't trigger retire — only outright rejects do. Confidence gate avoids retiring on heuristic-fallback verdicts that come back without a confidence number. Closes the "audit-consensus → retire" item from HANDOVER.md. Live-tested: insert synthetic trace → /pathway/retire by trace_uid → retired counter 1 → 2. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 00:01:20 -05:00
root	0115a60072	observer: add /relevance heuristic filter for adjacency pollution Some checks failed lakehouse/auditor 1 blocking issue: todo!() macro call in tests/real-world/scrum_master_pipeline.ts Matrix retrieval often surfaces high-cosine chunks that are about symbols the focus file IMPORTS but doesn't define. The reviewer LLM then hallucinates those imported-crate internals as in-file content ("I see main.rs does X" when X lives in queryd::context). mcp-server/relevance.ts — pure scorer with five signals: path_match +1.0 chunk source/doc_id encodes focus path defined_match +0.6 chunk text mentions focus.defined_symbols token_overlap +0.4 jaccard of non-stopword tokens prefix_match +0.3 shared first-2-segment prefix import_only -0.5 mentions only imported symbols (pollution) Default threshold 0.3 — tuned empirically on the gateway/main.rs case. Also fixes a regex bug in the import extractor: the character class was lowercase-only, so `use catalogd::Registry;` silently never matched (regex backed off when it hit the uppercase R). Caught by the test suite. observer.ts — POST /relevance endpoint wraps filterChunks(). scrum_master_pipeline.ts — fetchMatrixContext gains optional focusContent param; calls /relevance after collecting allHits and before sort+top. Opt-out via LH_RELEVANCE_FILTER=0; threshold via LH_RELEVANCE_THRESHOLD. Fall-open on observer failure. 9 unit tests, all green. Live probe on real shape correctly drops a 0.7-cosine adjacency-pollution chunk while keeping in-focus hits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 23:51:45 -05:00
root	6ac7f61819	pathway_memory: Mem0 versioning + deletion (upsert/revise/retire/history) Per J 2026-04-25: pathway_memory was append-only — every agent run added a new trace, bad/failed runs polluted the matrix forever, no notion of "this is the canonical evolved playbook." Ported playbook_memory's Phase 25/27 patterns into pathway_memory so the agent loop's matrix converges on best-known approaches per task class instead of bloating. Fields added to PathwayTrace (all #[serde(default)] for back-compat): - trace_uid: stable UUID per individual trace within a bucket - version: u32 default 1 - parent_trace_uid, superseded_at, superseded_by_trace_uid - retirement_reason (paired with existing retired:bool) Methods added to PathwayMemory: - upsert(trace) → PathwayUpsertOutcome {Added\|Updated\|Noop} Workflow-fingerprint dedup: ladder_attempts + final_verdict hash. Identical workflow → bumps existing replay_count instead of duplicating. - revise(parent_uid, new_trace) → PathwayReviseOutcome Chains versions; rejects retired or already-superseded parents. - retire(trace_uid, reason) → bool Marks specific trace retired with reason. Idempotent. - history(trace_uid) → Vec<PathwayTrace> Walks parent_trace_uid back to root, then superseded_by forward to tip. Cycle-safe via visited set. Retrieval gates updated: - query_hot_swap skips superseded_at.is_some() - bug_fingerprints_for skips both retired AND superseded HTTP endpoints in service.rs: - POST /vectors/pathway/upsert - POST /vectors/pathway/retire - POST /vectors/pathway/revise - GET /vectors/pathway/history/{trace_uid} scripts/seal_agent_playbook.ts switched insert→upsert + accepts SESSION_DIR arg so it can seal any archived session, not just iter4. Verified live (4/4 ops): - UPSERT first run: Added trace_uid 542ae53f - UPSERT identical: Updated, replay_count bumped 0→1 (no duplicate) - REVISE 542ae53f→87a70a61: parent stamped superseded_at, v2 created - HISTORY of v2: chain_len=2, v1 superseded, v2 tip - RETIRE iter-6 broken trace: retired=true, retirement_reason preserved - pathway_memory.stats: total=79, retired=1, reuse_rate=0.0127 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 19:31:44 -05:00
root	ed83754f20	raw-corpus dump + vectorization + chicago contract inference pipeline Three new pieces, executed in order: scripts/dump_raw_corpus.sh - One-shot bash that creates MinIO bucket `raw` and uploads all testing corpora as a persistent immutable test set. 365 MB total across 5 prefixes (chicago, entities, sec, staffing, llm_team) + MANIFEST.json. Sources: workers_500k.parquet (309 MB), resumes.parquet, entities.jsonl, sec_company_tickers.json, Chicago permits last 30d (2,853 records, 5.4 MB), 9 LLM Team Postgres tables dumped via row_to_json. scripts/vectorize_raw_corpus.ts - Bun script that fetches each raw-bucket source via mc, runs a source-specific extractor into {id, text} docs, posts to /vectors/index, polls job to completion. Verified results: chicago_permits_v1: 3,420 chunks entity_brief_v1: 634 chunks sec_tickers_v1: 10,341 chunks (after extractor fix for wrapped {rows: {...}} JSON shape) llm_team_runs_v1: in flight, 19K+ chunks llm_team_response_cache_v1: queued scripts/analyze_chicago_contracts.ts - Real inference pipeline that picks N high-cost permits with named contractors from the raw bucket, queries all 6 contract- analysis corpora in parallel via /vectors/search, builds a MATRIX CONTEXT preamble, calls Grok 4.1 fast for structured staffing analysis, hand-reviews each via observer /review, appends to data/_kb/contract_analyses.jsonl. tests/real-world/scrum_master_pipeline.ts - MATRIX_CORPORA_FOR_TASK extended with two new task classes: contract_analysis (chicago + entity_brief + sec + llm_team_runs + llm_team_response_cache + distilled_procedural) staffing_inference (workers_500k_v8 + entity_brief + chicago + llm_team_runs + distilled_procedural) scrum_review unchanged. This is the first time the matrix architecture operates on real ingested data instead of code-review smoke tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 18:44:27 -05:00
root	a496ced848	scrum: unified matrix retriever — pull from ALL relevant KB corpora, not just pathway memory Per J 2026-04-25 architectural correction: matrix index is the vector indexing layer for the WHOLE knowledge base (distilled facts, procedures, config hints, team runs, playbooks, pathway successes), not a single narrow store. Built fetchMatrixContext(query, taskClass, filePath) that: - Queries multiple persistent vector indexes in parallel via /vectors/search - Collects hits per corpus + score + doc_id + 400-char excerpt - Pulls pathway successes via existing helper, mapped to MatrixHit shape - Sorts by score across corpora, returns top-N (default 8) - Reports per-corpus hit counts + errors for transparency Per-task-class corpus list (MATRIX_CORPORA_FOR_TASK): scrum_review → distilled_factual, distilled_procedural, distilled_config_hint, kb_team_runs_v1 (staffing data deliberately excluded — not relevant to code review) Probed live: distilled_config_hint top hit = 0.52, distilled_procedural top = 0.49, kb_team_runs top = 0.59. Real signal across corpora. Replaces the narrow proven-approaches preamble with a unified MATRIX-INDEXED CONTEXT preamble tagged with source_corpus per chunk so the model knows what kind of context it's seeing. LH_SCRUM_MATRIX_RETRIEVE=0 still disables for A/B testing. Future: promote to a Rust /v1/matrix endpoint once corpora list and ranking logic stabilize. For now TS lets us iterate fast against the live matrix without gateway restarts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 18:29:08 -05:00
root	d187bcd8ac	scrum: stop cascading models on quality issues — single-model retry with enrichment Architectural correction (J 2026-04-25): The 9-rung ladder was treating cascade as the strategy. That's wrong. ONE model handles the work, with same-model retries using enriched context. Cycle to a different model ONLY on PROVIDER errors (network / auth / 5xx) — never on quality issues, because quality issues mean the context needs more enrichment, not a different model. Changes: - LADDER shrunk from 11 entries to 3 (Grok 4.1 fast primary, DeepSeek V4 flash + Qwen3-235B as provider-error fallbacks). Removed Kimi K2.6, Gemini 2.5 flash, all Ollama Cloud rungs, OR free-tier rungs, local qwen3.5 — none were doing the work, all wasted attempts. They remain available as routable tools for the future mode router. - Loop restructured: separate `modelIdx` from attempt counter. Provider error → modelIdx++ (advance fallback). Observer reject / cycle / thin response → retry SAME model with rejection notes feeding into the `learning` preamble; advance fallback only after MAX_QUALITY_RETRIES (default 2) exhausted on the current model. - LH_SCRUM_MAX_QUALITY_RETRIES env to tune the per-model retry cap. What this preserves: - Tree-split (treeSplitFile) is still the ONE legitimate model-switch trigger for context-overflow, but even it just re-runs the same model against smaller chunks. - Pathway memory preamble still fires. - Hot-swap reorder still applies — when a recommended model maps to the new shorter ladder. Future direction (J 2026-04-25 note): the LLM Team multi-model modes in /root/llm_team_ui.py are a REFERENCE PATTERN for a mode router we will build INSIDE this gateway. Mimic the patterns, don't modify the LLM Team UI itself. The mode router will pick the right approach for each task class via the matrix index, not cascade through models. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 18:08:31 -05:00
root	6432465e2c	autonomous_loop: stop clobbering applier model/provider defaults Found by running: the loop was setting LH_APPLIER_MODEL=qwen3-coder:480b explicitly via env, which clobbered the applier's NEW default of x-ai/grok-4.1-fast on openrouter. Result: applier kept hitting the throttled ollama_cloud account and producing zero patches every iter. Now LOOP_APPLIER_MODEL and LOOP_APPLIER_PROVIDER are optional overrides; when unset, scrum_applier.ts uses its own defaults. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 17:54:54 -05:00
root	4ac56564c0	scrum + applier + observer: switch to paid OpenRouter ladder, add Kimi K2.6 + Gemini 2.5 Ollama Cloud was throttled across all 6 cloud rungs in iters 1-9, which forced the loop into 0-review iterations even though the architecture was sound. Swapping to paid OpenRouter unblocks the test path. Ladder changes (top-of-ladder paid models, all under $0.85/M either side): - moonshotai/kimi-k2.6 ($0.74/$4.66, 256K) — capped at 25/hr - x-ai/grok-4.1-fast ($0.20/$0.50, 2M) — primary general - google/gemini-2.5-flash ($0.30/$2.50, 1M) — Google reasoning - deepseek/deepseek-v4-flash ($0.14/$0.28, 1M) — cheap workhorse - qwen/qwen3-235b-a22b-2507 ($0.07/$0.10, 262K) — cheapest big Existing rungs (Ollama Cloud + free OR + local qwen3.5) kept as fallback. Per-model rate limiter (MODEL_RATE_LIMITS in scrum_master_pipeline.ts): - Persists call timestamps to data/_kb/rate_limit_calls.jsonl so caps survive process restarts (autonomous loop spawns a fresh subprocess per iteration; without persistence each iter would reset) - O(1) writes, prune-on-read for the rolling 1h window - Capped models log "SKIP (rate-limited: cap N/hr reached)" and the ladder cycles to the next rung - J directive 2026-04-25: 25/hr on Kimi K2.6 to bound output cost Observer hand-review cloud tier swapped from ollama_cloud/qwen3-coder:480b to openrouter/x-ai/grok-4.1-fast — proven to emit precise semantic verdicts (named "AccessControl::can_access() doesn't exist" specifically in 2026-04-25 tests instead of the heuristic fallback). Applier patch emitter swapped from ollama_cloud/qwen3-coder:480b to openrouter/x-ai/grok-4.1-fast (default; LH_APPLIER_MODEL + LH_APPLIER_PROVIDER override). This was the third LLM call we missed — without it, observer accepts a review but applier never produces patches because its emitter was still hitting the throttled account. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 17:49:02 -05:00
root	e79e51ed70	tests: autonomous_loop.ts — goal-driven scrum + applier retry harness Wraps tests/real-world/scrum_master_pipeline.ts and scrum_applier.ts in a single autonomous loop that runs scrum → applier --commit → optional git push, observes per-iteration outcomes via observer /event, journals to data/_kb/autonomous_loops.jsonl. Stops when 2 consecutive iters land zero commits OR LOOP_MAX_ITERS reached. Env knobs: LOOP_TARGETS — comma-sep paths, default 3 high-traffic Lakehouse files LOOP_MAX_ITERS — default 3 LOOP_PUSH=1 — push branch after each commit-landing iter LOOP_BRANCH — default scrum/auto-apply-19814 (refuses to run elsewhere) LOOP_MIN_CONF — applier min confidence (default 85) LOOP_APPLIER_MODEL — default qwen3-coder:480b Causality preserved: targets pass through to LH_APPLIER_FILES so applier patches what scrum just reviewed (vs picking from global review history). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 17:32:15 -05:00
root	3f166a5558	scrum + observer: hand-review wire — judgment moved out of the inner loop Pre-2026-04-25 the scrum_master applied a hardcoded grounding-rate gate inline. That baked policy into the wrong layer — semantic judgment about whether a review is grounded belongs in the observer (which has Langfuse traces, sees every response across the system, and can call cloud LLMs for real evaluation). Scrum should report DATA, observer DECIDES. What landed: - scrum_master_pipeline.ts: removed the inline grounding-pct threshold; every accepted candidate now POSTs to observer's /review endpoint with {response, source_content, grounding_stats, model, attempt}. Observer returns {verdict: accept\|reject\|cycle, confidence, notes}. On observer failure, scrum falls open to accept (observer is policy, not blocker). - mcp-server/observer.ts: new POST /review endpoint with two-tier evaluator. Tier 1: cloud LLM (qwen3-coder:480b at temp=0) hand-reviews with full context — response + source excerpt + grounding stats — and emits structured verdict JSON. Tier 2: deterministic heuristic over grounding pct + total quotes when cloud throttles, marked source: "heuristic" so consumers can tune it later by comparing against cloud. - Every verdict persists to data/_kb/observer_reviews.jsonl with full input snapshot so cloud vs heuristic can be A/B compared once cloud quota refreshes. Verified end-to-end: smoke loop iter 1 — observer returned `cycle` on 21% grounding (cycled to next rung), `reject` on 17% (gave up). Iter 2 — `reject` on 12% and 14%. Both UNRESOLVED with honest signal instead of polluting pathway memory with hallucinated patterns. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 17:32:04 -05:00
root	c90a509f49	applier: LH_APPLIER_FILES env to constrain to current-iter targets Without this, the applier loaded the latest 34 reviews and patched the highest-confidence file from history — which is meaningless when called from the autonomous loop where the intent is "review file X this iter, patch file X this iter." Now the loop passes its targets through and the applier filters eligible reviews accordingly. Causality is restored: scrum reviews file X → applier patches file X. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 17:31:49 -05:00
root	9ecc5848fa	scrum: blind-response guard + anchor-grounding post-verifier Two signal-quality fixes for the scrum loop: 1. isBlindResponse() — detects models that emit structurally-valid review JSON containing "no source code visible / cannot verify" even when source WAS supplied. Rejects so the ladder cycles to the next rung instead of accepting the blind hallucination. 2. verifyAnchorGrounding() + appendGroundingFooter() — post-process verifier that extracts every backtick-quoted snippet from the review and checks it against the original source content. Appends a grounding footer reporting grounded vs ungrounded counts so humans can audit hallucination rate at a glance. Born from the iter where llm_team_ui.py review came back with 6/10 findings hallucinated (invented render_template_string calls, fabricated logger.exception sites, made-up SHA-256 hashing). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 17:07:30 -05:00
root	4a94da2d41	tests/architecture_smoke — PRD-invariant probe against 500k workers J's reset: I'd been iterating on pipeline internals without a driver. The PRD says staffing is the REFERENCE consumer, not the domain driver — the architecture is the thing. This test makes that explicit. 8 sections exercise the PRD §Shared requirements against live production-shaped data (500k workers parquet, 50k-chunk vector index, 768d nomic embeddings): 1. preconditions — gateway + sidecar alive 2. catalog lookup — workers_500k resolves to 500000 rows 3. SQL at scale — count() + geo filter on 500k rows 4. vector search — /vectors/search returns top-k 5. hybrid SQL+vector — /vectors/hybrid with sql_filter 6. playbook_memory — /vectors/playbook_memory/stats 7. pathway_memory — ADR-021 stats + bug_fingerprints 8. truth gate — DROP TABLE blocked with 403 No cloud calls. Completes in ~5 seconds. Exits non-zero on any failure; failure messages print "these are the next things to fix." First-run measurements against current code: - 500k COUNT() = 22ms, OH-filtered = 20ms (invariant met) - vector search p=368ms on 10-NN - hybrid p=4662ms, returned 0 Toledo-OH hits (two signals worth investigating: the latency AND the empty result) - playbook_memory = 0 entries (rebuild never fired since boot) The 11/11 pass means the substrate's contract is intact. The measurements tell us WHERE to look next, not what to speculate. Going forward: this script is the canary. Run it after every substantive change. If a section flips from pass to fail, that IS the regression; roll back or fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 14:12:14 -05:00
root	021c1b557f	agent.ts: route generateCloud through /v1/chat (Phase 44 migration) Phase 44 PRD (docs/CONTROL_PLANE_PRD.md:204) explicitly lists `tests/multi-agent/agent.ts::generate()` as a migration target: every internal LLM caller must flow through /v1/chat so usage accounting + audit trail see all traffic. generateCloud() was bypassing the gateway entirely — direct POST to OLLAMA_CLOUD_URL/api/generate with the bearer key. This meant: - /v1/usage missed every agent.ts cloud call - No gateway-side caching, rate-limiting, or cost gating - Callers needed OLLAMA_CLOUD_KEY in env (leak risk; gateway already owns the key) Migration: - Endpoint: OLLAMA_CLOUD_URL/api/generate → GATEWAY/v1/chat - Body shape: {prompt,options.num_predict,options.temperature} → OpenAI-compatible {messages[],temperature,max_tokens} - provider: "ollama_cloud" explicit in the request - Response extraction: data.response → data.choices[0].message.content - OLLAMA_CLOUD_KEY no longer required in agent.ts env Phase 44 gate verified: `grep localhost:3200/generate\|/api/generate` now only hits (a) the ollama_cloud.rs adapter itself (legit — it's the gateway-side direct caller) and (b) this comment explaining the migration history. Zero non-adapter code paths to /api/generate. generate() (local Ollama) still goes direct to :3200 — that's the t1_hot path. Phase 44 PRD focuses on cloud callers; hot-path local generation deliberately stays direct for latency. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 13:27:54 -05:00
root	ed85620558	scrum: filter table-header words from bug_fingerprint extraction Iter 11 surfaced "DeadCode:Flag" in the matrix — a noisy pattern_key where "Flag" is the table column HEADER kimi produces for structured review output, not an actual Rust identifier. Kimi's standard format on recent iters: \| # \| Change \| Flag \| Confidence \| \| 1 \| Wire AgentIdentity into.. \| Boundary.. \| 92% \| The extractor's KEYWORDS set already filtered Rust grammar words (self, mut, async, etc) and the FLAG_VARIANTS themselves. Adding markdown-layout words (Flag, Change, Confidence, PRD, Plan) closes the last common noise class. One-line addition — empirically validated against the iter 11 vectord trace that produced DeadCode:Flag. Future iters won't reproduce that specific noise. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 13:22:50 -05:00
root	0cf1b7c45a	scrum_master: env-configurable tree-split threshold + shard size Some checks failed lakehouse/auditor 1 blocking issue: todo!() macro call in tests/real-world/scrum_master_pipeline.ts Hard-coded constants (FILE_TREE_SPLIT_THRESHOLD=6000, FILE_SHARD_SIZE=3500) were tuned for Rust source files in crates/<crate>/src/*.rs. Running the pipeline against /root/llm-team-ui/llm_team_ui.py (13K lines, ~400KB) would produce ~200 shards per review at the default size — not viable. Two env vars now: - LH_SCRUM_TREE_SPLIT_THRESHOLD — when tree-split fires (default 6000) - LH_SCRUM_SHARD_SIZE — bytes per shard (default 3500) For the big-Python case the CLAUDE.md in /root/llm-team-ui/ recommends LH_SCRUM_TREE_SPLIT_THRESHOLD=20000, LH_SCRUM_SHARD_SIZE=12000 which brings the 13K-line file down to ~35 shards — same ballpark as a typical Rust file review. No default change. Existing lakehouse runs unaffected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 13:02:45 -05:00
root	f4cff660aa	ADR-021 Phase D fix: strip flag names + Rust keywords from pattern_keys Iter 9 revealed two quality bugs in the extractor: 1. Kimi wraps the Flag column in backticks (\`DeadCode\`), so the flag name itself was captured as a code token. Result: pattern_keys like "DeadCode:DeadCode" that match nothing and add noise to the index. Fix: filter FLAG_VARIANTS out of token candidates. 2. Complex backtick content like \`Foo::bar(&self) -> u64\` was rejected wholesale by the identifier regex. Fallback now scans for identifier substrings and ranks by ::-qualified paths first, then length. Bonus: filter Rust keywords (self, mut, async, etc) since they're grammar, not bug-shape signal. Dry-run on iter 9 delta.rs output produces semantically meaningful keys: DeadCode:DeltaStats::tombstones_applied NullableConfusion:DeltaError-DeltaStats-apply_delta BoundaryViolation:apply_delta-journald::emit-rows_dropped_by_tombstones PseudoImpl:apply_delta-delta_ops-validate_schema These are stable under reviewer prose variation (canonical sort + top-3 slice) and precise enough to separate different bugs within the same Flag category. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 06:05:50 -05:00
root	ee31424d0c	ADR-021 Phase D: bug_fingerprint pattern extraction from reviewer output Some checks failed lakehouse/auditor 4 blocking issues: todo!() macro call in tests/real-world/scrum_master_pipeline.ts Fills the gap between Phase B (flags tagged) and Phase C (preamble quotes past fingerprints): parses each reviewer line that mentions a Flag variant, collects backtick-quoted identifiers, canonicalizes them (sorted alphabetically, top 3), and emits a stable pattern_key of shape `{Flag}:{tok1}-{tok2}-{tok3}`. Stability by design: canonical sort means "row_count + QueryResponse" and "QueryResponse + row_count" produce the same key, so variation in reviewer prose doesn't fragment the index. Top-3 cap keeps keys short while retaining enough signal to separate different bugs of the same category. Dry-run validation on iter-8 delta.rs output (crates/queryd prefix) extracted 10 semantically meaningful fingerprints including: - UnitMismatch:base_rows-checked_add-checked_sub - DeadCode:queryd::delta::write_delta (P9-001 dead-function finding) - BoundaryViolation:can_access-log_query-masked_columns (P13-001 gap) - NullableConfusion:CompactResult-DeltaError-IntegerOverflow Cross-cutting signal: kimi-k2:1t's finding #5 explicitly quoted the seeded pathway memory preamble ("Pathway memory flags row_count- file_count unit mismatch") and proposed overflow-checked arithmetic as the fix. That is the compounding loop in action — prior bug context shifted the reviewer's attention toward a specific instance of the same class, which produces a specific pattern_key that will compound further on the next iter. Filter: identifier-shaped tokens only (A-Za-z_ / :: paths / snake_case / CamelCase). Skips punctuation, prose quotes, and tokens <3 chars so generic nouns and partial words don't pollute the index. What's still queued (Phase E): - type_hints_used population from catalogd column types + Arrow schema - auditor → pathway audit_consensus update wire (strict-audit gate activation) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 06:02:07 -05:00
root	0a0843b605	ADR-021: semantic-correctness layer lands in pathway_memory (A+B+C) Some checks failed lakehouse/auditor 4 blocking issues: todo!() macro call in tests/real-world/scrum_master_pipeline.ts Phase A — data model (vectord/src/pathway_memory.rs): + SemanticFlag enum (9 variants: UnitMismatch, TypeConfusion, NullableConfusion, OffByOne, StaleReference, PseudoImpl, DeadCode, WarningNoise, BoundaryViolation) as #[serde(tag = "kind")] + TypeHint { source, symbol, type_repr } + BugFingerprint { flag, pattern_key, example, occurrences } + PathwayTrace gains semantic_flags, type_hints_used, bug_fingerprints all #[serde(default)] for back-compat deserialization of pre-ADR-021 traces on disk + build_pathway_vec now tokenizes flag:{variant} + bug:{flag}:{key} so traces with different bug histories cluster separately in the similarity gate (proven by pathway_vec_differs_when_bug_fingerprint_added test) Phase B — producer (scrum_master_pipeline.ts): + Prompt addendum: each finding must carry `Flag: <CATEGORY>` tag alongside the existing Confidence: NN% tag. 9 category choices plus `None` for improvements that aren't bug-shaped. + Parser extracts tagged flags from reviewer markdown; falls back to bare-word match if reviewer omits the label. Deduplicated per trace. + PathwayTracePayload gains semantic_flags / type_hints_used / bug_fingerprints fields. Wire format matches Rust serde tagged enum so TS and Rust interop directly. Phase C — pre-review enrichment: + new `/vectors/pathway/bug_fingerprints` endpoint aggregates occurrences by (flag, pattern_key) across traces sharing a narrow fingerprint, sorts by frequency, returns top-K. + scrum calls it before the ladder and prepends a PATHWAY MEMORY preamble to the reviewer prompt ("these patterns appeared N times on this file area before — check for recurrences"). Empty on fresh install; grows as the matrix index learns. Tests: 27 pathway_memory tests green (was 18). New tests: - pathway_trace_deserializes_without_new_fields_backcompat - semantic_flag_serializes_as_tagged_enum - bug_fingerprint_roundtrips_through_serde - pathway_vec_differs_when_bug_fingerprint_added - semantic_flag_discriminates_by_variant - bug_fingerprints_aggregate_by_pattern_key (sums occurrences, sorts desc) - bug_fingerprints_empty_for_unseen_fingerprint - bug_fingerprints_respects_limit - insert_preserves_semantic_fields (roundtrip via persist + reload) Workspace warnings unchanged at 11. What's still queued (not this commit): - type_hints_used population from catalogd column types + Arrow schema - bug_fingerprint extraction from reviewer output (Phase D — for now semantic_flags populate but the fingerprint key requires parsing code-shape from the finding; next iteration's work) - auditor → pathway audit_consensus update wire (explicit-fail gate) Why this commit matters: the mechanical applier's gates are syntactic (warning count, patch size, rationale-token alignment). The queryd/delta.rs base_rows bug (86901f8) was found by human reading — unit mismatch between row counts and file counts. At 100 bugs this deep, humans can't catch them all; the matrix index has to learn the shapes. This commit gives it the fields to learn into and the surface to read from. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 05:49:10 -05:00
root	2f8b347f37	pathway_memory: consensus-designed sidecar + hot-swap learning loop Some checks failed lakehouse/auditor 11 warnings — see review 10-probe N=3 consensus (kimi-k2:1t / gpt-oss:120b / qwen3.5:latest / deepseek-v3.1:671b / qwen3-coder:480b / mistral-large-3:675b / qwen3.5:397b + 2 stability re-probes; 2 openrouter probes 429'd) locked the design across three rounds. Full JSON responses in data/_kb/consensus_reducer_design_{mocq3akn,mocq6pi1,mocqatik}.json. What it does Preserves FULL backtrack context per reviewed file (ladder attempts + latencies + reject reasons, KB chunks with provenance + cosine + rank, observer signals, context7 bridge hits, sub-pipeline calls, audit consensus) and indexes them by narrow fingerprint for hot-swap of proven review pathways. When scrum reviews a file: 1. narrow fingerprint = task_class + file_prefix + signal_class 2. query_hot_swap checks pathway memory for a match that passes probation (≥3 replays @ ≥80% success) + audit gate + similarity (≥0.90 cosine on normalized-metadata-token embedding) 3. if hot-swap eligible, recommended model tried first in the ladder 4. replay outcome reported back, updating the pathway's success_rate 5. pathways below 0.80 after ≥3 replays retire permanently (sticky) 6. full PathwayTrace always inserted at end of review — hot-swap grows with use, it doesn't bootstrap from nothing Gate design is load-bearing: - narrow fingerprint (6 of 8 consensus models converged on the same 3-field composition; lock) — enables generalization within crate - probation ≥3 replays — binomial tail at 80% is ~5%, below is noise - success rate ≥0.80 — mistral + qwen3-coder independently proposed this exact threshold across two rounds - similarity ≥0.90 — middle of the 0.85/0.95 consensus spread - bootstrap: null audit_consensus ALLOWED (auditor → pathway update not wired yet; probation + success_rate gates alone enforce safety during bootstrap; explicit audit FAIL still blocks) - retirement is sticky — prevents oscillation on noise Files + crates/vectord/src/pathway_memory.rs (new, 600 lines + 18 tests) PathwayTrace, LadderAttempt, KbChunkRef, ObserverSignal, BridgeHit, SubPipelineCall, AuditConsensus, HotSwapCandidate, PathwayMemory, PathwayMemoryStats. 18/18 tests green. Cosine + 32-bucket L2-normalized embedding; mirror of TS impl. M crates/vectord/src/lib.rs pub mod pathway_memory; M crates/vectord/src/service.rs VectorState grows pathway_memory field; 4 HTTP handlers (/pathway/insert, /pathway/query, /pathway/record_replay, /pathway/stats). M crates/gateway/src/main.rs Construct PathwayMemory + load from storage on boot, wire into VectorState. M tests/real-world/scrum_master_pipeline.ts Byte-matching TS bucket-hash (verified same bucket indices as Rust); pre-ladder hot-swap query; ladder reorder on hit; per-attempt latency capture; post-accept trace insert (fire-and-forget); replay outcome recording; observer /event emits pathway_hot_swap_hit, pathway_similarity, rungs_saved per review for the VCP UI. M ui/server.ts /data/pathway_stats aggregates /vectors/pathway/stats + scrum_reviews.jsonl window for the value metric. M ui/ui.js Three new metric cards: · pathway reuse rate (activity: is it firing?) · avg rungs saved (value: is it earning its keep?) · pathways tracked (stability: retirement = learning) What's not in this commit (queued) - auditor → pathway audit_consensus update wire (explicit audit-fail block activates when this lands) - bridge_hits + sub_pipeline_calls population from context7 / LLM Team extract results (fields wired, callers not yet) - replay log (PathwayReplayOutcome {matched_id, succeeded, ts}) as a separate jsonl for forensic audit of why specific replays failed Why > summarization Summaries discard the causal chain. With this, auditor can verify citation provenance, applier can distinguish lucky from learned paths, and the matrix indexing actually stores end-to-end pathways instead of just RAG chunks — which is what J meant by "why aren't we using it for everything." Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 05:15:32 -05:00
root	5e8d87bf34	cleanup: remove unused HashSet import from 96b46cd + tighten applier gates Some checks failed lakehouse/auditor 1 blocking issue: cloud: claim not backed — "journal event verified live (total_events_created 0→1 after probe)." 96b46cd ("first auto-applied commit") added `use tracing;` and `use std::collections::HashSet;` to queryd/service.rs under a commit message claiming to add a destructive SQL filter. HashSet was unused — cargo check passed (warnings aren't errors) but the workspace now carries a permanent `unused_imports` warning. `use tracing;` is redundant but not flagged by the compiler, leave it. This is an honest postmortem of the rationale-diff divergence problem: emitter claimed one thing, diffed another. The cargo-green gate alone can't catch that. Applier hardening in this commit addresses all three failure modes: - new-warning gate: reject patches that keep build green but add warnings (baseline → post-patch diff) - rationale-diff token alignment heuristic: reject patches whose rationale shares no vocabulary with the actual new_string - dry-run workspace revert: COMMIT=0 was silently leaving files modified between runs; now reverts after each cargo check - prompt additions: forbid unused-symbol imports; require rationale vocabulary to appear in the diff Next-iter applier runs should produce cleaner commits or none at all. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 04:25:53 -05:00
root	25ea3de836	observer: fix LLM Team escalation — route to /v1/chat qwen3-coder:480b instead of dead mode Some checks failed lakehouse/auditor 1 blocking issue: cloud: claim not backed — "journal event verified live (total_events_created 0→1 after probe)." Discovery 2026-04-24: /api/run?mode=code_review returns "Unknown mode" (error response from llm_team_ui.py). The 2026-04-24 observer escalation wiring pointed at a dead endpoint and was failing silently. My earlier claim of "9 registered LLM Team modes" came from GET probes that all returned 405 — I interpreted that as "POST-only endpoints exist" when it just means "GET is not allowed for anything, and on POST only `extract` is registered." Rewire: observer's escalateFailureClusterToLLMTeam now hits POST /v1/chat { provider: "ollama_cloud", model: "qwen3-coder:480b", ... } which is the same coding-specialist rung 2 of the scrum ladder that reliably produces substantive reviews. Probe shows 1240 chars of substantive analysis in ~8.7s. Also tightens scrum_applier: * MODEL default: kimi-k2:1t → qwen3-coder:480b (coding specialist) * Size gate: 20 lines → 6 lines (surgical patches only) * Max patches per file: 3 → 2 * Prompt: explicit forbidden-actions list (no struct renames, no function-signature changes, no new modules) and mechanical-only whitelist These changes produced the first auto-applied commit (96b46cd), which landed a 2-line import addition that passed cargo check. Zero-to-one. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 04:14:33 -05:00
root	8b77d67c9c	OpenRouter rescue ladder + tree-split reduce fix + observer→LLM Team + scrum_applier + first auto-applied patch Some checks failed lakehouse/auditor 1 blocking issue: cloud: claim not backed — "journal event verified live (total_events_created 0→1 after probe)." ## Infrastructure (scrum loop hardening) crates/gateway/src/v1/openrouter.rs — new OpenRouter provider Direct HTTPS to openrouter.ai/api/v1/chat/completions with OpenAI-compatible shape. Key resolution: OPENROUTER_API_KEY env → /home/profit/.env → /root/llm_team_config.json (shares LLM Team UI's quota). Added after iter 5 hit repeated Ollama Cloud 502s on kimi-k2:1t — different provider backbone as rescue rung. Unit tests pin the URL stripping and OpenAI wire shape. crates/gateway/src/v1/mod.rs + main.rs Added `"openrouter" \| "openrouter_free"` arm to /v1/chat dispatch. V1State.openrouter_key loaded at startup via openrouter::resolve_openrouter_key() mirroring the Ollama Cloud pattern. Startup log: "v1: OpenRouter key loaded — /v1/chat provider=openrouter enabled" tests/real-world/scrum_master_pipeline.ts * 9-rung ladder — kimi-k2:1t → qwen3-coder:480b → deepseek-v3.1:671b → mistral-large-3:675b → gpt-oss:120b → qwen3.5:397b → openrouter/gpt-oss-120b:free → openrouter/gemma-3-27b-it:free → local qwen3.5:latest. Added qwen3-coder:480b as rung 2 after live probes confirmed it rescues kimi-k2:1t 502s cleanly (0.9s latency, substantive reviews). Dropped devstral-2 (displaced by qwen3-coder); dropped kimi-k2.6 (not available); dropped minimax-m2.7 (returned 0 chars / 400 thinking tokens). Local fallback promoted qwen3.5:latest per J's direction 2026-04-24. * MAX_ATTEMPTS bumped 6 → 9 to accommodate the rescue tier. * Tree-split scratchpad fixed — was concatenating shard markers directly into the reviewer input, causing kimi-k2:1t to write titles like "Forensic Audit Report – file.rs (shard 3)". Now uses internal §N§ markers during accumulation and runs a proper reduce step that collapses per-shard digests into ONE coherent file-level synthesis with markers stripped. Matches the Phase 21 aibridge::tree_split map→reduce design. Fallback to stripped scratchpad if reducer returns thin. tests/real-world/scrum_applier.ts — NEW (737 lines) The auto-apply pipeline. Reads scrum_reviews.jsonl, filters rows where gradient_tier ∈ {auto, dry_run} AND confidence_avg ≥ MIN_CONF (default 90), asks the reviewer model for concrete old_string/new_string patch JSON, applies via text replacement, runs cargo check after each file, commits if green and reverts if red. Deny-list: /etc/, config/, ops/, auditor/, docs/, data/, mcp-server/, ui/, sidecar/, scripts/. Hard caps: per-patch confidence ≥ MIN_CONF, old_string must be exactly unique, max 20 lines per patch. Never runs on main without explicit LH_APPLIER_BRANCH override. Audit trail in data/_kb/auto_apply.jsonl. Empirical behavior (dry-run over iter 4 reviews): 5 eligible files → 1 green commit-ready, 2 build-red reverts, 2 all-rejected The build-green gate caught 2 bad patches before they'd have merged. mcp-server/observer.ts — LLM Team code_review escalation When a sig_hash accumulates ≥3 failures (ESCALATION_THRESHOLD), fire-and-forget POST /api/run?mode=code_review at localhost:5000 with the failure cluster context. Parses facts/entities/relationships/file_hints from the response. Writes to a new data/_kb/observer_escalations.jsonl surface. Answers J's vision of the observer triggering richer LLM Team calls when failures pile up. Non-blocking: runs parallel to existing qwen2.5 analyzer, never replaces it. Tracks escalated sig_hashes in a session-local Set to avoid re-hammering LLM Team when a cluster persists across observer cycles. crates/aibridge/src/context.rs First auto-applied patch produced by scrum_applier.ts (dry-run path — applier writes files in dry-run mode but doesn't commit; bug noted for iter 6 fix). Adds #[deprecated] annotation to the inline estimate_tokens helper pointing callers to the centralized shared::model_matrix::ModelMatrix entry point (P21-002 — duplicate token-estimator surfaces). Cargo check passes with the annotation (verified by applier's own build gate). ## Visual Control Plane (UI) ui/server.ts — Bun.serve on :3950 with /data/* fan-out: /data/services, /data/reviews, /data/metrics, /data/trust, /data/overrides, /data/findings, /data/outcomes, /data/audit_facts, /data/file/:path, /data/refactor_signals, /data/search?q=, /data/signal_classes, /data/logs/:svc (journalctl tail per systemd unit), /data/scrum_log. Bug fix: tryFetch always attempts JSON.parse before falling back to text — observer's Bun.serve returns JSON without application/json content-type, which was displaying stats as a raw string ("0 ops" on map) before. ui/index.html + ui.css — dark neo-brutalist shell. 6 views: MAP (D3 force-graph + overlays) / TRACE (per-file iter history) / TRAJECTORY (signal-class cards + refactor-signals table + reverse-index search box) / METRICS (every card has SOURCE + GOOD lines explaining where the number comes from and what target trajectory means) / KB (card grid with tooltips on every field) / CONSOLE (per-service journalctl tabs). ui/ui.js — polling client, D3 wiring, signal-class panel, refactor-signals table, reverse-index search, per-service console tabs. Bug fix: renderNodeContext had Object.entries() iterating string characters when /health returned a plain string — now guards with typeof check so "lakehouse ok" renders as one row instead of "0 l / 1 a / 2 k / ...". 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 03:45:35 -05:00
root	21fd3b9c61	Scrum-driven fixes: P5-001 auth wired, P42-001 truth evaluator, P9-001 journal on ingest Some checks failed lakehouse/auditor 2 blocking issues: cloud: claim not backed — "\| P9-001 (partial) \| `crates/ingestd/src/service.rs` \| 3 → 6 ↑↑↑ \| `journal.record_ing Apply the highest-confidence findings from the Phase 0→42 forensic sweep after four scrum-master iterations under the adversarial prompt. Each fix is independently validated by a later scrum iteration scoring the same file higher under the same bar. Code changes ──────────── P5-001 — crates/gateway/src/auth.rs + main.rs api_key_auth was marked #[allow(dead_code)] and never wrapped around the router, so `[auth] enabled=true` logged a green message and enforced nothing. Now wired via from_fn_with_state, with constant-time header compare and /health exempted for LB probes. P42-001 — crates/truth/src/lib.rs TruthStore::check() ignored RuleCondition entirely — signature looked like enforcement, body returned every action unconditionally. Added evaluate(task_class, ctx) that actually walks FieldEquals / FieldEmpty / FieldGreater / Always against a serde_json::Value via dot-path lookup. check() kept for back-compat. Tests 14 → 24 (10 new exercising real pass/fail semantics). serde_json moved to [dependencies]. P9-001 (partial) — crates/ingestd/src/service.rs Added Optional<Journal> to IngestState + a journal.record_ingest() call on /ingest/file success. Gateway wires it with `journal.clone()` before the /journal nest consumes the original. First-ever internal mutation journal event verified live (total_events_created 0→1 after probe). Iter-4 scrum scored these files higher under same prompt: ingestd/src/service.rs 3 → 6 (P9-001 visible) truth/src/lib.rs 3 → 4 (P42-001 visible) gateway/src/auth.rs 3 → 4 (P5-001 visible) gateway/src/execution_loop 4 → 6 (indirect) storaged/src/federation 3 → 4 (indirect) Infrastructure additions ──────────────────────── * tests/real-world/scrum_master_pipeline.ts - cloud-first ladder: kimi-k2:1t → deepseek-v3.1:671b → mistral-large-3:675b → gpt-oss:120b → devstral-2:123b → qwen3.5:397b (deep final thinker) - LH_SCRUM_FORENSIC env: injects SCRUM_FORENSIC_PROMPT.md as adversarial preamble - LH_SCRUM_PROPOSAL env: per-iter fix-wave doc override - Confidence extraction (markdown + JSON), schema v4 KB rows with: verdict, critical_failures_count, verified_components_count, missing_components_count, output_format, gradient_tier - Model trust profile written per file-accept to data/_kb/model_trust.jsonl - Fire-and-forget POST to observer /event so by_source.scrum appears in /stats * mcp-server/observer.ts — unchanged in shape, confirmed receiving scrum events * ui/ — new Visual Control Plane on :3950 - Bun.serve with /data/{services,reviews,metrics,trust,overrides,findings,file,refactor_signals,search,logs/:svc,scrum_log} - Views: MAP (D3 graph, 5 overlays) / TRACE (per-file iter timeline) / TRAJECTORY (refactor signals + reverse index search) / METRICS (explainers with SOURCE + GOOD lines) / KB (card grid with tooltips) / CONSOLE (per-service journalctl tail, tabs for gateway/sidecar/observer/mcp/ctx7/auditor/langfuse) - tryFetch always attempts JSON.parse (fix for observer returning JSON without content-type) - renderNodeContext primitive-vs-object guard (fix for gateway /health string) * docs/SCRUM_FIX_WAVE.md — iter-specific scope directing the scrum * docs/SCRUM_FORENSIC_PROMPT.md — adversarial audit prompt (verdict/critical/verified schema) * docs/SCRUM_LOOP_NOTES.md — iteration observations + fix-next-loop queue * docs/SYSTEM_EVOLUTION_LAYERS.md — Layers 1-10 roadmap (trust profiling, execution DNA, drift sentinel, etc) Measurements across iterations ────────────────────────────── iter 1 (soft prompt, gpt-oss:120b): mean score 5.00/10 iter 3 (forensic, kimi-k2:1t): mean score 3.56/10 (−1.44 — bar raised) iter 4 (same bar, post fixes): mean score 4.00/10 (+0.44 — fixes landed) Score movement iter3→iter4: ↑5 ↓1 =12 21/21 first-attempt accept by kimi-k2:1t in iter 4 20/21 emitted forensic JSON (richer signal than markdown) 16 verified_components captured (proof-of-life, new metric) Permission Gradient distribution: 0 auto · 16 dry_run · 4 sim · 1 block Observer loop: by_source {scrum: 21, langfuse: 1985, phase24_audit: 1} v1/usage: 224 requests, 477K tokens, all tracked Signal classes per file (iter 3 → iter 4): CONVERGING: 1 (ingestd/service.rs — fix clearly landed) LOOPING: 4 (catalogd/registry, main, queryd/service, vectord/index_registry) ORBITING: 1 (truth — novel findings surfacing as surface ones fix) PLATEAU: 9 (scores flat with high confidence — diminishing returns) MIXED: 6 Loop thesis status ────────────────── A file's score rises only when the scrum confirms a real fix landed. No false positives yet across 3 iterations. Fixes applied to 3 files all raised their independent scores under the same adversarial prompt. Loop is measurable, not hand-wavy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 02:25:43 -05:00
root	e2ccddd8d2	Test updates: scenarios manifest + nine_consecutive_audits	2026-04-23 01:57:44 -05:00
profit	7c1745611a	Audit pipeline PR #9 : determinism + fact extraction + verifier gate + KB stats + context injection (PR #9 ) Bundles PR #9's work for the audit pipeline: - N=3 consensus on cloud inference (gpt-oss:120b parallel) with qwen3-coder:480b tie-breaker - audit_discrepancies.jsonl logs N-run disagreements - scrum_master reviews route through llm_team fact extraction; source="scrum_review" - Verifier-gated persistence: drops INCORRECT, keeps UNVERIFIABLE/UNCHECKED; schema_version:2 - scrum_master_reviewed flag on accepted reviews - auditor/kb_stats.ts: on-demand observability script - claim_parser history/proof pattern class (verified-on-PR, was-flipping, the-proven-X) - claim_parser quoted-string guard (mirrors static.ts fix) - fact_extractor project context injection via docs/AUDITOR_CONTEXT.md - Fixed verifier-verdict parser to handle multiple gemma2 output formats Empirical: 3-run determinism test on unchanged PR #9 SHA showed 7/7 warn findings stable; block count oscillation eliminated; llm_team quality scores 8-9 on context-injected extract runs. See PR #9 for full run-by-run commit history.	2026-04-23 05:29:38 +00:00
profit	156dae6732	Auditor self-test branch: real-world pipelines + cohesion Phase C + KB index (PR #8 ) Bundles 12 commits validating the auditor + scrum_master architecture end-to-end: - enrich_prd_pipeline / hard_task_escalation / scrum_master_pipeline stress tests - Tree-split + scrum_reviews.jsonl + kb_query surfacing - Verdict → audit_lessons feedback loop (closed) - kb_index aggregator with confidence-based severity policy - 9-run + 5-run empirical tests proved the predictive-compounding property - Level 1 correction: temp=0 cloud inference for deterministic per-claim verdicts - audit_one.ts dry-run CLI - Fixes: static quoted-string guard, empirical-claim classification, symbol-resolver gate, repo-file size cap See PR #8 for run-by-run commit history.	2026-04-23 03:28:32 +00:00
profit	f44b6b3e6b	Control-plane pivot: Phase 38-44 plan + bot scaffold Direction shift 2026-04-22: docs/CONTROL_PLANE_PRD.md becomes the long-horizon architecture target. Existing Lakehouse (docs/PRD.md, Phases 0-37) is preserved as the reference implementation and first consumer. New 6-layer architecture: L1 Universal API /v1/chat /v1/usage /v1/sessions /v1/tools /v1/context L2 Routing & Policy Engine (rules, fallback chains, cost gating) L3 Provider Adapter Layer (Ollama + OpenRouter + Gemini + Claude) L4 Knowledge + Memory + Playbooks (already built) L5 Execution Loop (scenarios + bot/cycle.ts instances) L6 Observability + token accounting Phases 38-44 sequenced with detailed per-phase specs in the PRD. Current scope: staffing domain (synthetic workers_500k, contracts, emails, SMS, playbooks). DevOps (Terraform/Ansible) is long-horizon target — architecture-compatible but not current. Files added: - docs/CONTROL_PLANE_PRD.md — 6-layer architecture, Phase 38-44 sequencing with staffing-first Truth Layer + Validation pipeline - bot/ — manual-only PR bot scaffold. First consumer test-bed for /v1/chat (Phase 38). Mem0-aligned ADD/UPDATE/NOOP apply semantics; KB feedback loop reads prior cycles on same gap and injects into cloud prompt so bot cycles compound like scenario.ts runs do. - tests/multi-agent/run_stress.ts — the 6-task diverse stress test referenced in the previous commit but missing from its staging Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 02:43:31 -05:00
profit	5b1fcf6d27	Phase 28-36 body of work Accumulated since a6f12e2 (Phase 21 Rust port + Phase 27 versioning): - Phase 36: embed_semaphore on VectorState (permits=1) serializes seed embed calls — prevents sidecar socket collisions under concurrent /seed stress load - Phase 31+: run_stress.ts 6-task diverse stress scaffolding; run_e2e_rated.ts + orchestrator.ts tightening - Catalog dedupe cleanup: 16 duplicate manifests removed; canonical candidates.parquet (10.5MB -> 76KB) + placements.parquet (1.2MB -> 11KB) regenerated post-dedupe; fresh manifests for active datasets - vectord: harness EvalSet refinements (+181), agent portfolio rotation + ingest triggers (+158), autotune + rag adjustments - catalogd/storaged/ingestd/mcp-server: misc tightening - docs: Phase 28-36 PRD entries + DECISIONS ADR additions; control-plane pivot banner added to top of docs/PRD.md (pointing at docs/CONTROL_PLANE_PRD.md which lands in next commit) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 02:41:15 -05:00
root	52561d10d3	Input normalizer + unified memory query — "seamless with whatever input" J asked directly: "did we implement our memory findings so that our knowledge base and our configuration playbook [work] seamlessly with whatever input they're given?" Honest answer tonight was "one of five findings shipped, normalizer is the blocker." This closes that gap. NORMALIZER (tests/multi-agent/normalize.ts): Accepts structured JSON, natural language, or mixed. Returns canonical NormalizedInput { role, city, state, count, client, deadline, intent, confidence, extraction_method, missing_fields } for any downstream consumer. Three-tier path: 1. Structured fast-path — already-shaped input skips LLM 2. Regex path — "need 3 welders in Nashville, TN" parses without LLM. City/state parser tightened to 1-3 capitalized words + "in {city}" anchor preference + case-exact full-state-name variants to prevent "Forklift Operators in Chicago" being captured as the city name 3. LLM fallback — qwen3 local with think:false + 400 max_tokens for inputs the regex can't handle Unit tests (tests/multi-agent/normalize.test.ts): 9/9 pass. Covers structured fast-path, misplacement→rescue intent, state-name→abbrev conversion, regex extraction from natural language, plural role + full state name edge case, rescue intent keyword precedence, partial input reporting missing fields, empty object fallthrough, async/sync parity on clean inputs. UNIFIED MEMORY QUERY (tests/multi-agent/memory_query.ts): One function, five parallel fan-outs, one bundle returned: - playbook_workers — hybrid_search via gateway with use_playbook_memory - pathway_recommendation — KB recommender for this sig - neighbor_signatures — K-NN sigs weighted by staffer competence - prior_lessons — T3 overseer lessons filtered by city/state - top_staffers — competence-sorted leaderboard - discovered_patterns — top workers endorsed across past playbooks for this (role, city, state) - latency_ms — per-source + total Every branch is best-effort: one source down doesn't break the bundle. HTTP ENDPOINT (mcp-server/index.ts): POST /memory/query with body {input: <anything>} → MemoryQueryResult Returns the same shape the TS function does. Typed with types.ts for future UI consumption. VERIFIED: curl POST /memory/query with structured {role,city,state,count} → extraction_method=structured, 10 playbook workers, top score 0.878 curl POST /memory/query with "I need 3 welders in Nashville, TN" → extraction_method=regex (no LLM call), 319ms total, 8 endorsements for Lauren Gomez auto-discovered as top Nashville Welder Honest remaining gaps (documented for next phase): - Mem0 ADD/UPDATE/DELETE/NOOP — we still only ADD + mark_failed - Zep validity windows — playbook entries have timestamps but no retirement semantic - Letta working-memory / hot cache — every query scans all 1560 playbook entries - Memory profiles / scoped queries — global pool, no per-staffer private subsets 2 of 5 findings now shipped (multi-strategy retrieval in Rust, input normalization + unified query in TS). The remaining 3 are architectural additions queued as Phase 25 items — validity windows first since it's the most load-bearing for long-running systems.	2026-04-20 23:59:05 -05:00
root	b95dd86556	Phase 24 — observer HTTP ingest + scenario outcome streaming Closes the gap J flagged: observer wraps MCP:3700, scenarios hit gateway:3100 directly, observer idle at 0 ops across 3600+ cycles. Now scenarios POST per-event outcomes to observer's new HTTP ingest on :3800, observer consumes them alongside MCP-wrapped ops, ERROR_ ANALYZER and PLAYBOOK_BUILDER loops see the full picture. observer.ts: - Bun.serve() HTTP listener on OBSERVER_PORT (default 3800): GET /health — basic + ring depth GET /stats — total / success / failure / by_source / recent scenario ops digest POST /event — accept scenario outcome, shape it into ObservedOp with source="scenario" + staffer_id + sig_hash + event_kind + role/city/state + rescue flags - recordExternalOp() — shared ring-buffer insert so the main analyzer + playbook builder don't care where the op came from - ObservedOp extended with provenance fields persistOp() FIX — old path POSTed to /ingest/file?name=observed_operations which REPLACES the dataset (flagged in feedback_ingest_replace_semantics.md). Every op was silently wiping all prior ops. Replaced with append to data/_observer/ops.jsonl so the historical trace is durable across analyzer cycles and process restarts. scenario.ts: - OBSERVER_URL env (default http://localhost:3800) - postObserverEvent() helper with 2s AbortSignal.timeout so observer being down doesn't block scenario flow - Per-event POST after ctx.results.push(result), carrying staffer_id, sig_hash (via imported computeSignature), event_kind + role + city + state + count + rescue_attempted / rescue_succeeded + truncated output_summary VERIFIED: curl POST /event → {"accepted":true,"ring_size":1} curl GET /stats → {"total":1,"successes":1,"by_source":{"scenario":1}, "recent_scenario_ops":[{...staffer_id,kind,role}]} Final v3 demo leaderboard (9 runs per staffer, cumulative 3 batches): James (local): 92.9% fill, 36.8 cites, score 0.775 — RANK 1 Maria (full): 81.0% fill, 26.2 cites, score 0.727 Sam (basic): 61.9% fill, 28.2 cites, score 0.640 Alex (minimal): 59.5% fill, 32.2 cites, score 0.631 Honest finding: Alex has MORE citations than Sam despite NO T3 and NO rescue. Playbook inheritance alone is firing hardest when overseer is absent. The 59.5% fill rate (up from 0% when qwen2.5 was executor) proves cloud-exec + playbook inheritance is the floor the architecture delivers. Local gpt-oss:20b T3 outperforms cloud gpt-oss:120b T3 by 12pt fill rate on this workload — cloud overseer paying latency+variance for no measurable gain, worth flagging in next models.json tune.	2026-04-20 23:49:30 -05:00
root	137aed64fb	Coherence pass — PRD/PHASES updates, config snapshot wired, unit tests J flagged the audit: "make sure everything flows coherently, no pseudocode or unnecessary patches or ignoring any particular part of what we built." This is that pass. PRD.md updates: - Phase 19 refinement block — geo-filter + role-prefilter WIRED with citation density numbers (0.32 → 1.38, and 2 → 28 on same scenario). - Phase 20 rewrite — mistral dropped, qwen3.5 + qwen3 local hot path, think:false as the key mechanical finding, kimi-k2.6 upgrade path. - Phase 21 status block — think plumbing + cloud executor routing added after original commit. - Phase 22 item B (cloud rescue) — pivot sanitizer, rescue verified 1/3 on stress_01. - Phase 23 NEW — staffer identity + tool_level + competence-weighted retrieval + kb_staffer_report. Auto-discovered worker labels called out with real numbers (Rachel Lewis 12× across 4 staffers). - Phase 24 NEW — Observer/Autotune integration gap DOCUMENTED, not fixed. Observer has been idle at 0 ops for 3600+ cycles because scenarios hit gateway:3100 directly, bypassing MCP:3700 which the observer wraps. This is the honest "we're not using it in these tests" signal J surfaced. Fix deferred; gap visible now. PHASES.md: - Appended Phases 20-23 as checked, Phase 24 as unchecked gap. - Updated footer count: 102 unit tests across all layers. - Latest line updated with 14× citation lift + 46.4pt tool-asymmetry finding. scenario.ts: - snapshotConfig() was defined but never called. Now fires at every scenario start with a stable sha256 hash over the active model set + tool_level + cloud flags. config_snapshots.jsonl finally populates, which the error_corrections diff path needs to work correctly. kb.test.ts (new): 4 signature invariant tests — stability across unrelated fields (date, contract, staffer), sensitivity to role/city/ count changes, digest shape. All pass under `bun test`. service.rs: 6 Rust extractor tests for extract_target_geo + extract_target_role — basic, missing-state-returns-none, word boundary (civilian != city), multi-word role, absent role, quoted value parse. All pass under `cargo test -p vectord --lib extractor_tests`. Dangling items now honestly documented rather than silently pending: - Chunking cache (config/models.json SPEC, not wired) — flagged - Playbook versioning (SPEC, not wired) — flagged - Observer integration (WIRED but disconnected) — new Phase 24	2026-04-20 23:29:13 -05:00
root	ad0edbe29c	Cloud kimi-k2.5 executor for weak tiers + multi-strategy playbook retrieval Two coupled changes from the 2026 agent-memory research + tool asymmetry findings. SCENARIO (weak-tier cloud substitute): qwen2.5 collapsed to 0/14 across the basic/minimal tool_levels. Replace with cloud kimi-k2.5 on Ollama Cloud — same family as k2.6 (pro-tier locked today, on J's upgrade path). Plumb cloud flag through ACTIVE_EXECUTOR_CLOUD / ACTIVE_REVIEWER_CLOUD into generateContinuable so executor/reviewer can route to cloud when tool_level requires. think:false supported by Kimi family. Tool level mapping (revised): full — qwen3.5 local + qwen3 local + cloud gpt-oss:120b T3 + rescue local — qwen3.5 local + qwen3 local + local gpt-oss:20b T3 + rescue basic — kimi-k2.5 cloud + qwen3 local + local T3, no rescue minimal — kimi-k2.5 cloud + qwen3 local, no T3, no rescue. Playbook inheritance alone on the decision path. This is the honest version of J's "minimal tools still works via inheritance" hypothesis — with the executor no longer broken at the tokenizer level, we can actually measure whether playbook retrieval substitutes for missing overseers. PLAYBOOK_MEMORY (multi-strategy retrieval): Zep / Mem0 research shows multi-strategy rerank (semantic + keyword + graph + temporal) outperforms single-strategy cosine. Lakehouse now has a two-tier: 1. Exact (role, city, state) match: skip cosine, assign similarity=1.0, take up to top_k/2+1 slots. These are identity-class neighbors — the strongest possible signal. 2. Cosine fallback within the same (city, state) but different role: fills remaining slots. Exposed as compute_boost_for_filtered_with_role(target_geo, target_role). Backwards-compatible: compute_boost_for_filtered forwards with role=None so existing callers keep their current behavior. Service.rs wires both: extract_target_geo and extract_target_role pull from the executor's SQL filter. grab_eq_value is factored out of extract_target_geo so both lookups share one parser. Diagnostic log now prints target_role alongside target_geo for every hybrid_search: playbook_boost: boosts=88 sources=39 parsed=39 matched=5 target_geo=Some(("Nashville", "TN")) target_role=Some("Welder") Verified: Nashville Welder query returns 5/10 boosted workers in top_k with clean role+geo provenance. Research sources: atlan.com Agent Memory Frameworks 2026, Mem0 paper (arxiv 2504.19413), Zep/Graphiti LongMemEval comparison, ossinsight Agent Memory Race 2026. kimi-k2.6 on current key returns 403 — pro-tier upgrade required. kimi-k2.5 is the substitute today; swap to k2.6 by renaming one line in applyToolLevel once the subscription lands.	2026-04-20 23:20:07 -05:00
root	5e89407939	Phase 23 refinement — per-staffer tool_level variance Staffer.tool_level now controls which subsystems a specific run gets: full — qwen3.5 + qwen3 + cloud T3 + cloud rescue local — qwen3.5 + qwen3 + local gpt-oss:20b T3 + rescue basic — qwen2.5 + qwen2.5 + local T3, no rescue minimal — qwen2.5 + qwen2.5, NO T3, NO rescue. Playbook inheritance only. applyToolLevel() mutates module-scoped ACTIVE_* slots each run from the env defaults, so prior staffer's overrides never leak. Hot-path code reads ACTIVE_EXECUTOR / ACTIVE_REVIEWER / ACTIVE_T3_DISABLED / ACTIVE_OVERVIEW_CLOUD / ACTIVE_RETRY_ON_FAIL instead of the baked constants. The architectural question this answers: does playbook_memory inheritance carry enough knowledge to let a weakly-tooled coordinator still produce usable outcomes? "Minimal" Alex runs qwen2.5 exec + no reviewer overseer + no cloud rescue. If Alex still fills events at a reasonable rate, the playbook system is the real knowledge carrier — the senior stack is nice-to-have, not the sine qua non. Demo personas mapped: Maria (senior, 48mo, full) James (mid, 14mo, local) Sam (junior, 4mo, basic) Alex (trainee, 1mo, minimal) Same 3 contracts (Nashville downtown, Joliet warehouse, Indianapolis assembly) across all four → 12 runs. KB + kb_staffer_report.py leaderboard already wired; competence_score will now reflect real tool asymmetry instead of LLM sampling variance.	2026-04-20 22:50:05 -05:00
root	6b71c8e9b2	Phase 23 — contract terms + staffer identity + competence-weighted retrieval Matrix-index the "who handled this" dimension so top staffers become the training signal and juniors inherit their playbooks automatically via the boost pipeline. Auto-discovered indicators emerge from comparing trajectories across staffers on similar contracts — that was always the architectural point; this wires the last piece. ContractTerms: - deadline, budget_total_usd, budget_per_hour_max, local_bonus_per_hour, local_bonus_radius_mi, fill_requirement ("paramount" \| "preferred") - Attached to ScenarioSpec, propagated into T3 checkpoint + cloud rescue prompts so cloud reasons about trade-offs (pivot within bonus radius first; respect per-hour cap; split across cities when fill_requirement=paramount). Staffer: - {id, name, tenure_months, role: senior\|mid\|junior\|trainee} - On ScenarioSpec; logged at scenario start; attached to KB outcome - Recomputed StafferStats written to data/_kb/staffers.jsonl after every run: total_runs, fill_rate, avg_turns, avg_citations, rescue_rate, competence_score. - Competence formula: 0.45fill_rate + 0.20turn_efficiency + 0.20citation_density + 0.15rescue_rate. Normalized to 0..1. findNeighbors now returns weighted_score = cosine × best_staffer_competence (floored at 0.3 so high-similarity low-competence neighbors still surface). pathway_recommender prompt shows the top staffer's identity so cloud knows WHOSE playbook it's synthesizing from. Demo infrastructure: - tests/multi-agent/gen_staffer_demo.ts: 4 personas (Maria senior, James mid, Sam junior, Alex trainee) × 3 contracts (Nashville Welder, Joliet Warehouse, Indianapolis Assembly). 12 scenarios total. - scripts/run_staffer_demo.sh: runs the 12 sequentially with LH_OVERVIEW_CLOUD=1. Post-run calls kb_staffer_report.py. - scripts/kb_staffer_report.py: leaderboard + cross-staffer worker overlap (names endorsed by ≥2 staffers → auto-discovered high-value workers). Top vs bottom differential. gen_scenarios.ts (Phase 22 generator) also now emits contract terms on 70% of generated specs — future KB batches populate with realistic constraint patterns instead of bare role+city+count. Stress scenario from item A intentionally NOT the production test. Real staffing has constraints; Nashville contract + staffer demo is the honest test of whether the architecture produces measurable differential between coordinator skill levels. Demo batch launched — 12 runs × ~3min each ≈ 40min unattended. Report emitted after batch.	2026-04-20 22:16:09 -05:00
root	a7fc8e2256	Item B — cloud-rescue retry on event failure When a scenario event fails (drift abort or other error) and LH_RETRY_ON_FAIL is on (default when cloud T3 is enabled), ask cloud for a concrete pivot — new city, role, or count — then re-run the event with the remediation's fields. Capped at 1 retry per event so a genuinely-impossible scenario can't burn budget. requestCloudRemediation(event, result): - Feeds the same diagnostic bundle T3 checkpoints get (SQL filters, row counts, SQL errors, reviewer drift reasons, gap signals). - Prompt demands structured JSON: {retry, new_city, new_role, new_count, rationale}. - Cloud is instructed to pivot to NEAREST alternate city when zero-supply detected, broaden role when uniquely scarce, reduce count when clearly unachievable, or return retry=false when no pivot seems viable. EventResult additions: - retry_attempt, retry_remediation (with rationale + cloud_model + duration), retry_result (full inner result shape), original_event. - If retry succeeded, it becomes the primary result and original_event preserves what was attempted first. If retry also failed, the primary stays the failure and retry is recorded alongside. Sanitizer on cloud output: model sometimes emits "Hammond, IN" in new_city with "IN" in a non-existent new_state field, producing "Hammond, IN, IN" downstream. Split new_city on comma, take first token as city, extract state if present after the comma. Original event's state is the fallback. VERIFIED on stress_01.json with LH_OVERVIEW_CLOUD=1: Without rescue (item A baseline): 1/5 events ok With rescue (item B): 3/5 events ok Gary IN misplacement: drift → cloud proposed South Bend IN → retry filled 1/1. Rationale stored in retry_remediation for forensics. Known limits surfaced (future work): - City-field mangling failed one rescue before the sanitizer landed; next run will use the fix. - Cloud picks alternate cities without knowing ground-truth supply. Flint → Saginaw pivoted but Saginaw also had sparse Welders. Future: expose a /vectors/supply-estimate endpoint cloud can consult before proposing a pivot.	2026-04-20 22:01:45 -05:00
root	c21b261877	Item A — stress scenario + enriched T3 diagnostic prompt Proves cloud passthrough works end-to-end AND fixes the diagnostic quality problem that first run surfaced. STRESS SCENARIO (tests/multi-agent/scenarios/stress_01.json): Five genuinely hard events with varied failure modes: - Gary, IN 5× Electrician: ZERO supply (city not in workers_500k) - Peoria, IL 8× Safety Coordinator: scarce role, initial pool only 5 - Flint, MI 3× Welder: ZERO supply - Grand Rapids, MI 4× Tool & Die Maker: scarce but solvable - Gary, IN 1× Electrician misplacement: repeats event 1's impossibility FIRST RUN (stress v1) — cloud passthrough works, diagnosis vague: T3 checkpoint: "Potential drift flags for upcoming role" Lesson: "Before dispatching, query pool status. Update turn counter..." Generic tactical advice that doesn't address the real problem. Root cause: T3 prompt only saw outcome summary, not the raw SQL/pool/drift signals the executor had in its log. DIAGNOSTIC FIX: - Added LogEntry[] `sharedLog` parameter to runAgentFill so the caller retains the trace even when runAgentFill throws drift-abort. - EventResult gained `diagnostic_log` field populated on both OK and FAIL paths. - extractDiagnostics() pulls SQL filters, hybrid_search row counts, SQL errors, and reviewer drift notes from the log. - Checkpoint prompt now includes FAILURE FORENSICS block for failed events: SQL filters attempted, row counts, errors, drift reasons, and an explicit teaching note about zero-supply detection. - Cross-day lesson prompt flags each event with [ZERO-SUPPLY: pivot city needed] tag when drift reasons mention "no match"/"no candidates"/"0 rows". PRIORITY clause in the prompt tells the model its lesson MUST name alternate cities when that tag appears. SECOND RUN (stress v2 with enriched prompt) — cloud diagnosis sharp: T3 after Flint: risk="Zero candidate supply for Welder in Flint" hint="search Welder×3 in Saginaw, MI (≈30 mi) or expand role to Metal Fabricator" T3 after Gary: risk="Zero supply for Electrician in Gary, IN" hint="Pivot to Chicago, IL (≈40 min); broaden to Electrical Technician within 60 min radius" Lesson: specific, per-city, with distances, role-broadening fallback, and pre-loading strategy — actionable for item B retry. Cloud 120b call latencies consistent: 4.8-8.0s per prompt. Cloud passthrough proven under stress. Fill outcomes unchanged (1/5 — correct rejection of three impossible events + one propagating JSON emission edge case on retry pivot reasoning). The knowledge to rescue them now exists in the lesson; item B wires the retry.	2026-04-20 21:54:29 -05:00
root	330cb90f99	Lift k cap, drop ornamental `reason` field, scenario generator ITEM 1 — k CAP + REASON FIELD The hybrid_search default k was hard-coded to 10. For multi-fill events (5× expansion, 4× emergency) that's pool=10 → propose 5-of-10, half the candidates become the answer with no room for rejection. Executor prompt now instructs k to scale with target_count: k = max(count*5, 20), cap 80. Default helper bumped 10 → 20. Fill.reason dropped from required to optional. Nothing downstream ever consumed it — resolveWorkerIds, sealSale, retrospective all use candidate_id and name. Models loved to write 100-150 char justifications per fill; on 4+ fills that blew the JSON budget before the structure closed. Test 1 run result after this change: FIRST EVER 5/5 on the Riverfront Steel scenario, 13 total turns across 5 events. The event that failed last run (emergency 4×Loader with truncated reason-field continuation) now clears in 2 turns. Progression: mistral baseline: 0/5 qwen3.5 + continuation + think:false: 4/5 qwen3.5 + k=20 + no-reason: 5/5 ✓ ITEM 2 — SCENARIO GENERATOR (NOT YET TESTED E2E) tests/multi-agent/gen_scenarios.ts emits N deterministic ScenarioSpecs with varied clients (15 companies), cities (20 Midwest cities known to exist in workers_500k), role mixes (14 industrial staffing roles, weighted realistic), and event sequences. Each gets a unique sig_hash so the KB populates with distinct neighbor signatures. scripts/run_kb_batch.sh runs all generated specs sequentially against scenario.ts, logs per-scenario outcomes, and reports KB state at the end. Each run takes ~2-4min; 20-30 scenarios = 1-2hr unattended. Next: test the generator+batch on a small N (3-5) to verify KB populates correctly and pathway recommendations start getting neighbor signal instead of cold-starts. Then item 3 (Rust re-weighting of hybrid_search by playbook_memory success).	2026-04-20 20:31:34 -05:00
root	9c1400d738	Phase 22 — Internal Knowledge Library (KB) Meta-layer over Phase 19 playbook_memory. Phase 19 answers "which WORKERS worked for this event"; KB answers "which CONFIG worked for this playbook signature" — model choice, budget hints, pathway notes, error corrections. tests/multi-agent/kb.ts: - computeSignature(): stable sha256 hash of the (kind, role, count, city, state) tuple sequence. Same scenario shape → same sig. - indexRun(): extracts sig, embeds spec digest via sidecar, appends outcome record, upserts signature to data/_kb/signatures.jsonl. - findNeighbors(): cosine-ranks the k most-similar signatures from prior runs for a target spec. - detectErrorCorrections(): scans outcomes for same-sig fail→succeed pairs, diffs the model set, logs to error_corrections.jsonl. - recommendFor(): feeds target digest + k-NN neighbors + recent corrections to the overview model, gets back a structured JSON recommendation (top_models, budget_hints, pathway_notes), appends to pathway_recommendations.jsonl. JSON-shape constrained so the executor can inherit it mechanically. - loadRecommendation(): at scenario start, pulls newest rec matching this sig (or nearest). scenario.ts: - Reads KB recommendation at startup (alongside prior lessons). - Injects pathway_notes into guidanceFor() executor context. - After retrospective, indexes the run + synthesizes next rec. Cold-start behavior: first run with no history writes a low-confidence "no prior data" rec so the signal that something was attempted is captured. Second run gets "low confidence, 0 neighbors" until a third distinct sig gives the embedder something to compare against — hence the upcoming scenario generator. VERIFIED: - data/_kb/ populated after one scenario run: 1 outcome (sig=4674…, 4/5 ok, 16 turns total), 1 signature, 2 recs (cold + post-run). - Recommendation JSON-parsed cleanly from gpt-oss:20b overview model. PRD Phase 22 added with file layout, cycle description, and the rationale for file-based MVP → Rust port progression that matches how Phase 21 primitives shipped. What's NOT here yet (batched follow-ups per J's request, tested between each): - Lift the k=10 hybrid_search cap to adaptive k=max(count*5, 20) - Scenario generator to bulk-populate KB with varied signatures - Rust re-weighting: push playbook_memory success signal INTO hybrid_search scoring, not just post-hoc boost	2026-04-20 20:27:12 -05:00
root	0c4868c191	qwen3.5 executor + continuation primitive + think:false Three coupled fixes that together turned the Riverfront Steel scenario from 0/5 (mistral) to 4/5 (qwen3.5) with T3 flagging real staffing concerns rather than linter advice. MODEL SWAP - Executor: mistral → qwen3.5:latest (9.7B, 262K ctx, thinking). mistral's decoder emitted malformed JSON on complex SQL filters regardless of prompt; J called it — stop using mistral. - Reviewer: qwen2.5 → qwen3:latest (40K ctx) - Applied to scenario.ts, orchestrator.ts, network_proving.ts, run_e2e_rated.ts CONTINUATION PRIMITIVE (agent.ts) - generateContinuable(): empty-response → geometric backoff retry; truncated-JSON → continue from partial as scratchpad; bounded by budget cap + max_continuations. No more "bump max_tokens until it stops truncating" tourniquet. - generateTreeSplit(): map-reduce for oversized input corpora with running scratchpad digest, reduce pass for final synthesis. - Empty text no longer throws — it's a signal to continuable that thinking ate the budget. think:false FOR HOT PATH - qwen3.5 burned ~650 tokens of hidden thinking for trivial JSON emission. For executor/reviewer/draft: think:false. For T3/T4/T5 overseers: thinking stays on (that's the point). - Sidecar generate endpoint accepts `think` bool, passes through to Ollama's /api/generate. VERIFIED OUTCOMES Riverfront Steel 2026-04-21, qwen3.5+continuable+think:false: 08:00 baseline_fill 3/3 4 turns 10:30 recurring 2/2 3 turns (1 playbook citation) 12:15 expansion 0/5 drift-aborted (5-fill orchestration problem, separate work) 14:00 emergency 4/4 3 turns (1 citation) 15:45 misplacement 1/1 3 turns → T3 caught Patrick Ross double-booking across events → T3 flagged forklift cert drift on the event that failed → Cross-day lesson proposed "maintain buffer of ≥3 emergency candidates, pre-fetch certs for expansion, booking system cross-check" — real staffing advice, not generic linter output PRD PHASE 21 rewritten to reflect the actual primitive shape (two- call map-reduce with scratchpad glue) instead of the tourniquet approach originally documented. Rust port queued for next sprint. scripts/ab_t3_test.sh: A/B harness that chains B→C→D runs and emits tests/multi-agent/playbooks/ab_scorecard.json.	2026-04-20 20:19:02 -05:00
root	6e7ca1830e	Phase 21 foundation — context stability + chunking pipeline PRD: add Phase 20 (model matrix, wired) and Phase 21 (context stability, partial). Phase 21 exists because LLM Team hit this exact wall — running multi-model ranking on large context silently truncated, rankings degraded, no pipeline caught it. The stable answer: every agent call goes through a budget check against the model's declared context_window minus safety_margin, with a declared overflow_policy when the check fails. config/models.json: - context_window + context_budget per tier - overflow_policies block: summarize_oldest_tool_results_via_t3, chunk_lessons_via_cosine_topk, two_pass_map_reduce, escalate_to_kimi_k2_1t_or_split_decision - chunking_cache spec (data/_chunk_cache/, corpus-hash keyed) agent.ts: - estimateTokens() chars/4 biased safe ~15% - CONTEXT_WINDOWS table (fallback; prod reads models.json) - assertContextBudget() — throws on overflow with exact numbers, can bypass with bypass_budget:true for callers with their own policy - Wired into generate() and generateCloud() so EVERY call is checked scenario.ts: - T3 lesson archive to data/_playbook_lessons/*.json (the old /vectors/playbook_memory/seed path was silently failing with HTTP 400 because it requires 'fill: Role xN in City, ST' operation shape) - loadPriorLessons() at scenario start — filters by city/state match, date-sorted, takes top-3 - prior_lessons.json archived per-run (honest signal for A/B) - guidanceFor() injects up to 2 prior lessons (≤500 chars each) into the executor's per-event context - Retrospective shows explicit "Prior lessons loaded: N" line Verified: mistral correctly rejects a 150K-char prompt (7532 tokens over), gpt-oss:120b accepts it with 90K headroom. The enforcement is in-band on every call now, not an afterthought. Full chunking service (Rust) remains deferred to the sprint this feeds: crates/aibridge/src/budget.rs + chunk.rs + storaged/chunk_cache.rs	2026-04-20 19:34:44 -05:00
root	03d723e7e6	Model matrix — 5 tiers, local hard workers + cloud overseers config/models.json is the authoritative catalog. Hot path (T1/T2) stays local; cloud is consulted only for overview (T3), strategic (T4), and gatekeeper (T5) calls. J named qwen3.5 + newer models (minimax-m2.7, glm-5, qwen3-next) specifically — all mapped with real reachable IDs verified against ollama.com/api/tags. Tier shape: - t1_hot mistral + qwen2.5 local — 50-200 calls/scenario - t2_review qwen2.5 + qwen3 local — 5-14 calls/event - t3_overview gpt-oss:120b cloud — 1-3 calls/scenario - t4_strategic qwen3.5:397b + glm-4.7 — 1-10 calls/day - t5_gatekeeper kimi-k2-thinking — 1-5 calls/day, audit-logged Rate budgets are declared in-config — Ollama Cloud paid tier is generous but we cap overview/strategic/gatekeeper so no single rogue scenario can blow the day's quota. Experimental rotation list wired but disabled by default. When enabled, T4 randomly routes 10% of calls to a rotating minimax/GLM/qwen-next/ deepseek/nemotron/cogito/mistral-large candidate, logs comparisons, and auto-promotes after 3 rotations of wins. Playbook versioning SPEC embedded under `playbook_versioning` key: every seed gets version + parent_id + retired_at + architecture_snapshot, so when a schema migration breaks a playbook we can pinpoint which change retired it. Implementation flagged for next sprint (touches gateway + catalogd + mcp-server) — not wired here. - scenario.ts now loads config/models.json at init, env vars still override - mcp-server exposes /models/matrix read-only so UI can render it	2026-04-20 19:24:41 -05:00
root	e4ae5b646e	T3 overview tier — mid-day checkpoints + cross-day lesson Hot path (T1/T2) stays mistral + qwen2.5. The new T3 tier runs a thinking model SPARINGLY — after every misplacement, every N-th event (default N=3), and once post-scenario for the cross-day lesson. - agent.ts: generateCloud() for Ollama Cloud (gpt-oss:120b etc). Uses the same /api/generate shape; thinking field is discarded. - scenario.ts: runOverviewCheckpoint + runCrossDayLesson. Outputs land in checkpoints.jsonl and lesson.md. Lesson also seeds playbook_memory under operation "cross-day-lesson-{date}" — future runs pick it up through the existing similarity boost. - Env knobs: LH_OVERVIEW_CLOUD=1 routes T3 to cloud, LH_OVERVIEW_MODEL overrides (default gpt-oss:20b local, gpt-oss:120b cloud), LH_T3_CHECKPOINT_EVERY controls cadence, LH_T3_DISABLE=1 turns it off. Why this shape: prior feedback_phase19_seed_text.md warned that verbose seeds dilute the embedding and silently kill the boost. T3's rich prose goes to lesson.md; the embedded "approach" + "context" stay terse. Verified end-to-end: local 20b checkpoint 10.9s, lesson 4.0s; cloud 120b lesson 3.7s. Cloud output is both faster AND more specific than local (sequenced, tactical, logging advice included).	2026-04-20 19:21:45 -05:00
root	f8e8d25b5f	Unblock complex scenarios: JSON tolerance + optional question + mistral exec parseAction now strips stray `)` before `}` and trailing commas — qwen2.5 emits those regularly on tool_call outputs; soft-fix beats retry-loops. hybrid_search no longer hard-requires `question`; defaults to "qualified available workers" when the model drops it (mistral's most common failure mode on complex events). Kept original TOOL_CATALOG shape (args examples only, not full action envelopes). The verbose few-shot version from the prior iteration confused mistral into wrapping propose_done as tool_call. Scenario V7 result: expansion (5 Forklift Ops) and emergency (4 Loaders) — previously-failing complex events — now seal reliably. Pool sizes: 687 and 380 from 500K corpus. Patterns endpoint produces real operator-actionable signals: expansion: "recurring certifications: Forklift (40%), OSHA-10 (40%) · recurring skills: mill (40%) · archetype mostly: leader · reliability median 0.83" Baseline + recurring are now flaky (inverted trade-off, pure model-reliability variance).	2026-04-20 15:28:30 -05:00
root	1274ab2cb3	Scenario harness: Path 1+2 integration + schema hardening Upgrades to tests/multi-agent/scenario.ts to exercise the full Path 1+2 feature set on a real warehouse-client week (5 events on one client): - Hard SCHEMA ENFORCEMENT block in every event's guidance. Prior runs had mistral read narrative words ("shift", "recurring", "expansion") as SQL column names. Schema is now locked explicitly with valid columns listed and CAST guidance for availability + reliability. - playbook_memory_k bumped 10 → 100 to match server default. - Canonical short seed text (operation + "{kind} fill via hybrid search" + "{role} fill in {city}, {state}"). Verbose LLM rationales dilute embeddings and silently kill boost (Pass 1 finding). - /vectors/playbook_memory/mark_failed fires automatically on misplacement events — records the no-shower's failure so future searches for same city+role dampen their boost. - /vectors/playbook_memory/patterns call per event — surfaces what the meta-index discovered (recurring certs/skills/archetype/reliability) for that query into the dispatch log and retrospective. - Retrospective now includes a workers-touched audit table (every worker who reached a decision, with outcome column) and a discovered-patterns-evolution section across events. Honest limitations this surfaced in the real run: - mistral's executor prompt-adherence degrades on high-count events (5+ fills) and scenario-specific language (emergency/misplacement). 3 of 5 events aborted via drift guard. Baseline + recurring sealed cleanly with real fills + SMS + emails + seeded playbooks. - worker_id resolution returns "undefined" for some names when name matching is ambiguous in workers_500k (multiple workers with same name in same city).	2026-04-20 15:09:14 -05:00
root	25b7e6c3a7	Phase 19 wiring + Path 1/2 work + chain integrity fixes Backend: - crates/vectord/src/playbook_memory.rs (new): Phase 19 in-memory boost store with seed/rebuild/snapshot, plus temporal decay (e^-age/30 per playbook), persist_to_sql endpoint backing successful_playbooks_live, and discover_patterns endpoint for meta-index pattern aggregation (recurring certs/skills/archetype/reliability across similar past fills). - DEFAULT_TOP_K_PLAYBOOKS bumped 5 → 25; old default silently missed most boosts when memory had > 25 entries. - service.rs: new routes /vectors/playbook_memory/{seed,rebuild,stats, persist_sql,patterns}. Bun staffing co-pilot (mcp-server/): - /search, /match, /verify, /proof, /simulation/run, MCP tools all forward use_playbook_memory:true and playbook_memory_k:25 to the hybrid endpoint. Boost was previously dark across the entire app. - /log no longer POSTs to /ingest/file — that endpoint REPLACES the dataset's object list, so single-row CSV writes were wiping all prior rows in successful_playbooks (sp_rows went 33→1 in one /log call). /log now seeds playbook_memory with canonical short text and calls /persist_sql to keep successful_playbooks_live in sync. - /simulation/run cumulative end-of-week CSV write removed for the same reason. Per-day per-contract /seed (added in this session) is the accumulating feedback path now. - search.html addWorkerInsight renders a green "Endorsed · N playbooks" chip with playbook citations when boost > 0. Internal Dioxus UI (crates/ui/): - Dashboard phase list rewritten through Phase 19 (was stuck at "Phase 16: File Watcher" / "Phase 17: DB Connector" — both wrong). - Removed fabricated "27ms" stat label. - Ask tab examples + SQL default replaced with real staffing prompts against candidates/clients/job_orders (was referencing nonexistent employees/products/events). - New Playbook tab exposes /vectors/playbook_memory/{stats,rebuild} and side-by-side hybrid search (boost OFF vs ON) with citations. Tests (tests/multi-agent/): - run_e2e_rated.ts: parallel two-agent (mistral + qwen2.5) build phase + verifier rating (geo, auth, persist, boost, speed → /10). - network_proving.ts: continuous build → verify → repeat with staffing-recruiter profile hot-swap; geo-discrimination check. - chain_of_custody.ts: single recruiter operation traced through every layer (Bun /search, direct /vectors/hybrid parity, /log, SQL, playbook_memory growth, profile activation, post-op boost lift).	2026-04-20 06:21:13 -05:00
root	19bdfab227	Phase 2: DataFusion query engine over Parquet - queryd: SessionContext with custom URL scheme to avoid path doubling with LocalFileSystem - queryd: ListingTable registration from catalog ObjectRefs with schema inference - queryd: POST /query/sql returns JSON {columns, rows, row_count} - queryd→catalogd wiring: reads all datasets, registers as named tables - gateway: wires QueryEngine with shared store + registry - e2e verified: SELECT *, WHERE/ORDER BY, COUNT/AVG all correct Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 05:48:20 -05:00

47 Commits