lakehouse

Author	SHA1	Message	Date
profit	156dae6732	Auditor self-test branch: real-world pipelines + cohesion Phase C + KB index (PR #8 ) Bundles 12 commits validating the auditor + scrum_master architecture end-to-end: - enrich_prd_pipeline / hard_task_escalation / scrum_master_pipeline stress tests - Tree-split + scrum_reviews.jsonl + kb_query surfacing - Verdict → audit_lessons feedback loop (closed) - kb_index aggregator with confidence-based severity policy - 9-run + 5-run empirical tests proved the predictive-compounding property - Level 1 correction: temp=0 cloud inference for deterministic per-claim verdicts - audit_one.ts dry-run CLI - Fixes: static quoted-string guard, empirical-claim classification, symbol-resolver gate, repo-file size cap See PR #8 for run-by-run commit history.	2026-04-23 03:28:32 +00:00
profit	6d7b251607	Merge pull request 'Phase 45 slice 3: doc_drift check + resolve endpoints' (#5 ) from phase/45-slice-3 into main	2026-04-22 19:14:11 +00:00
profit	8bacd43465	Phase 45 slice 3: doc_drift check + resolve endpoints Some checks failed lakehouse/auditor cloud: claim not backed — "Previously the hybrid fixture honestly reported layer 5 as 404/unimplemented. With this PR it flips " Closes the last open loop of Phase 45. Previously, playbooks could carry doc_refs (slice 1) and the context7 bridge could report drift (slice 2) — but nothing tied them together. An operator had no way to say "check this playbook against its doc sources and flag it if the docs moved." This slice wires that. Ships: - crates/vectord/src/doc_drift.rs — thin context7 bridge client. No cache (bridge has its own 5-min TTL). No retry (transient failure = Unknown outcome, caller decides). - PlaybookMemory::flag_doc_drift(id) — stamps doc_drift_flagged_at idempotently. Once flagged, compute_boost_for_filtered_with_role excludes the entry from both the non-geo and geo-indexed boost paths until resolved. - PlaybookMemory::resolve_doc_drift(id) — human re-admission. Stamps doc_drift_reviewed_at which clears the boost exclusion. - PlaybookMemory::get_entry(id) — new read-only accessor the handler uses to read doc_refs without exposing the state lock. - POST /vectors/playbook_memory/doc_drift/check/{id} - POST /vectors/playbook_memory/doc_drift/resolve/{id} Design call: Unknown outcomes from the bridge (bridge down, tool not in context7, no snippet_hash recorded) are NEVER enough to flag. Only a positive drifted=true from the bridge flips the flag. A down bridge doesn't silently drift-flag every playbook. Tests (5 new, in upsert_tests mod): - flag_doc_drift_stamps_timestamp_and_persists - flag_doc_drift_is_idempotent_on_already_flagged - resolve_doc_drift_clears_flag_admission_gate - boost_excludes_flagged_unreviewed_entries - boost_re_admits_resolved_entries 14/14 upsert tests pass (9 pre-existing + 5 new). Live end-to-end — hybrid fixture on auditor/scaffold (merged to main at b6d69b2) now shows: overall: PASS shipped: [38, 40, 45.1, 45.2, 45.3] placeholder: [—] ✓ Phase 38 /v1/chat 4039ms ✓ Phase 40 Langfuse trace 11ms ✓ Phase 45.1 seed + doc_refs 748ms ✓ Phase 45.2 bridge diff 563ms ✓ Phase 45.3 drift-check endpoint 116ms ← was a 404 before this First time the fixture reports overall=PASS with zero placeholder layers. The honest "not built" signal on layer 5 is now honestly "built and working." Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 14:12:57 -05:00
profit	e57ab8ad01	Merge pull request 'ops: systemd units for auditor + context7 bridge' (#4 ) from ops/auditor-systemd-units into main	2026-04-22 09:17:09 +00:00
profit	c85c55006d	ops: systemd units for auditor + context7 bridge Some checks failed lakehouse/auditor 3 warnings — see review Promotes two previously manual-start Bun services to systemd so they survive restarts + run continuously. - ops/systemd/lakehouse-auditor.service — polls Gitea every 90s, runs 4 audit checks per PR head SHA, posts commit status + review comment. Runs as root to match existing lakehouse-* service conventions on this host; can read /home/profit/.git-credentials (0600 profit:profit). - ops/systemd/lakehouse-context7-bridge.service — HTTP wrapper on :3900 for Phase 45 doc-drift detection. Decoupled from gateway; runs independently. - ops/systemd/install.sh — idempotent installer (copy → daemon-reload → enable --now). Prints post-install active/enabled status. - ops/systemd/README.md — run/stop/logs/pause docs. Pause control stays per-service (bot.paused / auditor.paused files at repo root). Not wired to branch protection yet — the auditor's commit status is currently advisory, not enforcing. Flip via Gitea branch_protections API when confident.	2026-04-22 04:15:58 -05:00
profit	b6d69b2e82	Merge pull request 'Auditor: PR-claim hard-block reviewer (scaffold)' (#1 ) from auditor/scaffold into main	2026-04-22 09:13:34 +00:00
profit	b82caa9971	Merge pull request 'Fix: UPDATE branch of upsert_entry dropped doc_refs + valid_until' (#3 ) from fix/upsert-outcome-update-merge into main	2026-04-22 09:11:15 +00:00
profit	1270e167fe	Post-merge: update test pattern matches for struct-like UpsertOutcome After merging main (with the UpsertOutcome struct-like enum shape from PR #2), the 4 new upsert tests needed pattern-match updates: UpsertOutcome::Added(_) → UpsertOutcome::Added { .. } 9/9 upsert tests pass.	2026-04-22 04:11:13 -05:00
profit	4dca2a6705	Merge branch 'main' of https://git.agentview.dev/profit/lakehouse into fix/upsert-outcome-update-merge	2026-04-22 04:10:27 -05:00
profit	b667fdeff1	Fix: UpsertOutcome newtype serde panic (silent since Phase 26) Auditor found this via hybrid fixture 2026-04-22. Blocks the serde-tag-newtype shape by converting to struct-like variants. See PR #2 body for full context. Manual merge: auditor commit status was failure due to 1 false-positive inference finding on a commit-message reference; underlying fix is verified (curl against live gateway confirmed all 3 upsert paths return valid JSON). Proceeding per human review.	2026-04-22 09:10:07 +00:00
profit	320009ddf4	Fix: UPDATE branch of upsert_entry dropped doc_refs + valid_until All checks were successful lakehouse/auditor all checks passed (3 findings, all info) The auditor's hybrid fixture (branch auditor/scaffold) surfaced this on 2026-04-22. A re-seed of the same (operation, day) pair with new endorsed_names merged the names but silently discarded the incoming doc_refs and valid_until fields. schema_fingerprint was partially handled (set-if-Some) but doc_refs and valid_until weren't touched. Root cause: the UPDATE arm of upsert_entry at playbook_memory.rs:609 only covered: - endorsed_names (union-merge) - timestamp - embedding (if Some) - schema_fingerprint (if Some) Fix: - valid_until — refresh if caller provides one - doc_refs — merge by tool (case-insensitive). Same-tool new entry supersedes older one; different-tool refs are appended. Empty incoming doc_refs preserves existing (don't wipe on partial seed). 4 new regression tests under upsert_tests: - update_merges_doc_refs_with_existing_ones - update_same_tool_supersedes_older_version - update_preserves_existing_doc_refs_when_new_entry_has_none - update_refreshes_valid_until_when_caller_provides_one Test result: 9/9 upsert tests pass (4 new + 5 pre-existing). Branch basis note: this branch is off main, so the UpsertOutcome enum here still has the newtype variants Added(String) / Noop(String). PR #2 (fix/upsert-outcome-serde) changes that enum to struct-like. When PR #2 merges first this branch needs a trivial rebase; the UPDATE arm logic is untouched by that change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 04:06:54 -05:00
profit	c33c1bcbc5	Auditor: poller + live end-to-end proof All checks were successful lakehouse/auditor all checks passed (4 findings, all info) auditor/index.ts (task #9) — the top-level poller. 90s interval, dedupes by head SHA via data/_auditor/state.json, supports --once for CLI testing. Env gates: LH_AUDITOR_RUN_DYNAMIC=1 to include the hybrid fixture (default off; it mutates live state), LH_AUDITOR_SKIP_INFERENCE=1 for fast runs without cloud calls. Single-shot run proof (task #10): cycle 1: 2 open PRs audit PR #2 f0a3ed68 "Fix: UpsertOutcome newtype serde panic" verdict=block, 9 findings (1 block, 5 warn, 3 info) audit PR #1 039ed324 "Auditor: PR-claim hard-block reviewer" verdict=approve, 4 findings (0 block, 0 warn, 4 info) audits_run=2, state persisted Commit statuses and issue comments posted live to Gitea. PR #2 is currently hard-blocked (lakehouse/auditor commit status = failure); PR #1 has a passing status. State survives restart — next cycle skips already-audited SHAs. Both PRs now have the audit comment with per-check breakdown. Operator can read the comment, fix blocking findings (or defend them with a reply), push a new commit; auditor re-audits on new SHA, verdict updates, merge gate responds accordingly. The full loop J asked for is closed: 1. static check caught own Phase 45 placeholder (b933334) 2. hybrid fixture caught UpsertOutcome serde panic (9c893fb) 3. LLM-Team-style codereview caught ternary bug (5bbcaf4) 4. auditor poller now runs on every open PR, block/approve with evidence, re-audits on new SHAs Tasks done: 1-11 (except 12, a scoped follow-up fix for UPDATE branch dropping doc_refs). The auditor is running, catching real bugs in its own build, and gating merges.	2026-04-22 04:02:36 -05:00
profit	039ed32411	Auditor: KB query check + verdict orchestrator + Gitea poster All checks were successful lakehouse/auditor all checks passed (4 findings, all info) auditor/checks/kb_query.ts (task #7) — reads data/_kb/outcomes.jsonl, error_corrections.jsonl, data/_observer/ops.jsonl, data/_bot/cycles/. Cheap/offline: no model calls, tail-reads only. Fail-rate >30% in recent scenario outcomes → warn; otherwise info. Live-proven: 1 finding emitted against current KB state (69 scenario runs, 27.7% fail rate — below warn threshold). auditor/audit.ts (task #8) — orchestrator. Runs static + dynamic + inference + kb_query in parallel, calls assembleVerdict, persists to data/_auditor/verdicts/, posts to Gitea (commit status + issue comment). AuditOptions supports skip_dynamic/skip_inference/dry_run for iteration. auditor/gitea.ts — added postIssueComment (author can comment on own PR, unlike postReview which self-review-blocks). static.ts — skip BLOCK_PATTERNS scan on auditor/checks/ and auditor/fixtures/* because those files legitimately contain the patterns as regex/string-literal data. WARN/INFO patterns (TODO comments, hardcoded placeholders) still run. Live-proven: dry-run audit of PR #1 after fix went from 13 block findings to 0 from static; 11 warn from inference still fire on real overreach claims. Dry-run audit against PR #1, skip_dynamic=true: verdict: block (BEFORE the static fix) verdict: request_changes (AFTER — inference correctly flagged "tasks 1-9 complete" as not backed; 0 false-positive blocks from static self-match) 42.5s total across checks (mostly cloud inference: 36s) 26 claims, 39KB diff Tasks 5 + 6 + 7 + 8 complete. Remaining: #9 (poller) + #10 (end-to-end proof) + #12 (upsert UPDATE merge fix).	2026-04-22 03:59:38 -05:00
profit	efc7b5ac44	Auditor: dynamic + inference checks auditor/checks/dynamic.ts — wraps runHybridFixture, maps layer results to Findings. Placeholder-style errors (404/unimplemented/ slice N) → info; other failures → warn. Always emits a summary finding with real numbers (shipped/placeholder phase counts + per- layer latency). Live-tested against current stack: 2 info findings, 0 warnings — all shipped layers actually work. auditor/checks/inference.ts — wraps the run_codereview reviewer pattern from llm_team_ui.py, adapted for claim-vs-diff verification. Calls /v1/chat provider=ollama_cloud model=gpt-oss:120b. Requests strict JSON response with claim_verdicts[] and unflagged_gaps[]. A strong claim marked "not backed" by cloud → BLOCK severity; moderate → warn; weak → info. Cloud-unreachable or unparseable-output → info (never blocks on the reviewer being down). Live-tested against PR #1 (this PR, 20 claims, 39KB diff): - 36.9s round-trip - 7 block + 23 warn + 2 info findings - gpt-oss:120b correctly flagged "Fully-functional auditor (tasks 1-9 complete)" as not-backed (only 6/10 tasks done at that commit) — accurate catch - Some false positives from the original 15KB truncation threshold (cloud missed gitea.ts, flagged "no Gitea client present") - Bumped MAX_DIFF_CHARS from 15000 to 40000 to fit the full PR diff in context; reviewer precision improves accordingly Tasks 5 + 6 completed. Remaining: #7 (KB query), #8 (verdict + Gitea poster), #9 (poller), #10 (end-to-end proof), #12 (upsert UPDATE-drops-doc_refs).	2026-04-22 03:54:18 -05:00
profit	c5da680add	Fixture: unique-per-run nonce eliminates state-pollution false positive After the serde fix (PR #2, fix/upsert-outcome-serde) landed on main, re-running this fixture STILL reported "doc_refs field is empty" — but with a different root cause than the panic. Root cause: pre-fix runs panicked on response serialization but had already added entries to state (panic happened between upsert_entry returning and the handler's serde_json::json! of the response). So state.json was polluted with __auditor_test_worker__ entries from those runs, WITHOUT doc_refs (doc_refs wasn't even wired at the time those state rows were written). The fixture's `find(endorsed_names.includes(TEST_WORKER_NAME))` was picking the oldest polluted entry, not the fresh one. Compounding: discovered a secondary bug while investigating — upsert_entry's UPDATE branch only merges endorsed_names. doc_refs, schema_fingerprint, valid_until on an UPDATE are silently dropped. Filed as task #12, separate PR to follow. Fix in this fixture: use a nonce suffix on both TEST_WORKER_NAME and TEST_OPERATION so every run is guaranteed to hit the ADD path in upsert_entry, sidestepping the UPDATE bug AND eliminating state pollution entirely. Live re-run after this edit: ✓ Phase 38 /v1/chat 449ms, 42 tokens ✓ Phase 40 Langfuse trace 20ms ✓ Phase 45.1 seed + doc_refs 239ms, doc_refs.length=1 persisted ✓ Phase 45.2 bridge diff 2ms, drifted=true ✗ Phase 45.3 drift-check HONEST 404 (endpoint not built) shipped_phases: [38, 40, 45.1, 45.2] (was [38, 40, 45.2]) placeholder: [45.3] (was [45.1, 45.3]) One fewer placeholder — exactly because the serde fix merged on fix/upsert-outcome-serde and the fixture now cleanly exercises the path. The loop is: fixture finds bug → PR fixes bug → fixture re-run confirms fix → one fewer placeholder.	2026-04-22 03:50:46 -05:00
profit	f0a3ed6832	Fix: UpsertOutcome newtype variants panicked serde from Phase 26 Some checks failed lakehouse/auditor 1 blocking issue: cloud: claim not backed — "Verified live after gateway restart:" playbook_memory.rs:257 — UpsertOutcome had two newtype variants carrying a bare String: Added(String) Noop(String) under #[serde(tag = "mode")]. serde cannot tag newtype variants of primitive types, so every serialization threw: "cannot serialize tagged newtype variant UpsertOutcome::Added containing a string" This caused gateway /vectors/playbook_memory/seed to panic the tokio worker on EVERY call that reached Added or Noop, returning an empty socket close to the client. The bug was silent from commit 640db8c (Phase 26, 2026-04-21) until 2026-04-22 when the auditor's hybrid fixture (auditor/fixtures/hybrid_38_40_45.ts on the auditor/scaffold branch) exercised the endpoint live and gateway logs showed the panic. Fix — convert both newtype variants to struct-like: Added { playbook_id: String } Noop { playbook_id: String } Updated all 7 construction + pattern-match sites. Updated rustdoc on the enum explaining why the shape is what it is. JSON wire format is now uniform across all three variants: {"mode":"added","playbook_id":"pb-..."} {"mode":"updated","playbook_id":"pb-...","merged_names":[...]} {"mode":"noop","playbook_id":"pb-..."} Verified live after gateway restart: curl /seed new payload → mode=added, playbook 860231f5 curl /seed new payload + doc_refs → mode=added, playbook 11d348d9 curl /seed identical re-submit → mode=noop, same id 860231f5, entries_after unchanged (Mem0 contract intact) Tests: 51/51 vectord lib tests green. Release build clean. This is a follow-up bug fix landed in its own branch (fix/upsert-outcome-serde) rather than commingled with other work. The auditor's hybrid fixture on the auditor/scaffold branch will now light up layer 3 (phase45_seed_with_doc_refs) as a pass once this merges — previously it failed here with an empty socket close.	2026-04-22 03:48:05 -05:00
profit	5bbcaf4c33	Fix: layer-2 Langfuse filter used meaningless ternary Caught by running a side-test through LLM Team's run_codereview flow (gpt-oss:120b reviewer) against this fixture, 2026-04-22. BEFORE: const ourStart = Date.parse( l1.evidence.match(/tokens=/) ? result.ran_at : result.ran_at ); // Both branches return result.ran_at — the ternary is meaningless. // result.ran_at is the fixture start time, NOT the moment we fired // /v1/chat. Any trace created between fixture-start and chat-fetch // would false-negative. AFTER: const chat_request_sent_ms = Date.now(); // captured before layer 1 // ... const recent = items.filter(t => Date.parse(t.timestamp) >= chat_request_sent_ms ); Re-ran the fixture against the live stack — layers 1,2,4 still pass (no regression); layer 2 trace matched at age=2494ms which is within the chat-to-trace propagation window. Layers 3,5 still fail for the original unrelated reasons (UpsertOutcome serde panic + Phase 45 slice 3 endpoint not built). First concrete act-on-finding from a code-checker run. The process works.	2026-04-22 03:44:36 -05:00
profit	9c893fbb8c	Auditor: hybrid fixture — found a pre-existing bug on first live run auditor/fixtures/hybrid_38_40_45.ts — the never-before-run hybrid test. Exercises Phase 38 /v1/chat → Phase 40 Langfuse → Phase 45 slice 1 seed+doc_refs → Phase 45 slice 2 bridge drift → (expected- fail) Phase 45 slice 3 drift-check endpoint. auditor/fixtures/cli.ts — standalone runner. Human-readable summary to stderr, machine-readable JSON to stdout, exit code 0/1/2 for pass / fail / partial_pass. Live run results — honest measurements, not hand-waved: ✓ Phase 38 /v1/chat returns 9 visible tokens, 6.7s latency ("docker run is a common Docker command.") ✓ Phase 40 Langfuse trace 18a8a0b7 landed in 2.5s ✗ Phase 45.1 seed endpoint returns empty reply — discovered a PRE-EXISTING BUG unrelated to doc_refs: playbook_memory.rs:257 UpsertOutcome has newtype variants Added(String) and Noop(String) under #[serde(tag="mode")] — serde panics on serialize. panicked at crates/vectord/src/service.rs:2323: Error("cannot serialize tagged newtype variant UpsertOutcome::Added containing a string") Reproduced: curl /seed with AND without doc_refs both get "Empty reply from server" (socket closed mid-response). This bug has existed since Phase 26 shipped (commit 640db8c, 2026-04-21). No test or caller in the repo exercised the response path live against the gateway until this fixture did. ✓ Phase 45.2 context7 bridge confirms drift: current hash 475a0396ca436bba vs our stale input, upstream last updated 2026-04-20 ✗ Phase 45.3 /doc_drift/check endpoint — correctly unreachable because layer 3 blocked us from getting a playbook_id; endpoint still doesn't exist independent of that Real numbers published: per-layer latency_ms, token counts, trace_age_ms, library_id, current_hash_length. All stored in the JSON output for downstream audit. Value delivered: the fixture's first live run found a bug that unit tests, compile checks, and my own "phase shipped" commits all missed. Exactly the gap J called out — the auditor is doing what it's supposed to do. Bug fix is a SEPARATE concern: new task #11 tracks a separate PR (fix/upsert-outcome-serde) so the audit finding and the fix stay cleanly attributed.	2026-04-22 03:34:20 -05:00
profit	b933334ae2	Auditor: static diff check — catches own Phase 45 placeholder auditor/checks/static.ts — grep-style scan of PR diffs, no AST, no LLM. High-signal patterns only. Severity grading: - BLOCK — unimplemented!(), todo!(), panic!("not implemented"), throw new Error("not implemented") - WARN — TODO/FIXME/XXX/HACK in added lines; new pub struct fields with <2 mentions in the diff (added but nobody reads it — placeholder state) - INFO — hardcoded "placeholder"/"dummy"/"foobar"/"changeme"/"xxx" strings in added lines Live-proven — the existential test J asked for: vs PR #1 (scaffold): 0 findings (all scaffold fields cross- reference within the diff) vs commit 2a4b81b (Phase 5 WARN: every DocRef field (tool, 45 first slice — I version_seen, snippet_hash, source_url, half-admitted placeholder): seen_at) added with 0 read-sites in the diff That's the auditor flagging my own "Phase 45 first slice" commit as state-without-consumer, which is exactly what I half-admitted it was. If PR #1 had required auditor-pass (branch protection), the DocRef commit would have been blocked pre-merge. The auditor works because it agreed with the honest read. Next: dynamic hybrid test fixture (task #4) — the never-run multi- layer pipeline test.	2026-04-22 03:29:31 -05:00
profit	bfe8985233	Auditor: claim parser auditor/claim_parser.ts — reads PR body + commit messages, extracts ship-claims. Regex-based, intentionally not LLM-driven: the parser's job is to surface claim substrates, not to judge them (that's the inference check's job, runs later with cloud model). Three strength tiers: - strong — "verified end-to-end", "live-proven", "production-ready", "phase N shipped", "proven" - moderate — "shipped", "landed", "green", "passing", "works", "complete", "done" - weak — "should work", "expected to", "probably" Live-proven against PR #1 (this PR): 4 claims extracted from 1 commit (2 strong, 2 moderate). "live-proven" correctly tagged as strong (it IS a stronger claim than "shipped"). Next: static diff check consumes these claims + the PR diff to find placeholder patterns — empty fns, TODO, unwired fields, etc.	2026-04-22 03:28:06 -05:00
profit	f48dd2f20b	Auditor scaffold: types + Gitea client + policy stub + README All-Bun sub-agent that watches open PRs on Gitea, reads ship-claims, and hard-blocks merges when the code doesn't back the claim. First commit of N; this is the skeleton. Dynamic/static/inference/kb checks + poller land in follow-up commits on this same branch. - auditor/types.ts — Claim, Finding, Verdict, PrSnapshot shapes - auditor/gitea.ts — minimal API client (listOpenPrs, getPrDiff, postCommitStatus, postReview). Live-proven: returned 0 open PRs against our repo (which IS the current state — every commit today went to main directly, which is the problem this auditor is meant to prevent) - auditor/policy.ts — stub `assembleVerdict` + severity rules. Intentionally conservative defaults: strong claim + zero evidence = block, not warn. - auditor/README.md — how to run + the hard-block mechanism Workflow discipline change: starting with this branch, no more direct pushes to main. Every change lands as a PR. When this auditor is fully built and running, it'll review its own completion PR — the recursive self-test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 03:26:56 -05:00
profit	affab8ac83	Phase 45 slice 2: context7 HTTP bridge for doc drift detection Bun bridge on :3900 that wraps context7's public API and exposes the surface gateway consumes for Phase 45 drift checks. Own port so a failure here never tips over mcp-server on :3700. Endpoints: GET /health status + cache stats GET /docs/:tool resolve tool → library_id → fetch docs → return descriptor {snippet_hash, last_updated, source_url, docs_preview, ...} GET /docs/:tool/diff?since=X compare current snippet_hash to X; returns {drifted: bool, current, previous, preview if drifted} GET /cache debug dump of cached entries Implementation notes: - 5 minute in-memory cache (context7 rate-limits by IP; gateway drift-checks are the hot caller) - 1500-token slices from context7 (enough for drift-meaningful hash, not so much we hammer their API) - snippet_hash = SHA-256 prefix (16 hex chars) of fetched content - Library resolution prefers "finalized" state; falls back to top result if none finalized Verified live against context7.com: - /health → ok, 0 cache, 300s TTL - /docs/docker → library_id /docker/docs, title "Docker", hash 475a0396ca436bba, last updated 2026-04-20 - /docs/docker (again) → cache hit, 0.37ms (5400× speedup) - /docs/docker/diff?since=stale-hash-0000 → drifted=true, preview included - /docs/docker/diff?since=<current hash> → drifted=false, preview omitted (honest: no drift to show) Not yet wired: - Gateway consumer (Phase 45 slice 3): /vectors/playbook_memory/doc_drift/check/{id} calls this bridge and updates DocRef.snippet_hash + doc_drift_flagged_at - Systemd unit (bridge is manual-start for now, same as bot/) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 03:17:17 -05:00
profit	2a4b81bf48	Phase 45 (first slice): DocRef + doc_refs field on PlaybookEntry Phase J keeps asking for: playbooks know which external docs they used, get flagged when those docs drift. This commit ships the data model; context7 bridge + drift check endpoints land in follow-ups. Added to crates/vectord/src/playbook_memory.rs: - pub struct DocRef { tool, version_seen, snippet_hash, source_url, seen_at } — one external doc reference - PlaybookEntry.doc_refs: Vec<DocRef> — empty on legacy entries, serde default ensures pre-Phase-45 persisted state loads cleanly - PlaybookEntry.doc_drift_flagged_at: Option<String> — set by the (future) drift-check code when context7 reports newer version - PlaybookEntry.doc_drift_reviewed_at: Option<String> — set by human via /resolve endpoint after reviewing the diagnosis - impl Default for PlaybookEntry — collapses most test-helper constructors from 17 explicit fields to 6-9 fields + ..Default::default() Updated SeedPlaybookRequest + RevisePlaybookRequest (service.rs) to accept optional doc_refs: the seed/revise endpoints already take the field, downstream drift detection (Phase 45.2) consumes it. Docs: docs/CONTROL_PLANE_PRD.md gains full Phase 45 spec with gate criteria, non-goals, and risk notes. Tests: 51/51 vectord lib tests green (same count as before, field additions are backward-compat). Memory: project_doc_drift_vision.md written so this keeps coming back to the front of mind. Next slices (same phase): context7 HTTP bridge in mcp-server, /vectors/playbook_memory/doc_drift/check/{id} endpoint, overview- model drift synthesis writing to data/_kb/doc_drift_corrections.jsonl, boost exclusion for flagged+unreviewed entries. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 03:14:07 -05:00
profit	75a0f424ef	Phase 40 (early): Langfuse tracing on /v1/chat — observability recovery The lost stack J flagged was partly already present: Langfuse container has been running 2 days with the staffing project, SDK installed, mcp-server tracing gw:/* routes. What was missing was Rust-side /v1/chat emission — the new Phase 38/39 code bypassed Langfuse entirely. This commit bridges it. Fire-and-forget HTTP POST to http://localhost:3001/api/public/ingestion (batch {trace-create + generation-create}) on every chat call. Non-blocking — spawned tokio task, response latency unaffected. Trace failures log warn and drop, never propagate. Verified end-to-end after restart: - Log line "v1: Langfuse tracing enabled" at startup - /v1/chat local (qwen3.5:latest) → v1.chat:ollama trace appears with lat=0.41s, 24+6 tokens - /v1/chat cloud (gpt-oss:120b) → v1.chat:ollama_cloud trace appears with lat=1.87s, 73+87 tokens - mcp-server's existing gw:/log + gw:/intelligence/* traces continue to flow into the same project unchanged Files: - crates/gateway/src/v1/langfuse_trace.rs (new, 195 LOC) — thin client, no SDK. reqwest Basic Auth. ChatTrace payload + event serializer. from_env_or_defaults() resolver matches mcp-server/tracing.ts conventions (pk-lf-staffing / sk-lf- staffing-secret / localhost:3001) - crates/gateway/src/v1/mod.rs — V1State.langfuse field, emission after successful provider call (post-dispatch, pre-usage-update) - crates/gateway/src/main.rs — resolve + log at startup Tests: 12/12 green (9 prior + 3 for langfuse_trace: ingestion-batch serialization, uuid generator uniqueness, env resolver shape). Recovered piece #1 of 3 from the lost-stack narrative. Still open: - Langfuse → observer :3800 pipe (Phase 40 mid-deliverable) - Gitea MCP reconnect in mcp-server/index.ts (Phase 40 late) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 03:04:28 -05:00
profit	6316433062	Phase 40 scope: Langfuse + Gitea MCP recovery as named deliverables J flagged that a prior version of this stack had Langfuse traces piping into the observer + Gitea MCP for repo ops — lost. Adding these as explicit Phase 40 deliverables alongside routing engine + Gemini/Claude adapters. Findings during scope-check: - Langfuse container is already running (Up 2 days, langfuse:2, localhost:3001 healthcheck passes) - mcp-server/tracing.ts + package.json already have SDK wired - Credentials pk-lf-staffing / sk-lf-staffing-secret (from env) - Gitea MCP binary still installed at gitea-mcp@0.0.10 So recovery here is mostly re-connecting existing infra: 1. Add Rust-side Langfuse client for /v1/chat tracing (gateway currently bypasses tracing, mcp-server already has it) 2. Wire Langfuse → observer :3800 pipe 3. Register Gitea MCP in mcp-server/index.ts tool list Each landing as part of Phase 40 when the routing engine ships.	2026-04-22 03:01:28 -05:00
profit	42a11d35cd	Phase 39 (first slice): Ollama Cloud adapter on /v1/chat Second provider wired. /v1/chat now routes by optional `provider` field: default "ollama" hits local via sidecar, "ollama_cloud" (or "cloud") hits ollama.com/api/generate directly with Bearer auth. Key sourced at gateway startup from OLLAMA_CLOUD_KEY env, then /root/llm_team_config.json (providers.ollama_cloud.api_key), then OLLAMA_CLOUD_API_KEY env. Config source matches LLM Team convention. Shape-identical to scenario.ts::generateCloud — same endpoint, same body, same Bearer auth. Cloud path bypasses sidecar entirely (sidecar is local-only by design, mirrors TS agent.ts). Changes: - crates/gateway/src/v1/ollama_cloud.rs (new, 130 LOC) — reqwest client, resolve_cloud_key(), chat() adapter, CloudGenerateBody / CloudGenerateResponse wire shapes - crates/gateway/src/v1/ollama.rs — flatten_messages_public() re-export so sibling adapters reuse the shape collapse - crates/gateway/src/v1/mod.rs — provider field on ChatRequest, dispatch match in chat() handler, ollama_cloud_key on V1State - crates/gateway/src/main.rs — resolves cloud key at startup, logs which source provided it - crates/gateway/Cargo.toml — reqwest 0.12 with rustls-tls Verified end-to-end after restart: - provider=ollama → qwen3.5:latest local (~400ms, Phase 38 unchanged) - provider=ollama_cloud + model=gpt-oss:120b → real 225-word technical response in 5.4s, 313 tokens Tests: 9/9 green (7 from Phase 38 + 2 new for cloud body serialization and key resolver shape). Not in this slice: trait extraction (full Phase 39 scope adds ProviderAdapter trait + OpenRouter adapter + fallback chain logic). These land next with Phase 40 routing engine on top. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 02:57:42 -05:00
profit	8cbbd0ef70	Phase 38 fix: default think=false on /v1/chat Live-test caught the Phase 21 thinking-model trap on first call. qwen3.5 with max_tokens=50 and default think behavior burned all 50 tokens on hidden reasoning; visible content was "". completion_tokens exactly matching max_tokens was the tell. Adapter now defaults think: Some(false) matching scenario.ts hot-path discipline. Callers that want reasoning (overseers, T3+) opt in via a non-OpenAI `think: true` extension field on the request. Verified end-to-end after restart: - "Lakehouse supports ACID and raw data." (5 words, 516ms) - "tokio\nasync-std\nsmol" (3 Rust crates, 391ms) - /v1/usage accumulates across calls (2 req / 95 total tokens) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 02:50:09 -05:00
profit	4cb405bb42	Phase 38: Universal API skeleton — /v1/chat, /v1/usage, /v1/sessions First slice of the control-plane pivot. OpenAI-compatible surface over the existing aibridge → Ollama path. Additive — no existing routes touched. All 7 unit tests green, release build clean. What ships: - crates/gateway/src/v1/mod.rs — router, V1State (ai_client + Usage counter), ChatRequest/ChatResponse/Message/UsageBlock types, handlers for /chat, /usage, /sessions. OpenAI-compatible field shapes: {model, messages[{role,content}], temperature?, max_tokens?, stream?} - crates/gateway/src/v1/ollama.rs — shape adapter. Flattens messages into (system, prompt), calls aibridge.generate, unwraps response back into OpenAI /v1/chat shape. Prefers sidecar-reported tokens; falls back to chars/4 ceiling estimate matching Phase 21 convention. - crates/gateway/src/main.rs — one new mod, one .nest("/v1", ...) Tests (7/7): - chat_request_parses_openai_shape - chat_request_accepts_minimal - usage_counter_default_is_zero - flatten_separates_system_from_turns - flatten_concatenates_multiple_system_messages - flatten_with_no_system_returns_empty_system - estimate_tokens_chars_div_4_ceiling Not in this phase (per CONTROL_PLANE_PRD.md): streaming, tool calls, session state, multi-provider, fallback chain, cost gating. All land in Phases 39-44. Next: live-test POST /v1/chat after gateway restart, then migrate bot/propose.ts off direct sidecar calls to prove the loop end-to-end. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 02:47:15 -05:00
profit	f44b6b3e6b	Control-plane pivot: Phase 38-44 plan + bot scaffold Direction shift 2026-04-22: docs/CONTROL_PLANE_PRD.md becomes the long-horizon architecture target. Existing Lakehouse (docs/PRD.md, Phases 0-37) is preserved as the reference implementation and first consumer. New 6-layer architecture: L1 Universal API /v1/chat /v1/usage /v1/sessions /v1/tools /v1/context L2 Routing & Policy Engine (rules, fallback chains, cost gating) L3 Provider Adapter Layer (Ollama + OpenRouter + Gemini + Claude) L4 Knowledge + Memory + Playbooks (already built) L5 Execution Loop (scenarios + bot/cycle.ts instances) L6 Observability + token accounting Phases 38-44 sequenced with detailed per-phase specs in the PRD. Current scope: staffing domain (synthetic workers_500k, contracts, emails, SMS, playbooks). DevOps (Terraform/Ansible) is long-horizon target — architecture-compatible but not current. Files added: - docs/CONTROL_PLANE_PRD.md — 6-layer architecture, Phase 38-44 sequencing with staffing-first Truth Layer + Validation pipeline - bot/ — manual-only PR bot scaffold. First consumer test-bed for /v1/chat (Phase 38). Mem0-aligned ADD/UPDATE/NOOP apply semantics; KB feedback loop reads prior cycles on same gap and injects into cloud prompt so bot cycles compound like scenario.ts runs do. - tests/multi-agent/run_stress.ts — the 6-task diverse stress test referenced in the previous commit but missing from its staging Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 02:43:31 -05:00
profit	5b1fcf6d27	Phase 28-36 body of work Accumulated since a6f12e2 (Phase 21 Rust port + Phase 27 versioning): - Phase 36: embed_semaphore on VectorState (permits=1) serializes seed embed calls — prevents sidecar socket collisions under concurrent /seed stress load - Phase 31+: run_stress.ts 6-task diverse stress scaffolding; run_e2e_rated.ts + orchestrator.ts tightening - Catalog dedupe cleanup: 16 duplicate manifests removed; canonical candidates.parquet (10.5MB -> 76KB) + placements.parquet (1.2MB -> 11KB) regenerated post-dedupe; fresh manifests for active datasets - vectord: harness EvalSet refinements (+181), agent portfolio rotation + ingest triggers (+158), autotune + rag adjustments - catalogd/storaged/ingestd/mcp-server: misc tightening - docs: Phase 28-36 PRD entries + DECISIONS ADR additions; control-plane pivot banner added to top of docs/PRD.md (pointing at docs/CONTROL_PLANE_PRD.md which lands in next commit) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 02:41:15 -05:00
profit	a6f12e2609	Phase 21 Rust port + Phase 27 playbook versioning + doc-sync Phase 21 — Rust port of scratchpad + tree-split primitives (companion to the 2026-04-21 TS shipment). New crates/aibridge modules: context.rs — estimate_tokens (chars/4 ceil), context_window_for, assert_context_budget returning a BudgetCheck with numeric diagnostics on both success and overflow. Windows table mirrors config/models.json. continuation.rs — generate_continuable<G: TextGenerator>. Handles the two failure modes: empty-response from thinking models (geometric 2x budget backoff up to budget_cap) and truncated-non-empty (continuation with partial as scratchpad). is_structurally_complete balances braces then JSON.parse-checks. Guards the degen case "all retries empty, don't loop on empty partial". tree_split.rs — generate_tree_split map->reduce with running scratchpad. Per-shard + reduce-prompt go through assert_context_budget first; loud-fails rather than silently truncating. Oldest-digest-first scratchpad truncation at scratchpad_budget (default 6000 t). TextGenerator trait (native async-fn-in-trait, edition 2024). AiClient implements it; ScriptedGenerator test double lets tests inject canned sequences without a live Ollama. GenerateRequest gained think: Option<bool> — forwards to sidecar for per-call hidden-reasoning opt-out on hot-path JSON emitters. Three existing callsites updated (rag.rs x2, service.rs hybrid answer). Phase 27 — Playbook versioning. PlaybookEntry gained four optional fields (all #[serde(default)] so pre-Phase-27 state loads as roots): version u32, default 1 parent_id Option<String>, previous version's playbook_id superseded_at Option<String>, set when newer version replaces superseded_by Option<String>, the playbook_id that replaced New methods: revise_entry(parent_id, new_entry) — appends new version, stamps superseded_at+superseded_by on parent, inherits parent_id and sets version = parent + 1 on the new entry. Rejects revising a retired or already-superseded parent (tip-of-chain is the only valid revise target). history(playbook_id) — returns full chain root->tip from any node. Walks parent_id back to root, then superseded_by forward to tip. Cycle-safe. Superseded entries excluded from boost (same rule as retired): filter in compute_boost_for_filtered_with_role (both active-entries prefilter and geo-filtered path), rebuild_geo_index, and upsert_entry's existing- idx search. status_counts returns (total, retired, superseded, failures); /status JSON reports active = total - retired - superseded. Endpoints: POST /vectors/playbook_memory/revise GET /vectors/playbook_memory/history/{id} Doc-sync — PHASES.md + PRD.md drifted from git after Phases 24-26 shipped. Fixes applied: - Phase 24 marked shipped (commit b95dd86) with detail of observer HTTP ingest + scenario outcome streaming. PRD "NOT YET WIRED" rewritten to reflect shipped state. - Phase 25 (validity windows, commit e0a843d) added to PHASES + PRD. - Phase 26 (Mem0 upsert + Letta hot cache, commit 640db8c) added. - Phase 27 entry added to both docs. - Phase 19.6 time decay corrected: was documented as "deferred", actually wired via BOOST_HALF_LIFE_DAYS = 30.0 in playbook_memory.rs. - Phase E/Phase 8 tombstone-at-compaction limit note updated — Phase E.2 closed it. Tests: 8 new version_tests in vectord (chain-metadata stamping, retired/superseded parent rejection, boost exclusion, history from root/tip/middle, legacy default round-trip, status counts). 25 new aibridge tests (context/continuation/tree_split). Workspace total 145 green (was 120). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 17:40:49 -05:00
root	640db8c63c	Phase 26 — Mem0 upsert + Letta geo hot cache Closes the two remaining 2026-era memory findings. Both are optimizations per J's framing — not load-bearing, but good data hygiene + future-proofing at scale. MEM0 UPSERT (data hygiene): Before: /seed always appended. A scenario re-running the same operation on the same day wrote duplicate entries, inflating the playbook corpus with near-identical rows. Now: upsert_entry(new) inspects existing non-retired entries and decides ADD / UPDATE / NOOP: ADD → no matching (operation, day, city, state) tuple, append UPDATE → match exists with different names → merge (union, stable order), refresh timestamp, keep original playbook_id so citations stay valid NOOP → match exists with identical names → skip, return id Day-granularity keying on timestamp YYYY-MM-DD means intraday re-seeds dedup but tomorrow's same-operation is a fresh ADD. Retired entries don't block new seeds — they're out of scope anyway. Seed endpoint returns {outcome: {mode, playbook_id, merged_names?}, entries_after}. Append=false retains old replace-all semantics. 5 unit tests pass: first_seed_is_add, identical_reseed_is_noop, same_day_different_names_updates_and_merges, different_day_same_op_is_add, retired_entry_doesnt_block_new_seed. Live verified: three successive seeds with (Alice), (Alice), (Alice, Bob) left entry count unchanged at 1936 with merged names {Alejandro, Lauren, Alice, Bob}. Previously would have been 3 appends. LETTA GEO HOT CACHE (scale primitive): Added geo_index: HashMap<(city_lower, state_upper), Vec<usize>> alongside PlaybookMemoryState. Rebuilt on every mutation: set_entries, retire_one, retire_on_schema_drift, upsert_entry, load_from_storage. compute_boost_for_filtered_with_role now uses the index for O(1) geo lookup instead of scanning all entries. At current scale (1.9K) the scan was sub-ms; at 100K+ the scan becomes the dominant cost. The hot cache future-proofs without adding an LRU abstraction. Retired entries excluded from index; valid_until still checked on the hot path since it can elapse between rebuilds. Owns cloned PlaybookEntries in the geo_filtered vector so the state read-lock is released before cosine scoring — avoids lock contention on the scoring path. Memory-findings progress: 5 of 5 shipped. ✓ Multi-strategy parallel retrieval (Phase 19 refinement) ✓ Input normalization + unified /memory/query (Phase 24 TS) ✓ Zep validity windows (Phase 25) ✓ Mem0 UPSERT (Phase 26 today) ✓ Letta geo hot cache (Phase 26 today) All 18 playbook_memory tests pass. v0.26-memory-complete	2026-04-21 00:24:05 -05:00
root	138592dc56	Spec v2 — all chapters aligned with Phases 19-25 Full audit pass on devop.live/lakehouse/spec. Five chapters were stale, one had an outright incorrect line. Scope was bigger than ch6 alone — J asked "you want to update all" and the honest answer was yes. Ch 1 (Repository layout): - mcp-server row gains /memory/query, /models/matrix, /system/summary, observer.ts with :3800 listener - tests/multi-agent/ row lists all new files: kb.ts, normalize.ts, memory_query.ts, gen_scenarios.ts, gen_staffer_demo.ts, and the colocated unit tests (kb.test.ts, normalize.test.ts) - NEW config/ row documents models.json as the 5-tier matrix - data/ row enumerates the four learning-loop directories: _kb/, _playbook_lessons/, _observer/, _chunk_cache/ Ch 3 (Measurement & indexing): - NEW "Model matrix (Phase 20)" subsection — 5-tier table (T1 hot / T2 review / T3 overview / T4 strategic / T5 gatekeeper), per-tier primary model, frequency, the think:false mechanical finding called out with the 650-token reasoning-budget example - NEW "Continuation primitive (Phase 21)" paragraph - NEW "Per-staffer tool_level (Phase 23)" section with full/local/ basic/minimal mapping and the 46pt fill-rate delta from the 36-run demo Ch 7 (Scale story): - FIX: playbook_memory growth bullet was claiming "No TTL or merge policy" — Phase 25 added retirement via valid_until + schema_fingerprint + /retire endpoint. Rewritten to name current state (1936 entries, active vs retired split exposed). Ch 8 (Error surfaces): - Five new rows added to the failure-mode table: * Zero-supply city → cloud rescue (Phase 22 item B) with the Gary IN → South Bend IN concrete example * LLM truncation → generateContinuable (Phase 21) * Schema migration → /vectors/playbook_memory/retire (Phase 25) * Observer unreachable → scenario silent-skip + append journal survivability Ch 9 (Per-staffer context): - NEW "Staffer identity + competence-weighted retrieval (Phase 23)" section with the competence_score formula and findNeighbors weighted_score - NEW "Auto-discovered reliable-performer labels" section naming Rachel D. Lewis (18 endorsements) and Angela U. Ward (19) as concrete output of 36-run demo Ch 10 (A day in the life): - Added 17:15 timeline entry — Kim using /memory/query with natural language, regex normalizer extracting role/city/count in 0ms - 17:00 entry updated to mention KB indexing + pathway recommendation + observer stream - 22:00 entry updated to mention detectErrorCorrections nightly scan Ch 11 (Known limits & non-goals): - FIX: "playbook_memory compaction" bullet rewritten since retirement is now wired; reframed as the honest Mem0 UPDATE/NOOP gap - Added Letta hot cache deferred item with honest "cheap at 1.9K, will bite at 100K" framing - Added Chunking cache (Phase 21 Rust port) deferred item - Added Observer → autotune feedback wire deferred item (Phase 26+) Footer bumped v1 2026-04-20 → v2 2026-04-21 with Phase list. Verified all updates live on devop.live/lakehouse/spec.	2026-04-21 00:16:41 -05:00
root	e0a843d1a5	Phase 25 — validity windows + playbook retirement Addresses the load-bearing memory gap J flagged: playbook entries had timestamps but no retirement semantic. When a schema migration changed a column or a seasonal contract ended, stale playbooks kept boosting candidates silently. Zep 2026-era finding — temporal validity is the single highest-value memory-hygiene primitive. SCHEMA (PlaybookEntry gains four optional fields, serde default): schema_fingerprint — SHA-256 over dataset (column, type) tuples at seed time. Missing = legacy entry, never auto-retired on drift. valid_until — RFC3339 hard expiry. compute_boost skips entries past this moment. retired_at — Set by retire_one or retire_on_schema_drift. Retired entries excluded from all boost calculations but kept in journal. retirement_reason — Human-readable: "schema_drift: ...", "expired: ...", "manual: ..." RETRIEVAL PATH (compute_boost_for_filtered_with_role): Before geo+cosine, active_entries filter removes anything retired OR past valid_until. Uses chrono::Utc::now() once per call, no per- entry clock queries. NEW METHODS on PlaybookMemory: retire_one(playbook_id, reason) retire_on_schema_drift(city, state, current_fp, reason) — idempotent, scopes by (city, state) so a Nashville migration doesn't touch Chicago. Skips legacy entries with no fingerprint. status_counts() -> (total, retired, failures) HTTP ENDPOINTS: POST /vectors/playbook_memory/retire {playbook_id, reason} → retire by id {city, state, current_schema_fingerprint, reason} → schema drift GET /vectors/playbook_memory/status {total, active, retired, failures} SEED REQUEST extended with optional schema_fingerprint + valid_until so the orchestrator (scenario.ts) can pass the current schema hash when seeding, without a round trip through catalogd. UNIT TESTS (5/5 pass): retire_one_marks_entry_and_persists, retired_entries_do_not_boost, expired_valid_until_is_skipped, schema_drift_retires_mismatched_fingerprints_only, schema_drift_skips_other_cities. LIVE VERIFIED: /status on current state = 1936 entries, 43 failures. POST /retire with a sample playbook_id → "retired":1, /status now reports active=1935, retired=1. Memory-findings progress: 3 of 5 shipped. ✓ Multi-strategy parallel retrieval (Phase 19 refinement) ✓ Input normalization + unified /memory/query (Phase 24 TS) ✓ Zep-style validity windows (Phase 25, tonight) ⏳ Mem0 UPDATE / DELETE / NOOP ops (dedup same-(op,date) seeds) ⏳ Letta working-memory hot cache (not biting at 1.5K entries)	2026-04-21 00:11:02 -05:00
root	3fb3a60da4	Spec ch6 rewrite — 3 learning paths → 7 + honest gap list J flagged the spec out of alignment with what's built. Ch6 now reflects the full current architecture: - Path 1 (playbook boost) — formula kept; geo+role prefilter refinement called out with measured 14× citation lift - Path 2 (pattern discovery) — unchanged - Path 3 (autotune agent) — unchanged - Path 4 (KB + pathway recommender) — Phase 22, file layout documented - Path 5 (cloud rescue on failure) — Phase 22 item B, verified stress_01 example cited - Path 6 (staffer competence-weighted retrieval) — Phase 23, competence_score formula included, cross-staffer auto- discovered worker labels (Rachel D. Lewis 18× endorsements) - Path 7 (observer outcome ingest) — Phase 24, :3800 HTTP listener + ops.jsonl append journal Input normalizer + unified /memory/query surface documented as the "seamless with whatever input" answer, with the 319ms natural-language latency number. Honest gaps kept visible in the spec itself, not hidden: - Zep validity windows (most load-bearing remaining) - Mem0 UPDATE/DELETE/NOOP ops - Letta working-memory hot cache Live at https://devop.live/lakehouse/spec#ch6 after service restart. Verified post-deploy: geo+role prefilter, 14× delta, validity windows gap all present in served HTML.	2026-04-21 00:03:06 -05:00
root	52561d10d3	Input normalizer + unified memory query — "seamless with whatever input" J asked directly: "did we implement our memory findings so that our knowledge base and our configuration playbook [work] seamlessly with whatever input they're given?" Honest answer tonight was "one of five findings shipped, normalizer is the blocker." This closes that gap. NORMALIZER (tests/multi-agent/normalize.ts): Accepts structured JSON, natural language, or mixed. Returns canonical NormalizedInput { role, city, state, count, client, deadline, intent, confidence, extraction_method, missing_fields } for any downstream consumer. Three-tier path: 1. Structured fast-path — already-shaped input skips LLM 2. Regex path — "need 3 welders in Nashville, TN" parses without LLM. City/state parser tightened to 1-3 capitalized words + "in {city}" anchor preference + case-exact full-state-name variants to prevent "Forklift Operators in Chicago" being captured as the city name 3. LLM fallback — qwen3 local with think:false + 400 max_tokens for inputs the regex can't handle Unit tests (tests/multi-agent/normalize.test.ts): 9/9 pass. Covers structured fast-path, misplacement→rescue intent, state-name→abbrev conversion, regex extraction from natural language, plural role + full state name edge case, rescue intent keyword precedence, partial input reporting missing fields, empty object fallthrough, async/sync parity on clean inputs. UNIFIED MEMORY QUERY (tests/multi-agent/memory_query.ts): One function, five parallel fan-outs, one bundle returned: - playbook_workers — hybrid_search via gateway with use_playbook_memory - pathway_recommendation — KB recommender for this sig - neighbor_signatures — K-NN sigs weighted by staffer competence - prior_lessons — T3 overseer lessons filtered by city/state - top_staffers — competence-sorted leaderboard - discovered_patterns — top workers endorsed across past playbooks for this (role, city, state) - latency_ms — per-source + total Every branch is best-effort: one source down doesn't break the bundle. HTTP ENDPOINT (mcp-server/index.ts): POST /memory/query with body {input: <anything>} → MemoryQueryResult Returns the same shape the TS function does. Typed with types.ts for future UI consumption. VERIFIED: curl POST /memory/query with structured {role,city,state,count} → extraction_method=structured, 10 playbook workers, top score 0.878 curl POST /memory/query with "I need 3 welders in Nashville, TN" → extraction_method=regex (no LLM call), 319ms total, 8 endorsements for Lauren Gomez auto-discovered as top Nashville Welder Honest remaining gaps (documented for next phase): - Mem0 ADD/UPDATE/DELETE/NOOP — we still only ADD + mark_failed - Zep validity windows — playbook entries have timestamps but no retirement semantic - Letta working-memory / hot cache — every query scans all 1560 playbook entries - Memory profiles / scoped queries — global pool, no per-staffer private subsets 2 of 5 findings now shipped (multi-strategy retrieval in Rust, input normalization + unified query in TS). The remaining 3 are architectural additions queued as Phase 25 items — validity windows first since it's the most load-bearing for long-running systems.	2026-04-20 23:59:05 -05:00
root	b95dd86556	Phase 24 — observer HTTP ingest + scenario outcome streaming Closes the gap J flagged: observer wraps MCP:3700, scenarios hit gateway:3100 directly, observer idle at 0 ops across 3600+ cycles. Now scenarios POST per-event outcomes to observer's new HTTP ingest on :3800, observer consumes them alongside MCP-wrapped ops, ERROR_ ANALYZER and PLAYBOOK_BUILDER loops see the full picture. observer.ts: - Bun.serve() HTTP listener on OBSERVER_PORT (default 3800): GET /health — basic + ring depth GET /stats — total / success / failure / by_source / recent scenario ops digest POST /event — accept scenario outcome, shape it into ObservedOp with source="scenario" + staffer_id + sig_hash + event_kind + role/city/state + rescue flags - recordExternalOp() — shared ring-buffer insert so the main analyzer + playbook builder don't care where the op came from - ObservedOp extended with provenance fields persistOp() FIX — old path POSTed to /ingest/file?name=observed_operations which REPLACES the dataset (flagged in feedback_ingest_replace_semantics.md). Every op was silently wiping all prior ops. Replaced with append to data/_observer/ops.jsonl so the historical trace is durable across analyzer cycles and process restarts. scenario.ts: - OBSERVER_URL env (default http://localhost:3800) - postObserverEvent() helper with 2s AbortSignal.timeout so observer being down doesn't block scenario flow - Per-event POST after ctx.results.push(result), carrying staffer_id, sig_hash (via imported computeSignature), event_kind + role + city + state + count + rescue_attempted / rescue_succeeded + truncated output_summary VERIFIED: curl POST /event → {"accepted":true,"ring_size":1} curl GET /stats → {"total":1,"successes":1,"by_source":{"scenario":1}, "recent_scenario_ops":[{...staffer_id,kind,role}]} Final v3 demo leaderboard (9 runs per staffer, cumulative 3 batches): James (local): 92.9% fill, 36.8 cites, score 0.775 — RANK 1 Maria (full): 81.0% fill, 26.2 cites, score 0.727 Sam (basic): 61.9% fill, 28.2 cites, score 0.640 Alex (minimal): 59.5% fill, 32.2 cites, score 0.631 Honest finding: Alex has MORE citations than Sam despite NO T3 and NO rescue. Playbook inheritance alone is firing hardest when overseer is absent. The 59.5% fill rate (up from 0% when qwen2.5 was executor) proves cloud-exec + playbook inheritance is the floor the architecture delivers. Local gpt-oss:20b T3 outperforms cloud gpt-oss:120b T3 by 12pt fill rate on this workload — cloud overseer paying latency+variance for no measurable gain, worth flagging in next models.json tune.	2026-04-20 23:49:30 -05:00
root	137aed64fb	Coherence pass — PRD/PHASES updates, config snapshot wired, unit tests J flagged the audit: "make sure everything flows coherently, no pseudocode or unnecessary patches or ignoring any particular part of what we built." This is that pass. PRD.md updates: - Phase 19 refinement block — geo-filter + role-prefilter WIRED with citation density numbers (0.32 → 1.38, and 2 → 28 on same scenario). - Phase 20 rewrite — mistral dropped, qwen3.5 + qwen3 local hot path, think:false as the key mechanical finding, kimi-k2.6 upgrade path. - Phase 21 status block — think plumbing + cloud executor routing added after original commit. - Phase 22 item B (cloud rescue) — pivot sanitizer, rescue verified 1/3 on stress_01. - Phase 23 NEW — staffer identity + tool_level + competence-weighted retrieval + kb_staffer_report. Auto-discovered worker labels called out with real numbers (Rachel Lewis 12× across 4 staffers). - Phase 24 NEW — Observer/Autotune integration gap DOCUMENTED, not fixed. Observer has been idle at 0 ops for 3600+ cycles because scenarios hit gateway:3100 directly, bypassing MCP:3700 which the observer wraps. This is the honest "we're not using it in these tests" signal J surfaced. Fix deferred; gap visible now. PHASES.md: - Appended Phases 20-23 as checked, Phase 24 as unchecked gap. - Updated footer count: 102 unit tests across all layers. - Latest line updated with 14× citation lift + 46.4pt tool-asymmetry finding. scenario.ts: - snapshotConfig() was defined but never called. Now fires at every scenario start with a stable sha256 hash over the active model set + tool_level + cloud flags. config_snapshots.jsonl finally populates, which the error_corrections diff path needs to work correctly. kb.test.ts (new): 4 signature invariant tests — stability across unrelated fields (date, contract, staffer), sensitivity to role/city/ count changes, digest shape. All pass under `bun test`. service.rs: 6 Rust extractor tests for extract_target_geo + extract_target_role — basic, missing-state-returns-none, word boundary (civilian != city), multi-word role, absent role, quoted value parse. All pass under `cargo test -p vectord --lib extractor_tests`. Dangling items now honestly documented rather than silently pending: - Chunking cache (config/models.json SPEC, not wired) — flagged - Playbook versioning (SPEC, not wired) — flagged - Observer integration (WIRED but disconnected) — new Phase 24	2026-04-20 23:29:13 -05:00
root	ad0edbe29c	Cloud kimi-k2.5 executor for weak tiers + multi-strategy playbook retrieval Two coupled changes from the 2026 agent-memory research + tool asymmetry findings. SCENARIO (weak-tier cloud substitute): qwen2.5 collapsed to 0/14 across the basic/minimal tool_levels. Replace with cloud kimi-k2.5 on Ollama Cloud — same family as k2.6 (pro-tier locked today, on J's upgrade path). Plumb cloud flag through ACTIVE_EXECUTOR_CLOUD / ACTIVE_REVIEWER_CLOUD into generateContinuable so executor/reviewer can route to cloud when tool_level requires. think:false supported by Kimi family. Tool level mapping (revised): full — qwen3.5 local + qwen3 local + cloud gpt-oss:120b T3 + rescue local — qwen3.5 local + qwen3 local + local gpt-oss:20b T3 + rescue basic — kimi-k2.5 cloud + qwen3 local + local T3, no rescue minimal — kimi-k2.5 cloud + qwen3 local, no T3, no rescue. Playbook inheritance alone on the decision path. This is the honest version of J's "minimal tools still works via inheritance" hypothesis — with the executor no longer broken at the tokenizer level, we can actually measure whether playbook retrieval substitutes for missing overseers. PLAYBOOK_MEMORY (multi-strategy retrieval): Zep / Mem0 research shows multi-strategy rerank (semantic + keyword + graph + temporal) outperforms single-strategy cosine. Lakehouse now has a two-tier: 1. Exact (role, city, state) match: skip cosine, assign similarity=1.0, take up to top_k/2+1 slots. These are identity-class neighbors — the strongest possible signal. 2. Cosine fallback within the same (city, state) but different role: fills remaining slots. Exposed as compute_boost_for_filtered_with_role(target_geo, target_role). Backwards-compatible: compute_boost_for_filtered forwards with role=None so existing callers keep their current behavior. Service.rs wires both: extract_target_geo and extract_target_role pull from the executor's SQL filter. grab_eq_value is factored out of extract_target_geo so both lookups share one parser. Diagnostic log now prints target_role alongside target_geo for every hybrid_search: playbook_boost: boosts=88 sources=39 parsed=39 matched=5 target_geo=Some(("Nashville", "TN")) target_role=Some("Welder") Verified: Nashville Welder query returns 5/10 boosted workers in top_k with clean role+geo provenance. Research sources: atlan.com Agent Memory Frameworks 2026, Mem0 paper (arxiv 2504.19413), Zep/Graphiti LongMemEval comparison, ossinsight Agent Memory Race 2026. kimi-k2.6 on current key returns 403 — pro-tier upgrade required. kimi-k2.5 is the substitute today; swap to k2.6 by renaming one line in applyToolLevel once the subscription lands.	2026-04-20 23:20:07 -05:00
root	5e89407939	Phase 23 refinement — per-staffer tool_level variance Staffer.tool_level now controls which subsystems a specific run gets: full — qwen3.5 + qwen3 + cloud T3 + cloud rescue local — qwen3.5 + qwen3 + local gpt-oss:20b T3 + rescue basic — qwen2.5 + qwen2.5 + local T3, no rescue minimal — qwen2.5 + qwen2.5, NO T3, NO rescue. Playbook inheritance only. applyToolLevel() mutates module-scoped ACTIVE_* slots each run from the env defaults, so prior staffer's overrides never leak. Hot-path code reads ACTIVE_EXECUTOR / ACTIVE_REVIEWER / ACTIVE_T3_DISABLED / ACTIVE_OVERVIEW_CLOUD / ACTIVE_RETRY_ON_FAIL instead of the baked constants. The architectural question this answers: does playbook_memory inheritance carry enough knowledge to let a weakly-tooled coordinator still produce usable outcomes? "Minimal" Alex runs qwen2.5 exec + no reviewer overseer + no cloud rescue. If Alex still fills events at a reasonable rate, the playbook system is the real knowledge carrier — the senior stack is nice-to-have, not the sine qua non. Demo personas mapped: Maria (senior, 48mo, full) James (mid, 14mo, local) Sam (junior, 4mo, basic) Alex (trainee, 1mo, minimal) Same 3 contracts (Nashville downtown, Joliet warehouse, Indianapolis assembly) across all four → 12 runs. KB + kb_staffer_report.py leaderboard already wired; competence_score will now reflect real tool asymmetry instead of LLM sampling variance.	2026-04-20 22:50:05 -05:00
root	6b71c8e9b2	Phase 23 — contract terms + staffer identity + competence-weighted retrieval Matrix-index the "who handled this" dimension so top staffers become the training signal and juniors inherit their playbooks automatically via the boost pipeline. Auto-discovered indicators emerge from comparing trajectories across staffers on similar contracts — that was always the architectural point; this wires the last piece. ContractTerms: - deadline, budget_total_usd, budget_per_hour_max, local_bonus_per_hour, local_bonus_radius_mi, fill_requirement ("paramount" \| "preferred") - Attached to ScenarioSpec, propagated into T3 checkpoint + cloud rescue prompts so cloud reasons about trade-offs (pivot within bonus radius first; respect per-hour cap; split across cities when fill_requirement=paramount). Staffer: - {id, name, tenure_months, role: senior\|mid\|junior\|trainee} - On ScenarioSpec; logged at scenario start; attached to KB outcome - Recomputed StafferStats written to data/_kb/staffers.jsonl after every run: total_runs, fill_rate, avg_turns, avg_citations, rescue_rate, competence_score. - Competence formula: 0.45fill_rate + 0.20turn_efficiency + 0.20citation_density + 0.15rescue_rate. Normalized to 0..1. findNeighbors now returns weighted_score = cosine × best_staffer_competence (floored at 0.3 so high-similarity low-competence neighbors still surface). pathway_recommender prompt shows the top staffer's identity so cloud knows WHOSE playbook it's synthesizing from. Demo infrastructure: - tests/multi-agent/gen_staffer_demo.ts: 4 personas (Maria senior, James mid, Sam junior, Alex trainee) × 3 contracts (Nashville Welder, Joliet Warehouse, Indianapolis Assembly). 12 scenarios total. - scripts/run_staffer_demo.sh: runs the 12 sequentially with LH_OVERVIEW_CLOUD=1. Post-run calls kb_staffer_report.py. - scripts/kb_staffer_report.py: leaderboard + cross-staffer worker overlap (names endorsed by ≥2 staffers → auto-discovered high-value workers). Top vs bottom differential. gen_scenarios.ts (Phase 22 generator) also now emits contract terms on 70% of generated specs — future KB batches populate with realistic constraint patterns instead of bare role+city+count. Stress scenario from item A intentionally NOT the production test. Real staffing has constraints; Nashville contract + staffer demo is the honest test of whether the architecture produces measurable differential between coordinator skill levels. Demo batch launched — 12 runs × ~3min each ≈ 40min unattended. Report emitted after batch.	2026-04-20 22:16:09 -05:00
root	a7fc8e2256	Item B — cloud-rescue retry on event failure When a scenario event fails (drift abort or other error) and LH_RETRY_ON_FAIL is on (default when cloud T3 is enabled), ask cloud for a concrete pivot — new city, role, or count — then re-run the event with the remediation's fields. Capped at 1 retry per event so a genuinely-impossible scenario can't burn budget. requestCloudRemediation(event, result): - Feeds the same diagnostic bundle T3 checkpoints get (SQL filters, row counts, SQL errors, reviewer drift reasons, gap signals). - Prompt demands structured JSON: {retry, new_city, new_role, new_count, rationale}. - Cloud is instructed to pivot to NEAREST alternate city when zero-supply detected, broaden role when uniquely scarce, reduce count when clearly unachievable, or return retry=false when no pivot seems viable. EventResult additions: - retry_attempt, retry_remediation (with rationale + cloud_model + duration), retry_result (full inner result shape), original_event. - If retry succeeded, it becomes the primary result and original_event preserves what was attempted first. If retry also failed, the primary stays the failure and retry is recorded alongside. Sanitizer on cloud output: model sometimes emits "Hammond, IN" in new_city with "IN" in a non-existent new_state field, producing "Hammond, IN, IN" downstream. Split new_city on comma, take first token as city, extract state if present after the comma. Original event's state is the fallback. VERIFIED on stress_01.json with LH_OVERVIEW_CLOUD=1: Without rescue (item A baseline): 1/5 events ok With rescue (item B): 3/5 events ok Gary IN misplacement: drift → cloud proposed South Bend IN → retry filled 1/1. Rationale stored in retry_remediation for forensics. Known limits surfaced (future work): - City-field mangling failed one rescue before the sanitizer landed; next run will use the fix. - Cloud picks alternate cities without knowing ground-truth supply. Flint → Saginaw pivoted but Saginaw also had sparse Welders. Future: expose a /vectors/supply-estimate endpoint cloud can consult before proposing a pivot.	2026-04-20 22:01:45 -05:00
root	c21b261877	Item A — stress scenario + enriched T3 diagnostic prompt Proves cloud passthrough works end-to-end AND fixes the diagnostic quality problem that first run surfaced. STRESS SCENARIO (tests/multi-agent/scenarios/stress_01.json): Five genuinely hard events with varied failure modes: - Gary, IN 5× Electrician: ZERO supply (city not in workers_500k) - Peoria, IL 8× Safety Coordinator: scarce role, initial pool only 5 - Flint, MI 3× Welder: ZERO supply - Grand Rapids, MI 4× Tool & Die Maker: scarce but solvable - Gary, IN 1× Electrician misplacement: repeats event 1's impossibility FIRST RUN (stress v1) — cloud passthrough works, diagnosis vague: T3 checkpoint: "Potential drift flags for upcoming role" Lesson: "Before dispatching, query pool status. Update turn counter..." Generic tactical advice that doesn't address the real problem. Root cause: T3 prompt only saw outcome summary, not the raw SQL/pool/drift signals the executor had in its log. DIAGNOSTIC FIX: - Added LogEntry[] `sharedLog` parameter to runAgentFill so the caller retains the trace even when runAgentFill throws drift-abort. - EventResult gained `diagnostic_log` field populated on both OK and FAIL paths. - extractDiagnostics() pulls SQL filters, hybrid_search row counts, SQL errors, and reviewer drift notes from the log. - Checkpoint prompt now includes FAILURE FORENSICS block for failed events: SQL filters attempted, row counts, errors, drift reasons, and an explicit teaching note about zero-supply detection. - Cross-day lesson prompt flags each event with [ZERO-SUPPLY: pivot city needed] tag when drift reasons mention "no match"/"no candidates"/"0 rows". PRIORITY clause in the prompt tells the model its lesson MUST name alternate cities when that tag appears. SECOND RUN (stress v2 with enriched prompt) — cloud diagnosis sharp: T3 after Flint: risk="Zero candidate supply for Welder in Flint" hint="search Welder×3 in Saginaw, MI (≈30 mi) or expand role to Metal Fabricator" T3 after Gary: risk="Zero supply for Electrician in Gary, IN" hint="Pivot to Chicago, IL (≈40 min); broaden to Electrical Technician within 60 min radius" Lesson: specific, per-city, with distances, role-broadening fallback, and pre-loading strategy — actionable for item B retry. Cloud 120b call latencies consistent: 4.8-8.0s per prompt. Cloud passthrough proven under stress. Fill outcomes unchanged (1/5 — correct rejection of three impossible events + one propagating JSON emission edge case on retry pivot reasoning). The knowledge to rescue them now exists in the lesson; item B wires the retry.	2026-04-20 21:54:29 -05:00
root	a663698571	Item 3 — geo-filtered playbook boost; diagnostic logging ROOT CAUSE (found via instrumentation, not hunch): After a 20-scenario corpus batch, only 6/40 successful (role, city) combos ever triggered playbook_memory citations on subsequent runs. Added `playbook_boost:` tracing::info! line in vectord::service to log boost map size vs candidate pool vs match count. One query revealed: boosts=170 sources=50 parsed=50 matched=0 170 endorsed workers came back from compute_boost_for — but zero were in the 50-candidate Toledo pool. The boost map was pulling globally- ranked semantic neighbors (top-100 playbooks across ALL cities), dominated by Kansas City / Chicago / Detroit forklift playbooks the Toledo SQL filter would never admit. The mechanism was correct at the per-playbook level; the problem was pool intersection. FIX (surgical, not cap-tuning): - playbook_memory::compute_boost_for_filtered(): accepts optional (city, state) filter. When set, skips playbooks from other geos BEFORE cosine-ranking, so top-k is within the target city. - Backwards-compatible: compute_boost_for() calls the filtered variant with None — existing callers unchanged. - service::hybrid_search(): extracts target (city, state) from the executor's SQL filter via a small parser (extract_target_geo), passes to compute_boost_for_filtered. VERIFIED: Before fix: boosts=170 sources=50 parsed=50 matched=0 (0% hit) After fix: boosts=36 sources=50 parsed=50 matched=11 (22% hit) Top-k=10 now has 7/10 boosted workers with 2-3 citations each. Boost values 0.075-0.113 on cosine scores 0.67-0.74 — meaningful reorder without saturation. scripts/kb_measure.py: Aggregator that reads data/_kb/.jsonl and playbooks//results.json, reports fill rate, citation density, recommender confidence trend, and zero-citation-ok combos (item 3 target signal). Used to measure before/after on bigger batches. Diagnostic logging stays — the class of "boosts computed but not matched" bug can recur if the SQL filter format ever drifts, and without the counter it's invisible. Every hybrid_search with use_playbook_memory=true now logs its boost stats.	2026-04-20 21:35:04 -05:00
root	330cb90f99	Lift k cap, drop ornamental `reason` field, scenario generator ITEM 1 — k CAP + REASON FIELD The hybrid_search default k was hard-coded to 10. For multi-fill events (5× expansion, 4× emergency) that's pool=10 → propose 5-of-10, half the candidates become the answer with no room for rejection. Executor prompt now instructs k to scale with target_count: k = max(count*5, 20), cap 80. Default helper bumped 10 → 20. Fill.reason dropped from required to optional. Nothing downstream ever consumed it — resolveWorkerIds, sealSale, retrospective all use candidate_id and name. Models loved to write 100-150 char justifications per fill; on 4+ fills that blew the JSON budget before the structure closed. Test 1 run result after this change: FIRST EVER 5/5 on the Riverfront Steel scenario, 13 total turns across 5 events. The event that failed last run (emergency 4×Loader with truncated reason-field continuation) now clears in 2 turns. Progression: mistral baseline: 0/5 qwen3.5 + continuation + think:false: 4/5 qwen3.5 + k=20 + no-reason: 5/5 ✓ ITEM 2 — SCENARIO GENERATOR (NOT YET TESTED E2E) tests/multi-agent/gen_scenarios.ts emits N deterministic ScenarioSpecs with varied clients (15 companies), cities (20 Midwest cities known to exist in workers_500k), role mixes (14 industrial staffing roles, weighted realistic), and event sequences. Each gets a unique sig_hash so the KB populates with distinct neighbor signatures. scripts/run_kb_batch.sh runs all generated specs sequentially against scenario.ts, logs per-scenario outcomes, and reports KB state at the end. Each run takes ~2-4min; 20-30 scenarios = 1-2hr unattended. Next: test the generator+batch on a small N (3-5) to verify KB populates correctly and pathway recommendations start getting neighbor signal instead of cold-starts. Then item 3 (Rust re-weighting of hybrid_search by playbook_memory success).	2026-04-20 20:31:34 -05:00
root	9c1400d738	Phase 22 — Internal Knowledge Library (KB) Meta-layer over Phase 19 playbook_memory. Phase 19 answers "which WORKERS worked for this event"; KB answers "which CONFIG worked for this playbook signature" — model choice, budget hints, pathway notes, error corrections. tests/multi-agent/kb.ts: - computeSignature(): stable sha256 hash of the (kind, role, count, city, state) tuple sequence. Same scenario shape → same sig. - indexRun(): extracts sig, embeds spec digest via sidecar, appends outcome record, upserts signature to data/_kb/signatures.jsonl. - findNeighbors(): cosine-ranks the k most-similar signatures from prior runs for a target spec. - detectErrorCorrections(): scans outcomes for same-sig fail→succeed pairs, diffs the model set, logs to error_corrections.jsonl. - recommendFor(): feeds target digest + k-NN neighbors + recent corrections to the overview model, gets back a structured JSON recommendation (top_models, budget_hints, pathway_notes), appends to pathway_recommendations.jsonl. JSON-shape constrained so the executor can inherit it mechanically. - loadRecommendation(): at scenario start, pulls newest rec matching this sig (or nearest). scenario.ts: - Reads KB recommendation at startup (alongside prior lessons). - Injects pathway_notes into guidanceFor() executor context. - After retrospective, indexes the run + synthesizes next rec. Cold-start behavior: first run with no history writes a low-confidence "no prior data" rec so the signal that something was attempted is captured. Second run gets "low confidence, 0 neighbors" until a third distinct sig gives the embedder something to compare against — hence the upcoming scenario generator. VERIFIED: - data/_kb/ populated after one scenario run: 1 outcome (sig=4674…, 4/5 ok, 16 turns total), 1 signature, 2 recs (cold + post-run). - Recommendation JSON-parsed cleanly from gpt-oss:20b overview model. PRD Phase 22 added with file layout, cycle description, and the rationale for file-based MVP → Rust port progression that matches how Phase 21 primitives shipped. What's NOT here yet (batched follow-ups per J's request, tested between each): - Lift the k=10 hybrid_search cap to adaptive k=max(count*5, 20) - Scenario generator to bulk-populate KB with varied signatures - Rust re-weighting: push playbook_memory success signal INTO hybrid_search scoring, not just post-hoc boost	2026-04-20 20:27:12 -05:00
root	0c4868c191	qwen3.5 executor + continuation primitive + think:false Three coupled fixes that together turned the Riverfront Steel scenario from 0/5 (mistral) to 4/5 (qwen3.5) with T3 flagging real staffing concerns rather than linter advice. MODEL SWAP - Executor: mistral → qwen3.5:latest (9.7B, 262K ctx, thinking). mistral's decoder emitted malformed JSON on complex SQL filters regardless of prompt; J called it — stop using mistral. - Reviewer: qwen2.5 → qwen3:latest (40K ctx) - Applied to scenario.ts, orchestrator.ts, network_proving.ts, run_e2e_rated.ts CONTINUATION PRIMITIVE (agent.ts) - generateContinuable(): empty-response → geometric backoff retry; truncated-JSON → continue from partial as scratchpad; bounded by budget cap + max_continuations. No more "bump max_tokens until it stops truncating" tourniquet. - generateTreeSplit(): map-reduce for oversized input corpora with running scratchpad digest, reduce pass for final synthesis. - Empty text no longer throws — it's a signal to continuable that thinking ate the budget. think:false FOR HOT PATH - qwen3.5 burned ~650 tokens of hidden thinking for trivial JSON emission. For executor/reviewer/draft: think:false. For T3/T4/T5 overseers: thinking stays on (that's the point). - Sidecar generate endpoint accepts `think` bool, passes through to Ollama's /api/generate. VERIFIED OUTCOMES Riverfront Steel 2026-04-21, qwen3.5+continuable+think:false: 08:00 baseline_fill 3/3 4 turns 10:30 recurring 2/2 3 turns (1 playbook citation) 12:15 expansion 0/5 drift-aborted (5-fill orchestration problem, separate work) 14:00 emergency 4/4 3 turns (1 citation) 15:45 misplacement 1/1 3 turns → T3 caught Patrick Ross double-booking across events → T3 flagged forklift cert drift on the event that failed → Cross-day lesson proposed "maintain buffer of ≥3 emergency candidates, pre-fetch certs for expansion, booking system cross-check" — real staffing advice, not generic linter output PRD PHASE 21 rewritten to reflect the actual primitive shape (two- call map-reduce with scratchpad glue) instead of the tourniquet approach originally documented. Rust port queued for next sprint. scripts/ab_t3_test.sh: A/B harness that chains B→C→D runs and emits tests/multi-agent/playbooks/ab_scorecard.json.	2026-04-20 20:19:02 -05:00
root	6e7ca1830e	Phase 21 foundation — context stability + chunking pipeline PRD: add Phase 20 (model matrix, wired) and Phase 21 (context stability, partial). Phase 21 exists because LLM Team hit this exact wall — running multi-model ranking on large context silently truncated, rankings degraded, no pipeline caught it. The stable answer: every agent call goes through a budget check against the model's declared context_window minus safety_margin, with a declared overflow_policy when the check fails. config/models.json: - context_window + context_budget per tier - overflow_policies block: summarize_oldest_tool_results_via_t3, chunk_lessons_via_cosine_topk, two_pass_map_reduce, escalate_to_kimi_k2_1t_or_split_decision - chunking_cache spec (data/_chunk_cache/, corpus-hash keyed) agent.ts: - estimateTokens() chars/4 biased safe ~15% - CONTEXT_WINDOWS table (fallback; prod reads models.json) - assertContextBudget() — throws on overflow with exact numbers, can bypass with bypass_budget:true for callers with their own policy - Wired into generate() and generateCloud() so EVERY call is checked scenario.ts: - T3 lesson archive to data/_playbook_lessons/*.json (the old /vectors/playbook_memory/seed path was silently failing with HTTP 400 because it requires 'fill: Role xN in City, ST' operation shape) - loadPriorLessons() at scenario start — filters by city/state match, date-sorted, takes top-3 - prior_lessons.json archived per-run (honest signal for A/B) - guidanceFor() injects up to 2 prior lessons (≤500 chars each) into the executor's per-event context - Retrospective shows explicit "Prior lessons loaded: N" line Verified: mistral correctly rejects a 150K-char prompt (7532 tokens over), gpt-oss:120b accepts it with 90K headroom. The enforcement is in-band on every call now, not an afterthought. Full chunking service (Rust) remains deferred to the sprint this feeds: crates/aibridge/src/budget.rs + chunk.rs + storaged/chunk_cache.rs	2026-04-20 19:34:44 -05:00
root	03d723e7e6	Model matrix — 5 tiers, local hard workers + cloud overseers config/models.json is the authoritative catalog. Hot path (T1/T2) stays local; cloud is consulted only for overview (T3), strategic (T4), and gatekeeper (T5) calls. J named qwen3.5 + newer models (minimax-m2.7, glm-5, qwen3-next) specifically — all mapped with real reachable IDs verified against ollama.com/api/tags. Tier shape: - t1_hot mistral + qwen2.5 local — 50-200 calls/scenario - t2_review qwen2.5 + qwen3 local — 5-14 calls/event - t3_overview gpt-oss:120b cloud — 1-3 calls/scenario - t4_strategic qwen3.5:397b + glm-4.7 — 1-10 calls/day - t5_gatekeeper kimi-k2-thinking — 1-5 calls/day, audit-logged Rate budgets are declared in-config — Ollama Cloud paid tier is generous but we cap overview/strategic/gatekeeper so no single rogue scenario can blow the day's quota. Experimental rotation list wired but disabled by default. When enabled, T4 randomly routes 10% of calls to a rotating minimax/GLM/qwen-next/ deepseek/nemotron/cogito/mistral-large candidate, logs comparisons, and auto-promotes after 3 rotations of wins. Playbook versioning SPEC embedded under `playbook_versioning` key: every seed gets version + parent_id + retired_at + architecture_snapshot, so when a schema migration breaks a playbook we can pinpoint which change retired it. Implementation flagged for next sprint (touches gateway + catalogd + mcp-server) — not wired here. - scenario.ts now loads config/models.json at init, env vars still override - mcp-server exposes /models/matrix read-only so UI can render it	2026-04-20 19:24:41 -05:00
root	e4ae5b646e	T3 overview tier — mid-day checkpoints + cross-day lesson Hot path (T1/T2) stays mistral + qwen2.5. The new T3 tier runs a thinking model SPARINGLY — after every misplacement, every N-th event (default N=3), and once post-scenario for the cross-day lesson. - agent.ts: generateCloud() for Ollama Cloud (gpt-oss:120b etc). Uses the same /api/generate shape; thinking field is discarded. - scenario.ts: runOverviewCheckpoint + runCrossDayLesson. Outputs land in checkpoints.jsonl and lesson.md. Lesson also seeds playbook_memory under operation "cross-day-lesson-{date}" — future runs pick it up through the existing similarity boost. - Env knobs: LH_OVERVIEW_CLOUD=1 routes T3 to cloud, LH_OVERVIEW_MODEL overrides (default gpt-oss:20b local, gpt-oss:120b cloud), LH_T3_CHECKPOINT_EVERY controls cadence, LH_T3_DISABLE=1 turns it off. Why this shape: prior feedback_phase19_seed_text.md warned that verbose seeds dilute the embedding and silently kill the boost. T3's rich prose goes to lesson.md; the embedded "approach" + "context" stay terse. Verified end-to-end: local 20b checkpoint 10.9s, lesson 4.0s; cloud 120b lesson 3.7s. Cloud output is both faster AND more specific than local (sequenced, tactical, logging advice included).	2026-04-20 19:21:45 -05:00

... 3 4 5 6 7 ...

389 Commits