Coherence pass — PRD/PHASES updates, config snapshot wired, unit tests
J flagged the audit: "make sure everything flows coherently, no pseudocode or unnecessary patches or ignoring any particular part of what we built." This is that pass. PRD.md updates: - Phase 19 refinement block — geo-filter + role-prefilter WIRED with citation density numbers (0.32 → 1.38, and 2 → 28 on same scenario). - Phase 20 rewrite — mistral dropped, qwen3.5 + qwen3 local hot path, think:false as the key mechanical finding, kimi-k2.6 upgrade path. - Phase 21 status block — think plumbing + cloud executor routing added after original commit. - Phase 22 item B (cloud rescue) — pivot sanitizer, rescue verified 1/3 on stress_01. - Phase 23 NEW — staffer identity + tool_level + competence-weighted retrieval + kb_staffer_report. Auto-discovered worker labels called out with real numbers (Rachel Lewis 12× across 4 staffers). - Phase 24 NEW — Observer/Autotune integration gap DOCUMENTED, not fixed. Observer has been idle at 0 ops for 3600+ cycles because scenarios hit gateway:3100 directly, bypassing MCP:3700 which the observer wraps. This is the honest "we're not using it in these tests" signal J surfaced. Fix deferred; gap visible now. PHASES.md: - Appended Phases 20-23 as checked, Phase 24 as unchecked gap. - Updated footer count: 102 unit tests across all layers. - Latest line updated with 14× citation lift + 46.4pt tool-asymmetry finding. scenario.ts: - snapshotConfig() was defined but never called. Now fires at every scenario start with a stable sha256 hash over the active model set + tool_level + cloud flags. config_snapshots.jsonl finally populates, which the error_corrections diff path needs to work correctly. kb.test.ts (new): 4 signature invariant tests — stability across unrelated fields (date, contract, staffer), sensitivity to role/city/ count changes, digest shape. All pass under `bun test`. service.rs: 6 Rust extractor tests for extract_target_geo + extract_target_role — basic, missing-state-returns-none, word boundary (civilian != city), multi-word role, absent role, quoted value parse. All pass under `cargo test -p vectord --lib extractor_tests`. Dangling items now honestly documented rather than silently pending: - Chunking cache (config/models.json SPEC, not wired) — flagged - Playbook versioning (SPEC, not wired) — flagged - Observer integration (WIRED but disconnected) — new Phase 24
This commit is contained in:
parent
ad0edbe29c
commit
137aed64fb
@ -2486,3 +2486,46 @@ async fn lance_build_scalar_index(
|
||||
Err(e) => Err((StatusCode::INTERNAL_SERVER_ERROR, e)),
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod extractor_tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn extract_target_geo_basic() {
|
||||
let f = "role = 'Welder' AND city = 'Toledo' AND state = 'OH' AND CAST(availability AS DOUBLE) > 0.5";
|
||||
assert_eq!(extract_target_geo(f), Some(("Toledo".into(), "OH".into())));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn extract_target_geo_missing_state_returns_none() {
|
||||
let f = "role = 'Welder' AND city = 'Toledo'";
|
||||
assert_eq!(extract_target_geo(f), None);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn extract_target_geo_word_boundary() {
|
||||
// "civilian" contains "city" as a substring — must not match.
|
||||
let f = "civilian_rank = 1 AND city = 'Toledo' AND state = 'OH'";
|
||||
assert_eq!(extract_target_geo(f), Some(("Toledo".into(), "OH".into())));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn extract_target_role_basic() {
|
||||
let f = "role = 'Welder' AND city = 'Toledo'";
|
||||
assert_eq!(extract_target_role(f), Some("Welder".into()));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn extract_target_role_none_when_absent() {
|
||||
let f = "city = 'Toledo' AND state = 'OH'";
|
||||
assert_eq!(extract_target_role(f), None);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn extract_target_role_multi_word() {
|
||||
let f = "role = 'Warehouse Associate' AND city = 'Chicago'";
|
||||
assert_eq!(extract_target_role(f), Some("Warehouse Associate".into()));
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
@ -230,7 +230,7 @@
|
||||
- Profile-driven routing: `POST /vectors/profile/{id}/search` auto-routes to Lance when profile.vector_backend=lance
|
||||
- Auto-migrate + auto-index on activation
|
||||
- Measured on real 100K × 768d: migrate 0.57s, IVF_PQ build 16.2s (14× faster than HNSW 230s), search 23ms, append 100 rows 3.3ms, doc_id fetch 3.5ms (with scalar btree)
|
||||
- IVF_PQ recall@10 = 0.805 (HNSW = 1.000) — measured via `/vectors/lance/recall/{idx}` harness
|
||||
- IVF_PQ recall@10 = 0.805 with Lance's default `nprobes=1` (the hidden cap — see 2026-04-20 tuning work below, which lifts it to 1.000). Measured via `/vectors/lance/recall/{idx}` harness.
|
||||
- [x] Phase E.3: Scheduled ingest — 2026-04-17
|
||||
- `ingestd::schedule` module: ScheduleDef, ScheduleStore (JSON at `_schedules/{id}.json`), Scheduler tokio task
|
||||
- Supports MySQL + Postgres sources on interval triggers (Cron variant defined, parsing stubbed)
|
||||
@ -241,10 +241,68 @@
|
||||
- Two-tier: lopdf text extraction → Tesseract 5.5 fallback for scanned/image PDFs
|
||||
- Extracts embedded XObject /Image streams, shells to tesseract --oem 3 --psm 6
|
||||
- Same schema (source_file, page_number, text_content) — downstream unchanged
|
||||
- [ ] Fine-tuned domain models
|
||||
- [ ] Multi-node query distribution
|
||||
- [x] Catalog hygiene — idempotent `register()` + dedupe + DELETE (2026-04-19, ADR-020)
|
||||
- `catalogd::Registry::register` now gates on `(name, schema_fingerprint)`: same fp → reuse `DatasetId` and update objects in place; different fp → return error (409 Conflict on HTTP, `FAILED_PRECONDITION` on gRPC). First-time registration is unchanged.
|
||||
- `POST /catalog/dedupe` one-shot operator endpoint collapses pre-existing duplicates; winner = non-null `row_count` first, newest `updated_at` second.
|
||||
- `DELETE /catalog/datasets/by-name/{name}` removes the manifest from both in-memory registry and object storage (metadata-only — parquet files, vector indexes, tombstones are NOT cascade-deleted). Added to support test-harness cleanup; also plugs a real catalog hole where zombie entries from prior deletes would break DataFusion schema inference.
|
||||
- Cleanup run on live catalog: 374 → 31 datasets, 343 orphan manifests removed, 0 errors. 308× `successful_playbooks` was the worst offender.
|
||||
- Concurrency: write lock held across storage I/O in `register()` to close the check→insert TOCTOU window (32-worker multi-threaded stress test verifies single-manifest invariant).
|
||||
- End-to-end verification: `scripts/e2e_pipeline_check.sh` runs 31 assertions across 12 pipeline stages (ingest → catalog → SQL+JOIN → dedup → idempotency → metadata → PII → vector embed → semantic search → cleanup) against the live gateway. Idempotent across repeat runs.
|
||||
- Tests: 11 new in `catalogd` (was 0, includes 3 concurrency tests + 3 delete_dataset tests); 11 new in `storaged` for `AppendLog` + `ErrorJournal` (was 0). Fixed a broken doctest in `append_log.rs`.
|
||||
- [x] Autotune agent: portfolio rotation + auto-bootstrap (2026-04-20)
|
||||
- `pick_periodic_target` now sources candidates from `IndexRegistry` (not just promoted indexes) and picks least-recently-tuned, so trial budget spreads across every index with ≥1000 vectors instead of fixating on one converged champion.
|
||||
- `run_one_cycle` bootstraps on first visit: `ensure_auto_harness` auto-generates `{index}_auto` (20 synthetic self-queries, k=10, brute-force ground truth) if missing, then seeds with `HnswConfig::default()` (ec=80/es=30).
|
||||
- Regression fix: `harness::recall_at_k` now uses set-intersection semantics. The prior impl counted duplicates in `predicted` — on corpora with repeated chunks (`kb_response_cache_agent`) this inflated recall above 1.0 and poisoned promotion decisions. +7 unit tests.
|
||||
- [x] Scheduled ingest: real cron parsing (2026-04-20)
|
||||
- Vixie-compatible 5/6-field cron via `croner` crate. Day-of-week follows Unix convention (`1-5` = Mon-Fri). 6-field adds seconds granularity.
|
||||
- `validate_trigger` in `ingestd::schedule` — create/patch handlers reject malformed expressions with `400 BAD_REQUEST` at creation time, not silently at fire time.
|
||||
- Swapped away from the `cron` crate (0.16) which uses a non-Unix DOW convention (`1=Sun`) that would silently bite anyone writing `1-5` expecting weekdays. +9 unit tests.
|
||||
- [x] EvalSets federation (2026-04-20)
|
||||
- `harness::HarnessStore` mirrors the TrialJournal / PromotionRegistry federation pattern: eval artifacts colocate with each index's recorded bucket; legacy evals in primary remain discoverable via a fallback path; cross-bucket listing dedupes.
|
||||
- Every eval callsite (service.rs × 5, agent.rs × 3, autotune.rs × 1) now routes through `HarnessStore`. `VectorState` and `AgentDeps` each hold a shared instance.
|
||||
- [x] Index bucket-migrate PATCH (2026-04-20)
|
||||
- `PATCH /vectors/indexes/{name}/bucket` copies an index's vector parquet + trial-journal batches + promotion file + auto-harness to `dest_bucket`, flips `IndexMeta.bucket` as the commit point, and evicts the `EmbeddingCache` so next load reads from the new bucket. Optional `delete_source: true` sweeps source artifacts.
|
||||
- Lance-backed indexes refused with 400 — Lance URIs are bucket-specific and require rewriting the dataset, separate story. Round-trip verified: 390 artifacts, 0.04s.
|
||||
- [x] IVF_PQ recall tuning (2026-04-20)
|
||||
- `LanceVectorStore::search` now accepts optional `nprobes` + `refine_factor`. Lance's built-in `nprobes=1` default was the hidden cap on recall — on 316-partition `resumes_100k_v2` it searched only 0.3% of partitions per query.
|
||||
- Server defaults (`LANCE_DEFAULT_NPROBES=20`, `LANCE_DEFAULT_REFINE_FACTOR=5`) flow through the scoped-search path and the autotune harness. Measured on `resumes_100k_v2`: recall `0.805 → 1.000` at p50 ≈ 7.4ms. Even `nprobes=5, refine=5` saturates recall at p50 ≈ 4.7ms.
|
||||
- `/vectors/lance/recall/{idx}` accepts per-request `nprobes` / `refine_factor` so operators can sweep the curve.
|
||||
- [x] **Phase 19: Playbook memory (meta-index)** — the feedback loop originally implied by the PRD but never built. Playbooks stop being write-only; they start shaping future rankings. (2026-04-20)
|
||||
- [x] 19.1 — `POST /vectors/playbook_memory/rebuild` scans `successful_playbooks` via DataFusion, builds one `PlaybookEntry` per row (operation + approach + context embedded as one vector via nomic-embed-text)
|
||||
- [x] 19.2 — Brute-force cosine search over in-memory embeddings (chosen over HNSW: successful_playbooks maxes around thousands of rows, overhead of a second indexed surface isn't worth it until that ceiling bites)
|
||||
- [x] 19.3 — Endorsed names parsed out of `result` column, keyed by `(city, state, name)` tuple so shared names across cities don't cross-pollinate. Parsing via `parse_names` + `parse_city_state` helpers (7 unit tests)
|
||||
- [x] 19.4 — `/vectors/hybrid?use_playbook_memory=true`: fetches `top_k * 5` candidates so endorsed workers outside the vanilla top-K can still climb. Boost is additive on vector score, each hit carries `playbook_boost` + `playbook_citations` in the response for explainability
|
||||
- [x] 19.5 — Multi-agent orchestrator (`tests/multi-agent/orchestrator.ts`) auto-seeds `POST /vectors/playbook_memory/seed` on consensus_done, so the next query sees the new endorsement without a full `/rebuild`. Closes the feedback loop: two agents reach consensus → playbook sealed → next query re-ranks
|
||||
- [x] 19.6 — `MAX_BOOST_PER_WORKER = 0.25` enforced in `compute_boost_for`; verified with unit test (100 identical playbooks → boost capped at 0.25) and live test (5 identical seeds → exactly 0.25). Time decay deferred as optional
|
||||
- Real finding surfaced during build: the 32 bootstrap rows in `successful_playbooks` reference phantom worker names — 80 of 82 don't correspond to actual rows in `workers_500k`. `/seed` endpoint bypasses `successful_playbooks` so operators can prime memory with real fixtures; production path is the orchestrator write-through
|
||||
- [x] **Phase 19 refinement — geo + role prefilter on boost** (2026-04-21)
|
||||
- Added `compute_boost_for_filtered` and `compute_boost_for_filtered_with_role` to `playbook_memory.rs`. SQL filter's `(city, state, role)` parsed in `service.rs`; exact role-matches in target geo skip cosine and earn similarity=1.0. Restored the feedback loop: matched=0 → matched=11 per query on the same Nashville test. Citation density on Riverfront Steel: 2 → 28 per run (14×).
|
||||
- Rust unit tests: `extractor_tests::extract_target_geo_basic/_missing_state/_word_boundary`, `extract_target_role_basic/_none/_multi_word`. 6/6 pass.
|
||||
- Diagnostic log: `playbook_boost: boosts=N sources=N parsed=N matched=N target_geo=? target_role=?` on every call.
|
||||
- [x] **Phase 20: Model Matrix + Overseer Tiers** (2026-04-21)
|
||||
- `config/models.json` — 5 tiers (t1_hot / t2_review / t3_overview / t4_strategic / t5_gatekeeper), each with context_window + context_budget + overflow_policy. Ollama Cloud bearer key from `/root/llm_team_config.json`.
|
||||
- Hot path: qwen3.5:latest + qwen3:latest local with `think:false`. Mistral dropped after 0/14 fill on complex scenarios.
|
||||
- T3 cloud: gpt-oss:120b via Ollama Cloud — verified 4-8s latency, strict JSON-shape output for remediation.
|
||||
- [x] **Phase 21: Scratchpad + Tree-Split Continuation** (2026-04-21)
|
||||
- `tests/multi-agent/agent.ts`: `estimateTokens()`, `assertContextBudget()`, `generateContinuable()`, `generateTreeSplit()`. `think` flag plumbed through sidecar's `/generate`. Empty-response backoff + truncation-continuation, no max_tokens tourniquet.
|
||||
- Rust port queued: `crates/aibridge/src/continuation.rs`, `tree_split.rs`.
|
||||
- [x] **Phase 22: Internal Knowledge Library** (2026-04-21)
|
||||
- `data/_kb/` — signatures.jsonl, outcomes.jsonl, pathway_recommendations.jsonl, error_corrections.jsonl, config_snapshots.jsonl. Event-driven cycle: indexRun → recommendFor → loadRecommendation.
|
||||
- Item B cloud rescue: failed event → cloud remediation JSON → retry with pivot. Verified 1/3 rescues succeeded on stress_01 (Gary IN → South Bend IN pivot).
|
||||
- `scripts/kb_measure.py` aggregator. Unit tests: `kb.test.ts` — 4/4 pass (signature stability, role/city/count invariants, digest shape).
|
||||
- [x] **Phase 23: Staffer identity + competence-weighted retrieval** (2026-04-21)
|
||||
- ScenarioSpec gained `contract: ContractTerms` and `staffer: Staffer { id, name, tenure_months, role, tool_level }`.
|
||||
- tool_level runtime overrides: full / local / basic / minimal. Basic + minimal route executor to Ollama Cloud `kimi-k2.5` (kimi-k2.6 pending pro-tier upgrade).
|
||||
- `data/_kb/staffers.jsonl` — competence_score = 0.45·fill + 0.20·turn_eff + 0.20·cite + 0.15·rescue. Recomputed per run.
|
||||
- `findNeighbors` now returns `weighted_score = cosine × max_staffer_competence`. `scripts/kb_staffer_report.py` — leaderboard + cross-staffer worker overlap (Rachel D. Lewis 12× across 4 staffers → auto-discovered high-value label).
|
||||
- `gen_staffer_demo.ts` + `run_staffer_demo.sh` — 4 personas × 3 contracts = 12 runs.
|
||||
- [ ] **Phase 24: Observer / Autotune integration** (GAP, not wired)
|
||||
- `lakehouse-observer.service` watches MCP :3700; scenario.ts hits gateway :3100 directly. Observer idle at 0 ops across 3600+ cycles. Autotune runs on its own schedule, never sees scenario outcomes.
|
||||
- Next-sprint: scenario emits per-event outcome summaries to observer's ingest path; observer ERROR_ANALYZER + PLAYBOOK_BUILDER loops consume them; autotune subscribes to the metric stream.
|
||||
- [ ] Fine-tuned domain models (Phase 25+)
|
||||
- [ ] Multi-node query distribution (only if ceilings bite)
|
||||
|
||||
---
|
||||
|
||||
**52+ unit tests | 13 crates | 19 ADRs | 2.47M rows | 100K vectors | Hybrid Parquet+HNSW ⊕ Lance**
|
||||
**Latest: 2026-04-17 — 8 commits shipping Phase 16.2 through Phase 18**
|
||||
**102 unit tests | 13 crates | 20 ADRs | 2.47M rows | 100K vectors | Hybrid Parquet+HNSW ⊕ Lance | Phase 19 refined + 20-23 shipped**
|
||||
**Latest: 2026-04-21 — Phases 20-23 shipped. Geo+role prefilter lifted playbook citation density 14×. Cloud rescue converts zero-supply failures into successful pivots. Staffer competence weighting differentiates full-tool senior from minimal-tool trainee by 46.4pt fill rate on same contracts. Phase 24 observer integration flagged as honest gap.**
|
||||
|
||||
90
docs/PRD.md
90
docs/PRD.md
@ -380,17 +380,37 @@ Make successful playbooks actually improve future searches. Today `successful_pl
|
||||
- Hard guarantees about recall lift magnitude. "Measurably better on the demo query" is the gate, not a universal quality claim.
|
||||
- Real-time recomputation on every playbook. Batched refresh via the existing stale-marking path is sufficient.
|
||||
|
||||
### Phase 19 refinement (WIRED 2026-04-21): geo-filter + role prefilter on boost
|
||||
|
||||
Item-3 diagnostic pass surfaced that `compute_boost_for` was ranking playbooks globally by cosine similarity, while candidates came from an SQL-filtered city. Result: boost map had 170 endorsed workers, 0 intersected the 50 Nashville-filtered candidates. Zero citations where there should have been dozens.
|
||||
|
||||
Fix — in `crates/vectord/src/playbook_memory.rs`:
|
||||
- `compute_boost_for_filtered(target_geo)` — skip playbooks from other cities before cosine sort.
|
||||
- `compute_boost_for_filtered_with_role(target_geo, target_role)` — multi-strategy: exact (role, city, state) match earns similarity=1.0 and fills up to half the top_k; cosine fallback fills the rest. Mirrors Mem0/Zep 2026 guidance on parallel-strategy rerank.
|
||||
|
||||
In `crates/vectord/src/service.rs`:
|
||||
- `extract_target_geo` and `extract_target_role` pull both from the executor's SQL filter.
|
||||
- `tracing::info!` emits `playbook_boost: boosts=N sources=N parsed=N matched=N target_geo=? target_role=?` on every hybrid_search. Silent-truncation class of bug now visible.
|
||||
|
||||
Citation lift measured: avg citations per run 0.32 → 1.38 after geo filter; then 2 → 28 in the single-scenario Riverfront Steel re-run after role prefilter landed. 14× delta on same scenario.
|
||||
|
||||
Unit tests: `extract_target_geo_basic`, `_missing_state_returns_none`, `_word_boundary` (rejects "civilian" substring), `extract_target_role_basic`, `_none_when_absent`, `_multi_word` — all pass (`cargo test -p vectord --lib extractor_tests`).
|
||||
|
||||
### Phase 20: Model Matrix + Overseer Tiers (WIRED 2026-04-21)
|
||||
|
||||
Five-tier routing declared in `config/models.json`. Hot path (T1/T2) stays on local mistral + qwen2.5. Cloud is consulted sparingly for overview (T3 gpt-oss:120b), strategic (T4 qwen3.5:397b), and gatekeeper decisions (T5 kimi-k2-thinking). Every tier declares `context_window` + `context_budget` + `overflow_policy`. See ADR-021 (to add).
|
||||
Five-tier routing declared in `config/models.json`. Hot path (T1/T2) stays local (qwen3.5 + qwen3 after mistral was dropped for 0/14 fill rate on complex scenarios). Cloud for overview (T3 gpt-oss:120b), strategic (T4 qwen3.5:397b), and gatekeeper (T5 kimi-k2-thinking). Every tier declares `context_window` + `context_budget` + `overflow_policy`.
|
||||
|
||||
- T1 hot: 50-200 calls/scenario, local only
|
||||
- T2 review: 5-14 calls/event, local only
|
||||
- T3 overview: 1-3 calls/scenario, cloud primary
|
||||
- T1 hot: 50-200 calls/scenario, local only — `qwen3.5:latest` executor, `think:false`
|
||||
- T2 review: 5-14 calls/event, local only — `qwen3:latest` reviewer, `think:false`
|
||||
- T3 overview: 1-3 calls/scenario, cloud primary — `gpt-oss:120b` on Ollama Cloud, thinking on
|
||||
- T4 strategic: 1-10 calls/day, cloud primary
|
||||
- T5 gatekeeper: 1-5 calls/day, audit-logged
|
||||
|
||||
T3 checkpoints + cross-day lessons are wired. Lessons archive to `data/_playbook_lessons/` and load back at next scenario start as `prior_lessons` in executor context.
|
||||
T3 checkpoints + cross-day lessons wired. Lessons archive to `data/_playbook_lessons/` and load back at next scenario start as `prior_lessons` in executor context. Cloud passthrough verified on stress_01 scenario with `LH_OVERVIEW_CLOUD=1` — `gpt-oss:120b` response latency consistently 4-8s, diagnosing city-pivot ("Gary IN → Chicago IL, 40mi") when target city has zero supply.
|
||||
|
||||
`think:false` is the key mechanical finding — qwen3.5 burns ~650 tokens of hidden reasoning before emitting response; hot-path JSON emitters MUST disable thinking or continuation has to paper over empty returns. T3/T4 overseers KEEP thinking (that's the point).
|
||||
|
||||
**Kimi-k2.6 upgrade path:** Current Ollama Cloud key returns 403 on kimi-k2.6 (`ollama run kimi-k2.6:cloud` requires `ollama signin` with pro-tier account). kimi-k2.5 substitutes on the current tier — same family, strong at tool calling. Swap to k2.6 is a one-line change in `applyToolLevel` once the subscription lands.
|
||||
|
||||
### Phase 21: Scratchpad + Tree-Split Continuation
|
||||
|
||||
@ -421,6 +441,16 @@ T3 checkpoints + cross-day lessons are wired. Lessons archive to `data/_playbook
|
||||
|
||||
**Status:** TS primitives WIRED. Rust port pending. The escalation path (tree split → bigger-context cloud model → kimi-k2:1t's 1M window → split decision into sub-decisions) is declared in `config/models.json` under `context_management.overflow_policies`.
|
||||
|
||||
### Phase 21 status update (WIRED 2026-04-21 evening)
|
||||
|
||||
Additional primitives landed after the initial commit:
|
||||
|
||||
- **`think: boolean`** flag plumbed through `generate()`, `generateCloud()`, `generateContinuable()`, and into sidecar's `/generate` endpoint. Enables per-call opt-out of hidden reasoning for hot-path JSON emitters. Verified: qwen3.5 with `think:false` + `num_predict:400` returns clean `{"worker_id":...}` on first call; without `think:false`, 650 tokens eaten by reasoning, response empty.
|
||||
|
||||
- **Cloud executor routing** — `ACTIVE_EXECUTOR_CLOUD` / `ACTIVE_REVIEWER_CLOUD` flags let per-staffer tool_level route executor to Ollama Cloud when weak local model (qwen2.5) would collapse. Verified on kimi-k2.5 via Ollama Cloud: clean JSON emission, think:false honored.
|
||||
|
||||
Rust port of continuation + tree-split primitives remains queued for next sprint (`crates/aibridge/src/continuation.rs`, `tree_split.rs`).
|
||||
|
||||
### Phase 22: Internal Knowledge Library (KB)
|
||||
|
||||
Meta-layer over Phase 19 playbook_memory. Playbook memory answers "which WORKERS worked for this event." The KB answers "which CONFIG worked for this playbook signature." Subject changes from workers to the system itself — model choice, budget hints, overflow policies, pathway notes.
|
||||
@ -451,9 +481,55 @@ Meta-layer over Phase 19 playbook_memory. Playbook memory answers "which WORKERS
|
||||
- budget_hints {executor_max_tokens, reviewer_max_tokens, executor_think}
|
||||
- pathway_notes (concrete pre-run advice)
|
||||
|
||||
**Status (WIRED 2026-04-21):** `tests/multi-agent/kb.ts` holds all primitives. scenario.ts reads rec at start, indexes + recommends at end. Cold start gracefully writes a "low confidence, no history" rec so the second run has a floor to build on.
|
||||
**Status (WIRED 2026-04-21):** `tests/multi-agent/kb.ts` holds all primitives. scenario.ts reads rec at start, indexes + recommends at end. Cold start gracefully writes a "low confidence, no history" rec so the second run has a floor to build on. `snapshotConfig()` wired to fire at every scenario start — active model set + tool_level + cloud flags hashed and appended to `config_snapshots.jsonl`.
|
||||
|
||||
### Phase 23+: Further horizon
|
||||
**Phase 22 item B — cloud rescue (WIRED):** When an event fails and cloud T3 is enabled, `requestCloudRemediation()` feeds the failure trace (SQL filters attempted, row counts, reviewer drift reasons, gap signals, contract terms) to cloud and parses a JSON remediation with new_city / new_state / new_role / new_count / rationale. Event retries once with the pivot. Verified 1/3 rescues succeeded on stress_01 (Gary IN → South Bend IN pivot filled a Welder that local drift-aborted). Sanitizer splits "City, ST" comma-packed outputs so downstream SQL doesn't get `Hammond, IN, IN`.
|
||||
|
||||
### Phase 23: Staffer identity + competence-weighted retrieval (WIRED 2026-04-21)
|
||||
|
||||
Answers "who handled this" as a first-class dimension of the matrix index. Senior staffers' playbooks rank higher than juniors' on similar scenarios via competence × similarity score. Auto-discovers "reliable performer" worker labels via cross-staffer endorsement overlap.
|
||||
|
||||
**Schema (`scenario.ts` ScenarioSpec):**
|
||||
- `contract?: ContractTerms` — deadline, budget_per_hour_max, local_bonus_per_hour, local_bonus_radius_mi, fill_requirement. Propagates into T3 checkpoint + cloud rescue prompts so cloud reasons about trade-offs (pivot-within-radius before budget-pivot-further).
|
||||
- `staffer?: Staffer` — {id, name, tenure_months, role, tool_level}. tool_level controls subsystems available to this run:
|
||||
- `full` — qwen3.5 + qwen3 local + cloud T3 + cloud rescue
|
||||
- `local` — qwen3.5 + qwen3 local + local gpt-oss:20b T3 + rescue
|
||||
- `basic` — **kimi-k2.5 cloud** exec + qwen3 local reviewer + local T3, no rescue
|
||||
- `minimal` — kimi-k2.5 cloud exec + qwen3 local reviewer, NO T3, NO rescue — tests whether playbook inheritance carries knowledge alone
|
||||
|
||||
**KB staffer indexing (`data/_kb/staffers.jsonl`):**
|
||||
- Recomputed per-staffer on every run: total_runs, fill_rate, avg_turns_per_event, avg_citations_per_run, rescue_rate, competence_score.
|
||||
- `competence_score = 0.45·fill_rate + 0.20·turn_efficiency + 0.20·citation_density + 0.15·rescue_rate`. Bounded 0..1.
|
||||
|
||||
**Weighted neighbor retrieval:**
|
||||
- `findNeighbors` in `kb.ts` returns `weighted_score = cosine × max_staffer_competence` (floor 0.3). Senior playbooks rank above junior playbooks on similar scenarios.
|
||||
- `pathway_recommendations` include `best_staffer_id` / `best_staffer_competence` so cloud knows WHOSE playbook it's synthesizing from.
|
||||
|
||||
**Cross-staffer auto-discovery:**
|
||||
- `scripts/kb_staffer_report.py` emits leaderboard + workers endorsed across ≥2 staffers on same signature.
|
||||
- Validated output: Rachel D. Lewis (Welder Nashville) endorsed 12× across 4 staffers; Christina Watson (Machine Op Indianapolis) 11×. These are the highest-confidence "reliable performer" labels the system produced without human tagging.
|
||||
|
||||
**Demo infrastructure:**
|
||||
- `tests/multi-agent/gen_staffer_demo.ts` — 4 personas × 3 contracts = 12 scenario specs.
|
||||
- `scripts/run_staffer_demo.sh` — sequential batch with cloud T3.
|
||||
- `scripts/kb_staffer_report.py` — leaderboard + top/bottom differential + cross-staffer overlap.
|
||||
|
||||
### Phase 24: Observer / Autotune integration (NOT YET WIRED — honest gap)
|
||||
|
||||
J flagged this 2026-04-21 evening: the `lakehouse-observer.service` systemd unit has been running for 3600+ cycles but shows `total_ops=0 successes=0 failures=0` because `tests/multi-agent/scenario.ts` hits the Rust gateway directly on port 3100, bypassing the Bun MCP layer on 3700 that observer wraps.
|
||||
|
||||
Result: our test scenarios are INVISIBLE to the observer and the autotune pipeline. Autotune's HNSW parameter learning runs on its own schedule, but no signal from scenario outcomes flows into it.
|
||||
|
||||
**Target architecture:**
|
||||
- Scenarios emit per-event outcome summaries to a path the observer polls (or POST to observer's ingest endpoint directly).
|
||||
- Observer's ERROR ANALYZER + PLAYBOOK BUILDER loops consume those summaries alongside the MCP-layer ops.
|
||||
- Autotune agent subscribes to a metric stream the observer writes.
|
||||
|
||||
**Why deferred:** this is a real architecture change (coherent data path from scenario → observer → autotune → vectord index) and needs care. The observer's current `observed_operations` ingest uses REPLACE semantics (flagged in `feedback_ingest_replace_semantics.md`) — naive appending will wipe prior ops.
|
||||
|
||||
**Status:** GAP DOCUMENTED, not fixed. Scenarios continue to populate KB directly. The parallel pipelines are coherent but separate; Phase 24 connects them.
|
||||
|
||||
### Phase 25+: Further horizon
|
||||
|
||||
- Specialized fine-tuned models per domain (staffing matcher, resume parser)
|
||||
- Video/audio transcript ingest + multimodal embeddings
|
||||
|
||||
51
tests/multi-agent/kb.test.ts
Normal file
51
tests/multi-agent/kb.test.ts
Normal file
@ -0,0 +1,51 @@
|
||||
import { test, expect } from "bun:test";
|
||||
import { computeSignature, specDigest } from "./kb.ts";
|
||||
|
||||
// kb signature invariants — required so the KB's retrieval layer
|
||||
// doesn't silently drift when we add fields to ScenarioSpec.
|
||||
|
||||
test("computeSignature is stable across reorderings of unrelated fields", () => {
|
||||
const a = {
|
||||
client: "Acme Corp",
|
||||
events: [
|
||||
{ kind: "baseline_fill", role: "Welder", count: 3, city: "Toledo", state: "OH" },
|
||||
],
|
||||
};
|
||||
const b = { ...a, date: "2026-05-01", contract: { deadline: "2026-05-15" } } as any;
|
||||
const c = { ...a, staffer: { id: "S-1", name: "X", tenure_months: 10, role: "senior" } } as any;
|
||||
const sigA = computeSignature(a);
|
||||
const sigB = computeSignature(b);
|
||||
const sigC = computeSignature(c);
|
||||
expect(sigA).toBe(sigB);
|
||||
expect(sigA).toBe(sigC);
|
||||
});
|
||||
|
||||
test("computeSignature changes when role changes", () => {
|
||||
const base = { client: "Acme", events: [{ kind: "baseline_fill", role: "Welder", count: 3, city: "Toledo", state: "OH" }] };
|
||||
const swapped = { client: "Acme", events: [{ kind: "baseline_fill", role: "Electrician", count: 3, city: "Toledo", state: "OH" }] };
|
||||
expect(computeSignature(base)).not.toBe(computeSignature(swapped));
|
||||
});
|
||||
|
||||
test("computeSignature changes when city or count changes", () => {
|
||||
const base = { client: "A", events: [{ kind: "baseline_fill", role: "Welder", count: 3, city: "Toledo", state: "OH" }] };
|
||||
const cityChange = { ...base, events: [{ ...base.events[0], city: "Detroit", state: "MI" }] };
|
||||
const countChange = { ...base, events: [{ ...base.events[0], count: 5 }] };
|
||||
expect(computeSignature(base)).not.toBe(computeSignature(cityChange));
|
||||
expect(computeSignature(base)).not.toBe(computeSignature(countChange));
|
||||
});
|
||||
|
||||
test("specDigest includes each event's role + city", () => {
|
||||
const spec = {
|
||||
client: "Acme",
|
||||
events: [
|
||||
{ kind: "baseline_fill", role: "Welder", count: 3, city: "Toledo", state: "OH" },
|
||||
{ kind: "emergency", role: "Loader", count: 2, city: "Chicago", state: "IL" },
|
||||
],
|
||||
};
|
||||
const digest = specDigest(spec);
|
||||
expect(digest).toContain("Acme");
|
||||
expect(digest).toContain("Welder");
|
||||
expect(digest).toContain("Toledo,OH");
|
||||
expect(digest).toContain("Loader");
|
||||
expect(digest).toContain("Chicago,IL");
|
||||
});
|
||||
@ -35,7 +35,8 @@ import {
|
||||
reviewerPrompt,
|
||||
GATEWAY,
|
||||
} from "./agent.ts";
|
||||
import { indexRun, recommendFor, loadRecommendation, type PathwayRecommendation } from "./kb.ts";
|
||||
import { indexRun, recommendFor, loadRecommendation, snapshotConfig, type PathwayRecommendation } from "./kb.ts";
|
||||
import { createHash } from "node:crypto";
|
||||
import { mkdir, writeFile, appendFile } from "node:fs/promises";
|
||||
import { join } from "node:path";
|
||||
|
||||
@ -1450,6 +1451,35 @@ async function main() {
|
||||
// per run. If no staffer or no tool_level, defaults hold.
|
||||
applyToolLevel(spec.staffer?.tool_level);
|
||||
|
||||
// Phase 22 — record the config snapshot each run gets. Lets the
|
||||
// error_corrections detector diff configs between fail→succeed pairs
|
||||
// and gives the KB a receipt of what was active for any given
|
||||
// outcome. Hash computed over the active model set + tool_level so
|
||||
// the same staffer running back-to-back with no config change
|
||||
// doesn't clutter the file.
|
||||
try {
|
||||
const activeModels = {
|
||||
executor: ACTIVE_EXECUTOR,
|
||||
reviewer: ACTIVE_REVIEWER,
|
||||
overview: OVERVIEW_MODEL,
|
||||
executor_cloud: String(ACTIVE_EXECUTOR_CLOUD),
|
||||
overview_cloud: String(ACTIVE_OVERVIEW_CLOUD),
|
||||
t3_disabled: String(ACTIVE_T3_DISABLED),
|
||||
tool_level: spec.staffer?.tool_level ?? "default",
|
||||
};
|
||||
const configHash = createHash("sha256")
|
||||
.update(JSON.stringify(activeModels))
|
||||
.digest("hex")
|
||||
.slice(0, 16);
|
||||
await snapshotConfig(
|
||||
configHash,
|
||||
activeModels,
|
||||
`scenario_start ${spec.client} ${spec.date} staffer=${spec.staffer?.id ?? "none"}`,
|
||||
);
|
||||
} catch (e) {
|
||||
console.log(` (config snapshot skipped: ${(e as Error).message})`);
|
||||
}
|
||||
|
||||
console.log(`▶ scenario: ${spec.client}, ${spec.date}, ${spec.events.length} events`);
|
||||
if (spec.staffer) {
|
||||
const level = spec.staffer.tool_level ?? "(default)";
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user