From 97dd3f826d4541158a3b0c2b94f871742ed49e85 Mon Sep 17 00:00:00 2001 From: root Date: Wed, 29 Apr 2026 20:27:41 -0500 Subject: [PATCH] =?UTF-8?q?SPEC=20=C2=A73.5/=C2=A73.6/=C2=A73.7/=C2=A73.8?= =?UTF-8?q?=20=E2=80=94=20name=20F/B/C=20as=20port=20targets=20+=20add=20A?= =?UTF-8?q?rchon-style=20workflow=20runner?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Per the 2026-04-29 scope-discipline pause: the wave shipped four pieces beyond SPEC §3.4 component scope, and one architectural pattern surfaced (Archon-style multi-pass workflow runner) that's the observer's natural growth path. Document them as port targets so the next scrum review has authoritative SPEC components. §3.5 — Drift quantification (loop 5 of the PRD) Names the SCORER drift work shipped in be65f85 + the deferred shapes (PLAYBOOK drift, EMBEDDING drift, AUDIT BASELINE drift). Acceptance gates G3.5.A–B. §3.6 — Staffing-side structured filter Names the metadata-filter MVP shipped in b199093 + the deferred pre-retrieval SQL gate via queryd. Acceptance gates G3.6.A–C. §3.7 — Operational rating wiring Names the bulk playbook-record endpoint shipped in 6392772 + the deferred UI shim, negative-feedback path, and time-decay. Acceptance gates G3.7.A–B. §3.8 — Observer-KB workflow runner (Archon-style multi-pass) — PORT TARGET, not yet started Documents the architecture J was working on across the Rust observer-kb branch (10 commits ahead of main, never merged) and the local Archon mod (committed 2026-04-29 as 3f2afc8 in /home/profit/external/Archon, not pushed to coleam00/Archon). The pattern: multi-pass mode chain (extract → validator → hallucination → consensus → redteam → pipeline → render) where each pass is a deterministic measurement. The observer is the natural home — workflows ARE observation patterns whose every step is recorded. Five components in dependency order: workflow definition (YAML), node executor (DAG runner), provenance recording (ObservedOps), mode catalog (matrix.search, distillation.score, drift.scorer, llm.chat), HTTP surface (/v1/observer/workflow/run). Reference materials on the system (preserved, not lost): - /home/profit/lakehouse/.archon/workflows/lakehouse-architect-review.yaml (Rust main, 69919d9) — 3-node Archon-via-Lakehouse proof - /home/profit/external/Archon dev branch — upstream engine with local pi/provider.ts mod (3f2afc8) for Lakehouse routing - Rust observer-kb branch — apps/observer-kb/docs/PRD.md + Python prototypes proven on real ChatGPT/Claude PDF data Acceptance gates G3.8.A–D. Estimated effort: L. PRD updated with "Observer as system resource (clarified 2026-04-29)" section pointing at §3.8 as the architectural growth path. The bare-bones observerd in bc9ab93 is the substrate; the workflow runner is what makes it the "objective measurement engine" the small-model pipeline needs. Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/PRD.md | 21 ++++++ docs/SPEC.md | 206 +++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 227 insertions(+) diff --git a/docs/PRD.md b/docs/PRD.md index bbfe0c4..902f258 100644 --- a/docs/PRD.md +++ b/docs/PRD.md @@ -34,6 +34,27 @@ What the Go refactor is FOR: a second-language pass surfaces architectural weakn **The playbook + matrix indexer must produce the results we're looking for.** That's the single load-bearing acceptance criterion. Throughput, scaling, code elegance — all secondary. If a deep-field reality test on the 500K corpus surfaces wrong answers, the loop isn't working and we fix that before adding anything else. +### Observer as system resource (clarified 2026-04-29) + +The observer is not a service among services — it's a *system +resource*. Its job is to be objective about the process: watch +everything, record measurements, surface what worked vs what +didn't, feed the KB so the playbook substrate can decide the +right pathway to the correct outcome. + +The bare-bones observerd shipped in `bc9ab93` (event ingest + +stats) is the substrate for this. The architectural pattern +that grows it into the full "objective measurement engine" is +the **multi-pass workflow runner** documented in SPEC §3.8 — +inspired by Archon (`/home/profit/external/Archon`) and proven +in the Rust `observer-kb` branch's Python prototypes (`deep_analysis.py`, +`extract_knowledge.py`, `process_knowledge.py`). + +The pipeline mode-chain (extract → validator → hallucination → +consensus → redteam → pipeline → render) IS how the observer +makes actionable decisions: each mode pass is a deterministic +measurement; what survives the gauntlet is what feeds the KB. + ### Triage / human-in-loop Most cases are abstract enough that small-model + pathway + matrix can complete them. Some can't — they need a human. The system's job is to **identify which is which** and only escalate the second class. Frontier models partially solve this internally with their thinking loops; we're externalizing it so: diff --git a/docs/SPEC.md b/docs/SPEC.md index 6f3d958..1c5a8b6 100644 --- a/docs/SPEC.md +++ b/docs/SPEC.md @@ -192,6 +192,212 @@ and got reduced to "we have vectord" in earlier port-planning. The SPEC names it explicitly so the port preserves the multi-corpus retrieval shape AND the learning loop, not just the HNSW substrate. +### §3.5 — Drift quantification (loop 5 of the PRD) + +**What it is.** PRD names "drift" as the 5th loop: quantify when +historical decisions stop matching current reality. Distinct from +the rating+distillation loop because drift is MEASUREMENT, not +LEARNING. The learning loop says "this match worked, remember it"; +the drift loop says "this 4-month-old playbook entry — does it +still match what the substrate would surface today?" + +**What's shipped (commit `be65f85`):** + - SCORER drift: re-runs current `distillation.ScoreRecord` over + historical (EvidenceRecord, persisted_category) pairs and + reports mismatches + a sorted shift matrix + - `internal/drift/drift.go` — pure-function `ComputeScorerDrift` + - 6 unit tests covering no-drift, shift detection, multi-shift + sorted-by-count, includeEntries flag, empty input, scorer-version + stamping + +**Future drift shapes (not shipped):** + - PLAYBOOK drift: re-run playbook queries through current + matrix-search; recorded answer not in top-K = drift + - EMBEDDING drift: KS-test on vector distribution at T1 vs T2 + - AUDIT BASELINE drift: matches Rust `audit_baselines.jsonl` + longitudinal signal + +**Acceptance gates:** +- G3.5.A — A scorer-version bump triggers a non-zero `Drifted` count + on a corpus of historical ScoredRuns where the new logic produces + different categories than the persisted ones. +- G3.5.B — `ScorerDriftReport.ShiftMatrix` is deterministic-ordered + (count desc, ties broken alphabetically) so JSON output is stable + across runs. + +### §3.6 — Staffing-side structured filter + +**What it is.** Reality tests on the candidates + workers corpora +(commits `0d1553c`, `a97881d`) surfaced that pure semantic retrieval +can't gate by location/status/availability — the matrix indexer +returns Production Workers for a Forklift+OSHA-30 query because +nomic-embed-text's geometry doesn't separate the role labels well. +Structured filtering is the addressable piece: pre-filter the +candidate set on metadata fields BEFORE semantic ranking. + +**What's shipped (commit `b199093`):** + - `SearchRequest.MetadataFilter` — `map[string]any` of metadata + field → expected value (single value or list-of-values for OR + semantics within a key, AND across keys) + - Post-retrieval filter applied before top-K truncation in + `internal/matrix/retrieve.go` + - `SearchResponse.MetadataFilterDropped` for telemetry on filter + aggressiveness + - 7 unit tests covering nil filter, missing metadata, exact match, + AND across keys, OR within list, bool match, malformed JSON + +**Deferred:** + - Pre-retrieval SQL gate via `queryd` (the actual hybrid). The + post-retrieval filter is an MVP that helps when the candidate + set is mostly relevant; for aggressive filters that drop most + results, a SQL pre-filter into matrix retrieval would surface + the right candidates with less wasted embedding work. + - Filter language richer than equality (e.g. range, prefix, regex). + +**Acceptance gates:** +- G3.6.A — `MetadataFilter: {"state": "IL"}` against a mixed-state + corpus drops every non-IL result; `MetadataFilterDropped` reports + the count. +- G3.6.B — List filter `{"state": ["IL", "WI"]}` keeps both states, + drops the rest (OR within key). +- G3.6.C — Multi-key filter is AND: a result missing any key is + dropped, no exception. + +### §3.7 — Operational rating wiring + +**What it is.** PRD loop 4 (rating + distillation) needs real +inflows to be a learning system rather than a substrate. The +playbook-record endpoint (`06e7152`) takes one (query, answer, +score) per call; productizing it into actual signal sources is what +makes the system get smarter with use. + +**What's shipped (commit `6392772`):** + - `POST /v1/matrix/playbooks/bulk` — bulk-record N successes; + per-entry success/failure response so callers can see which of + a 4,701-row historical placement import succeeded vs which + failed validation. + - Single-record path from `06e7152` unchanged. + +**Deferred:** + - UI shim for click-tracking (no Go demo UI yet — the Bun demo at + `devop.live/lakehouse/` is still serving the public surface). + When the Go UI lands or a feedback API is added to the Bun UI, + every coordinator click → bulk-batched POST → playbook entry. + - Negative feedback (this match didn't work). Currently only + positive scores are recorded; a rejection signal would help the + learning loop avoid pushing bad matches. + - Time-decay on playbook scores so stale recommendations attenuate. + +**Acceptance gates:** +- G3.7.A — Bulk POST of N entries returns `{recorded, failed, + results[]}` with per-entry IDs/errors, no single-entry failure + aborting the batch. +- G3.7.B — Each recorded entry surfaces in `/v1/matrix/search` with + `use_playbook=true` after a re-query. + +### §3.8 — Observer-KB workflow runner (Archon-style multi-pass) + +**What it is.** The architectural pattern documented in the Rust +`observer-kb` branch (10 commits ahead of main, never merged) and +proven by `/home/profit/external/Archon`'s workflow engine. Multiple +mode passes processing data, with each pass an objective measurement +that contributes to the KB: + +``` +Raw data + ↓ Mode: EXTRACT structured facts/entities/relationships + ↓ Mode: VALIDATOR fact-check, confidence 1-10 + ↓ Mode: HALLUCINATION verify each claim, flag likely fabrications + ↓ Mode: CONSENSUS multiple passes until extraction converges + ↓ Mode: REDTEAM attack what survived, patch what fails + ↓ Mode: PIPELINE clean → Q&A structure → topic group → rank + ↓ RENDER curated doc anchored on questions +``` + +This is the *orchestrator* missing from §3.4 components 1-5: each +SPEC §3.4 piece (relevance, downgrade, scorer, drift) is a "mode"; +what's missing is the workflow engine that chains them. + +**Why it matters.** Per the PRD's product vision: the observer +should make actionable decisions based on watching what's +successful. The workflow runner is how observers compose modes +into multi-pass pipelines that score outcomes rigorously enough +to feed the KB and inform the playbook substrate. + +**Reference materials on the system:** + - `/home/profit/lakehouse/.archon/workflows/lakehouse-architect-review.yaml` + (committed `69919d9` in main) — proves Archon-via-Lakehouse + works with a 3-node `shape → weakness → improvement` workflow + - `/home/profit/external/Archon` — the upstream workflow engine + (cloned 2026-04-26); `packages/providers/src/community/pi/provider.ts` + has the local Lakehouse-routing mod committed locally as + `3f2afc8` (not pushed to upstream `coleam00/Archon`) + - Rust `observer-kb` branch (10 commits, +4338/-55506 LoC) — + `apps/observer-kb/docs/PRD.md` documents the multi-pass + architecture; `scripts/{deep_analysis,extract_knowledge,process_knowledge}.py` + are the Python prototypes that proved it on real ChatGPT/Claude + PDF data (496 topics, 300 decisions, 100 insights extracted) + +**Components to port (in dependency order):** + +1. **Workflow definition** (`internal/workflow/types.go`) — YAML + schema matching Archon's shape: `name`, `description`, `provider`, + `model`, list of `nodes` each with `id`, `prompt`, `allowed_tools`, + `effort`, `idle_timeout`, `depends_on`. The depends_on edges form + a DAG; the runner resolves topologically. + +2. **Node executor** (`internal/workflow/runner.go`) — given a + workflow and a starting context, walks the DAG, executes each + node by dispatching to the configured backend (matrix.Search, + distillation.ScoreRecord, drift.ComputeScorerDrift, or a generic + prompt-against-LLM via gateway `/v1/chat`), captures per-node + output, makes it available as `$.output` in subsequent + nodes. + +3. **Provenance recording** — every node execution lands an + ObservedOp (via the observerd substrate from `bc9ab93`) with + `source: "workflow"`, the workflow name + node ID, input/output + summaries, and timing. The ring buffer + JSONL log become the + substrate for the rating+distillation loop's KB feed. + +4. **Mode catalog** (`internal/workflow/modes.go`) — registry of + the modes the runner can dispatch to. Each mode is a Go function + matching a uniform `func(ctx, input map[string]any) (map[string]any, error)` + signature so workflows can compose them. Initial modes from + §3.4: `matrix.search`, `matrix.relevance`, `matrix.downgrade`, + `playbook.record`, `playbook.lookup`, `distillation.score`, + `drift.scorer`. Plus `llm.chat` for free-form mode prompts. + +5. **HTTP surface** — `POST /v1/observer/workflow/run` accepts a + workflow YAML body + a starting context; returns the per-node + results + the chain of ObservedOps generated. `GET + /v1/observer/workflow/list` lists workflows in a known directory + for operator discoverability. + +**Why integrate into observerd, not a new service.** The observer +is the system resource that watches and records. Workflows ARE +observation patterns — multi-step processes whose every step is +recorded. Putting the runner inside observerd keeps the +"measurement → KB feed" wiring tight; a separate service would +re-implement the recording layer. + +**Acceptance gates:** +- G3.8.A — Load a workflow YAML matching the Archon `lakehouse-architect-review.yaml` + shape; runner executes the 3-node DAG topologically. +- G3.8.B — Each node execution lands an ObservedOp with + `source: "workflow"` and the node's input/output. Stats endpoint + shows the workflow ops. +- G3.8.C — A node referencing `$.output` in its prompt + resolves correctly; missing reference is a clear error not a + silent empty string. +- G3.8.D — Mode catalog dispatches `matrix.search` invocation to + the matrixd backend without going through HTTP (in-process + function call when matrixd is co-resident). + +**Status:** PORT TARGET, not yet started. SPEC commits the design; +implementation is its own wave (estimated **L** effort given the +DAG runner + mode dispatch + provenance recording). + ### §3.3 — UI (HTMX) **Approach:** server-rendered Go templates using `html/template`,