From 97dd3f826d4541158a3b0c2b94f871742ed49e85 Mon Sep 17 00:00:00 2001
From: root <root@island37.com>
Date: Wed, 29 Apr 2026 20:27:41 -0500
Subject: [PATCH] =?UTF-8?q?SPEC=20=C2=A73.5/=C2=A73.6/=C2=A73.7/=C2=A73.8?=
 =?UTF-8?q?=20=E2=80=94=20name=20F/B/C=20as=20port=20targets=20+=20add=20A?=
 =?UTF-8?q?rchon-style=20workflow=20runner?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Per the 2026-04-29 scope-discipline pause: the wave shipped four
pieces beyond SPEC §3.4 component scope, and one architectural
pattern surfaced (Archon-style multi-pass workflow runner) that's
the observer's natural growth path. Document them as port targets
so the next scrum review has authoritative SPEC components.

§3.5 — Drift quantification (loop 5 of the PRD)
  Names the SCORER drift work shipped in be65f85 + the deferred
  shapes (PLAYBOOK drift, EMBEDDING drift, AUDIT BASELINE drift).
  Acceptance gates G3.5.A–B.

§3.6 — Staffing-side structured filter
  Names the metadata-filter MVP shipped in b199093 + the deferred
  pre-retrieval SQL gate via queryd. Acceptance gates G3.6.A–C.

§3.7 — Operational rating wiring
  Names the bulk playbook-record endpoint shipped in 6392772 + the
  deferred UI shim, negative-feedback path, and time-decay.
  Acceptance gates G3.7.A–B.

§3.8 — Observer-KB workflow runner (Archon-style multi-pass) —
       PORT TARGET, not yet started
  Documents the architecture J was working on across the Rust
  observer-kb branch (10 commits ahead of main, never merged) and
  the local Archon mod (committed 2026-04-29 as 3f2afc8 in
  /home/profit/external/Archon, not pushed to coleam00/Archon).

  The pattern: multi-pass mode chain (extract → validator →
  hallucination → consensus → redteam → pipeline → render) where
  each pass is a deterministic measurement. The observer is the
  natural home — workflows ARE observation patterns whose every
  step is recorded. Five components in dependency order: workflow
  definition (YAML), node executor (DAG runner), provenance
  recording (ObservedOps), mode catalog (matrix.search,
  distillation.score, drift.scorer, llm.chat), HTTP surface
  (/v1/observer/workflow/run).

  Reference materials on the system (preserved, not lost):
    - /home/profit/lakehouse/.archon/workflows/lakehouse-architect-review.yaml
      (Rust main, 69919d9) — 3-node Archon-via-Lakehouse proof
    - /home/profit/external/Archon dev branch — upstream engine
      with local pi/provider.ts mod (3f2afc8) for Lakehouse routing
    - Rust observer-kb branch — apps/observer-kb/docs/PRD.md +
      Python prototypes proven on real ChatGPT/Claude PDF data

  Acceptance gates G3.8.A–D. Estimated effort: L.

PRD updated with "Observer as system resource (clarified
2026-04-29)" section pointing at §3.8 as the architectural growth
path. The bare-bones observerd in bc9ab93 is the substrate; the
workflow runner is what makes it the "objective measurement engine"
the small-model pipeline needs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 docs/PRD.md  |  21 ++++++
 docs/SPEC.md | 206 +++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 227 insertions(+)

diff --git a/docs/PRD.md b/docs/PRD.md
index bbfe0c4..902f258 100644
--- a/docs/PRD.md
+++ b/docs/PRD.md
@@ -34,6 +34,27 @@ What the Go refactor is FOR: a second-language pass surfaces architectural weakn
 
 **The playbook + matrix indexer must produce the results we're looking for.** That's the single load-bearing acceptance criterion. Throughput, scaling, code elegance — all secondary. If a deep-field reality test on the 500K corpus surfaces wrong answers, the loop isn't working and we fix that before adding anything else.
 
+### Observer as system resource (clarified 2026-04-29)
+
+The observer is not a service among services — it's a *system
+resource*. Its job is to be objective about the process: watch
+everything, record measurements, surface what worked vs what
+didn't, feed the KB so the playbook substrate can decide the
+right pathway to the correct outcome.
+
+The bare-bones observerd shipped in `bc9ab93` (event ingest +
+stats) is the substrate for this. The architectural pattern
+that grows it into the full "objective measurement engine" is
+the **multi-pass workflow runner** documented in SPEC §3.8 —
+inspired by Archon (`/home/profit/external/Archon`) and proven
+in the Rust `observer-kb` branch's Python prototypes (`deep_analysis.py`,
+`extract_knowledge.py`, `process_knowledge.py`).
+
+The pipeline mode-chain (extract → validator → hallucination →
+consensus → redteam → pipeline → render) IS how the observer
+makes actionable decisions: each mode pass is a deterministic
+measurement; what survives the gauntlet is what feeds the KB.
+
 ### Triage / human-in-loop
 
 Most cases are abstract enough that small-model + pathway + matrix can complete them. Some can't — they need a human. The system's job is to **identify which is which** and only escalate the second class. Frontier models partially solve this internally with their thinking loops; we're externalizing it so:
diff --git a/docs/SPEC.md b/docs/SPEC.md
index 6f3d958..1c5a8b6 100644
--- a/docs/SPEC.md
+++ b/docs/SPEC.md
@@ -192,6 +192,212 @@ and got reduced to "we have vectord" in earlier port-planning. The
 SPEC names it explicitly so the port preserves the multi-corpus
 retrieval shape AND the learning loop, not just the HNSW substrate.
 
+### §3.5 — Drift quantification (loop 5 of the PRD)
+
+**What it is.** PRD names "drift" as the 5th loop: quantify when
+historical decisions stop matching current reality. Distinct from
+the rating+distillation loop because drift is MEASUREMENT, not
+LEARNING. The learning loop says "this match worked, remember it";
+the drift loop says "this 4-month-old playbook entry — does it
+still match what the substrate would surface today?"
+
+**What's shipped (commit `be65f85`):**
+  - SCORER drift: re-runs current `distillation.ScoreRecord` over
+    historical (EvidenceRecord, persisted_category) pairs and
+    reports mismatches + a sorted shift matrix
+  - `internal/drift/drift.go` — pure-function `ComputeScorerDrift`
+  - 6 unit tests covering no-drift, shift detection, multi-shift
+    sorted-by-count, includeEntries flag, empty input, scorer-version
+    stamping
+
+**Future drift shapes (not shipped):**
+  - PLAYBOOK drift: re-run playbook queries through current
+    matrix-search; recorded answer not in top-K = drift
+  - EMBEDDING drift: KS-test on vector distribution at T1 vs T2
+  - AUDIT BASELINE drift: matches Rust `audit_baselines.jsonl`
+    longitudinal signal
+
+**Acceptance gates:**
+- G3.5.A — A scorer-version bump triggers a non-zero `Drifted` count
+  on a corpus of historical ScoredRuns where the new logic produces
+  different categories than the persisted ones.
+- G3.5.B — `ScorerDriftReport.ShiftMatrix` is deterministic-ordered
+  (count desc, ties broken alphabetically) so JSON output is stable
+  across runs.
+
+### §3.6 — Staffing-side structured filter
+
+**What it is.** Reality tests on the candidates + workers corpora
+(commits `0d1553c`, `a97881d`) surfaced that pure semantic retrieval
+can't gate by location/status/availability — the matrix indexer
+returns Production Workers for a Forklift+OSHA-30 query because
+nomic-embed-text's geometry doesn't separate the role labels well.
+Structured filtering is the addressable piece: pre-filter the
+candidate set on metadata fields BEFORE semantic ranking.
+
+**What's shipped (commit `b199093`):**
+  - `SearchRequest.MetadataFilter` — `map[string]any` of metadata
+    field → expected value (single value or list-of-values for OR
+    semantics within a key, AND across keys)
+  - Post-retrieval filter applied before top-K truncation in
+    `internal/matrix/retrieve.go`
+  - `SearchResponse.MetadataFilterDropped` for telemetry on filter
+    aggressiveness
+  - 7 unit tests covering nil filter, missing metadata, exact match,
+    AND across keys, OR within list, bool match, malformed JSON
+
+**Deferred:**
+  - Pre-retrieval SQL gate via `queryd` (the actual hybrid). The
+    post-retrieval filter is an MVP that helps when the candidate
+    set is mostly relevant; for aggressive filters that drop most
+    results, a SQL pre-filter into matrix retrieval would surface
+    the right candidates with less wasted embedding work.
+  - Filter language richer than equality (e.g. range, prefix, regex).
+
+**Acceptance gates:**
+- G3.6.A — `MetadataFilter: {"state": "IL"}` against a mixed-state
+  corpus drops every non-IL result; `MetadataFilterDropped` reports
+  the count.
+- G3.6.B — List filter `{"state": ["IL", "WI"]}` keeps both states,
+  drops the rest (OR within key).
+- G3.6.C — Multi-key filter is AND: a result missing any key is
+  dropped, no exception.
+
+### §3.7 — Operational rating wiring
+
+**What it is.** PRD loop 4 (rating + distillation) needs real
+inflows to be a learning system rather than a substrate. The
+playbook-record endpoint (`06e7152`) takes one (query, answer,
+score) per call; productizing it into actual signal sources is what
+makes the system get smarter with use.
+
+**What's shipped (commit `6392772`):**
+  - `POST /v1/matrix/playbooks/bulk` — bulk-record N successes;
+    per-entry success/failure response so callers can see which of
+    a 4,701-row historical placement import succeeded vs which
+    failed validation.
+  - Single-record path from `06e7152` unchanged.
+
+**Deferred:**
+  - UI shim for click-tracking (no Go demo UI yet — the Bun demo at
+    `devop.live/lakehouse/` is still serving the public surface).
+    When the Go UI lands or a feedback API is added to the Bun UI,
+    every coordinator click → bulk-batched POST → playbook entry.
+  - Negative feedback (this match didn't work). Currently only
+    positive scores are recorded; a rejection signal would help the
+    learning loop avoid pushing bad matches.
+  - Time-decay on playbook scores so stale recommendations attenuate.
+
+**Acceptance gates:**
+- G3.7.A — Bulk POST of N entries returns `{recorded, failed,
+  results[]}` with per-entry IDs/errors, no single-entry failure
+  aborting the batch.
+- G3.7.B — Each recorded entry surfaces in `/v1/matrix/search` with
+  `use_playbook=true` after a re-query.
+
+### §3.8 — Observer-KB workflow runner (Archon-style multi-pass)
+
+**What it is.** The architectural pattern documented in the Rust
+`observer-kb` branch (10 commits ahead of main, never merged) and
+proven by `/home/profit/external/Archon`'s workflow engine. Multiple
+mode passes processing data, with each pass an objective measurement
+that contributes to the KB:
+
+```
+Raw data
+   ↓ Mode: EXTRACT       structured facts/entities/relationships
+   ↓ Mode: VALIDATOR     fact-check, confidence 1-10
+   ↓ Mode: HALLUCINATION verify each claim, flag likely fabrications
+   ↓ Mode: CONSENSUS     multiple passes until extraction converges
+   ↓ Mode: REDTEAM       attack what survived, patch what fails
+   ↓ Mode: PIPELINE      clean → Q&A structure → topic group → rank
+   ↓ RENDER              curated doc anchored on questions
+```
+
+This is the *orchestrator* missing from §3.4 components 1-5: each
+SPEC §3.4 piece (relevance, downgrade, scorer, drift) is a "mode";
+what's missing is the workflow engine that chains them.
+
+**Why it matters.** Per the PRD's product vision: the observer
+should make actionable decisions based on watching what's
+successful. The workflow runner is how observers compose modes
+into multi-pass pipelines that score outcomes rigorously enough
+to feed the KB and inform the playbook substrate.
+
+**Reference materials on the system:**
+  - `/home/profit/lakehouse/.archon/workflows/lakehouse-architect-review.yaml`
+    (committed `69919d9` in main) — proves Archon-via-Lakehouse
+    works with a 3-node `shape → weakness → improvement` workflow
+  - `/home/profit/external/Archon` — the upstream workflow engine
+    (cloned 2026-04-26); `packages/providers/src/community/pi/provider.ts`
+    has the local Lakehouse-routing mod committed locally as
+    `3f2afc8` (not pushed to upstream `coleam00/Archon`)
+  - Rust `observer-kb` branch (10 commits, +4338/-55506 LoC) —
+    `apps/observer-kb/docs/PRD.md` documents the multi-pass
+    architecture; `scripts/{deep_analysis,extract_knowledge,process_knowledge}.py`
+    are the Python prototypes that proved it on real ChatGPT/Claude
+    PDF data (496 topics, 300 decisions, 100 insights extracted)
+
+**Components to port (in dependency order):**
+
+1. **Workflow definition** (`internal/workflow/types.go`) — YAML
+   schema matching Archon's shape: `name`, `description`, `provider`,
+   `model`, list of `nodes` each with `id`, `prompt`, `allowed_tools`,
+   `effort`, `idle_timeout`, `depends_on`. The depends_on edges form
+   a DAG; the runner resolves topologically.
+
+2. **Node executor** (`internal/workflow/runner.go`) — given a
+   workflow and a starting context, walks the DAG, executes each
+   node by dispatching to the configured backend (matrix.Search,
+   distillation.ScoreRecord, drift.ComputeScorerDrift, or a generic
+   prompt-against-LLM via gateway `/v1/chat`), captures per-node
+   output, makes it available as `$<node_id>.output` in subsequent
+   nodes.
+
+3. **Provenance recording** — every node execution lands an
+   ObservedOp (via the observerd substrate from `bc9ab93`) with
+   `source: "workflow"`, the workflow name + node ID, input/output
+   summaries, and timing. The ring buffer + JSONL log become the
+   substrate for the rating+distillation loop's KB feed.
+
+4. **Mode catalog** (`internal/workflow/modes.go`) — registry of
+   the modes the runner can dispatch to. Each mode is a Go function
+   matching a uniform `func(ctx, input map[string]any) (map[string]any, error)`
+   signature so workflows can compose them. Initial modes from
+   §3.4: `matrix.search`, `matrix.relevance`, `matrix.downgrade`,
+   `playbook.record`, `playbook.lookup`, `distillation.score`,
+   `drift.scorer`. Plus `llm.chat` for free-form mode prompts.
+
+5. **HTTP surface** — `POST /v1/observer/workflow/run` accepts a
+   workflow YAML body + a starting context; returns the per-node
+   results + the chain of ObservedOps generated. `GET
+   /v1/observer/workflow/list` lists workflows in a known directory
+   for operator discoverability.
+
+**Why integrate into observerd, not a new service.** The observer
+is the system resource that watches and records. Workflows ARE
+observation patterns — multi-step processes whose every step is
+recorded. Putting the runner inside observerd keeps the
+"measurement → KB feed" wiring tight; a separate service would
+re-implement the recording layer.
+
+**Acceptance gates:**
+- G3.8.A — Load a workflow YAML matching the Archon `lakehouse-architect-review.yaml`
+  shape; runner executes the 3-node DAG topologically.
+- G3.8.B — Each node execution lands an ObservedOp with
+  `source: "workflow"` and the node's input/output. Stats endpoint
+  shows the workflow ops.
+- G3.8.C — A node referencing `$<prior_node>.output` in its prompt
+  resolves correctly; missing reference is a clear error not a
+  silent empty string.
+- G3.8.D — Mode catalog dispatches `matrix.search` invocation to
+  the matrixd backend without going through HTTP (in-process
+  function call when matrixd is co-resident).
+
+**Status:** PORT TARGET, not yet started. SPEC commits the design;
+implementation is its own wave (estimated **L** effort given the
+DAG runner + mode dispatch + provenance recording).
+
 ### §3.3 — UI (HTMX)
 
 **Approach:** server-rendered Go templates using `html/template`,