SPEC §3.5/§3.6/§3.7/§3.8 — name F/B/C as port targets + add Archon-style workflow runner

Per the 2026-04-29 scope-discipline pause: the wave shipped four
pieces beyond SPEC §3.4 component scope, and one architectural
pattern surfaced (Archon-style multi-pass workflow runner) that's
the observer's natural growth path. Document them as port targets
so the next scrum review has authoritative SPEC components.

§3.5 — Drift quantification (loop 5 of the PRD)
  Names the SCORER drift work shipped in be65f85 + the deferred
  shapes (PLAYBOOK drift, EMBEDDING drift, AUDIT BASELINE drift).
  Acceptance gates G3.5.A–B.

§3.6 — Staffing-side structured filter
  Names the metadata-filter MVP shipped in b199093 + the deferred
  pre-retrieval SQL gate via queryd. Acceptance gates G3.6.A–C.

§3.7 — Operational rating wiring
  Names the bulk playbook-record endpoint shipped in 6392772 + the
  deferred UI shim, negative-feedback path, and time-decay.
  Acceptance gates G3.7.A–B.

§3.8 — Observer-KB workflow runner (Archon-style multi-pass) —
       PORT TARGET, not yet started
  Documents the architecture J was working on across the Rust
  observer-kb branch (10 commits ahead of main, never merged) and
  the local Archon mod (committed 2026-04-29 as 3f2afc8 in
  /home/profit/external/Archon, not pushed to coleam00/Archon).

  The pattern: multi-pass mode chain (extract → validator →
  hallucination → consensus → redteam → pipeline → render) where
  each pass is a deterministic measurement. The observer is the
  natural home — workflows ARE observation patterns whose every
  step is recorded. Five components in dependency order: workflow
  definition (YAML), node executor (DAG runner), provenance
  recording (ObservedOps), mode catalog (matrix.search,
  distillation.score, drift.scorer, llm.chat), HTTP surface
  (/v1/observer/workflow/run).

  Reference materials on the system (preserved, not lost):
    - /home/profit/lakehouse/.archon/workflows/lakehouse-architect-review.yaml
      (Rust main, 69919d9) — 3-node Archon-via-Lakehouse proof
    - /home/profit/external/Archon dev branch — upstream engine
      with local pi/provider.ts mod (3f2afc8) for Lakehouse routing
    - Rust observer-kb branch — apps/observer-kb/docs/PRD.md +
      Python prototypes proven on real ChatGPT/Claude PDF data

  Acceptance gates G3.8.A–D. Estimated effort: L.

PRD updated with "Observer as system resource (clarified
2026-04-29)" section pointing at §3.8 as the architectural growth
path. The bare-bones observerd in bc9ab93 is the substrate; the
workflow runner is what makes it the "objective measurement engine"
the small-model pipeline needs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
root 2026-04-29 20:27:41 -05:00
parent bc9ab93afe
commit 97dd3f826d
2 changed files with 227 additions and 0 deletions

View File

@ -34,6 +34,27 @@ What the Go refactor is FOR: a second-language pass surfaces architectural weakn
**The playbook + matrix indexer must produce the results we're looking for.** That's the single load-bearing acceptance criterion. Throughput, scaling, code elegance — all secondary. If a deep-field reality test on the 500K corpus surfaces wrong answers, the loop isn't working and we fix that before adding anything else.
### Observer as system resource (clarified 2026-04-29)
The observer is not a service among services — it's a *system
resource*. Its job is to be objective about the process: watch
everything, record measurements, surface what worked vs what
didn't, feed the KB so the playbook substrate can decide the
right pathway to the correct outcome.
The bare-bones observerd shipped in `bc9ab93` (event ingest +
stats) is the substrate for this. The architectural pattern
that grows it into the full "objective measurement engine" is
the **multi-pass workflow runner** documented in SPEC §3.8 —
inspired by Archon (`/home/profit/external/Archon`) and proven
in the Rust `observer-kb` branch's Python prototypes (`deep_analysis.py`,
`extract_knowledge.py`, `process_knowledge.py`).
The pipeline mode-chain (extract → validator → hallucination →
consensus → redteam → pipeline → render) IS how the observer
makes actionable decisions: each mode pass is a deterministic
measurement; what survives the gauntlet is what feeds the KB.
### Triage / human-in-loop
Most cases are abstract enough that small-model + pathway + matrix can complete them. Some can't — they need a human. The system's job is to **identify which is which** and only escalate the second class. Frontier models partially solve this internally with their thinking loops; we're externalizing it so:

View File

@ -192,6 +192,212 @@ and got reduced to "we have vectord" in earlier port-planning. The
SPEC names it explicitly so the port preserves the multi-corpus
retrieval shape AND the learning loop, not just the HNSW substrate.
### §3.5 — Drift quantification (loop 5 of the PRD)
**What it is.** PRD names "drift" as the 5th loop: quantify when
historical decisions stop matching current reality. Distinct from
the rating+distillation loop because drift is MEASUREMENT, not
LEARNING. The learning loop says "this match worked, remember it";
the drift loop says "this 4-month-old playbook entry — does it
still match what the substrate would surface today?"
**What's shipped (commit `be65f85`):**
- SCORER drift: re-runs current `distillation.ScoreRecord` over
historical (EvidenceRecord, persisted_category) pairs and
reports mismatches + a sorted shift matrix
- `internal/drift/drift.go` — pure-function `ComputeScorerDrift`
- 6 unit tests covering no-drift, shift detection, multi-shift
sorted-by-count, includeEntries flag, empty input, scorer-version
stamping
**Future drift shapes (not shipped):**
- PLAYBOOK drift: re-run playbook queries through current
matrix-search; recorded answer not in top-K = drift
- EMBEDDING drift: KS-test on vector distribution at T1 vs T2
- AUDIT BASELINE drift: matches Rust `audit_baselines.jsonl`
longitudinal signal
**Acceptance gates:**
- G3.5.A — A scorer-version bump triggers a non-zero `Drifted` count
on a corpus of historical ScoredRuns where the new logic produces
different categories than the persisted ones.
- G3.5.B — `ScorerDriftReport.ShiftMatrix` is deterministic-ordered
(count desc, ties broken alphabetically) so JSON output is stable
across runs.
### §3.6 — Staffing-side structured filter
**What it is.** Reality tests on the candidates + workers corpora
(commits `0d1553c`, `a97881d`) surfaced that pure semantic retrieval
can't gate by location/status/availability — the matrix indexer
returns Production Workers for a Forklift+OSHA-30 query because
nomic-embed-text's geometry doesn't separate the role labels well.
Structured filtering is the addressable piece: pre-filter the
candidate set on metadata fields BEFORE semantic ranking.
**What's shipped (commit `b199093`):**
- `SearchRequest.MetadataFilter``map[string]any` of metadata
field → expected value (single value or list-of-values for OR
semantics within a key, AND across keys)
- Post-retrieval filter applied before top-K truncation in
`internal/matrix/retrieve.go`
- `SearchResponse.MetadataFilterDropped` for telemetry on filter
aggressiveness
- 7 unit tests covering nil filter, missing metadata, exact match,
AND across keys, OR within list, bool match, malformed JSON
**Deferred:**
- Pre-retrieval SQL gate via `queryd` (the actual hybrid). The
post-retrieval filter is an MVP that helps when the candidate
set is mostly relevant; for aggressive filters that drop most
results, a SQL pre-filter into matrix retrieval would surface
the right candidates with less wasted embedding work.
- Filter language richer than equality (e.g. range, prefix, regex).
**Acceptance gates:**
- G3.6.A — `MetadataFilter: {"state": "IL"}` against a mixed-state
corpus drops every non-IL result; `MetadataFilterDropped` reports
the count.
- G3.6.B — List filter `{"state": ["IL", "WI"]}` keeps both states,
drops the rest (OR within key).
- G3.6.C — Multi-key filter is AND: a result missing any key is
dropped, no exception.
### §3.7 — Operational rating wiring
**What it is.** PRD loop 4 (rating + distillation) needs real
inflows to be a learning system rather than a substrate. The
playbook-record endpoint (`06e7152`) takes one (query, answer,
score) per call; productizing it into actual signal sources is what
makes the system get smarter with use.
**What's shipped (commit `6392772`):**
- `POST /v1/matrix/playbooks/bulk` — bulk-record N successes;
per-entry success/failure response so callers can see which of
a 4,701-row historical placement import succeeded vs which
failed validation.
- Single-record path from `06e7152` unchanged.
**Deferred:**
- UI shim for click-tracking (no Go demo UI yet — the Bun demo at
`devop.live/lakehouse/` is still serving the public surface).
When the Go UI lands or a feedback API is added to the Bun UI,
every coordinator click → bulk-batched POST → playbook entry.
- Negative feedback (this match didn't work). Currently only
positive scores are recorded; a rejection signal would help the
learning loop avoid pushing bad matches.
- Time-decay on playbook scores so stale recommendations attenuate.
**Acceptance gates:**
- G3.7.A — Bulk POST of N entries returns `{recorded, failed,
results[]}` with per-entry IDs/errors, no single-entry failure
aborting the batch.
- G3.7.B — Each recorded entry surfaces in `/v1/matrix/search` with
`use_playbook=true` after a re-query.
### §3.8 — Observer-KB workflow runner (Archon-style multi-pass)
**What it is.** The architectural pattern documented in the Rust
`observer-kb` branch (10 commits ahead of main, never merged) and
proven by `/home/profit/external/Archon`'s workflow engine. Multiple
mode passes processing data, with each pass an objective measurement
that contributes to the KB:
```
Raw data
↓ Mode: EXTRACT structured facts/entities/relationships
↓ Mode: VALIDATOR fact-check, confidence 1-10
↓ Mode: HALLUCINATION verify each claim, flag likely fabrications
↓ Mode: CONSENSUS multiple passes until extraction converges
↓ Mode: REDTEAM attack what survived, patch what fails
↓ Mode: PIPELINE clean → Q&A structure → topic group → rank
↓ RENDER curated doc anchored on questions
```
This is the *orchestrator* missing from §3.4 components 1-5: each
SPEC §3.4 piece (relevance, downgrade, scorer, drift) is a "mode";
what's missing is the workflow engine that chains them.
**Why it matters.** Per the PRD's product vision: the observer
should make actionable decisions based on watching what's
successful. The workflow runner is how observers compose modes
into multi-pass pipelines that score outcomes rigorously enough
to feed the KB and inform the playbook substrate.
**Reference materials on the system:**
- `/home/profit/lakehouse/.archon/workflows/lakehouse-architect-review.yaml`
(committed `69919d9` in main) — proves Archon-via-Lakehouse
works with a 3-node `shape → weakness → improvement` workflow
- `/home/profit/external/Archon` — the upstream workflow engine
(cloned 2026-04-26); `packages/providers/src/community/pi/provider.ts`
has the local Lakehouse-routing mod committed locally as
`3f2afc8` (not pushed to upstream `coleam00/Archon`)
- Rust `observer-kb` branch (10 commits, +4338/-55506 LoC) —
`apps/observer-kb/docs/PRD.md` documents the multi-pass
architecture; `scripts/{deep_analysis,extract_knowledge,process_knowledge}.py`
are the Python prototypes that proved it on real ChatGPT/Claude
PDF data (496 topics, 300 decisions, 100 insights extracted)
**Components to port (in dependency order):**
1. **Workflow definition** (`internal/workflow/types.go`) — YAML
schema matching Archon's shape: `name`, `description`, `provider`,
`model`, list of `nodes` each with `id`, `prompt`, `allowed_tools`,
`effort`, `idle_timeout`, `depends_on`. The depends_on edges form
a DAG; the runner resolves topologically.
2. **Node executor** (`internal/workflow/runner.go`) — given a
workflow and a starting context, walks the DAG, executes each
node by dispatching to the configured backend (matrix.Search,
distillation.ScoreRecord, drift.ComputeScorerDrift, or a generic
prompt-against-LLM via gateway `/v1/chat`), captures per-node
output, makes it available as `$<node_id>.output` in subsequent
nodes.
3. **Provenance recording** — every node execution lands an
ObservedOp (via the observerd substrate from `bc9ab93`) with
`source: "workflow"`, the workflow name + node ID, input/output
summaries, and timing. The ring buffer + JSONL log become the
substrate for the rating+distillation loop's KB feed.
4. **Mode catalog** (`internal/workflow/modes.go`) — registry of
the modes the runner can dispatch to. Each mode is a Go function
matching a uniform `func(ctx, input map[string]any) (map[string]any, error)`
signature so workflows can compose them. Initial modes from
§3.4: `matrix.search`, `matrix.relevance`, `matrix.downgrade`,
`playbook.record`, `playbook.lookup`, `distillation.score`,
`drift.scorer`. Plus `llm.chat` for free-form mode prompts.
5. **HTTP surface**`POST /v1/observer/workflow/run` accepts a
workflow YAML body + a starting context; returns the per-node
results + the chain of ObservedOps generated. `GET
/v1/observer/workflow/list` lists workflows in a known directory
for operator discoverability.
**Why integrate into observerd, not a new service.** The observer
is the system resource that watches and records. Workflows ARE
observation patterns — multi-step processes whose every step is
recorded. Putting the runner inside observerd keeps the
"measurement → KB feed" wiring tight; a separate service would
re-implement the recording layer.
**Acceptance gates:**
- G3.8.A — Load a workflow YAML matching the Archon `lakehouse-architect-review.yaml`
shape; runner executes the 3-node DAG topologically.
- G3.8.B — Each node execution lands an ObservedOp with
`source: "workflow"` and the node's input/output. Stats endpoint
shows the workflow ops.
- G3.8.C — A node referencing `$<prior_node>.output` in its prompt
resolves correctly; missing reference is a clear error not a
silent empty string.
- G3.8.D — Mode catalog dispatches `matrix.search` invocation to
the matrixd backend without going through HTTP (in-process
function call when matrixd is co-resident).
**Status:** PORT TARGET, not yet started. SPEC commits the design;
implementation is its own wave (estimated **L** effort given the
DAG runner + mode dispatch + provenance recording).
### §3.3 — UI (HTMX)
**Approach:** server-rendered Go templates using `html/template`,