SPEC §3.5/§3.6/§3.7/§3.8 — name F/B/C as port targets + add Archon-style workflow runner
Per the 2026-04-29 scope-discipline pause: the wave shipped four
pieces beyond SPEC §3.4 component scope, and one architectural
pattern surfaced (Archon-style multi-pass workflow runner) that's
the observer's natural growth path. Document them as port targets
so the next scrum review has authoritative SPEC components.
§3.5 — Drift quantification (loop 5 of the PRD)
Names the SCORER drift work shipped in be65f85 + the deferred
shapes (PLAYBOOK drift, EMBEDDING drift, AUDIT BASELINE drift).
Acceptance gates G3.5.A–B.
§3.6 — Staffing-side structured filter
Names the metadata-filter MVP shipped in b199093 + the deferred
pre-retrieval SQL gate via queryd. Acceptance gates G3.6.A–C.
§3.7 — Operational rating wiring
Names the bulk playbook-record endpoint shipped in 6392772 + the
deferred UI shim, negative-feedback path, and time-decay.
Acceptance gates G3.7.A–B.
§3.8 — Observer-KB workflow runner (Archon-style multi-pass) —
PORT TARGET, not yet started
Documents the architecture J was working on across the Rust
observer-kb branch (10 commits ahead of main, never merged) and
the local Archon mod (committed 2026-04-29 as 3f2afc8 in
/home/profit/external/Archon, not pushed to coleam00/Archon).
The pattern: multi-pass mode chain (extract → validator →
hallucination → consensus → redteam → pipeline → render) where
each pass is a deterministic measurement. The observer is the
natural home — workflows ARE observation patterns whose every
step is recorded. Five components in dependency order: workflow
definition (YAML), node executor (DAG runner), provenance
recording (ObservedOps), mode catalog (matrix.search,
distillation.score, drift.scorer, llm.chat), HTTP surface
(/v1/observer/workflow/run).
Reference materials on the system (preserved, not lost):
- /home/profit/lakehouse/.archon/workflows/lakehouse-architect-review.yaml
(Rust main, 69919d9) — 3-node Archon-via-Lakehouse proof
- /home/profit/external/Archon dev branch — upstream engine
with local pi/provider.ts mod (3f2afc8) for Lakehouse routing
- Rust observer-kb branch — apps/observer-kb/docs/PRD.md +
Python prototypes proven on real ChatGPT/Claude PDF data
Acceptance gates G3.8.A–D. Estimated effort: L.
PRD updated with "Observer as system resource (clarified
2026-04-29)" section pointing at §3.8 as the architectural growth
path. The bare-bones observerd in bc9ab93 is the substrate; the
workflow runner is what makes it the "objective measurement engine"
the small-model pipeline needs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
bc9ab93afe
commit
97dd3f826d
21
docs/PRD.md
21
docs/PRD.md
@ -34,6 +34,27 @@ What the Go refactor is FOR: a second-language pass surfaces architectural weakn
|
||||
|
||||
**The playbook + matrix indexer must produce the results we're looking for.** That's the single load-bearing acceptance criterion. Throughput, scaling, code elegance — all secondary. If a deep-field reality test on the 500K corpus surfaces wrong answers, the loop isn't working and we fix that before adding anything else.
|
||||
|
||||
### Observer as system resource (clarified 2026-04-29)
|
||||
|
||||
The observer is not a service among services — it's a *system
|
||||
resource*. Its job is to be objective about the process: watch
|
||||
everything, record measurements, surface what worked vs what
|
||||
didn't, feed the KB so the playbook substrate can decide the
|
||||
right pathway to the correct outcome.
|
||||
|
||||
The bare-bones observerd shipped in `bc9ab93` (event ingest +
|
||||
stats) is the substrate for this. The architectural pattern
|
||||
that grows it into the full "objective measurement engine" is
|
||||
the **multi-pass workflow runner** documented in SPEC §3.8 —
|
||||
inspired by Archon (`/home/profit/external/Archon`) and proven
|
||||
in the Rust `observer-kb` branch's Python prototypes (`deep_analysis.py`,
|
||||
`extract_knowledge.py`, `process_knowledge.py`).
|
||||
|
||||
The pipeline mode-chain (extract → validator → hallucination →
|
||||
consensus → redteam → pipeline → render) IS how the observer
|
||||
makes actionable decisions: each mode pass is a deterministic
|
||||
measurement; what survives the gauntlet is what feeds the KB.
|
||||
|
||||
### Triage / human-in-loop
|
||||
|
||||
Most cases are abstract enough that small-model + pathway + matrix can complete them. Some can't — they need a human. The system's job is to **identify which is which** and only escalate the second class. Frontier models partially solve this internally with their thinking loops; we're externalizing it so:
|
||||
|
||||
206
docs/SPEC.md
206
docs/SPEC.md
@ -192,6 +192,212 @@ and got reduced to "we have vectord" in earlier port-planning. The
|
||||
SPEC names it explicitly so the port preserves the multi-corpus
|
||||
retrieval shape AND the learning loop, not just the HNSW substrate.
|
||||
|
||||
### §3.5 — Drift quantification (loop 5 of the PRD)
|
||||
|
||||
**What it is.** PRD names "drift" as the 5th loop: quantify when
|
||||
historical decisions stop matching current reality. Distinct from
|
||||
the rating+distillation loop because drift is MEASUREMENT, not
|
||||
LEARNING. The learning loop says "this match worked, remember it";
|
||||
the drift loop says "this 4-month-old playbook entry — does it
|
||||
still match what the substrate would surface today?"
|
||||
|
||||
**What's shipped (commit `be65f85`):**
|
||||
- SCORER drift: re-runs current `distillation.ScoreRecord` over
|
||||
historical (EvidenceRecord, persisted_category) pairs and
|
||||
reports mismatches + a sorted shift matrix
|
||||
- `internal/drift/drift.go` — pure-function `ComputeScorerDrift`
|
||||
- 6 unit tests covering no-drift, shift detection, multi-shift
|
||||
sorted-by-count, includeEntries flag, empty input, scorer-version
|
||||
stamping
|
||||
|
||||
**Future drift shapes (not shipped):**
|
||||
- PLAYBOOK drift: re-run playbook queries through current
|
||||
matrix-search; recorded answer not in top-K = drift
|
||||
- EMBEDDING drift: KS-test on vector distribution at T1 vs T2
|
||||
- AUDIT BASELINE drift: matches Rust `audit_baselines.jsonl`
|
||||
longitudinal signal
|
||||
|
||||
**Acceptance gates:**
|
||||
- G3.5.A — A scorer-version bump triggers a non-zero `Drifted` count
|
||||
on a corpus of historical ScoredRuns where the new logic produces
|
||||
different categories than the persisted ones.
|
||||
- G3.5.B — `ScorerDriftReport.ShiftMatrix` is deterministic-ordered
|
||||
(count desc, ties broken alphabetically) so JSON output is stable
|
||||
across runs.
|
||||
|
||||
### §3.6 — Staffing-side structured filter
|
||||
|
||||
**What it is.** Reality tests on the candidates + workers corpora
|
||||
(commits `0d1553c`, `a97881d`) surfaced that pure semantic retrieval
|
||||
can't gate by location/status/availability — the matrix indexer
|
||||
returns Production Workers for a Forklift+OSHA-30 query because
|
||||
nomic-embed-text's geometry doesn't separate the role labels well.
|
||||
Structured filtering is the addressable piece: pre-filter the
|
||||
candidate set on metadata fields BEFORE semantic ranking.
|
||||
|
||||
**What's shipped (commit `b199093`):**
|
||||
- `SearchRequest.MetadataFilter` — `map[string]any` of metadata
|
||||
field → expected value (single value or list-of-values for OR
|
||||
semantics within a key, AND across keys)
|
||||
- Post-retrieval filter applied before top-K truncation in
|
||||
`internal/matrix/retrieve.go`
|
||||
- `SearchResponse.MetadataFilterDropped` for telemetry on filter
|
||||
aggressiveness
|
||||
- 7 unit tests covering nil filter, missing metadata, exact match,
|
||||
AND across keys, OR within list, bool match, malformed JSON
|
||||
|
||||
**Deferred:**
|
||||
- Pre-retrieval SQL gate via `queryd` (the actual hybrid). The
|
||||
post-retrieval filter is an MVP that helps when the candidate
|
||||
set is mostly relevant; for aggressive filters that drop most
|
||||
results, a SQL pre-filter into matrix retrieval would surface
|
||||
the right candidates with less wasted embedding work.
|
||||
- Filter language richer than equality (e.g. range, prefix, regex).
|
||||
|
||||
**Acceptance gates:**
|
||||
- G3.6.A — `MetadataFilter: {"state": "IL"}` against a mixed-state
|
||||
corpus drops every non-IL result; `MetadataFilterDropped` reports
|
||||
the count.
|
||||
- G3.6.B — List filter `{"state": ["IL", "WI"]}` keeps both states,
|
||||
drops the rest (OR within key).
|
||||
- G3.6.C — Multi-key filter is AND: a result missing any key is
|
||||
dropped, no exception.
|
||||
|
||||
### §3.7 — Operational rating wiring
|
||||
|
||||
**What it is.** PRD loop 4 (rating + distillation) needs real
|
||||
inflows to be a learning system rather than a substrate. The
|
||||
playbook-record endpoint (`06e7152`) takes one (query, answer,
|
||||
score) per call; productizing it into actual signal sources is what
|
||||
makes the system get smarter with use.
|
||||
|
||||
**What's shipped (commit `6392772`):**
|
||||
- `POST /v1/matrix/playbooks/bulk` — bulk-record N successes;
|
||||
per-entry success/failure response so callers can see which of
|
||||
a 4,701-row historical placement import succeeded vs which
|
||||
failed validation.
|
||||
- Single-record path from `06e7152` unchanged.
|
||||
|
||||
**Deferred:**
|
||||
- UI shim for click-tracking (no Go demo UI yet — the Bun demo at
|
||||
`devop.live/lakehouse/` is still serving the public surface).
|
||||
When the Go UI lands or a feedback API is added to the Bun UI,
|
||||
every coordinator click → bulk-batched POST → playbook entry.
|
||||
- Negative feedback (this match didn't work). Currently only
|
||||
positive scores are recorded; a rejection signal would help the
|
||||
learning loop avoid pushing bad matches.
|
||||
- Time-decay on playbook scores so stale recommendations attenuate.
|
||||
|
||||
**Acceptance gates:**
|
||||
- G3.7.A — Bulk POST of N entries returns `{recorded, failed,
|
||||
results[]}` with per-entry IDs/errors, no single-entry failure
|
||||
aborting the batch.
|
||||
- G3.7.B — Each recorded entry surfaces in `/v1/matrix/search` with
|
||||
`use_playbook=true` after a re-query.
|
||||
|
||||
### §3.8 — Observer-KB workflow runner (Archon-style multi-pass)
|
||||
|
||||
**What it is.** The architectural pattern documented in the Rust
|
||||
`observer-kb` branch (10 commits ahead of main, never merged) and
|
||||
proven by `/home/profit/external/Archon`'s workflow engine. Multiple
|
||||
mode passes processing data, with each pass an objective measurement
|
||||
that contributes to the KB:
|
||||
|
||||
```
|
||||
Raw data
|
||||
↓ Mode: EXTRACT structured facts/entities/relationships
|
||||
↓ Mode: VALIDATOR fact-check, confidence 1-10
|
||||
↓ Mode: HALLUCINATION verify each claim, flag likely fabrications
|
||||
↓ Mode: CONSENSUS multiple passes until extraction converges
|
||||
↓ Mode: REDTEAM attack what survived, patch what fails
|
||||
↓ Mode: PIPELINE clean → Q&A structure → topic group → rank
|
||||
↓ RENDER curated doc anchored on questions
|
||||
```
|
||||
|
||||
This is the *orchestrator* missing from §3.4 components 1-5: each
|
||||
SPEC §3.4 piece (relevance, downgrade, scorer, drift) is a "mode";
|
||||
what's missing is the workflow engine that chains them.
|
||||
|
||||
**Why it matters.** Per the PRD's product vision: the observer
|
||||
should make actionable decisions based on watching what's
|
||||
successful. The workflow runner is how observers compose modes
|
||||
into multi-pass pipelines that score outcomes rigorously enough
|
||||
to feed the KB and inform the playbook substrate.
|
||||
|
||||
**Reference materials on the system:**
|
||||
- `/home/profit/lakehouse/.archon/workflows/lakehouse-architect-review.yaml`
|
||||
(committed `69919d9` in main) — proves Archon-via-Lakehouse
|
||||
works with a 3-node `shape → weakness → improvement` workflow
|
||||
- `/home/profit/external/Archon` — the upstream workflow engine
|
||||
(cloned 2026-04-26); `packages/providers/src/community/pi/provider.ts`
|
||||
has the local Lakehouse-routing mod committed locally as
|
||||
`3f2afc8` (not pushed to upstream `coleam00/Archon`)
|
||||
- Rust `observer-kb` branch (10 commits, +4338/-55506 LoC) —
|
||||
`apps/observer-kb/docs/PRD.md` documents the multi-pass
|
||||
architecture; `scripts/{deep_analysis,extract_knowledge,process_knowledge}.py`
|
||||
are the Python prototypes that proved it on real ChatGPT/Claude
|
||||
PDF data (496 topics, 300 decisions, 100 insights extracted)
|
||||
|
||||
**Components to port (in dependency order):**
|
||||
|
||||
1. **Workflow definition** (`internal/workflow/types.go`) — YAML
|
||||
schema matching Archon's shape: `name`, `description`, `provider`,
|
||||
`model`, list of `nodes` each with `id`, `prompt`, `allowed_tools`,
|
||||
`effort`, `idle_timeout`, `depends_on`. The depends_on edges form
|
||||
a DAG; the runner resolves topologically.
|
||||
|
||||
2. **Node executor** (`internal/workflow/runner.go`) — given a
|
||||
workflow and a starting context, walks the DAG, executes each
|
||||
node by dispatching to the configured backend (matrix.Search,
|
||||
distillation.ScoreRecord, drift.ComputeScorerDrift, or a generic
|
||||
prompt-against-LLM via gateway `/v1/chat`), captures per-node
|
||||
output, makes it available as `$<node_id>.output` in subsequent
|
||||
nodes.
|
||||
|
||||
3. **Provenance recording** — every node execution lands an
|
||||
ObservedOp (via the observerd substrate from `bc9ab93`) with
|
||||
`source: "workflow"`, the workflow name + node ID, input/output
|
||||
summaries, and timing. The ring buffer + JSONL log become the
|
||||
substrate for the rating+distillation loop's KB feed.
|
||||
|
||||
4. **Mode catalog** (`internal/workflow/modes.go`) — registry of
|
||||
the modes the runner can dispatch to. Each mode is a Go function
|
||||
matching a uniform `func(ctx, input map[string]any) (map[string]any, error)`
|
||||
signature so workflows can compose them. Initial modes from
|
||||
§3.4: `matrix.search`, `matrix.relevance`, `matrix.downgrade`,
|
||||
`playbook.record`, `playbook.lookup`, `distillation.score`,
|
||||
`drift.scorer`. Plus `llm.chat` for free-form mode prompts.
|
||||
|
||||
5. **HTTP surface** — `POST /v1/observer/workflow/run` accepts a
|
||||
workflow YAML body + a starting context; returns the per-node
|
||||
results + the chain of ObservedOps generated. `GET
|
||||
/v1/observer/workflow/list` lists workflows in a known directory
|
||||
for operator discoverability.
|
||||
|
||||
**Why integrate into observerd, not a new service.** The observer
|
||||
is the system resource that watches and records. Workflows ARE
|
||||
observation patterns — multi-step processes whose every step is
|
||||
recorded. Putting the runner inside observerd keeps the
|
||||
"measurement → KB feed" wiring tight; a separate service would
|
||||
re-implement the recording layer.
|
||||
|
||||
**Acceptance gates:**
|
||||
- G3.8.A — Load a workflow YAML matching the Archon `lakehouse-architect-review.yaml`
|
||||
shape; runner executes the 3-node DAG topologically.
|
||||
- G3.8.B — Each node execution lands an ObservedOp with
|
||||
`source: "workflow"` and the node's input/output. Stats endpoint
|
||||
shows the workflow ops.
|
||||
- G3.8.C — A node referencing `$<prior_node>.output` in its prompt
|
||||
resolves correctly; missing reference is a clear error not a
|
||||
silent empty string.
|
||||
- G3.8.D — Mode catalog dispatches `matrix.search` invocation to
|
||||
the matrixd backend without going through HTTP (in-process
|
||||
function call when matrixd is co-resident).
|
||||
|
||||
**Status:** PORT TARGET, not yet started. SPEC commits the design;
|
||||
implementation is its own wave (estimated **L** effort given the
|
||||
DAG runner + mode dispatch + provenance recording).
|
||||
|
||||
### §3.3 — UI (HTMX)
|
||||
|
||||
**Approach:** server-rendered Go templates using `html/template`,
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user