106 Commits

Author SHA1 Message Date
root
997527be4d matrix: cross-role playbook gate — closes real_001 bleed (OPEN #1)
real_001 surfaced same-client+city queries bleeding across roles:
Q#2 (Forklift Operator @ Beacon Freight Detroit) recorded e-6193
in the playbook corpus. Q#5 (Pickers same client+city) and Q#10
(CNC Operator same client+city) embedded within 0.13-0.18 cosine of
Q#2's query — well inside the 0.20 inject threshold — so e-6193
injected on both, demoting the cold-pass-correct workers.

Root cause: the inject distance threshold isn't tight enough on
the same-client+city cluster. Cosine collapses queries that share
city + client + count-token + time-token regardless of role. The
existing judge gate is per-injection at record time and doesn't
fire at retrieve time.

Fix: structural role gate in front of both Shape A boost and
Shape B inject. PlaybookEntry gains Role; SearchRequest gains
QueryRole. When both are non-empty and differ under roleEqual's
case+plural normalization, the entry is rejected before BoostFactor
or judge-gate logic runs.

Backward-compat: empty role on either side disables the gate —
preserves behavior for the lift suite's free-form multi-constraint
queries that have no clean single role. Caller-supplied (not
inferred), so existing recordings unaffected.

Wire-through:
- internal/matrix/playbook.go: Role field, NewPlaybookEntryWithRole,
  roleEqual helper with plural+case normalization
- internal/matrix/retrieve.go: QueryRole on SearchRequest, threaded
  to both ApplyPlaybookBoost + InjectPlaybookMisses
- cmd/matrixd/main.go: role on POST /matrix/playbooks/record + bulk
- scripts/playbook_lift/main.go: extractRoleFromNeed regex pulls
  role from "Need N {role}{s} in" queries (the fill_events shape);
  free-form queries fall back to empty (gate disabled)

Tests (5 new):
- TestInjectPlaybookMisses_RoleGateRejectsCrossRole: exact Q#10
  scenario (distance 0.135, recorded "Forklift Operator", query
  "CNC Operator") — locks the bleed at unit level
- TestInjectPlaybookMisses_RoleGateAllowsSameRole: Forklift Operator
  recording fires on Forklift Operators query (plural normalization)
- TestInjectPlaybookMisses_RoleGateBackwardCompat: empty Role on
  either side = gate disabled, preserves current behavior
- TestApplyPlaybookBoost_RoleGateRejectsCrossRole: Shape A defense
  in depth — boost doesn't fire on cross-role even when answer is
  in cold top-K
- TestRoleEqual_PluralAndCase: case + -s + -es plural normalization

Verification (real_002, same query set as real_001):
- Q#5 Pickers @ Beacon Freight: e-6193 → e-8499 (no bleed)
- Q#10 CNC Operator @ Beacon Freight: e-6193 → w-2404 (no bleed)
- Discoveries + lifts unchanged at 2 each (same-role lift still fires)
- Mean Δdist tightens from -0.127 to -0.040 (boosts no longer
  pulling distances through the floor on cross-role mismatches)

Findings: reports/reality-tests/real_002_findings.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 20:34:10 -05:00
root
7f2f112e6a reality_test real_001: real-shape coordinator queries — surfaces cross-role bleed
First retrieval probe with non-synthetic query distribution. Pulls
N rows from /home/profit/lakehouse/data/datasets/fill_events.parquet
(real-shape demand data) and translates each to the natural language
a coordinator would type: "Need {count} {role}s in {city} {state}
starting at {at} for {client}".

Headline: 8/10 cold-pass top-1 = judge-best on real distribution.
Substrate works on queries it was never trained for. v2-moe + workers
corpus carry the load.

Surfaced finding (the real value of running this): same-client+city
queries cluster, and Shape A's distance boost bleeds across roles
within the cluster. Q#2 (Forklift @ Beacon Freight Detroit) records
e-6193 in the playbook corpus. Q#5 (Pickers same client+city) and
Q#10 (CNC Operator same client+city) inherit e-6193 at warm top-1
even though:
- Neither query has its own recorded playbook.
- Neither warm pass triggers a Shape B inject (boosted=0).
- The roles are different staffing categories.

Q#10 specifically demoted the cold-pass-correct w-3759 (judge rating
4 at rank 0) for a worker who was approved by the judge for a
different role on a different query.

Why the lift suite missed it: synthetic queries use 7 disjoint
scenario buckets (forklift+OSHA+WI / CDL+IL / etc.). Real demand
clusters on (client, city). The cluster doesn't exist in the
synthetic distribution.

Why the judge gate doesn't catch it: the gate (5a3364f) is
per-injection at record time. After approval the worker rides Shape A
distance boosts on all later same-cluster queries with no second
gate call.

Becomes new OPEN #1. Fix candidate: role-scoped playbook corpus
metadata + Shape A boost gate on role match. Cheap; doesn't need
new judge calls.

Files:
- scripts/cutover/gen_real_queries.go: parquet → coordinator NL
- tests/reality/real_coord_queries.txt: 10 generated queries
- reports/reality-tests/playbook_lift_real_001.md: harness output
- reports/reality-tests/real_001_findings.md: the reading

Repro:
  go run scripts/cutover/gen_real_queries.go -limit 10 > tests/reality/real_coord_queries.txt
  QUERIES_FILE=tests/reality/real_coord_queries.txt RUN_ID=real_001 \
    WITH_PARAPHRASE=0 WITH_REJUDGE=0 ./scripts/playbook_lift.sh

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 20:18:40 -05:00
root
5687ec65c2 G5 cutover prep: embed parity probe — Rust /ai/embed ↔ Go /v1/embed verified
First concrete cutover artifact: scripts/cutover/embed_parity.sh
brings up Go embedd + gateway alongside the live Rust gateway,
hits both /ai/embed and /v1/embed with the same forced model, and
emits a per-date verdict report under reports/cutover/.

Why embed first: the parity invariant is one math identity (cosine
sim of vectors against same input). Retrieve has thousands of edge
cases. If embed parity holds, all downstream vector consumers
inherit confidence; if it doesn't, we catch it in 30s instead of
after a flip.

Verdict 2026-04-30: 5/5 samples cosine=1.000000 with model forced
to nomic-embed-text (v1). Same with nomic-embed-text-v2-moe (both
Ollamas have it loaded). Math is provably equivalent across the
gateway plumbing.

Drift catalog (reports/cutover/SUMMARY.md):
- URL: Rust /ai/embed vs Go /v1/embed
- Wire: Rust {embeddings, dimensions} (plural) vs Go {vectors,
  dimension} (singular). Wire-format adapter is the only real
  cutover work for this endpoint.
- L2 norm: Rust unit vectors (~1.0); Go raw Ollama (~20-23). Same
  direction (cos=1.0); harmless under cosine-distance HNSW (which
  is Go vectord's default), but worth fixing in internal/embed/
  before extending to euclidean indexes.

reports/cutover/ now tracked (joined the scrum/ + reality-tests/
exemptions in .gitignore).

Next probe: /v1/matrix/retrieve ↔ Rust /vectors/hybrid for the
real user-facing retrieve path. Embed parity gives that probe a
clean foundation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 20:07:04 -05:00
root
a2fa9a2ce7 scripts/scrum_review: pipe diff via temp files — fixes argv overflow on large bundles
`jq --arg` and `curl --data-binary @-` both read stdin/argv-bound
buffers. Diffs >~128KB blow past the kernel's argv limit even when
piped via stdin (because we still build `body` as a shell variable
first, then feed it to curl). Voice-ai full bundle was 156K and
hit it.

Switch to writing user/system/body to mktemp files, jq reads via
--rawfile, curl reads via @file. Same on-the-wire shape, no argv
involvement. Cleanup with rm at the end.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 19:57:34 -05:00
root
68d9e554b0 shared: auto-emit Langfuse trace+span per HTTP request — closes OPEN #2
Adds langfuseMiddleware in internal/shared so every daemon's
shared.Run gets free production-traffic trace visibility when
LANGFUSE_URL + LANGFUSE_PUBLIC_KEY + LANGFUSE_SECRET_KEY are set.
Same env names + file shape as the multi_coord_stress driver, so
operators ship one /etc/lakehouse/langfuse.env across the deploy.

Wiring is auth-gated: middleware runs INSIDE the RequireAuth group,
so 401s from credential-stuffing don't pollute traces. /health is
exempt so LB probes don't either. Missing env vars → nil client →
middleware is a passthrough no-op (fail-open per ADR-005 5.1).

Bundled deploy:
- langfuse.env.example template (mode 0640, root:lakehouse)
- 11 systemd units gain `EnvironmentFile=-/etc/lakehouse/langfuse.env`
  (leading - so missing file = OK)
- REPLICATION.md bootstrap section documents setup

Tests (4): nil passthrough, /health bypass, real-request emission,
status-writer wrapping. All green.

STATE_OF_PLAY OPEN list: 5 rows → 4 rows.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 19:55:42 -05:00
root
5a3364f539 matrix: judge-gated Shape B inject — closes lift-suite tail issues
Lift suite run #004 left two unresolved tail issues:
- Q6 ("Forklift loader") ↔ Q7 ("Hazmat warehouse, cold storage")
  swap recordings as warm top-1 because their embeddings are within
  0.20 cosine of each other. Distance gate can't tell them apart.
- Q9 + Q15 lose paraphrase recovery when qwen2.5 rephrases past the
  0.20 threshold. Distance says "drift too far"; sometimes the drift
  is real (skip), sometimes the paraphrase is still on-domain (don't
  want to skip).

Multi-coord run #008's judge re-rating proved the LLM can
distinguish: Q3 crane case landed at distance 0.23 (looks tight)
but rating 1 (irrelevant). The judge sees domain mismatch the
embedder doesn't.

This commit lifts that pattern into the matrix substrate. Shape B
inject now optionally routes every candidate through a judge gate
before the rank insert lands. Distance + judge BOTH have to approve.

internal/matrix/playbook.go:
- InjectPlaybookMisses signature gains a query string + an
  optional InjectGate. nil gate preserves pre-judge-gating
  behavior (current tests already pass with nil).
- New InjectGate interface + InjectGateFunc adapter for tests
  and non-LLM callers.
- Per-candidate gate.Approve(query, hit) call inserted between
  the dedup and the inject. Rejected candidates skip silently;
  injected count reflects post-gate decision.

internal/matrix/judge.go (new, ~140 lines):
- LLMJudgeGate calls an Ollama-shape /api/chat endpoint with the
  same 1-5 staffing-rubric prompt that worked in multi_coord
  run #008. fail-closed on HTTP/JSON errors (don't inject if
  judge can't speak — better miss than wrong-domain).
- NewLLMJudgeGate returns nil when URL or Model is empty,
  matching InjectGate's nil-means-no-judge semantics.

internal/matrix/retrieve.go:
- SearchRequest gains JudgeURL, JudgeModel, JudgeMinRating
  fields. Run() builds an LLMJudgeGate when set; passes nil
  otherwise. Backward compatible — existing callers see no
  behavior change.

Tests:
- TestInjectPlaybookMisses_GateRejectsCandidate (rejectAll → 0
  injected, even with tight distance)
- TestInjectPlaybookMisses_GateApprovesCandidate (approveAll →
  same as nil-gate behavior)
- TestInjectPlaybookMisses_GateSeesCorrectQuery (gate receives
  CURRENT query + RECORDED query separately so it can score
  the (current, candidate) pair)
- All 5 existing inject tests updated to new signature

go test ./internal/matrix → all 8 inject tests pass.
go test ./internal/matrix ./internal/shared ./cmd/{matrixd,
queryd,pathwayd,observerd} → all green.

STATE_OF_PLAY:
- OPEN item #1 (judge-gated injection) closed.
- DO NOT RELITIGATE adds the substrate-level judge-gate lock.
- OPEN list now 5 rows (was 6).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 19:38:12 -05:00
root
247e36e687 STATE_OF_PLAY: trim OPEN list — 9 rows → 6, ordered by product leverage
Sprint 4 row removed (shipped: a59ef5b systemd + 54a05d9 docker).
ADR-006 row already dropped on the previous STATE update.

Two lift-suite tail items (Q6↔Q7 adjacent-query, Q9/Q15 liberal-
paraphrase) consolidated into one "judge-gated playbook injection"
row — both are downstream of the same fix (let the judge approve
before Shape B inserts). Captures the design lineage from
multi-coord run #008's judge-rating pattern.

Three items folded into a single "operational nice-to-haves" row:
real-time clock, chatd fixture storage half, liberal-paraphrase
calibration. None are product-blocking; each lights up when
someone hits its specific trigger.

Reorder reflects leverage on the active product theory (multi-
coord staffing co-pilot via the 5-loop substrate), not effort:
1. Judge-gated injection (lift quality + lift-tail closure)
2. Wider Langfuse instrumentation (production observability)
3. Fresh→main merge (operational hygiene as the corpus grows)
4. Distillation full port (production dependency, not yet)
5. Drift quantification (research)
6. Operational nice-to-haves

Lead-in note added: "Items move to closed when the work demands
them, not on a calendar." Locks intent against future drift toward
a sprawling todo list.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 19:32:31 -05:00
root
54a05d9311 Sprint 4 deployment artifacts: Dockerfile + docker-compose
Parallel deploy target to the systemd units that landed in a59ef5b.
Single image carries all 11 daemons; docker-compose runs one
container per daemon with the same dependency graph as the systemd
units. Useful when systemd isn't available (Mac dev, remote VMs
without root) or when isolation to a private docker network is
preferred.

Dockerfile (multi-stage):
- Builder: golang:1.25-bookworm. DuckDB cgo needs gcc + glibc;
  alpine's musl doesn't link the official duckdb-go bindings cleanly.
- Runtime: debian:bookworm-slim — same libc, much smaller surface.
  Adds ca-certificates (outbound HTTPS to OpenRouter/OpenCode/Kimi),
  curl + jq (in-container healthchecks + smoke probes), tini (PID 1
  signal forwarding so docker stop sends SIGTERM to the daemon, not
  to a wrapper).
- Single image, multiple binaries. Ships all 11 cmd/* + 3 scripts/
  (staffing_workers, playbook_lift, multi_coord_stress) so deployed
  stacks can run reality tests against themselves.
- Non-root runtime user (uid 999 lakehouse). Layout matches
  /usr/local/bin/lakehouse/<daemon> from REPLICATION.md.
- ENTRYPOINT=tini; no default CMD — operators / compose pick
  which daemon explicitly.

docker-compose.yml (11 services):
- Same dependency graph as deploy/systemd/. depends_on with
  service_healthy condition matches Requires= equivalents:
    catalogd → storaged
    ingestd → storaged + catalogd
    queryd → catalogd
    matrixd → embedd + vectord
- Gateway uses bare depends_on (no health condition) — Wants=
  equivalent so single-upstream restart doesn't cascade.
- chatd has per-provider env_file entries (one each for
  ollama_cloud, openrouter, opencode, kimi) — missing files are
  silently OK, matching the systemd unit's EnvironmentFile=- list.
- Persistent state on the lakehouse-state named volume; commented
  driver_opts shows how to bind to a host path for off-volume
  backups.

.dockerignore:
- Excludes bin/ + reports/ + data/ + git metadata + .env files.
- Especially excludes lakehouse.toml/secrets-go.toml/auth.env so
  local dev configs don't accidentally bake into a published image.

REPLICATION.md gains a Docker section between systemd setup and
the logs section. Ten-line copy-paste from "git clone" to
"docker compose up -d", plus a docker-vs-systemd differences
table covering process supervision, logs, restart policy, file
ownership, host networking quirks, and backup targets.

Validation: docker compose config --quiet → exit 0 (with
placeholder env files in place).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:58:47 -05:00
root
a59ef5b930 Sprint 4 deployment artifacts: 11 systemd units + REPLICATION.md + env templates
Builds on ADR-006 to ship the operator-facing bits Sprint 4 was
blocked on. Single-host deploy is now a documented procedure.

deploy/systemd/ (12 files):
- 11 .service units, one per daemon. Each follows the same template:
  Type=simple, User=lakehouse, hardening (NoNewPrivileges,
  ProtectSystem=strict, ProtectHome, PrivateTmp, ReadWritePaths
  scoped to /var/lib/lakehouse + /var/log/lakehouse), JSON to
  journald with per-daemon SyslogIdentifier, EnvironmentFile=- on
  /etc/lakehouse/auth.env.
- Dependency graph baked in via After=/Requires=:
    storaged → standalone (only network-online)
    catalogd → Requires storaged
    ingestd → Requires storaged + catalogd
    queryd → Requires catalogd
    matrixd → Requires embedd + vectord
    gateway → Wants every other daemon (Wants= not Requires=
              so a single upstream restart doesn't cascade-restart
              the gateway)
    pathwayd / observerd / vectord / embedd / chatd → standalone
- chatd unit reads 4 cloud-provider EnvironmentFile=s
  (ollama_cloud / openrouter / opencode / kimi) — each is its own
  file so per-provider key rotation doesn't restart the others.
- lakehouse-go.target: convenience aggregator. Operators
  systemctl start/stop/enable lakehouse-go.target instead of
  managing 11 daemons individually. Per-daemon WantedBy=
  this target.

deploy/etc-lakehouse/ (2 templates):
- auth.env.example: AUTH_TOKEN per ADR-006 6.2 + rotation playbook
  comments. The committed file is empty — operators copy + fill in.
- secrets-go.toml.example: [s3.primary] template with
  REPLACE_ME placeholders. Multi-bucket G2 example commented.

REPLICATION.md (top-level):
- Operator runbook from fresh box → 11 daemons running.
- Prereqs (Go 1.25+, gcc, MinIO, Ollama, optionally Langfuse +
  Postgres for Langfuse) with reachability checks.
- Bind ports table (3110–3220, shifted by 10 from Rust legacy).
- Bootstrap: useradd → build → install → config → secrets →
  systemd → validation.
- Auth posture matrix (loopback / non-loopback / multi-host / TLS).
- Token rotation procedure inline (ADR-006 Decision 6.5).
- Logs (journalctl), backup paths, troubleshooting matrix.

Validation: systemd-analyze verify passed on all 11 .service files
(only "not executable" warnings, expected since binaries don't live
at /usr/local/bin/lakehouse/ until step 2 of bootstrap runs).

Sprint 4 is now operator-ready. Next: Dockerfile + multi-stage
build for container deploys (separate concern; deploy targets
either systemd OR docker, not both).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:54:49 -05:00
root
814197cfd3 ADR-006: auth posture for non-loopback deploy + token rotation impl
ADR-003 locked the auth substrate; ADR-006 ratifies the operator
playbook + adds two implementation pieces needed for Sprint 4
deployment: env-resolved tokens and dual-token rotation.

Six decisions locked in docs/DECISIONS.md:
- 6.1: Non-loopback bind requires auth.token (mechanical gate at
       shared.Run, already implemented; this ratifies it).
- 6.2: Token from env, not TOML. /etc/lakehouse/auth.env (mode 0600)
       loaded by systemd EnvironmentFile=. New TokenEnv field on
       AuthConfig defaults to "AUTH_TOKEN".
- 6.3: AllowedIPs for inter-service same-trust-domain; Token for
       cross-trust-boundary (gateway ↔ external).
- 6.4: /health stays unauthenticated; everything else under
       shared.Run is gated. Already implemented; ratified here.
- 6.5: Token rotation is dual-token. New SecondaryTokens []string
       on AuthConfig — both primary and any secondary pass auth
       during the rotation window. Implemented in this commit.
- 6.6: TLS terminates at the network edge (nginx/Caddy), not
       in-process. Daemons stay HTTP-only; internal traffic stays
       on private subnets per Decision 6.3.

Implementation:
- internal/shared/config.go: AuthConfig gains TokenEnv +
  SecondaryTokens fields. New resolveAuthFromEnv() called by
  LoadConfig fills Token from os.Getenv(TokenEnv) when Token is
  empty. TokenEnv defaults to "AUTH_TOKEN" so the happy path needs
  no TOML config.
- internal/shared/auth.go: RequireAuth pre-encodes Bearer headers
  for primary + every secondary token; per-request constant-time
  compare walks the slice. Fast path is 1 compare (primary).

Tests:
- TestLoadConfig_AuthTokenFromEnv (3 sub-tests): default env name,
  custom token_env, explicit Token wins over env.
- TestRequireAuth_SecondaryTokenAccepted: both primary + secondary
  tokens pass during rotation window.
- TestRequireAuth_SecondaryTokensOnly: only-secondary path works
  for the case where primary was just promoted-to-empty mid-rotation.

go test ./internal/shared all green; existing auth_test.go
unchanged (constant-time compare path preserved).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 17:51:14 -05:00
root
6c93a38093 scrum multi_coord_phase3: 4 fixes from cross-lineage review
Cross-lineage scrum on bundle 87cbd10..f971e64 (3,652 lines)
produced 4 actionable findings, all defensive hardening.

1. (Opus WARN) internal/langfuse/client.go:queue
   Synchronous Flush at maxBatch threshold blocked the calling
   goroutine for the full 5s HTTP timeout when Langfuse hiccupped,
   defeating the "best-effort, never blocks calling path" contract
   in the package doc. Now fire-and-forget via goroutine.

2. (Opus + Kimi convergent) cmd/observerd/main.go:handleInbox
   - Free-form priority string was accepted; "nonsense" passed
     through unchecked. Now closed enum: urgent|high|medium|low (+
     empty defaults to medium). Tested: TestInbox_RejectsBadPriority.
   - No size cap on body, only emptiness check; multi-MB payloads
     would bloat observer's ring + JSONL. Now 8 KiB cap returns 413.
     Tested: TestInbox_RejectsOversizedBody.
   - Subject/sender/tag concatenated into InputSummary without
     newline stripping; embedded \n could corrupt JSONL line-based
     parsers. New sanitizeInboxField strips \r\n + caps at 256 chars
     before interpolation.

3. (Opus INFO) scripts/multi_coord_stress/main.go
   Removed dead `must[T]` generic — tracedSearch took over the
   fail-fast role for matrix searches, so the helper became unused.

4. (Opus INFO) scripts/multi_coord_stress/main.go:Event
   `JudgeRating int` collapsed "judge errored" and "judge said
   unrated" both to 0. Changed to *int — nil = errored, 1-5 =
   verdict. judgeInboxResult still returns 0 on error; caller
   gates on > 0 before assigning.

Dismissed (with rationale):
- Opus WARN ExcludeIDs ordering: verified by code read — filter
  applies after sort + before top-K truncation as documented;
  no slot waste possible.
- Opus INFO 10 prior-run reports contradict #011: those are
  point-in-time snapshots; intentional history.
- Kimi INFO Langfuse error suppression: design intent (best-effort
  per package doc).
- Kimi INFO contract schema validation: defer until contract count
  grows enough to make hand-edit drift a real risk.
- Kimi INFO paraphrase prompt duplicated across lift + multi_coord:
  defer (lift to internal/paraphrase/ when a third consumer appears).
- Qwen HOLD: single-line, no actionable finding.

go test ./cmd/observerd ./internal/langfuse all green; multi_coord
driver builds clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 17:42:07 -05:00
root
f971e64745 g2_smoke: accept nomic-embed-text* family members as default
Pre-push hook caught the regression — the smoke hardcoded
MODEL = "nomic-embed-text" and the bump to nomic-embed-text-v2-moe
in 4da32ad failed the gate.

Fix: glob-match the family prefix (nomic-embed-text*). Both v1 and
v2-moe are 768d drop-ins; the property the smoke is locking is
dim + distinct-vectors, not the exact model variant. Operators
swap the variant in lakehouse.toml without needing to touch the
smoke.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 17:37:20 -05:00
root
db2e57402e STATE_OF_PLAY: capture multi-coord stress wave (Phase 1-3 verified)
Anchor was last touched at v4 split-threshold; since then the
multi-coord stress harness landed end-to-end across 11 commits.
Future sessions reading this file need to see the verified state,
not derive from git log.

Major additions:
- New "Multi-coordinator stress test (Phase 1 → 3)" section in
  VERIFIED WORKING. 11-row capability table covering per-coord
  playbook isolation, diversity metrics, paraphrase handover,
  ExcludeIDs swap, fresh-resume two-tier, inbox endpoints, LLM
  demand parsing, judge re-rating, Langfuse tracing.
- Substrate-gains list under that section: ExcludeIDs on
  SearchRequest, observer.SourceInbox + /observer/inbox,
  internal/langfuse client, embedd default bumped to v2-moe,
  two-tier fresh_workers index pattern.
- Last-verified bumped to 16:42 CDT on the run #011 anchor.

DO NOT RELITIGATE expanded with five new locks:
1. Boost / inject use SEPARATE thresholds (0.5 / 0.20)
2. Multi-coord product theory is empirically VALIDATED
3. Fresh content uses two-tier indexing (fresh_workers)
4. embedd.default_model = nomic-embed-text-v2-moe (don't downgrade)
5. Inbox flow: parse + search + judge + trace
6. Langfuse Go-side client lives at internal/langfuse/

OPEN list refresh:
- Removed: re-judge metric (shipped as b13b5cd), adjacent-query as
  separate item (folded into a single "judge-approves-before-inject"
  follow-up), liberal-paraphrase (kept).
- Added: real-time 48-hour clock, wider Langfuse instrumentation,
  periodic fresh→main merge job.

RECENT VERIFIED WAVE table extended with 11 commits (b13b5cd..5d49967).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 17:30:04 -05:00
root
5d49967833 multi_coord_stress: full Langfuse coverage — every phase + every call
Phase 1c-only tracing (commit 7e6431e) was the proof-of-concept.
This commit threads tracing through every phase: baseline / fresh-
resume / inbox burst / surge / swap / merge / handover (verbatim +
paraphrase) / split / reissue. Each phase is a parent span; each
matrix.search / LLM call inside is a child span.

Refactor:
- One run-level trace is created at driver startup.
- New startPhase(name, hour, meta) helper emits a phase span as a
  child of the run trace; subsequent emitSpan calls nest under it.
- New tracedSearch(spanName, query, corpora, ...) wraps matrixSearch
  with span emission. Every search call site replaced with this so
  the input/output JSON (query, corpora, k, playbook, exclude_n →
  top-K ids, top1 distance, boost/inject counts) lands in Langfuse.
- Phase 4b's paraphrase generation also emits llm.paraphrase spans.
- Phase 1c's existing inline span emission converted to use the new
  helpers (no more inboxTraceID variable).

Run #011 result: trace landed at http://localhost:3001 with 111
observations attached. Span breakdown:
  phase.* parents:         9 (one per phase that ran)
  matrix.search.baseline:  10
  matrix.search.fresh_verify: 3 (top-1 confirmed for all 3 fresh)
  observerd.inbox.record:  6
  llm.parse_demand:        6
  matrix.search.inbox:     6
  llm.judge_top1:          6
  matrix.search.surge:     12
  matrix.search.swap_orig: 1
  matrix.search.swap_replace: 1
  matrix.search.merge:     6
  matrix.search.handover_verbatim: 4
  llm.paraphrase:          4
  matrix.search.handover_paraphrase: 4
  matrix.search.split:     4
  matrix.search.reissue:   12
  matrix.search.reissue_retrieval_only: 12
  ─────────────
  Total:                   111

Browse: http://localhost:3001 → Traces → "multi_coord_stress run"
Each phase is a collapsible section showing per-call timing and
input/output JSON. Operators can drill into any single retrieval
to see exactly what query was issued and what came back.

All other metrics held: diversity 0.026, determinism 1.000,
verbatim handover 4/4, paraphrase handover 4/4, fresh-resume 3/3
at top-1 (two-tier index), 200-worker swap Jaccard 0.000.

This is the FULL TEST J asked for — every action in the run
visible in Langfuse, full input/output drilldown.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 16:43:32 -05:00
root
08a086779b multi_coord_stress: fresh_workers two-tier index — fresh-resume now top-1
Runs #003-#009 surfaced the same finding: fresh workers added
mid-run to the main 'workers' vectord index (5K items) reliably
*absorbed* (HTTP 200) but failed to *surface* in semantic queries
even with content-matching prompts. Distances on the verify queries
sat at 0.25-0.65 against existing workers; fresh items were beyond
top-K. Better embedder (v2-moe) didn't help — distances got TIGHTER
on existing items, pushing fresh items further out of reach.

Root cause: coder/hnsw incremental adds to a populated graph land
in poorly-connected regions and disappear from search traversal.
Known property of HNSW post-build adds; not a bug.

Fix: two-tier index pattern (canonical NRT search architecture).
Fresh content goes to a small "hot" corpus (fresh_workers); main
queries include it in the corpora list and merge results. Hot corpus
has no recall crowding because it's tiny; periodic batch job (post-
G3) merges it into the main index.

Implementation:
- ensureFreshIndex(hc, gw, name, dim) — idempotent POST
  /v1/vectors/index. 409 from re-create treated as "already there."
- ingestFreshWorker now takes idx parameter so callers can target
  fresh_workers instead of workers.
- multi_coord_stress phase 1b creates fresh_workers index + ingests
  3 fresh workers there + searches verifyCorpora=[workers,
  ethereal_workers, fresh_workers].

Run #010 result:
  fresh-001 (Senior tower crane rigger NCCCO Chicago)
    top-1: fresh-001 from fresh_workers, distance 0.143
  fresh-002 (Bilingual Spanish/English OSHA trainer Indianapolis)
    top-1: fresh-002 from fresh_workers, distance 0.146
  fresh-003 (FAA Part 107 drone surveyor Chicago)
    top-1: fresh-003 from fresh_workers, distance 0.129

3/3 fresh workers surface at top-1 — the absorption-but-not-
findable issue from runs #003-#009 is closed.

All other metrics held: diversity 0.007, determinism 1.000,
verbatim handover 4/4, paraphrase handover 4/4, swap Jaccard 0.000,
inbox burst all 6 events accepted + traced to Langfuse.

This is the final structural fix for the multi-coord stress
suite. Phase 3 is feature-complete.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 16:31:45 -05:00
root
7e6431e4fd langfuse: Go-side client + Phase 1c instrumentation
The Rust side has Langfuse tracing already (gateway/v1/langfuse_trace.rs);
this commit lands Go-side parity so the multi-coord stress harness can
emit traces visible at http://localhost:3001.

internal/langfuse/client.go:
- Minimal Trace + Span + Flush API mirroring what the Rust emitter
  uses. Auth: Basic over public_key:secret_key.
- Best-effort posture: errors are slog.Warn'd, never block calling
  paths. Same fail-open as observerd's persistor (ADR-005 Decision
  5.1) — observability is a witness, not a gate.
- Events buffered until 50, then auto-flushed; explicit Flush() at
  process exit.
- Each Trace/Span returns its id so callers can build hierarchies.

multi_coord_stress driver wiring:
- New --langfuse-env flag (default /etc/lakehouse/langfuse.env).
  Empty / missing / unparseable file → skip tracing with a logged
  warning; run still proceeds.
- Phase 1c (inbox burst) now emits one parent trace + 4 spans per
  inbox event:
    1. observerd.inbox.record  (post to /v1/observer/inbox)
    2. llm.parse_demand        (qwen2.5 → structured fields)
    3. matrix.search           (parsed query → top-K)
    4. llm.judge_top1          (rate top-1 vs original body)
  Each span carries input/output JSON + start/end times so the
  Langfuse UI shows a full waterfall per event.

Run #009 result:
  Trace landed: "multi_coord_stress phase 1c inbox burst"
  Observations attached: 24 (= 6 events × 4 spans)
  Tags: stress, phase-1c, inbox
  Browseable at http://localhost:3001 by tag query.

Other harness metrics: diversity 0.016, determinism 1.000,
verbatim handover 4/4, paraphrase handover 4/4 — all unchanged
by the tracing addition (best-effort post in parallel).

Phase 1c is the proof-of-concept; future commits can wrap other
phases (baseline / merge / handover / split) in traces too. Once
that's done, the entire stress run becomes scrubbable in Langfuse
without grepping the events JSON.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 16:25:03 -05:00
root
ce940f4a14 multi_coord_stress: judge re-rates inbox top-1 — recovers honesty signal
Run #007 surfaced a tradeoff: LLM-parsed inbox queries produce much
tighter cosine distances (0.05-0.10 in three cases) but lose the
"system has no good match" signal that high-distance results give.
A coordinator UI showing only distance can't tell wrong-domain
matches apart from real ones.

Fix: judge re-rates top-1 against the ORIGINAL inbox body (not the
LLM-parsed query). Coordinators see both:
  - distance: how close was retrieval in vector space
  - rating:   does this person actually fit the original ask
The pair tells the honest story.

Run #008 result on the 6 inbox events:

  Demand                Top-1     Distance  Rating  Reading
  ─────────────────────────────────────────────────────────────
  Forklift Cleveland    w-3573    0.29      4       Strong
  Production Indy       e-1764    0.41      3       Adjacent
  Crane Chicago         e-7798    0.23      1       TIGHT BUT WRONG
  Bilingual safety Indy w-3918    0.05      5       Perfect
  Drone Chicago         e-1058    0.06      5       Perfect (verify e-1058)
  Warehouse Milwaukee   w-460     0.32      4       Strong

The crane-Chicago case is the architectural-honesty signal at work:
distance 0.23 says "tight match" but the judge says rating 1 reading
the original body. A coordinator seeing only distance would ship the
wrong worker; coordinator seeing distance+rating sees the disagreement
and escalates.

Net distribution: 5/6 rated 3+ (acceptable→perfect), 1/6 rated 1
(irrelevant despite tight cosine). The substrate-honesty signal is
recovered without losing the LLM-parse quality wins.

Cost: 6 extra judge calls (~9s on qwen2.5). Production amortizes
when judge runs only on top-1 of high-priority inbox events; the
search-cost-vs-quality tradeoff lives in the priority gate.

Implementation:
- New JudgeRating int field on Event (omitempty so non-judged
  events stay clean in JSON)
- New judgeInboxResult helper, reusing the same prompt structure as
  playbook_lift's judgeRate. The two could share an internal package
  if a third judge consumer appears.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 16:16:49 -05:00
root
186d209aae multi_coord_stress: LLM-parsed inbox demands (qwen2.5)
Replaced the hard-coded DemandQuery on inbox events with an actual
LLM call: each email/SMS body is parsed by qwen2.5 (format=json,
schema-anchored) into structured {role, count, location, certs,
skills, shift}. The driver then composes a query string from those
fields and runs matrix.search.

This is the real-product flow that the Phase 3 stress test was
asking for: real bodies → real LLM parsing → real search. Before
this commit, the DemandQuery was my hand-crafted string, which
made the inbox phase trivial.

Run #007 result vs #006 (same bodies, parser swapped):

  All 6 inbox events parsed cleanly — qwen2.5 nailed:
    "Need 50 forklift operators in Cleveland OH for Monday day
     shift. OSHA-30 + active forklift cert required."
    → {role:"forklift operator", count:50, location:"Cleveland, OH",
       certs:["OSHA-30","active forklift cert"], skills:[], shift:"day"}
    Other 5 similarly faithful (indy stayed as "indy", count
    defaulted to 1 when unspecified, no hallucinated fields).

  LLM-parsed queries produced TIGHTER matches than hard-coded:
    Demand              #006 dist  #007 dist  Δ
    Crane Chicago       0.499      0.093      -82%
    Drone Chicago       0.707      0.073      -90%
    Bilingual safety    0.240      0.048      -80%
    Forklift Cleveland  0.330      0.273      -17%
    Production Indy     0.260      0.399      +53%
    Warehouse Milwaukee 0.458      0.420       -8%

  Three matches landed at distance < 0.10 — verbatim-replay-tight
  territory. Structured queries embed sharper than conversational
  hand-crafted strings.

  Other metrics unchanged: diversity 0.000, determinism 1.000,
  verbatim handover 4/4, paraphrase handover 4/4.

Tradeoff worth flagging: the drone-Chicago case dropped from
distance 0.71 (clear "we don't have one") to 0.07 (confident match
returned). The OOD honesty signal weakens when LLM-parsed structure
makes any closest-neighbor look tight. Future Phase 4 work: judge
re-rates the top match before surfacing, so coordinators see "your
demand was for X but the closest match scored 2/5" rather than just
the worker ID + distance.

Substrate cost: +6 LLM calls per inbox burst (~9s on qwen2.5).
Production would amortize via a small dedicated parser model.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 14:51:19 -05:00
root
e7fc63b216 observerd: /observer/inbox + multi-coord stress phase 1c (priority-ordered events)
Phase 3 ask: real-world inbox-style event injection during the stress
test. Coordinators in production receive emails + SMS that trigger
contract responses; the substrate has to RECORD these signals AND
react with a search using the embedded demand. This commit lands the
endpoint and exercises it end-to-end in the stress harness.

observerd surface:
- New POST /observer/inbox route — accepts {type, sender, subject,
  body, priority, tag} and records as ObservedOp with
  Source=SourceInbox. Type must be email|sms; body required;
  priority defaults to medium. The handler ONLY records — downstream
  triggers (search, ingest, etc.) are the caller's concern, recorded
  separately. Keeps the witness role pure.
- New observer.SourceInbox = "inbox" alongside SourceMCP /
  SourceScenario / SourceWorkflow.
- Three contract tests on the new route (happy path / bad type / empty
  body), router-mount test extended, all green.

Stress harness phase 1c (Hour 9):
- 6 inbox events fire in priority order (urgent → high → medium):
    2 urgent emails (forklift Cleveland, production Indianapolis)
    1 high email (crane Chicago)
    1 high sms (bilingual safety Indianapolis)
    1 medium sms (drone Chicago)
    1 medium email (warehouse Milwaukee FYI)
- Each event:
    1. POSTs to /v1/observer/inbox (recorded by observerd)
    2. Triggers matrix.search using a parsed demand (the demand
       extraction is hard-coded for now; production needs a small
       LLM to parse from body)
    3. Captures both as events in the run JSON

Run #006 result (with v2-moe embedder + all phases including inbox):

  Diversity:
    Same-role-across-contracts Jaccard = 0.000 (n=9)
    Different-roles-same-contract Jaccard = 0.046 (n=18)
  Determinism: 1.000
  Verbatim handover: 4/4 (100%)
  Paraphrase handover: 4/4 (100%)
  Inbox burst:
    6/6 events accepted by observerd (200 status, all recorded)
    6/6 triggered searches produced distinct top-1 worker IDs
    distance distribution: 0.24 (Indy production) → 0.71 (Chicago
    drone surveyor — honest stretch since drones aren't in the
    5K-worker corpus, system surfaces closest neighbor at high
    distance rather than fabricating)

The drone-Chicago case is the architectural-honesty signal: when
the demand asks for a specialist NOT in the roster, the system
returns the closest semantic neighbor with a distance that flags
"this is a stretch." Coordinators reading distances see "we don't
have a great match here" rather than a confident wrong answer.

Total events captured: 67 (was 61 pre-inbox).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 08:34:36 -05:00
root
4da32ad102 embedd: bump default to nomic-embed-text-v2-moe (475M MoE, 768d drop-in)
Local Ollama has three embedding models loaded:
  nomic-embed-text:latest        137M  768d  (previous default)
  nomic-embed-text-v2-moe:latest 475M  768d  (this commit's default)
  qwen3-embedding:latest         7.6B  4096d (would require dim change)

v2-moe is a drop-in upgrade — same 768 dim, 3.5× more params, MoE
architecture. Workers index doesn't need rebuilding, just future ingests
embed with the stronger model.

Run #005 result on the multi-coord stress suite:

  Diversity (same-role-across-contracts): 0.080 → 0.000 (n=9)
    → MoE is more discriminating: zero worker overlap across
      Milwaukee / Indianapolis / Chicago for shared role names.
      The geo + cert + skill context fully separates worker pools.
  Different-roles-same-contract: 0.013 → 0.036 (still ~96% diff)
  Determinism: 1.000 (unchanged)
  Verbatim handover: 4/4 (100%)
  Paraphrase handover: 4/4 (100%)

  200-worker swap: Jaccard 0.000 (unchanged — still perfect)

  Fresh-resume verify: STILL doesn't surface fresh workers in top-8.
    With v2-moe, distances increased (top-1 = 0.43–0.65 vs v1's 0.25–0.39)
    — the embedder is MORE discriminating, but the fresh worker's
    vector still doesn't outrank the 8th-best existing worker. Now
    suspect of being an HNSW post-build add issue (coder/hnsw
    incremental adds can land in hard-to-reach graph regions, not an
    embedder problem). Better embedder didn't fix it; needs a
    different strategy: full index rebuild after fresh adds, or
    explicit playbook-layer score boost for fresh workers, or
    hybrid (keyword + semantic) retrieval. Phase 3 investigation.

Cost: ingest is ~5× slower (workers 20s→100s; ethereal 35s→112s).
Acceptable for the quality jump on diversity. Real production with
incremental ingest won't pay this once-per-deploy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 08:26:52 -05:00
root
84a32f0d29 multi-coord stress Phase 2: ExcludeIDs + fresh-resume + 200-worker swap
Three Phase 2 additions land in this commit:

1. matrix.SearchRequest gains ExcludeIDs ([]string) — filters specific
   worker IDs out of results post-retrieval, AND skips them at the
   playbook boost+inject step (so excluded answers can't sneak back
   via Shape B). Real-world driver: coordinator placed N workers,
   client asks for replacements, system needs alternatives, not the
   same N. Threaded through retrieve.go after merge but before
   metadata filter so excluded IDs don't waste post-filter top-K slots.

2. New harness phase 2b: 200-worker swap simulation. Captures the
   top-K from alpha's warehouse query, then re-issues with
   exclude_ids=<placed>. Result Jaccard(orig, swap) measures whether
   the substrate finds genuine alternatives.

3. New harness phase 1b: fresh-resume mid-run injection. Three new
   workers ingested via /v1/embed + /v1/vectors/index/workers/add,
   then verified findable via semantic queries matching resume content.

Plus Hour labels on every event (operational narrative: 0/6/12/18/
24/30/36/42/48) and a refactor of captureEvent to take hour as a
param.

Run #003 + #004 results (5K workers + 10K ethereal):

  Diversity (#004):
    Same-role-across-contracts Jaccard = 0.080 (n=9)
    Different-roles-same-contract Jaccard = 0.013 (n=18)
  Determinism: 1.000 (#004 unchanged)
  Verbatim handover:  4/4 = 100%
  Paraphrase handover: 4/4 = 100%

  Phase 2b — 200-worker swap (Jaccard 0.000):
    8 originally-placed workers fully replaced by 8 alternatives.
    ExcludeIDs substrate change works end-to-end — boost AND inject
    both honor the exclusion, so excluded workers don't return via
    the playbook either.

  Phase 1b — fresh-resume injection: REAL PRODUCT FINDING.
    Substrate ABSORPTION is fine — 3 /v1/vectors/index/workers/add
    calls at 200 status, 3 vectors persisted. But none of the 3
    fresh workers surfaced in top-8 even with semantic queries
    matching their resume content (e.g. "Senior tower crane rigger
    NCCCO Chicago" vs fresh-001's resume "Senior rigger with 12
    years tower-crane signaling..." NCCCO + Chicago).
    Top-1 came from existing workers at distance ~0.25; fresh
    workers' distances must be > 0.25, pushing them past rank 8.
    Cause: dense retrieval at 5000+ workers means many existing
    profiles cluster near any specific query in cosine space;
    nomic-embed-text-v2 (137M) introduces enough noise that a
    fresh worker doesn't reliably outrank them just because the
    text content overlaps.
    Workarounds (Phase 3 work): (a) hybrid retrieval (keyword +
    semantic), (b) playbook-layer score boost for fresh adds,
    (c) larger embedder. Documented in run #004 report.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 08:19:29 -05:00
root
0fa42a0cc3 multi-coord stress Phase 1.5: shared-role contracts + paraphrase handover
Phase 1 had two known gaps: (1) the 3 contracts had zero shared role
names, so same-role-across-contracts Jaccard was vacuous (n=0); (2)
the verbatim handover at 100% was the trivial case, not the hard
learning test (paraphrased queries against another coord's playbook).

Both fixed in this commit.

Contract redesign — all 3 contracts now share warehouse worker /
admin assistant / heavy equipment operator roles, plus a unique
specialist per contract (industrial electrician / bilingual safety
coord / drone surveyor — the "specialist not on the standard roster"
case from J's spec). Counts and skill mixes vary per region.

New driver phase 4b — paraphrase handover. Bob runs qwen2.5-paraphrased
versions of Alice's contract queries against Alice's playbook
namespace. Tests whether institutional memory propagates across
coordinators AND across natural wording variation that Bob would
introduce when running Alice's contract.

Run #002 result (5K workers + 10K ethereal_workers, 4 demand × 3
coords + paraphrase handover):

  Diversity (the question J asked: locking or cycling?):
    Same-role-across-contracts Jaccard = 0.119 (n=9)
      → 88% of workers DIFFER across regions for the same role name.
        Milwaukee warehouse vs Indianapolis warehouse vs Chicago
        warehouse pull mostly distinct top-K from the same population.
        The system locks into geo+cert+skill context, not cycling.
    Different-roles-same-contract Jaccard = 0.004 (n=18)
      → role-specific retrieval works (unchanged from Phase 1).

  Determinism: Jaccard = 1.000 (n=12) — unchanged.

  Learning:
    Verbatim handover  4/4 = 100%  (trivial case, expected)
    Paraphrase handover 4/4 = 100% (HARD case — passes!)
      Of those 4 paraphrase recoveries:
        - 2 used boost (Alice's recording was already in Bob's
          paraphrase top-K; ApplyPlaybookBoost re-ranked to top-1)
        - 2 used Shape B inject (recording wasn't in Bob's
          paraphrase top-K; InjectPlaybookMisses brought it in)

The boost/inject mix is healthy — both paths are used and both
produce correct top-1s. Multi-coord institutional memory propagation
is empirically working under wording variation.

Sample warehouse worker top-1s across contracts (proves diversity):
  alice / Milwaukee     → w-713
  bob   / Indianapolis  → e-8447
  carol / Chicago       → e-7145
Three different workers from the same 15K-person population,
selected on geo+cert+skill context.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 08:03:16 -05:00
root
61c7b55e48 multi-coord stress harness — Phase 1 of 48-hour mock
Three coordinators (alice / bob / carol) with three contracts
(Milwaukee distribution / Indianapolis manufacturing / Chicago
construction). 7-phase scenario runner: baseline → surge → merge →
handover → split → reissue → analysis. Each coord has a separate
playbook namespace (playbook_{name}) so institutional memory stays
isolated by default but transferable on demand.

Phase 1 deliberately skips the 48-hour clock, email/SMS endpoints,
and Langfuse tracing — those are Phase 2/3.

Run #001 (52 events, 4 queries × 3 coords × 2 demand flavors):

  Diversity:
    Different-roles-same-contract Jaccard = 0.004 (n=18)
      → role-specific retrieval is working perfectly. Different
        roles within one contract pull totally different worker
        pools. System is NOT cycling; locks into per-role retrieval.
    Same-role-across-contracts Jaccard = N/A (n=0)
      → TEST-DESIGN ISSUE: the 3 contracts use distinct role names
        per industry (warehouse worker / production worker / general
        laborer), so no exact-name overlaps exist. Phase 2 should
        either share at least one role across contracts OR add a
        skill-based diversity metric.

  Determinism: Jaccard = 1.000 (n=12)
    → HNSW + Ollama retrieval is fully deterministic on identical
      query text. coder/hnsw + nomic-embed-text are stable.

  Learning: handover hit rate = 4/4 = 100%
    → Bob inherits Alice's recordings perfectly when bob runs
      identical queries with alice's playbook namespace. CAVEAT:
      this tests the trivial verbatim case, not paraphrase handover.
      The harder test (bob runs paraphrased queries with alice's
      playbook) is Phase 2 work.

Per-event capture in JSON: every matrix.search response is logged
with phase / coordinator / contract / role / query / top-K IDs +
distances + per-corpus counts + boosted/injected counts. Reviewable
via:
  jq '.events[] | select(.phase == "merge")'
  jq '.events[] | select(.coordinator == "alice")'
  jq '.events[] | select(.role == "warehouse worker")'

Notable finding from per-event: carol's "general laborer" and "crane
operator" queries both surface w-1009 as top-1, with crane operator
at distance 0.098 (very tight) and general laborer at 0.297. The
system found a worker who legitimately covers both roles — realistic
for small construction crews.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 07:55:29 -05:00
root
b13b5cd7a1 playbook_lift v4 metric: warm-top-1 re-judge — quality lift +24%/-14%
The rank-based "lift" metric (warm-top-1 == cold-judge-best) doesn't
distinguish "Shape B surfaced a strictly-better answer" from "Shape B
shuffled ranks but quality is unchanged" from "Shape B replaced a good
answer with a wrong one." This commit adds Pass 4: judge warm top-1
with the same prompt as cold ratings, then bucket the comparison.

Implementation:
- New --with-rejudge driver flag (default off).
- New WITH_REJUDGE harness env (default 1, on for prod runs).
- queryRun gains WarmTop1Metadata (cached during Pass 2 for the
  rejudge call) + WarmTop1Rating *int (nil-distinguishable; nil = no
  rejudge, 0..5 = rating).
- summary gains RejudgeAttempted, QualityLifted, QualityNeutral,
  QualityRegressed (counts of warm-rating > / == / < cold-rating).
- Markdown headline gains a Quality block when rejudge ran.
- ~21 extra judge calls (~30s on qwen2.5).

Run #005 result (split inject threshold 0.20 + paraphrase + rejudge):

  Quality lifted     5 / 21  (24%)  — 3× +2 rating, 2× +1 rating
  Quality neutral   13 / 21  (62%)  — includes OOD queries holding 1
  Quality regressed  3 / 21  (14%)
  Net rating delta  +3 across 21 queries (+0.14 average)

The 5 lifts were all rating-2 cold replaced with rating-3 or rating-4
warm — Shape B took mediocre matches and substituted substantively
better ones. The 3 regressions were small (-1, -1, -3).

Q11 is the cautionary tale: cold top-1 "production line worker"
(rating 4) got replaced by Q1's recorded "forklift OSHA-30 operator"
e-5729 (rating 1). Adjacent-domain cross-pollination — production
worker and forklift operator embed within 0.20 cosine because both
are warehouse-adjacent staffing queries, even though the judge
correctly distinguishes them. The split-threshold defense (0.5 boost
/ 0.20 inject) catches OOD cross-pollination (Q19/Q20/Q21 all stayed
neutral at rating 1) but not adjacent-domain cross-pollination.

Net product verdict: working, net-positive on quality, but the worst
case (Q11 4→1) is customer-visible and warrants a tighter inject
threshold OR an additional gate beyond cosine distance. Filed in
STATE_OF_PLAY OPEN as a follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 07:42:04 -05:00
root
87cbd10090 STATE_OF_PLAY: v4 split-threshold result + adjacent-query observation
- Reality test table extends from #001-#003 to #001-#004; v4 row marked
  as "the honest configuration" because OOD cross-pollination is gone.
- Shape B section gains the split-threshold rationale (boost safe at
  loose, inject structurally riskier so tighter).
- Verbatim drop framing rewritten — v3→v4 is configuration evolution,
  not regression.
- OPEN: closed "Shape B cap/decay" + the conditional Q15 boost-math
  item (Shape B + split threshold addressed both). Replaced with two
  finer-grained follow-ups: adjacent-query Q6↔Q7 swap (might be
  correct, verify with v4 re-judge metric) and liberal-paraphrase
  recovery loss (Q9/Q15 missed because qwen2.5 drifted >0.20).
- RECENT VERIFIED WAVE adds 94fc3b6 + 67d1957.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 07:26:23 -05:00
root
67d1957b87 matrix: split boost / inject thresholds — kills Shape B cross-pollination
Run #003 surfaced rampant cross-pollination: Q2's "OSHA-30 forklift
Wisconsin" recording (w-4435) became warm top-1 for Q19 (dental
hygienist), Q20 (RN), Q21 (software engineer), and 6 other unrelated
staffing queries. Cause: InjectPlaybookMisses inherited the same
DefaultPlaybookMaxDistance (0.5) as the boost path, but inject is
structurally riskier than boost — boost only re-ranks results that
already retrieved on their own merits, while inject FORCES a result
into top-K, so a loose match cross-pollinates wrong-domain answers.

Empirical motivation from v3:
  Implied playbook hit distances for cross-pollinated cases: 0.20-0.46
  Implied distances for the 6/6 paraphrase recoveries:        0.23-0.30
  Threshold of 0.20 should keep most paraphrases, kill the OOD bleed.

Implementation:
- New DefaultPlaybookMaxInjectDistance = 0.20 in playbook.go.
- New PlaybookMaxInjectDistance field on SearchRequest (override).
- InjectPlaybookMisses signature gains maxInjectDist param; hits whose
  Distance exceeds it are skipped (boost path may still re-rank them).
- TestInjectPlaybookMisses_RespectsInjectThreshold locks the contract
  with one tight + one loose hit, asserting only the tight one injects.
- Existing tests pass explicit threshold (0 = default for tight tests,
  0.5 for the dedupe test which uses 0.30 hits).

Run #004 result on identical queries with the split threshold:

  Verbatim discovery        8 (vs v3's 6 — judge variance, separate)
  Verbatim lift             6 / 8 (75%)
  Paraphrase top-1          6 / 8 (75%)
  Paraphrase any-rank in K  6 / 8

OOD queries Q19/Q20/Q21 ALL show warm top-1 = cold top-1 (no
injection) — cross-pollination eliminated where it was wrong-direction.
Mean Δ top-1 distance dropped from -0.164 (v3, distorted) to -0.071
(v4, comparable to v1's -0.053).

Two paraphrases missed in v4 (Q9, Q15) were ones where qwen2.5
rephrased liberally enough to drift past 0.20 — Q9: "Inventory
specialist..." → "Individual needed for inventory management..." and
Q15: "Engaged warehouse associate..." → "Warehouse associate currently
engaged with a robust history...". The system correctly refusing to
inject when it's not confident is the right product behavior; the
boost path still re-ranks recorded answers when they appear in regular
retrieval.

The Q6 ↔ Q7 cross-pollination ("Forklift-certified loader" ↔
"Hazmat warehouse worker") is legitimate — these are genuinely similar
staffing queries and the judge ranks both directions as plausible.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 07:24:55 -05:00
root
94fc3b67ec STATE_OF_PLAY: capture v3 reality test + Shape B + cross-pollination
- Reality test section now spans v1/v2/v3 across one table — the
  product story (boost-only verbatim → paraphrase gap → Shape B
  closes the gap) is legible without reading the reports.
- Verbatim-lift drop v1→v3 (7→2) explicitly framed as
  cross-pollination, NOT regression — and filed as v4 re-judge metric
  in OPEN.
- "DO NOT RELITIGATE" gains: Shape B is the stance now (don't revert
  to boost-only); local_judge stays on qwen2.5 (don't bump to qwen3.5
  for cleanliness — vision-SSM cost geometry).
- OPEN list: removed the now-closed paraphrase v2 row + the boost-math
  Q15 row (Shape B may have addressed it; flagged for verify after v4).
  Added v4 re-judge metric and Shape B injection cap/decay design call.
- RECENT VERIFIED WAVE adds the four new commits past 6c02c90
  (2c71d1c, 9ce067b, e9822f0, 154a72e).
- Matrix indexer §5/5 component description now references
  InjectPlaybookMisses + the run #002→#003 evidence chain.
- [models] tier registry comment locks the local_judge=qwen2.5 choice
  with the rationale inline.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 07:09:31 -05:00
root
154a72ea5e matrix: Shape B — inject playbook misses + 6/6 paraphrase recovery
The v0 boost-only stance documented in internal/matrix/playbook.go:22-27
("the boost only re-ranks results that ALREADY surfaced from the regular
retrieval") couldn't promote recorded answers that dropped out of a
paraphrase's top-K. playbook_lift_002 surfaced exactly that gap: 0/2
paraphrase recoveries because the recorded answers weren't in regular
retrieval at all (rank=-1).

Shape B: when warm-pass retrieval doesn't surface a playbook hit's
answer, inject a synthetic Result for it directly. Distance =
playbook_hit_distance × BoostFactor — same formula as the boost path so
injections land in comparable distance space. Caller re-sorts +
truncates after both boost and inject have run.

Result on playbook_lift_003 (Shape B + paraphrase pass):

  Verbatim discovery        6
  Verbatim lift             2 / 6
  **Paraphrase top-1**      **6 / 6**
  Paraphrase any-rank in K  6 / 6
  Mean Δ top-1 distance     -0.1637 (warm closer than cold)

Every paraphrase the judge generated landed the v1-recorded answer at
top-1 of the new query's results. The learning property holds — cosine
on embed(paraphrase) finds the recorded query's vector within
DefaultPlaybookMaxDistance (0.5), and Shape B injects the answer.

Verbatim lift dropped from v1's 7/8 because Shape B cross-pollinates
recorded answers across queries. w-4435 (Q2's recording) appears as
warm top-1 for several other queries because their embeddings are
within the playbook hit threshold of "OSHA-30 forklift Wisconsin." This
is a feature, not a bug — the matrix layer's purpose is to share
knowledge across queries — but the lift metric only counts "warm top-1
== cold judge best," so cross-pollinated lifts don't register. A v3
metric would re-judge warm pass to measure true judge improvement.

Tests:
- TestInjectPlaybookMisses_AddsMissingAnswers — primary claim
- TestInjectPlaybookMisses_SkipsAnswersAlreadyPresent — no double-inject
- TestInjectPlaybookMisses_DedupesPerAnswer — multi-hit same answer
- TestInjectPlaybookMisses_EmptyHits — fast-path no-op

Driver fix: ParaphraseRecordedRank int → *int. The `omitempty` int
silently dropped rank=0 (top-1, the WANTED value) from JSON, making the
v003 report show "null" instead of "0" for every successful recovery.
Pointer keeps nil/rank-0 distinguishable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 07:06:13 -05:00
root
e9822f025d playbook_lift v2: paraphrase pass + run #002 finds boost-only limit
Adds an opt-in Pass 3 to the lift driver: for each query whose Pass 1
recorded a playbook, ask the judge to rephrase the query, then re-query
with playbook=true and check whether the recorded answer surfaces in
top-K. This is the test the v1 report's caveat #3 explicitly flagged
as the actual learning-property gate (not the cheap verbatim case).

Implementation:
- New flag --with-paraphrase on the driver (default off).
- New WITH_PARAPHRASE env in the harness (default 1, on for prod runs).
- New paraphrase_* fields on queryRun + summary, // 0 fallback in jq so
  re-rendering verbatim-only evidence stays clean.
- generateParaphrase() calls the same judge model with format=json and
  a tight schema; temperature=0.5 for variance without domain drift.
- Markdown report adds a paraphrase per-query table (only when the
  pass ran) and an honesty caveat about judge-also-rephrases coupling.

Run #002 result (reports/reality-tests/playbook_lift_002.{json,md}):

  Verbatim lift               2/2 (100% — Q7 + Q13, both stable from v1)
  Paraphrase top-1            0/2
  Paraphrase any-rank in K    0/2

Both paraphrases dropped the recorded answer OUT of top-K entirely
(rank=-1). This isn't a paraphrase-quality problem — qwen2.5's outputs
preserved intent ("Hazmat-certified warehouse worker comfortable with
cold storage" → "Warehouse worker with Hazmat certification and
experience in cold storage"). It's the v0 boost-only stance documented
in internal/matrix/playbook.go:22-27: the boost only re-ranks results
that ALREADY surfaced from regular retrieval. If paraphrase's cosine
retrieval doesn't include the recorded answer in top-K, no boost can
promote it.

The "Shape B" upgrade mentioned in the playbook.go comment — inject
playbook hits directly even when they weren't in the top-K — is what
would close this gap. The reality test surfaced exactly the gap the
docs warned about. Worth filing as the next product gate.

Run-to-run variance also visible: v1 had 8 discoveries, v2 had 2.
HNSW insertion order + judge variance both contribute. Stability of
Q7 and Q13 across both runs (lifted in v1 AND v2) is the most reliable
signal in the dataset.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 06:47:41 -05:00
root
9ce067bd9d observerd: test that locks ADR-005 Decision 5.3
TestWorkflowRun_AllProvenanceRecordedPostRun proves that
handleWorkflowRun records ObservedOps only AFTER runner.Run returns,
not interleaved with node execution.

The test pauses inside a node via a controlled channel, samples
observer.Store mid-run (must be 0), unblocks, then samples again
(must be N). If a future commit adds per-node streaming (e.g.
runner.NodeHook firing as each node finishes), n1's record would
appear before the unblock and the first assertion fires.

This is intentional test-as-spec lock. Closing the streaming gap is
deferred per the ADR ("acceptable for short workflows; streaming
callback is the right shape when workflows get longer") — but if
someone later adds the streaming callback without updating the ADR,
this test catches it in `go test` instead of leaving the doc and
code drifted.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 06:35:41 -05:00
root
2c71d1c637 ADR-005: observer fail-safe semantics
Closes the OPEN item from STATE_OF_PLAY. Required because observerd is
now on the prod-realistic data path via the lift harness boot (b2e45f7),
so the next consumer (scrum runner / distillation rebuild / production
workflow) needs the fail-safe rationale locked, not implicit.

The Rust "verdict:accept on crash" anti-pattern doesn't translate
one-to-one to the Go observer (witness, not gate). But four adjacent
fail-safe decisions are real and live:

5.1 Persist failure is logged-not-fatal; ring is in-flight source of
    truth. Persist-required mode deferred to a future opt-in ADR.

5.2 Mode failure → Success=false, no panic-swallow path. The runner
    catches mode errors and surfaces them via node.Error; downstream
    consumers see failures explicitly rather than as fake successes
    (the Rust anti-pattern surface).

5.3 One row per node, recorded post-run. A workflow with N nodes
    produces N audit rows, never a per-workflow catch-all that
    survives partial crashes. Known gap: recording happens after
    runner.Run returns (acceptable for short workflows; streaming
    callback is the right shape when workflows get longer).

5.4 /observer/event accepts on full ring (oldest evicted). Refusing
    to write would translate every burst into client errors — wrong
    direction for an audit witness.

Mostly ratifies existing behavior; cross-checked claims against
actual code (caught one error in Decision 5.3 draft — recording is
post-run-batched, not per-node-as-it-completes — and the ADR now
states reality).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 06:32:12 -05:00
root
6c02c905c8 scrum lift_001: 4 fixes from cross-lineage review
Cross-lineage scrum on b2e45f7 produced 1 convergent + 3 single-reviewer
findings worth fixing. All apply.

1. (Opus WARN + Qwen INFO convergent) scripts/playbook_lift.sh: replace
   sleep 2.5 in SQL probe with active polling up to 5s. refresh_every=1s
   is a lower bound; under load the manifest may not be visible in a
   fixed sleep, which would 4xx the probe and abort the reality run.

2. (Opus INFO) scripts/playbook_lift.sh: report template glued
   "env JUDGE_MODEL" + value as "env JUDGE_MODELqwen2.5:latest" with no
   separator. Replaced two :+/:- substitution chains with a single
   JUDGE_SOURCE variable computed once at the top of the harness.

3. (Opus INFO) scripts/staffing_workers/main.go: -id-prefix "" silently
   allowed, defeating the flag's purpose (cross-corpus collision prevent).
   Now log.Fatal at startup with explicit hint.

4. (Opus WARN) cmd/{pathwayd,observerd}/main_test.go: newTestRouter
   returned http.Handler then re-cast to chi.Router for chi.Walk.
   Returning chi.Router directly satisfies http.Handler AND avoids an
   assertion that would panic if future middleware wraps the router.

Dismissed (with rationale):
- Kimi INFO hardcoded MinIO endpoint: harness is local-by-design.
- Kimi WARN matrixd accepts 502/500: documented; real retriever needs
  real upstreams the test doesn't spin up.
- Qwen INFO queryd string.Contains: brittle but very low risk; restating
  through typed-error path would couple without adding signal.

go test ./cmd/{matrixd,queryd,pathwayd,observerd} all green.

Verdicts at reports/scrum/_evidence/2026-04-30/verdicts/lift_001_*.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 06:27:24 -05:00
root
b2e45f7f26 playbook_lift: harness expansion + reality test #001 (7/8 lift, 87.5%)
The 5-loop substrate's load-bearing gate is verified — playbook +
matrix indexer give the results we're looking for. Per the report's
rubric, lift ≥ 50% of discoveries means matrix is doing real work;
7/8 = 87.5% blew through that.

Harness was structurally hiding bugs behind a 5-daemon stripped boot.
Expanding to the full 10-daemon prod stack surfaced 7 fixes in cascade:

1. driver→matrixd: {"query": ...} → {"query_text": ...} field name
2. harness temp toml missing [s3] → wrong default bucket → catalogd
   rehydrate 500 on first call
3. harness→queryd SQL probe: {"q": ...} → {"sql": ...} field name
4. expand boot from 5 → 10 daemons in dep-ordered launch
5. add SQL surface probe (3-row CSV ingest → COUNT(*)=3 assertion)
6. candidates corpus was synthetic SWE-tech (Swift/iOS, Scala/Spark) —
   wrong domain for staffing queries; replaced with ethereal_workers
   (10K rows, real staffing schema, "e-" id prefix to avoid collision
   with workers' "w-"). staffing_workers driver gains -index-name +
   -id-prefix flags so the same binary serves both corpora
7. local_judge qwen3.5:latest is a vision-SSM 256K-ctx build running
   ~30s per judge call against the lift loop; reverted to
   qwen2.5:latest (~1s/call, 30× faster, held lift theory)

Each contract drift (1, 3) is now locked into a cmd/<bin>/main_test.go
so future drift fires in `go test`, not in a reality run. R-005 closed:

- cmd/matrixd/main_test.go (new) — playbook record drift detector +
  score bounds + 6 routes mounted
- cmd/queryd/main_test.go — wrong-field-name drift detector
- cmd/pathwayd/main_test.go (new) — 9 routes + add round-trip + retire
- cmd/observerd/main_test.go (new) — 4 routes + invalid-op + unknown-mode

`go test ./cmd/{matrixd,queryd,pathwayd,observerd}` all green.

Reality test results (reports/reality-tests/playbook_lift_001.{json,md}):
  Queries              21 (staffing-domain, 7 categories)
  Discoveries          8 (judge ≠ cosine top-1)
  Lifts                7/8 (87.5%)
  Boosts triggered     9
  Mean Δ distance      -0.053 (warm closer than cold)
  OOD honesty          dental/RN/SWE rated 1, no fake matches
  Cross-corpus boosts  confirmed (e- ↔ w- swaps in lifts)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 06:22:21 -05:00
root
740eb0d00c scrum_review: switch curl to stdin so large diffs don't blow argv
Phase 4-bundle review (128KB diff) hit "Argument list too long" when
curl --data was passed the body as a literal arg. Pipe via stdin
with --data-binary @- instead. Lifts the practical bundle size from
~30KB to whatever fits in process memory.

Caught while running the harness scrum on golangLAKEHOUSE today —
the bigger Phase A+B harness diff (4566 lines) tripped it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 02:46:52 -05:00
root
511083ae40 docs: SPEC §3.9 (chatd) + §3.10 (local-review-harness sibling)
- SPEC §1 component table: add chatd row marked DONE; replaces
  Rust gateway's v1::ollama_cloud / openrouter / opencode adapters
  + the aibridge crate.
- SPEC §3.9 — chatd shipped: 5-provider routing (ollama, ollama_cloud,
  openrouter, opencode, kimi) by model-name prefix or :cloud suffix.
  Captures the Anthropic 4.7 temperature-deprecation quirk + the
  local-Ollama think=false default that the playbook_lift judge
  needed. Mentions scrum_review.sh as the reusable cross-lineage
  vehicle eating chatd's own /v1/chat.
- SPEC §3.10 — local-review-harness sibling tool: separate repo at
  git.agentview.dev/profit/local-review-harness, MVP shipped today.
  Documents the cross-pollination plan for when both substrates
  stabilize (chatd as the harness's LLM backend; harness findings
  as Lakehouse pathway-memory drift signal; .memory/known-risks
  as a matrix corpus). Explicit "don't re-port" so future Claudes
  don't try to absorb the harness into Lakehouse.
- STATE_OF_PLAY.md: SIBLING TOOLS section with 1-line summary
  + pointer to SPEC §3.10.

No code changes. just verify still PASS — touched only docs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 01:01:23 -05:00
root
c5c31b6ca6 docs: STATE_OF_PLAY.md — Go-side truth anchor (mirrors Rust convention)
Adds the "verified working RIGHT NOW / DO NOT RELITIGATE / OPEN"
anchor at the repo root, mirroring /home/profit/lakehouse/STATE_OF_PLAY.md.
Memory files (project_golang_lakehouse.md) supplement; this file is
the verified-truth pointer.

Sections:
- VERIFIED WORKING: 13 cmd binaries + 18 smokes + 5 matrix components
  + Mem0 pathway + observerd + workflow runner + chatd 5-provider
  dispatcher + model tier registry. just verify PASS in 31s.
- DO NOT RELITIGATE: 4 ratified ADRs (DECISIONS.md ADR-001..004) +
  today's scrum dispositions (B-1..B-4 fixed, FP-A1/A2/C1 dismissed)
  + session frame items (Rust legacy is maintenance-only, etc.).
- OPEN: reality test held on J's queries, 3 daemon main_test.go gap,
  Sprint 4 deployment, ADR-005 observer fail-safe, ADR-006 auth posture.
- RECENT WAVE: 6-commit table 05273ac..e4ee002 documenting today's
  4 phases + scrum + tooling.
- RUNTIME CHEATSHEET: just verify, chatd boot, /v1/chat/providers
  probe, scrum_review.sh usage.
- VISION: 5-loop substrate gate from project_small_model_pipeline_vision.md.

The read-mem skill (in /root/.claude/skills/read-mem/) and project
memory file are updated to reference this file as the primary Go anchor.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 00:37:24 -05:00
root
e4ee0029c0 scrum_review.sh: reusable 3-lineage cross-review driver
Bash driver wrapping /v1/chat for Opus + Kimi + Qwen3-coder review
runs. Used today to scrum the 4-phase wave (1,624 LoC of chatd +
config-refactor + Rust cleanup) and caught 2 BLOCKs + 2 WARNs.

Usage:
  ./scripts/scrum_review.sh <bundle.diff> <bundle_label>

Output: reports/scrum/_evidence/<DATE>/verdicts/<bundle>_<reviewer>.md
verbatim, per the evidence-only convention. Per-reviewer latency +
token counts captured in the report header.

System prompt enforces the BLOCK/WARN/INFO + WHERE/WHAT/WHY shape
per feedback_cross_lineage_review.md — leads with verdict, no
preamble (Kimi tends to spend tokens thinking otherwise).

Reviewer fleet matches project_golang_lakehouse.md "Scrum routing":
- opencode/claude-opus-4-7
- openrouter/moonshotai/kimi-k2-0905
- openrouter/qwen/qwen3-coder

This is the first dogfood of chatd as the scrum vehicle — eats its
own /v1/chat dispatcher.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 00:29:36 -05:00
root
0efc7363c5 scrum 2026-04-30: 4 real fixes + 2 INFOs from cross-lineage review
3-lineage scrum (Opus 4.7 / Kimi K2.6 / Qwen3-coder) on today's wave
landed 4 real findings (2 BLOCK + 2 WARN) and 2 INFO touch-ups.
Verbatim verdicts + disposition table at:
  reports/scrum/_evidence/2026-04-30/

B-1 (BLOCK Opus + INFO Kimi convergent) — ResolveKey API:
  collapse from 3-arg (envVar, envFileName, envFilePath) to 2-arg
  (envVar, envFilePath). Pre-fix every chatd caller passed the env
  var name twice; if operator renamed *_key_env in lakehouse.toml
  while keeping the canonical KEY= line in the .env file, fallback
  silently missed.

B-2 (WARN Opus + WARN Kimi convergent) — handleProviders probe:
  drop the synthesize-then-Resolve probe; look up by name directly
  via Registry.Available(name). Prior probe synthesized "<name>/probe"
  model strings and routed through Resolve, fragile to any future
  routing rule (e.g. cloud-suffix special case).

B-3 (BLOCK Opus single — verified by trace + end-to-end probe) —
  OllamaCloud.Chat StripPrefix used "cloud" but registry routes
  "ollama_cloud/<m>". Result: upstream got the prefixed model name
  and 400'd. Smoke missed it because chatd_smoke runs without
  ollama_cloud registered. Now strips the right prefix; new
  TestOllamaCloud_StripsCorrectPrefix locks both prefix + suffix
  cases. Verified live: ollama_cloud/deepseek-v3.2 round-trips
  cleanly through the real ollama.com endpoint.

B-4 (WARN Opus single) — Ollama finishReason: read done_reason
  field instead of inferring from done bool alone. Newer Ollama
  reports done=true with done_reason="length" on truncation; the
  prior code mapped that to "stop" and lost the truncation signal
  the playbook_lift judge needs to retry. New
  TestFinishReasonFromOllama_PrefersDoneReason covers the fallback
  ladder.

INFOs:
- B-5: replace hand-rolled insertion sort in Registry.Names with
  sort.Strings (Opus called the "avoid sort import" comment a
  false economy — correct).
- A-1: clarify the playbook_lift.sh comment around -judge "" arg
  passing (Opus noted the comment said "env priority" but didn't
  reflect that the empty arg also passes through the Go driver's
  resolution chain).

False positives dismissed (3, documented in disposition.md):
- Kimi: TestMaybeDowngrade_WithConfigList wrong assertion (test IS
  correct per design — model excluded from weak list = strong = downgrade)
- Qwen: nil-deref claim (defensive code already handles nil)
- Opus: qwen3.5:latest doesn't exist on Ollama hub (true on the
  public hub but local install has it)

just verify: PASS. chatd_smoke 6/6 PASS. New regression tests:
3 (B-2, B-3, B-4 each get a focused test).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 00:28:08 -05:00
root
05273ac06b phase 4: chatd — multi-provider LLM dispatcher (ollama / cloud / openrouter / opencode / kimi)
new cmd/chatd on :3220 routes /v1/chat to the right provider based
on model-name prefix or :cloud suffix. closes the architectural gap
named in lakehouse.toml [models]: tiers map to model IDs, but until
phase 4 there was no service that could actually CALL those models
from go.

routing rules (registry.Resolve):
  ollama/<m>          → local Ollama (prefix stripped)
  ollama_cloud/<m>    → Ollama Cloud
  <m>:cloud           → Ollama Cloud (suffix variant — kimi-k2.6:cloud)
  openrouter/<v>/<m>  → OpenRouter (prefix stripped, OpenAI-compat)
  opencode/<m>        → OpenCode unified Zen+Go
  kimi/<m>            → Kimi For Coding (api.kimi.com/coding/v1)
  bare names          → local Ollama (default)

provider implementations:
- internal/chat/types.go      Provider interface, Request/Response, errors
- internal/chat/registry.go   prefix + :cloud suffix dispatch
- internal/chat/ollama.go     local Ollama via /api/chat (think=false default)
- internal/chat/ollama_cloud.go  Ollama Cloud via /api/generate (Bearer auth)
- internal/chat/openai_compat.go shared OpenAI Chat Completions for the
                                 OpenRouter/OpenCode/Kimi family
- internal/chat/builder.go    BuildRegistry from BuilderInput;
                              ResolveKey reads env then .env file fallback

config:
- ChatdConfig in internal/shared/config.go with bind, ollama_url,
  per-provider key env names + .env fallback paths, timeout
- Gateway gains chatd_url + /v1/chat + /v1/chat/* routes
- lakehouse.toml [chatd] block with /etc/lakehouse/<provider>.env defaults

tests (19 in internal/chat):
- registry: prefix + :cloud + errors + telemetry + provider listing
- ollama: happy path + prefix strip + format=json + 500 mapping +
  flatten_messages
- openai_compat: happy path + format=json + 429 mapping + zero-choices

think=false default in ollama + ollama_cloud — local hot path skips
reasoning, low-budget callers (the playbook_lift judge at max_tokens=10)
get direct answers instead of empty content + done_reason=length.
proven via chatd_smoke acceptance.

acceptance gate: scripts/chatd_smoke.sh — 6/6 PASS:
1. /v1/chat/providers lists exactly registered providers (1 in dev mode)
2. bare model → ollama default with content + token counts + latency
3. explicit ollama/<m> → prefix stripped at upstream
4. <m>:cloud without ollama_cloud registered → 404 (no silent fall-through)
5. unknown/<m> → falls through to default → upstream 502 (no prefix rewrite)
6. missing model field → 400

just verify: PASS (vet + 30 packages × short tests + 9 smokes).
chatd_smoke is a domain smoke (not in just verify, mirrors matrix /
observer / pathway pattern).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 00:08:29 -05:00
root
848cbf5fef phase 3: playbook_lift harness reads judge from config
migrate the reality-test harness's judge-model default from a
hardcoded "qwen3.5:latest" string to cfg.Models.LocalJudge.

resolution priority: explicit -judge flag > $JUDGE_MODEL env >
cfg.Models.LocalJudge from lakehouse.toml > hardcoded fallback.

bumping the judge for run #N+1 now means editing one line in
lakehouse.toml [models].local_judge — no Go file or shell script
edits required.

changes:
- scripts/playbook_lift/main.go: -config flag added, judge default
  flips to "" so resolution chain runs. Imports internal/shared for
  config loader.
- scripts/playbook_lift.sh: JUDGE_MODEL no longer defaulted in bash;
  EFFECTIVE_JUDGE resolved by mirror-of-the-Go-chain (env > config
  grep > qwen3.5:latest fallback). Used for the Ollama presence
  check + report header. Pre-flight grep avoids requiring jq just
  to read the toml.
- reports/reality-tests/README.md: documents the 4-step priority
  chain.

verified all 4 paths produce the expected judge:
- config (no env): qwen3.5:latest (from lakehouse.toml)
- env override:    env wins
- flag override:   flag wins over env
- missing config:  DefaultConfig fallback still gives qwen3.5:latest

just verify PASS.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 23:57:28 -05:00
root
622e124b8f phase 2: matrix.downgrade reads WeakModels from config
migrate the strong-model auto-downgrade gate from a hardcoded weak
list to cfg.Models.WeakModels. backward compatible: existing API
preserved, callers that don't migrate keep using DefaultWeakModels.

changes:
- internal/matrix/downgrade.go: split IsWeakModel into rule-based
  base (`:free` suffix/infix) + literal-list lookup. New
  IsWeakModelInList(model, list) takes the config-supplied list.
  DowngradeInput grows a WeakModels field; nil falls back to
  DefaultWeakModels (preserves pre-phase-2 behavior).
- internal/workflow/modes.go: add MatrixDowngradeWithWeakList(list)
  factory mirroring MatrixSearch's pattern. Plain MatrixDowngrade
  kept for backward compat.
- cmd/matrixd/main.go: handlers struct holds weakModels populated
  from cfg.Models.WeakModels at startup; handleDowngrade threads it
  into every DowngradeInput.
- cmd/observerd/main.go: registerBuiltinModes accepts weakModels
  and uses the factory variant. observerd reads cfg.Models.WeakModels
  in main().

end-to-end verified: downgrade + matrix + observer + workflow smokes
all pass. Existing TestMaybeDowngrade_TruthTable + TestIsWeakModel
unchanged (backward compat). Two new tests cover the config path:
- TestIsWeakModelInList — covers rule + literal + empty + nil
- TestMaybeDowngrade_WithConfigList — verifies cfg list overrides
  default

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 23:52:18 -05:00
root
ec1d031996 phase 1: add [models] tier config — additive, no callers migrate yet
Codifies the small-model-pipeline tiering (per project_small_model_pipeline_vision.md)
in lakehouse.toml [models] section. Tier names map to actual model
IDs; bumping a model means editing one line, not hunting through code.

Tier philosophy:
- local_*    : on-box Ollama. Inner-loop hot path. Repeated calls.
- cloud_*    : Ollama Cloud (Pro plan). Larger context, fail-up tier.
- frontier_* : OpenRouter / OpenCode. Rate-limited, billed per call.

weak_models is the codified "local-hot-path eligible" list — phase 2
will migrate matrix.downgrade to read it instead of hardcoding.

Defaults reflect 2026-04-29 architecture: qwen3.5:latest as local
(stronger than qwen2.5, same JSON-clean property), kimi-k2.6 as cloud
judge (kimi-k2:1t still upstream-broken), opus-4-7 + kimi-k2-0905 as
frontier review/arch via OpenRouter, opencode/claude-opus-4-7 as
frontier_free leveraging the OpenCode subscription.

3 new tests in internal/shared/config_test.go:
- TestDefaultConfig_ModelsTier — locks tier defaults
- TestModelsConfig_IsWeak     — weak-bypass list
- TestLoadConfig_ModelsTOMLRoundTrip — override semantics

just verify PASS (g2 had one flake on first run — Ollama transfer
truncation; clean on retry, unrelated to this change).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 23:48:45 -05:00
root
3dd7d9fe30 reality-tests: playbook-lift harness — does the 5-loop substrate beat raw cosine?
First reality test driver. Two-pass design:
- Pass 1 (cold): matrix.search use_playbook=false → small-model judge
  rates top-K → record playbook entry pointing at the highest-rated
  result (which may NOT be top-1 by distance — that's the discovery).
- Pass 2 (warm): same queries with use_playbook=true → measure
  ranking shift. Lift = real if recorded answer becomes top-1.

Files:
- scripts/playbook_lift/main.go         driver (391 LoC)
- scripts/playbook_lift.sh              stack-bring-up + report gen
- tests/reality/playbook_lift_queries.txt  query corpus (5 placeholders;
                                            J writes real 20+)
- reports/reality-tests/README.md       framework + interpretation
- .gitignore                            track reports/reality-tests/
                                        but ignore per-run JSON evidence

This answers the gate from project_small_model_pipeline_vision.md:
"the playbook + matrix indexer must give the results we're looking
for." Without ground-truth labels, the LLM judge is the proxy — the
same small-model thesis applied to evaluation. Honest about that
limitation in the generated reports.

Driver compiles clean; full run requires Ollama + workers/candidates
ingest. Skips cleanly if Ollama absent.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 23:22:36 -05:00
root
8278eb9a87 scrum2 cleanup: JSON-marshal in stringifyValue, drop dead detectCycle, name SourceWorkflow
5 small fixes from the §3.8 scrum2 review wave:

- workflow.stringifyValue now JSON-marshals maps/slices instead of
  fmt.Sprint %v (Opus+Kimi convergent: LLM modes were getting Go's
  map[k:v] syntax, which is unparseable as JSON context).
- workflow.detectCycle removed — duplicate of topoSort that discarded
  the useful node ID. Validate() now calls topoSort directly and
  returns its wrapped ErrCycle.
- observer.SourceWorkflow named constant — was an implicit string
  cast (observer.Source("workflow")) at the cmd/observerd handler.
- Unused context imports + dead silencer comments removed across
  workflow/modes.go and observerd/main.go.
- Unused store parameter dropped from registerBuiltinModes (reserved
  comment removed; can be re-added when a mode actually needs it).

just verify still PASS — these are pure cleanup, no behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 23:16:07 -05:00
root
c41698acae scrum rerun-2 — 50/60 (Δ R1 +7, Δ baseline +15) at c7e3124
Audited stash-clean c7e3124 (30 commits past rerun-1 4840c10).
3 HIGH risks closed (R-002 internal/shared, R-003 internal/storeclient,
R-008 queryd/db.go). 3 advanced to partial (R-001 via fail-loud-bind +
opt-in auth, R-006 via g2_smoke_fixtures, R-007 via ADR-003 auth.go).

Biggest move: Agent Memory Correctness 4 → 9 — pathway Mem0 ops
(ADD/UPDATE/REVISE/RETIRE/HISTORY) all tested, including cycle-detection
and retired-trace-exclusion. Sprint 2 acceptance criteria are now
verified code, not design-bar work.

Two new findings:
- F1 (MED): cmd/{matrixd,observerd,pathwayd}/main_test.go absent —
  reopens R-005 against new daemons.
- F2 (LOW): scripts/staffing_*/main.go flag-defaults reach
  /home/profit/lakehouse/data/...

Evidence under reports/scrum/_evidence/rerun2/ (local; per
.gitkeep convention).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 23:13:01 -05:00
root
c7e3124208 §3.8 second slice: real modes wired (matrix.relevance/downgrade/search,
distillation.score, drift.scorer)

Lands the workflow.Mode adapters for the §3.4 components + the
distillation scorer + drift quantifier. Workflows can now compose
real measurement capabilities; the substrate's parallel
capabilities become composable Lego bricks (per the prior commit's
closing insight).

Modes registered (in observerd's registerBuiltinModes):

  Pure-function wrappers (no I/O):
    - matrix.relevance    → matrix.FilterChunks
    - matrix.downgrade    → matrix.MaybeDowngrade
    - distillation.score  → distillation.ScoreRecord
    - drift.scorer        → drift.ComputeScorerDrift

  HTTP-backed:
    - matrix.search       → POST matrixd /matrix/search
                             (registered only when matrixd_url is set)

  Fixture (kept from §3.8 first slice):
    - fixture.echo, fixture.upper

internal/workflow/modes.go:
  Each mode follows the same glue pattern: marshal generic input
  through a typed struct (free schema validation + clear error
  messages), call the underlying capability, return a generic
  output map. Roundtrip-via-JSON gives us schema validation
  without writing custom field-by-field coercion.

internal/workflow/modes_test.go (10 tests, all PASS):
  - matrix.relevance filters adjacency pollution (Connector kept,
    catalogd::Registry dropped — same headline as the relevance
    smoke, run through the workflow mode)
  - matrix.downgrade flips lakehouse→isolation on strong model;
    keeps lakehouse on weak (qwen3.5:latest); errors on missing
    fields
  - distillation.score rates scrum_review attempt_1 as accepted;
    rejects empty record
  - drift.scorer reports zero drift on matched inputs; errors on
    empty inputs slice
  - matrix.search HTTP flow round-trips through httptest fake
    matrixd; non-OK status surfaces a clear error

scripts/workflow_smoke.sh (5 assertions PASS, was 4):
  New assertion #5: real-mode chain
    matrix.downgrade (lakehouse + grok-4.1-fast → isolation)
    → distillation.score (scrum_review attempt_1 → accepted)
  Proves §3.4 components compose through the workflow runner with
  no fixture intermediation. Both nodes ran successfully, runner
  recorded provenance, status=succeeded.

  Mode listing assertion now expects 7 modes (5 real + 2 fixture)
  instead of just the fixtures.

17-smoke regression all green. SPEC §3.8 acceptance gate G3.8.D
("Mode catalog dispatches matrix.search invocation to the matrixd
backend without going through HTTP") still pending — current path
goes through HTTP for matrix.search, which is the cleaner service-
mesh shape but slower than direct in-process. In-process dispatch
when matrixd is co-resident is a future optimization.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 20:39:26 -05:00
root
e30da6e5aa §3.8 first slice: workflow runner skeleton + DAG executor + observerd integration
Lands the structural piece of SPEC §3.8 (Observer-KB workflow runner)
documented in 97dd3f8: types + DAG runner + reference substitution +
provenance recording into observerd. Real-mode integrations
(matrix.search, distillation.score, drift.scorer, llm.chat) come in
follow-up commits — this commit proves the mechanics.

internal/workflow/types.go:
  - Workflow / Node / NodeResult / RunResult types matching Archon's
    YAML shape so existing workflows (e.g. lakehouse-architect-review.yaml)
    load directly. Optional `mode` field added — implicit fall-back is
    "llm.chat" matching Archon's convention.
  - Mode signature: func(Context, map[string]any) (map[string]any, error)
  - 4 sentinel errors: ErrCycle, ErrMissingDep, ErrUnknownMode,
    ErrDuplicateNodeID, ErrUnresolvedRef
  - Validate enforces structural invariants: unique IDs, every
    depends_on resolves, no cycles

internal/workflow/runner.go:
  - Kahn's-algorithm topological sort, stable for declaration-order
    ties (deterministic execution + JSON output across runs)
  - Reference substitution: $node_id.output.key.path resolves through
    nested maps; $node_id alone resolves to the whole output map
  - Skip cascade: a node whose dependency failed/skipped is skipped
    with explicit "upstream node X failed" error in NodeResult, never
    silently dropped
  - Per-node provenance: NodeResult.StartedAt + DurationMs captured
    for every execution
  - Mode pre-validation: every node's mode checked against registry
    BEFORE any node runs — typo catches in 5ms not after 6 nodes

internal/workflow/runner_test.go (14 tests, all PASS):
  - Validate: missing name, no nodes, duplicate IDs, missing deps, cycles
  - Run: single node, 3-node DAG with chained $-refs (shape→weakness→improvement),
    failed-node skip cascade with independent siblings still running,
    unknown-mode abort, unresolved-reference error, implicit
    llm.chat fallback, provenance fields populated, inputs (not just
    prompt) honor $-refs, topological-sort stability for ties

cmd/observerd extended:
  - POST /observer/workflow/run executes a workflow, records each
    node's execution as an ObservedOp (source="workflow"), returns
    the full RunResult
  - GET /observer/workflow/modes lists the registered mode names
  - registerBuiltinModes wires fixture.echo + fixture.upper for v0;
    real modes register here in follow-up commits

scripts/workflow_smoke.sh (4 assertions PASS):
  - GET /modes lists fixture.echo + fixture.upper
  - 3-node DAG executes: shape (uppercase "hello world") → weakness
    (sees "HELLO WORLD" via $shape.output.upper ref) → improvement
    (sees "HELLO WORLD" propagated through 2-hop $weakness.output.prompt)
  - /observer/stats shows by_source.workflow == 3 (one per node) and
    total == 3 — provenance lands as expected
  - Unknown mode → 400 with "unknown mode" in error body

17-smoke regression all green. Acceptance gates G3.8.A (Archon-shape
workflow loads + executes topologically) + G3.8.B (per-node ObservedOps)
+ G3.8.C ($prior_node.output ref resolves, error on missing ref) all
satisfied. G3.8.D (in-process matrix.search dispatch) deferred until
a real mode is wired.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 20:34:30 -05:00
root
97dd3f826d SPEC §3.5/§3.6/§3.7/§3.8 — name F/B/C as port targets + add Archon-style workflow runner
Per the 2026-04-29 scope-discipline pause: the wave shipped four
pieces beyond SPEC §3.4 component scope, and one architectural
pattern surfaced (Archon-style multi-pass workflow runner) that's
the observer's natural growth path. Document them as port targets
so the next scrum review has authoritative SPEC components.

§3.5 — Drift quantification (loop 5 of the PRD)
  Names the SCORER drift work shipped in be65f85 + the deferred
  shapes (PLAYBOOK drift, EMBEDDING drift, AUDIT BASELINE drift).
  Acceptance gates G3.5.A–B.

§3.6 — Staffing-side structured filter
  Names the metadata-filter MVP shipped in b199093 + the deferred
  pre-retrieval SQL gate via queryd. Acceptance gates G3.6.A–C.

§3.7 — Operational rating wiring
  Names the bulk playbook-record endpoint shipped in 6392772 + the
  deferred UI shim, negative-feedback path, and time-decay.
  Acceptance gates G3.7.A–B.

§3.8 — Observer-KB workflow runner (Archon-style multi-pass) —
       PORT TARGET, not yet started
  Documents the architecture J was working on across the Rust
  observer-kb branch (10 commits ahead of main, never merged) and
  the local Archon mod (committed 2026-04-29 as 3f2afc8 in
  /home/profit/external/Archon, not pushed to coleam00/Archon).

  The pattern: multi-pass mode chain (extract → validator →
  hallucination → consensus → redteam → pipeline → render) where
  each pass is a deterministic measurement. The observer is the
  natural home — workflows ARE observation patterns whose every
  step is recorded. Five components in dependency order: workflow
  definition (YAML), node executor (DAG runner), provenance
  recording (ObservedOps), mode catalog (matrix.search,
  distillation.score, drift.scorer, llm.chat), HTTP surface
  (/v1/observer/workflow/run).

  Reference materials on the system (preserved, not lost):
    - /home/profit/lakehouse/.archon/workflows/lakehouse-architect-review.yaml
      (Rust main, 69919d9) — 3-node Archon-via-Lakehouse proof
    - /home/profit/external/Archon dev branch — upstream engine
      with local pi/provider.ts mod (3f2afc8) for Lakehouse routing
    - Rust observer-kb branch — apps/observer-kb/docs/PRD.md +
      Python prototypes proven on real ChatGPT/Claude PDF data

  Acceptance gates G3.8.A–D. Estimated effort: L.

PRD updated with "Observer as system resource (clarified
2026-04-29)" section pointing at §3.8 as the architectural growth
path. The bare-bones observerd in bc9ab93 is the substrate; the
workflow runner is what makes it the "objective measurement engine"
the small-model pipeline needs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 20:27:41 -05:00
root
bc9ab93afe H: observerd — autonomous-iteration witness loop (SPEC §2 port)
Port of the load-bearing pieces of mcp-server/observer.ts (Rust
system, 852 lines TS) per SPEC §2's named target. Implements PRD
loop 3 ("Observer loop — watches each run, refines configs").

Routes (all under /v1/observer/* via gateway):
  GET  /observer/health   — liveness
  GET  /observer/stats    — total / successes / failures /
                             by_source / recent_scenario_ops
                             (matches Rust JSON shape exactly)
  POST /observer/event    — record one ObservedOp; auto-defaults
                             timestamp + source, validates required
                             fields (endpoint), persists to JSONL,
                             appends to ring buffer

Architecture:
  - internal/observer/types.go — ObservedOp model + Source taxonomy
    (mcp / scenario / langfuse / overseer_correction). Mirrors the
    Rust shape so JSON round-trips during cutover.
  - internal/observer/store.go — Store + Persistor. Ring buffer cap
    matches Rust's 2000; recent_scenarios cap matches Rust's 10.
    Same persist-then-apply order as pathwayd; same corruption-
    tolerant replay (skip malformed lines + warn).
  - cmd/observerd — :3219 HTTP service, fronted by gateway as
    /v1/observer/*.
  - lakehouse.toml + DefaultConfig — [observerd] block matches the
    pathwayd pattern (Bind + PersistPath; empty path = ephemeral).

Tests + smoke (all PASS):
  - 7 unit tests in store_test.go: validation, default fields,
    stats aggregation, recent-scenarios cap + ordering, ring-buffer
    rollover at cap, JSONL round-trip persistence, corruption-
    tolerant replay (1 valid + 1 corrupt + 1 valid → 2 applied)
  - scripts/observer_smoke.sh: 4 assertions through gateway —
    record 5 events (3 ok / 2 fail across 2 sources), stats
    aggregates correctly, empty-endpoint→400, kill+restart preserves
    via JSONL replay (5 ops, 3 ok, 2 err survive)

Deferred (named in package + cmd doc, not in this commit):
  - POST /observer/review (cloud-LLM hand-review fall-back). The
    heuristic-only path could land cheaply but the productized
    cloud path (qwen3-coder fall-back) is multi-day port.
  - Background loops: analyzeErrors, consolidatePlaybooks,
    tailOverseerCorrections (read overseer_corrections.jsonl into
    the ring buffer once per cycle).
  - escalateFailureClusterToLLMTeam (failure clustering trigger
    that posts to LLM Team's /api/run with code_review mode).

/relevance is NOT duplicated — already ported in 9588bd8 to
internal/matrix/relevance.go (component 3 of SPEC §3.4).

16-smoke regression all green (D1-D6, G1, G1P, G2, storaged_cap,
pathway, matrix, relevance, downgrade, playbook, observer).
13 binaries now: gateway, storaged, catalogd, ingestd, queryd,
vectord, embedd, pathwayd, matrixd, observerd, mcpd, fake_ollama
(plus catalogd-only test build).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 20:18:02 -05:00
root
6392772f41 C: bulk playbook record — operational rating wiring
POST /v1/matrix/playbooks/bulk accepts an array of playbook entries
and records each independently — failures per-entry don't abort the
batch. Designed for two operational use cases:

  1. Backfilling historical placement data into the playbook
     substrate (the Rust system has 4,701 fill operations recorded
     with embeddings; that data deserves to feed the Go learning
     loop without a 4,701-call procedural script).
  2. Batched click-tracking from a session's worth of coordinator
     interactions, posted once at idle rather than per-click.

Per-entry response shape: {index, playbook_id} on success or
{index, error} on failure. Caller can inspect failures without
diffing.

Smoke (scripts/playbook_smoke.sh, new assertion #4):
  Bulk POST 3 entries: 2 valid (alpha→widget-a, bravo→widget-b) +
  1 invalid (empty query_text). Verifies recorded=2, failed=1,
  the 2 valid ones get playbook_ids back, and the invalid one
  surfaces its validation error in-line.

Single-record /matrix/playbooks/record from 06e7152 still works
unchanged; bulk is additive. The corpus field can be set per-
entry or once at the batch level (entry-level wins on collision).

Per the small-model autonomous pipeline framing: this is the
"the playbook gets denser with each iteration" mechanism. Click
tracking → bulk POST → playbook entries → future similar queries
get those answers boosted via the existing /matrix/search
use_playbook path. The learning loop now has both inflows wired
(single + bulk) — what remains is the demo UI shim that calls
/feedback on result interaction (deferred — no Go demo UI yet).

15-smoke regression all green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 20:10:13 -05:00