302 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
86123fce4c |
gateway: /v1/validate endpoint — Phase 43 v3 part 2
Closes the Phase 43 PRD's "any caller can validate" surface. The
validator crate (FillValidator + EmailValidator + PlaybookValidator
+ WorkerLookup) is now reachable over HTTP at /v1/validate.
Request/response:
POST /v1/validate
{"kind":"fill"|"email"|"playbook", "artifact":{...}, "context":{...}?}
→ 200 + Report on success
→ 422 + ValidationError on validation failure
→ 400 on bad kind
Boot-time wiring (main.rs):
- Load workers_500k.parquet into a shared Arc<dyn WorkerLookup>
- Path overridable via LH_WORKERS_PARQUET env
- Missing file: warn + fall back to empty InMemoryWorkerLookup so the
endpoint stays live (validators just fail Consistency on every
worker-existence check, which is the correct behavior when the
roster isn't configured)
- Boot log line: "workers parquet loaded from <path>" or
"workers parquet at <path> not found"
- Live boot timing: 500K rows loaded in ~1.4s
V1State gains `validate_workers: Arc<dyn validator::WorkerLookup>`.
The `_context` JSON key is auto-injected from `request.context` so
callers can either embed `_context` directly in `artifact` or split
it cleanly via the `context` field.
Verified live (gateway + 500K worker snapshot):
POST {kind:"fill", phantom W-FAKE-99999} → 422 Consistency
("does not exist in
worker roster")
POST {kind:"fill", real W-1, "Anyone"} → 200 OK + Warning
("differs from
roster name 'Donald
Green'")
POST {kind:"email", body has 123-45-6789} → 422 Policy ("SSN-
shaped sequence")
POST {kind:"nonsense"} → 400 Bad Request
The "0→85% with iteration" thesis can now run end-to-end on real
staffing data: an executor emits a fill_proposal, posts to
/v1/validate, gets a structured ValidationError on phantom IDs or
inactive workers, observer-corrects, retries. Closure of that loop
in a scrum harness is the next commit (separate scope).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
ebd9ab7c77 |
validator: Phase 43 v3 — production WorkerLookup backed by workers_500k.parquet
Some checks failed
lakehouse/auditor 13 blocking issues: cloud: claim not backed — "Verified end-to-end:"
Closes the Phase 43 v2 loose end. The validator scaffolds (FillValidator,
EmailValidator) take Arc<dyn WorkerLookup> at construction; this commit
ships the parquet-snapshot impl that production code wires in.
Schema mapping (workers_500k.parquet → WorkerRecord):
worker_id (int64) → candidate_id = "W-{id}" (matches what the
staffing executor
emits)
name (string) → name (already concatenated upstream)
role (string) → role
city, state (string) → city, state
availability (double) → status: "active" if >0 else "inactive"
Workers_500k has no `status` column; we derive from `availability`
since 0.0 means vacationing/suspended/etc in this dataset's
convention. Once Track A.B's `_safe` view ships with proper status,
flip the loader to read it directly — schema mapping is in one
function (load_workers_parquet), so the swap is trivial.
In-memory snapshot model:
- Loads all 500K rows at startup → ~75MB resident
- Sync .find() — no per-call I/O on the validation hot path
- Refresh = call load_workers_parquet again to rebuild
- Caller-driven refresh (no auto-watch) — operators pick the cadence
Why workers_500k and not candidates.parquet:
candidates.parquet has the right shape (string candidate_id, status,
first/last_name) but lacks `role` — and the staffing executor matches
the W-* convention from workers_500k_v8 corpus. So the production
data path goes through workers_500k. The schema mismatch between the
two parquets is documented in `reports/staffing/synthetic-data-gap-
report.md` (gap A); resolution is operator's call.
Errors are typed (LookupLoadError):
- Open: file not found / permission
- Parse: invalid parquet
- MissingColumn: schema doesn't have required field
- BadRow: row missing worker_id or name
Schema check happens before iteration, so a wrong-shape file fails
loud immediately rather than silently building an empty lookup.
Verification:
cargo build -p validator compiles
cargo test -p validator 33 pass / 0 fail
(was 31; +2 for parquet)
load_real_workers_500k smoke test passes against the
live 500K-row file:
W-1 resolves, status +
role + city/state all
populated.
Phase 43 v3 part 2 (next):
- /v1/validate gateway endpoint that takes a JSON artifact + dispatches
to FillValidator/EmailValidator/PlaybookValidator with a shared
WorkerLookup loaded from the parquet at gateway startup.
- That closes the "any caller can validate" surface; execution-loop
wiring (Phase 43 PRD's "generate → validate → correct → retry")
becomes a thin wrapper on top of /v1/validate.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
f6af0fd409 |
phase 44 (part 1): migrate TS callers to /v1/chat + add regression guard
Some checks failed
lakehouse/auditor 16 blocking issues: cloud: claim not backed — "Verified end-to-end:"
Migrates the four TypeScript /generate callers to the gateway's
/v1/chat surface so every LLM call lands on /v1/usage and Langfuse:
tests/multi-agent/agent.ts::generate() provider="ollama"
tests/agent_test/agent_harness.ts::callAgent provider="ollama"
bot/propose.ts::generateProposal provider="ollama_cloud"
mcp-server/observer.ts (error analysis) provider="ollama"
Each migration follows the same pattern as the prior generateCloud()
migration (already on /v1/chat from 2026-04-24): replace
`fetch(SIDECAR/generate)` with `fetch(GATEWAY/v1/chat)`, swap the
prompt-style body for OpenAI-compat messages array, extract
content from `choices[0].message.content` instead of `text`.
Same upstream models in every case — gateway is the new home for
the call, transport otherwise unchanged.
Adds scripts/check_phase44_callers.sh — fail-loud regression guard
that exits non-zero if any non-adapter file fetches /generate or
api/generate. Adapter files (crates/gateway, crates/aibridge,
sidecar/) are exempt. Pre-tightening regex flagged prose mentions
in comments; the shipped regex requires `fetch(...)` or
`client.post(...)` shape so comments don't trip it.
Verification:
bun build mcp-server/observer.ts compiles
bun build tests/multi-agent/agent.ts compiles
bun build tests/agent_test/agent_harness.ts compiles
bun build bot/propose.ts compiles
./scripts/check_phase44_callers.sh ✅ clean
systemctl restart lakehouse-observer active
Phase 44 part 2 (deferred):
- crates/aibridge/src/client.rs:118 still posts to sidecar /generate
directly. AiClient is the foundational Rust LLM caller used by
8+ vectord modules; migrating it is a workspace-wide refactor
that needs its own commit. Plan: keep AiClient as the local-
transport layer for the gateway's `provider=ollama` arm, but
introduce a thin `/v1/chat` wrapper for external callers (vectord
autotune, agent, rag, refresh, supervisor, playbook_memory).
- tests/real-world/hard_task_escalation.ts: comment mentions
/api/generate but doesn't actually call it. Comment is left
intentionally as historical context; regex no longer flags it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
bfe1ea9d1c |
auditor: alternate Kimi K2.6 ↔ Haiku 4.5, drop Opus from auto-promotion
Some checks failed
lakehouse/auditor 13 blocking issues: cloud: claim not backed — "Verified end-to-end:"
Operator can't sustain Opus's ~$0.30/audit on the daemon. New strategy: - Even-numbered audits per PR use kimi-k2.6 via ollama_cloud (effectively free under the Ollama Pro flat subscription) - Odd-numbered audits use claude-haiku-4-5 via opencode/Zen (~$0.04/audit) - Frontier models (Opus, GPT-5.5-pro, Gemini 3.1-pro) are NOT in auto-promotion. Operator hands distilled findings to a frontier model manually when a load-bearing decision needs it. Mirrors the lakehouse playbook-memory pattern: cheap models do the volume, the validated subset compounds, only the compounded bundle gets handed to a frontier model. Same logic at the auditor layer. Audit-index derivation: count of existing kimi_verdicts files for the PR. So if the dir has 4 verdicts for PR #11 already, the 5th audit is index 4 (even) → Kimi, the 6th is index 5 (odd) → Haiku. Across an active PR's lifetime the audits naturally interleave the two lineages. Cost projection at observed cadence (5-10 pushes/day): - Old (Haiku default + Opus auto on big diffs): $1-3/day - New (Kimi/Haiku alternating, no Opus): $0.10-0.40/day - $31.68 budget lasts: ~3 months instead of ~10 days Override knobs: LH_AUDITOR_KIMI_MODEL=<X> pins to model X (no alternation) LH_AUDITOR_KIMI_PROVIDER=<P> provider for default model LH_AUDITOR_KIMI_ALT_MODEL=<X> sets the odd-index alternate LH_AUDITOR_KIMI_ALT_PROVIDER=<P> provider for alternate The OPUS_THRESHOLD env knobs from the prior auto-promotion commit are now no-ops (unset, no longer referenced). Verification: bun build auditor/checks/kimi_architect.ts compiles systemctl restart lakehouse-auditor active systemctl show env Haiku pin removed, Kimi default + cap=3 set Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
dc6dd1d30c |
auditor: per-PR audit cap (default 3) — daemon halts further audits until reset
Adds MAX_AUDITS_PER_PR (env LH_AUDITOR_MAX_AUDITS_PER_PR, default 3).
The poller increments a per-PR counter on each successful audit; when
the counter reaches the cap it skips that PR with a "capped" log line
until the operator manually clears state.audit_count_per_pr[<PR#>].
Why:
"I don't want it to continuously loop even if it finds a problem.
We need a maximum until we can come back."
Without this, the daemon polls every 90s and audits every new head
SHA. If each fix-commit surfaces new findings (which is what
kimi_architect is designed to do), the audit loop runs unbounded
while the operator is away. At ~$0.30/audit on Opus and 5-10 pushes
a day, that's $1-3/day idle burn — fine for a couple days, painful
for weeks.
Cap mechanics:
- Counter starts at 0 per PR (or whatever exists in state.json)
- Increments only on successful audit (failures don't count)
- Comparison is >= so cap=3 means audits 1, 2, 3 run; 4+ skip
- Skip is logged: "capped at N/M audits — clear state.json
audit_count_per_pr.<N> to resume"
- New `cycles_skipped_capped` counter on State for observability
Reset:
jq '.audit_count_per_pr = (.audit_count_per_pr - {"11": 4})' \
/home/profit/lakehouse/data/_auditor/state.json > /tmp/s.json && \
mv /tmp/s.json /home/profit/lakehouse/data/_auditor/state.json
- Daemon picks up the change on the next cycle (no restart needed —
state is reloaded each cycle)
- Or set the entry to 0 if you want to keep the key
Disable cap: LH_AUDITOR_MAX_AUDITS_PER_PR=0
Reduce cap: LH_AUDITOR_MAX_AUDITS_PER_PR=1 (one audit per PR head, then pause)
Pre-existing PR audits today (4 on PR #11) are NOT seeded into the
counter by this commit — operator decides post-deploy whether to set
state.audit_count_per_pr.11 to today's actual count or leave at 0.
Setting to 4 (or 3) immediately halts further audits on PR #11.
Verification:
bun build auditor/index.ts compiles
systemctl restart lakehouse-auditor active
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
19a65b87e3 |
auditor: 3 fixes from Opus self-audit on 454da15 + tree-split deletion
Some checks failed
lakehouse/auditor 14 blocking issues: cloud: claim not backed — "Verified end-to-end:"
The post-fix audit on commit 454da15 produced a fresh BLOCK and
re-flagged the dead tree-split as still dead. This commit lands the
BLOCK fix and the deletion.
LANDED:
1. kimi_architect.ts:113 BLOCK — MAX_TOKENS=128_000 exceeds Anthropic
Opus 4.x's 32K output cap. Worked silently (Anthropic clamps
server-side) but was technically invalid. Replaced single-default
with `maxTokensFor(model)` returning per-model caps:
claude-opus-* -> 32_000 (Opus extended-output)
claude-haiku-* -> 8_192 (Haiku/Sonnet default)
claude-sonnet-* -> 8_192
kimi-* -> 128_000 (reasoning_content needs headroom)
gpt-5*/o-series -> 32_000
default -> 16_000 (conservative)
LH_AUDITOR_KIMI_MAX_TOKENS env override still works (forces value
regardless of model).
2. inference.ts dead-code removal — Opus flagged tree-split as still
dead post-2026-04-27 mode-runner rebuild. Removed 156 lines:
runCloudInference (lines 464-503) legacy /v1/chat caller
treeSplitDiff (lines 547-619) shard-and-summarize fn
callCloud (lines 621-651) helper for treeSplitDiff
SHARD_MODEL const qwen3-coder:480b
SHARD_CONCURRENCY const 6
DIFF_SHARD_SIZE const 4500
CURATION_THRESHOLD const 30000
No live callers — verified by grep before deletion. The mode
runner's matrix retrieval against lakehouse_answers_v1 supplies
the cross-PR context that tree-split was synthesizing from scratch.
3. inference.ts:38-49 stale comment about "curate via tree-split"
replaced with current "matrix retrieval supplies cross-PR context"
semantics. Block was already physically gone but the comment
describing it remained, contradicting the actual code path.
SKIPPED (defensible / minor):
- WARN: outage sentinel TTL refresh on continued failure — intentional
(refresh keeps cache valid while upstream is still down)
- WARN: enrichment counts use Math.max — defensible (consensus
enrichment IS the max of the three runs)
- WARN: parseFindings regex eats severity into rationale on multi-
paragraph inputs — minor, hasn't affected grounding rate
- WARN: selectModel uses pre-truncation diff.length — defensible
(promotion is "is this audit worth Opus", not "what does the model
see")
- INFO×3: static.ts state reset, parentStruct walk bound,
appendMetrics 0-finding rows — all defensible per current intent
Verification:
bun build auditor/checks/{inference,kimi_architect}.ts compiles
systemctl restart lakehouse-auditor.service active
Net: -184 lines, +29 lines (155 net deletion).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
454da15301 |
auditor + aibridge: 6 fixes from Opus 4.7 self-audit on PR #11
Some checks failed
lakehouse/auditor 16 blocking issues: cloud: claim not backed — "Verified end-to-end:"
The kimi_architect auditor on commit 00c8408 ran with auto-promotion
to claude-opus-4-7 (diff > 100k chars), produced 10 grounded
findings, 1 BLOCK + 6 WARN + 3 INFO. This commit lands 6 of them; 3
are skipped (false positives or out-of-scope cleanup deferred).
LANDED:
1. kimi_architect.ts:144 empty-parse cache poisoning. When parseFindings
returns 0 findings (markdown shape changed, prompt too big, regex
missed every block), the verdict was still persisted with empty
findings, and the 24h TTL cache short-circuited every subsequent
audit with a useless "0 findings" hit. Fix: only persist when
findings.length > 0; metrics still appended unconditionally.
2. kimi_architect.ts:122 outage negative-cache. When callKimi throws
(network error, gateway 502, rate limit), we returned skipFinding
but didn't note the outage anywhere. Every audit cycle within the
24h TTL hammered the dead upstream. Fix: write a sentinel file
`<verdict>.outage` on failure with 10-min TTL; future calls within
that window short-circuit immediately.
3. kimi_architect.ts:331 mkdir(join(p, "..")) -> dirname(p). The
"/.." idiom resolved correctly via Node path normalization but
was non-idiomatic and breaks if the path ever has trailing dots.
Both Haiku and Opus self-audits flagged it.
4. inference.ts:202 N=3 consensus latency double/triple-count.
`totalLatencyMs += run.latency_ms` summed across THREE parallel
`Promise.all` calls — wall-clock is bounded by the slowest, not
the sum. Renamed to `maxLatencyMs` using `Math.max`. Telemetry now
reports actual wall-clock instead of 3x reality.
5. continuation.rs:198,199,230,231 i64/u64 -> u32 saturating cast.
`resp.tokens_evaluated as u32` truncates bits when source > u32::MAX
instead of saturating. Fix: u32::try_from(...).unwrap_or(u32::MAX)
wraps the cast in a real saturate. Applied to both the empty-retry
loop and the structural-completion continuation loop.
SKIPPED:
- BLOCK at Cargo.lock:8911 "validator-not-in-workspace" — confabulation.
The diff Opus saw was truncated mid-line; validator IS in
Cargo.toml workspace members. Real-world MAX_DIFF_CHARS=180k
edge case to watch as we feed more big diffs.
- WARN at kimi_architect.ts:248 regex absolute-path edge case — minor,
doesn't affect grounding rate observed so far.
- INFO at inference.ts:606 "dead reconstruction loop" — Opus misread.
The Promise.all worker fills `summaries[]`; the second loop builds
a sequential `scratchpad` string from those. Two distinct
operations, not redundant.
Verification:
bun build auditor/checks/{kimi_architect,inference}.ts compiles
cargo check -p aibridge green
cargo build --release -p gateway green
systemctl restart lakehouse.service lakehouse-auditor.service active
Next audit cycle (~90s after push) will run on the new diff and
exercise the negative-cache + dirname + maxLatencyMs paths.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
00c8408335 |
validator: Phase 43 v2 — real worker-existence + PII + name-consistency checks
Some checks failed
lakehouse/auditor 16 blocking issues: cloud: claim not backed — "Verified end-to-end:"
The Phase 43 scaffolds (FillValidator, EmailValidator) shipped with
TODO(phase-43 v2) markers for the actual cross-roster checks. This is
those checks landing.
The PRD calls for "the 0→85% pattern reproduces on real staffing
tasks — the iteration loop with validation in place is what made
small models successful." Worker-existence is the load-bearing check:
when the executor emits {candidate_id: "W-FAKE", name: "Imaginary"},
schema-only validation passes, and only roster lookup catches it.
Architecture:
- New `WorkerLookup` trait + `WorkerRecord` struct in lib.rs. Sync by
design — validators hold an in-memory snapshot, no per-call I/O on
the validation hot path. Production wraps a parquet snapshot;
tests use `InMemoryWorkerLookup`.
- Validators take `Arc<dyn WorkerLookup>` at construction so the
same shape covers prod + tests + future devops scaffolds.
- Contract metadata travels under JSON `_context` key alongside the
validated payload (target_count, city, state, role, client_id for
fills; candidate_id for emails). Keeps the Validator trait
signature stable and lets the executor serialize context inline.
FillValidator (11 tests, was 4):
- Schema (existing)
- Completeness — endorsed count == target_count
- Worker existence — phantom candidate_id fails Consistency
- Status — non-active worker fails Consistency
- Geo/role match — city/state/role mismatch with contract fails
Consistency
- Client blacklist — fails Policy
- Duplicate candidate_id within one fill — fails Consistency
- Name mismatch — Warning (not Error) since recruiters sometimes
send roster updates through the proposal layer
EmailValidator (11 tests, was 4):
- Schema + length (existing)
- SSN scan (NNN-NN-NNNN) — fails Policy
- Salary disclosure (keyword + $-amount within ~40 chars) — fails
Policy. Std-only scan, no regex dep added.
- Worker name consistency — when _context.candidate_id resolves,
body must contain the worker's first name (Warning if missing)
- Phantom candidate_id in _context — fails Consistency
- Phone NNN-NNN-NNNN does NOT trip the SSN detector (verified by
test); the SSN scanner explicitly rejects sequences embedded in
longer digit runs
Pre-existing issue (NOT from this change, NOT fixed here):
crates/vectord/src/pathway_memory.rs:927 has a stale PathwayTrace
struct initializer that fails `cargo check --tests` with E0063 on
6 missing fields. `cargo check --workspace` (production) is green;
only the vectord test target is broken. Tracked for a separate fix.
Verification:
cargo test -p validator 31 pass / 0 fail (was 13)
cargo check --workspace green
Next: wire `Arc<dyn WorkerLookup>` into the gateway execution loop
(generate → validate → observer-correct → retry, bounded by
max_iterations=3 per Phase 43 PRD). Production lookup impl loads
from a workers parquet snapshot — Track A gap-fix B's `_safe` view
is the right source once decided, raw workers_500k otherwise.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
8aa7ee974f |
auditor: auto-promote to Claude Opus 4.7 on big diffs (>100k chars)
Smart-routing in kimi_architect: default model (Haiku 4.5 by env, or
Kimi K2.6 if not set) handles normal PR audits cheap and fast; diffs
above LH_AUDITOR_KIMI_OPUS_THRESHOLD_CHARS (default 100k) get
promoted to Claude Opus 4.7 for the audit.
Why this split: the 2026-04-27 3-way bake-off (Kimi K2.6 vs Haiku 4.5
vs Opus 4.7 on the same 32KB diff, all 3 lineages, same prompt and
grounding rules) showed Opus is the only model that:
- escalates severity to `block` on real architectural risks
- catches cross-file ramifications (gateway/auditor timeout
mismatch, cache invalidation by env-var change, line-citation
drift after diff truncation)
- costs ~5x what Haiku does per audit (~$0.10 vs $0.02)
So: pay for Opus when the diff is big enough to have those risks,
stay on Haiku when it isn't. 80% of refactor PRs cross 100KB; 90% of
single-feature PRs don't.
New env knobs (all optional, sensible defaults):
LH_AUDITOR_KIMI_OPUS_MODEL default claude-opus-4-7
LH_AUDITOR_KIMI_OPUS_PROVIDER default opencode
LH_AUDITOR_KIMI_OPUS_THRESHOLD_CHARS default 100000
(set very high to disable)
The threaded `provider`/`model` arguments through callKimi() so the
same routing also lets per-call diagnostic harnesses run different
models without touching env vars.
Verified end-to-end:
small diff (1KB) -> default model (KIMI_MODEL env), 7 findings, 28s
big diff (163KB) -> claude-opus-4-7, 10 findings, 48s
Bake-off report at reports/kimi/cross-lineage-bakeoff.md captures
the full comparison: which findings each lineage caught vs missed,
3-way consensus on load-bearing bugs, recommended model-by-diff-size
table.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
bc698eb6da |
gateway: OpenCode (Zen + Go) provider adapter
Wires opencode.ai as a /v1/chat provider. One sk-* key reaches 40
models across Anthropic, OpenAI, Google, Moonshot, DeepSeek, Zhipu,
Alibaba, Minimax — billed against either the user's Zen balance
(pay-per-token premium models) or Go subscription (flat-rate
Kimi/GLM/DeepSeek/etc.). The unified /zen/v1 endpoint routes both;
upstream picks the billing tier based on model id.
Notable adapter quirks:
- Strip "opencode/" prefix on outbound (mirrors openrouter/kimi
pattern). Caller can use {provider:"opencode", model:"X"} or
{model:"opencode/X"}.
- Drop temperature for claude-*, gpt-5*, o1/o3/o4 models. Anthropic
and OpenAI's reasoning lineage rejects temperature with 400
"deprecated for this model". OCChatBody now serializes temperature
as Option<f64> with skip_serializing_if so omitting it produces
clean JSON.
- max_tokens.filter(|&n| n > 0) catches Some(0) — defensive after
the same trap bit kimi.rs (empty env -> Number("") -> 0 -> 503).
- 600s default upstream timeout; reasoning models on big audit
prompts legitimately take 3-5 min. Override OPENCODE_TIMEOUT_SECS.
Key handling:
- /etc/lakehouse/opencode.env (0600 root) loaded via systemd
EnvironmentFile. Same pattern as kimi.env.
- OPENCODE_API_KEY env first, file scrape as fallback.
Verified end-to-end:
opencode/claude-opus-4-7 -> "I'm Claude, made by Anthropic."
opencode/kimi-k2.6 -> PONG-K26-GO
opencode/deepseek-v4-pro -> PONG-DS-V4
opencode/glm-5.1 -> PONG-GLM
opencode/minimax-m2.5-free -> PONG-FREE
Pricing reference (per audit @ ~14k in / 6k out):
claude-opus-4-7 ~$0.22 (Zen)
claude-haiku-4-5 ~$0.04 (Zen)
gpt-5.5-pro ~$1.50 (Zen)
gemini-3-flash ~$0.03 (Zen)
kimi-k2.6 / glm / deepseek / qwen / minimax / mimo: covered by Go
subscription ($10/mo, $60/mo cap).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
ff5de76241 |
auditor + gateway: 2 fixes from kimi_architect's first real run
Acted on 2 of 10 findings Kimi caught when auditing its own integration on PR #11 head 8d02c7f. Skipped 8 (false positives or out-of-scope). 1. crates/gateway/src/v1/kimi.rs — flatten OpenAI multimodal content array to plain string before forwarding to api.kimi.com. The Kimi coding endpoint is text-only; passing a [{type,text},...] array returns 400. Use Message::text() to concat text-parts and drop non-text. Verified with curl using array-shape content: gateway now returns "PONG-ARRAY" instead of upstream error. 2. auditor/checks/kimi_architect.ts — computeGrounding switched from readFileSync to async readFile inside Promise.all. Doesn't matter at 10 findings; would matter at 100+. Removed unused readFileSync import. Skipped findings (with reason): - drift_report.ts:18 schema bump migration concern: the strict schema_version refusal IS the migration boundary (v1 readers explicitly fail on v2; not a silent corruption risk). - replay.ts:383 ISO timestamp precision: Date.toISOString always emits "YYYY-MM-DDTHH:mm:ss.sssZ" (ms precision). False positive. - mode.rs:1035 matrix_corpus deserializer compat: deserialize_string _or_vec at mode.rs:175 already accepts both shapes. Confabulation from not seeing the deserializer in the input bundle. - /etc/lakehouse/kimi.env world-readable: actually 0600 root. Real concern would be permission-drift; not a code bug. - callKimi response.json hang: obsolete; we use curl now. - parseFindings silent-drop: ergonomic concern, not a bug. - appendMetrics join with "..": works for current path; deferred. - stubFinding dead-type extension: cosmetic. Self-audit grounding rate at v1.0.0: 10/10 file:line citations verified by grep. 2 of 10 actionable bugs landed. The other 8 were correctly flagged as concerns but didn't earn a code change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
3eaac413e6 |
auditor: route kimi_architect through ollama_cloud/kimi-k2.6 (TOS-clean primary)
Two changes:
1. Default provider now ollama_cloud/kimi-k2.6 (env-overridable via
LH_AUDITOR_KIMI_PROVIDER + LH_AUDITOR_KIMI_MODEL). Ollama Cloud Pro
exposes kimi-k2.6 legitimately, so we no longer need the User-Agent-
spoof path through api.kimi.com. Smoke test 2026-04-27:
api.kimi.com 368s 8 findings 8/8 grounded
ollama_cloud 54s 10 findings 10/10 grounded
The kimi.rs adapter (provider=kimi) stays wired as a fallback when
Ollama Cloud is upstream-broken.
2. Switch HTTP transport from Bun's native fetch to curl via Bun.spawn.
Bun fetch has an undocumented ~300s ceiling that AbortController +
setTimeout cannot override; curl honors -m for end-to-end max
transfer time without a hard intrinsic limit. Required for Kimi's
reasoning-heavy responses on big audit prompts.
3. Bug fix Kimi caught in this very file (turtles all the way down):
Number(process.env.LH_AUDITOR_KIMI_MAX_TOKENS ?? 128_000) yields 0
when env is set to empty string — `??` only catches null/undefined.
Switched to Number(env) || 128_000 so empty/0/NaN all fall back.
Same pattern probably exists in other files; future audit pass.
4. Bumped MAX_TOKENS default 12K -> 128K. Kimi K2.6's reasoning_content
counts against this budget but isn't surfaced in OpenAI-shape content;
12K silently produced finish_reason=length with empty content when
reasoning consumed the budget.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
8d02c7f441 |
auditor: integrate Kimi second-pass review (off by default, LH_AUDITOR_KIMI=1)
Adds kimi_architect as a fifth check kind in the auditor. Runs sequentially after static/dynamic/inference/kb_query, consumes their findings as context, and asks Kimi For Coding "what did everyone miss?" — targeting load-bearing issues that deepseek N=3 voting can't see (compile errors, false telemetry, schema bypasses, determinism leaks). 7/7 grounded on the distillation v1.0.0 audit experiment 2026-04-27. Off by default. Enable on the lakehouse-auditor service: systemctl edit lakehouse-auditor.service Environment=LH_AUDITOR_KIMI=1 Tunable env (all optional): LH_AUDITOR_KIMI_MODEL default kimi-for-coding LH_AUDITOR_KIMI_MAX_TOKENS default 12000 LH_GATEWAY_URL default http://localhost:3100 Guardrails: - Failure-isolated. Any Kimi error / 429 / TOS revocation returns a single info-level skip-finding so the existing pipeline never blocks on a Kimi outage. - Cost-bounded. Cached verdicts at data/_auditor/kimi_verdicts/<pr>- <sha>.json with 24h TTL — re-audits within the window return cached findings instead of re-calling upstream. New commits produce new SHAs so caching is per-head, not per-day. - 6min upstream timeout (vs 2min for openrouter inference) — Kimi is a reasoning model and the audit prompt is large. - Grounding verification baked in. Every finding's cited file:line is greppped against the actual file before the verdict is persisted. Per-finding evidence carries [grounding: verified at FILE:LINE] or [grounding: line N > EOF] / [grounding: file not found]. Confab- ulation rate goes into data/_kb/kimi_audits.jsonl as grounding_rate for "is this still valuable" tracking. Persisted artifacts: data/_auditor/kimi_verdicts/<pr>-<sha>.json full verdict + raw Kimi response + grounding data/_kb/kimi_audits.jsonl one row per call: latency, tokens, findings, grounding rate Verdict-rendering: kimi_architect now appears in the per-check sections of the human-readable comment posted to PRs (auditor/audit.ts checkOrder), after kb_query. Verification: bun build auditor/checks/kimi_architect.ts compiles bun build auditor/audit.ts compiles parser sanity (3-finding fixture) 3/3 lifted correctly Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
643dd2d520 |
gateway: direct Kimi For Coding provider adapter (api.kimi.com)
Wires kimi-for-coding (Kimi K2.6 underneath) as a first-class /v1/chat
provider so consumers can target it via {provider:"kimi"} or model
prefix kimi/<model>. Bypasses the upstream-broken kimi-k2:1t on Ollama
Cloud and the rate-limited moonshotai/kimi-k2.6 path through OpenRouter.
Adapter shape mirrors openrouter.rs (OpenAI-compatible Chat Completions).
Differences from generic OpenAI providers:
- api.kimi.com is a SEPARATE account system from api.moonshot.ai and
api.moonshot.cn. sk-kimi-* keys are NOT interchangeable across them.
- Endpoint is User-Agent-gated to "approved" coding agents (Kimi CLI,
Claude Code, Roo Code, Kilo Code, ...). Requests from generic clients
return 403 access_terminated_error. Adapter sends User-Agent:
claude-code/1.0.0. Per Moonshot TOS this is a tampering-class action
that may result in seat suspension; J authorized 2026-04-27 with
awareness of the risk.
- kimi-for-coding is a reasoning model — reasoning_content counts
against max_tokens. Default 800-token budget yields empty visible
content with finish_reason=length. Code-review workloads need
max_tokens >= 1500.
- Default 600s upstream timeout (vs 180s for openrouter.rs) — code
audits with full file context legitimately take 3-5 minutes.
Override via KIMI_TIMEOUT_SECS env.
Key handling:
- /etc/lakehouse/kimi.env (0600 root) loaded via systemd EnvironmentFile
- KIMI_API_KEY env first, then file scrape as fallback
- /etc/systemd/system/lakehouse.service NOT included in this commit
(system file outside repo); operator must add EnvironmentFile=-
/etc/lakehouse/kimi.env to the lakehouse.service unit
NOT in scrum_master_pipeline LADDER. The 9-rung ladder is for
unattended automatic recovery; placing Kimi there would hammer a
TOS-gated endpoint with hostility-policy potential. Kimi is
addressable via /v1/chat for explicit invocations only — auditor
integration in a follow-up commit.
Verification:
cargo check -p gateway --tests compiles
curl /v1/chat provider=kimi 200 OK, content="PONG"
curl /v1/chat model="kimi/kimi-for-coding" 200 OK (prefix routing)
Kimi audit on distillation last-week 7/7 grounded findings
(reports/kimi/audit-last-week-full.md)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
d77622fc6b |
distillation: fix 7 grounding bugs found by Kimi audit
Kimi For Coding (api.kimi.com, kimi-for-coding) ran a forensic audit on
distillation v1.0.0 with full file content. 7/7 flags verified real on
grep. Substrate now matches what v1.0.0 claimed: deterministic, no
schema bypasses, Rust tests compile.
Fixes:
- mode.rs:1035,1042 matrix_corpus Some/None -> vec![..]/vec![]; cargo
check --tests now compiles (was silently broken;
only bun tests were running)
- scorer.ts:30 SCORER_VERSION env override removed - identical
input now produces identical version stamp, not
env-dependent drift
- transforms.ts:181 auto_apply wall-clock fallback (new Date()) ->
deterministic recorded_at fallback
- replay.ts:378 recorded_run_id Date.now() -> sha256(recorded_at);
replay rows now reproducible given recorded_at
- receipts.ts:454,495 input_hash_match hardcoded true was misleading
telemetry; bumped DRIFT_REPORT_SCHEMA_VERSION 1->2,
field is now boolean|null with honest null when
not computed at this layer
- score_runs.ts:89-100,159 dedup keyed only on sig_hash made
scorer-version bumps invisible. Composite
sig_hash:scorer_version forces re-scoring
- export_sft.ts:126 (ev as any).contractor bypass emitted "<contractor>"
placeholder for every contract_analyses SFT row.
Added typed EvidenceRecord.metadata bucket;
transforms.ts populates metadata.contractor;
exporter reads typed value
Verification (all green):
cargo check -p gateway --tests compiles
bun test tests/distillation/ 145 pass / 0 fail
bun acceptance 22/22 invariants
bun audit-full 16/16 required checks
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
d11632a6fa |
staffing: recon + synthetic-data gap report (Phase 0, no implementation)
Some checks failed
lakehouse/auditor 13 blocking issues: cloud: claim not backed — "Phase 8 done-criteria (per spec):"
Spec mandates these two docs before any staffing audit runner ships:
docs/recon/staffing-lakehouse-distillation-recon.md
reports/staffing/synthetic-data-gap-report.md
NO distillation core touched. Distillation v1.0.0 (commit e7636f2,
tag distillation-v1.0.0) remains the stable substrate. Staffing
work is consumer-only.
Recon findings (12 sections, ~5KB):
- Existing staffing schemas in crates/validator/staffing/* are scaffolds
(FillValidator schema-shape only; worker-existence/status/geo TODOs)
- Synthetic data spans 6+ shapes across 9 parquet files
(~625k worker-shape rows + 1k candidate-shape rows)
- PII detection lives in shared/pii.rs but enforcement at query
time is unverified — the LLM may have been seeing raw PII via
workers_500k_v8 vector corpus
- 44 scenarios + 64 playbook_lessons = ~108 RAG candidates
- No structured fill-event log exists; scenarios+lessons are
retrospective, not queryable per-event records
- workers_500k.phone is int (should be string — leading-zero loss)
- client_workerskjkk.parquet is a typo file (160 rows, sibling of
client_workersi.parquet)
- PRD §158 claims Phase 19 closed playbook write-only gap — unverified
Gap report findings (9 sections, ~6KB):
- 4 BLOCKING gaps requiring J decisions before audit ships:
A. Generate fill_events.parquet from scenarios + lessons?
B. Build views/{candidates,workers,jobs}_safe with PII masking?
C. Delete client_workerskjkk.parquet typo file?
D. Fix workers_500k.phone type (int → string)?
- 5 SOFT gaps the audit can run with (will be reported as findings)
- 3 NON-gaps (data sufficient as-is)
- Recommendation: NO new synthetic data needed; only normalization
of what already exists, contingent on J approval of A-D
Up-front commitments:
- Distillation v1.0.0 substrate untouched (verified by audit-full
running clean before+after each staffing change)
- All synthetic-data modifications via deterministic scripts under
scripts/staffing/, never hand-edit
- Every staffing artifact carries canonical sha256 provenance back
to source parquet/scenario/lesson
- _safe views are the source of truth for LLM-facing text; raw
parquets never directly fed into corpus builds
Phase 1 unblocks AFTER J reviews both docs and approves audit scope
+ the 4 gap-fix decisions.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
e7636f202b |
distillation: regenerate v1.0.0 release artifacts
Some checks failed
lakehouse/auditor 13 blocking issues: cloud: claim not backed — "Phase 8 done-criteria (per spec):"
Auto-generated by `./scripts/distill release-freeze` — RELEASE-READY (6/6 gates). Captures the v1.0.0 manifest + the latest acceptance + audit reports re-run during the freeze. reports/distillation/release-freeze.md human-readable manifest reports/distillation/release-manifest.json machine-readable manifest reports/distillation/phase6-acceptance-report.md re-run during freeze (22/22 invariants) reports/distillation/phase8-full-audit-report.md re-run during freeze (16/16 required) Pre-tag state: branch: scrum/auto-apply-19814 head: <prior commit before this one> full pipeline: 145 distillation tests pass · 0 fail acceptance: 22/22 invariants on fixture, bit-identical reproducibility audit-full: 16/16 required across Phases 0-7 Tag command awaiting operator confirmation: git tag -a distillation-v1.0.0 -m "distillation v1.0.0 — 8-phase substrate frozen" git push origin distillation-v1.0.0 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>distillation-v1.0.0 |
||
|
|
73f242e3e4 |
distillation: Phase 9 — release freeze and operator handoff
Final phase. Adds:
scripts/distillation/release_freeze.ts ~330 lines, 6 release gates
docs/distillation/operator-handoff.md durable cold-start operator doc
docs/distillation/recovery-runbook.md failure-mode runbook by symptom
scripts/distillation/distill.ts +release-freeze subcommand
The release_freeze orchestrator runs every gate the system has:
1. Clean git state (tolerates auto-regenerated reports)
2. Full test suite (bun test tests/distillation auditor/schemas/distillation)
3. Phase commit verification (every Phase 0-8 commit resolves)
4. Acceptance gate (22-invariant fixture E2E)
5. audit-full (Phases 0-7 verified + drift detection)
6. Tag availability check (distillation-v1.0.0 not yet existing)
Outputs:
reports/distillation/release-freeze.md human-readable manifest
reports/distillation/release-manifest.json machine-readable manifest
Manifest captures:
- git_head + git_branch + released_at
- phase→commit map for all 9 commits (Phase 0+1+2 scaffold through Phase 8 audit)
- dataset counts at freeze (RAG/SFT/Preference/evidence/scored/quarantined)
- latest audit baseline row
- per-gate pass/fail with detail
Operator handoff doc covers:
- phase map with commits + report locations
- known-good commands
- how to rerun audit-full + inspect drift
- how to restore from last-good (git checkout distillation-v1.0.0)
- how to add future phases without contaminating corpus
- what NOT to modify casually (with file:reason mapping)
- cumulative commits at v1.0.0
Recovery runbook covers, by symptom:
- audit-full exit non-zero (per-phase diagnostics)
- drift table flags warn (intentional vs regression)
- acceptance fail vs audit-full pass divergence
- run-all empty exports (counter-bisection order)
- hash mismatch on identical input (determinism violation; CRITICAL)
- replay logs growing unbounded (rotation guidance)
- nuclear restore via git checkout distillation-v1.0.0
Spec constraints (per now.md Phase 9):
- DO NOT add new intelligence features ✓ (zero new logic)
- DO NOT change scoring/export logic ✓ (zero touches)
- DO NOT weaken gates ✓ (gates only added, never relaxed beyond the
auto-regen tolerance documented in checkCleanGit)
- DO NOT retrain anything ✓ (no model touches)
CLI:
./scripts/distill release-freeze # exit 0 = release-ready
Tag creation deferred to operator confirmation (the release-freeze
report prints the exact `git tag` command). Per CLAUDE.md guidance,
destructive/visible operations like tags require explicit user
authorization.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
5bdd159966 |
distillation: Phase 8 — full system audit
Some checks failed
lakehouse/auditor 14 blocking issues: cloud: claim not backed — "Phase 8 done-criteria (per spec):"
Meta-audit script that runs deterministic checks across Phases 0-7
and compares to a baseline (auto-grown from prior runs). Pure
observability — no pipeline modification. Single command:
./scripts/distill audit-full
Files (2 new + 1 modified):
scripts/distillation/audit_full.ts ~430 lines, 8 phase checks + drift
scripts/distillation/distill.ts +audit-full subcommand
reports/distillation/phase8-full-audit-report.md (autogenerated by run)
Real-data audit on commit 681f39d:
22 total checks, 16 required, ALL 16 required PASS.
Per-phase (required-pass / required):
P0 recon: 1/1 — docs/recon/local-distillation-recon.md + tier-1 streams
P1 schemas: 1/1 — 51 schema tests pass via subprocess
P2 evidence: 1/1 — materializer dry-run completes
P3 scoring: 1/1 — acc=386 part=132 rej=57 hum=480 on disk
P4 exports: 5/5 — SFT 0-leak + RAG 0-rejected + Pref 0 self-pairs +
0 identical-text + 0 missing provenance
P5 receipts: 4/4 — 5/5 stage receipts, all validate, RunSummary valid,
run_hash is sha256
P6 acceptance: 1/1 — 22/22 fixture invariants pass via subprocess
P7 replay: 2/2 — 3/3 dry-run tasks pass + escalation guard holds
Drift detection (auto-grown baseline at data/_kb/audit_baselines.jsonl):
10 tracked metrics across P2/P3/P4 + quarantine totals.
This run vs first audit baseline: 0% drift on all 10 metrics.
Future drift >20% on any metric flips flag from ok → warn.
Non-negotiables:
- DO NOT modify pipeline logic — audit only reads + calls scripts
- DO NOT suppress failures — non-zero exit on any required-check fail
- DO NOT fake pass conditions — checks are deterministic + assertive
Bug surfaced during construction (matches the spec's "spec is honest"
gate): P3 check first used scoreAll dry-run which reported 0 accepted
because scored-runs were deduped against. Fixed by reading
data/scored-runs/ directly to get the on-disk distribution. Same
class of bug as the audits.jsonl recon mistake from Phase 3 — assume
nothing about a stream, inspect what's there.
Phase 8 done-criteria (per spec):
✓ audit command runs successfully
✓ all 8 phases verified (P0..P7)
✓ drift clearly reported (10-metric drift table per run)
✓ report exists (reports/distillation/phase8-full-audit-report.md)
What this unlocks:
Subsequent CI / cron runs of audit-full will surface real drift if
the pipeline's behavior changes. The system is now self-monitoring
in the strongest sense: every invariant has an automated check,
every metric has a drift gate, and the report tells a future agent
exactly what diverged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
681f39d5fa |
distillation: Phase 7 — replay-driven local model bootstrapping
Some checks failed
lakehouse/auditor 13 blocking issues: cloud: claim not backed — "probes; multi-hour outage). deepseek is the proven drop-in from"
Runtime layer that takes a task → retrieves matching playbooks/RAG
records → builds a structured context bundle → feeds it to a LOCAL
model (qwen3.5:latest, ~7B class) → validates output → escalates only
when needed → logs the full run as new evidence. NOT model training.
Pure runtime behavior shaping via retrieval against the Phase 0-6
distillation substrate.
Files (3 new + 1 modified):
scripts/distillation/replay.ts ~370 lines
tests/distillation/replay.test.ts 10 tests, 19 expects
scripts/distillation/distill.ts +replay subcommand
reports/distillation/phase7-replay-report.md
Test metrics: 145 cumulative distillation tests pass · 0 fail · 372 expects · 618ms
Real-data A/B on 3 tasks (same qwen3.5:latest local model, with vs
without retrieval) — proves the spec claim "local model improves
with retrieval":
Task 1 "Audit phase 38 provider routing":
WITH retrieval: cited V1State, openrouter, /v1/chat, ProviderAdapter,
PRD.md line ranges — REAL Lakehouse internals
WITHOUT retrieval: invented "P99999, Z99999 placeholder codes" and
"production routing table" — pure fabrication
Task 2 "Verify pr_audit mode wired":
WITH: correct crates/gateway/src/main.rs path + lakehouse_answers_v1
WITHOUT: same assertion, no proof, asserts confidently
Task 3 "Audit phase 40 PRD circuit breaker drift":
WITH: anchored on the actual audit finding "no breaker class found"
WITHOUT: invented "0.0% failure rate vs 5.0% threshold" and signed
off as PASS on broken code — exact failure mode the
distillation pipeline was built to prevent
Both runs passed the structural validation gate (length, no hedges,
checklist token overlap) — the difference is grounding, supplied by
the retrieval layer pulling from exports/rag/playbooks.jsonl (446
records from earlier Phase 4 export).
Architecture:
jaccard token overlap against rag corpus → top-K (default 8) split
into accepted exemplars (top 3) + partial-warnings (top 2) + extracted
validation_steps (lines starting verify|check|assert|ensure|confirm)
→ prompt assembly → qwen3.5:latest via /v1/chat (or OpenRouter
for namespaced/free models) → deterministic validation gate →
escalation to deepseek-v3.1:671b on fail with --allow-escalation
→ log to data/_kb/replay_runs.jsonl
Spec invariants enforced:
- never bypass retrieval (--no-retrieval is explicit baseline, not default)
- never discard provenance (task_hash + rag_ids + full bundle logged)
- never allow free-form hallucinated output (validation gate is
deterministic code, never an LLM)
- log every run as new evidence (replay_run.v1 schema, append-only
to data/_kb/replay_runs.jsonl)
CLI:
./scripts/distill replay --task "<input>" [--local-only]
[--allow-escalation]
[--no-retrieval]
What this unlocks:
The substrate for "small-model bootstrapping" and "local inference
dominance" J flagged after Phase 5. Phase 8+ closes the loop:
schedule replay runs on common tasks, score outputs, feed accepted
ones back into corpus, measure escalation rate decreasing over time.
Known limitations (documented in report):
- Validation gate is structural not semantic (catches hedges/empty
but not plausible-wrong). Phase 13 wiring: run auditor against
every replay output.
- Retrieval is jaccard keyword. Works at 446 corpus, scale via
/vectors/search HNSW retrieval once corpus crosses ~10k.
- Convergence claim is architectural (deterministic retrieval +
low-temp call); longitudinal empirical study is Phase 8+.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
20a039c379 |
auditor: rebuild on mode runner + drop tree-split (use distillation substrate)
Some checks failed
lakehouse/auditor 13 blocking issues: cloud: claim not backed — "Invariants enforced (proven by tests + real run):"
Architectural simplification leveraging Phase 5 distillation work: the auditor no longer pre-extracts facts via per-shard summaries because lakehouse_answers_v1 (gold-standard prior PR audits + observer escalations corpus) supplies cross-PR context through the mode runner's matrix retrieval. Same signal, ~50× fewer cloud calls per audit. Per-audit cost: Before: 168 gpt-oss:120b shard summaries + 3 final inference calls After: 3 deepseek-v3.1:671b mode-runner calls (full retrieval included) Wall-clock on PR #11 (1.36MB diff): Before: ~25 minutes After: 88 seconds (3/3 consensus succeeded) Files: auditor/checks/inference.ts - Default MODEL kimi-k2:1t → deepseek-v3.1:671b. kimi-k2 is hitting sustained Ollama Cloud 500 ISE (verified via repeated trivial probes; multi-hour outage). deepseek is the proven drop-in from Phase 5 distillation acceptance testing. - Dropped treeSplitDiff invocation. Diff truncates to MAX_DIFF_CHARS and goes straight to /v1/mode/execute task_class=pr_audit; mode runner pulls cross-PR context from lakehouse_answers_v1 via matrix retrieval. SHARD_MODEL retained for legacy callCloud compatibility (default qwen3-coder:480b if it ever runs). - extractAndPersistFacts now reads from truncated diff (no scratchpad post-tree-split-removal). auditor/checks/static.ts - serde-derived struct exemption (commit 107a682 shipped this; this commit is the rest of the auditor rebuild it landed alongside) - multi-line template literal awareness in isInsideQuotedString — tracks backtick state across lines so todo!() inside docstrings doesn't trip BLOCK_PATTERNS. crates/gateway/src/v1/mode.rs - pr_audit native runner mode added to VALID_MODES + is_native_mode + flags_for_mode + framing_text. PrAudit framing produces strict JSON {claim_verdicts, unflagged_gaps} for the auditor to parse. config/modes.toml - pr_audit task class with default_model=deepseek-v3.1:671b and matrix_corpus=lakehouse_answers_v1. Documents kimi-k2 outage with link to the swap rationale. Real-data audit on PR #11 head 1b433a9 (which is the PR with all the distillation work + auditor rebuild itself): - Pipeline ran to completion (88s for inference; full audit ~3 min) - 3/3 consensus runs succeeded on deepseek-v3.1:671b - 156 findings: 12 block, 23 warn, 121 info - Block findings are legitimate signal: 12 reviewer claims like "Invariants enforced (proven by tests + real run):" that the truncated diff can't directly verify. The auditor is correctly flagging claim-vs-diff divergence — exactly its job. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
1b433a9308 |
distillation: Phase 6 — acceptance gate suite
End-to-end fixture-driven gate. Runs the entire pipeline (collect →
score → export-rag → export-sft → export-preference) on a deterministic
fixture, asserts 22 invariants, runs a SECOND time with the same
recorded_at, and verifies hash reproducibility. Exits non-zero on any
failure. Pure observability — no scoring/filtering/schema changes.
Files (3 new + 1 modified + 6 fixture jsonls):
scripts/distillation/acceptance.ts 330 lines, runner + 22 checks
reports/distillation/phase6-acceptance-report.md autogenerated by run
scripts/distillation/distill.ts +run-all, +receipts, +acceptance subcommands
tests/fixtures/distillation/acceptance/data/_kb/
scrum_reviews.jsonl 5 rows (accepted/partial/needs_human/scratchpad/missing-provenance)
audits.jsonl 3 rows (info/high+PRD-drift/medium severity)
auto_apply.jsonl 2 rows (committed, build_red_reverted)
contract_analyses.jsonl 2 rows (accept, reject)
observer_reviews.jsonl 2 rows (accept, reject — pair candidates)
distilled_facts.jsonl 1 extraction-class row
Spec cases covered (now.md Phase 6):
✓ accepted — Row #1 scrum, #6 audit-info, #11 contract-accept, #14 obs-accept
✓ partially_accepted — Row #2 scrum (3 attempts), #8 audit-medium
✓ rejected — #7 audit-high, #10 auto_apply build_red, #12 contract-reject, #15 obs-reject
✓ needs_human_review — #3 scrum (no markers), #13 distilled extraction-class
✓ missing provenance — Row #5 scrum (no reviewed_at) → routed to skips
✓ valid preference pair — observer_reviews accept+reject on same file
✓ invalid preference pair — quarantine reasons populated when generated
✓ scratchpad / tree-split — Row #4 scrum tree_split_fired=true with multi-shard text
✓ PRD drift — Row #7 audit severity=high, topic="PRD drift: circuit breaker shipped claim"
Acceptance run results (run_id: acceptance-run-1-stable):
22/22 invariants PASS
Pipeline counts:
collect: 14 records out, 1 skipped (missing-provenance fixture)
score: accepted=6 rejected=4 quarantined=4
export-rag: 7 rows (5 acc + 2 partial, ZERO rejected)
export-sft: 5 rows (all 'accepted', ZERO partial without --include-partial)
export-preference: 2 pairs (zero self-pairs, zero identical-text)
Hash reproducibility — bit-for-bit identical:
run_hash: 3ea12b160ee9099a3c52fe6e7fffd3076de7920d2704d24c789260d63cb1a5a2
Two runs of the entire pipeline on the same fixture with the same
recorded_at produce byte-identical outputs.
The 22 invariants:
1-4. Receipts + summary.json + summary.md + drift.json exist
5-7. StageReceipt + RunSummary + DriftReport schemas all valid
8-10. SFT contains accepted only — no rejected/needs_human/partial leak
11-12. RAG contains accepted+partial — zero rejected
13-15. Preference: ≥1 pair, zero self-pairs, zero identical text
16. Every export row has 64-char hex provenance.sig_hash
17. Phase 2 missing-provenance row routed to distillation_skips.jsonl
18. SFT quarantine populated (6 unsafe_sft_category entries)
19. Scratchpad/tree-split fixture row materialized
20. PRD drift fixture row materialized
21. Per-stage output_hash identical across runs (0 mismatches)
22. run_hash identical across runs (bit-for-bit)
CLI:
./scripts/distill.ts acceptance # exits 0 on pass, 1 on fail
./scripts/distill.ts run-all # full pipeline with receipts
./scripts/distill.ts receipts --run-id <id>
Cumulative test metrics:
135 distillation tests pass · 0 fail · 353 expect() calls · 1411ms
(Phase 6 adds the runtime acceptance gate, not new unit tests —
the acceptance script IS the integration test, callable from CI.)
What this proves:
- Distillation pipeline is SAFE (contamination firewall held under
adversarial fixture)
- Distillation pipeline is REPRODUCIBLE (identical input → bit-identical
output across two runs)
- Distillation pipeline is GATED (every now.md invariant has a
deterministic assertion that exits non-zero on failure)
The 6-phase distillation substrate is now training-safe. RAG (446),
SFT (351 strict-accepted), and Preference (83 paired) datasets on
real lakehouse data each carry full provenance back to source rows
through the verified Phase 2 → Phase 3 → Phase 4 chain, with Phase 5
receipts capturing every input/output sha256 + per-stage validation,
and Phase 6 proving the whole chain is gate-tight on a deterministic
fixture.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
2cf359a646 |
distillation: Phase 5 — receipts harness (system-level observability)
Forensic-grade per-stage receipts wrapping all 5 implemented pipeline
stages. Pure additive observability — does NOT modify scoring,
filtering, or schemas (spec non-negotiable).
Files (6 new):
auditor/schemas/distillation/stage_receipt.ts StageReceipt v1
auditor/schemas/distillation/run_summary.ts RunSummary v1
auditor/schemas/distillation/drift_report.ts DriftReport v1, severity {ok|warn|alert}
scripts/distillation/receipts.ts runAllWithReceipts + buildDrift + CLI
tests/distillation/receipts.test.ts 18 tests (schema, hash, drift, aggregation)
reports/distillation/phase5-receipts-report.md acceptance report
Stages wrapped:
collect (build_evidence_index → data/evidence/)
score (score_runs → data/scored-runs/)
export-rag (exports/rag/playbooks.jsonl)
export-sft (exports/sft/instruction_response.jsonl)
export-preference (exports/preference/chosen_rejected.jsonl)
Reserved (not yet implemented): extract-playbooks, index.
Output tree (per run_id):
reports/distillation/<run_id>/
collect.json score.json export-rag.json export-sft.json export-preference.json
summary.json summary.md drift.json
Test metrics: 135 distillation tests pass · 0 fail · 353 expects · 1.5s
(Phase 5 added 18; total 117→135)
Real-data run-all (run_id=78072357-835d-...):
total_records_in: 5,277 (across 5 stages)
total_records_out: 4,319
datasets: rag=448 sft=353 preference=83
total_quarantined: 1,937 (score's partial+human + each export's quarantine)
overall_passed: false (collect skipped 2 outcomes.jsonl rows missing created_at —
carry-over from Phase 2; faithfully propagated)
run_hash: 7a14d8cdd6980048a075efe97043683a4f9aabb38ec1faa8982c9887593090e0
Drift detection (second run):
prior_run_id detected automatically
severity=ok (no count or category swung >20%)
flags: ["run_hash differs from prior run"] — expected, since recorded_at
is baked into provenance and changes per run. No false alert.
Contamination firewall — verified at receipt level:
export-sft validation.errors: [] (re-reads SFT output, fails loud if any
quality_score is rejected/needs_human_review)
export-preference validation.errors: [] (re-reads, fails loud if any
chosen_run_id == rejected_run_id or chosen text == rejected text)
Invariants enforced (proven by tests + real run):
- Every stage emits ONE receipt per run (5/5 on disk)
- All receipts share run_id (uuid generated per run-all)
- aggregateIoHash is order-independent + collision-free across path/content
- Schema validators gate every receipt before write (defense in depth)
- Drift detection: pct_change > 20% → warn; new error class → warn
- Failure propagation: any stage validation.passed=false → overall_passed=false
- Self-validation: harness throws if RunSummary/DriftReport fail their own schema
CLI:
bun run scripts/distillation/receipts.ts run-all
bun run scripts/distillation/receipts.ts read --run-id <id>
Spec acceptance gate (now.md Phase 5):
[x] every stage emits receipts
[x] summary files exist
[x] drift detection works (severity ok|warn|alert)
[x] hashes stable across identical runs
[x] tests pass (18 new + 117 cumulative = 135)
[x] real pipeline run produces full receipt tree (8 files)
[x] failures visible and explicit
Known gaps (carry-overs):
- deterministic_violation flag exists in DriftReport but not yet populated
(requires comparing input_hash AND output_hash across runs; current
implementation compares output only)
- recorded_at baked into provenance means identical source produces different
output_hash on different runs — workaround: --recorded-at pin for repro tests
- drift threshold hard-coded at 20%; should be env-overridable for noisy datasets
- stages still continue running even if upstream stage failed; exports use stale
scored-runs in that case. Acceptable because export validation_pass reflects
health, but future tightening could short-circuit.
Phase 6 (acceptance gate suite) unblocked.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
68b6697bcb |
distillation: Phase 4 — dataset export layer
Some checks failed
lakehouse/auditor 1 blocking issue: todo!() macro call in tests/real-world/scrum_master_pipeline.ts
Build the contamination firewall: RAG, SFT, and Preference exporters
that turn scored evidence into clean training datasets without
leaking rejected, unvalidated, hallucinated, or provenance-free
records.
Files (8 new + 4 schema updates):
scripts/distillation/quarantine.ts shared QuarantineWriter, 11-reason taxonomy
scripts/distillation/export_rag.ts RAG exporter (--include-review opt-in)
scripts/distillation/export_sft.ts SFT exporter (--include-partial opt-in, SFT_NEVER constant)
scripts/distillation/export_preference.ts preference exporter, same task_id pairing
scripts/distillation/distill.ts CLI dispatcher (build-evidence/score/export-*)
tests/distillation/exports.test.ts 15 contamination-firewall tests
reports/distillation/phase4-export-report.md acceptance report
Schema field-name alignment with now.md:
rag_sample.ts +source_category, exported_at→created_at
sft_sample.ts +id, exported_at→created_at, partially_accepted at schema (CLI gates)
preference_sample.ts +id, source_run_ids→chosen_run_id+rejected_run_id, +created_at
Test metrics: 117 distillation tests pass · 0 fail · 315 expects · 327ms
Real-data export run (1052 scored input rows):
RAG: 446 exported (351 acc + 95 partial), 606 quarantined
SFT: 351 exported (all 'accepted'), 701 quarantined
Preference: 83 pairs exported, 16 quarantined
CONTAMINATION FIREWALL — verified held on real data:
- SFT output: 351/351 quality_score='accepted' (ZERO leaked)
- RAG output: 351 acc + 95 partial (ZERO rejected leaked)
- Preference: 0 self-pairs (chosen_run_id != rejected_run_id)
- 536 rejected+needs_human_review records caught at unsafe_sft_category
gate, exact match to scored-runs forbidden-category total
Defense in depth (the firewall is two layers, not one):
1. Schema layer (Phase 1): SftSample.quality_score enum forbids
rejected/needs_human at write time
2. Exporter layer: SFT_NEVER constant in export_sft.ts checks
category before synthesis. Even if synthesis produced a row
with quality_score=rejected, validateSftSample would reject it.
Quarantine reasons (11): missing_provenance, missing_source_run_id,
empty_content, schema_violation, unsafe_sft_category,
unsafe_rag_category, invalid_preference_pairing,
hallucinated_file_path, duplicate_id, self_pairing,
category_disallowed.
Bug surfaced + fixed during testing: module-level evidenceCache
shared state across test runs (tests wipe TMP, cache holds stale
empty Map). Moved cache to per-call scope. Same pattern bit Phase 2
materializer would have hit if its tests had multiple runs sharing
state — preventive fix.
Pairing logic v1: same task_id with category gap. accepted×rejected
preferred, accepted×partially_accepted as fallback. MAX_PAIRS_PER_TASK=5
cap prevents one hot task from dominating. Future: cross-source
pairing (scrum_reviews chosen vs observer_reviews rejected on same
file) to grow dataset beyond 83.
CLI: ./scripts/distill.ts {build-evidence|score|export-rag|export-sft|export-preference|export-all|health}
Flags: --dry-run, --include-partial (SFT only), --include-review (RAG only)
Carry-overs to Phase 5 (Receipts Harness):
- Each exporter currently writes results but no per-stage receipt.json.
Phase 5 wraps build_evidence_index + score_runs + export_* in a
withReceipt() helper that captures git_sha + sha256 of inputs/outputs
+ record_counts + validation_pass.
- reports/distillation/latest.md aggregating most-recent run of each stage.
Carry-overs to Phase 3 v2:
- mode_experiments scoring (168 needs_human_review): derive markers from
validation_results.grounded_fraction
- extraction-class JOIN: distilled_*/audit_facts/observer_escalations
→ JOIN to verdict-bearing parent by task_id
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
c989253e9b |
distillation: Phase 3 — deterministic Success Scorer
Pure scoreRecord function + score_runs.ts CLI + 38 tests.
Reads data/evidence/YYYY/MM/DD/*.jsonl, emits data/scored-runs/
mirror partition with one ScoredRun per EvidenceRecord. ZERO model
calls. scorer_version stamped on every output (default v1.0.0).
Three-class scoring strategy (taxonomy from Phase 2 evidence_health.md):
CLASS A (verdict-bearing): direct mapping from existing markers.
scrum_reviews: accepted_on_attempt_1 → accepted; 2-3 → partial;
4+ → partial with high-cost reason
observer_reviews: accept|reject|cycle → category
audits: severity info/low → accepted, medium → partial,
high/critical → rejected (legacy markers also handled)
contract_analyses: failure_markers + observer_verdict
CLASS B (telemetry-rich): partial markers, fall back to needs_human
auto_apply: committed → accepted; *_reverted → rejected
outcomes: all_events_ok → accepted; gap_signals > 0 → partial
mode_experiments: empty text → rejected; latency > 120s → partial
CLASS C (extraction): needs_human (Phase 3 v2 will JOIN to parents)
Real-data run on 1052 evidence rows:
accepted=384 (37%) · partial=132 (13%) · rejected=57 (5%) · needs_human=479 (45%)
Verdict-bearing sources land 0% needs_human:
scrum_reviews (172): 111 acc · 61 part · 0 rej · 0 hum
audits (264): 217 acc · 29 part · 18 rej · 0 hum
observer_reviews (44): 22 acc · 3 part · 19 rej · 0 hum
contract_analyses (2): 1 acc · 0 part · 1 rej · 0 hum
BUG SURFACED + FIXED:
Phase 2 transform for audits.jsonl assumed PR-verdict shape (recon
misnamed it). Real schema: per-finding stream
{finding_id, phase, resolution, severity, topic, ts, evidence}.
Updated transform to derive markers from severity. 264 findings
went 0% scoreable → 100% scoreable. Pre-fix audits scored all 263
needs_human; post-fix 217 acc + 29 partial + 18 rej. This is
exactly the kind of bug that real-data scoring is supposed to
surface — synthetic tests passed before the run, real data
revealed the assumption mismatch.
Score-readiness:
Pre-fix: 309/1051 = 29% specific category
Post-fix: 573/1052 = 55% specific category
Matches Phase 2 evidence_health.md prediction (~54% scoreable)
Test metrics:
51 distillation tests pass (10 evidence_record + 30 schemas + 8 realdata
+ 9 build_evidence_index + 30 scorer + 8 score_runs + 21 inferred from earlier
files; bun test reports 51 across 3 phase-3 files alone)
192 expect() calls
399ms total
Receipts:
reports/distillation/2026-04-27T03-44-26-602Z/receipt.json
- record_counts.cat_accepted=384, cat_partially_accepted=132,
cat_rejected=57, cat_needs_human_review=479
- validation_pass=true (0 skips)
- self-validates against Receipt schema before write
Carry-overs to Phase 4+:
- mode_experiments 166 needs_human: derive grounding from validation_results
- extraction-class 207 rows: JOIN to verdict-bearing parent by task_id
- audit_discrepancies transform (still missing — Phase 4c needs)
- model_trust transform (needed for ModelLedgerEntry aggregation)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
1ea802943f |
distillation: Phase 2 — Evidence View materializer + health audit
Phase 2 ships the JOIN script that turns 12 source JSONL streams
into unified data/evidence/YYYY/MM/DD/<source>.jsonl rows conforming
to EvidenceRecord v1, plus a high-level health audit proving the
substrate is real before Phase 3 reads from it.
Files:
scripts/distillation/build_evidence_index.ts materializeAll() + cli
scripts/distillation/check_evidence_health.ts provenance + coverage audit
tests/distillation/build_evidence_index.test.ts 9 acceptance tests
Test metrics:
9/9 pass · 85 expect() calls · 323ms
Real-data run (2026-04-27T03:33:53Z):
1053 rows read from 12 source streams
1051 written (99.8%) to data/evidence/2026/04/27/
2 skipped (outcomes.jsonl rows missing created_at — schema-level catch)
0 deduped on first run
Sources covered (priority order from recon):
TIER 1 (validated 100% in Phase 1, 8 sources):
distilled_facts/procedures/config_hints, contract_analyses,
mode_experiments, scrum_reviews, observer_escalations, audit_facts
TIER 2 (added by Phase 2):
auto_apply, observer_reviews, audits, outcomes
High-level audit results:
Provenance round-trip: 30/30 sampled rows trace cleanly to source
rows with matching canonicalSha256(orderedKeys(row)). Every output
has source_file + line_offset + sig_hash + recorded_at. Proven.
Score-readiness: 54% aggregate scoreable. Three-class taxonomy
emerges from coverage matrix:
- Verdict-bearing (100% scoreable): scrum_reviews, observer_reviews,
audits, contract_analyses — direct scoring inputs
- Telemetry-rich (0-70%): mode_experiments, audit_facts, outcomes
— Phase 3 will derive markers from latency/grounding/retrieval
- Pure-extraction (0%): distilled_*, observer_escalations
— context for OTHER scoring, not scoreable themselves
Invariants enforced (proven by tests + real-data audit):
- ZERO model calls in materializer (deterministic only)
- canonicalSha256(orderedKeys(row)) per source row → stable sig_hash
- Schema validator gates output: rejected rows go to skips, never to evidence/
- JSON.parse failures caught + logged, never crash the run
- Missing source files tallied as rows_present=false, never error
- Idempotent: second run on identical input writes 0 rows (proven on
real data: 1053 read, 0 written, 1051 deduped)
- Bit-stable: identical input produces byte-identical output (proven
by tests/distillation/build_evidence_index.test.ts case 3)
- Receipt self-validates against schema before write
- validation_pass = boolean (skipped == 0), never inferred
Receipt at:
reports/distillation/2026-04-27T03-33-53-972Z/receipt.json
- schema_version=1, git_sha pinned, sha256 on every input/output
- record_counts: {in:1053, out:1051, skipped:2, deduped:0}
- validation_pass=false (skipped > 0; spec says explicit, never inferred)
Skips at:
data/_kb/distillation_skips.jsonl (2 rows from outcomes.jsonl,
reason: timestamp field missing — schema layer caught it cleanly)
Health audit at:
data/_kb/evidence_health.md
Phase 2 done-criteria all met:
✓ tests pass
✓ ≥1 row from each Tier-1 source on real data (8/8 + 4 Tier 2 bonus)
✓ data/_kb/distillation_skips.jsonl populated with reasons
✓ Receipt JSON written + self-validates
✓ Provenance round-trip proven on real sampled rows
✓ Score-readiness coverage measured
Carry-overs to Phase 3:
- audit_discrepancies transform (needed before Phase 4c preference data)
- model_trust transform (needed before ModelLedgerEntry aggregation)
- outcomes.jsonl created_at: 2 rows fail materialization, decide
transform-side fix vs source-side fix
- 11 untested streams from recon still have no transform; add as
Phase 3+ consumers need them
- mode_experiments + distilled_* are 0% scoreable; Phase 3 must
JOIN to adjacent verdict-bearing records, NOT score in isolation
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
27b1d27605 |
distillation: Phase 0 recon + Phase 1 schemas + Phase 2 transforms scaffold
Some checks failed
lakehouse/auditor 9 blocking issues: todo!() macro call in tests/real-world/scrum_master_pipeline.ts
Phase 0 — docs/recon/local-distillation-recon.md
Inventories the 23 KB JSONL streams + 20 vector corpora + auditor's
kb_index.ts as substrate for the now.md distillation pipeline. Maps
spec modules to existing producers, identifies real gaps, lists 9
schemas to formalize. ZERO implementation in recon — gating doc only.
Phase 1 — auditor/schemas/distillation/
9 schemas + foundation types + 48 tests passing in 502ms:
types.ts shared validators + canonicalSha256
evidence_record.ts EVIDENCE_SCHEMA_VERSION=1, ModelRole enum
scored_run.ts 4 categories pinned, anchor_grounding ∈ [0,1]
receipt.ts git_sha 40-char, sha256 file refs, validation_pass:bool
playbook.ts non-empty source_run_ids + acceptance_criteria
scratchpad_summary.ts validation_status enum, hash sha256
model_ledger.ts success_rate ∈ [0,1], sample_count ≥ 1
rag_sample.ts success_score ∈ {accepted, partially_accepted}
sft_sample.ts quality_score MUST be 'accepted' (no leak)
preference_sample.ts chosen != rejected, source_run_ids must differ
evidence_record.test.ts 10 tests, JSON-fixture round-trip
schemas.test.ts 30 tests, inline fixtures
realdata.test.ts 8 tests, real-JSONL probe
Real-data validation probe (one of the 3 notables from recon):
46 rows across 7 sources, 100% pass. distilled_facts/procedures alive.
Report at data/_kb/realdata_validation_report.md (also written by the
test). Confirms schema fits existing producers without migration.
Phase 2 scaffold — scripts/distillation/transforms.ts
Promoted PROBES from realdata.test.ts into a real TRANSFORMS array
covering 12 source streams (8 Tier 1 validated + 4 Tier 2 from
recon's untested-streams list). Pure functions: no I/O, no model
calls, no clock reads. Caller supplies recorded_at + sig_hash so
materializer is deterministic by construction.
Spec non-negotiables enforced at schema layer (defense in depth):
- provenance{source_file, sig_hash, recorded_at} required everywhere
- schema_version mismatch hard-rejects (forward-compat gate)
- SFT no-leak: validateSftSample REJECTS partially_accepted, rejected,
needs_human_review — three explicit tests
- Every score has WHY (reasons non-empty)
- Every playbook traces to source (source_run_ids non-empty)
- Every preference has WHY (reason non-empty)
- Receipts substantive (git_sha 40-char, sha256 64-char, validation_pass:bool)
Branch carries uncommitted auditor rebuild work (mode.rs + modes.toml
+ inference.ts + static.ts) blocked on upstream Ollama Cloud kimi-k2
500 ISE; held pending recon-driven design decisions.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
f753e11157 |
docs: SCRUM_MASTER_SPEC timeline — productization wave + verified live state
Some checks failed
lakehouse/auditor 9 blocking issues: todo!() macro call in tests/real-world/scrum_master_pipeline.ts
Splits the existing 04-25/26 section into two waves: - experiment wave (mode-runner build-out, pre-productization) - productization wave (OpenAI-compat, Archon, answers corpus, staffing native runner, multi-corpus + downgrade gate, observer paid escalation, /v1/chat → observer event wiring) Adds verified-live block at the end with the numbers a fresh session needs to anchor on: pathway memory 88 traces / 11 successful replays at 100% (probation gate crossed), strong-model auto-downgrade firing on grok-4.1-fast, and the auditor blind spot at static.ts:117 (now fixed in 107a682). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
107a68224d |
auditor: skip serde-derived structs in unread-field check
Fields on structs that derive Serialize or Deserialize ARE read — by the macro, on every JSON round-trip — but the static check only looked for explicit `.field` references in the diff. Result: every new response/request struct shipped through `/v1/*` was flagged as "placeholder state without a consumer." PR #11 head 0844206 surfaced 8 such false positives across mode.rs, respond.rs, truth.rs, and profiles/memory.rs — same shape as the existing string-literal exemption for BLOCK_PATTERNS, just at a different syntactic layer. Two helpers added: - extractNewFieldsWithLine: keeps each field's diff-line index so the caller can locate the parent struct. - parentStructHasSerdeDerive: walks back ≤80 lines for a `pub struct` boundary, then ≤8 lines above it for `#[derive(...)]` lines containing Serialize or Deserialize. Stops on closing-brace-at-col-0 to avoid escaping the enclosing scope. Verified on PR #11's actual diff: unread-field warnings dropped from 8 → 0. Synthetic cases confirm the check still fires on plain (non-serde) structs with no in-diff reader, so the genuine-placeholder catch is preserved. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
0844206660 |
observer + scrum: gold-standard answer corpus for compounding context
Some checks failed
lakehouse/auditor 1 blocking issue: todo!() macro call in tests/real-world/scrum_master_pipeline.ts
The compose-don't-add discipline applied to the original ask: when big
models produce good results (scrum reviews + observer escalations),
save them into the matrix indexer so future small-model handlers can
retrieve them as scaffolding. Local model gets near-paid quality from
a fraction of the cost.
New: scripts/build_answers_corpus.ts indexes lakehouse_answers_v1
from data/_kb/scrum_reviews.jsonl + data/_kb/observer_escalations.jsonl.
doc_id prefixes ('review:' vs 'escalation:') let consumers same-file-
gate the prior-reviews case while keeping escalations broad.
observer.ts: buildKbPreamble adds lakehouse_answers_v1 as a third
retrieval source alongside pathway/bug_fingerprints + lakehouse_arch_v1.
qwen3.5:latest synthesis now compresses three lenses into a single
briefing for the cloud reviewer.
scrum_master_pipeline.ts: epilogue dispatches a fire-and-forget rebuild
of lakehouse_answers_v1 after each run so this run's accepted reviews
are retrievable within ~30s. LH_SCRUM_SKIP_ANSWERS_REBUILD=1 disables.
Verified live: kb_preamble grew 416 → 727 chars after wiring third
source; qwen3.5:latest synthesis (702 → 128 tokens) compresses
correctly; deepseek-v3.1-terminus diagnosis (301 → 148 tokens) is
sharper, citing architectural patterns (circuit breaker, adapter
files) instead of generic timeouts. Total cost per escalation
unchanged at ~$0.0002.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
340fca2427 |
observer: route escalation to paid OpenRouter (deepseek-v3.1-terminus)
Some checks failed
lakehouse/auditor 1 blocking issue: todo!() macro call in tests/real-world/scrum_master_pipeline.ts
ollama_cloud/qwen3-coder:480b was hitting weekly 429 quota; observer escalations were silently failing 502 with no audit row. Switched escalation cloud-call to deepseek-v3.1-terminus on paid OpenRouter: 671B reasoning specialist, $0.21 in / $0.79 out per M tokens (under the $0.85/M ceiling J set), 164K ctx. End-to-end verified: kb_preamble_chars=416, prompt 245 tokens, completion 155 tokens, ~$0.00018 per escalation. Diagnosis output is specific (cites adapter + route file), not generic. Two-stage chain holds: qwen3.5:latest compresses raw KB hits into a tight briefing, deepseek-v3.1-terminus reasons over the briefing for diagnosis. Audit `mode` field updated to direct_chat_deepseek_v3_1_terminus so downstream consumers can attribute analyses to the correct rung. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
d9bd4c9bdf |
observer: KB enrichment preamble before failure-cluster escalation
Some checks failed
lakehouse/auditor 1 blocking issue: todo!() macro call in tests/real-world/scrum_master_pipeline.ts
escalateFailureClusterToLLMTeam now calls a new buildKbPreamble() that mirrors what scrum_master_pipeline does on every per-file review: queries /vectors/pathway/bug_fingerprints + /vectors/search against the lakehouse_arch_v1 corpus, then asks local qwen3.5:latest (provider=ollama) to synthesize a tight briefing. The synthesized preamble prepends the existing escalation prompt so the cloud reviewer sees historical context the same way scrum reviewers do. Reuses existing KB primitives — no new corpora, no new endpoints, no new abstractions. Same code path scrum already exercises 3+ times per review; observer joins the same compounding loop. Audit row gains kb_preamble_chars so we can later track enrichment yield per escalation. Empty preamble (both fingerprints + matrix return nothing) → empty string, prompt unchanged. Verified: qwen3.5:latest synthesis fires for every escalation with non-empty matrix hits (gateway log: 445→72 tokens, 3.1s). Matrix retrieval correctly surfaces PRD Phase 40/44 chunks for chat_completion clusters. Pathway memory stays consistent with scrum (84→87 traces); chat_completion task_class doesn't have fingerprints yet — graceful. Local-model synthesis was J's explicit ask: compress the raw bundle before the cloud call so the briefing is actionable, not a dump. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
69919d9d57 |
.archon: add lakehouse-architect-review workflow
Some checks failed
lakehouse/auditor 1 blocking issue: todo!() macro call in tests/real-world/scrum_master_pipeline.ts
Demonstrates Archon → Pi → our gateway → OpenRouter end-to-end on Lakehouse code. Three Pi nodes (shape, weakness, improvement) — each fires a /v1/chat/completions call that lands a Langfuse trace AND an observer event for KB-consolidation parity with scrum. Run: ARCHON_SUPPRESS_NESTED_CLAUDE_WARNING=1 \ PI_OPENROUTER_BASE_URL=http://localhost:3100/v1 \ OPENROUTER_API_KEY=sk-anything \ archon workflow run lakehouse-architect-review --no-worktree Verified 2026-04-26 — 3 grok-4.1-fast calls, 14.6s total, observer ring delta=3, three Langfuse traces. Read-only (allowed_tools: [read]) so running it doesn't mutate the repo. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
d1d97a045b |
v1: fire observer /event from /v1/chat alongside Langfuse trace
Some checks failed
lakehouse/auditor 1 blocking issue: todo!() macro call in tests/real-world/scrum_master_pipeline.ts
Observer at :3800 already collects scrum + scenario events into a ring
buffer that pathway-memory + KB consolidation read from. /v1/chat now
posts a lightweight {endpoint, source:"v1.chat", input_summary,
output_summary, success, duration_ms} event there too — fire-and-forget
tokio::spawn, observer-down doesn't block the chat response.
Now any tool routed through our gateway (Pi CLI, Archon, openai SDK
clients, langchain-js) shows up in the same ring buffer the scrum loop
reads, ready for the same KB-consolidation analysis. Independent of the
existing langfuse-bridge that polls Langfuse — this path is immediate.
Verified: GET /stats shows {by_source: {v1.chat: N}} grows by 1 per
chat call, both for direct curl and for Pi CLI invocations.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
540a9a27ee |
v1: accept OpenAI multimodal content shape (array-of-parts)
Some checks failed
lakehouse/auditor 1 blocking issue: todo!() macro call in tests/real-world/scrum_master_pipeline.ts
Modern OpenAI clients (pi-ai, openai SDK 6.x, langchain-js, the official
agents) send `messages[].content` as an array of content parts:
`[{type:"text", text:"..."}, {type:"image_url", ...}]`. Our gateway
typed `content` as plain `String` and 422'd those calls.
Fix: `Message.content` is now `serde_json::Value` so requests
deserialize regardless of shape. `Message::text()` flattens
content-parts arrays (concat'd `text` fields, non-text parts skipped)
for places that need a plain string — Ollama prompt assembly, char
counts, the assistant's own response synthesis. `Message::new_text()`
constructs string-content messages without writing the wrapper at
each call site. Forwarders (openrouter) clone content through
verbatim so providers see exactly what the client sent.
Verified end-to-end: Pi CLI (`pi --print --provider openrouter`)
landed a clean 1902-token request through `/v1/chat/completions`,
routed to OpenRouter as `openai/gpt-oss-120b:free`, response in
1.62s, Langfuse trace `v1.chat:openrouter` recorded with provider
tag. Same path that any tool using the official openai SDK takes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
3a0b37ed93 |
v1: OpenAI-compat alias + smart provider routing — gateway is now drop-in middleware
Some checks failed
lakehouse/auditor 1 blocking issue: todo!() macro call in tests/real-world/scrum_master_pipeline.ts
/v1/chat/completions route alias (same handler as /chat) lets any tool
using the official `openai` SDK adopt the gateway via OPENAI_BASE_URL
alone — no custom provider field needed.
resolve_provider() extended:
- bare `vendor/model` (slash) → openrouter (catches x-ai/grok-4.1-fast,
moonshotai/kimi-k2, deepseek/deepseek-v4-flash, openai/gpt-oss-120b:free)
- bare vendor model names (no slash, no colon) get auto-prefixed:
gpt-* / o1-* / o3-* / o4-* → openai/<name> (OpenRouter form)
claude-* → anthropic/<name>
grok-* → x-ai/<name>
Then routed to openrouter. Ollama models (with colon, no slash) keep
default routing. Tools like pi-ai validate against an OpenAI-style
catalog and send bare names — this lets them flow through cleanly.
Verified end-to-end:
- curl POST /v1/chat/completions {model: "gpt-4o-mini", ...} → 200,
routed to openrouter as openai/gpt-4o-mini
- openai SDK with baseURL=http://localhost:3100/v1 → 3 model variants all
succeed (openai/gpt-4o-mini, gpt-4o-mini, x-ai/grok-4.1-fast)
- Langfuse traces fire automatically on every call
(v1.chat:openrouter, provider tagged in metadata)
scripts/mode_pass5_variance_paid.ts gains LH_CONDITIONS env so subset
runs (e.g. just isolation vs composed) take half the latency.
Archon-on-Lakehouse integration: gateway side is done. Pi-ai's
openai-responses backend uses /v1/responses (not /chat/completions) and
its openrouter backend appears to bail in client-side validation before
sending. Patching Pi locally to override baseUrl works for arch but the
harness still rejects — needs more work in a follow-up. Direct openai
SDK path (langchain-js / agents / patched Pi) works today.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
2dbc8dbc83 |
v1/mode: model-aware enrichment downgrade + 3 corpora + variance harness
Some checks failed
lakehouse/auditor 1 blocking issue: todo!() macro call in tests/real-world/scrum_master_pipeline.ts
Pass 5 (5 reps × 4 conditions × 1 file on grok-4.1-fast) showed composing matrix corpora is anti-additive on strong models — composed lakehouse_arch + symbols LOST 5/5 head-to-head vs codereview_isolation (Δ −1.8 grounded findings, p=0.031). Default flips to isolation; matrix path now auto- downgrades when the resolved model is strong. Mode runner: - matrix_corpus is Vec<String> (string OR array via deserialize_string_or_vec) - top_k=6 from each corpus, merge by score, take top 8 globally - chunk tag prefers doc_id over source so reviewer sees [adr:009] vs [lakehouse_arch] - is_weak_model() gate auto-downgrades codereview_lakehouse → codereview_isolation for strong models (default-strong; weak = :free suffix or local last-resort) - LH_FORCE_FULL_ENRICHMENT=1 bypasses for diagnostic runs - EnrichmentSources.downgraded_from records when the gate fires Three corpora indexed via /vectors/index (5849 chunks total): - lakehouse_arch_v1 — ADRs + phases + PRD + scrum spec (93 docs, 2119 chunks) - scrum_findings_v1 — past scrum_reviews.jsonl (168 docs, 1260 chunks; EXCLUDED from defaults — 24% out-of-bounds line citations from cross-file drift) - lakehouse_symbols_v1 — regex-extracted pub items + /// docs (656 docs, 2470 chunks) Experiment infra: - scripts/build_*_corpus.ts — re-runnable when source content changes - scripts/mode_pass5_variance_paid.ts — N reps × M conditions on one file - scripts/mode_pass5_summarize.ts — mean ± σ + head-to-head, parser handles numbered + path-with-line + path-with-symbol finding tables - scripts/mode_compare.ts — groups by mode|corpus when sweeps span corpora - scripts/mode_experiment.ts — default model bumped to x-ai/grok-4.1-fast, --corpus flag for per-call override Decisions + open follow-ups: docs/MODE_RUNNER_TUNING_PLAN.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
56bf30cfd8 |
v1/mode: override knobs + staffing native runner + pass 2/3/4 harnesses
Some checks failed
lakehouse/auditor 1 blocking issue: todo!() macro call in tests/real-world/scrum_master_pipeline.ts
Setup for the corpus-tightening experiment sweep (J 2026-04-26 — "now
is the only cheap window before the corpus gets large and refactoring
costs go up").
Override params on /v1/mode/execute (additive — old callers unaffected):
force_matrix_corpus — Pass 2: try alternate corpora per call
force_relevance_threshold — Pass 2: sweep filter strictness
force_temperature — Pass 3: variance test
New native mode `staffing_inference_lakehouse` (Pass 4):
- Same composer architecture as codereview_lakehouse
- Staffing framing: coordinator producing fillable|contingent|
unfillable verdict + ranked candidate list with playbook citations
- matrix_corpus = workers_500k_v8
- Validates that modes-as-prompt-molders generalizes beyond code
- Framing explicitly says "do NOT fabricate workers" — the staffing
analog of the lakehouse mode's symbol-grounding requirement
Three sweep harnesses:
scripts/mode_pass2_corpus_sweep.ts — 4 corpora × 4 thresholds × 5 files
scripts/mode_pass3_variance.ts — 3 files × 3 temps × 5 reps
scripts/mode_pass4_staffing.ts — 5 fill requests through staffing mode
Each appends per-call rows to data/_kb/mode_experiments.jsonl which
mode_compare.ts already aggregates with grounding column.
Pass 1 (10 files × 5 modes broad sweep) currently running via the
existing scripts/mode_experiment.ts — gateway restart deferred until
it completes so the new override knobs aren't enabled mid-experiment.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
52bb216c2d |
mode_compare: grounding check + control flag + emoji-tolerant section detection
Some checks failed
lakehouse/auditor 1 blocking issue: todo!() macro call in tests/real-world/scrum_master_pipeline.ts
Three fixes after the playbook_only confabulation surfaced in 2026-04-26 experiment (8 'findings' on a 333-line file all citing lines 378-945 — fully fabricated from pathway-memory pattern names). (1) Aggregator regex bug — section detection failed on emoji-prefixed markdown headers like `## 🔎 Ranked Findings`. The original regex required word chars right after #{1,3}\s+, so the patches table header `## 🛠️ Concrete Patch Suggestions` was never recognized as a stop boundary, double-counting every finding. Fix: allow non-letter chars (emoji/space) between # and the keyword. (2) Grounding check — for each finding row in the response, extract backtick-quoted symbols + cited line numbers; verify symbols exist in the actual focus file and lines fall within EOF. Computes grounded/total ratio per mode. Surfaces 'OOB' (out-of-bounds) count explicitly so confabulation is visible at a glance. Confirms what hand-grading found: codereview_playbook_only's 8 findings on service.rs were 1/8 grounded with 7 OOB. (3) Control mode tagging — codereview_null and codereview_playbook_only are designed as falsifiers (baseline / lossy ceiling) and their numerical wins should never be read as recommendations. Output marks them with ⚗ glyph + warning footer. Per-mode aggregate is now sorted by groundedness, not raw count. Per-mode-vs-lakehouse comparison uses grounded findings, not raw — so confabulation can no longer score a "win". Updated SCRUM_MASTER_SPEC.md with refactor timeline pointing at the 2026-04-25/26 commits (observer fix, relevance filter, retire wire, mode router, enrichment runner, parameterized experiment). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
7c47734287 |
v1/mode: parameterized runner + 5 enrichment-experiment modes
Some checks failed
lakehouse/auditor 1 blocking issue: todo!() macro call in tests/real-world/scrum_master_pipeline.ts
J's directive (2026-04-26): "Create different modes so we can really
dial in the architecture before it gets further along — pinpoint the
failures and strengths equally so I know what direction to go in.
Loop theater happens when we don't pinpoint the most accurate path."
Refactored execute() to switch on mode name → EnrichmentFlags preset.
Five native modes designed as deliberate experiments — each isolates
one architectural axis so the comparison matrix reads off what's
doing work vs what's adding latency for nothing:
codereview_lakehouse — all enrichment on (ceiling)
codereview_null — raw file + generic prompt (baseline)
codereview_isolation — file + pathway only (no matrix)
codereview_matrix_only — file + matrix only (no pathway)
codereview_playbook_only — pathway only, NO file content (lossy ceiling)
Each call appends a row to data/_kb/mode_experiments.jsonl with full
sources + response. LH_MODE_LOG_OFF=1 to suppress.
scripts/mode_experiment.ts — sweeps files × modes serially, prints
live progress with per-call enrichment stats. Defaults to OpenRouter
free model so cloud quota doesn't gate experiments.
scripts/mode_compare.ts — reads the JSONL, outputs per-file matrix
+ per-mode aggregate + mode-vs-baseline win/loss with avg finding
delta. Heuristic finding-count from markdown table rows; pathway
citation count from preamble references.
scrum_master_pipeline.ts gets a mode-runner fast path gated by
LH_USE_MODE_RUNNER=1: try /v1/mode/execute first, fall through to
the existing ladder if response < LH_MODE_MIN_CHARS (default 2000)
or anything errors. Off by default until A/B-validated.
First experiment results (2 files × 5 modes via gpt-oss-120b:free):
- codereview_null produces 12.6KB response with ZERO findings
(proves adversarial framing is load-bearing)
- codereview_playbook_only produces MORE findings than lakehouse
on average (12 vs 9) at 73% the latency — pathway memory is
the dominant signal driver
- codereview_matrix_only underperforms isolation by ~0.5 findings
while costing the same latency — matrix corpus likely
underperforming for scrum_review task class
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
86f63a083d |
v1/mode: codereview_lakehouse native runner — modes are prompt-molders
Some checks failed
lakehouse/auditor 1 blocking issue: todo!() macro call in tests/real-world/scrum_master_pipeline.ts
J's framing (2026-04-26): "Modes are how you ask ONCE and get BETTER
information — they mold the data, hyperfocus the prompt on this
codebase's needs, so the model gets it right the first time without
the cascading retry ladder."
Built the first concrete native enrichment runner (codereview_lakehouse)
that composes every context primitive the gateway exposes:
1. Focus file content (read from disk OR caller-supplied)
2. Pathway memory bug_fingerprints for this file area (ADR-021
preamble — "📚 BUGS PREVIOUSLY FOUND IN THIS FILE AREA")
3. Matrix corpus search via the task_class's matrix_corpus
4. Relevance filter (observer /relevance) drops adjacency pollution
5. Assembles ONE precise prompt with system framing
6. Single call to /v1/chat with the recommended model
POST /v1/mode/execute dispatches. Native mode → runs the composer.
Non-native mode → 501 NOT_IMPLEMENTED with hint (proxy to LLM Team
/api/run is queued).
Provider hint logic auto-routes by model name shape:
- vendor/model[:tag] → openrouter
- kimi-*/qwen3-coder*/deepseek-v*/mistral-large* → ollama_cloud
- everything else → local ollama
Live test against crates/queryd/src/delta.rs (10593 bytes, 10
historical bug fingerprints, 2 matrix chunks dropped by relevance):
- enriched_chars: 12876
- response_chars: 16346 (14 findings with confidence percentages)
- Model literally cited the pathway memory preamble in finding #7
- One call to free-tier gpt-oss:120b produced what previously
required the 9-rung escalation ladder
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
d277efbfd2 |
v1/mode: task_class → mode/model router (decision-only, phase 1)
Some checks failed
lakehouse/auditor 1 blocking issue: todo!() macro call in tests/real-world/scrum_master_pipeline.ts
HANDOVER §queued (2026-04-25): "Mode router — port LLM Team multi-model
patterns. Pick the right TOOL/MODE for each task class via the matrix,
not cascade through models."
Two-stage architecture:
1. Decision (POST /v1/mode) — pure recommendation, no execution.
Returns {mode, model, decision: {source, fallbacks, matrix_corpus,
notes}} so callers see WHY this mode was picked.
2. Execution (future POST /v1/mode/execute) — proxy to LLM Team
/api/run for modes not yet ported to native Rust runners. Not
wired in this phase.
Splitting decision from execution lets us A/B-test the routing logic
without committing to running every recommendation. The decision
function is pure enough for exhaustive unit tests (3 added).
config/modes.toml — initial map for 5 task_classes (scrum_review,
contract_analysis, staffing_inference, fact_extract, doc_drift_check)
+ a default. matrix_corpus per task is reserved for the future
matrix-informed routing pass.
VALID_MODES list (24 modes) is kept in sync manually with LLM Team's
/api/run handler at /root/llm_team_ui.py:10581. Adding a mode here
without adding it upstream returns 400 from a future proxy.
GET /v1/mode/list — operator introspection so a UI can render the
registry table without re-parsing TOML.
Live-tested: 5 task classes match, unknown classes fall through to
default, force_mode override works + validates, bogus modes return
400 with the valid_modes list.
Updates reference_llm_team_modes.md memory — earlier note claiming
"only extract is registered" was wrong (all 25 are registered).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
626f18d491 |
pathway_memory: audit-consensus → retire wire
Some checks failed
lakehouse/auditor 1 blocking issue: todo!() macro call in tests/real-world/scrum_master_pipeline.ts
When observer's hand-review explicitly rejects the output of a
hot-swap-recommended model, the matrix's recommendation was wrong
for this context. Auto-retire the trace so future agents don't
get the same poisoned recommendation in their preamble.
crates/vectord/src/pathway_memory.rs — add `trace_uid` to
HotSwapCandidate response and populate from the matched trace.
This gives consumers single-trace precision for /pathway/retire.
tests/real-world/scrum_master_pipeline.ts:
- HotSwapCandidate interface gains trace_uid
- new retirePathwayTrace() helper (fire-and-forget, fall-open)
- in the obsVerdict reject branch: if hotSwap was active AND
the rejected model is the hot-swap-recommended one AND
observer confidence ≥0.7, fire retire and null hotSwap so
post-loop replay bookkeeping doesn't double-process.
- hotSwap declared `let` (was const) so it can be nulled
Cycle verdicts ("needs different angle") don't trigger retire —
only outright rejects do. Confidence gate avoids retiring on
heuristic-fallback verdicts that come back without a confidence
number. Closes the "audit-consensus → retire" item from
HANDOVER.md.
Live-tested: insert synthetic trace → /pathway/retire by trace_uid
→ retired counter 1 → 2.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
e1ef185868 |
docs: add MATRIX_AGENT_HANDOVER notes + cross-link from SCRUM_MASTER_SPEC
Some checks failed
lakehouse/auditor 1 blocking issue: todo!() macro call in tests/real-world/scrum_master_pipeline.ts
The matrix-agent-validated repo split + Ansible deploy pipeline are otherwise undocumented in this repo. This new doc explains: - the relationship between lakehouse and matrix-agent-validated - where the playbook lives and what it provisions - the critical distinction: matrix-test (10.111.129.50 Incus container) is the REAL destination, while 192.168.1.145 is a smoke-test VPS only (partial deploy, no services, do not treat as production) - what landed today (observer fix, HANDOVER.md render, relevance filter) Added a cross-link block at the top of SCRUM_MASTER_SPEC.md so the canonical scrum handoff doc points at the new MATRIX_AGENT_HANDOVER doc. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
0115a60072 |
observer: add /relevance heuristic filter for adjacency pollution
Some checks failed
lakehouse/auditor 1 blocking issue: todo!() macro call in tests/real-world/scrum_master_pipeline.ts
Matrix retrieval often surfaces high-cosine chunks that are about
symbols the focus file IMPORTS but doesn't define. The reviewer LLM
then hallucinates those imported-crate internals as in-file content
("I see main.rs does X" when X lives in queryd::context).
mcp-server/relevance.ts — pure scorer with five signals:
path_match +1.0 chunk source/doc_id encodes focus path
defined_match +0.6 chunk text mentions focus.defined_symbols
token_overlap +0.4 jaccard of non-stopword tokens
prefix_match +0.3 shared first-2-segment prefix
import_only -0.5 mentions only imported symbols (pollution)
Default threshold 0.3 — tuned empirically on the gateway/main.rs case.
Also fixes a regex bug in the import extractor: the character class
was lowercase-only, so `use catalogd::Registry;` silently never
matched (regex backed off when it hit the uppercase R). Caught by
the test suite.
observer.ts — POST /relevance endpoint wraps filterChunks().
scrum_master_pipeline.ts — fetchMatrixContext gains optional
focusContent param; calls /relevance after collecting allHits and
before sort+top. Opt-out via LH_RELEVANCE_FILTER=0; threshold via
LH_RELEVANCE_THRESHOLD. Fall-open on observer failure.
9 unit tests, all green. Live probe on real shape correctly drops
a 0.7-cosine adjacency-pollution chunk while keeping in-focus hits.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
54689d523c |
observer: fix gateway health probe — text/plain not JSON
Some checks failed
lakehouse/auditor 1 blocking issue: todo!() macro call in tests/real-world/scrum_master_pipeline.ts
Same bug as matrix-agent-validated 5db0c58. observer.ts:645 did fetch().then(r => r.json()) against /health which returns text/plain "lakehouse ok". r.json() throws on non-JSON, .catch swallows to null, observer exits assuming gateway down. With systemd Restart=on-failure this crash-loops every 5s — confirmed live on matrix-test box today. Fix: r.ok ? r.text() : null. Same shape, accepts the actual content type. Sealed in pathway_memory as TypeConfusion:fetch-health-json. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
779158a09b |
scripts: chicago analyzer field-name fixes + vectorize sanitizer hardening
Some checks failed
lakehouse/auditor 1 blocking issue: todo!() macro call in tests/real-world/scrum_master_pipeline.ts
Two small fixes surfaced during smoke testing: analyze_chicago_contracts.ts: permit field is contact_1_name not contact_1; reported_cost is integer-string. Fixed filter (was rejecting all 2853 permits) and contractor extraction (was empty). vectorize_raw_corpus.ts: sanitize() expanded to strip control chars + ALL backslashes (kills incomplete \uXXXX escapes) + UTF-16 surrogates (unpaired surrogates from emoji split by truncate boundary). Llm_team response cache had docs with all three pollution shapes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
6ac7f61819 |
pathway_memory: Mem0 versioning + deletion (upsert/revise/retire/history)
Per J 2026-04-25: pathway_memory was append-only — every agent run added
a new trace, bad/failed runs polluted the matrix forever, no notion of
"this is the canonical evolved playbook." Ported playbook_memory's
Phase 25/27 patterns into pathway_memory so the agent loop's matrix
converges on best-known approaches per task class instead of bloating.
Fields added to PathwayTrace (all #[serde(default)] for back-compat):
- trace_uid: stable UUID per individual trace within a bucket
- version: u32 default 1
- parent_trace_uid, superseded_at, superseded_by_trace_uid
- retirement_reason (paired with existing retired:bool)
Methods added to PathwayMemory:
- upsert(trace) → PathwayUpsertOutcome {Added|Updated|Noop}
Workflow-fingerprint dedup: ladder_attempts + final_verdict hash.
Identical workflow → bumps existing replay_count instead of duplicating.
- revise(parent_uid, new_trace) → PathwayReviseOutcome
Chains versions; rejects retired or already-superseded parents.
- retire(trace_uid, reason) → bool
Marks specific trace retired with reason. Idempotent.
- history(trace_uid) → Vec<PathwayTrace>
Walks parent_trace_uid back to root, then superseded_by forward to tip.
Cycle-safe via visited set.
Retrieval gates updated:
- query_hot_swap skips superseded_at.is_some()
- bug_fingerprints_for skips both retired AND superseded
HTTP endpoints in service.rs:
- POST /vectors/pathway/upsert
- POST /vectors/pathway/retire
- POST /vectors/pathway/revise
- GET /vectors/pathway/history/{trace_uid}
scripts/seal_agent_playbook.ts switched insert→upsert + accepts SESSION_DIR
arg so it can seal any archived session, not just iter4.
Verified live (4/4 ops):
- UPSERT first run: Added trace_uid 542ae53f
- UPSERT identical: Updated, replay_count bumped 0→1 (no duplicate)
- REVISE 542ae53f→87a70a61: parent stamped superseded_at, v2 created
- HISTORY of v2: chain_len=2, v1 superseded, v2 tip
- RETIRE iter-6 broken trace: retired=true, retirement_reason preserved
- pathway_memory.stats: total=79, retired=1, reuse_rate=0.0127
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
ed83754f20 |
raw-corpus dump + vectorization + chicago contract inference pipeline
Three new pieces, executed in order:
scripts/dump_raw_corpus.sh
- One-shot bash that creates MinIO bucket `raw` and uploads all
testing corpora as a persistent immutable test set. 365 MB total
across 5 prefixes (chicago, entities, sec, staffing, llm_team)
+ MANIFEST.json. Sources: workers_500k.parquet (309 MB),
resumes.parquet, entities.jsonl, sec_company_tickers.json,
Chicago permits last 30d (2,853 records, 5.4 MB), 9 LLM Team
Postgres tables dumped via row_to_json.
scripts/vectorize_raw_corpus.ts
- Bun script that fetches each raw-bucket source via mc, runs a
source-specific extractor into {id, text} docs, posts to
/vectors/index, polls job to completion. Verified results:
chicago_permits_v1: 3,420 chunks
entity_brief_v1: 634 chunks
sec_tickers_v1: 10,341 chunks (after extractor fix for
wrapped {rows: {...}} JSON shape)
llm_team_runs_v1: in flight, 19K+ chunks
llm_team_response_cache_v1: queued
scripts/analyze_chicago_contracts.ts
- Real inference pipeline that picks N high-cost permits with
named contractors from the raw bucket, queries all 6 contract-
analysis corpora in parallel via /vectors/search, builds a
MATRIX CONTEXT preamble, calls Grok 4.1 fast for structured
staffing analysis, hand-reviews each via observer /review,
appends to data/_kb/contract_analyses.jsonl.
tests/real-world/scrum_master_pipeline.ts
- MATRIX_CORPORA_FOR_TASK extended with two new task classes:
contract_analysis (chicago + entity_brief + sec + llm_team_runs
+ llm_team_response_cache + distilled_procedural)
staffing_inference (workers_500k_v8 + entity_brief + chicago
+ llm_team_runs + distilled_procedural)
scrum_review unchanged.
This is the first time the matrix architecture operates on real
ingested data instead of code-review smoke tests.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
a496ced848 |
scrum: unified matrix retriever — pull from ALL relevant KB corpora, not just pathway memory
Per J 2026-04-25 architectural correction: matrix index is the vector
indexing layer for the WHOLE knowledge base (distilled facts, procedures,
config hints, team runs, playbooks, pathway successes), not a single
narrow store. Built fetchMatrixContext(query, taskClass, filePath) that:
- Queries multiple persistent vector indexes in parallel via /vectors/search
- Collects hits per corpus + score + doc_id + 400-char excerpt
- Pulls pathway successes via existing helper, mapped to MatrixHit shape
- Sorts by score across corpora, returns top-N (default 8)
- Reports per-corpus hit counts + errors for transparency
Per-task-class corpus list (MATRIX_CORPORA_FOR_TASK):
scrum_review → distilled_factual, distilled_procedural,
distilled_config_hint, kb_team_runs_v1
(staffing data deliberately excluded — not relevant to code review)
Probed live: distilled_config_hint top hit = 0.52, distilled_procedural
top = 0.49, kb_team_runs top = 0.59. Real signal across corpora.
Replaces the narrow proven-approaches preamble with a unified
MATRIX-INDEXED CONTEXT preamble tagged with source_corpus per chunk
so the model knows what kind of context it's seeing.
LH_SCRUM_MATRIX_RETRIEVE=0 still disables for A/B testing.
Future: promote to a Rust /v1/matrix endpoint once corpora list and
ranking logic stabilize. For now TS lets us iterate fast against the
live matrix without gateway restarts.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|