lakehouse

profit/lakehouse

Fork 0

Commit Graph

Author	SHA1	Message	Date
profit	181c35b829	scrum_master fact extraction + verifier gate + schema_version bump Three bundled changes that round out the KB enrichment pipeline (PR #9 commits B/C/D compressed into one — they all touch the same persist surfaces so splitting them would just add noise): B. scrum_master reviews now route accepted review bodies through fact_extractor (same llm_team extract pipeline as inference) and append to data/_kb/audit_facts.jsonl tagged source:"scrum_review". One KB, two producers — downstream consumers can filter by source when they care about provenance. Skips reviews <120 chars (one-liners / LGTM-type comments with no extractable knowledge). C. Verifier-gated fact persistence. fact_extractor now parses the verifier's free-form prose into per-fact verdicts (CORRECT / INCORRECT / UNVERIFIABLE / UNCHECKED). Facts marked INCORRECT are dropped on write; CORRECT + UNVERIFIABLE + UNCHECKED are kept (dropping UNVERIFIABLE would lose ~90% of real signal — the verifier's prior-knowledge base doesn't know Lakehouse internals, so domain-specific facts read as UNVERIFIABLE by default). verifier_verdicts array is persisted alongside facts so downstream queries can surface high-confidence facts (CORRECT) separately from provisional ones (UNVERIFIABLE). schema_version:2 added to both scrum_reviews.jsonl and audit_facts.jsonl writes. Old (v1) rows remain readable; new rows get the field so the forward-compat reader in kb_query can differentiate. D. scrum_master_reviewed:true flag added to scrum_reviews.jsonl rows on accept. Future kb_query surfacing can filter by this (e.g., "show me PRs where a scrum review exists vs only inference" as governance signal). Also carried into audit_facts.jsonl when the scrum_review source path writes there.	2026-04-22 23:40:21 -05:00
profit	77650c4ba3	auditor: inference curation layer + llm_team fact extraction → KB Closes the cycle J asked for: curated cloud output lands structured knowledge in the KB so future audits have architectural context, not just a log of per-finding signatures. Three pieces: 1. Inference curation (tree-split) — when diff > 30KB, shard at 4.5KB, summarize each shard via cloud (temp=0, think=false on small shards; think=true on main call). Merge into scratchpad. The cloud verification then runs against the scratchpad, not truncated raw. Eliminates the 40KB MAX_DIFF_CHARS truncation path for large PRs (PR #8 is 102KB — was losing 62KB). Anti-false-positive guard in the prompt: cloud is told scratchpad absence is NOT diff absence, so it doesn't flag curated-out symbols as missing. unflagged_gaps section is dropped entirely when curated (scratchpad can't ground them). 2. fact_extractor — TS client for llm_team_ui's extract-facts mode at localhost:5000/api/run. Sends curated scratchpad through qwen2.5 extractor + gemma2 verifier, parses SSE stream, returns structured {facts, entities, relationships, verification, llm_team_run_id}. Best-effort: if llm_team is down, extraction fails silently and the audit still completes. AWAITED so CLI tools (audit_one.ts) don't exit before extraction lands — the systemd poller has 90s headroom so the extra ~15s doesn't matter. 3. audit_facts.jsonl + checkAuditFacts() — one row per curated audit with the extraction result. kb_query tails the jsonl, explodes entity rows, aggregates by entity name with distinct-PR counting, surfaces entities recurring in 2+ PRs as info findings. Filters out short names (<3 chars, extractor truncation artifacts) and generic types (string/number/etc.) so signal isn't drowned. Verified end-to-end on PR #8: 102KB diff → 23 shards → 1KB scratchpad → qwen2.5 extracted 4 facts + 6 entities + 6 relationships (real code-level knowledge: AggregateOptions<T> type, aggregate<T> async function with real signature, typed relationships). llm_team_run_id cross-references to llm_team's own team_runs table. Also: audit.ts passes (pr_number, head_sha) as InferenceContext so extracted facts are scope-tagged for the KB index.	2026-04-22 23:09:14 -05:00

Author

SHA1

Message

Date

profit

181c35b829

scrum_master fact extraction + verifier gate + schema_version bump

Three bundled changes that round out the KB enrichment pipeline
(PR #9 commits B/C/D compressed into one — they all touch the same
persist surfaces so splitting them would just add noise):

B. scrum_master reviews now route accepted review bodies through
   fact_extractor (same llm_team extract pipeline as inference) and
   append to data/_kb/audit_facts.jsonl tagged source:"scrum_review".
   One KB, two producers — downstream consumers can filter by source
   when they care about provenance. Skips reviews <120 chars
   (one-liners / LGTM-type comments with no extractable knowledge).

C. Verifier-gated fact persistence. fact_extractor now parses the
   verifier's free-form prose into per-fact verdicts (CORRECT /
   INCORRECT / UNVERIFIABLE / UNCHECKED). Facts marked INCORRECT are
   dropped on write; CORRECT + UNVERIFIABLE + UNCHECKED are kept
   (dropping UNVERIFIABLE would lose ~90% of real signal — the
   verifier's prior-knowledge base doesn't know Lakehouse internals,
   so domain-specific facts read as UNVERIFIABLE by default).

   verifier_verdicts array is persisted alongside facts so downstream
   queries can surface high-confidence facts (CORRECT) separately
   from provisional ones (UNVERIFIABLE).

   schema_version:2 added to both scrum_reviews.jsonl and
   audit_facts.jsonl writes. Old (v1) rows remain readable; new rows
   get the field so the forward-compat reader in kb_query can
   differentiate.

D. scrum_master_reviewed:true flag added to scrum_reviews.jsonl
   rows on accept. Future kb_query surfacing can filter by this
   (e.g., "show me PRs where a scrum review exists vs only inference"
   as governance signal). Also carried into audit_facts.jsonl when
   the scrum_review source path writes there.

2026-04-22 23:40:21 -05:00

profit

77650c4ba3

auditor: inference curation layer + llm_team fact extraction → KB

Closes the cycle J asked for: curated cloud output lands structured
knowledge in the KB so future audits have architectural context, not
just a log of per-finding signatures.

Three pieces:

1. Inference curation (tree-split) — when diff > 30KB, shard at 4.5KB,
   summarize each shard via cloud (temp=0, think=false on small
   shards; think=true on main call). Merge into scratchpad. The cloud
   verification then runs against the scratchpad, not truncated raw.
   Eliminates the 40KB MAX_DIFF_CHARS truncation path for large PRs
   (PR #8 is 102KB — was losing 62KB). Anti-false-positive guard in
   the prompt: cloud is told scratchpad absence is NOT diff absence,
   so it doesn't flag curated-out symbols as missing. unflagged_gaps
   section is dropped entirely when curated (scratchpad can't ground
   them).

2. fact_extractor — TS client for llm_team_ui's extract-facts mode at
   localhost:5000/api/run. Sends curated scratchpad through qwen2.5
   extractor + gemma2 verifier, parses SSE stream, returns structured
   {facts, entities, relationships, verification, llm_team_run_id}.
   Best-effort: if llm_team is down, extraction fails silently and
   the audit still completes. AWAITED so CLI tools (audit_one.ts)
   don't exit before extraction lands — the systemd poller has 90s
   headroom so the extra ~15s doesn't matter.

3. audit_facts.jsonl + checkAuditFacts() — one row per curated audit
   with the extraction result. kb_query tails the jsonl, explodes
   entity rows, aggregates by entity name with distinct-PR counting,
   surfaces entities recurring in 2+ PRs as info findings. Filters
   out short names (<3 chars, extractor truncation artifacts) and
   generic types (string/number/etc.) so signal isn't drowned.

Verified end-to-end on PR #8: 102KB diff → 23 shards → 1KB scratchpad
→ qwen2.5 extracted 4 facts + 6 entities + 6 relationships (real
code-level knowledge: AggregateOptions<T> type, aggregate<T> async
function with real signature, typed relationships). llm_team_run_id
cross-references to llm_team's own team_runs table.

Also: audit.ts passes (pr_number, head_sha) as InferenceContext so
extracted facts are scope-tagged for the KB index.

2026-04-22 23:09:14 -05:00

2 Commits