lakehouse

profit/lakehouse

Fork 0

Commit Graph

Author	SHA1	Message	Date
profit	56dbfb7d03	fact_extractor: project context + fixed verifier-verdict parser Some checks failed lakehouse/auditor 8 warnings — see review Two bundled changes. Both came out of J's observation that the verifier was defaulting to UNVERIFIABLE on domain-specific facts because it had no idea what Lakehouse was, which project's code it was reading, or what framework the types belonged to. 1. Project context preamble. Added docs/AUDITOR_CONTEXT.md — a <400- word concise description of the project (crates, services, architecture phases, the auditor's role itself). fact_extractor reads it once, caches it, prepends it to the extract prompt as a "PROJECT CONTEXT (for grounding; do NOT extract from this)" section. Both extractor and verifier now see this context, so statements like "aggregate<T> returns Map<string, AggregateRow>" get grounded as "this is a TypeScript function in the Lakehouse auditor subsystem" and the verifier can reason about plausibility instead of guessing. 2. Verifier-verdict parser fix. Gemma2's output format varies between "Verdict: CORRECT" and just "* CORRECT" inline (observed variance across runs). The old regex required "Verdict:" as a label and missed the second format — causing all verdicts to stay UNCHECKED. Replaced with a two-pass approach: find each fact section start ("N." or "N."), slice to the next section, scan the slice for the first CORRECT\|INCORRECT\|UNVERIFIABLE token. Handles both formats plus unfenced fallback. Verified: 4-fact test extraction went from 0/4 verdicts scored (pre-fix) to 2/4 CORRECT + 2/4 UNVERIFIABLE (post-fix). The 2 UNVERIFIABLE cases are domain-specific code behavior the verifier legitimately can't confirm without reading source — correct stance, not a parser miss. No new consensus modes yet. J suggested adding codereview or validator as a second pass; holding until we see whether context injection alone gives sufficient signal lift.	2026-04-23 00:26:01 -05:00
profit	181c35b829	scrum_master fact extraction + verifier gate + schema_version bump Three bundled changes that round out the KB enrichment pipeline (PR #9 commits B/C/D compressed into one — they all touch the same persist surfaces so splitting them would just add noise): B. scrum_master reviews now route accepted review bodies through fact_extractor (same llm_team extract pipeline as inference) and append to data/_kb/audit_facts.jsonl tagged source:"scrum_review". One KB, two producers — downstream consumers can filter by source when they care about provenance. Skips reviews <120 chars (one-liners / LGTM-type comments with no extractable knowledge). C. Verifier-gated fact persistence. fact_extractor now parses the verifier's free-form prose into per-fact verdicts (CORRECT / INCORRECT / UNVERIFIABLE / UNCHECKED). Facts marked INCORRECT are dropped on write; CORRECT + UNVERIFIABLE + UNCHECKED are kept (dropping UNVERIFIABLE would lose ~90% of real signal — the verifier's prior-knowledge base doesn't know Lakehouse internals, so domain-specific facts read as UNVERIFIABLE by default). verifier_verdicts array is persisted alongside facts so downstream queries can surface high-confidence facts (CORRECT) separately from provisional ones (UNVERIFIABLE). schema_version:2 added to both scrum_reviews.jsonl and audit_facts.jsonl writes. Old (v1) rows remain readable; new rows get the field so the forward-compat reader in kb_query can differentiate. D. scrum_master_reviewed:true flag added to scrum_reviews.jsonl rows on accept. Future kb_query surfacing can filter by this (e.g., "show me PRs where a scrum review exists vs only inference" as governance signal). Also carried into audit_facts.jsonl when the scrum_review source path writes there.	2026-04-22 23:40:21 -05:00
profit	77650c4ba3	auditor: inference curation layer + llm_team fact extraction → KB Closes the cycle J asked for: curated cloud output lands structured knowledge in the KB so future audits have architectural context, not just a log of per-finding signatures. Three pieces: 1. Inference curation (tree-split) — when diff > 30KB, shard at 4.5KB, summarize each shard via cloud (temp=0, think=false on small shards; think=true on main call). Merge into scratchpad. The cloud verification then runs against the scratchpad, not truncated raw. Eliminates the 40KB MAX_DIFF_CHARS truncation path for large PRs (PR #8 is 102KB — was losing 62KB). Anti-false-positive guard in the prompt: cloud is told scratchpad absence is NOT diff absence, so it doesn't flag curated-out symbols as missing. unflagged_gaps section is dropped entirely when curated (scratchpad can't ground them). 2. fact_extractor — TS client for llm_team_ui's extract-facts mode at localhost:5000/api/run. Sends curated scratchpad through qwen2.5 extractor + gemma2 verifier, parses SSE stream, returns structured {facts, entities, relationships, verification, llm_team_run_id}. Best-effort: if llm_team is down, extraction fails silently and the audit still completes. AWAITED so CLI tools (audit_one.ts) don't exit before extraction lands — the systemd poller has 90s headroom so the extra ~15s doesn't matter. 3. audit_facts.jsonl + checkAuditFacts() — one row per curated audit with the extraction result. kb_query tails the jsonl, explodes entity rows, aggregates by entity name with distinct-PR counting, surfaces entities recurring in 2+ PRs as info findings. Filters out short names (<3 chars, extractor truncation artifacts) and generic types (string/number/etc.) so signal isn't drowned. Verified end-to-end on PR #8: 102KB diff → 23 shards → 1KB scratchpad → qwen2.5 extracted 4 facts + 6 entities + 6 relationships (real code-level knowledge: AggregateOptions<T> type, aggregate<T> async function with real signature, typed relationships). llm_team_run_id cross-references to llm_team's own team_runs table. Also: audit.ts passes (pr_number, head_sha) as InferenceContext so extracted facts are scope-tagged for the KB index.	2026-04-22 23:09:14 -05:00

Author

SHA1

Message

Date

profit

56dbfb7d03

fact_extractor: project context + fixed verifier-verdict parser

lakehouse/auditor 8 warnings — see review

Two bundled changes. Both came out of J's observation that the
verifier was defaulting to UNVERIFIABLE on domain-specific facts
because it had no idea what Lakehouse was, which project's code it
was reading, or what framework the types belonged to.

1. Project context preamble. Added docs/AUDITOR_CONTEXT.md — a <400-
   word concise description of the project (crates, services,
   architecture phases, the auditor's role itself). fact_extractor
   reads it once, caches it, prepends it to the extract prompt as a
   "PROJECT CONTEXT (for grounding; do NOT extract from this)"
   section. Both extractor and verifier now see this context, so
   statements like "aggregate<T> returns Map<string, AggregateRow>"
   get grounded as "this is a TypeScript function in the Lakehouse
   auditor subsystem" and the verifier can reason about plausibility
   instead of guessing.

2. Verifier-verdict parser fix. Gemma2's output format varies between
   "**Verdict:** CORRECT" and just "* **CORRECT**" inline (observed
   variance across runs). The old regex required "Verdict:" as a
   label and missed the second format — causing all verdicts to
   stay UNCHECKED. Replaced with a two-pass approach: find each
   fact section start ("**N.**" or "N."), slice to the next section,
   scan the slice for the first CORRECT|INCORRECT|UNVERIFIABLE
   token. Handles both formats plus unfenced fallback.

Verified: 4-fact test extraction went from 0/4 verdicts scored
(pre-fix) to 2/4 CORRECT + 2/4 UNVERIFIABLE (post-fix). The 2
UNVERIFIABLE cases are domain-specific code behavior the verifier
legitimately can't confirm without reading source — correct stance,
not a parser miss.

No new consensus modes yet. J suggested adding codereview or
validator as a second pass; holding until we see whether context
injection alone gives sufficient signal lift.

2026-04-23 00:26:01 -05:00

profit

181c35b829

scrum_master fact extraction + verifier gate + schema_version bump

Three bundled changes that round out the KB enrichment pipeline
(PR #9 commits B/C/D compressed into one — they all touch the same
persist surfaces so splitting them would just add noise):

B. scrum_master reviews now route accepted review bodies through
   fact_extractor (same llm_team extract pipeline as inference) and
   append to data/_kb/audit_facts.jsonl tagged source:"scrum_review".
   One KB, two producers — downstream consumers can filter by source
   when they care about provenance. Skips reviews <120 chars
   (one-liners / LGTM-type comments with no extractable knowledge).

C. Verifier-gated fact persistence. fact_extractor now parses the
   verifier's free-form prose into per-fact verdicts (CORRECT /
   INCORRECT / UNVERIFIABLE / UNCHECKED). Facts marked INCORRECT are
   dropped on write; CORRECT + UNVERIFIABLE + UNCHECKED are kept
   (dropping UNVERIFIABLE would lose ~90% of real signal — the
   verifier's prior-knowledge base doesn't know Lakehouse internals,
   so domain-specific facts read as UNVERIFIABLE by default).

   verifier_verdicts array is persisted alongside facts so downstream
   queries can surface high-confidence facts (CORRECT) separately
   from provisional ones (UNVERIFIABLE).

   schema_version:2 added to both scrum_reviews.jsonl and
   audit_facts.jsonl writes. Old (v1) rows remain readable; new rows
   get the field so the forward-compat reader in kb_query can
   differentiate.

D. scrum_master_reviewed:true flag added to scrum_reviews.jsonl
   rows on accept. Future kb_query surfacing can filter by this
   (e.g., "show me PRs where a scrum review exists vs only inference"
   as governance signal). Also carried into audit_facts.jsonl when
   the scrum_review source path writes there.

2026-04-22 23:40:21 -05:00

profit

77650c4ba3

auditor: inference curation layer + llm_team fact extraction → KB

Closes the cycle J asked for: curated cloud output lands structured
knowledge in the KB so future audits have architectural context, not
just a log of per-finding signatures.

Three pieces:

1. Inference curation (tree-split) — when diff > 30KB, shard at 4.5KB,
   summarize each shard via cloud (temp=0, think=false on small
   shards; think=true on main call). Merge into scratchpad. The cloud
   verification then runs against the scratchpad, not truncated raw.
   Eliminates the 40KB MAX_DIFF_CHARS truncation path for large PRs
   (PR #8 is 102KB — was losing 62KB). Anti-false-positive guard in
   the prompt: cloud is told scratchpad absence is NOT diff absence,
   so it doesn't flag curated-out symbols as missing. unflagged_gaps
   section is dropped entirely when curated (scratchpad can't ground
   them).

2. fact_extractor — TS client for llm_team_ui's extract-facts mode at
   localhost:5000/api/run. Sends curated scratchpad through qwen2.5
   extractor + gemma2 verifier, parses SSE stream, returns structured
   {facts, entities, relationships, verification, llm_team_run_id}.
   Best-effort: if llm_team is down, extraction fails silently and
   the audit still completes. AWAITED so CLI tools (audit_one.ts)
   don't exit before extraction lands — the systemd poller has 90s
   headroom so the extra ~15s doesn't matter.

3. audit_facts.jsonl + checkAuditFacts() — one row per curated audit
   with the extraction result. kb_query tails the jsonl, explodes
   entity rows, aggregates by entity name with distinct-PR counting,
   surfaces entities recurring in 2+ PRs as info findings. Filters
   out short names (<3 chars, extractor truncation artifacts) and
   generic types (string/number/etc.) so signal isn't drowned.

Verified end-to-end on PR #8: 102KB diff → 23 shards → 1KB scratchpad
→ qwen2.5 extracted 4 facts + 6 entities + 6 relationships (real
code-level knowledge: AggregateOptions<T> type, aggregate<T> async
function with real signature, typed relationships). llm_team_run_id
cross-references to llm_team's own team_runs table.

Also: audit.ts passes (pr_number, head_sha) as InferenceContext so
extracted facts are scope-tagged for the KB index.

2026-04-22 23:09:14 -05:00

3 Commits