3 Commits

Author SHA1 Message Date
root
7bb432f6c8 distillation: full SFT export port — closes OPEN #2 fully
Follow-up to b216b7e (which shipped the SFT export substrate). This
commit ports the synthesis logic, completing the migration:

- SynthesizeSft(scored, ev, recordedAt, sftID) → *SftSample
  Mirrors the Rust synthesizeSft byte-for-byte. Returns nil for
  extraction-class records + empty-text records (same skip
  semantics as Rust).
- LoadEvidenceByRunID(scoredPath, cache) reads the paired evidence
  JSONL (path derived by /scored-runs/ → /evidence/ replacement).
  Per-call cache so multiple scored-runs files in the same dir
  don't reload the same evidence.
- buildInstruction maps source_file stem → per-class instruction
  template. All 8 templates (scrum_reviews, mode_experiments,
  auto_apply, audits, observer_reviews, contract_analyses,
  outcomes, default) match Rust output exactly so a/b validation
  between runtimes can diff JSONL byte-for-byte.
- stemFromSourceFile strips data/_kb/ prefix + .jsonl suffix.
- ExportSft now writes data/distilled/sft/sft_export.jsonl with
  the synthesized samples (DryRun=true skips file write).

Per-class templates verified by 8-case sub-test:
- scrum_reviews → "Review the file '...' against the PRD..."
- mode_experiments → "Run task_class='...' for file..."
- auto_apply → "Auto-apply: emit a 6-line surgical patch..."
- audits with phase: prefix → strips to bare phase name
- observer_reviews → "Observer-review the latest attempt..."
- contract_analyses with permit: prefix → strips to permit ID
- outcomes → "Run scenario; report per-event outcome..."
- unknown source → "Source 'X' run; produce the appropriate output"

Caveat documented inline: contract_analyses uses ev.metadata.contractor
in Rust to produce "Analyze contractor 'X' for permit 'Y'" when
present. Go's EvidenceRecord doesn't carry a free-form metadata bag
yet, so we always emit the no-contractor form. Operators needing
contractor-aware instructions can extend EvidenceRecord with an
explicit Metadata field (separate ADR).

Test additions (5 new):
- TestSynthesizeSft_PerSourceClass: 8 sub-cases, one per template
- TestSynthesizeSft_RejectsExtraction: extraction-role records skipped
- TestSynthesizeSft_RejectsEmptyText: empty/whitespace text skipped
- TestSynthesizeSft_ContextAssembly: matrix + pathway + model
  context string formatting matches Rust " · " join
- TestExportSft_FullPort_WritesJSONL: end-to-end fixture, asserts
  output contains expected instruction + omits firewalled records

Pre-existing TestExportSft_PartialPort_FirewallFires renamed +
updated to TestExportSft_FirewallFiresBeforeEvidenceLoad — reflects
the new contract that records passing the firewall but lacking
evidence land in "not-instructable" rather than being silently
exported. Honest semantics shift documented in the test.

OPEN #2 now fully closed (was: substrate-only). The synthesis path
no longer requires the Rust pipeline to be invoked — Go-side
operators can run the full distillation export end-to-end.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 00:06:57 -05:00
root
b216b7e5b6 fix the other 4: close all OPEN-list items in one wave
Substantial wave addressing all 4 prior OPEN items. Three closed in
full, one partially (the speculative half deliberately deferred).

OPEN #1 — Periodic fresh→main index merge (FULL):
- POST /v1/vectors/index/{src}/merge with {dest, clear_source}
- Idempotent on re-runs (existing-in-dest items skipped)
- internal/vectord/index.go: new Index.IDs() snapshot method +
  i.ids tracker field as canonical ID set, independent of meta
  map's nil-vs-{} sparseness (was a real bug — IDs() backed by meta
  alone missed items added with nil metadata)
- 4 cmd-level integration tests (happy path drain+clear, dim
  mismatch, dest not found, self-merge rejection) + 1 unit test
- DecodeIndex backward-compat: old envelopes restore i.ids from
  meta keys (best effort; new items going forward use the tracker)

OPEN #2 — Distillation SFT export (SUBSTRATE):
- internal/distillation/sft_export.go ports the load-bearing half:
  IsSftNever predicate + ListScoredRunFiles (data/scored-runs/YYYY/
  MM/DD walk) + LoadScoredRunsFromFile + partial ExportSft.
- Synthesis (instruction/input/response generation) deferred to a
  separate wave — too big for this session, but the substrate
  makes the next wave a port-not-design exercise.
- TestSftNever_PinsExpectedSet locks the contamination firewall
  set: if a future commit adds/removes from SftNever, this test
  fails — forcing the change through review.
- 5 new tests; firewall fires end-to-end through the partial port.

OPEN #3 — Distribution drift via PSI (FULL):
- internal/drift/drift.go: ComputeDistributionDrift via Population
  Stability Index. Standard finance/risk metric, well-defined
  verdict tiers (stable < 0.10, minor 0.10–0.25, major ≥ 0.25).
- Equal-width bucketing over combined min/max so neither dist
  falls outside; epsilon-clamping for empty buckets so log doesn't
  blow up. Per-bucket breakdown for drilldown.
- Pairs with the existing ComputeScorerDrift: scorer drift is
  categorical, distribution drift is continuous. Different shapes,
  same package.
- 7 new tests covering identical-is-stable, hard-shift-is-major,
  moderate-detected-not-stable, empty-inputs-safe, all-identical-
  safe, bucket-counts-conserved, num-buckets-clamping.

OPEN #4 — Ops nice-to-haves (PARTIAL — wall-clock done, others
deferred):
- (a) Real-time wall-clock for stress harness: per-phase elapsed
  time logged to stdout as it runs (`[stress] phase NAME starting
  (T+12.3s)` + `[stress] phase NAME done — 8.5s (T+20.8s)`).
  Output.PhaseTimings + Output.TotalElapsedMs in JSON.
- (b) chatd fixture-mode S3 mock + (c) liberal-paraphrase
  calibration: not actioned — no fired trigger, would be
  speculative. Documented as deferred-until-need rather than
  ignored. Per the project's discipline ("don't add features
  beyond what the task requires").

OPEN list now empty / steady-state. Future items will land as
production triggers fire.

Build + vet + tests green; 18 new tests across the 4 closures.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 23:42:11 -05:00
root
57d0df125d E (partial): distillation port — scorer + contamination firewall
First slice of the Rust v1.0.0 distillation substrate (e7636f2)
ported to Go per ADR-001 #4 (port LOGIC, not bit-identical
reproducibility). This commit lands the LOAD-BEARING pieces named
in project_distillation_substrate.md memory:

  - The deterministic Success Scorer (8 sub-scorers + dispatch)
  - The contamination firewall on SFT samples (the "non-negotiable"
    spec property: rejected/needs_human_review NEVER ship to SFT)
  - All on-wire types + validators for ScoredRun, SftSample,
    EvidenceRecord with Provenance

Files:
  internal/distillation/types.go  — types + ScorerVersion + SftNever
                                    + ValidateScoredRun + ValidateSftSample
  internal/distillation/scorer.go — ScoreRecord + 8 class scorers +
                                    BuildScoredRun (deterministic)
  internal/distillation/scorer_test.go — ~40 test cases:
    - source-class dispatch (verdict / telemetry / extraction)
    - scrum_review (4 attempt cases)
    - observer_review (5 verdict cases)
    - audit (legacy + severity, 9 cases)
    - auto_apply (4 cases)
    - outcomes / mode_experiment / extraction
    - CONTAMINATION FIREWALL: ErrSftContamination sentinel fires
      on rejected/needs_human_review, distinct from typo errors
    - empty-pair guard (instruction/response trim != "")
    - reasons-required ScoredRun validation
    - deterministic sig_hash on identical input
    - purity check (input not mutated, repeatable output)

Per the 2026-04-29 cross-lineage scrum's discipline: false-positive
findings would be dismissed inline (none in this commit). Real
findings would be addressed before merge — but this is greenfield
port code reviewed against its Rust source line-by-line, which the
test suite encodes as truth tables.

Explicitly DEFERRED to follow-up commits:
  - Materialization layer (jsonl read/write, date-partitioned
    storage in data/scored-runs/YYYY/MM/DD/, evidence index)
  - SFT exporter (file iteration + filtering — the SCORING firewall
    is here; the EXPORT firewall is the next layer)
  - export_preference, export_rag (other export shapes)
  - Acceptance harness (16/16 acceptance gate that locks v1.0.0)
  - replay, receipts, build_evidence_index, transforms

The scorer + firewall validator are pure functions — operational
tooling layers on top without changing the deterministic logic the
downstream learning loop depends on. The Go ScorerVersion stays at
v1.0.0 to match the Rust e7636f2 baseline; bumping in the Go
materialization commit is reserved for the next scoring-rule
change, NOT the port itself.

15-smoke regression all green. vet clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 20:04:29 -05:00