Two threads landing together — the doc edits interleave so they ship in a single commit. 1. **vectord substrate fix verified at original scale** (closes the 2026-05-01 thread). Re-ran multitier 5min @ conc=50: 132,211 scenarios at 438/sec, 6/6 classes at 0% failure (was 4/6 pre-fix). Throughput dropped 1,115 → 438/sec because previously-broken scenarios now do real HNSW Add work — honest cost of correctness. The fix (i.vectors side-store + safeGraphAdd recover wrappers + smallIndexRebuildThreshold=32 + saveTask coalescing) holds at the footprint that originally surfaced the bug. 2. **Materializer port** — internal/materializer + cmd/materializer + scripts/materializer_smoke.sh. Ports scripts/distillation/transforms.ts (12 transforms) + build_evidence_index.ts (idempotency, day-partition, receipt). On-wire JSON shape matches TS so Bun and Go runs are interchangeable. 14 tests green. 3. **Replay port** — internal/replay + cmd/replay + scripts/replay_smoke.sh. Ports scripts/distillation/replay.ts (retrieve → bundle → /v1/chat → validate → log). Closes audit-FULL phase 7 live invocation on the Go side. Both runtimes append to the same data/_kb/replay_runs.jsonl (schema=replay_run.v1). 14 tests green. Side effect on internal/distillation/types.go: EvidenceRecord gained prompt_tokens, completion_tokens, and metadata fields to mirror the TS shape the materializer transforms produce. STATE_OF_PLAY refreshed to 2026-05-02; ARCHITECTURE_COMPARISON decisions tracker moves the materializer + replay items from _open_ to DONE and adds the substrate-fix scale verification row. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
493 lines
18 KiB
Go
493 lines
18 KiB
Go
// Package distillation is the Go port of the Rust v1.0.0 distillation
|
|
// substrate (frozen at e7636f2). Per ADR-001 #4: port LOGIC, not
|
|
// bit-identical reproducibility.
|
|
//
|
|
// What this package owns (this commit):
|
|
// - The deterministic scorer: EvidenceRecord → ScoredRun
|
|
// - Score categories + scorer version constant
|
|
// - SftSample type + validator with the contamination firewall
|
|
// (the safety-critical piece — rejected/needs_human_review must
|
|
// NEVER ship to SFT)
|
|
//
|
|
// What's deferred to follow-up commits:
|
|
// - Materialization layer (file iteration, jsonl read/write,
|
|
// date-partitioned storage) — operational tooling on top of
|
|
// the scorer logic
|
|
// - export_preference, export_rag (other export shapes)
|
|
// - acceptance harness (the gate that locks v1.0.0)
|
|
// - replay, receipts, evidence-index builders
|
|
//
|
|
// The scorer + SftSample validator are the LOAD-BEARING pieces
|
|
// per project_distillation_substrate.md memory. The rest is plumbing
|
|
// that can land incrementally without changing the logic the
|
|
// downstream learning loop depends on.
|
|
|
|
package distillation
|
|
|
|
import (
|
|
"encoding/json"
|
|
"errors"
|
|
"fmt"
|
|
"strings"
|
|
"time"
|
|
)
|
|
|
|
// ScoreCategory is one of the 4 deterministic verdicts. Matches Rust
|
|
// SCORE_CATEGORIES exactly.
|
|
type ScoreCategory string
|
|
|
|
const (
|
|
CategoryAccepted ScoreCategory = "accepted"
|
|
CategoryPartiallyAccepted ScoreCategory = "partially_accepted"
|
|
CategoryRejected ScoreCategory = "rejected"
|
|
CategoryNeedsHumanReview ScoreCategory = "needs_human_review"
|
|
)
|
|
|
|
// AllScoreCategories lists every legal category — used by validators.
|
|
var AllScoreCategories = []ScoreCategory{
|
|
CategoryAccepted,
|
|
CategoryPartiallyAccepted,
|
|
CategoryRejected,
|
|
CategoryNeedsHumanReview,
|
|
}
|
|
|
|
// ScorerVersion is hardcoded — the deterministic-output contract
|
|
// requires this. Bump the literal in the same commit as any scoring-
|
|
// rule change so the version stamp moves atomically with logic.
|
|
// Mirrors the Rust SCORER_VERSION (also v1.0.0 at e7636f2).
|
|
const ScorerVersion = "v1.0.0"
|
|
|
|
// SftQualityScore enumerates the categories LEGAL in SFT exports.
|
|
// SFT_NEVER (defined below) is the inverse — categories that NEVER
|
|
// ship to SFT under any flag combination. The contamination firewall
|
|
// is enforced at the schema layer (ValidateSftSample) AND by the
|
|
// exporter; defense in depth.
|
|
type SftQualityScore string
|
|
|
|
const (
|
|
SftQualityAccepted SftQualityScore = "accepted"
|
|
SftQualityPartiallyAccepted SftQualityScore = "partially_accepted"
|
|
)
|
|
|
|
// SftQualityScores lists quality scores legal in SFT samples.
|
|
// Default is SftQualityAccepted only; --include-partial CLI flag
|
|
// expands to both. rejected and needs_human_review are NEVER legal.
|
|
var SftQualityScores = []SftQualityScore{
|
|
SftQualityAccepted,
|
|
SftQualityPartiallyAccepted,
|
|
}
|
|
|
|
// SftNever is the contamination firewall: ScoreCategories that NEVER
|
|
// ship to SFT under ANY caller flag. Enforced at the schema layer
|
|
// (ValidateSftSample) AND at the exporter layer. Per the Rust
|
|
// e7636f2 spec: "Hard non-negotiable: this set never expands. If you
|
|
// find yourself adding 'needs_human_review' or 'rejected' here, stop
|
|
// — that's the contamination the spec forbids."
|
|
//
|
|
// Exported so callers AND the validator share the same source of
|
|
// truth. Modifying this constant changes the contract; reviewers
|
|
// should treat any commit that touches it as a security review.
|
|
var SftNever = []ScoreCategory{
|
|
CategoryRejected,
|
|
CategoryNeedsHumanReview,
|
|
}
|
|
|
|
// SftSampleSchemaVersion bumps when the on-wire SftSample shape
|
|
// changes incompatibly. Match the Rust SFT_SAMPLE_SCHEMA_VERSION.
|
|
const SftSampleSchemaVersion = 1
|
|
|
|
// ScoredRunSchemaVersion bumps when the on-wire ScoredRun shape
|
|
// changes incompatibly. Match the Rust SCORED_RUN_SCHEMA_VERSION.
|
|
const ScoredRunSchemaVersion = 1
|
|
|
|
// EvidenceSchemaVersion mirrors the Rust EVIDENCE_SCHEMA_VERSION.
|
|
// This package consumes EvidenceRecord; producing it is a separate
|
|
// concern (the materialization layer not yet ported).
|
|
const EvidenceSchemaVersion = 1
|
|
|
|
// ModelRole categorizes the kind of model output represented by an
|
|
// EvidenceRecord. Used by the SFT exporter to filter "real model
|
|
// output" from pure-extraction rows.
|
|
type ModelRole string
|
|
|
|
const (
|
|
RoleExecutor ModelRole = "executor"
|
|
RoleReviewer ModelRole = "reviewer"
|
|
RoleExtractor ModelRole = "extractor"
|
|
RoleVerifier ModelRole = "verifier"
|
|
RoleCategorizer ModelRole = "categorizer"
|
|
RoleTiebreaker ModelRole = "tiebreaker"
|
|
RoleApplier ModelRole = "applier"
|
|
RoleEmbedder ModelRole = "embedder"
|
|
RoleOther ModelRole = "other"
|
|
)
|
|
|
|
// Provenance is the source-linkage every distillation record carries.
|
|
// SourceFile is required (no record without source linkage); other
|
|
// fields are best-effort for de-duplication and trace-back.
|
|
type Provenance struct {
|
|
SourceFile string `json:"source_file"`
|
|
LineOffset int64 `json:"line_offset,omitempty"`
|
|
SigHash string `json:"sig_hash"`
|
|
RecordedAt string `json:"recorded_at"` // ISO 8601
|
|
}
|
|
|
|
// ObserverVerdict is what an observer returned for an executor's
|
|
// output. Matches the Rust enum but as a string type for JSON
|
|
// flexibility.
|
|
type ObserverVerdict string
|
|
|
|
const (
|
|
VerdictAccept ObserverVerdict = "accept"
|
|
VerdictReject ObserverVerdict = "reject"
|
|
VerdictCycle ObserverVerdict = "cycle"
|
|
)
|
|
|
|
// EvidenceRecord is one row in the canonical evidence stream.
|
|
// Producing it (transforms from raw KB streams) is a separate
|
|
// concern; this package consumes it.
|
|
//
|
|
// Fields mirror the Rust EvidenceRecord at e7636f2. Optional fields
|
|
// use Go pointers / slices so missing-vs-empty stays distinguishable
|
|
// for the scorer's heuristics.
|
|
type EvidenceRecord struct {
|
|
RunID string `json:"run_id"`
|
|
TaskID string `json:"task_id"`
|
|
Timestamp string `json:"timestamp"`
|
|
SchemaVersion int `json:"schema_version"`
|
|
|
|
Provenance Provenance `json:"provenance"`
|
|
|
|
ModelName string `json:"model_name,omitempty"`
|
|
ModelProvider string `json:"model_provider,omitempty"`
|
|
ModelRole ModelRole `json:"model_role,omitempty"`
|
|
|
|
InputHash string `json:"input_hash,omitempty"`
|
|
OutputHash string `json:"output_hash,omitempty"`
|
|
|
|
SourceFiles []string `json:"source_files,omitempty"`
|
|
CommandsRun []string `json:"commands_run,omitempty"`
|
|
|
|
RetrievedContext *RetrievedContext `json:"retrieved_context,omitempty"`
|
|
|
|
ObserverNotes []string `json:"observer_notes,omitempty"`
|
|
ObserverVerdict ObserverVerdict `json:"observer_verdict,omitempty"`
|
|
ObserverConfidence float64 `json:"observer_confidence,omitempty"`
|
|
ScratchpadSummary string `json:"scratchpad_summary,omitempty"`
|
|
|
|
SuccessMarkers []string `json:"success_markers,omitempty"`
|
|
FailureMarkers []string `json:"failure_markers,omitempty"`
|
|
|
|
ValidationResults map[string]any `json:"validation_results,omitempty"`
|
|
|
|
HumanOverride *HumanOverride `json:"human_override,omitempty"`
|
|
|
|
CostUSD float64 `json:"cost_usd,omitempty"`
|
|
LatencyMs int64 `json:"latency_ms,omitempty"`
|
|
PromptTokens int64 `json:"prompt_tokens,omitempty"`
|
|
CompletionTokens int64 `json:"completion_tokens,omitempty"`
|
|
Text string `json:"text,omitempty"`
|
|
|
|
// Domain-specific bucket for source-row fields that don't earn a
|
|
// top-level slot. e.g. contract_analyses carries `contractor` here.
|
|
// Typed scalar values only — keep this small or it becomes a junk
|
|
// drawer. Mirrors EvidenceRecord.metadata in evidence_record.ts.
|
|
Metadata map[string]any `json:"metadata,omitempty"`
|
|
}
|
|
|
|
// RetrievedContext captures what the model saw via retrieval. Matches
|
|
// the Rust shape exactly so the JSON round-trips byte-identical (per
|
|
// ADR-001 #4 "logic, not bit-identical" — but on-wire compatibility
|
|
// is desirable for tooling that consumes EvidenceRecord JSONL).
|
|
type RetrievedContext struct {
|
|
MatrixCorpora []string `json:"matrix_corpora,omitempty"`
|
|
MatrixHits int `json:"matrix_hits,omitempty"`
|
|
MatrixChunksKept int `json:"matrix_chunks_kept,omitempty"`
|
|
MatrixChunksDropped int `json:"matrix_chunks_dropped,omitempty"`
|
|
PathwayFingerprintsSeen int `json:"pathway_fingerprints_seen,omitempty"`
|
|
}
|
|
|
|
// HumanOverride captures a human-in-the-loop decision overriding the
|
|
// scorer's verdict. Recorded but doesn't change the scorer's output;
|
|
// downstream consumers (UI, distillation acceptance) decide how to
|
|
// treat it.
|
|
type HumanOverride struct {
|
|
Overrider string `json:"overrider"`
|
|
Decision string `json:"decision"` // accept|reject|needs_review
|
|
Reason string `json:"reason"`
|
|
OverriddenAt string `json:"overridden_at"`
|
|
}
|
|
|
|
// SubScores carries the deterministic scorer's intermediate signals
|
|
// alongside the final ScoreCategory. Persisted on every ScoredRun
|
|
// so a downstream UI can show "why" without re-running the scorer.
|
|
type SubScores struct {
|
|
CargoGreen *bool `json:"cargo_green,omitempty"`
|
|
AnchorGrounding *float64 `json:"anchor_grounding,omitempty"`
|
|
SchemaValid *bool `json:"schema_valid,omitempty"`
|
|
PathwayReplaySucceeded *bool `json:"pathway_replay_succeeded,omitempty"`
|
|
ObserverVerdict ObserverVerdict `json:"observer_verdict,omitempty"`
|
|
AcceptedOnAttempt *int `json:"accepted_on_attempt,omitempty"`
|
|
// Extra fields the Rust schema accepted as `[key: string]: unknown`.
|
|
// Captured here as a free-form map so future signals don't require
|
|
// type-system changes.
|
|
Extras map[string]any `json:"-"`
|
|
}
|
|
|
|
// ScoredRun is the deterministic scorer's output. One per
|
|
// EvidenceRecord. Provenance ties back to the materialized evidence
|
|
// row (not the raw source stream).
|
|
type ScoredRun struct {
|
|
SchemaVersion int `json:"schema_version"`
|
|
EvidenceRunID string `json:"evidence_run_id"`
|
|
EvidenceTaskID string `json:"evidence_task_id"`
|
|
Category ScoreCategory `json:"category"`
|
|
Reasons []string `json:"reasons"` // non-empty
|
|
ScoredAt string `json:"scored_at"`
|
|
ScorerVersion string `json:"scorer_version"`
|
|
SubScores *SubScores `json:"sub_scores,omitempty"`
|
|
Provenance Provenance `json:"provenance"`
|
|
}
|
|
|
|
// SftSample is one entry in exports/sft/instruction_response.jsonl.
|
|
// The contamination firewall lives in ValidateSftSample.
|
|
type SftSample struct {
|
|
SchemaVersion int `json:"schema_version"`
|
|
ID string `json:"id"`
|
|
Instruction string `json:"instruction"`
|
|
Context string `json:"context"` // empty allowed; null/missing not
|
|
Response string `json:"response"`
|
|
SourceRunID string `json:"source_run_id"`
|
|
QualityScore SftQualityScore `json:"quality_score"`
|
|
CreatedAt string `json:"created_at"`
|
|
Provenance Provenance `json:"provenance"`
|
|
}
|
|
|
|
// ─── Validators ──────────────────────────────────────────────────
|
|
|
|
// ValidationError is a single field-level violation.
|
|
type ValidationError struct {
|
|
Field string
|
|
Message string
|
|
}
|
|
|
|
func (e ValidationError) Error() string {
|
|
return fmt.Sprintf("%s: %s", e.Field, e.Message)
|
|
}
|
|
|
|
// ValidationErrors is the joinable error returned by the validators
|
|
// when one or more fields violate the schema.
|
|
type ValidationErrors []ValidationError
|
|
|
|
func (es ValidationErrors) Error() string {
|
|
if len(es) == 0 {
|
|
return "no errors"
|
|
}
|
|
parts := make([]string, len(es))
|
|
for i, e := range es {
|
|
parts[i] = e.Error()
|
|
}
|
|
return strings.Join(parts, "; ")
|
|
}
|
|
|
|
// HasErrors returns true when one or more errors are present.
|
|
func (es ValidationErrors) HasErrors() bool { return len(es) > 0 }
|
|
|
|
// ValidateScoredRun mirrors the Rust validateScoredRun. Returns nil
|
|
// on success or a ValidationErrors with the field-level violations.
|
|
func ValidateScoredRun(r ScoredRun) error {
|
|
var errs ValidationErrors
|
|
if r.SchemaVersion != ScoredRunSchemaVersion {
|
|
errs = append(errs, ValidationError{
|
|
"schema_version",
|
|
fmt.Sprintf("expected %d, got %d", ScoredRunSchemaVersion, r.SchemaVersion),
|
|
})
|
|
}
|
|
if r.EvidenceRunID == "" {
|
|
errs = append(errs, ValidationError{"evidence_run_id", "must be non-empty"})
|
|
}
|
|
if r.EvidenceTaskID == "" {
|
|
errs = append(errs, ValidationError{"evidence_task_id", "must be non-empty"})
|
|
}
|
|
if !validISOTimestamp(r.ScoredAt) {
|
|
errs = append(errs, ValidationError{"scored_at", "must be ISO 8601 timestamp"})
|
|
}
|
|
if r.ScorerVersion == "" {
|
|
errs = append(errs, ValidationError{"scorer_version", "must be non-empty"})
|
|
}
|
|
if len(r.Reasons) == 0 {
|
|
errs = append(errs, ValidationError{"reasons", "must be non-empty (every score needs a reason)"})
|
|
}
|
|
if !isValidCategory(r.Category) {
|
|
errs = append(errs, ValidationError{"category", fmt.Sprintf("must be one of %v, got %q", AllScoreCategories, r.Category)})
|
|
}
|
|
if err := validateProvenance(r.Provenance, "provenance"); err != nil {
|
|
errs = append(errs, err...)
|
|
}
|
|
if r.SubScores != nil && r.SubScores.AnchorGrounding != nil {
|
|
ag := *r.SubScores.AnchorGrounding
|
|
if ag < 0 || ag > 1 {
|
|
errs = append(errs, ValidationError{"sub_scores.anchor_grounding", "must be in [0, 1]"})
|
|
}
|
|
}
|
|
if errs.HasErrors() {
|
|
return errs
|
|
}
|
|
return nil
|
|
}
|
|
|
|
// ValidateSftSample is the contamination firewall. Returns ErrSftContamination
|
|
// (wrapped) when quality_score is in SftNever — which is the safety-critical
|
|
// guarantee the spec calls non-negotiable.
|
|
//
|
|
// Other field violations come back as ValidationErrors.
|
|
func ValidateSftSample(s SftSample) error {
|
|
var errs ValidationErrors
|
|
if s.SchemaVersion != SftSampleSchemaVersion {
|
|
errs = append(errs, ValidationError{
|
|
"schema_version",
|
|
fmt.Sprintf("expected %d, got %d", SftSampleSchemaVersion, s.SchemaVersion),
|
|
})
|
|
}
|
|
if s.ID == "" {
|
|
errs = append(errs, ValidationError{"id", "must be non-empty"})
|
|
}
|
|
if strings.TrimSpace(s.Instruction) == "" {
|
|
errs = append(errs, ValidationError{"instruction", "must be non-whitespace (no empty pairs)"})
|
|
}
|
|
if strings.TrimSpace(s.Response) == "" {
|
|
errs = append(errs, ValidationError{"response", "must be non-whitespace (no empty pairs)"})
|
|
}
|
|
// Context is required-string but empty is allowed.
|
|
// (Field is always typed as string in Go, so the only way to
|
|
// distinguish "set" from "missing" was via the JSON layer; here
|
|
// empty is fine.)
|
|
if s.SourceRunID == "" {
|
|
errs = append(errs, ValidationError{"source_run_id", "must be non-empty"})
|
|
}
|
|
if !validISOTimestamp(s.CreatedAt) {
|
|
errs = append(errs, ValidationError{"created_at", "must be ISO 8601 timestamp"})
|
|
}
|
|
if err := validateProvenance(s.Provenance, "provenance"); err != nil {
|
|
errs = append(errs, err...)
|
|
}
|
|
|
|
// Contamination firewall. Hard non-negotiable per the spec.
|
|
if !isLegalSftQualityScore(s.QualityScore) {
|
|
// If it's in SftNever, surface the firewall sentinel — callers
|
|
// can errors.Is(err, ErrSftContamination) to reliably detect
|
|
// "the spec said never" as opposed to "you typo'd a category."
|
|
if isContaminationCategory(s.QualityScore) {
|
|
return fmt.Errorf("%w: quality_score %q in SftNever (rejected/needs_human_review never legal in SFT)",
|
|
ErrSftContamination, s.QualityScore)
|
|
}
|
|
errs = append(errs, ValidationError{
|
|
"quality_score",
|
|
fmt.Sprintf("must be one of %v, got %q", SftQualityScores, s.QualityScore),
|
|
})
|
|
}
|
|
|
|
if errs.HasErrors() {
|
|
return errs
|
|
}
|
|
return nil
|
|
}
|
|
|
|
// ErrSftContamination is the firewall sentinel — when ValidateSftSample
|
|
// rejects a sample because its quality_score is in SftNever, callers
|
|
// can errors.Is(err, ErrSftContamination) to reliably distinguish
|
|
// "spec violation" from "typo'd category."
|
|
var ErrSftContamination = errors.New("distillation: SFT contamination — quality_score in SftNever")
|
|
|
|
// ─── Internal helpers ────────────────────────────────────────────
|
|
|
|
func isValidCategory(c ScoreCategory) bool {
|
|
for _, v := range AllScoreCategories {
|
|
if c == v {
|
|
return true
|
|
}
|
|
}
|
|
return false
|
|
}
|
|
|
|
func isLegalSftQualityScore(q SftQualityScore) bool {
|
|
for _, v := range SftQualityScores {
|
|
if q == v {
|
|
return true
|
|
}
|
|
}
|
|
return false
|
|
}
|
|
|
|
func isContaminationCategory(q SftQualityScore) bool {
|
|
// Compare as ScoreCategory — the on-wire string is the same; this
|
|
// just guards the firewall against typos that happen to match
|
|
// SftNever string-wise.
|
|
for _, v := range SftNever {
|
|
if string(v) == string(q) {
|
|
return true
|
|
}
|
|
}
|
|
return false
|
|
}
|
|
|
|
func validISOTimestamp(s string) bool {
|
|
if s == "" {
|
|
return false
|
|
}
|
|
// time.Parse with RFC3339 covers most ISO 8601. We accept both
|
|
// the basic and nano variants since the Rust producers vary.
|
|
if _, err := time.Parse(time.RFC3339, s); err == nil {
|
|
return true
|
|
}
|
|
if _, err := time.Parse(time.RFC3339Nano, s); err == nil {
|
|
return true
|
|
}
|
|
return false
|
|
}
|
|
|
|
func validateProvenance(p Provenance, field string) ValidationErrors {
|
|
var errs ValidationErrors
|
|
if p.SourceFile == "" {
|
|
errs = append(errs, ValidationError{field + ".source_file", "must be non-empty"})
|
|
}
|
|
if p.SigHash == "" {
|
|
errs = append(errs, ValidationError{field + ".sig_hash", "must be non-empty"})
|
|
}
|
|
if !validISOTimestamp(p.RecordedAt) {
|
|
errs = append(errs, ValidationError{field + ".recorded_at", "must be ISO 8601 timestamp"})
|
|
}
|
|
return errs
|
|
}
|
|
|
|
// MarshalSubScores is a shim — Go's encoding/json doesn't merge a
|
|
// "rest" map into the struct's JSON output by default. Callers that
|
|
// need Extras serialized into the same object can use this helper.
|
|
func MarshalSubScores(s *SubScores) ([]byte, error) {
|
|
if s == nil {
|
|
return []byte("null"), nil
|
|
}
|
|
// First marshal the typed fields normally.
|
|
type alias SubScores
|
|
base, err := json.Marshal((*alias)(s))
|
|
if err != nil {
|
|
return nil, err
|
|
}
|
|
if len(s.Extras) == 0 {
|
|
return base, nil
|
|
}
|
|
// Decode back to a map, merge Extras, re-encode. Less efficient
|
|
// but keeps the field semantics correct (typed fields override
|
|
// extras on collision — first-write-wins for known keys).
|
|
var combined map[string]any
|
|
if err := json.Unmarshal(base, &combined); err != nil {
|
|
return nil, err
|
|
}
|
|
for k, v := range s.Extras {
|
|
if _, exists := combined[k]; !exists {
|
|
combined[k] = v
|
|
}
|
|
}
|
|
return json.Marshal(combined)
|
|
}
|