root d023b07b30 Real-scale validation post-G0: configurable ingest cap + workers_500k metrics
Validated G0 substrate against the production workers_500k.parquet
dataset (18 cols × 500,000 rows). Findings + one applied fix:

Finding #1 (FIXED): ingestd's hardcoded 256 MiB cap rejected the 500K
CSV (344 MiB) with 413. Cap fired correctly, no OOM. Extracted to
[ingestd].max_ingest_bytes config field; default 256 MiB, override
per deployment for known-large workloads. With cap bumped to 512 MiB,
500K ingest succeeds in 3.12s with ingestd peak RSS 209 MiB.

Finding #2 (deferred): ingestd doesn't release memory between
ingests. Go runtime conservative; long-running daemon, fine.

Finding #3: DuckDB-via-httpfs is healthy at 500K. GROUP BY 45ms,
count(*) 24ms, AVG 47ms, schema introspection 25ms. Sub-linear
scaling vs 100K — the s3:// read path is not a bottleneck.

Finding #4: ADR-010 type inference correctly handled real staffing
data. worker_id → BIGINT, numeric scores → DOUBLE, multi-line
resume_text → VARCHAR. 1000-row sample sufficient.

Finding #5: Go's encoding/csv handles RFC 4180 quoted-comma fields
and multi-line quoted text without LazyQuotes — confirming the D4
scrum's dismissal of Qwen's BLOCK on this point.

Net: substrate handles production-scale data with one config knob.
No correctness issues, no OOMs, no silent type errors.
All 6 G0 smokes still PASS after the cap-config change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 00:32:08 -05:00
..