01 Headline
Exactly one family robustly separates the models: JSON constraint-emission, where 26B-A4B holds ~90–100% across d6–d12 while E4B falls to ~30–55% (McNemar p≈1×10⁻¹⁹). Code generation does not separate them at all (both 100% through 22 rungs incl. regex matching, Dijkstra, knapsack). The state-tracking and arith "divergences" that a naïve pass would have reported turned out to be token-truncation artifacts — once deconfounded, both models track state to d38 and arithmetic to d16 at parity. The real differentiator is emitting many simultaneous computed constraints, not raw multi-step reasoning. Click a card to jump to its detail.
02 Method
Both models always see the same generated item (deterministic seeds), enabling paired statistics. Every task family has a difficulty knob and an objective auto-grader — no LLM-as-judge.
- Adaptive search. A coarse pass brackets each family with an exponential walk up its difficulty ladder (3 items/milestone) until a rung both models fail; a refine pass then bisects and fills every rung inside the contested zone with up to N = 18 paired items, sharpening the pass-rate curves and the inter-model gap exactly where it matters.
- Wilson 95% CIs on every per-rung pass rate; a paired McNemar test over items in the contested band gives the significance and direction of the gap.
- Competence frontier = the highest difficulty a model still passes at ≥ τ (70%). Divergence band = the difficulty range where one model stays competent while the other has started to fail (or the two differ by ≥ 25 pts).
- Four families: arith (operator-precedence value), state (register tracking over N ops), json (N simultaneous constraints, several computed), code (12→22-rung algorithm ladder, graded by hidden unit tests in a sandboxed subprocess).
- Resumable. State lives in an append-only JSONL log; each window self-limits and the controller re-derives what to probe next, so the run survives kill/restart and can be deepened mid-campaign.
03 Per-family results
Pass rate vs difficulty for each family. Solid blue = E4B, dashed orange = 26B-A4B; shaded ribbons are Wilson 95% CIs, the dotted grey line is the τ=70% competence threshold, and amber bands mark the contested difficulties. Hover any point for the exact rate, CI, and n.
04 Recommendation for the agent pipeline
Routing tied to the clean measured frontiers — not priors, and not the truncation artifacts. The rule that survives scrutiny: only structured multi-constraint output (the JSON family) reliably separates the models. Everything else — code synthesis, state tracking, arithmetic — is at parity across the full range we could measure cleanly, so default to E4B and reserve 26B-A4B specifically for roles that emit dense, computed, schema-bound output.
| Role | Dominant load | Closest family | E4B frontier | 26B frontier | Route to |
|---|---|---|---|---|---|
| Implementer | Write a function from a spec | code | E4B — safe | ||
| Tester | Generate/execute checks, small algorithms | code | E4B — safe | ||
| Critic / tool-caller (structured) | Emit JSON with many computed/derived fields | json | 26B-A4B | ||
| Supervisor | Track evolving task/agent state | state | E4B — safe* | ||
| Planner | Multi-step constrained reasoning | arith | E4B — safe* |
05 Harness validation & corrections
The provided harness had never been run against the live endpoint. Validation surfaced four real bugs (fixed before any campaign data was trusted), plus one measurement confound caught during analysis.
solve(*args) (correct for multi-arg rungs like glob/Levenshtein/Dijkstra), but rungs 1–4 encoded
a single list argument unwrapped, so sum([1,2,3]) was called as solve(1,2,3),
raised, and was counted as a fail. Every model scored 0/N on those rungs regardless of correctness.ANSWER:, and rewrote the grader to extract the final answer (ANSWER: → last
= N → last integer).len(rows) % 4, but
rows grow by two per paired item, so it only ever yielded 0 or 2 — the state and
code families never led the rotation.Mid-campaign deepening (Step 3): the code ladder was extended from 12 to 22 rungs (adding LCS, LIS, coin-change, word-break, knapsack, Dijkstra, regex matching, longest-unique-substring, trapping-rain-water, min-path-sum) and json from 8 to 12 constraints once both models aced the original tops; intermediate rungs were inserted where the curve jumped steeply; and N_BAND was raised 8 → 12 → 18 to tighten the CIs and McNemar power.