Gemma-4 E4B vs 26B-A4B — Divergence Benchmark

01 Headline

Exactly one family robustly separates the models: JSON constraint-emission, where 26B-A4B holds ~90–100% across d6–d12 while E4B falls to ~30–55% (McNemar p≈1×10⁻¹⁹). Code generation does not separate them at all (both 100% through 22 rungs incl. regex matching, Dijkstra, knapsack). The state-tracking and arith "divergences" that a naïve pass would have reported turned out to be token-truncation artifacts — once deconfounded, both models track state to d38 and arithmetic to d16 at parity. The real differentiator is emitting many simultaneous computed constraints, not raw multi-step reasoning. Click a card to jump to its detail.

02 Method

Both models always see the same generated item (deterministic seeds), enabling paired statistics. Every task family has a difficulty knob and an objective auto-grader — no LLM-as-judge.

Adaptive search. A coarse pass brackets each family with an exponential walk up its difficulty ladder (3 items/milestone) until a rung both models fail; a refine pass then bisects and fills every rung inside the contested zone with up to N = 18 paired items, sharpening the pass-rate curves and the inter-model gap exactly where it matters.
Wilson 95% CIs on every per-rung pass rate; a paired McNemar test over items in the contested band gives the significance and direction of the gap.
Competence frontier = the highest difficulty a model still passes at ≥ τ (70%). Divergence band = the difficulty range where one model stays competent while the other has started to fail (or the two differ by ≥ 25 pts).
Four families: arith (operator-precedence value), state (register tracking over N ops), json (N simultaneous constraints, several computed), code (12→22-rung algorithm ladder, graded by hidden unit tests in a sandboxed subprocess).
Resumable. State lives in an append-only JSONL log; each window self-limits and the controller re-derives what to probe next, so the run survives kill/restart and can be deepened mid-campaign.

03 Per-family results

Pass rate vs difficulty for each family. Solid blue = E4B, dashed orange = 26B-A4B; shaded ribbons are Wilson 95% CIs, the dotted grey line is the τ=70% competence threshold, and amber bands mark the contested difficulties. Hover any point for the exact rate, CI, and n.

04 Recommendation for the agent pipeline

Routing tied to the clean measured frontiers — not priors, and not the truncation artifacts. The rule that survives scrutiny: only structured multi-constraint output (the JSON family) reliably separates the models. Everything else — code synthesis, state tracking, arithmetic — is at parity across the full range we could measure cleanly, so default to E4B and reserve 26B-A4B specifically for roles that emit dense, computed, schema-bound output.

Role	Dominant load	Closest family	Route to
Implementer	Write a function from a spec	code	E4B — safe
Tester	Generate/execute checks, small algorithms	code	E4B — safe
Critic / tool-caller (structured)	Emit JSON with many computed/derived fields	json	26B-A4B
Supervisor	Track evolving task/agent state	state	E4B — safe*
Planner	Multi-step constrained reasoning	arith	E4B — safe*

E4B and 26B-A4B are served concurrently with no swap penalty, so a router can send each call to the cheaper-sufficient model per-role at no reload cost. The clean win: only structured-output roles (Critic / tool-caller emitting many computed JSON fields) justify 26B-A4B — there E4B drops the computed constraints (sums, parity flags) while keeping the structure. Code, state-tracking, and arithmetic roles run safely on E4B. * The Supervisor/Planner verdicts reversed after deconfounding: the first (truncated) pass made them look like 26B-only roles; clean re-runs show parity to the hardest difficulty tested — a caution that these are bounded by what we could measure, not proof of parity at arbitrary scale.

05 Harness validation & corrections

The provided harness had never been run against the live endpoint. Validation surfaced four real bugs (fixed before any campaign data was trusted), plus one measurement confound caught during analysis.

Bug 1 — code grader silently failed the four easiest rungs. The grader invokes solve(*args) (correct for multi-arg rungs like glob/Levenshtein/Dijkstra), but rungs 1–4 encoded a single list argument unwrapped, so sum([1,2,3]) was called as solve(1,2,3), raised, and was counted as a fail. Every model scored 0/N on those rungs regardless of correctness.

Fix. Wrapped the single-list args so the call convention matches the grader.

Bug 2 — arithmetic was truncated at 32 tokens, biasing the verbose model. Both models show their working before answering; a 32-token cap cut them off mid-reasoning. The more verbose 26B-A4B was penalized harder (0.25 vs 0.38 at just four operators) — the test was measuring brevity, not arithmetic.

Fix. Raised the budget, reworded the prompt to allow working then require a trailing ANSWER:, and rewrote the grader to extract the final answer (ANSWER: → last = N → last integer).

Bug 3 — state grader fragility. A 48-token cap plus first-match register parsing could capture an intermediate register value instead of the final one.

Fix. Raised the budget and switched to last-match per register.

Bug 4 — broken family interleaving. The rotation used len(rows) % 4, but rows grow by two per paired item, so it only ever yielded 0 or 2 — the state and code families never led the rotation.

Fix. Rotated by paired-item count so all four families interleave; also added per-row logging of the raw model text (the harness logged none, yet failure-mode inspection requires it).

Confound caught in analysis — token-cap truncation (and what it overturned). A first pass showed dramatic state and arith divergence — but inspecting raw outputs revealed E4B's verbose step-by-step style was hitting the token cap before it emitted its final answer (state d30: E4B truncated on 18/18 items vs 26B on 4/18). That asymmetrically depressed the more-verbose E4B and manufactured the gap. Re-running state and arith at a 2048-token budget erased the state divergence entirely (both models now pass the hardest rungs, d34/d38, at ~100%) and collapsed arith to near-parity (both 18/18 at d16). Had the failure-mode read been skipped, three of four "divergences" would have been reported — two of them wrong. json is the one separation that survives clean scrutiny; code was always truncation-free. (Arith still carries residual E4B truncation above d16 — flagged on its panel.)

Mid-campaign deepening (Step 3): the code ladder was extended from 12 to 22 rungs (adding LCS, LIS, coin-change, word-break, knapsack, Dijkstra, regex matching, longest-unique-substring, trapping-rain-water, min-path-sum) and json from 8 to 12 constraints once both models aced the original tops; intermediate rungs were inserted where the curve jumped steeply; and N_BAND was raised 8 → 12 → 18 to tighten the CIs and McNemar power.