Where do Gemma-4 E4B and 26B-A4B diverge?

An adaptive, difficulty-bisecting accuracy benchmark that pushes two co-resident Gemma-4 models past the point where easy tasks saturate, and locates the competence frontier and divergence band for each of four task families — using only objective, checkable graders.

Gemma-4 E4B · Q6_K vs Gemma-4 26B-A4B · Q6_K_XL
graded evaluations temperature 0 · paired identical items 146 min active probing endpoint trixi · llama.cpp generated

01 Headline

Exactly one family robustly separates the models: JSON constraint-emission, where 26B-A4B holds ~90–100% across d6–d12 while E4B falls to ~30–55% (McNemar p≈1×10⁻¹⁹). Code generation does not separate them at all (both 100% through 22 rungs incl. regex matching, Dijkstra, knapsack). The state-tracking and arith "divergences" that a naïve pass would have reported turned out to be token-truncation artifacts — once deconfounded, both models track state to d38 and arithmetic to d16 at parity. The real differentiator is emitting many simultaneous computed constraints, not raw multi-step reasoning. Click a card to jump to its detail.

02 Method

Both models always see the same generated item (deterministic seeds), enabling paired statistics. Every task family has a difficulty knob and an objective auto-grader — no LLM-as-judge.

03 Per-family results

Pass rate vs difficulty for each family. Solid blue = E4B, dashed orange = 26B-A4B; shaded ribbons are Wilson 95% CIs, the dotted grey line is the τ=70% competence threshold, and amber bands mark the contested difficulties. Hover any point for the exact rate, CI, and n.

04 Recommendation for the agent pipeline

Routing tied to the clean measured frontiers — not priors, and not the truncation artifacts. The rule that survives scrutiny: only structured multi-constraint output (the JSON family) reliably separates the models. Everything else — code synthesis, state tracking, arithmetic — is at parity across the full range we could measure cleanly, so default to E4B and reserve 26B-A4B specifically for roles that emit dense, computed, schema-bound output.

RoleDominant loadClosest familyE4B frontier26B frontierRoute to
ImplementerWrite a function from a speccode E4B — safe
TesterGenerate/execute checks, small algorithmscode E4B — safe
Critic / tool-caller (structured)Emit JSON with many computed/derived fieldsjson 26B-A4B
SupervisorTrack evolving task/agent statestate E4B — safe*
PlannerMulti-step constrained reasoningarith E4B — safe*
E4B and 26B-A4B are served concurrently with no swap penalty, so a router can send each call to the cheaper-sufficient model per-role at no reload cost. The clean win: only structured-output roles (Critic / tool-caller emitting many computed JSON fields) justify 26B-A4B — there E4B drops the computed constraints (sums, parity flags) while keeping the structure. Code, state-tracking, and arithmetic roles run safely on E4B. * The Supervisor/Planner verdicts reversed after deconfounding: the first (truncated) pass made them look like 26B-only roles; clean re-runs show parity to the hardest difficulty tested — a caution that these are bounded by what we could measure, not proof of parity at arbitrary scale.

05 Harness validation & corrections

The provided harness had never been run against the live endpoint. Validation surfaced four real bugs (fixed before any campaign data was trusted), plus one measurement confound caught during analysis.

Bug 1 — code grader silently failed the four easiest rungs. The grader invokes solve(*args) (correct for multi-arg rungs like glob/Levenshtein/Dijkstra), but rungs 1–4 encoded a single list argument unwrapped, so sum([1,2,3]) was called as solve(1,2,3), raised, and was counted as a fail. Every model scored 0/N on those rungs regardless of correctness.
Fix. Wrapped the single-list args so the call convention matches the grader.
Bug 2 — arithmetic was truncated at 32 tokens, biasing the verbose model. Both models show their working before answering; a 32-token cap cut them off mid-reasoning. The more verbose 26B-A4B was penalized harder (0.25 vs 0.38 at just four operators) — the test was measuring brevity, not arithmetic.
Fix. Raised the budget, reworded the prompt to allow working then require a trailing ANSWER:, and rewrote the grader to extract the final answer (ANSWER: → last = N → last integer).
Bug 3 — state grader fragility. A 48-token cap plus first-match register parsing could capture an intermediate register value instead of the final one.
Fix. Raised the budget and switched to last-match per register.
Bug 4 — broken family interleaving. The rotation used len(rows) % 4, but rows grow by two per paired item, so it only ever yielded 0 or 2 — the state and code families never led the rotation.
Fix. Rotated by paired-item count so all four families interleave; also added per-row logging of the raw model text (the harness logged none, yet failure-mode inspection requires it).
Confound caught in analysis — token-cap truncation (and what it overturned). A first pass showed dramatic state and arith divergence — but inspecting raw outputs revealed E4B's verbose step-by-step style was hitting the token cap before it emitted its final answer (state d30: E4B truncated on 18/18 items vs 26B on 4/18). That asymmetrically depressed the more-verbose E4B and manufactured the gap. Re-running state and arith at a 2048-token budget erased the state divergence entirely (both models now pass the hardest rungs, d34/d38, at ~100%) and collapsed arith to near-parity (both 18/18 at d16). Had the failure-mode read been skipped, three of four "divergences" would have been reported — two of them wrong. json is the one separation that survives clean scrutiny; code was always truncation-free. (Arith still carries residual E4B truncation above d16 — flagged on its panel.)

Mid-campaign deepening (Step 3): the code ladder was extended from 12 to 22 rungs (adding LCS, LIS, coin-change, word-break, knapsack, Dijkstra, regex matching, longest-unique-substring, trapping-rain-water, min-path-sum) and json from 8 to 12 constraints once both models aced the original tops; intermediate rungs were inserted where the curve jumped steeply; and N_BAND was raised 8 → 12 → 18 to tighten the CIs and McNemar power.