Qwen 3.6 & Gemma 4 on M3 Ultra Mac Studio

Wilson Nguyen·2026-05-13

The release of Qwen 3.6 and Gemma 4 by Alibaba and Google has demonstrated the considerable progress in developing smaller open-source models to replace their more expensive closed-source counterparts, with their evaluation scores matching those of previous-generation SOTA models. Despite their small sizes, these models still require significant compute resources to run on consumer-grade hardware, often requiring significant quantization, lower context windows, and higher latencies.

Though they are difficult to run today, it is our hypothesis that consumers will soon be capable of running these models without significant workarounds. This experiment tests the bounds of latencies (pp - prefill speed & tg - decode speed) and resource consumption (memory usage) for these models on an M3 Ultra Mac Studio to serve as a basis for future experiments. We will not be focusing on evaluation scores as that work will be done separately. This specific variant of the Mac Studio was chosen due to its entry-level pricing and hardware specifications, serving as a great stepping stone for anyone looking to run their own personal agents. Our experiment tests the following combinations:

4 models (2 Gemma, 2 Qwen):
- Gemma-4-31B-it (dense, 30.7 B params, sliding-window attention)
- Gemma-4-26B-A4B-it (MoE, 25.2 B total / 3.8 B active, sliding-window attention)
- Qwen3.6-27B (dense, 27.8 B params, linear+full hybrid attention)
- Qwen3.6-35B-A3B (MoE, 34.7 B total / 2.9 B active, linear+full hybrid attention)
2 inference backends: MLX, llama.cpp
3 parameter quantizations (KV-cache remains BF16): BF16, 8-bit, 4-bit
6 context windows: 4K → 256K

1. Setup#

Our experiment setup involves a single M3 Ultra Mac Studio along with open-source software.

1.1 Hardware#

Field	Value
Chip	Apple M3 Ultra
OS	macOS 26.4.1 (Darwin 25.4.0)
RAM	96 GB unified (LPDDR5)
Memory bandwidth	819 GB/s
GPU cores	60
GPU compute	21.47 TFLOPS BF16
Disk at start	845 GB free
MSRP as of 05/13/2026	$3,999.00

Table 1. Hardware configuration of the M3 Ultra Mac Studio.

1.2 Benchmark matrix#

Axis	Values	Count
Model	Gemma-4-31B-it, Gemma-4-26B-A4B-it, Qwen3.6-27B, Qwen3.6-35B-A3B	4
Quantization	BF16 (orig), 8-bit (Q8_0 / MLX 8b), 4-bit (Q4_K_M / MLX 4b)	3
Backend	MLX (mlx-community), llama.cpp (GGUF)	2
Context	4K, 16K, 32K, 64K, 128K, 256K	6

Table 2. Combinations tested.

1.3 Software versions#

Software	Version
llama.cpp (`llama-bench`)	build b9090 (Homebrew)
ggml	0.11.0
MLX (Python)	0.31.2
mlx-lm	0.31.3
Python (venv)	3.13.13

Table 3. Software used along with their versions.

The quantized variants of the models in this experiment were done with:

MLX: mlx_lm.convert -q --q-bits {4,8}
GGUF: convert_hf_to_gguf.py + llama-quantize

1.4 Architecture#

Model	FFN	Attention layers	Total params	Active params
Gemma-4-31B-it	Dense	50 sliding 10 full	30.7 B	30.7 B
Qwen3.6-27B	Dense	48 linear 16 full	27.8 B	27.8 B
Gemma-4-26B-A4B-it	MoE	25 sliding 5 full	25.2 B	3.8 B
Qwen3.6-35B-A3B	MoE	30 linear 10 full	34.7 B	2.9 B

Table 4. Architectural composition of the four models under test, read from each model's Hugging Face config.json and model card. Layer counts list attention types, with a sliding window of 1024 tokens for Gemma models.

2. Method#

To minimize potential contamination in our experiment, each test (one <model, quantization, backend, context>) was measured in isolation, with no state carried over from the previous test. The tests were done as follows:

Start a fresh process. No Python state, allocator cache, or model weights are carried over from the previous test.
Load the model and tokenizer into memory.
Run one untimed warmup pass. This discards JIT-compile cost, kernel preparation, and first-touch cache misses before measurement begins.
Measure prefill in a separate backend invocation (-p ctx -n 0), repeating five times to record pp tok/s.
Measure decode in a separate backend invocation (-p 0 -n 128 -d ctx), repeating five times to record tg tok/s.
Record per-repetition timings, peak Metal memory, and process resident set size.
Report the median over the five measured runs; standard deviation is tracked alongside.
Exit the process so the next test starts from a clean slate.

Prefill and decode run as separate backend invocations so neither phase contaminates the other's timing. Each test records the exact model file used so that a re-run hits the same bytes.

3. Results#

3.1 4-bit MLX#

To serve as a basis for our experiment, we started off by testing each of the models with 4-bit quantization and using the MLX backend. Figures 1–3 plot the three measured metrics across all context sizes.

Prefill throughput vs. context window, 4-bit MLX, one line per model — **Figure 1.** Prefill throughput (`pp tok/s`) vs. context window, 4-bit MLX. The two MoE models (pink + violet) prefill ~7× faster than the two dense models (amber + emerald) at short context, narrowing to ~4× at 256K.

Show the data behind Figure 1

Model	4K	16K	32K	64K	128K	256K	Δ 4K → 256K
Gemma-4-31B-it 4b MLX	259.5	243.6	226.4	195.5	152.4	104.7	−60%
Qwen3.6-27B 4b MLX	329.5	315.0	296.0	260.0	205.4	143.4	−56%
Gemma-4-26B-A4B-it 4b MLX	1991.5	1795.2	1595.6	1285.8	897.8	557.2	−72%
Qwen3.6-35B-A3B 4b MLX	2432.3	2156.6	1832.6	1375.8	870.8	500.0	−79%

Table 6. Prefill throughput (pp tok/s) across context windows. The Δ column reports relative change between the 4K and 256K endpoints. Bold entries are the per-context-column maxima.

Decode throughput vs. context window, 4-bit MLX, one line per model — **Figure 2.** Decode throughput (`tg tok/s`) vs. context window, 4-bit MLX. MoE leads at every context; dense decode runs at roughly one-third the MoE rate at short context, reflecting the bandwidth-versus-dispatch tradeoff discussed under Figure 5 below.

Show the data behind Figure 2

Model	4K	16K	32K	64K	128K	256K	Δ 4K → 256K
Gemma-4-31B-it 4b MLX	30.5	27.1	23.7	19.1	13.6	8.6	−72%
Qwen3.6-27B 4b MLX	38.1	35.2	32.2	28.2	22.5	15.8	−59%
Gemma-4-26B-A4B-it 4b MLX	105.4	88.3	74.6	55.5	35.9	21.5	−80%
Qwen3.6-35B-A3B 4b MLX	109.6	98.0	89.8	76.5	60.1	41.7	−62%

Table 7. Decode throughput (tg tok/s) across context windows. The Δ column reports relative change between the 4K and 256K endpoints. Bold entries are the per-context-column maxima.

Peak resident memory vs. context window, 4-bit MLX, one line per model — **Figure 3.** Peak resident memory (% of 96 GB) vs. context window, 4-bit MLX. Gemma-4-26B-A4B-it has the smallest footprint thanks to sliding-window attention; Gemma-4-31B-it dense climbs steepest because its full-attention layers' KV-cache grows linearly with context.

Show the data behind Figure 3

Model	4K	16K	32K	64K	128K	256K	Δ 4K → 256K
Gemma-4-31B-it 4b MLX	21.5	24.0	27.6	34.9	49.5	78.5	+265%
Qwen3.6-27B 4b MLX	18.5	20.6	23.2	28.6	39.8	62.1	+236%
Gemma-4-26B-A4B-it 4b MLX	16.2	17.3	18.7	21.6	27.6	39.7	+145%
Qwen3.6-35B-A3B 4b MLX	21.5	22.6	24.0	26.9	32.7	44.5	+107%

Table 8. Peak resident memory (% of 96 GB unified memory) across context windows. The Δ column reports relative change between the 4K and 256K endpoints. Bold entries are the per-context-column minima (lower = smaller footprint).

Model	pp @ 4K	pp @ 256K (Δ)	tg @ 4K	tg @ 256K (Δ)	RAM @ 4K	RAM @ 256K (Δ)
Gemma-4-31B-it 4b MLX	259.5	104.7 (−60%)	30.5	8.6 (−72%)	21.5%	78.5% (+265%)
Qwen3.6-27B 4b MLX	329.5	143.4 (−56%)	38.1	15.8 (−59%)	18.5%	62.1% (+236%)
Gemma-4-26B-A4B-it 4b MLX	1991.5	557.2 (−72%)	105.4	21.5 (−80%)	16.2%	39.7% (+145%)
Qwen3.6-35B-A3B 4b MLX	2432.3	500.0 (−79%)	109.6	41.7 (−62%)	21.5%	44.5% (+107%)

Table 5. The 4-bit MLX row per model at the two context endpoints. Prefill (pp) and decode (tg) throughput are reported in tok/s; resident memory as a percentage of the 96 GB unified memory. The parenthetical Δ in the 256K columns reports the relative change vs. the respective 4K baseline. Bold entries are the per-column best.

At a glance, the MoE models have significantly higher prefill and decode speeds compared to their dense counterparts, with the Qwen model leading decode speeds at every context size. Both Qwen and Gemma MoE also show similar prefill speeds, with Qwen beating Gemma in context windows ≤ 64k and closely trailing Gemma's prefill in context windows > 64k.

3.2 Full experiment#

Following the initial subset of 4-bit MLX models, we ran the experiment across the entire set to get the following results. We've omitted results where the test configuration led the Mac Studio to run out of memory and subsequently terminate the test program or fail to allocate memory.

Prefill throughput vs. context window, one panel per model, six lines per panel for quantization × backend — **Figure 4.** Prefill throughput (`pp tok/s`) vs. context window. Shared log y-axis across panels, so vertical position is comparable between panels. Quant curves cluster within each panel, consistent with compute-bound prefill (with Qwen3.6-35B-A3B BF16 MLX at 128K as a near-cap exception, sitting ~17% below the 4-bit/8-bit lines). MoE (bottom row) prefills ~7× faster than dense (top row) at short context, narrowing to ~4× at 256K. Markers: ○ 4-bit, ◻ 8-bit, △ BF16. Solid + filled = MLX, dashed + hollow = GGUF.

Show the data behind Figure 4

Model	4K	16K	32K	64K	128K	256K
Gemma-4-31B-it 4-bit MLX	259.5	243.6	226.4	195.5	152.4	104.7
Gemma-4-31B-it 4-bit GGUF	253.7	226.6	197.9	157.7	111.6	–
Gemma-4-31B-it 8-bit MLX	257.4	242.3	225.4	194.7	151.7	–
Gemma-4-31B-it 8-bit GGUF	263.8	234.5	203.9	161.4	113.3	–
Gemma-4-31B-it BF16 MLX	261.5	246.7	228.9	195.6	–	–
Gemma-4-31B-it BF16 GGUF	268.9	238.7	207.0	163.4	114.1	–
Qwen3.6-27B 4-bit MLX	329.5	315.0	296.0	260.0	205.4	143.4
Qwen3.6-27B 4-bit GGUF	302.3	267.8	261.9	226.0	175.8	120.2
Qwen3.6-27B 8-bit MLX	325.7	312.0	293.4	257.9	204.0	142.8
Qwen3.6-27B 8-bit GGUF	315.3	296.3	271.6	233.2	180.0	122.3
Qwen3.6-27B BF16 MLX	330.2	317.1	298.1	261.4	206.0	–
Qwen3.6-27B BF16 GGUF	318.1	298.4	273.3	234.5	180.4	122.0
Gemma-4-26B-A4B-it 4-bit MLX	1991.5	1795.2	1595.6	1285.8	897.8	557.2
Gemma-4-26B-A4B-it 4-bit GGUF	1652.8	1363.7	1118.5	816.6	526.2	305.0
Gemma-4-26B-A4B-it 8-bit MLX	1961.0	1775.2	1579.3	1275.7	893.6	555.4
Gemma-4-26B-A4B-it 8-bit GGUF	1682.0	1384.2	1131.9	823.8	529.8	305.7
Gemma-4-26B-A4B-it BF16 MLX	1946.5	1768.0	1575.1	1271.6	889.4	553.9
Gemma-4-26B-A4B-it BF16 GGUF	1700.6	1403.7	1145.4	831.7	534.3	308.6
Qwen3.6-35B-A3B 4-bit MLX	2432.3	2156.6	1832.6	1375.8	870.8	500.0
Qwen3.6-35B-A3B 4-bit GGUF	1826.3	1555.5	1293.5	962.3	621.3	355.6
Qwen3.6-35B-A3B 8-bit MLX	2406.9	2138.3	1819.7	1368.1	867.0	498.6
Qwen3.6-35B-A3B 8-bit GGUF	1730.5	1481.9	1244.5	936.2	612.2	351.5
Qwen3.6-35B-A3B BF16 MLX	2459.9	2189.4	1853.0	1381.6	722.6	–
Qwen3.6-35B-A3B BF16 GGUF	1736.8	1485.0	1246.5	937.2	610.4	350.9

Table 9. Full prefill throughput (pp tok/s) for every model × quant × backend × context window. Tests marked – are aborted measurements (weights + KV-cache exceeded the Mac Studio's memory constraints).

Decode throughput vs. context window, one panel per model, six lines per panel for quantization × backend — **Figure 5.** Decode throughput (`tg tok/s`) vs. context window. Shared log y-axis. Dotted lines on dense panels mark the BF16 bandwidth ceiling (819 GB/s ÷ bytes-per-token at 4K). Measured BF16 lines sit just below, since dense decode is bandwidth-limited. MoE ceilings are off-chart at ~125+ tok/s; MoE hits dispatch overhead before bandwidth. Quant spread is wide on dense, narrow on MoE.

Show the data behind Figure 5

Model	4K	16K	32K	64K	128K	256K
Gemma-4-31B-it 4-bit MLX	30.5	27.1	23.7	19.1	13.6	8.6
Gemma-4-31B-it 4-bit GGUF	26.2	24.1	21.8	18.1	12.4	–
Gemma-4-31B-it 8-bit MLX	18.5	17.2	15.7	13.5	10.5	–
Gemma-4-31B-it 8-bit GGUF	19.0	17.8	16.5	14.3	10.5	–
Gemma-4-31B-it BF16 MLX	11.0	10.5	9.9	9.0	–	–
Gemma-4-31B-it BF16 GGUF	11.2	10.8	10.3	9.4	7.6	–
Qwen3.6-27B 4-bit MLX	38.1	35.2	32.2	28.2	22.5	15.8
Qwen3.6-27B 4-bit GGUF	26.4	25.1	23.4	20.1	14.5	8.4
Qwen3.6-27B 8-bit MLX	22.9	21.7	20.4	18.8	16.1	12.4
Qwen3.6-27B 8-bit GGUF	20.8	19.7	18.7	16.6	12.7	7.6
Qwen3.6-27B BF16 MLX	13.2	12.9	12.5	11.8	10.7	–
Qwen3.6-27B BF16 GGUF	12.7	12.3	11.9	11.0	9.1	6.2
Gemma-4-26B-A4B-it 4-bit MLX	105.4	88.3	74.6	55.5	35.9	21.5
Gemma-4-26B-A4B-it 4-bit GGUF	87.1	82.9	75.1	64.4	49.6	33.6
Gemma-4-26B-A4B-it 8-bit MLX	83.5	73.0	62.8	48.6	32.8	20.4
Gemma-4-26B-A4B-it 8-bit GGUF	79.5	76.0	69.1	59.7	46.6	31.9
Gemma-4-26B-A4B-it BF16 MLX	64.7	59.1	52.1	42.1	29.7	19.1
Gemma-4-26B-A4B-it BF16 GGUF	62.3	59.7	55.6	49.5	40.2	29.3
Qwen3.6-35B-A3B 4-bit MLX	109.6	98.0	89.8	76.5	60.1	41.7
Qwen3.6-35B-A3B 4-bit GGUF	81.1	75.6	70.3	61.0	45.9	26.9
Qwen3.6-35B-A3B 8-bit MLX	88.2	81.3	74.8	65.1	53.1	38.2
Qwen3.6-35B-A3B 8-bit GGUF	73.4	69.1	64.3	56.5	43.4	26.0
Qwen3.6-35B-A3B BF16 MLX	73.6	68.2	63.6	56.3	46.9	–
Qwen3.6-35B-A3B BF16 GGUF	59.8	56.9	53.7	48.0	38.3	23.9

Table 10. Full decode throughput (tg tok/s) for every model × quant × backend × context window. Tests marked – are aborted measurements (weights + KV-cache exceeded the Mac Studio's memory constraints).

Dense BF16 is pinned to the bandwidth wall. For dense models at BF16, decode throughput is predominantly bandwidth-bound: tg ≤ bandwidth / (weight_bytes + kv_bytes_per_token). For example, with Gemma-4-31B-it at 4K, each step streams 62.9 GB (61.4 GB weights + 1.5 GB KV-cache) against the 819 GB/s peak, giving a 13.0 tok/s ceiling, which the recorded 11.0 tok/s closely matches.

Below the bandwidth wall, fixed overheads take over. Both quantization and MoE leave the Mac Studio's bandwidth underutilized, so dispatch overheads dominate the remaining time. Take for instance the following at 4K context:

Gemma-4-31B-it 4b GGUF streams 20.18 GB per decode step (18.67 GB weights + 1.51 GB KV-cache) against the 819 GB/s peak, giving a 40.6 tok/s ceiling; the recorded speed is 26.2 tok/s (64.5%)
Qwen3.6-35B-A3B BF16 MLX streams 5.8 GB per decode step (2.9 B active params × 2 bytes for BF16) against the 819 GB/s peak, giving a 141 tok/s ceiling; the recorded speed is 73.6 tok/s (52%)

Peak resident memory vs. context window, one panel per model, six lines per panel for quantization × backend — **Figure 6.** Peak resident memory (% of 96 GB) vs. context window. Dotted reference lines mark ~85% and 100% of unified memory. MLX = Metal heap peak; GGUF = model file + KV-cache estimate with OS paging (not directly comparable at the high end). Markers: ○ 4-bit, ◻ 8-bit, △ BF16. Solid + filled = MLX, dashed + hollow = GGUF.

Show the data behind Figure 6

Model	4K	16K	32K	64K	128K	256K
Gemma-4-31B-it 4-bit MLX	21.5	24.0	27.6	34.9	49.5	78.5
Gemma-4-31B-it 4-bit GGUF	20.2	22.2	24.9	30.3	41.0	–
Gemma-4-31B-it 8-bit MLX	36.9	39.3	42.9	50.3	64.9	–
Gemma-4-31B-it 8-bit GGUF	34.1	36.1	38.8	44.2	54.9	–
Gemma-4-31B-it BF16 MLX	65.4	68.1	71.7	79.0	–	–
Gemma-4-31B-it BF16 GGUF	62.9	64.9	67.6	73.0	83.7	–
Qwen3.6-27B 4-bit MLX	18.5	20.6	23.2	28.6	39.8	62.1
Qwen3.6-27B 4-bit GGUF	17.9	21.1	25.4	34.0	51.2	85.5
Qwen3.6-27B 8-bit MLX	32.0	34.0	36.7	42.1	53.2	75.5
Qwen3.6-27B 8-bit GGUF	29.7	32.9	37.2	45.8	62.9	97.3
Qwen3.6-27B BF16 MLX	57.1	59.1	61.8	67.3	78.5	–
Qwen3.6-27B BF16 GGUF	54.9	58.1	62.4	71.0	88.2	–
Gemma-4-26B-A4B-it 4-bit MLX	16.2	17.3	18.7	21.6	27.6	39.7
Gemma-4-26B-A4B-it 4-bit GGUF	17.2	17.7	18.3	19.7	22.4	27.7
Gemma-4-26B-A4B-it 8-bit MLX	28.6	29.7	31.1	33.9	39.9	52.0
Gemma-4-26B-A4B-it 8-bit GGUF	27.2	27.7	28.4	29.7	32.4	37.8
Gemma-4-26B-A4B-it BF16 MLX	52.2	53.3	54.7	57.6	63.6	75.7
Gemma-4-26B-A4B-it BF16 GGUF	50.9	51.4	52.0	53.4	56.1	61.4
Qwen3.6-35B-A3B 4-bit MLX	21.5	22.6	24.0	26.9	32.7	44.5
Qwen3.6-35B-A3B 4-bit GGUF	21.5	22.5	23.8	26.5	31.9	42.6
Qwen3.6-35B-A3B 8-bit MLX	38.8	39.9	41.3	44.2	50.0	61.8
Qwen3.6-35B-A3B 8-bit GGUF	37.2	38.2	39.6	42.3	47.6	58.4
Qwen3.6-35B-A3B BF16 MLX	71.3	72.4	73.8	76.6	82.4	–
Qwen3.6-35B-A3B BF16 GGUF	69.7	70.7	72.1	74.7	80.1	90.8

Table 11. Full peak resident memory (% of 96 GB unified memory) for every model × quant × backend × context window. Tests marked – are aborted measurements (weights + KV-cache exceeded the Mac Studio's memory constraints).

MoE's active-parameter advantage doesn't save BF16 at 256K. Qwen3.6-35B-A3B's 34.7 B total params still sit resident (only 2.9 B stream per decode token), and that's enough to fail allocation at 256K alongside every dense BF16 row.

MLX vs. GGUF skew depends on the workload. Generally MLX will beat GGUF on prefill and decode speeds. However, with long-context sliding-window MoE, the decode flips and GGUF dominates (e.g. Gemma-4-26B-A4B-it 4b at 256K: 33.6 GGUF vs. 21.5 MLX, +56% throughput). The flip happens specifically for sliding-window attention: it bounds each layer's KV cache to a fixed 1024-token window regardless of context length, so the per-step byte stream stays tiny (small active-parameter footprint + bounded KV) and GGUF's mmap'd weight path amortizes well under that profile. The other MoE in the sweep (Qwen3.6-35B-A3B, linear+full hybrid attention) doesn't get the same effect; at 256K decode, it's 41.7 MLX vs. 26.9 GGUF, MLX still ahead by 55%.

Interestingly, Qwen3.6-27B's MLX memory footprint stays 23.4 percentage points below the GGUF figure by 256K (62.1% vs. 85.5%) and the lead grows roughly linearly with context. Both backends already allocate compact recurrent state for the linear-attention layers (mlx-lm's ArraysCache, llama.cpp's llama_memory_recurrent), so the gap likely stems from KV dtype defaults, prompt-cache buffers, or measurement-vs-allocation differences rather than from the linear-layer cache itself. Pinning down the exact cause needs instrumentation and may be pursued in a later experiment.

4. Takeaways#

Model	Active / Total params	Decode (BF16 @ 4K) ceiling → measured	Prefill (BF16 @ 4K) ceiling → measured	RAM @ 256K (4b MLX): weights + KV
Gemma-4-31B-it	30.7 / 30.7 B	13.0 → 11.0 (85%)	350 → 261.5 (75%)	18 GB + 57 GB (78.5%)
Qwen3.6-27B	27.8 / 27.8 B	14.4 → 13.2 (92%)	386 → 330.2 (86%)	17 GB + 43 GB (62.1%)
Gemma-4-26B-A4B-it	3.8 / 25.2 B	101 → 64.7 (64%)	2,826 → 1,946.5 (69%)	15 GB + 23 GB (39.7%)
Qwen3.6-35B-A3B	2.9 / 34.7 B	141 → 73.6 (52%)	3,702 → 2,459.9 (66%)	21 GB + 22 GB (44.5%)

Table 12. Theoretical ceilings versus measured throughput. Decode and prefill columns use BF16 MLX at 4K context: decode ceiling = 819 GB/s ÷ bytes-per-step, prefill ceiling = 21.47 TFLOPS ÷ (2 × active params), and the percentage in parentheses is measured / ceiling; the gap is dispatch overhead, kernel launch latency, and other fixed per-step costs. RAM @ 256K is measured at the experiment's maximum context using 4-bit MLX (the practical default); the breakdown is total resident weights (total params × ~0.6 bytes for Q4_K_M / MLX 4b) plus everything else: KV-cache, per-step buffers, and allocator overhead, which can't be cleanly separated from the measurement. All four models complete at this setting. Active / Total params differs for MoE since full weights stay resident while only active params stream per decode token.

MLX outpaces GGUF in nearly every test. It wins on both prefill and decode with only one exception for long-context sliding-window MoE decoding.
Dispatch overhead becomes a tax when bytes-per-step shrinks. The dense BF16 models' decode throughput sits at 85–92% of the bandwidth ceiling whereas MoE BF16 decode throughput sits at 52–64% of its ceiling as the fixed per-step costs (dispatch, dequantization) begin to dominate.