Lilac Labs

Qwen 3.6 & Gemma 4 on M3 Ultra Mac Studio

Wilson Nguyen·

The release of Qwen 3.6 and Gemma 4 by Alibaba and Google has demonstrated the considerable progress in developing smaller open-source models to replace their more expensive closed-source counterparts, with their evaluation scores matching those of previous-generation SOTA models. Despite their small sizes, these models still require significant compute resources to run on consumer-grade hardware, often requiring significant quantization, lower context windows, and higher latencies.

Though they are difficult to run today, it is our hypothesis that consumers will soon be capable of running these models without significant workarounds. This experiment tests the bounds of latencies (pp - prefill speed & tg - decode speed) and resource consumption (memory usage) for these models on an M3 Ultra Mac Studio to serve as a basis for future experiments. We will not be focusing on evaluation scores as that work will be done separately. This specific variant of the Mac Studio was chosen due to its entry-level pricing and hardware specifications, serving as a great stepping stone for anyone looking to run their own personal agents. Our experiment tests the following combinations:

  • 4 models (2 Gemma, 2 Qwen):
    • Gemma-4-31B-it (dense, 30.7 B params, sliding-window attention)
    • Gemma-4-26B-A4B-it (MoE, 25.2 B total / 3.8 B active, sliding-window attention)
    • Qwen3.6-27B (dense, 27.8 B params, linear+full hybrid attention)
    • Qwen3.6-35B-A3B (MoE, 34.7 B total / 2.9 B active, linear+full hybrid attention)
  • 2 inference backends: MLX, llama.cpp
  • 3 parameter quantizations (KV-cache remains BF16): BF16, 8-bit, 4-bit
  • 6 context windows: 4K → 256K

1. Setup#

Our experiment setup involves a single M3 Ultra Mac Studio along with open-source software.

1.1 Hardware#

FieldValue
ChipApple M3 Ultra
OSmacOS 26.4.1 (Darwin 25.4.0)
RAM96 GB unified (LPDDR5)
Memory bandwidth819 GB/s
GPU cores60
GPU compute21.47 TFLOPS BF16
Disk at start845 GB free
MSRP as of 05/13/2026$3,999.00
Table 1. Hardware configuration of the M3 Ultra Mac Studio.

1.2 Benchmark matrix#

AxisValuesCount
ModelGemma-4-31B-it, Gemma-4-26B-A4B-it, Qwen3.6-27B, Qwen3.6-35B-A3B4
QuantizationBF16 (orig), 8-bit (Q8_0 / MLX 8b), 4-bit (Q4_K_M / MLX 4b)3
BackendMLX (mlx-community), llama.cpp (GGUF)2
Context4K, 16K, 32K, 64K, 128K, 256K6
Table 2. Combinations tested.

1.3 Software versions#

SoftwareVersion
llama.cpp (llama-bench)build b9090 (Homebrew)
ggml0.11.0
MLX (Python)0.31.2
mlx-lm0.31.3
Python (venv)3.13.13
Table 3. Software used along with their versions.

The quantized variants of the models in this experiment were done with:

  • MLX: mlx_lm.convert -q --q-bits {4,8}
  • GGUF: convert_hf_to_gguf.py + llama-quantize

1.4 Architecture#

ModelFFNAttention layersTotal paramsActive params
Gemma-4-31B-itDense50 sliding 10 full30.7 B30.7 B
Qwen3.6-27BDense48 linear 16 full27.8 B27.8 B
Gemma-4-26B-A4B-itMoE25 sliding 5 full25.2 B3.8 B
Qwen3.6-35B-A3BMoE30 linear 10 full34.7 B2.9 B
Table 4. Architectural composition of the four models under test, read from each model's Hugging Face config.json and model card. Layer counts list attention types, with a sliding window of 1024 tokens for Gemma models.

2. Method#

To minimize potential contamination in our experiment, each test (one <model, quantization, backend, context>) was measured in isolation, with no state carried over from the previous test. The tests were done as follows:

  1. Start a fresh process. No Python state, allocator cache, or model weights are carried over from the previous test.
  2. Load the model and tokenizer into memory.
  3. Run one untimed warmup pass. This discards JIT-compile cost, kernel preparation, and first-touch cache misses before measurement begins.
  4. Measure prefill in a separate backend invocation (-p ctx -n 0), repeating five times to record pp tok/s.
  5. Measure decode in a separate backend invocation (-p 0 -n 128 -d ctx), repeating five times to record tg tok/s.
  6. Record per-repetition timings, peak Metal memory, and process resident set size.
  7. Report the median over the five measured runs; standard deviation is tracked alongside.
  8. Exit the process so the next test starts from a clean slate.

Prefill and decode run as separate backend invocations so neither phase contaminates the other's timing. Each test records the exact model file used so that a re-run hits the same bytes.

3. Results#

3.1 4-bit MLX#

To serve as a basis for our experiment, we started off by testing each of the models with 4-bit quantization and using the MLX backend. Figures 1–3 plot the three measured metrics across all context sizes.

Prefill throughput vs. context window, 4-bit MLX, one line per model
Figure 1. Prefill throughput (pp tok/s) vs. context window, 4-bit MLX. The two MoE models (pink + violet) prefill ~7× faster than the two dense models (amber + emerald) at short context, narrowing to ~4× at 256K.
Show the data behind Figure 1
Model4K16K32K64K128K256KΔ 4K → 256K
Gemma-4-31B-it 4b MLX259.5243.6226.4195.5152.4104.7−60%
Qwen3.6-27B 4b MLX329.5315.0296.0260.0205.4143.4−56%
Gemma-4-26B-A4B-it 4b MLX1991.51795.21595.61285.8897.8557.2−72%
Qwen3.6-35B-A3B 4b MLX2432.32156.61832.61375.8870.8500.0−79%
Table 6. Prefill throughput (pp tok/s) across context windows. The Δ column reports relative change between the 4K and 256K endpoints. Bold entries are the per-context-column maxima.
Decode throughput vs. context window, 4-bit MLX, one line per model
Figure 2. Decode throughput (tg tok/s) vs. context window, 4-bit MLX. MoE leads at every context; dense decode runs at roughly one-third the MoE rate at short context, reflecting the bandwidth-versus-dispatch tradeoff discussed under Figure 5 below.
Show the data behind Figure 2
Model4K16K32K64K128K256KΔ 4K → 256K
Gemma-4-31B-it 4b MLX30.527.123.719.113.68.6−72%
Qwen3.6-27B 4b MLX38.135.232.228.222.515.8−59%
Gemma-4-26B-A4B-it 4b MLX105.488.374.655.535.921.5−80%
Qwen3.6-35B-A3B 4b MLX109.698.089.876.560.141.7−62%
Table 7. Decode throughput (tg tok/s) across context windows. The Δ column reports relative change between the 4K and 256K endpoints. Bold entries are the per-context-column maxima.
Peak resident memory vs. context window, 4-bit MLX, one line per model
Figure 3. Peak resident memory (% of 96 GB) vs. context window, 4-bit MLX. Gemma-4-26B-A4B-it has the smallest footprint thanks to sliding-window attention; Gemma-4-31B-it dense climbs steepest because its full-attention layers' KV-cache grows linearly with context.
Show the data behind Figure 3
Model4K16K32K64K128K256KΔ 4K → 256K
Gemma-4-31B-it 4b MLX21.524.027.634.949.578.5+265%
Qwen3.6-27B 4b MLX18.520.623.228.639.862.1+236%
Gemma-4-26B-A4B-it 4b MLX16.217.318.721.627.639.7+145%
Qwen3.6-35B-A3B 4b MLX21.522.624.026.932.744.5+107%
Table 8. Peak resident memory (% of 96 GB unified memory) across context windows. The Δ column reports relative change between the 4K and 256K endpoints. Bold entries are the per-context-column minima (lower = smaller footprint).
Modelpp @ 4Kpp @ 256K (Δ)tg @ 4Ktg @ 256K (Δ)RAM @ 4KRAM @ 256K (Δ)
Gemma-4-31B-it 4b MLX259.5104.7 (−60%)30.58.6 (−72%)21.5%78.5% (+265%)
Qwen3.6-27B 4b MLX329.5143.4 (−56%)38.115.8 (−59%)18.5%62.1% (+236%)
Gemma-4-26B-A4B-it 4b MLX1991.5557.2 (−72%)105.421.5 (−80%)16.2%39.7% (+145%)
Qwen3.6-35B-A3B 4b MLX2432.3500.0 (−79%)109.641.7 (−62%)21.5%44.5% (+107%)
Table 5. The 4-bit MLX row per model at the two context endpoints. Prefill (pp) and decode (tg) throughput are reported in tok/s; resident memory as a percentage of the 96 GB unified memory. The parenthetical Δ in the 256K columns reports the relative change vs. the respective 4K baseline. Bold entries are the per-column best.

At a glance, the MoE models have significantly higher prefill and decode speeds compared to their dense counterparts, with the Qwen model leading decode speeds at every context size. Both Qwen and Gemma MoE also show similar prefill speeds, with Qwen beating Gemma in context windows ≤ 64k and closely trailing Gemma's prefill in context windows > 64k.

3.2 Full experiment#

Following the initial subset of 4-bit MLX models, we ran the experiment across the entire set to get the following results. We've omitted results where the test configuration led the Mac Studio to run out of memory and subsequently terminate the test program or fail to allocate memory.

Prefill throughput vs. context window, one panel per model, six lines per panel for quantization × backend
Figure 4. Prefill throughput (pp tok/s) vs. context window. Shared log y-axis across panels, so vertical position is comparable between panels. Quant curves cluster within each panel, consistent with compute-bound prefill (with Qwen3.6-35B-A3B BF16 MLX at 128K as a near-cap exception, sitting ~17% below the 4-bit/8-bit lines). MoE (bottom row) prefills ~7× faster than dense (top row) at short context, narrowing to ~4× at 256K. Markers: ○ 4-bit, ◻ 8-bit, △ BF16. Solid + filled = MLX, dashed + hollow = GGUF.
Show the data behind Figure 4
Model4K16K32K64K128K256K
Gemma-4-31B-it 4-bit MLX259.5243.6226.4195.5152.4104.7
Gemma-4-31B-it 4-bit GGUF253.7226.6197.9157.7111.6
Gemma-4-31B-it 8-bit MLX257.4242.3225.4194.7151.7
Gemma-4-31B-it 8-bit GGUF263.8234.5203.9161.4113.3
Gemma-4-31B-it BF16 MLX261.5246.7228.9195.6
Gemma-4-31B-it BF16 GGUF268.9238.7207.0163.4114.1
Qwen3.6-27B 4-bit MLX329.5315.0296.0260.0205.4143.4
Qwen3.6-27B 4-bit GGUF302.3267.8261.9226.0175.8120.2
Qwen3.6-27B 8-bit MLX325.7312.0293.4257.9204.0142.8
Qwen3.6-27B 8-bit GGUF315.3296.3271.6233.2180.0122.3
Qwen3.6-27B BF16 MLX330.2317.1298.1261.4206.0
Qwen3.6-27B BF16 GGUF318.1298.4273.3234.5180.4122.0
Gemma-4-26B-A4B-it 4-bit MLX1991.51795.21595.61285.8897.8557.2
Gemma-4-26B-A4B-it 4-bit GGUF1652.81363.71118.5816.6526.2305.0
Gemma-4-26B-A4B-it 8-bit MLX1961.01775.21579.31275.7893.6555.4
Gemma-4-26B-A4B-it 8-bit GGUF1682.01384.21131.9823.8529.8305.7
Gemma-4-26B-A4B-it BF16 MLX1946.51768.01575.11271.6889.4553.9
Gemma-4-26B-A4B-it BF16 GGUF1700.61403.71145.4831.7534.3308.6
Qwen3.6-35B-A3B 4-bit MLX2432.32156.61832.61375.8870.8500.0
Qwen3.6-35B-A3B 4-bit GGUF1826.31555.51293.5962.3621.3355.6
Qwen3.6-35B-A3B 8-bit MLX2406.92138.31819.71368.1867.0498.6
Qwen3.6-35B-A3B 8-bit GGUF1730.51481.91244.5936.2612.2351.5
Qwen3.6-35B-A3B BF16 MLX2459.92189.41853.01381.6722.6
Qwen3.6-35B-A3B BF16 GGUF1736.81485.01246.5937.2610.4350.9
Table 9. Full prefill throughput (pp tok/s) for every model × quant × backend × context window. Tests marked – are aborted measurements (weights + KV-cache exceeded the Mac Studio's memory constraints).
Decode throughput vs. context window, one panel per model, six lines per panel for quantization × backend
Figure 5. Decode throughput (tg tok/s) vs. context window. Shared log y-axis. Dotted lines on dense panels mark the BF16 bandwidth ceiling (819 GB/s ÷ bytes-per-token at 4K). Measured BF16 lines sit just below, since dense decode is bandwidth-limited. MoE ceilings are off-chart at ~125+ tok/s; MoE hits dispatch overhead before bandwidth. Quant spread is wide on dense, narrow on MoE.
Show the data behind Figure 5
Model4K16K32K64K128K256K
Gemma-4-31B-it 4-bit MLX30.527.123.719.113.68.6
Gemma-4-31B-it 4-bit GGUF26.224.121.818.112.4
Gemma-4-31B-it 8-bit MLX18.517.215.713.510.5
Gemma-4-31B-it 8-bit GGUF19.017.816.514.310.5
Gemma-4-31B-it BF16 MLX11.010.59.99.0
Gemma-4-31B-it BF16 GGUF11.210.810.39.47.6
Qwen3.6-27B 4-bit MLX38.135.232.228.222.515.8
Qwen3.6-27B 4-bit GGUF26.425.123.420.114.58.4
Qwen3.6-27B 8-bit MLX22.921.720.418.816.112.4
Qwen3.6-27B 8-bit GGUF20.819.718.716.612.77.6
Qwen3.6-27B BF16 MLX13.212.912.511.810.7
Qwen3.6-27B BF16 GGUF12.712.311.911.09.16.2
Gemma-4-26B-A4B-it 4-bit MLX105.488.374.655.535.921.5
Gemma-4-26B-A4B-it 4-bit GGUF87.182.975.164.449.633.6
Gemma-4-26B-A4B-it 8-bit MLX83.573.062.848.632.820.4
Gemma-4-26B-A4B-it 8-bit GGUF79.576.069.159.746.631.9
Gemma-4-26B-A4B-it BF16 MLX64.759.152.142.129.719.1
Gemma-4-26B-A4B-it BF16 GGUF62.359.755.649.540.229.3
Qwen3.6-35B-A3B 4-bit MLX109.698.089.876.560.141.7
Qwen3.6-35B-A3B 4-bit GGUF81.175.670.361.045.926.9
Qwen3.6-35B-A3B 8-bit MLX88.281.374.865.153.138.2
Qwen3.6-35B-A3B 8-bit GGUF73.469.164.356.543.426.0
Qwen3.6-35B-A3B BF16 MLX73.668.263.656.346.9
Qwen3.6-35B-A3B BF16 GGUF59.856.953.748.038.323.9
Table 10. Full decode throughput (tg tok/s) for every model × quant × backend × context window. Tests marked – are aborted measurements (weights + KV-cache exceeded the Mac Studio's memory constraints).

Dense BF16 is pinned to the bandwidth wall. For dense models at BF16, decode throughput is predominantly bandwidth-bound: tg ≤ bandwidth / (weight_bytes + kv_bytes_per_token). For example, with Gemma-4-31B-it at 4K, each step streams 62.9 GB (61.4 GB weights + 1.5 GB KV-cache) against the 819 GB/s peak, giving a 13.0 tok/s ceiling, which the recorded 11.0 tok/s closely matches.

Below the bandwidth wall, fixed overheads take over. Both quantization and MoE leave the Mac Studio's bandwidth underutilized, so dispatch overheads dominate the remaining time. Take for instance the following at 4K context:

  • Gemma-4-31B-it 4b GGUF streams 20.18 GB per decode step (18.67 GB weights + 1.51 GB KV-cache) against the 819 GB/s peak, giving a 40.6 tok/s ceiling; the recorded speed is 26.2 tok/s (64.5%)
  • Qwen3.6-35B-A3B BF16 MLX streams 5.8 GB per decode step (2.9 B active params × 2 bytes for BF16) against the 819 GB/s peak, giving a 141 tok/s ceiling; the recorded speed is 73.6 tok/s (52%)
Peak resident memory vs. context window, one panel per model, six lines per panel for quantization × backend
Figure 6. Peak resident memory (% of 96 GB) vs. context window. Dotted reference lines mark ~85% and 100% of unified memory. MLX = Metal heap peak; GGUF = model file + KV-cache estimate with OS paging (not directly comparable at the high end). Markers: ○ 4-bit, ◻ 8-bit, △ BF16. Solid + filled = MLX, dashed + hollow = GGUF.
Show the data behind Figure 6
Model4K16K32K64K128K256K
Gemma-4-31B-it 4-bit MLX21.524.027.634.949.578.5
Gemma-4-31B-it 4-bit GGUF20.222.224.930.341.0
Gemma-4-31B-it 8-bit MLX36.939.342.950.364.9
Gemma-4-31B-it 8-bit GGUF34.136.138.844.254.9
Gemma-4-31B-it BF16 MLX65.468.171.779.0
Gemma-4-31B-it BF16 GGUF62.964.967.673.083.7
Qwen3.6-27B 4-bit MLX18.520.623.228.639.862.1
Qwen3.6-27B 4-bit GGUF17.921.125.434.051.285.5
Qwen3.6-27B 8-bit MLX32.034.036.742.153.275.5
Qwen3.6-27B 8-bit GGUF29.732.937.245.862.997.3
Qwen3.6-27B BF16 MLX57.159.161.867.378.5
Qwen3.6-27B BF16 GGUF54.958.162.471.088.2
Gemma-4-26B-A4B-it 4-bit MLX16.217.318.721.627.639.7
Gemma-4-26B-A4B-it 4-bit GGUF17.217.718.319.722.427.7
Gemma-4-26B-A4B-it 8-bit MLX28.629.731.133.939.952.0
Gemma-4-26B-A4B-it 8-bit GGUF27.227.728.429.732.437.8
Gemma-4-26B-A4B-it BF16 MLX52.253.354.757.663.675.7
Gemma-4-26B-A4B-it BF16 GGUF50.951.452.053.456.161.4
Qwen3.6-35B-A3B 4-bit MLX21.522.624.026.932.744.5
Qwen3.6-35B-A3B 4-bit GGUF21.522.523.826.531.942.6
Qwen3.6-35B-A3B 8-bit MLX38.839.941.344.250.061.8
Qwen3.6-35B-A3B 8-bit GGUF37.238.239.642.347.658.4
Qwen3.6-35B-A3B BF16 MLX71.372.473.876.682.4
Qwen3.6-35B-A3B BF16 GGUF69.770.772.174.780.190.8
Table 11. Full peak resident memory (% of 96 GB unified memory) for every model × quant × backend × context window. Tests marked – are aborted measurements (weights + KV-cache exceeded the Mac Studio's memory constraints).

MoE's active-parameter advantage doesn't save BF16 at 256K. Qwen3.6-35B-A3B's 34.7 B total params still sit resident (only 2.9 B stream per decode token), and that's enough to fail allocation at 256K alongside every dense BF16 row.

MLX vs. GGUF skew depends on the workload. Generally MLX will beat GGUF on prefill and decode speeds. However, with long-context sliding-window MoE, the decode flips and GGUF dominates (e.g. Gemma-4-26B-A4B-it 4b at 256K: 33.6 GGUF vs. 21.5 MLX, +56% throughput). The flip happens specifically for sliding-window attention: it bounds each layer's KV cache to a fixed 1024-token window regardless of context length, so the per-step byte stream stays tiny (small active-parameter footprint + bounded KV) and GGUF's mmap'd weight path amortizes well under that profile. The other MoE in the sweep (Qwen3.6-35B-A3B, linear+full hybrid attention) doesn't get the same effect; at 256K decode, it's 41.7 MLX vs. 26.9 GGUF, MLX still ahead by 55%.

4. Takeaways#

ModelActive / Total paramsDecode (BF16 @ 4K) ceiling → measuredPrefill (BF16 @ 4K) ceiling → measuredRAM @ 256K (4b MLX): weights + KV
Gemma-4-31B-it30.7 / 30.7 B13.0 → 11.0 (85%)350 → 261.5 (75%)18 GB + 57 GB (78.5%)
Qwen3.6-27B27.8 / 27.8 B14.4 → 13.2 (92%)386 → 330.2 (86%)17 GB + 43 GB (62.1%)
Gemma-4-26B-A4B-it3.8 / 25.2 B101 → 64.7 (64%)2,826 → 1,946.5 (69%)15 GB + 23 GB (39.7%)
Qwen3.6-35B-A3B2.9 / 34.7 B141 → 73.6 (52%)3,702 → 2,459.9 (66%)21 GB + 22 GB (44.5%)
Table 12. Theoretical ceilings versus measured throughput. Decode and prefill columns use BF16 MLX at 4K context: decode ceiling = 819 GB/s ÷ bytes-per-step, prefill ceiling = 21.47 TFLOPS ÷ (2 × active params), and the percentage in parentheses is measured / ceiling; the gap is dispatch overhead, kernel launch latency, and other fixed per-step costs. RAM @ 256K is measured at the experiment's maximum context using 4-bit MLX (the practical default); the breakdown is total resident weights (total params × ~0.6 bytes for Q4_K_M / MLX 4b) plus everything else: KV-cache, per-step buffers, and allocator overhead, which can't be cleanly separated from the measurement. All four models complete at this setting. Active / Total params differs for MoE since full weights stay resident while only active params stream per decode token.
  • MLX outpaces GGUF in nearly every test. It wins on both prefill and decode with only one exception for long-context sliding-window MoE decoding.

  • Dispatch overhead becomes a tax when bytes-per-step shrinks. The dense BF16 models' decode throughput sits at 85–92% of the bandwidth ceiling whereas MoE BF16 decode throughput sits at 52–64% of its ceiling as the fixed per-step costs (dispatch, dequantization) begin to dominate.