Qwen 3.6 & Gemma 4 on M3 Ultra Mac Studio
The release of Qwen 3.6 and Gemma 4 by Alibaba and Google has demonstrated the considerable progress in developing smaller open-source models to replace their more expensive closed-source counterparts, with their evaluation scores matching those of previous-generation SOTA models. Despite their small sizes, these models still require significant compute resources to run on consumer-grade hardware, often requiring significant quantization, lower context windows, and higher latencies.
Though they are difficult to run today, it is our hypothesis that consumers will soon be capable of running these models without significant workarounds. This experiment tests the bounds of latencies (pp - prefill speed & tg - decode speed) and resource consumption (memory usage) for these models on an M3 Ultra Mac Studio to serve as a basis for future experiments. We will not be focusing on evaluation scores as that work will be done separately. This specific variant of the Mac Studio was chosen due to its entry-level pricing and hardware specifications, serving as a great stepping stone for anyone looking to run their own personal agents. Our experiment tests the following combinations:
- 4 models (2 Gemma, 2 Qwen):
- Gemma-4-31B-it (dense, 30.7 B params, sliding-window attention)
- Gemma-4-26B-A4B-it (MoE, 25.2 B total / 3.8 B active, sliding-window attention)
- Qwen3.6-27B (dense, 27.8 B params, linear+full hybrid attention)
- Qwen3.6-35B-A3B (MoE, 34.7 B total / 2.9 B active, linear+full hybrid attention)
- 2 inference backends: MLX, llama.cpp
- 3 parameter quantizations (KV-cache remains BF16): BF16, 8-bit, 4-bit
- 6 context windows: 4K → 256K
1. Setup#
Our experiment setup involves a single M3 Ultra Mac Studio along with open-source software.
1.1 Hardware#
| Field | Value |
|---|---|
| Chip | Apple M3 Ultra |
| OS | macOS 26.4.1 (Darwin 25.4.0) |
| RAM | 96 GB unified (LPDDR5) |
| Memory bandwidth | 819 GB/s |
| GPU cores | 60 |
| GPU compute | 21.47 TFLOPS BF16 |
| Disk at start | 845 GB free |
| MSRP as of 05/13/2026 | $3,999.00 |
1.2 Benchmark matrix#
| Axis | Values | Count |
|---|---|---|
| Model | Gemma-4-31B-it, Gemma-4-26B-A4B-it, Qwen3.6-27B, Qwen3.6-35B-A3B | 4 |
| Quantization | BF16 (orig), 8-bit (Q8_0 / MLX 8b), 4-bit (Q4_K_M / MLX 4b) | 3 |
| Backend | MLX (mlx-community), llama.cpp (GGUF) | 2 |
| Context | 4K, 16K, 32K, 64K, 128K, 256K | 6 |
1.3 Software versions#
| Software | Version |
|---|---|
llama.cpp (llama-bench) | build b9090 (Homebrew) |
| ggml | 0.11.0 |
| MLX (Python) | 0.31.2 |
| mlx-lm | 0.31.3 |
| Python (venv) | 3.13.13 |
The quantized variants of the models in this experiment were done with:
- MLX:
mlx_lm.convert -q --q-bits {4,8} - GGUF:
convert_hf_to_gguf.py+llama-quantize
1.4 Architecture#
| Model | FFN | Attention layers | Total params | Active params |
|---|---|---|---|---|
| Gemma-4-31B-it | Dense | 50 sliding 10 full | 30.7 B | 30.7 B |
| Qwen3.6-27B | Dense | 48 linear 16 full | 27.8 B | 27.8 B |
| Gemma-4-26B-A4B-it | MoE | 25 sliding 5 full | 25.2 B | 3.8 B |
| Qwen3.6-35B-A3B | MoE | 30 linear 10 full | 34.7 B | 2.9 B |
config.json and model card. Layer counts list attention types, with a sliding window of 1024 tokens for Gemma models.2. Method#
To minimize potential contamination in our experiment, each test (one <model, quantization, backend, context>) was measured in isolation, with no state carried over from the previous test. The tests were done as follows:
- Start a fresh process. No Python state, allocator cache, or model weights are carried over from the previous test.
- Load the model and tokenizer into memory.
- Run one untimed warmup pass. This discards JIT-compile cost, kernel preparation, and first-touch cache misses before measurement begins.
- Measure prefill in a separate backend invocation (
-p ctx -n 0), repeating five times to recordpp tok/s. - Measure decode in a separate backend invocation (
-p 0 -n 128 -d ctx), repeating five times to recordtg tok/s. - Record per-repetition timings, peak Metal memory, and process resident set size.
- Report the median over the five measured runs; standard deviation is tracked alongside.
- Exit the process so the next test starts from a clean slate.
Prefill and decode run as separate backend invocations so neither phase contaminates the other's timing. Each test records the exact model file used so that a re-run hits the same bytes.
3. Results#
3.1 4-bit MLX#
To serve as a basis for our experiment, we started off by testing each of the models with 4-bit quantization and using the MLX backend. Figures 1–3 plot the three measured metrics across all context sizes.
pp tok/s) vs. context window, 4-bit MLX. The two MoE models (pink + violet) prefill ~7× faster than the two dense models (amber + emerald) at short context, narrowing to ~4× at 256K.Show the data behind Figure 1
| Model | 4K | 16K | 32K | 64K | 128K | 256K | Δ 4K → 256K |
|---|---|---|---|---|---|---|---|
| Gemma-4-31B-it 4b MLX | 259.5 | 243.6 | 226.4 | 195.5 | 152.4 | 104.7 | −60% |
| Qwen3.6-27B 4b MLX | 329.5 | 315.0 | 296.0 | 260.0 | 205.4 | 143.4 | −56% |
| Gemma-4-26B-A4B-it 4b MLX | 1991.5 | 1795.2 | 1595.6 | 1285.8 | 897.8 | 557.2 | −72% |
| Qwen3.6-35B-A3B 4b MLX | 2432.3 | 2156.6 | 1832.6 | 1375.8 | 870.8 | 500.0 | −79% |
pp tok/s) across context windows. The Δ column reports relative change between the 4K and 256K endpoints. Bold entries are the per-context-column maxima.tg tok/s) vs. context window, 4-bit MLX. MoE leads at every context; dense decode runs at roughly one-third the MoE rate at short context, reflecting the bandwidth-versus-dispatch tradeoff discussed under Figure 5 below.Show the data behind Figure 2
| Model | 4K | 16K | 32K | 64K | 128K | 256K | Δ 4K → 256K |
|---|---|---|---|---|---|---|---|
| Gemma-4-31B-it 4b MLX | 30.5 | 27.1 | 23.7 | 19.1 | 13.6 | 8.6 | −72% |
| Qwen3.6-27B 4b MLX | 38.1 | 35.2 | 32.2 | 28.2 | 22.5 | 15.8 | −59% |
| Gemma-4-26B-A4B-it 4b MLX | 105.4 | 88.3 | 74.6 | 55.5 | 35.9 | 21.5 | −80% |
| Qwen3.6-35B-A3B 4b MLX | 109.6 | 98.0 | 89.8 | 76.5 | 60.1 | 41.7 | −62% |
tg tok/s) across context windows. The Δ column reports relative change between the 4K and 256K endpoints. Bold entries are the per-context-column maxima.Show the data behind Figure 3
| Model | 4K | 16K | 32K | 64K | 128K | 256K | Δ 4K → 256K |
|---|---|---|---|---|---|---|---|
| Gemma-4-31B-it 4b MLX | 21.5 | 24.0 | 27.6 | 34.9 | 49.5 | 78.5 | +265% |
| Qwen3.6-27B 4b MLX | 18.5 | 20.6 | 23.2 | 28.6 | 39.8 | 62.1 | +236% |
| Gemma-4-26B-A4B-it 4b MLX | 16.2 | 17.3 | 18.7 | 21.6 | 27.6 | 39.7 | +145% |
| Qwen3.6-35B-A3B 4b MLX | 21.5 | 22.6 | 24.0 | 26.9 | 32.7 | 44.5 | +107% |
| Model | pp @ 4K | pp @ 256K (Δ) | tg @ 4K | tg @ 256K (Δ) | RAM @ 4K | RAM @ 256K (Δ) |
|---|---|---|---|---|---|---|
| Gemma-4-31B-it 4b MLX | 259.5 | 104.7 (−60%) | 30.5 | 8.6 (−72%) | 21.5% | 78.5% (+265%) |
| Qwen3.6-27B 4b MLX | 329.5 | 143.4 (−56%) | 38.1 | 15.8 (−59%) | 18.5% | 62.1% (+236%) |
| Gemma-4-26B-A4B-it 4b MLX | 1991.5 | 557.2 (−72%) | 105.4 | 21.5 (−80%) | 16.2% | 39.7% (+145%) |
| Qwen3.6-35B-A3B 4b MLX | 2432.3 | 500.0 (−79%) | 109.6 | 41.7 (−62%) | 21.5% | 44.5% (+107%) |
pp) and decode (tg) throughput are reported in tok/s; resident memory as a percentage of the 96 GB unified memory. The parenthetical Δ in the 256K columns reports the relative change vs. the respective 4K baseline. Bold entries are the per-column best.At a glance, the MoE models have significantly higher prefill and decode speeds compared to their dense counterparts, with the Qwen model leading decode speeds at every context size. Both Qwen and Gemma MoE also show similar prefill speeds, with Qwen beating Gemma in context windows ≤ 64k and closely trailing Gemma's prefill in context windows > 64k.
3.2 Full experiment#
Following the initial subset of 4-bit MLX models, we ran the experiment across the entire set to get the following results. We've omitted results where the test configuration led the Mac Studio to run out of memory and subsequently terminate the test program or fail to allocate memory.
pp tok/s) vs. context window. Shared log y-axis across panels, so vertical position is comparable between panels. Quant curves cluster within each panel, consistent with compute-bound prefill (with Qwen3.6-35B-A3B BF16 MLX at 128K as a near-cap exception, sitting ~17% below the 4-bit/8-bit lines). MoE (bottom row) prefills ~7× faster than dense (top row) at short context, narrowing to ~4× at 256K. Markers: ○ 4-bit, ◻ 8-bit, △ BF16. Solid + filled = MLX, dashed + hollow = GGUF.Show the data behind Figure 4
| Model | 4K | 16K | 32K | 64K | 128K | 256K |
|---|---|---|---|---|---|---|
| Gemma-4-31B-it 4-bit MLX | 259.5 | 243.6 | 226.4 | 195.5 | 152.4 | 104.7 |
| Gemma-4-31B-it 4-bit GGUF | 253.7 | 226.6 | 197.9 | 157.7 | 111.6 | – |
| Gemma-4-31B-it 8-bit MLX | 257.4 | 242.3 | 225.4 | 194.7 | 151.7 | – |
| Gemma-4-31B-it 8-bit GGUF | 263.8 | 234.5 | 203.9 | 161.4 | 113.3 | – |
| Gemma-4-31B-it BF16 MLX | 261.5 | 246.7 | 228.9 | 195.6 | – | – |
| Gemma-4-31B-it BF16 GGUF | 268.9 | 238.7 | 207.0 | 163.4 | 114.1 | – |
| Qwen3.6-27B 4-bit MLX | 329.5 | 315.0 | 296.0 | 260.0 | 205.4 | 143.4 |
| Qwen3.6-27B 4-bit GGUF | 302.3 | 267.8 | 261.9 | 226.0 | 175.8 | 120.2 |
| Qwen3.6-27B 8-bit MLX | 325.7 | 312.0 | 293.4 | 257.9 | 204.0 | 142.8 |
| Qwen3.6-27B 8-bit GGUF | 315.3 | 296.3 | 271.6 | 233.2 | 180.0 | 122.3 |
| Qwen3.6-27B BF16 MLX | 330.2 | 317.1 | 298.1 | 261.4 | 206.0 | – |
| Qwen3.6-27B BF16 GGUF | 318.1 | 298.4 | 273.3 | 234.5 | 180.4 | 122.0 |
| Gemma-4-26B-A4B-it 4-bit MLX | 1991.5 | 1795.2 | 1595.6 | 1285.8 | 897.8 | 557.2 |
| Gemma-4-26B-A4B-it 4-bit GGUF | 1652.8 | 1363.7 | 1118.5 | 816.6 | 526.2 | 305.0 |
| Gemma-4-26B-A4B-it 8-bit MLX | 1961.0 | 1775.2 | 1579.3 | 1275.7 | 893.6 | 555.4 |
| Gemma-4-26B-A4B-it 8-bit GGUF | 1682.0 | 1384.2 | 1131.9 | 823.8 | 529.8 | 305.7 |
| Gemma-4-26B-A4B-it BF16 MLX | 1946.5 | 1768.0 | 1575.1 | 1271.6 | 889.4 | 553.9 |
| Gemma-4-26B-A4B-it BF16 GGUF | 1700.6 | 1403.7 | 1145.4 | 831.7 | 534.3 | 308.6 |
| Qwen3.6-35B-A3B 4-bit MLX | 2432.3 | 2156.6 | 1832.6 | 1375.8 | 870.8 | 500.0 |
| Qwen3.6-35B-A3B 4-bit GGUF | 1826.3 | 1555.5 | 1293.5 | 962.3 | 621.3 | 355.6 |
| Qwen3.6-35B-A3B 8-bit MLX | 2406.9 | 2138.3 | 1819.7 | 1368.1 | 867.0 | 498.6 |
| Qwen3.6-35B-A3B 8-bit GGUF | 1730.5 | 1481.9 | 1244.5 | 936.2 | 612.2 | 351.5 |
| Qwen3.6-35B-A3B BF16 MLX | 2459.9 | 2189.4 | 1853.0 | 1381.6 | 722.6 | – |
| Qwen3.6-35B-A3B BF16 GGUF | 1736.8 | 1485.0 | 1246.5 | 937.2 | 610.4 | 350.9 |
pp tok/s) for every model × quant × backend × context window. Tests marked – are aborted measurements (weights + KV-cache exceeded the Mac Studio's memory constraints).tg tok/s) vs. context window. Shared log y-axis. Dotted lines on dense panels mark the BF16 bandwidth ceiling (819 GB/s ÷ bytes-per-token at 4K). Measured BF16 lines sit just below, since dense decode is bandwidth-limited. MoE ceilings are off-chart at ~125+ tok/s; MoE hits dispatch overhead before bandwidth. Quant spread is wide on dense, narrow on MoE.Show the data behind Figure 5
| Model | 4K | 16K | 32K | 64K | 128K | 256K |
|---|---|---|---|---|---|---|
| Gemma-4-31B-it 4-bit MLX | 30.5 | 27.1 | 23.7 | 19.1 | 13.6 | 8.6 |
| Gemma-4-31B-it 4-bit GGUF | 26.2 | 24.1 | 21.8 | 18.1 | 12.4 | – |
| Gemma-4-31B-it 8-bit MLX | 18.5 | 17.2 | 15.7 | 13.5 | 10.5 | – |
| Gemma-4-31B-it 8-bit GGUF | 19.0 | 17.8 | 16.5 | 14.3 | 10.5 | – |
| Gemma-4-31B-it BF16 MLX | 11.0 | 10.5 | 9.9 | 9.0 | – | – |
| Gemma-4-31B-it BF16 GGUF | 11.2 | 10.8 | 10.3 | 9.4 | 7.6 | – |
| Qwen3.6-27B 4-bit MLX | 38.1 | 35.2 | 32.2 | 28.2 | 22.5 | 15.8 |
| Qwen3.6-27B 4-bit GGUF | 26.4 | 25.1 | 23.4 | 20.1 | 14.5 | 8.4 |
| Qwen3.6-27B 8-bit MLX | 22.9 | 21.7 | 20.4 | 18.8 | 16.1 | 12.4 |
| Qwen3.6-27B 8-bit GGUF | 20.8 | 19.7 | 18.7 | 16.6 | 12.7 | 7.6 |
| Qwen3.6-27B BF16 MLX | 13.2 | 12.9 | 12.5 | 11.8 | 10.7 | – |
| Qwen3.6-27B BF16 GGUF | 12.7 | 12.3 | 11.9 | 11.0 | 9.1 | 6.2 |
| Gemma-4-26B-A4B-it 4-bit MLX | 105.4 | 88.3 | 74.6 | 55.5 | 35.9 | 21.5 |
| Gemma-4-26B-A4B-it 4-bit GGUF | 87.1 | 82.9 | 75.1 | 64.4 | 49.6 | 33.6 |
| Gemma-4-26B-A4B-it 8-bit MLX | 83.5 | 73.0 | 62.8 | 48.6 | 32.8 | 20.4 |
| Gemma-4-26B-A4B-it 8-bit GGUF | 79.5 | 76.0 | 69.1 | 59.7 | 46.6 | 31.9 |
| Gemma-4-26B-A4B-it BF16 MLX | 64.7 | 59.1 | 52.1 | 42.1 | 29.7 | 19.1 |
| Gemma-4-26B-A4B-it BF16 GGUF | 62.3 | 59.7 | 55.6 | 49.5 | 40.2 | 29.3 |
| Qwen3.6-35B-A3B 4-bit MLX | 109.6 | 98.0 | 89.8 | 76.5 | 60.1 | 41.7 |
| Qwen3.6-35B-A3B 4-bit GGUF | 81.1 | 75.6 | 70.3 | 61.0 | 45.9 | 26.9 |
| Qwen3.6-35B-A3B 8-bit MLX | 88.2 | 81.3 | 74.8 | 65.1 | 53.1 | 38.2 |
| Qwen3.6-35B-A3B 8-bit GGUF | 73.4 | 69.1 | 64.3 | 56.5 | 43.4 | 26.0 |
| Qwen3.6-35B-A3B BF16 MLX | 73.6 | 68.2 | 63.6 | 56.3 | 46.9 | – |
| Qwen3.6-35B-A3B BF16 GGUF | 59.8 | 56.9 | 53.7 | 48.0 | 38.3 | 23.9 |
tg tok/s) for every model × quant × backend × context window. Tests marked – are aborted measurements (weights + KV-cache exceeded the Mac Studio's memory constraints).Dense BF16 is pinned to the bandwidth wall. For dense models at BF16, decode throughput is predominantly bandwidth-bound: tg ≤ bandwidth / (weight_bytes + kv_bytes_per_token). For example, with Gemma-4-31B-it at 4K, each step streams 62.9 GB (61.4 GB weights + 1.5 GB KV-cache) against the 819 GB/s peak, giving a 13.0 tok/s ceiling, which the recorded 11.0 tok/s closely matches.
Below the bandwidth wall, fixed overheads take over. Both quantization and MoE leave the Mac Studio's bandwidth underutilized, so dispatch overheads dominate the remaining time. Take for instance the following at 4K context:
- Gemma-4-31B-it 4b GGUF streams 20.18 GB per decode step (18.67 GB weights + 1.51 GB KV-cache) against the 819 GB/s peak, giving a 40.6 tok/s ceiling; the recorded speed is 26.2 tok/s (64.5%)
- Qwen3.6-35B-A3B BF16 MLX streams 5.8 GB per decode step (2.9 B active params × 2 bytes for BF16) against the 819 GB/s peak, giving a 141 tok/s ceiling; the recorded speed is 73.6 tok/s (52%)
Show the data behind Figure 6
| Model | 4K | 16K | 32K | 64K | 128K | 256K |
|---|---|---|---|---|---|---|
| Gemma-4-31B-it 4-bit MLX | 21.5 | 24.0 | 27.6 | 34.9 | 49.5 | 78.5 |
| Gemma-4-31B-it 4-bit GGUF | 20.2 | 22.2 | 24.9 | 30.3 | 41.0 | – |
| Gemma-4-31B-it 8-bit MLX | 36.9 | 39.3 | 42.9 | 50.3 | 64.9 | – |
| Gemma-4-31B-it 8-bit GGUF | 34.1 | 36.1 | 38.8 | 44.2 | 54.9 | – |
| Gemma-4-31B-it BF16 MLX | 65.4 | 68.1 | 71.7 | 79.0 | – | – |
| Gemma-4-31B-it BF16 GGUF | 62.9 | 64.9 | 67.6 | 73.0 | 83.7 | – |
| Qwen3.6-27B 4-bit MLX | 18.5 | 20.6 | 23.2 | 28.6 | 39.8 | 62.1 |
| Qwen3.6-27B 4-bit GGUF | 17.9 | 21.1 | 25.4 | 34.0 | 51.2 | 85.5 |
| Qwen3.6-27B 8-bit MLX | 32.0 | 34.0 | 36.7 | 42.1 | 53.2 | 75.5 |
| Qwen3.6-27B 8-bit GGUF | 29.7 | 32.9 | 37.2 | 45.8 | 62.9 | 97.3 |
| Qwen3.6-27B BF16 MLX | 57.1 | 59.1 | 61.8 | 67.3 | 78.5 | – |
| Qwen3.6-27B BF16 GGUF | 54.9 | 58.1 | 62.4 | 71.0 | 88.2 | – |
| Gemma-4-26B-A4B-it 4-bit MLX | 16.2 | 17.3 | 18.7 | 21.6 | 27.6 | 39.7 |
| Gemma-4-26B-A4B-it 4-bit GGUF | 17.2 | 17.7 | 18.3 | 19.7 | 22.4 | 27.7 |
| Gemma-4-26B-A4B-it 8-bit MLX | 28.6 | 29.7 | 31.1 | 33.9 | 39.9 | 52.0 |
| Gemma-4-26B-A4B-it 8-bit GGUF | 27.2 | 27.7 | 28.4 | 29.7 | 32.4 | 37.8 |
| Gemma-4-26B-A4B-it BF16 MLX | 52.2 | 53.3 | 54.7 | 57.6 | 63.6 | 75.7 |
| Gemma-4-26B-A4B-it BF16 GGUF | 50.9 | 51.4 | 52.0 | 53.4 | 56.1 | 61.4 |
| Qwen3.6-35B-A3B 4-bit MLX | 21.5 | 22.6 | 24.0 | 26.9 | 32.7 | 44.5 |
| Qwen3.6-35B-A3B 4-bit GGUF | 21.5 | 22.5 | 23.8 | 26.5 | 31.9 | 42.6 |
| Qwen3.6-35B-A3B 8-bit MLX | 38.8 | 39.9 | 41.3 | 44.2 | 50.0 | 61.8 |
| Qwen3.6-35B-A3B 8-bit GGUF | 37.2 | 38.2 | 39.6 | 42.3 | 47.6 | 58.4 |
| Qwen3.6-35B-A3B BF16 MLX | 71.3 | 72.4 | 73.8 | 76.6 | 82.4 | – |
| Qwen3.6-35B-A3B BF16 GGUF | 69.7 | 70.7 | 72.1 | 74.7 | 80.1 | 90.8 |
MoE's active-parameter advantage doesn't save BF16 at 256K. Qwen3.6-35B-A3B's 34.7 B total params still sit resident (only 2.9 B stream per decode token), and that's enough to fail allocation at 256K alongside every dense BF16 row.
MLX vs. GGUF skew depends on the workload. Generally MLX will beat GGUF on prefill and decode speeds. However, with long-context sliding-window MoE, the decode flips and GGUF dominates (e.g. Gemma-4-26B-A4B-it 4b at 256K: 33.6 GGUF vs. 21.5 MLX, +56% throughput). The flip happens specifically for sliding-window attention: it bounds each layer's KV cache to a fixed 1024-token window regardless of context length, so the per-step byte stream stays tiny (small active-parameter footprint + bounded KV) and GGUF's mmap'd weight path amortizes well under that profile. The other MoE in the sweep (Qwen3.6-35B-A3B, linear+full hybrid attention) doesn't get the same effect; at 256K decode, it's 41.7 MLX vs. 26.9 GGUF, MLX still ahead by 55%.
4. Takeaways#
| Model | Active / Total params | Decode (BF16 @ 4K) ceiling → measured | Prefill (BF16 @ 4K) ceiling → measured | RAM @ 256K (4b MLX): weights + KV |
|---|---|---|---|---|
| Gemma-4-31B-it | 30.7 / 30.7 B | 13.0 → 11.0 (85%) | 350 → 261.5 (75%) | 18 GB + 57 GB (78.5%) |
| Qwen3.6-27B | 27.8 / 27.8 B | 14.4 → 13.2 (92%) | 386 → 330.2 (86%) | 17 GB + 43 GB (62.1%) |
| Gemma-4-26B-A4B-it | 3.8 / 25.2 B | 101 → 64.7 (64%) | 2,826 → 1,946.5 (69%) | 15 GB + 23 GB (39.7%) |
| Qwen3.6-35B-A3B | 2.9 / 34.7 B | 141 → 73.6 (52%) | 3,702 → 2,459.9 (66%) | 21 GB + 22 GB (44.5%) |
-
MLX outpaces GGUF in nearly every test. It wins on both prefill and decode with only one exception for long-context sliding-window MoE decoding.
-
Dispatch overhead becomes a tax when bytes-per-step shrinks. The dense BF16 models' decode throughput sits at 85–92% of the bandwidth ceiling whereas MoE BF16 decode throughput sits at 52–64% of its ceiling as the fixed per-step costs (dispatch, dequantization) begin to dominate.