Jun 11, 2026 · 1:14 AM
Subscribe
Home Ai

llama.cpp adds Multi-Token Prediction and doubles Qwen3.6 27B throughput for local inference

llama.cpp's merge of Multi-Token Prediction unlocks large local inference speedups for Qwen3.6 27B, with community benchmarks showing roughly 2.4× on Strix Halo and 2.17× on other rigs, making on-device open-weight models far more practical for low-latency and privacy-sensitive workloads.

Judith Murphy
· 5 min read · 2.3K views
llama.cpp adds Multi-Token Prediction and doubles Qwen3.6 27B throughput for local inference

llama.cpp has now merged Multi-Token Prediction support, and early Qwen3.6 27B tests show a real speed gain for local inference, especially on dense models running on consumer-class GPUs.

The important part is not that another inference flag appeared in a developer tool. It is that a widely used local LLM runtime just made a 27B open-weight model feel materially faster on hardware many builders already own. The llama.cpp project merged PR #22673 on May 16, bringing MTP speculative decoding into the main branch, and the first wave of community testing has been unusually clear: Qwen3.6 27B can generate tokens far faster when paired with MTP-capable GGUF weights and tuned correctly.

For local AI users, that matters immediately. A model that was useful but slightly slow in chat, coding, or agent workflows becomes easier to use as a daily driver. Lower latency changes behavior. People ask more follow-up questions, run more local experiments, and become less dependent on hosted inference for tasks where privacy, cost, or offline access matters.

llama.cpp is popular because it turns model releases into something ordinary developers can actually run. MTP fits that pattern. Instead of asking users to buy a larger GPU, it lets supported models draft more than one token ahead and then accept the correct path through speculative decoding. The feature still depends on the right model files and runtime settings, but the direction is simple enough: make each generation step do more useful work.

The Benchmarks Are Promising, But Specific

The strongest numbers so far come from community benchmark runs, not a vendor launch event, so they should be read with that context. In one widely shared LocalLLaMA post, a tester reported Qwen3.6 27B on a Framework Desktop with AMD Strix Halo moving from 7.4 to 18.1 tokens per second in Q8_0, a 2.44x gain. The same post reported a dual RTX 3090 layer-split setup rising from 25.7 to 55.9 tokens per second in Q8_0, or 2.17x, while a single RTX 3090 Q4_K_M run improved from 38.7 to 59.5 tokens per second.

Those figures line up with a DataCamp tutorial published May 14, which measured Qwen3.6 27B on an RTX 3090 moving from 38 tokens per second to 65 tokens per second after enabling MTP, roughly a 1.71x speedup. That is not the same as saying every system will double. It does say the improvement is large enough to survive outside a narrow lab setup.

The gains also appear stronger on dense models than on mixture-of-experts models. The same community benchmark showed Qwen3.6 35B-A3B improving by smaller margins, because only a portion of the model is active per token. That distinction is important for buyers and builders. MTP is not a universal speed button. It changes the economics most when the baseline generation path is expensive enough for speculative drafting to pay off.

Why Local AI Gets More Practical

Faster local inference reduces two persistent problems: waiting and spending. If a workstation can produce useful responses at close to twice the previous rate, the user experience moves closer to a hosted model without sending data to an external provider. That is valuable for developers working with private repositories, researchers handling sensitive notes, and companies that want internal assistants without routing everything through a cloud API.

There is also a competitive angle. Qwen3.6 27B was already getting attention because it sits in a useful middle range: large enough to be capable, small enough to run locally with quantization. MTP makes that class of model more attractive. The practical question shifts from whether local inference is possible to whether it is pleasant enough to use every day.

The caveats are real. Users need MTP-ready GGUF weights, recent llama.cpp builds, and enough memory headroom for the feature. Prompt processing can take a hit in some configurations, and draft size needs tuning by GPU, quantization, and workload. Hugging Face model pages are already publishing example commands, but this is still early adopter territory, not a one-click mainstream feature.

Output quality is another area where careful testing matters. Community reports say results can be byte-identical to baseline decoding at the same seed and temperature when configured correctly, but teams using MTP in production should verify their own prompts, especially for coding and agent workflows where small differences can matter.

The near-term signal is clear. MTP will not replace cloud inference, and it will not make every local model suddenly fast. But it does narrow the gap for serious local setups. Watch for cleaner builds, better defaults in local model apps, and broader MTP GGUF releases, because that is when this stops being a benchmark thread and starts becoming part of the normal local AI stack.

Also read: Gemini 3.2 Flash pushes Google deeper into elite math territoryGemini 2.5 Flash adds a new twist to the AI math raceAI's First Triage: Labor data show early contractions in high‑exposure white‑collar roles

TOPICS
Judith Murphy is a financial journalist and market analyst covering AI, technology stocks, and emerging market trends. She has contributed to multiple financial publications and brings a data-driven approach to her coverage of the technology sector and its impact on global markets.
Related Articles
More posts →
Loading next article…
You're all caught up