110 tok/s on RTX 4070 Super with Qwen3.6 35B

A LocalLLaMA community benchmark shows Qwen3.6-35B-A3B running at about 110 tokens per second on an RTX 4070 Super with 12GB VRAM. The result is current and notable, but it depends on a specific ik_llama.cpp setup, so builders should treat it as a reproducible recipe rather than a general hardware promise.

A 35 billion parameter model running this fast on a mid-range consumer GPU would have sounded unrealistic not long ago. That is why the May 21 Reddit benchmark from LocalLLaMA user janvitos is getting attention: it points to a practical local inference path for developers who want capable models without sending every prompt to a cloud API.

According to the Reddit post, Qwen3.6-35B-A3B reached a 110.24 tokens per second average on an RTX 4070 Super 12GB system running CachyOS, with an AMD Ryzen 7 9700X and 48GB of DDR5-6000 memory. The setup used the ik_llama.cpp fork, the byteshape Qwen3.6-35B-A3B-IQ4_XS-4.19bpw GGUF quant, a 131,072 token context, Q8 KV cache settings, and multi-token prediction parameters tuned for the fork.

This is not a vendor benchmark. It is a community result, and that distinction matters. The same post also showed current regular llama.cpp MTP results with the same quant ranging from 79.8 to 97.0 tokens per second across visible tasks. The author described the ik_llama.cpp result as a 22 percent increase, which is impressive, but narrower than a simple comparison against an earlier 80 tok/s headline would suggest.

Why Qwen3.6-35B-A3B fits the moment

Qwen3.6-35B-A3B is built for this kind of local deployment. The model documentation on Hugging Face lists it as a sparse Mixture-of-Experts model with 35 billion total parameters and about 3 billion active parameters. It uses 256 experts, with eight routed experts plus one shared expert activated per token.

The architecture also leans on Gated DeltaNet linear attention for most of its layers, with full attention appearing in every fourth layer. That design reduces the pressure that long-context inference puts on memory. In plain terms, the model can behave like a much larger system while only lighting up part of itself for each token.

That is the real story for small teams. Local AI is no longer limited to tiny models that feel compromised from the first prompt. A privacy-focused startup, an indie coding tool, or a research workflow can now consider a model in the 35B class on hardware that many developers already understand how to buy, install, and maintain.

The configuration matters as much as the card

The benchmark is also a reminder that local inference performance is not just about the GPU. The RTX 4070 Super is important, but the recipe matters just as much: the quant file, the fork, the cache settings, the context size, the operating system, and whether the display is using the same GPU all affect the result.

The post specifically notes that the monitor was plugged into integrated graphics, leaving the RTX 4070 Super available for inference. It also recommends adjusting --fit-margin if users run into out-of-memory errors. That is useful information, but it also means the 110 tok/s figure should not be read as a plug-and-play outcome for every 12GB setup.

The GGUF ecosystem around Qwen3.6 is moving quickly. Hugging Face listings now include multiple IQ4_XS and Q4_K_M variants from community quantizers, including mradermacher and byteshape-related releases. Those options give builders more room to trade size, quality, memory use, and speed, but they also make reproducibility harder. Two people can say they are running Qwen3.6-35B-A3B and still be testing meaningfully different systems.

The reliability caveat is not optional

Speed is only valuable if the output survives real use. That is why the silent corruption issue around speculative decoding deserves more than a footnote. The outsourc-e qwen36-4090-recipes GitHub repo, which tested Qwen3.6-27B on an RTX 4090, found that some cross-vocabulary speculative decoding setups produced strong throughput numbers while breaking JSON, lists, quote escapes, and tool-call boundaries.

That caveat does not directly invalidate the RTX 4070 Super benchmark, which uses Qwen3.6-35B-A3B with MTP settings. It does, however, underline the same production lesson: builders should validate the outputs they care about, not just the aggregate token rate. A coding assistant that corrupts braces or a workflow agent that damages JSON is not faster in any useful business sense.

For founders building AI products where privacy is part of the value proposition, the takeaway is practical. Local inference is becoming fast enough to matter, even on consumer hardware. The next question is whether a chosen fork, quant, and launch configuration can deliver that speed repeatedly while preserving the structure and quality the product depends on.