FastDMS Claims 6.4x KV Cache Compression While Running Faster Than vLLM and the Benchmark Numbers Are Credible Enough to Take Seriously

A developer posted a reference implementation of FastDMS to r/LocalLLaMA on May 4, reporting 5 to 8 times less KV memory usage than vLLM BF16 at 8K context alongside decode throughput 1.5 to 2 times faster than vLLM on Llama 3.2 1B benchmarks, results derived from a research paper on Dynamic Memory Sparsification originally published by a joint team from Nvidia, the University of Warsaw, and the University of Edinburgh, and the community response suggests the numbers are holding up to scrutiny from engineers who have independently tested adjacent techniques.

Understanding why this claim is interesting requires a brief explanation of what KV cache actually costs in production. Every token a transformer processes requires storing key and value vectors for every attention layer in the model, and those vectors must remain in GPU memory for the entire duration of the generation. At 8K context with a 32-billion parameter model, the KV cache alone can consume 62 gigabytes of GPU memory in BF16, leaving almost no headroom for model weights, activations, or concurrent requests on a single H100. vLLM's PagedAttention system, introduced in 2023, addressed the fragmentation problem by allocating KV cache in fixed-size pages rather than contiguous blocks, reducing memory waste from 60-80% to under 4%. That was a significant improvement for throughput and concurrent request handling, and it is why vLLM became the dominant serving framework for production LLM deployments. What PagedAttention did not address is the absolute size of the KV cache for long contexts, which grows linearly with sequence length regardless of how efficiently it is allocated. FastDMS addresses that absolute size problem through a different mechanism: learned token eviction.

Dynamic Memory Sparsification works by training a model to identify which key-value entries in the cache can be discarded without material degradation of output quality. The idea is not new. H2O, SnapKV, PyramidKV, and KV-Compress are all prior work in the KV sparsification space, and the original DMS paper from the Nvidia, Warsaw, and Edinburgh team represents the most recent academic contribution to that lineage. What the FastDMS developer has done is create a practical reference implementation with a training setup, validate it on Llama 3.2 1B against WikiText-2, and publish the benchmark methodology openly enough that other engineers can reproduce and challenge the results. The reported perplexity delta of -0.28% with 6.4x compression is the number that matters most for quality assessment. Perplexity is an imperfect proxy for real-world task performance, but a negative delta, meaning the compressed model has slightly lower perplexity than the baseline, at 6.4x compression is a strong signal that the eviction policy is not randomly discarding critical tokens. The KLD of 0.026 nats per token, measuring how much the compressed model's token distribution diverges from the full-cache baseline, is low enough that most production workloads would be unlikely to notice the difference in output quality.

The speed improvement alongside the memory improvement is the combination that elevates this beyond a standard compression result. Most KV cache compression techniques trade speed for memory: the overhead of deciding which tokens to evict, and of maintaining the sparse cache structure, adds latency that partially offsets the memory benefit. FastDMS reports 1.5 to 2x decode throughput improvements over vLLM BF16, attributable to two factors. First, the smaller KV cache reduces the memory bandwidth required for attention computation at each decode step, and since KV cache attention is memory-bandwidth-bound rather than compute-bound on modern GPUs, reducing the cache size directly accelerates decode. Second, the compact DMS approach reduces actual device memory pressure, not just KV bytes, which improves memory access patterns and reduces cache eviction overhead in the underlying GPU memory system. The community response to those numbers has been appropriately cautious but not dismissive: several engineers with vLLM production experience have noted that the benchmark context length of 8K is relatively short by current standards, and that the gains at 32K or 128K context, where KV memory pressure is most acute, have not yet been published.

The serving-stack efficiency thesis is the broader claim worth examining, because it has implications that extend well beyond FastDMS's specific results. The AI infrastructure competition for the past three years has been primarily about model size, parameter count, and training compute as proxies for capability. That competition has produced a generation of models that require expensive GPU clusters to train and expensive inference infrastructure to serve at scale. The economics of inference at long context are particularly challenging: a model with a 128K context window that a user fills with a large document is not consuming 128K tokens of compute uniformly. Most of that compute is KV cache memory and attention bandwidth, not the matrix multiplications that scale with model size. Techniques that address KV cache efficiency, whether through quantisation, sparsification, or learned eviction, directly improve the per-token cost of long-context inference without requiring model retraining or architectural changes. For a startup running document analysis, legal review, code generation over large codebases, or multi-turn agent workflows where context windows accumulate over many turns, the marginal cost per token at long context is a direct input to unit economics. A 6.4x reduction in KV memory with a 1.5 to 2x decode speed improvement, if it holds at longer contexts and across a broader range of models, is not an incremental optimisation. It is a step change in the cost structure of long-context inference.

The practical limitations the community has correctly identified are worth stating clearly. FastDMS has been validated on Llama 3.2 1B on WikiText-2, a language modelling benchmark with relatively short, homogeneous sequences. Production workloads are more diverse, with varying prompt structures, domain-specific vocabulary, instruction-following requirements, and output length distributions that may stress the eviction policy differently than language modelling evaluation does. The results at 8K context are promising but the 32K, 64K, and 128K regimes, where the memory savings would be most commercially significant, have not been benchmarked. Integration with vLLM's PagedAttention allocator would require engineering work beyond the current reference implementation, and the training overhead of learning the eviction policy adds a fine-tuning step that most deployment workflows do not currently include. None of these are reasons to dismiss the result. They are the checklist of validation work that converts a strong reference implementation into a production-ready serving optimisation, and the fact that a developer has published credible initial benchmarks openly is exactly how that validation process begins.

Also read: The Senate Just Voted 22-0 to Ban AI Companions for Minors and Every Founder Building Emotionally Engaging Consumer AI Needs to Read the Bill Carefully • Berkshire, Travelers, and Chubb Are Pulling Back From AI Risk and a YC-Backed Startup Just Walked Into the Gap With $108 Million and a New Coverage Category • Greg Brockman Just Confirmed OpenAI Is Exploring an IPO Under Oath and the Implications Run Much Deeper Than the Trial Headline