DeepSeek V4 shows how cheaper AI may come from lower precision

DeepSeek V4 is not just another big model release. Its real message for founders is that the next cost advantage in AI may come from precision, memory, and stability engineering rather than raw GPU scale.

DeepSeek has put the full V4 technical report into the open, and the most interesting part is not the headline parameter count. It is the way the Chinese lab is trying to make a trillion-scale model cheaper to run, easier to stretch across long context, and more credible for startups that cannot spend like OpenAI, Google, Anthropic, or Meta.

The release centers on two mixture-of-experts models. DeepSeek-V4-Pro is listed at 1.6 trillion total parameters with 49 billion active per token. DeepSeek-V4-Flash is far smaller, at 284 billion total parameters with 13 billion active. Both support a one-million-token context window, and both are being positioned less as simple chat models than as infrastructure for coding agents, research agents, and long-running tool workflows.

That matters because AI founders are already past the phase where a model demo is enough. The real question is whether a system can serve customers at a price that leaves room for a business. A model that performs well but burns too much memory, needs exotic clusters, or slows down after a long context is not a product advantage. It is a margin problem.

The sharpest technical choice in V4 is its use of low precision. The instruct models use FP4 for the MoE expert weights and FP8 for much of the rest of the stack, while implementation notes from vLLM and deployment teams point to MXFP4 experts with FP8 scales. In practical terms, DeepSeek is compressing the expensive expert layers hard while keeping enough numerical structure around them to avoid collapsing quality.

This is where quantization-aware training becomes important. Post-training quantization can work well for deployment, but it often asks the model to tolerate a precision regime it was not trained to expect. QAT brings the low-precision constraint into training or fine-tuning, so the model learns around the limitations rather than being squeezed afterward. For MoE systems, where expert weights dominate storage and bandwidth, that difference can be meaningful.

The lesson for startups is straightforward. If FP4 experts hold up under real workloads, serving cost can fall without forcing every team into a tiny model. Lower memory pressure also changes the hardware conversation. A founder choosing between a closed API, a hosted open model, and a self-managed cluster now has another variable to test: not just which model is smartest, but which precision stack delivers acceptable quality per dollar.

Stability Is The Hard Part

Low precision is only useful if the model stays trainable. V4 adds several stability choices around that problem. The paper describes manifold-constrained hyper-connections, which replace plain residual connections with a more controlled mixing structure. It also uses the Muon optimizer for most parameters, with AdamW reserved for parts such as embeddings, prediction heads, biases, and normalization weights.

Those details sound academic until you connect them to cost. Training instability wastes compute. Failed runs waste more. If DeepSeek can improve convergence while pushing weights into FP4 and FP8 formats, the advantage is not only cheaper inference. It is a path toward more efficient model development, especially for teams that need to iterate quickly and cannot afford repeated large-scale failures.

There are also routing and activation controls designed for MoE behavior. Reports on the technical release highlight anticipatory routing, which separates token routing from the most current parameter update, and clamping around SwiGLU activations to limit runaway values. These are not glamorous features, but they are the kind of engineering choices that decide whether low precision works in production or becomes a benchmark-only claim.

According to a Hugging Face analysis of the V4 release, the long-context efficiency is just as central as the parameter count: at one million tokens, V4-Pro uses 27 percent of the single-token inference FLOPs of DeepSeek-V3.2 and 10 percent of its KV cache memory, while V4-Flash drops those figures to 10 percent and 7 percent. That is the infrastructure story founders should watch closely.

The Open Model Pressure

DeepSeek is also applying pressure to the broader open-model market. Qwen and Llama have helped make capable models available outside the closed frontier labs, but V4 pushes the conversation toward systems design. The question is no longer only whether an open model can score well against proprietary models. It is whether open releases can ship with the kernels, quantization formats, routing choices, and serving patterns needed to make them economically useful.

That distinction matters for product teams. A model that is open in license but painful to serve still leaves many startups dependent on large API vendors. A model that is open, efficient, and supported by frameworks such as SGLang and vLLM creates a more realistic path to negotiation. Even teams that never self-host can use that option as leverage when comparing API pricing.

There are reasons to stay cautious. DeepSeek has not magically removed the need for serious hardware, and FP4 is not a free lunch. Quality can shift across tasks, kernels may be hardware-specific, and deployment maturity will matter as much as the paper. The Pro model is still enormous, and the Flash model is the more practical target for many companies.

Still, V4 points in a useful direction. The next wave of AI competition may be won by teams that understand precision, memory, routing, and stability as product levers. For founders, the takeaway is not to chase every new model release. It is to measure the full cost of intelligence, from training reliability to serving latency, because that is where the business model will increasingly be decided.

Also read: Polymarket losses show prediction markets are built for sharper traders • Qwen makes local AI inference practical on consumer GPUs • ICE wants smart glasses to make facial recognition harder to ignore