Jun 3, 2026 · 11:44 PM
Subscribe
Home Ai

Nvidia turns one reasoning model into three with Star Elastic

Nvidia's Star Elastic release packages 30B, 23B, and 12B reasoning models into a single checkpoint that can be sliced before deployment. The bigger question is whether elastic checkpoints can help startups control inference costs without adding new serving complexity.

Julian Lim
· 5 min read · 831 views
Nvidia turns one reasoning model into three with Star Elastic

Nvidia's Star Elastic points to a more flexible way to run reasoning models: one checkpoint, several cost profiles, and fewer duplicated deployment decisions.

Nvidia has released a new Nemotron Labs elastic model that makes a simple promise with serious infrastructure implications. Instead of asking teams to choose between separate 30B, 23B, and 12B reasoning models, Star Elastic puts all three inside one checkpoint and lets developers slice out the smaller variants before deployment.

That sounds technical, because it is. But the business point is straightforward. AI teams are no longer just chasing the largest model they can afford. They are trying to find the cheapest reliable model for each job, especially when reasoning workloads can burn through long thinking traces before producing a short final answer. For startups, that can be the difference between a product that scales and one that quietly eats the gross margin.

According to Nvidia's Hugging Face model card, NVIDIA-Nemotron-Labs-3-Elastic-30B-A3B-BF16 is a 3-in-1 elastic large language model with 30B, 23B, and 12B parameter variants sharing the same parameter space. The 23B and 12B versions can be extracted from the 30B checkpoint through zero-shot slicing, with no extra fine-tuning required. The model was released on Hugging Face on May 7, 2026, under the Nvidia Open Model License, and Nvidia says it is ready for commercial use.

The attraction is not just storage convenience. Maintaining separate model families creates a lot of operational drag. Teams need different checkpoints, separate evaluation runs, separate serving assumptions, and often a messy internal debate about whether the cheaper model is good enough for a particular customer workflow. Star Elastic tries to compress that decision into one model family that can move across the quality, latency, and compute curve more cleanly.

The architecture is based on Nemotron 3 Nano 30B and uses a hybrid Mamba2, Transformer, and mixture-of-experts design. The full model has 30B total parameters with 3.6B active per token. The 23B variant has 2.8B active parameters, while the 12B variant has 2.0B. All three share the same 52-layer structure, attention heads, Mamba heads, and expert count, while differing in embedding dimension and MoE feed-forward dimension.

That is where the zero-shot slicing becomes important. This is not the same as training a big model, then later building a smaller one from scratch or running a separate distillation process for each target size. Nvidia says the elastic family was produced by post-training the parent model with roughly 160B tokens, equal to about 0.6% of the parent model's approximately 25T-token pretraining budget. The claimed memory advantage is also meaningful: deploying the 12B, 23B, and 30B variants together in BF16 requires 58.9GB, compared with 126.1GB for three separate Nano V3 checkpoints.

The practical outcome is a model strategy that looks closer to infrastructure tuning than model shopping. A team running a support agent, coding assistant, or research copilot could use a smaller slice where speed matters, then reserve the full model for harder synthesis or premium paths. That is not glamorous, but it is exactly the kind of decision that shows up in cloud bills.

The reasoning budget is the real story

The most interesting part of Star Elastic is Nvidia's elastic budget control. Reasoning models often split their work into a thinking phase and an answering phase. The thinking phase can be long and repetitive, while the final response is much shorter and needs stronger instruction following. Nvidia argues that a smaller model can handle much of the high-volume thinking, while the larger model can take over for the final answer.

In Nvidia's own results, the 23B to 30B configuration gives the strongest accuracy-latency tradeoff across a range of budgets. The company says this approach can deliver up to 16% higher accuracy and 1.9x lower latency compared with standard single-model budget control. That is a useful claim, but it comes with an important caveat. Nvidia says switching nested submodels inside a single generation is not yet supported in standard vLLM and currently requires a custom inference path.

That caveat matters because many startups do not want a clever research path if it turns into custom serving maintenance. The simpler near-term use case is pre-deployment slicing: choose the 23B or 12B checkpoint for a product tier, run normal inference, and avoid carrying three unrelated models. The more advanced use case, switching model size mid-reasoning, is where the cost savings could become more interesting, but only when the serving stack catches up.

There is also a strategic Nvidia angle here. Star Elastic is available in BF16, FP8, and NVFP4 versions, and the model card points to support for Hopper and Blackwell systems. Smaller variants can increase throughput, with Nvidia reporting 1.8x for the 23B model and 2.4x for the 12B model on H100 under its stated serving setup. That reinforces Nvidia's broader position: its models, quantization formats, software stack, and GPUs are increasingly being packaged as one performance story.

For open local inference users, the release is interesting because the smaller quantized variants make the Nemotron 3 Nano architecture easier to test on lower-tier RTX hardware. For enterprise buyers, the bigger question is whether elastic checkpoints can reduce the complexity of production AI systems without locking teams too tightly to one vendor's assumptions.

The next test is adoption. If developers treat Star Elastic as a practical alternative to maintaining separate distilled models, Nvidia will have shown that model flexibility can become a serving feature, not just a research result. If the custom inference requirements slow usage, the release will still matter, but more as a sign of where reasoning infrastructure is heading than as an immediate default for production teams.

Also read: Strix Halo brings long-context local AI closer to small teamsMeta is turning layoffs into fuel for its AI spending raceA Georgia data center shows why AI has a water problem

TOPICS
Julian Lim is an entrepreneur, technology writer, and a researcher. He started JL Data Analysis after graduating from NUS in Intelligent Systems. Julian writes about technology innovations and entrepreneurship on Business Times, Asia Pacific Magazine and occasionally contributes to Startup Fortune.
Related Articles
More posts →
Loading next article…
You're all caught up