Nous Research says its new Token Superposition Training method can cut LLM pre-training time by up to 2.5x without changing the model architecture.
Nous Research has put a fresh target on one of AI's biggest cost centers. The company says its Token Superposition Training, or TST, can accelerate pre-training by as much as 2.5x at the same loss, a claim backed by experiments across model sizes from 270 million parameters to a 10 billion parameter mixture-of-experts system, according to the paper posted on arXiv and the company's own blog.
That matters because pre-training is where the money disappears fastest. For startups trying to build frontier or near-frontier foundation models, the compute bill is often the barrier between an ambitious prototype and something that can actually compete. If a method like TST holds up outside the lab, it gives smaller teams a more realistic path to training serious models on less expensive infrastructure, or at least stretching every GPU hour further.
The basic idea is simple enough to sound almost misleading at first. During an initial phase, the model does not learn on individual tokens in the usual way. Instead, it trains on contiguous groups of tokens, averaging their embeddings and predicting the next group with a modified loss, before switching back to standard next-token prediction for the recovery phase. The result, Nous says, is a drop-in training method that leaves the final inference-time model unchanged while improving data throughput per FLOP.
The size range is part of why the announcement has landed. Nous says TST was validated at 270M, 600M, and 3B dense scales, then carried through to a 10B A1B MoE model, which is the kind of spread that tends to make researchers pay attention. The paper's headline number is a 2.5x reduction in total pre-training time at the 10B scale under equal-loss settings, while the company's public post describes the result as a 2-3x wall-clock speedup at fixed FLOPs.
That is a cleaner story than many efficiency proposals get to tell. A lot of training tricks improve one part of the pipeline only to create new complexity elsewhere, whether that means a new architecture, a finicky optimizer setup, or an inference penalty later on. Nous is arguing that TST avoids that trap because the final model behaves like a conventionally trained one, which makes the method easier to imagine in real production workflows.
The timing also matters. The release was published in early May 2026, and the company framed it as a practical answer to the pre-training bottleneck rather than a theoretical curiosity. There is already plenty of discussion around the paper in technical circles and on Reddit's r/singularity, but the key point for readers is not the chatter. It is that a research lab known for the open-source Hermes family is now putting a concrete efficiency claim on the table at a moment when model training costs are still climbing.
How it compares with Meta's approach
The obvious comparison is Meta's multi-token prediction work, which also tries to make training denser by asking models to predict more than one token at a time. Meta's approach was aimed at better training efficiency and faster inference, and it does so by adding prediction heads to the model rather than changing the structure of the training loop in the same way TST does. In other words, both methods attack the same bottleneck, but they do it from different directions.
That difference is important for real-world cost savings. Meta's method is tied to an architectural choice, which can make it appealing inside a large research organization with broad infrastructure and engineering support. TST is trying to be more surgical. It claims to improve throughput without altering the tokenizer, optimizer, data, or model architecture, which is exactly the kind of message that resonates with teams trying to move fast on a budget.
For startups, the practical question is not which paper sounds smarter. It is which method can be adopted with the least disruption and the highest chance of lowering the cost per useful training run. If TST keeps its promise, the advantage is not just faster training, it is more attempts, more iterations, and a better shot at finding a model that works before the runway runs out.
That could shift some of the competitive pressure away from the hyperscalers. Big players will always have an advantage in raw compute, but efficiency gains compound. A team that can train at 2.5x the effective speed can spend less to reach the same milestone, or spend the same amount and push farther. Either way, the economics of who gets to build foundation models starts to look a little less fixed.
Nous Research is not claiming it has solved AI training costs once and for all. It is claiming something more practical and, arguably, more interesting: that the most expensive phase of model development may still have room for meaningful gains without rewriting the whole stack. If that proves durable, the people who care most will not be the ones buying the biggest clusters. They will be the ones trying to decide whether a much smaller budget can now buy them a seat at the table.
Also read: Microsoft's AI chief says white-collar work may be automated fast • Musk says he will not sell SpaceX shares as IPO moves closer • A tiny GPT wrapper just made $527 and exposed AI's new startup path