Caltech's PrismML shrinks AI models to fit your phone without losing their mind

A Caltech-backed startup claims to have solved one of AI's most stubborn engineering problems: making large language models dramatically smaller without sacrificing what makes them useful.

The AI industry has spent years convincing itself that bigger is better. More parameters, more compute, more data center square footage. That assumption is now being tested by a small team out of the California Institute of Technology that says the answer was never more hardware. It was better mathematics.

PrismML, a startup led by Caltech computer scientist and mathematician Babak Hassibi, came out of stealth in late March 2026 and open-sourced its 1-bit large language model technology. The claim is striking: its flagship Bonsai 8B model, derived from Qwen3-8B, can process up to eight times faster than a standard 16-bit model while cutting energy consumption by as much as 75 to 80 percent. The whole thing fits in 1.15 gigabytes. For context, most capable LLMs require tens of gigabytes just to load.

The mechanism is what makes this different from earlier quantization efforts. PrismML reduces each model weight to a single bit, represented as either +1 or -1. Previous attempts at 1-bit compression mostly targeted convolutional neural networks, not the transformer architectures that underpin modern LLMs. Hassibi's team claims to have cracked that problem through years of developing a proprietary mathematical framework. "We spent years developing the mathematical theory required to compress a neural network without losing its reasoning capabilities," he said. The effect, he explains, is much like compressing a digital photograph without losing visual fidelity. The image looks the same. It just takes up far less space.

The timing is not accidental. AI inference costs have become a serious problem for companies trying to scale deployments beyond internal tools and demos. Running a capable LLM in production is expensive. It requires high-end GPUs, large amounts of memory bandwidth, and significant power draw. Every query costs something. Multiply that across millions of users, and the economics get uncomfortable quickly. OpenAI, Google, and Anthropic have all faced questions about whether their margin structures can survive at scale, and that pressure filters down to every enterprise buying API access.

If PrismML's compression holds up at production quality across diverse tasks, it changes the conversation considerably. Models capable enough to automate real workflows could run locally on a laptop or smartphone. That is not just a cost reduction. It is an architectural shift. Applications that currently depend on cloud round-trips could operate entirely on-device, with meaningful implications for latency, privacy, and cost. Vinod Khosla of Khosla Ventures, one of PrismML's backers, put it simply: "You can fit a much better model on a phone. That's a big deal."

The company raised $16.25 million in a SAFE and seed round from Khosla Ventures, Cerberus Capital Management, and Caltech itself. Amir Salek, a senior managing director at Cerberus and a veteran of both Google and Nvidia, said he was convinced the team had achieved a genuine mathematical breakthrough with the potential to reshape AI economics. That is notable. Cerberus is not typically known for early-stage AI bets.

The Skeptic's View

There are legitimate reasons to read these claims carefully. The AI field has a long history of results that perform brilliantly on benchmarks and struggle in messy real-world deployments. Quantization always involves trade-offs, and the relevant question is not whether a compressed model can match a full-precision model on a controlled evaluation suite. The question is whether it holds up when users push it in unexpected directions, across specialized domains, or in agentic workflows that chain multiple reasoning steps together.

The underlying architecture of Bonsai 8B remains a standard dense decoder-only model. The innovation is in the deployment framework: full 1-bit weight storage for matrix-intensive components, a runtime format compatible with mainstream inference engines, and optimized kernels for both the MLX and llama.cpp backends. That last detail matters because it means developers can test it today without waiting for specialized hardware. The open-source release is a meaningful signal of confidence.

It also matters that the mathematics are proprietary. Open-sourcing the model weights while keeping the compression methodology closed is a reasonable commercial decision, but it does limit independent verification of the core claim. Other researchers will need time to stress-test the Bonsai outputs before the field reaches a consensus on whether this represents a durable advance or a well-engineered demo.

The Bigger Picture for AI Infrastructure

What PrismML is really challenging is the assumption that AI progress requires proportional growth in physical infrastructure. The last several years have been defined by a race to build more: more data centers, more GPU clusters, more undersea cables, more power capacity. That race has attracted hundreds of billions in investment and reshaped energy markets in the United States and Europe. The counterargument, which companies like PrismML are now making loudly, is that the field is overparameterized and underoptimized. There is more intelligence available per unit of compute than current deployment methods extract.

Edge AI is the most immediate beneficiary if this approach proves out. Robotics, industrial sensors, medical devices, and consumer electronics all have hardware constraints that currently exclude capable AI. A model that runs at full reasoning quality inside 1.15 gigabytes opens markets that cloud-dependent architectures simply cannot reach. The same efficiency gains reduce the energy footprint of data center inference, which is becoming a real concern for regulators and sustainability-conscious enterprise buyers alike.

Whether PrismML delivers on its boldest claims or not, it is now part of a broader and accelerating conversation about intelligence density over raw scale. The companies that figure out how to do more with less will have a structural advantage as inference costs become the defining constraint in enterprise AI. That shift is already underway, and the researchers working on it deserve more attention than the next model launch announcement.

Also read: Z.ai's GLM-5. and GLM-5.1 discount looks like a defense against DeepSeek pressure • India's new AI takedown rules could restrict free speech faster than deepfakes • New US bills target AI chatbots with fraud liability and minor safety mandates