Needle shows tiny models can move AI agents onto devices

Cactus Compute has open-sourced Needle, a 26M-parameter tool-calling model built for phones, watches and glasses. The bigger story is not a benchmark win, but a cheaper infrastructure layer for startups building agentic apps.

Most AI agent products still assume that every useful action needs a large model somewhere in the loop. Needle challenges that assumption by asking a narrower question: what if choosing the right tool and filling in its arguments is small enough to run locally, quickly and cheaply?

Cactus Compute released Needle on May 12, with weights on Hugging Face, code on GitHub and an MIT license. According to the company's launch post on Reddit, the model has 26 million parameters, runs at 6,000 tokens per second for prefill and 1,200 tokens per second for decode on consumer devices, and was trained specifically for single-shot function calling rather than general chat.

That distinction matters. A model that can talk about the weather is not the same thing as a model that can decide to call a weather API with the right location field. Google's Gemini documentation describes function calling as the process where a model returns structured output that an application can use to call external systems. For agent builders, that structured handoff is often where the product either feels instant or starts to drag.

Needle is pitched as a specialist, not a miniature replacement for frontier models. Cactus says the model was pretrained on 200 billion tokens across 16 TPU v6e chips over 27 hours, then post-trained on 2 billion synthetic function-calling tokens in 45 minutes. The synthetic data was generated with Gemini and covers 15 categories, including timers, messaging, navigation and smart home tasks.

The architecture is the part that makes the project more interesting than another small-model release. Cactus calls it a Simple Attention Network. The public writeup says Needle uses attention and gating, with no MLP or feed-forward layers. The argument is straightforward: tool calling is mostly retrieval and assembly. Match the user's request to the right tool, extract the values, then emit valid structured data.

If that view is right, many agent workflows have been carrying too much model for the job. A mobile assistant does not need a large language model to understand that set a timer for ten minutes maps to a timer function with a duration field. It needs reliability, low latency and enough local intelligence to avoid sending every small command to a server.

This is where the startup angle becomes practical. Cloud inference is easy to start with, but it becomes painful when a product needs to process routine actions all day. Every tap, voice command or background automation can turn into a cost line. A tiny open model that handles the routing layer locally gives founders a way to reserve expensive models for the tasks that actually need reasoning.

Distillation changes the economics

Needle also shows how frontier models may become factories for smaller models. Cactus used Gemini-generated data to train a narrow open model for a repeated product task. That is a different pattern from simply calling Gemini in production. The frontier model helps create the training signal, while the deployed system can run on a phone, laptop or wearable.

For entrepreneurs, that changes the build-vs-buy calculation. A team making an agent for field workers, personal productivity, smart homes or health devices may not need to own a giant model. It may need a narrow local model for intent and tool selection, a reliable runtime, and a cloud fallback for the difficult cases. Cactus is already positioning its broader engine around that hybrid idea, with documentation describing on-device execution, tool calling, retrieval support and cloud handoff.

The limits are just as important. Cactus says Needle beats larger models such as FunctionGemma-270M, Qwen-0.6B, Granite-350M and LFM2.5-350M on single-shot function calling, but it also notes those models have broader capacity and can be better in conversational settings. That is the right framing. A tiny specialist model can be useful without pretending to be a general assistant.

There are also normal open-source questions to test. The published speed figures are impressive, but developers will want to benchmark on their own devices, tools and latency budgets. They will also need to inspect the dataset pipeline, evaluate schema adherence and decide whether MIT licensing solves enough of the commercial risk for their product. Open weights are not the same as production readiness.

Still, Needle lands at the right moment. AI startups are moving from demos into workflows, and workflows are full of small, repetitive decisions. If those decisions can move onto the device, products can become faster, cheaper and more private at the same time.

The next thing to watch is whether developers treat Needle as a standalone model or as a template. The real opportunity may be thousands of tiny specialists, each trained for a narrow set of tools, sitting between users and larger models. That would make agentic apps less dependent on constant cloud calls, and it would give startups a more durable cost structure as usage grows.

Also read: TabPFN-3 pushes enterprise AI deeper into business data. • Japan's robot lab shows physical AI is moving into drug research • eBay just forced Ryan Cohen to prove GameStop can pay.