Title
Needle: A 26M Tool-Calling Model from Cactus-Compute
Query
Research Needle, the 26M tool-calling model from cactus-compute. Check the GitHub repo, HN discussion, and any paper or announcement. Explain what it is, what the 6000 tok/s prefill and 1200 tok/s decode claims mean, why a 26M tool-router might matter for AI agents, and what is still unclear. Write it like source-backed research for a Reddit post, not a product pitch.
Source-backed research on Needle, a tiny tool-calling model, its speed claims, why it matters for AI agents, and what remains unproven.
## Needle: A 26M Tool-Calling Model from Cactus-Compute — Source-Backed Research
### What is Needle?
Needle is a **26 million parameter** function-calling (tool-use) model developed by Cactus-Compute. It's designed to run on consumer devices like phones, watches, and glasses. The model was distilled from **Gemini 3.1 Flash Lite** and is trained specifically for single-shot function calling.
**Key sources:**
- **GitHub repo**: [cactus-compute/needle](https://github.com/cactus-compute/needle)
- **Hugging Face weights**: [Cactus-Compute/needle](https://huggingface.co/Cactus-Compute/needle)
- **Hacker News discussion**: [Show HN: Needle](https://news.ycombinator.com/item?id=48111896) (280 points, 102 comments)
- **Architecture docs**: [Simple Attention Networks](https://github.com/cactus-compute/needle/blob/main/docs/simple_attention_networks.md)
---
### The 6000 tok/s prefill and 1200 tok/s decode claims
**What these numbers mean:**
- **Prefill (6000 tok/s)**: The speed at which the model processes the input prompt (the tool definitions and user query). This is exceptionally fast because the model doesn't need to cache input tokens in memory.
- **Decode (1200 tok/s)**: The speed at which the model generates output tokens (the JSON response with tool calls). This is still very fast for edge devices.
**Why it's so fast:**
1. **Tiny parameter count** (26M vs billions in typical LLMs)
2. **No FFN (Feed-Forward Network)**: The architecture removes the largest component of standard transformers (~2/3 of parameters)
3. **Encoder-decoder design**: Input tokens are encoded once into a fixed representation, then the decoder generates output without re-attending to the full input at each step
4. **Cactus runtime**: Custom inference engine optimized for mobile/wearables
**Source**: The claims come directly from the [Hacker News post](https://news.ycombinator.com/item?id=48111896) and are backed by the [architecture documentation](https://github.com/cactus-compute/needle/blob/main/docs/simple_attention_networks.md).
---
### Why a 26M tool-router matters for AI agents
**The core insight from the authors:**
> "Tool calling is fundamentally retrieval-and-assembly (match query to tool name, extract argument values, emit JSON), not reasoning."
**Why this matters:**
1. **Edge deployment**: Most AI agents today require cloud APIs. Needle enables **on-device** tool routing, which means:
- No network latency
- Privacy-preserving (data stays on device)
- Works offline
- Lower cost (no API calls)
2. **Specialization over generalization**: Instead of using a massive general-purpose model for tool calling, Needle shows that a tiny specialized model can outperform larger models on this specific task. The HN discussion notes it beats FunctionGemma-270M, Qwen-0.6B, Granite-350M, and LFM2.5-350M on single-shot function calling.
3. **Architectural innovation**: The "Simple Attention Network" architecture (no FFN, encoder-decoder only) challenges conventional wisdom about transformer design. As the docs state: "For a task that is about routing information (query -> tool alignment), attention is the right primitive."
4. **Practical applications**: The HN discussion highlights potential use cases like:
- Command-line tools that understand natural language
- On-device assistants that can route to local tools
- Privacy-first desktop/mobile apps
---
### What's still unclear / Open questions
**From the HN discussion and my research:**
1. **Performance on ambiguous queries**: One commenter asked about handling ambiguity: "Can I send it something like a text message 'lets catch up at coffee tomorrow 10:00' and a command like 'save this' and have it choose a 'add appointment' action from hundreds (or even tens) of possible tools?" The model is trained on 15 tool categories, but scalability to hundreds of tools isn't fully demonstrated.
2. **Real-world accuracy benchmarks**: While the authors claim it beats other models on single-shot function calling, detailed benchmark results aren't published yet. The HN discussion is asking for more data on "discriminatory power."
3. **Conversation handling**: The authors explicitly state that while Needle beats other models on single-shot function calling, those models "have more scope/capacity and excel in conversational settings." Needle is designed for single-shot tool calls, not multi-turn conversations.
4. **Quantization impact**: The model is trained with INT4 quantization (14MB final size), but the performance impact of this aggressive quantization isn't fully documented.
5. **Cactus runtime availability**: The inference engine is part of the broader Cactus project, but the HN discussion notes that the team doesn't have infrastructure ready for a public demo yet. A community member deployed a [Hugging Face space](https://huggingface.co/spaces/benoitfavre/needle-playground) as a workaround.
6. **Generalization beyond tool calling**: The authors claim the "no FFN" finding generalizes to RAG and retrieval-augmented generation, but experimental results aren't published yet.
---
### Bottom line
Needle represents a compelling shift toward **specialized, tiny models** for specific AI agent tasks. It's not a general-purpose assistant, but a highly optimized tool-router that could enable a new class of on-device AI agents. The architectural innovations (no FFN, encoder-decoder only, attention-only design) are worth watching, even if the specific claims need more independent verification.
**MIT licensed, fully open weights, and finetunable on consumer hardware** — making it accessible for experimentation and real-world deployment.
1searchNeedle 26M tool-calling model cactus-compute
2navigatehttps://github.com/cactus-compute/needle
3navigatehttps://news.ycombinator.com/item?id=48111896
4clickundefined
5navigatehttps://huggingface.co/Cactus-Compute/needle
6ddm--text --find comments
7searchNeedle 6000 tok/s prefill 1200 tok/s decode performance benchmarks
8searchCactus runtime 6000 tok/s prefill 1200 tok/s decode Needle model
9get_text
10searchSimple Attention Network Needle 26M architecture encoder-decoder no FFN
11navigatehttps://aitoolly.com/ai-news/article/2026-05-13-needle-distilling-gemini-31-into-a-26m-parameter-simple-attention-network-for-high-speed-edge-device
12searchNeedle 26M HN discussion performance concerns limitations
13search"Show HN: Needle" "26M" "cactus-compute" Hacker News
14navigatehttps://news.ycombinator.com/item?id=48111896
15ddm--text --find comments
16ddm--text --find "performance" "limitation" "concern" "benchmark"
17searchSimple Attention Network architecture Needle 26M encoder decoder no FFN
18navigatehttps://github.com/cactus-compute/needle/blob/main/docs/simple_attention_networks.md
19get_text