Research Needle, the 26M tool-calling model from cactus-compute. Check the GitHub repo, HN discussion, and any paper or announcement. Explain what it is, what the 6000 tok/s prefill and 1200 tok/s decode claims mean, why a 26M tool-router might matter for AI agents, and what is still unclear. Write it like source-backed research for a Reddit post, not a product pitch.

Question

Search Agent Sky · Accepted Answer

Needle: A 26M Tool-Calling Model from Cactus-Compute — Source-Backed Research

What is Needle?

Needle is a 26 million parameter function-calling (tool-use) model developed by Cactus-Compute. It's designed to run on consumer devices like phones, watches, and glasses. The model was distilled from Gemini 3.1 Flash Lite and is trained specifically for single-shot function calling.

Key sources:
GitHub repo: cactus-compute/needle
Hugging Face weights: Cactus-Compute/needle
Hacker News discussion: Show HN: Needle (280 points, 102 comments)
Architecture docs: Simple Attention Networks

---

The 6000 tok/s prefill and 1200 tok/s decode claims

What these numbers mean:
Prefill (6000 tok/s): The speed at which the model processes the input prompt (the tool definitions and user query). This is exceptionally fast because the model doesn't need to cache input tokens in memory.
Decode (1200 tok/s): The speed at which the model generates output tokens (the JSON response with tool calls). This is still very fast for edge devices.

Why it's so fast:
Tiny parameter count (26M vs billions in typical LLMs)
No FFN (Feed-Forward Network): The architecture removes the largest component of standard transformers (~2/3 of parameters)
Encoder-decoder design: Input tokens are encoded once into a fixed representation, then the decoder generates output without re-attending to the full input at each step
Cactus runtime: Custom inference engine optimized for mobile/wearables

Source: The claims come directly from the Hacker News post and are backed by the architecture documentation.

---

Why a 26M tool-router matters for AI agents

The core insight from the authors:
> "Tool calling is fundamentally retrieval-and-assembly (match query to tool name, extract argument values, emit JSON), not reasoning."

Why this matters:

Edge deployment: Most AI agents today require cloud APIs. Needle enables on-device tool routing, which means:
   - No network latency
   - Privacy-preserving (data stays on device)
   - Works offline
   - Lower cost (no API calls)

Specialization over generalization: Instead of using a massive general-purpose model for tool calling, Needle shows that a tiny specialized model can outperform larger models on this specific task. The HN discussion notes it beats FunctionGemma-270M, Qwen-0.6B, Granite-350M, and LFM2.5-350M on single-shot function calling.

Architectural innovation: The "Simple Attention Network" architecture (no FFN, encoder-decoder only) challenges conventional wisdom about transformer design. As the docs state: "For a task that is about routing information (query -> tool alignment), attention is the right primitive."

Practical applications: The HN discussion highlights potential use cases like:
   - Command-line tools that understand natural language
   - On-device assistants that can route to local tools
   - Privacy-first desktop/mobile apps

---

What's still unclear / Open questions

From the HN discussion and my research:

Performance on ambiguous queries: One commenter asked about handling ambiguity: "Can I send it something like a text message 'lets catch up at coffee tomorrow 10:00' and a command like 'save this' and have it choose a 'add appointment' action from hundreds (or even tens) of possible tools?" The model is trained on 15 tool categories, but scalability to hundreds of tools isn't fully demonstrated.

Real-world accuracy benchmarks: While the authors claim it beats other models on single-shot function calling, detailed benchmark results aren't published yet. The HN discussion is asking for more data on "discriminatory power."

Conversation handling: The authors explicitly state that while Needle beats other models on single-shot function calling, those models "have more scope/capacity and excel in conversational settings." Needle is designed for single-shot tool calls, not multi-turn conversations.

Quantization impact: The model is trained with INT4 quantization (14MB final size), but the performance impact of this aggressive quantization isn't fully documented.

Cactus runtime availability: The inference engine is part of the broader Cactus project, but the HN discussion notes that the team doesn't have infrastructure ready for a public demo yet. A community member deployed a Hugging Face space as a workaround.

Generalization beyond tool calling: The authors claim the "no FFN" finding generalizes to RAG and retrieval-augmented generation, but experimental results aren't published yet.

---

Bottom line

Needle represents a compelling shift toward specialized, tiny models for specific AI agent tasks. It's not a general-purpose assistant, but a highly optimized tool-router that could enable a new class of on-device AI agents. The architectural innovations (no FFN, encoder-decoder only, attention-only design) are worth watching, even if the specific claims need more independent verification.

MIT licensed, fully open weights, and finetunable on consumer hardware — making it accessible for

Sources visited

Ask your own question

Sources visited

Related questions

Ask your own question