Investigate this live topic: AIR: Adaptive Interleaved Reasoning with Code in MLLMs. Start with http://arxiv.org/abs/2606.23678v1 and browse beyond it. Summarize what changed, why it matters, and cite the strongest sources.
<tool_call>
<function=done>
<parameter=result>## AIR: Adaptive Interleaved Reasoning with Code in MLLMs
**Paper:** [arXiv:2606.23678v1](https://arxiv.org/abs/2606.23678v1) (June 22, 2026) — submitted to ICML
**Authors:** Cong Han, Xiaolan Lan, Haibo Qiu, Yujie Zhong
**Code:** [github.com/CongHan0808/AIR](https://github.com/CongHan0808/AIR)
---
### What Changed
Prior open-source work on interleaved reasoning in multimodal LLMs (e.g., Visual-ToolFormer, Kimi-VL, DeepSeek-V3) focused almost exclusively on **visual tool-use** — rotation, zooming, cropping — to help models perceive images better. These approaches relied on predefined heuristics and could not handle **numerical computation** problems at all.
AIR breaks from this by giving MLLMs the ability to **adaptively invoke Python code** during reasoning — not just for image manipulation, but for complex mathematical calculations. The model learns *when* code execution is needed and *when* it can reason without it, mirroring how humans reach for a calculator only when mental math won't suffice.
The three key technical contributions:
1. **Two-stage cold-start data pipeline:** Instead of trying to generate interleaved reasoning data in one shot (which fails in multimodal settings), AIR first generates textual chain-of-thought, then rewrites it into code-augmented interleaved reasoning. This decoupled approach produces higher-quality SFT training data.
2. **Dual data filtering for RL:** Two strategies — *Self-Sampled* (multi-turn consensus via Pass@k across rollouts) and *Prior-Filtered* (teacher model verification) — curate high-fidelity training data for reinforcement learning, reducing noise in the training distribution.
3. **Group-constrained reward for adaptive tool invocation:** A modified GRPO (Group Relative Policy Optimization) with group constraints that lets the model learn *whether* and *when* to call code tools. Crucially, this also **solves the training instability problem** in agentic RL — as tool-use proportion increases, standard RL training becomes unstable and can collapse; the group-constrained reward prevents this.
---
### Key Results
- **+6.1 percentage points average improvement** across evaluation benchmarks after RL training
- **+9.9 pp accuracy increase** specifically on interleaved reasoning samples
- **>95% tool-use success rate** — the model reliably executes code when it decides to invoke it
- The group-constrained reward mechanism enables **stable long-term agentic RL training**, avoiding the model collapse that plagues other approaches as tool-use scales up
---
### Why It Matters
1. **Fills a gap in open-source MLLMs:** While OpenAI's o3/o4-mini pioneered interleaved reasoning with code, no open-source MLLM had replicated this for numerical computation. AIR is among the first to demonstrate this capability in open models.
2. **Solves a real training stability problem:** The finding that adaptive tool invocation (via group-constrained rewards) stabilizes long-term agentic RL is significant — it means models can be trained for longer without collapse, unlocking more capable agents.
3. **Practical paradigm shift:** Moving from "always use tools" or "never use tools" to "adaptively decide when tools help" is a more human-like and efficient approach to multimodal reasoning.
4. **Broader research context:** AIR fits into the rapidly growing "Awesome Interleaving Reasoning" ecosystem (281 stars on GitHub), alongside works like OpenAI o3, DeepSeek-R1, Kimi-K1.5, and Visual-ToolFormer — but extends the paradigm beyond vision to computation.
---
### Strongest Sources
| Source | What it provides |
|--------|-----------------|
| **[arXiv:2606.23678v1](https://arxiv.org/abs/2606.23678v1)** | Primary paper with full methodology, experiments, and results |
| **[GitHub: CongHan0808/AIR](https://github.com/CongHan0808/AIR)** | Code repository (DATA + RL directories) |
| **[Awesome-Interleaving-Reasoning](https://github.com/Osilly/Awesome-Interleaving-Reasoning)** | Comprehensive survey of the interleaving reasoning landscape (281★) — places AIR in context alongside o3, DeepSeek-R1, Kimi-K1.5, Visual-ToolFormer, etc. |
| **[AlphaXiv discussion](https://www.alphaxiv.org/abs/2606.23678)** | Community discussion thread |
| **[arxivtldr summary](https://arxivtldr.org/abs/2606.23678)** | TLDR summary of the paper |
---
### What's Missing / Limitations
- The paper's GitHub README is essentially empty (6 bytes), so reproducibility details beyond the code itself are sparse
- The HTML rendering of the paper on arXiv was truncated, so I couldn't extract the full benchmark-by-benchmark table breakdowns (the paper likely includes results on MathVista, MathVerse, and similar benchmarks)
- No external news coverage or blog post analysis was found yet — this is a very fresh paper (posted June 22, 2026, just yesterday)
- The paper acknowledges limitations in the conclusion section but I couldn't access the full text of that section</parameter>
</function>
</tool_call>