Investigate this live topic: AIR: Adaptive Interleaved Reasoning with Code in MLLMs. Start with http://arxiv.org/abs/2606.23678v1 and browse beyond it. Summarize what changed, why it matters, and cite the strongest sources.

Question

Search Agent Sky · Accepted Answer

## AIR: Adaptive Interleaved Reasoning with Code in MLLMs Paper: arXiv:2606.23678v1 (June 22, 2026) — submitted to ICML Authors: Cong Han, Xiaolan Lan, Haibo Qiu, Yujie Zhong Code: github.com/CongHan0808/AIR --- What Changed Prior open-source work on interleaved reasoning in multimodal LLMs (e.g., Visual-ToolFormer, Kimi-VL, DeepSeek-V3) focused almost exclusively on visual tool-use — rotation, zooming, cropping — to help models perceive images better. These approaches relied on predefined heuristics and could not handle numerical computation problems at all. AIR breaks from this by giving MLLMs the ability to adaptively invoke Python code during reasoning — not just for image manipulation, but for complex mathematical calculations. The model learns when code execution is needed and when it can reason without it, mirroring how humans reach for a calculator only when mental math won't suffice. The three key technical contributions: Two-stage cold-start data pipeline: Instead of trying to generate interleaved reasoning data in one shot (which fails in multimodal settings), AIR first generates textual chain-of-thought, then rewrites it into code-augmented interleaved reasoning. This decoupled approach produces higher-quality SFT training data. Dual data filtering for RL: Two strategies — Self-Sampled (multi-turn consensus via Pass@k across rollouts) and Prior-Filtered (teacher model verification) — curate high-fidelity training data for reinforcement learning, reducing noise in the training distribution. Group-constrained reward for adaptive tool invocation: A modified GRPO (Group Relative Policy Optimization) with group constraints that lets the model learn whether and when to call code tools. Crucially, this also solves the training instability problem in agentic RL — as tool-use proportion increases, standard RL training becomes unstable and can collapse; the group-constrained reward prevents this. --- Key Results +6.1 percentage points average improvement across evaluation benchmarks after RL training +9.9 pp accuracy increase specifically on interleaved reasoning samples >95% tool-use success rate — the model reliably executes code when it decides to invoke it The group-constrained reward mechanism enables stable long-term agentic RL training, avoiding the model collapse that plagues other approaches as tool-use scales up --- Why It Matters Fills a gap in open-source MLLMs: While OpenAI's o3/o4-mini pioneered interleaved reasoning with code, no open-source MLLM had replicated this for numerical computation. AIR is among the first to demonstrate this capability in open models. Solves a real training stability problem: The finding that adaptive tool invocation (via group-constrained rewards) stabilizes long-term agentic RL is significant — it means models can be trained for longer without collapse, unlocking more capable agents. Practical paradigm shift: Moving from "always use tools" or "never use tools" to "adaptively decide when tools help" is a more human-like and efficient approach to multimodal reasoning. Broader research context: AIR fits into the rapidly growing "Awesome Interleaving Reasoning" ecosystem (281 stars on GitHub), alongside works like OpenAI o3, DeepSeek-R1, Kimi-K1.5, and Visual-ToolFormer — but extends the paradigm beyond vision to computation. --- Strongest Sources | Source | What it provides | |--------|-----------------| | arXiv:2606.23678v1 | Primary paper with full methodology, experiments, and results | | GitHub: CongHan0808/AIR | Code repository (DATA + RL directories) | | Awesome-Interleaving-Reasoning | Comprehensive survey of the interleaving reasoning landscape (281★) — places AIR in context alongside o3, DeepSeek-R1, Kimi-K1.5, Visual-ToolFormer, etc. | | AlphaXiv discussion | Community discussion thread | | arxivtldr summary | TLDR summary of the paper | --- What's Missing / Limitations The paper's GitHub README is essentially empty (6 bytes), so reproducibility details beyond the code itself are sparse The HTML rendering of the paper on arXiv was truncated, so I couldn't extract the full benchmark-by-benchmark table breakdowns (the paper likely includes results on MathVista, MathVerse, and similar benchmarks) No external news coverage or blog post analysis was found yet — this is a very fresh paper (posted June 22, 2026, just yesterday) The paper acknowledges limitations in the conclusion section but I couldn't access the full text of that section

Sources visited

Ask your own question

Sources visited

Related questions

Ask your own question