InterleaveThinker: Reinforcing Agentic Interleaved Generation
Abstract
InterleaveThinker enables interleaved generation capabilities for image generators through a multi-agent pipeline with planner and critic agents, achieving performance comparable to state-of-the-art models while enhancing reasoning benchmarks.
Recent image generators have demonstrated impressive photorealism and instruction-following capabilities in single-image generation and editing. However, constrained by their architectures, they cannot achieve interleaved generation (text-image sequence), which has crucial applications in visual narratives, guidance, and embodied manipulation. Even the latest open-source Unified Multimodal Models (UMMs) exhibit limited performance in this regard. In this paper, we introduce InterleaveThinker, the first multi-agent pipeline designed to endow any existing image generator with interleaved generation capabilities. Specifically, we employ a planner agent to organize the image-text input sequence, instructing the image generator on the required execution at each step. Subsequently, we introduce a critic agent to evaluate the generator's outputs, identify samples that deviate from the planned instructions, and refine the instructions for regeneration. To implement this pipeline, we construct the Interleave-Planner-SFT-80k and Interleave-Critic-SFT-112k to perform a format cold-start. Then we develop Interleave-Critic-RL-13k to reinforce the step-wise instruction correction capability within a generation trajectory using GRPO. Since a single interleaved generation trajectory may involve over 25 generator calls, optimizing the entire trajectory is computationally impractical. Therefore, we propose accuracy reward and step-wise reward, allowing single-step RL to effectively guide the entire generation trajectory. The results show that InterleaveThinker improves performance across various image generators. On interleaved generation benchmarks, it achieves performance comparable to Nano Banana and GPT-5. Surprisingly, it also significantly enhances the base model on reasoning-based benchmarks; for example, on 4-step FLUX.2-klein, we observe substantial gains on WISE and RISE.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Breaking Dual Bottlenecks: Evolving Unified Multimodal Models into Self-Adaptive Interleaved Visual Reasoners (2026)
- InterSketch: An Interleaved Reasoning Model with Self-correcting Visual Sketch and Stepwise Reward (2026)
- Unlocking Complex Visual Generation via Closed-Loop Verified Reasoning (2026)
- Generation Navigator: A State-Aware Agentic Framework for Image Generation (2026)
- FiRe: Fine-grained Multimodal Reasoning for Enhanced Image Generation (2026)
- Render-in-the-Loop: Vector Graphics Generation via Visual Self-Feedback (2026)
- From Plans to Pixels: Learning to Plan and Orchestrate for Open-Ended Image Editing (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on HF Mirror checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
the dual reward scheme for the critic rl is the most interesting bit here, letting a single-step update approximate full trajectory optimization across 25+ generator calls.
R_acc keeps outputs faithful to the planner, while R_step pushes consistency across steps, which is neat as a computationally light surrogate for end-to-end trajectory optimization.
my worry is how brittle this is to planner errors or noisy evaluations—if the plan is off, can the critic's signals end up reinforcing a bad path rather than steering corrections?
a quick ablation on how much of the gain comes from the planner vs the critic would help, but i can see the value in decoupling planning from execution in this setting.
btw the arxivlens breakdown helped me parse the method details, https://arxivlens.com/PaperView/Details/interleavethinker-reinforcing-agentic-interleaved-generation-2785-c670d9dc
Get this paper in your agent:
hf papers read 2606.13679 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 3
InterleaveThinker/InterleaveThinker-Critic-8B
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper