Visual-Seeker: Towards Visual-Native Multimodal Agentic Search via Active Visual Reasoning
Abstract
Visual-Seeker enables visual-native multimodal deep search through active visual reasoning, outperforming proprietary models on real-world web environments.
Multimodal large language models (MLLMs) have demonstrated impressive capabilities in many visual tasks, but they often struggle with factual grounding when confronted with complex, open-world scenarios. While recent multimodal deep search agents attempt to address this issue by utilizing external tools, the visual-native search paradigm remains underexplored. Existing methods primarily rely on simple images with explicit semantics and text-only evidence trajectories, limiting the agent's ability to perform multi-hop, cross-modal reasoning and search. To address these limitations, we propose Visual-Seeker, a visual-native multimodal deep search agent via active visual reasoning. Rather than treating vision as a static input, our agent actively attends to fine-grained visual details, dynamically harvests visual evidence throughout the search process. To unlock its visual-native potential, we design an active visual reasoning data pipeline and synthesize 5K high-quality multimodal trajectories for model training. Extensive experiments demonstrate the state-of-the-art performance across five challenging multimodal search benchmarks, even surpassing several proprietary models, validating robust visual-native reasoning and search in real-world web environments. The code and data can be accessed at: https://github.com/ZhengboZhang/Visual-Seeker.
Community
Neat paper. The shift from treating vision as a static input to actively hunting for evidence throughout the search process feels like a logical step for MLLMs. I like that they are focusing on visual-native reasoning rather than just relying on text-based trajectories.
Since the model relies on these 5K synthesized multimodal trajectories, how much does the agent's performance rely on the quality of that specific data pipeline versus the underlying model architecture?
I made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:
https://researchpod.app/episode/6a148022-df2d-4c10-aaf2-8aa0c62f9144
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents (2026)
- InterLV-Search: Benchmarking Interleaved Multimodal Agentic Search (2026)
- Omni-DeepSearch: A Benchmark for Audio-Driven Omni-Modal Deep Search (2026)
- VistaHop: Benchmarking Multi-hop Visual Reasoning for Visual DeepSearch (2026)
- Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents (2026)
- InterSketch: An Interleaved Reasoning Model with Self-correcting Visual Sketch and Stepwise Reward (2026)
- Act2See: Emergent Active Visual Perception for Video Reasoning (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on HF Mirror checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2606.15231 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper