Logo MMSearch-Plus: Benchmarking Provenance-Aware Search For Multimodal Browsing Agents

Xijia Tao*1, Yihua Teng*2, Xinxing Su*2, Xinyu Fu2, Jihao Wu2, Chaofan Tao2, Ziru Liu2, Haoli Bai2, Rui Liu†2, Lingpeng Kong†1
1The University of Hong Kong, 2Huawei Inc.

Abstract

Large multimodal language models (MLLMs) are increasingly deployed as web agents, yet many multimodal browsing benchmarks can be solved by shallow, fixed workflows that lean on high-recall image search and nearby text—masking the genuinely multimodal challenges of fine-grained visual reasoning, provenance verification, and long-horizon tool use.

We introduce MMSearch-Plus, a benchmark of 311 tasks that highly demands multimodal understanding while preserving the difficulty profile of strong text-only browsing suites. Each item is constructed to contain multiple weak, localized visual signals that must be extracted, propagated through iterative text–image search, and cross-validated under retrieval noise before answering. Our curation procedure, Spatial–Temporal Extrapolation, seeds questions whose answers require extrapolating from spatial cues (micro-text, part-level appearance, layouts, signage) and temporal traces (broadcast overlays, seasonal context) to out-of-image facts such as events, dates, and venues.

We provide a model-agnostic agent framework with browsing tools, and evaluate a range of closed and open MLLMs. The strongest agent (o3) attains 15.1% without search and 36.0% accuracy with rollout under our framework, while a strong open-source model (Qwen-2.5-VL-72B-Instruct) achieves 0.0% without search and 6.9% after 20 rounds of search. Beyond answer accuracy, we assess bounding-box production and cropped-image search, and conduct an error analysis that surfaces failures in source verification, part-based reasoning, and long-horizon planning.

Leaderboard

End-to-end results on the MMSearch+ benchmark, across search modes. All numbers are accuracy (%).

Model / Search Mode Avg By Category Difficulty
Geo. Sports Acad. Film/TV Tech Games Vlog Music Easy Hard
Closed-source LMMs
o3 (2025-04-16)
    Without Search 15.1 31.2 14.8 6.0 17.5 13.9 3.2 5.3 11.8 50.0 0.0
    Image Search 19.3 28.1 14.8 18.0 30.0 22.2 3.2 5.3 17.6 63.8 0.0
    Full Rollout 🥇 36.0 35.9 24.1 50.0 42.5 44.4 16.1 42.1 29.4 54.3 28.1
Gemini-2.5-Pro
    Without Search 10.6 15.6 11.1 6.0 12.5 13.9 0.0 15.8 5.9 35.1 0.0
    Image Search 16.4 26.6 11.1 18.0 20.0 16.7 3.2 0.0 23.5 54.3 0.0
    Full Rollout 🥈 23.8 39.1 14.8 12.0 27.5 33.3 6.5 26.3 29.4 46.8 13.8
GPT-5
    Without Search 10.3 21.9 7.4 4.0 7.5 8.3 0.0 5.3 15.8 27.7 2.8
    Image Search 🥉 16.4 25.0 11.1 14.0 22.5 19.4 3.2 0.0 29.4 50.0 1.8
Open-source LMMs
Qwen-2.5-VL-72B-Instruct
    Without Search 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
    Image Search 13.5 20.3 7.4 18.0 17.5 11.1 3.2 0.0 23.5 41.5 1.4
    Full Rollout 6.1 9.5 7.4 4.0 5.0 2.8 3.2 5.3 11.8 17.0 1.4
Models are organized by closed-source and open-source LMMs. Each model shows results for different search modes: Without Search, Image Search, and Full Rollout. The best performing configuration for each model is highlighted with medals.

Logo MMSearch-Plus Dataset

Overview

Recent advances in large multimodal language models (MLLMs) have enabled them to act as capable browsing agents, yet existing multimodal benchmarks such as MMSearch can often be solved through relatively fixed workflows that require little genuine multimodal reasoning. Many current benchmarks heavily rely on external image search where the MLLM primarily orchestrates rather than performs deep visual reasoning—when search engines retrieve highly relevant images, even unimodal LLMs can frequently answer by reasoning over accompanying text alone. This occurs because a single strong image search can surface pages whose surrounding text already contains the answer, making image search tools and MLLMs partially interchangeable as information sources.

In contrast, recent text-only browsing benchmarks like BrowseComp emphasize persistence and creative, multi-step search for hard-to-find, entangled information, achieving much lower success rates (GPT-4o scores below 1% in direct-answer settings and under 2% even with browsing tools). Building on these insights, MMSearch-Plus introduces a BrowseComp-style multimodal benchmark that combines the persistence and high-reasoning demands of challenging text browsing with truly multimodal workflows that cannot be reduced to simple search-and-retrieve patterns.

Our benchmark targets challenging scenarios that require: (1) fine-grained, exhaustive visual reasoning that compels models to mine subtle, localized cues rather than rely on a single dominant entity; (2) provenance and source verification under retrieval noise—discriminating authentic sources when image results are conflicting and validating images embedded in webpages; and (3) long, tool-augmented reasoning chains with systematic cross-modal evidence gathering and resilience to near-duplicates. Unlike existing benchmarks where answers can often be read directly from prompts or images, MMSearch-Plus requires extrapolating from spatial cues (micro-text, layouts, uniforms, signage) and temporal traces (broadcast overlays, seasonal context) to identify events, dates, or locations not explicitly present.

Pipeline of MMSearch-Plus

Example MMSearch-Plus item demonstrating our BrowseComp-style approach. Given a 2025 concert photo and the query "What was the singer's performance time?", the agent must extract multiple localized cues—micro-text/lyrics, performer identification, festival/brand signage, and distinctive stage props—then issue targeted iterative searches to (i) identify the artist/outfit, (ii) resolve the specific event and venue, and (iii) cross-validate official schedules to obtain the exact performance time. This exemplifies our emphasis on fine-grained multimodal reasoning with rigorous provenance verification under retrieval noise.

Our 311-task benchmark spans diverse domains including geography, sports, academia, film/TV, technology, games, vlogs, and music. Each item is systematically curated using our Spatial-Temporal Extrapolation procedure to ensure genuine multimodal difficulty that matches the persistence demands of challenging text-only browsing benchmarks.

data-overview

Key statistics of Logo MMSearch-Plus.

data-composition

Category distribution of Logo MMSearch-Plus.

Data Curation Method: Spatial Temporal Extrapolation

Overview of data curation strategy.

A central challenge in BrowseComp-like benchmarks arises from the large intermediate search space induced by soft, fuzzy constraints. This requires agents to perform non-trivial cross-validation and identify the correct target. In designing our benchmark, rather than remixing existing text-only datasets, we aim to construct problems that naturally expand the search space during multimodal information seeking, thereby testing an agent's ability for strategic planning and uncertainty-aware reasoning in dynamic environments.

Inspired by Geoguessr-style tasks, our problems are anchored on real-world events. Agents must piece together fragmented visual information to identify the underlying source event. Task difficulty is modulated by varying the richness of both visual clues and textual context. Even a single visual fragment can expand the search space dramatically, requiring careful comparison with retrieved content and cross-validation against other multimodal evidence. In more difficult cases, this mirrors human cognition: the agent must iteratively generate hypotheses, verify them against internal knowledge or retrieved content, and refine its reasoning chain across interleaved text and images. Such processes result in extended trajectories that demand robust contextual understanding.

Once an event is identified, we formulate questions that probe its metadata or chain together multi-hop queries. To further elevate difficulty, we introduce Spatial-Temporal Extrapolation. Instead of asking what is directly visible, we query what is contextually implied but physically absent, compelling reasoning beyond the pixels to reconstruct the broader event. Spatial extrapolation targets unseen entities—individuals off-screen, facing away, or partially obscured—while temporal extrapolation probes events preceding or following the depicted moment. This design forces agents to first localize the event precisely (e.g., time, match, or episode), and then retrieve and reason over wider contextual knowledge from diverse sources.

Experimental Results

Bar plot of performance by search mode

Bar plot of performance by search mode.

Human annotated error types of Gemini-2.5-Pro

Human annotated error types of Gemini-2.5-Pro.

Errorneous Case Analysis

Reasoning Trajectory Analysis

This section provides an overview of reasoning-trajectory statistics across three MLLMs (o3, Gemini, Qwen). The analysis examines how different models utilize search capabilities during their reasoning process, collected in full rollout mode where models have access to both text and image search functions.

Left charts: Distribution of image search calls and text search calls per trajectory, showing how each model balances between textual and visual information gathering strategies.

Right charts: Relationship between assistant word count and the number of search calls, stratified by correctness (correct vs. incorrect responses). This reveals patterns in how verbose reasoning correlates with search behavior and ultimate accuracy across different models.

Concurrent Work

During our work, we became aware of some related efforts that explore similar multimodal browsing challenges, though with different approaches and focus areas.

BrowseComp-VL (Geng et al., 2025) takes an interesting approach by expanding the difficulty mainly through text search space in a BrowseComp-like manner. However, the image component often simplifies to a single identifiable entity that can be quickly found and used primarily for initial anchoring. More specifically, BrowseComp-VL is constructed by first creating multi-hop text QA tasks (following the BrowseComp style with entity obfuscation) and then converting them to visual QA by replacing explicit entity mentions with images retrieved from the web. This design means that many problems essentially become text search and webpage navigation tasks after an initial visual recognition step, rather than requiring sustained fine-grained visual reasoning throughout the process.

Another related effort is MM-BrowseComp (Li et al., 2025), which also explores multimodal browsing capabilities. Our work differs in several key aspects: (a) our data sources and curation methodology focus on spatial-temporal extrapolation from real-world events, (b) we provide a general search framework that can support any multimodal large language model, and (c) we conduct a detailed analysis of whether "thinking with images" and cropping strategies actually help current MLLMs excel on our benchmark.

While these concurrent works make valuable contributions to the field, our MMSearch-Plus benchmark is uniquely designed to require sustained multimodal reasoning throughout the entire search process, rather than relegating vision to an initial recognition step.

BibTeX

@misc{tao2025mmsearchplussimplechallengingbenchmark,
      title={MMSearch-Plus: A Simple Yet Challenging Benchmark for Multimodal Browsing Agents}, 
      author={Xijia Tao and Yihua Teng and Xinxing Su and Xinyu Fu and Jihao Wu and Chaofan Tao and Ziru Liu and Haoli Bai and Rui Liu and Lingpeng Kong},
      year={2025},
      eprint={2508.21475},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2508.21475}, 
}