Logo MMSearch-Plus: Benchmarking Provenance-Aware Search For Multimodal Browsing Agents

Xijia Tao*1, Yihua Teng*2, Xinxing Su*2, Xinyu Fu2, Jihao Wu2, Chaofan Tao2, Ziru Liu2, Haoli Bai2, Rui Liu2, Lingpeng Kong†1
1The University of Hong Kong, 2Huawei Inc.

Abstract

Existing multimodal browsing benchmarks often fail to require genuine multimodal reasoning, as many tasks can be solved with text-only heuristics without vision-in-the-loop verification.

We introduce MMSearch-Plus, a 311-task benchmark that enforces multimodal understanding by requiring extraction and propagation of fine-grained visual cues through iterative image–text retrieval and cross-validation under retrieval noise. Our curation procedure seeds questions whose answers require extrapolating from spatial cues and temporal traces to out-of-image facts such as events, dates, and venues. Beyond the dataset, we provide a model-agnostic agent framework with standard browsing tools and a set-of-mark (SoM) module, which lets the agent place marks, crop subregions, and launch targeted image/text searches. SoM enables provenance-aware zoom-and-retrieve and improves robustness in multi-step reasoning.

We evaluated closed- and open-source MLLMs in this framework. The strongest system achieves an end-to-end accuracy of 36.0%, and integrating SoM produces consistent gains in multiple settings, with improvements up to +3.9 points. From failure analysis, we observe recurring errors in locating relevant webpages and distinguishing between visually similar events. These results underscore the challenges of real-world multimodal search and establish MMSearch-Plus as a rigorous benchmark for advancing agentic MLLMs.

Leaderboard

End-to-end results on the MMSearch+ benchmark, across search modes. All numbers are accuracy (%).

Model / Search Mode Avg By Category Difficulty
Geo. Sports Acad. Film/TV Tech Games Vlog Music Easy Hard
Human
    Browser 22.8 20.3 25.9 20.0 25.0 19.4 16.1 31.6 35.3 34.0 18.0
Closed-source MLLMs
o3 (2025-04-16)
    Without Search 15.1 31.2 14.8 6.0 17.5 13.9 3.2 5.3 11.8 50.0 0.0
    Image Search 19.3 28.1 14.8 18.0 30.0 22.2 3.2 5.3 17.6 63.8 0.0
    Full Rollout 🥈 36.0 35.9 24.1 50.0 42.5 44.4 16.1 42.1 29.4 54.3 28.1
    Full Rollout + SoM 🥇 37.7 45.3 29.6 46.0 45.0 45.7 16.1 26.3 29.4 62.8 26.9
GPT-5
    Without Search 10.3 21.9 7.4 4.0 7.5 8.3 0.0 5.3 15.8 27.7 2.8
    Image Search 16.4 25.0 11.1 14.0 22.5 19.4 3.2 0.0 29.4 50.0 1.8
Gemini-2.5-Pro
    Without Search 10.6 15.6 11.1 6.0 12.5 13.9 0.0 15.8 5.9 35.1 0.0
    Image Search 16.4 26.6 11.1 18.0 20.0 16.7 3.2 0.0 23.5 54.3 0.0
    Full Rollout 23.8 39.1 14.8 12.0 27.5 33.3 6.5 26.3 29.4 46.8 13.8
    Full Rollout + SoM 🥉 27.7 40.6 22.2 24.0 25.0 33.3 19.4 15.8 29.4 54.3 16.1
GPT-5
    Without Search 10.3 21.9 7.4 4.0 7.5 8.3 0.0 5.3 15.8 27.7 2.8
    Image Search 16.4 25.0 11.1 14.0 22.5 19.4 3.2 0.0 29.4 50.0 1.8
Open-source MLLMs
Qwen-2.5-VL-72B-Instruct
    Without Search 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
    Image Search 13.5 20.3 7.4 18.0 17.5 11.1 3.2 0.0 23.5 41.5 1.4
    Full Rollout 6.1 9.5 7.4 4.0 5.0 2.8 3.2 5.3 11.8 17.0 1.4
    Full Rollout + SoM 7.1 10.9 3.7 4.0 10.0 5.6 6.5 5.3 11.8 18.1 2.3
Models are organized by closed-source and open-source MLLMs. Each model shows results for different search modes: Without Search, Image Search, Full Rollout, and Full Rollout + SoM (Set-of-Marks). The best performing configuration for each model is highlighted with medals.

Logo MMSearch-Plus Dataset

Overview

Recent advances in large multimodal language models (MLLMs) have enabled them to act as capable browsing agents, yet existing multimodal benchmarks such as MMSearch can often be solved through relatively fixed workflows that require little genuine multimodal reasoning. Many current benchmarks heavily rely on external image search where the MLLM primarily orchestrates rather than performs deep visual reasoning—when search engines retrieve highly relevant images, even unimodal LLMs can frequently answer by reasoning over accompanying text alone. This occurs because a single strong image search can surface pages whose surrounding text already contains the answer, making image search tools and MLLMs partially interchangeable as information sources.

In contrast, recent text-only browsing benchmarks like BrowseComp emphasize persistence and creative, multi-step search for hard-to-find, entangled information, achieving much lower success rates (GPT-4o scores below 1% in direct-answer settings and under 2% even with browsing tools). Building on these insights, MMSearch-Plus introduces a BrowseComp-style multimodal benchmark that combines the persistence and high-reasoning demands of challenging text browsing with truly multimodal workflows that cannot be reduced to simple search-and-retrieve patterns.

Our benchmark targets challenging scenarios that require: (1) fine-grained, exhaustive visual reasoning that compels models to mine subtle, localized cues rather than rely on a single dominant entity; (2) provenance and source verification under retrieval noise—discriminating authentic sources when image results are conflicting and validating images embedded in webpages; and (3) long, tool-augmented reasoning chains with systematic cross-modal evidence gathering and resilience to near-duplicates. Unlike existing benchmarks where answers can often be read directly from prompts or images, MMSearch-Plus requires extrapolating from spatial cues (micro-text, layouts, uniforms, signage) and temporal traces (broadcast overlays, seasonal context) to identify events, dates, or locations not explicitly present.

Pipeline of MMSearch-Plus

Three multimodal reasoning paradigms: 1. Without search: an MLLM reasons with its internal knowledge to answer a factual visual question (VQA); 2. Whole image search only: an MLLM combines its internal knowledge and the provided search results of an VQA's image; 3. MMSearch-Plus agentic framework: an MLLM can call a set of visual and search tools freely to extracts fine-grained visual clues and search with precision.

Pipeline of MMSearch-Plus

Example MMSearch-Plus item demonstrating our BrowseComp-style approach. Given a 2025 concert photo and the query "What was the singer's performance time?", the agent must extract multiple localized cues—micro-text/lyrics, performer identification, festival/brand signage, and distinctive stage props—then issue targeted iterative searches to (i) identify the artist/outfit, (ii) resolve the specific event and venue, and (iii) cross-validate official schedules to obtain the exact performance time. This exemplifies our emphasis on fine-grained multimodal reasoning with rigorous provenance verification under retrieval noise.

Our 311-task benchmark spans diverse domains including geography, sports, academia, film/TV, technology, games, vlogs, and music. Each item is systematically curated using our Spatial-Temporal Extrapolation procedure to ensure genuine multimodal difficulty that matches the persistence demands of challenging text-only browsing benchmarks.

data-overview

Key statistics of Logo MMSearch-Plus.

data-composition

Category distribution of Logo MMSearch-Plus.

Data Curation Method: Spatial Temporal Extrapolation

Overview of data curation strategy.

A central challenge in BrowseComp-like benchmarks arises from the large intermediate search space induced by soft, fuzzy constraints. This requires agents to perform non-trivial cross-validation and identify the correct target. In designing our benchmark, rather than remixing existing text-only datasets, we aim to construct problems that naturally expand the search space during multimodal information seeking, thereby testing an agent's ability for strategic planning and uncertainty-aware reasoning in dynamic environments.

Inspired by Geoguessr-style tasks, our problems are anchored on real-world events. Agents must piece together fragmented visual information to identify the underlying source event. Task difficulty is modulated by varying the richness of both visual clues and textual context. Even a single visual fragment can expand the search space dramatically, requiring careful comparison with retrieved content and cross-validation against other multimodal evidence. In more difficult cases, this mirrors human cognition: the agent must iteratively generate hypotheses, verify them against internal knowledge or retrieved content, and refine its reasoning chain across interleaved text and images. Such processes result in extended trajectories that demand robust contextual understanding.

Once an event is identified, we formulate questions that probe its metadata or chain together multi-hop queries. To further elevate difficulty, we introduce Spatial-Temporal Extrapolation. Instead of asking what is directly visible, we query what is contextually implied but physically absent, compelling reasoning beyond the pixels to reconstruct the broader event. Spatial extrapolation targets unseen entities—individuals off-screen, facing away, or partially obscured—while temporal extrapolation probes events preceding or following the depicted moment. This design forces agents to first localize the event precisely (e.g., time, match, or episode), and then retrieve and reason over wider contextual knowledge from diverse sources.

Set-of-Mark (SoM) Visualizations

The Set-of-Mark (SoM) module enhances our multimodal browsing framework by enabling agents to place marks, crop subregions, and launch targeted searches. Below are examples showing how SoM annotations help identify and extract fine-grained visual cues that are crucial for answering complex questions requiring spatial-temporal reasoning.

These examples illustrate how the Set-of-Mark module enables provenance-aware search by:

  • Fine-grained region marking: Identifying specific visual elements that contain relevant information
  • Spatial reasoning: Understanding layout and positional relationships between elements
  • Temporal context extraction: Capturing time-sensitive information from broadcasts and dynamic content
  • Cross-modal validation: Linking visual cues to textual information for verification

The SoM annotations demonstrate the complexity of real-world multimodal reasoning tasks where answers cannot be obtained through simple text-based heuristics but require genuine understanding of visual content combined with external knowledge retrieval.

Experimental Results

Bar plot of performance by search mode

Bar plot of performance by search mode.

Human annotated error types of Gemini-2.5-Pro

Human annotated error types of Gemini-2.5-Pro.

Errorneous Case Analysis

Reasoning Trajectory Analysis

This section provides an overview of reasoning-trajectory statistics across three MLLMs (o3, Gemini, Qwen). The analysis examines how different models utilize search capabilities during their reasoning process, collected in full rollout mode where models have access to both text and image search functions.

Left charts: Distribution of image search calls and text search calls per trajectory, showing how each model balances between textual and visual information gathering strategies.

Right charts: Relationship between assistant word count and the number of search calls, stratified by correctness (correct vs. incorrect responses). This reveals patterns in how verbose reasoning correlates with search behavior and ultimate accuracy across different models.

Concurrent Work

During our work, we became aware of some related efforts that explore similar multimodal browsing challenges, though with different approaches and focus areas.

BrowseComp-VL (Geng et al., 2025) takes an interesting approach by expanding the difficulty mainly through text search space in a BrowseComp-like manner. However, the image component often simplifies to a single identifiable entity that can be quickly found and used primarily for initial anchoring. More specifically, BrowseComp-VL is constructed by first creating multi-hop text QA tasks (following the BrowseComp style with entity obfuscation) and then converting them to visual QA by replacing explicit entity mentions with images retrieved from the web. This design means that many problems essentially become text search and webpage navigation tasks after an initial visual recognition step, rather than requiring sustained fine-grained visual reasoning throughout the process.

Another related effort is MM-BrowseComp (Li et al., 2025), which also explores multimodal browsing capabilities. Our work differs in several key aspects: (a) our data sources and curation methodology focus on spatial-temporal extrapolation from real-world events, (b) we provide a general search framework that can support any multimodal large language model, and (c) we conduct a detailed analysis of whether "thinking with images" and cropping strategies actually help current MLLMs excel on our benchmark.

While these concurrent works make valuable contributions to the field, our MMSearch-Plus benchmark is uniquely designed to require sustained multimodal reasoning throughout the entire search process, rather than relegating vision to an initial recognition step.

BibTeX

@article{tao2025mmsearch,
  title={MMSearch-Plus: A Simple Yet Challenging Benchmark for Multimodal Browsing Agents},
  author={Tao, Xijia and Teng, Yihua and Su, Xinxing and Fu, Xinyu and Wu, Jihao and Tao, Chaofan and Liu, Ziru and Bai, Haoli and Liu, Rui and Kong, Lingpeng},
  journal={arXiv preprint arXiv:2508.21475},
  year={2025}
}