MMSearch-Plus: Benchmarking Provenance-Aware Search For Multimodal Browsing Agents

Abstract

Existing multimodal browsing benchmarks often fail to require genuine multimodal reasoning, as many tasks can be solved with text-only heuristics without vision-in-the-loop verification.

We introduce MMSearch-Plus, a 311-task benchmark that enforces multimodal understanding by requiring extraction and propagation of fine-grained visual cues through iterative image–text retrieval and cross-validation under retrieval noise. Our curation procedure seeds questions whose answers require extrapolating from spatial cues and temporal traces to out-of-image facts such as events, dates, and venues. Beyond the dataset, we provide a model-agnostic agent framework with standard browsing tools and a set-of-mark (SoM) module, which lets the agent place marks, crop subregions, and launch targeted image/text searches. SoM enables provenance-aware zoom-and-retrieve and improves robustness in multi-step reasoning.

We evaluated closed- and open-source MLLMs in this framework. The strongest system achieves an end-to-end accuracy of 36.0%, and integrating SoM produces consistent gains in multiple settings, with improvements up to +3.9 points. From failure analysis, we observe recurring errors in locating relevant webpages and distinguishing between visually similar events. These results underscore the challenges of real-world multimodal search and establish MMSearch-Plus as a rigorous benchmark for advancing agentic MLLMs.

Leaderboard

End-to-end results on the MMSearch⁺ benchmark, across search modes. All numbers are accuracy (%).

Model / Search Mode	Avg	By Category								Difficulty
Model / Search Mode	Avg	Geo.	Sports	Acad.	Film/TV	Tech	Games	Vlog	Music	Easy	Hard
Human
Browser	22.8	20.3	25.9	20.0	25.0	19.4	16.1	31.6	35.3	34.0	18.0
Closed-source MLLMs
o3 (2025-04-16)
Without Search	15.1	31.2	14.8	6.0	17.5	13.9	3.2	5.3	11.8	50.0	0.0
Image Search	19.3	28.1	14.8	18.0	30.0	22.2	3.2	5.3	17.6	63.8	0.0
Full Rollout 🥈	36.0	35.9	24.1	50.0	42.5	44.4	16.1	42.1	29.4	54.3	28.1
Full Rollout + SoM 🥇	37.7	45.3	29.6	46.0	45.0	45.7	16.1	26.3	29.4	62.8	26.9
GPT-5
Without Search	10.3	21.9	7.4	4.0	7.5	8.3	0.0	5.3	15.8	27.7	2.8
Image Search	16.4	25.0	11.1	14.0	22.5	19.4	3.2	0.0	29.4	50.0	1.8
Gemini-2.5-Pro
Without Search	10.6	15.6	11.1	6.0	12.5	13.9	0.0	15.8	5.9	35.1	0.0
Image Search	16.4	26.6	11.1	18.0	20.0	16.7	3.2	0.0	23.5	54.3	0.0
Full Rollout	23.8	39.1	14.8	12.0	27.5	33.3	6.5	26.3	29.4	46.8	13.8
Full Rollout + SoM 🥉	27.7	40.6	22.2	24.0	25.0	33.3	19.4	15.8	29.4	54.3	16.1
GPT-5
Without Search	10.3	21.9	7.4	4.0	7.5	8.3	0.0	5.3	15.8	27.7	2.8
Image Search	16.4	25.0	11.1	14.0	22.5	19.4	3.2	0.0	29.4	50.0	1.8
Open-source MLLMs
Qwen-2.5-VL-72B-Instruct
Without Search	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
Image Search	13.5	20.3	7.4	18.0	17.5	11.1	3.2	0.0	23.5	41.5	1.4
Full Rollout	6.1	9.5	7.4	4.0	5.0	2.8	3.2	5.3	11.8	17.0	1.4
Full Rollout + SoM	7.1	10.9	3.7	4.0	10.0	5.6	6.5	5.3	11.8	18.1	2.3

Models are organized by closed-source and open-source MLLMs. Each model shows results for different search modes: Without Search, Image Search, Full Rollout, and Full Rollout + SoM (Set-of-Marks). The best performing configuration for each model is highlighted with medals.

Overview

Recent advances in large multimodal language models (MLLMs) have enabled them to act as capable browsing agents, yet existing multimodal benchmarks such as MMSearch can often be solved through relatively fixed workflows that require little genuine multimodal reasoning. Many current benchmarks heavily rely on external image search where the MLLM primarily orchestrates rather than performs deep visual reasoning—when search engines retrieve highly relevant images, even unimodal LLMs can frequently answer by reasoning over accompanying text alone. This occurs because a single strong image search can surface pages whose surrounding text already contains the answer, making image search tools and MLLMs partially interchangeable as information sources.

In contrast, recent text-only browsing benchmarks like BrowseComp emphasize persistence and creative, multi-step search for hard-to-find, entangled information, achieving much lower success rates (GPT-4o scores below 1% in direct-answer settings and under 2% even with browsing tools). Building on these insights, MMSearch-Plus introduces a BrowseComp-style multimodal benchmark that combines the persistence and high-reasoning demands of challenging text browsing with truly multimodal workflows that cannot be reduced to simple search-and-retrieve patterns.

Our benchmark targets challenging scenarios that require: (1) fine-grained, exhaustive visual reasoning that compels models to mine subtle, localized cues rather than rely on a single dominant entity; (2) provenance and source verification under retrieval noise—discriminating authentic sources when image results are conflicting and validating images embedded in webpages; and (3) long, tool-augmented reasoning chains with systematic cross-modal evidence gathering and resilience to near-duplicates. Unlike existing benchmarks where answers can often be read directly from prompts or images, MMSearch-Plus requires extrapolating from spatial cues (micro-text, layouts, uniforms, signage) and temporal traces (broadcast overlays, seasonal context) to identify events, dates, or locations not explicitly present.

Three multimodal reasoning paradigms: 1. Without search: an MLLM reasons with its internal knowledge to answer a factual visual question (VQA); 2. Whole image search only: an MLLM combines its internal knowledge and the provided search results of an VQA's image; 3. MMSearch-Plus agentic framework: an MLLM can call a set of visual and search tools freely to extracts fine-grained visual clues and search with precision.

Example MMSearch-Plus item demonstrating our BrowseComp-style approach. Given a 2025 concert photo and the query "What was the singer's performance time?", the agent must extract multiple localized cues—micro-text/lyrics, performer identification, festival/brand signage, and distinctive stage props—then issue targeted iterative searches to (i) identify the artist/outfit, (ii) resolve the specific event and venue, and (iii) cross-validate official schedules to obtain the exact performance time. This exemplifies our emphasis on fine-grained multimodal reasoning with rigorous provenance verification under retrieval noise.

Question: What are the numbers in the score column that are blurred out in the picture?

Answer: 0 4 (11) 2 16

Question: What's the date and page of the newspaper?

Answer: May 10, 2007, Page 5A

Question: In this team fight, who was killed first?

Answer: WBG TheShy!

Question: Who are the performers?

Answer: Ebène String Quartet and Damien Bachmann

Question: From which show and episode is this?

Answer: Street Food (2019) Season 1 Episode 7

Question: What is the title of this paper?

Answer: MUCAR: Benchmarking Multilingual Cross-Modal Ambiguity Resolution for Multimodal Large Language Models

Our 311-task benchmark spans diverse domains including geography, sports, academia, film/TV, technology, games, vlogs, and music. Each item is systematically curated using our Spatial-Temporal Extrapolation procedure to ensure genuine multimodal difficulty that matches the persistence demands of challenging text-only browsing benchmarks.

Key statistics of MMSearch-Plus.

Category distribution of MMSearch-Plus.

Data Curation Method: Spatial Temporal Extrapolation

Overview of data curation strategy.

A central challenge in BrowseComp-like benchmarks arises from the large intermediate search space induced by soft, fuzzy constraints. This requires agents to perform non-trivial cross-validation and identify the correct target. In designing our benchmark, rather than remixing existing text-only datasets, we aim to construct problems that naturally expand the search space during multimodal information seeking, thereby testing an agent's ability for strategic planning and uncertainty-aware reasoning in dynamic environments.

Inspired by Geoguessr-style tasks, our problems are anchored on real-world events. Agents must piece together fragmented visual information to identify the underlying source event. Task difficulty is modulated by varying the richness of both visual clues and textual context. Even a single visual fragment can expand the search space dramatically, requiring careful comparison with retrieved content and cross-validation against other multimodal evidence. In more difficult cases, this mirrors human cognition: the agent must iteratively generate hypotheses, verify them against internal knowledge or retrieved content, and refine its reasoning chain across interleaved text and images. Such processes result in extended trajectories that demand robust contextual understanding.

Once an event is identified, we formulate questions that probe its metadata or chain together multi-hop queries. To further elevate difficulty, we introduce Spatial-Temporal Extrapolation. Instead of asking what is directly visible, we query what is contextually implied but physically absent, compelling reasoning beyond the pixels to reconstruct the broader event. Spatial extrapolation targets unseen entities—individuals off-screen, facing away, or partially obscured—while temporal extrapolation probes events preceding or following the depicted moment. This design forces agents to first localize the event precisely (e.g., time, match, or episode), and then retrieve and reason over wider contextual knowledge from diverse sources.

Set-of-Mark (SoM) Visualizations

The Set-of-Mark (SoM) module enhances our multimodal browsing framework by enabling agents to place marks, crop subregions, and launch targeted searches. Below are examples showing how SoM annotations help identify and extract fine-grained visual cues that are crucial for answering complex questions requiring spatial-temporal reasoning.

Question: What is the title of this paper?

Answer: MUCAR: Benchmarking Multilingual Cross-Modal Ambiguity Resolution for Multimodal Large Language Models

This example demonstrates how SoM helps identify and extract text regions containing the paper title from an academic document figure.

Question: In Paris Fashion Week, what brand did the person in the image represent? In which year did the event occur, and what is the event's name?

Answer: StreetStyle Winter 2025 Hermes

SoM annotations help locate brand identifiers, temporal cues, and event context scattered across the image.

Question: What is the name of the previous level?

Answer: Exercise Hall

The SoM framework identifies UI elements and navigation context to extract level progression information.

Question: What was the final score of the match shown in the image?

Answer: 13-11

SoM helps locate and extract score information from sports broadcast overlays and scoreboards.

Question: What is the title of the video?

Answer: Who Can Identify as a Native American?

This example shows how SoM identifies title elements in video interfaces and content headers.

Question: Who assisted this goal?

Answer: Philipp Lahm

SoM enables extraction of player information and match statistics from sports footage, requiring cross-validation with external sources.

These examples illustrate how the Set-of-Mark module enables provenance-aware search by:

Fine-grained region marking: Identifying specific visual elements that contain relevant information
Spatial reasoning: Understanding layout and positional relationships between elements
Temporal context extraction: Capturing time-sensitive information from broadcasts and dynamic content
Cross-modal validation: Linking visual cues to textual information for verification

The SoM annotations demonstrate the complexity of real-world multimodal reasoning tasks where answers cannot be obtained through simple text-based heuristics but require genuine understanding of visual content combined with external knowledge retrieval.

Experimental Results

Bar plot of performance by search mode.

Human annotated error types of Gemini-2.5-Pro.

Errorneous Case Analysis

Key Information Not Extracted: This case demonstrates how the summarizer fails to extract critical facts from retrieved pages, leading to incomplete or inaccurate responses. The error occurs when important information is available in the source material but gets omitted during the summarization process.

Relevance Not Verified: In this example, while the retrieved page appears visually relevant, it contains information about the wrong event (e.g., a 2021 snooker match instead of a 2024 match). This highlights the importance of temporal verification in search results.

Question Misunderstood: This case illustrates how the browsing agent can misinterpret multimodal context and provide premature answers without fully understanding the query requirements, leading to incorrect responses.

May Require Video Understanding: This example shows a limitation where crucial information exists only in video narration. Our current search agent framework, which doesn't process video input, cannot access this information, highlighting a key area for improvement.

Reasoning Trajectory Analysis

This section provides an overview of reasoning-trajectory statistics across three MLLMs (o3, Gemini, Qwen). The analysis examines how different models utilize search capabilities during their reasoning process, collected in full rollout mode where models have access to both text and image search functions.

Left charts: Distribution of image search calls and text search calls per trajectory, showing how each model balances between textual and visual information gathering strategies.

Right charts: Relationship between assistant word count and the number of search calls, stratified by correctness (correct vs. incorrect responses). This reveals patterns in how verbose reasoning correlates with search behavior and ultimate accuracy across different models.

o3 Model Analysis

o3 is active in making search calls, sometimes reaching the maximum number of calls (20) within a single trajectory.

Gemini Model Analysis

Gemini-2.5-Pro is conservative in making search calls and has a relatively stable number of calls across trajectories. The model gets more verbose with reasoning as the trajectory goes on.

Qwen-2.5-VL-72B-Instruct Model Analysis

Qwen sometimes did not formulate valid image-search calls and occasionally spiking to many calls within a single trajectory, while text-search calls remain low.

Concurrent Work

During our work, we became aware of some related efforts that explore similar multimodal browsing challenges, though with different approaches and focus areas.

BrowseComp-VL (Geng et al., 2025) takes an interesting approach by expanding the difficulty mainly through text search space in a BrowseComp-like manner. However, the image component often simplifies to a single identifiable entity that can be quickly found and used primarily for initial anchoring. More specifically, BrowseComp-VL is constructed by first creating multi-hop text QA tasks (following the BrowseComp style with entity obfuscation) and then converting them to visual QA by replacing explicit entity mentions with images retrieved from the web. This design means that many problems essentially become text search and webpage navigation tasks after an initial visual recognition step, rather than requiring sustained fine-grained visual reasoning throughout the process.

Another related effort is MM-BrowseComp (Li et al., 2025), which also explores multimodal browsing capabilities. Our work differs in several key aspects: (a) our data sources and curation methodology focus on spatial-temporal extrapolation from real-world events, (b) we provide a general search framework that can support any multimodal large language model, and (c) we conduct a detailed analysis of whether "thinking with images" and cropping strategies actually help current MLLMs excel on our benchmark.

While these concurrent works make valuable contributions to the field, our MMSearch-Plus benchmark is uniquely designed to require sustained multimodal reasoning throughout the entire search process, rather than relegating vision to an initial recognition step.

BibTeX

@article{tao2025mmsearch,
  title={MMSearch-Plus: A Simple Yet Challenging Benchmark for Multimodal Browsing Agents},
  author={Tao, Xijia and Teng, Yihua and Su, Xinxing and Fu, Xinyu and Wu, Jihao and Tao, Chaofan and Liu, Ziru and Bai, Haoli and Liu, Rui and Kong, Lingpeng},
  journal={arXiv preprint arXiv:2508.21475},
  year={2025}
}