ICML 2026

Diversity Over Frequency: Rethinking Tool Use in Visual Chain-of-Thought Agents

Visual tools help most as training-time scaffolding through broader exploration, not through persistent tool usage.

Dong-Hee Kim1, Reuben Tan2, Donghyun Kim1
1Korea University · 2Microsoft Research Korea University logo Microsoft logo
Paper summary

Abstract

A study of tool-use collapse in visual chain-of-thought agents.

Visual agents employ external tools to incorporate fine-grained evidence. While prior work focused on visual search, we investigate harder 3D spatial reasoning tasks.


The central finding is tool-use collapse: accuracy improves while tool use falls toward zero. Explicitly incentivizing tool use yields only marginal gains despite high frequency.


Both vanilla training and reward-based encouragement reduce rollout diversity, whereas adaptive entropy regularization preserves broader exploration and achieves the best performance. This supports a view of tools as training-time scaffolding.

Tool-use collapse

Accuracy improves while tool use declines.

Frequency is not enough

100% tool usage brings marginal gains.

Diversity matters

Exploration yields the best performance.

Core result

Main results

Tool-use frequency and reasoning performance move differently.

Figure 2

Figure 2 · Training dynamics of tool use and accuracy

We compare three different reinforcement learning approaches and observe their distinct tool-use behaviors and resulting accuracies:

  • Vanilla RFT (Tool Collapse): Under standard reinforcement fine-tuning, the agent discovers that answering directly without multi-step tool use yields faster rewards. Tool usage plummets to near 0%, leading to modest accuracy gains.
  • Tool-Encouraging Reward: By explicitly adding a bonus for using tools, the agent is forced to invoke tools in 100% of rollouts. However, because it mindlessly uses tools without diverse exploration, the final accuracy improvement remains marginal compared to the baseline.
  • Entropy Regularization: Instead of forcing tool usage, this method applies an adaptive penalty if the agent's thought generations become too repetitive. This forces the agent to explore diverse reasoning paths early on. Interestingly, while its tool usage also eventually declines, the rich exploration scaffolding leads to the highest final performance.

Training summary on 3DSRBench

MethodTools?3DSRBench Acc.Tool usage (Init → Sat)
Vanilla RFTYes59.2%~20% → ~2%
Tool-bannedNo58.1%0% → 0%
Reward-encouragingYes59.9%~20% → 100%
Entropy-regularizedYes62.9%~20% → ~3%
Tool-banned + entropyNo57.8%0% → 0%
Mechanism

Exploration analysis

The performance gap comes from diversity in reasoning and crop selection.

Figure 3

Figure 3 · Textual exploration stays diverse

Distinct n-gram ratios decrease steadily for vanilla and reward-based methods. Entropy regularization maintains higher textual diversity.

Figure 4

Figure 4 · Visual exploration

Entropy-regularized models explore regions tied to context, while the vanilla policy fixates narrowly.

Why this matters

  • Text space: Diverse reasoning prevents templated thought patterns.
  • Visual space: Broader crop exploration improves coverage.
  • Interpretation: Tools help most by broadening the training trajectory.
Transfer

Generalization

The asymmetry extends to a broader tool suite.

Extension to Medical VQA

Figure 5

Figure 5 · Dynamics on VQA-RAD

Vanilla and entropy-regularized training both reduce tool use over time, while reward-based encouragement relies on tools almost always.

Table 4 · VQA-RAD Validation

MethodAccuracy
Vanilla RFT46.34
Tool-Encouraged RFT47.23
Entropy Regularization48.78

The best-performing method preserves exploration diversity, confirming dynamics in a medical domain.

Robustness on General Spatial Tasks

Table 3 · Impact on general spatial understanding

Method3DSRBench (Target Task)CV-Bench-3D (General Domain)
Qwen2.5-VL 7B48.482.9
Mini-o354.577.6
Vanilla RFT59.276.7
Tool-encourage59.974.5
Entropy-regularized62.978.8
What this means
While standard RFT improves performance on the specific target task (3DSRBench), it actively degrades the model's general depth perception (CV-Bench-3D). Entropy regularization is the only method that improves both specialized and general capabilities, proving that diverse exploration shapes more robust representations.
Citation

BibTeX

@inproceedings{kim2026diversity,
  title={Diversity Over Frequency: Rethinking Tool Use in Visual Chain-of-Thought Agents},
  author={Kim, Dong-Hee and Tan, Reuben and Kim, Donghyun},
  booktitle={Proceedings of the 43rd International Conference on Machine Learning},
  year={2026},
  address={Seoul, South Korea},
  publisher={PMLR},
  url={https://scaffolded-exploration.github.io/}
}