MindPower: Enabling Theory-of-Mind Reasoning in VLM-based Embodied Agents

1Jilin University 2National Taiwan University 3Microsoft Research Asia
MindPower Overview Diagram

Figure 1: MindPower Benchmark Overview. We evaluate Robot-Centric ToM through two tasks: False-Belief Correction and Implicit Goal Inference & Completion, assessing whether VLM-based embodied agents can generate correct decisions and actions. We further propose the MindPower Reasoning Hierarchy, comprising three levels and six layers. Existing VLMs perform poorly across layers, especially in action reasoning, while our model shows substantial improvements.

Abstract

Theory of Mind (ToM) refers to the ability to infer others’ mental states, such as beliefs, desires, and intentions. Current vision–language embodied agents lack ToM-based decision-making, and existing benchmarks focus solely on human mental states while ignoring the agent’s own perspective, hindering coherent decision and action generation. To address this, we propose MindPower, a Robot-Centric framework integrating Perception, Mental Reasoning, Decision Making and Action. Given multimodal inputs, MindPower first perceives the environment and human states, then performs ToM Reasoning to model both self and others, and finally generates decisions and actions guided by inferred mental states. Furthermore, we introduce Mind-Reward, a novel optimization objective that encourages VLMs to produce consistent ToM Reasoning and behavior. Our model outperforms GPT-4o by 12.77% in decision making and 12.49% in action generation.

ToM-Embodied Benchmark

Benchmark Overview

Figure 2: MindPower Reasoning Hierarchy. The agent first receives multimodal input, then performs mental reasoning to form beliefs, desires, and intentions, and finally makes decisions and generate action plan based on this reasoning.

Evaluation Tasks

False-Belief Correction

Evaluate whether an embodied agent can detect and correct a human’s mistaken belief about the environment (e.g., misjudged object locations).

Implicit Goal Inference

Test the agent’s ability to infer unstated intentions from subtle behavioral cues, such as searching or repeated failed attempts.

MindPower Reasoning Hierarchy

Level 1

Perception

<Perception>
Level 2

Mental Reasoning

<Belief> <Desire> <Intention>
Level 3

Decision & Action

<Decision> <Action>

Method

MindPower Methodology Diagram

Framework Overview

To effectively train the agent, we employ a two-stage pipeline:

  • Stage 1: Supervised Fine-Tuning (SFT) establishs fundamental capabilities.
  • Stage 2: Group Relative Policy Optimization (GRPO) combines Mind-Reward and Format-Reward, to enhance BDI consistency and Robot-Centric optimality.

Experimental Results

Table 1: Quantitative Evaluation. We evaluate our model against both image-based and video-based VLMs. “B” denotes the BERTScore, “S” represents the Sentence Transformer score, and “BPC” means BDI and Perspective Consistency. The BPC score ranges from 0 to 10, while all other metrics are normalized to a range of 0 to 100.

Method Perception Belief Desire Intention Decision Action BPC
B S B S B S B S B S SR AC
Human Study
Human Baseline --47.6561.8146.7653.7139.1852.9334.5556.6619.3726.268.19
Video-input
Gemini-2.5 Flash 31.1048.3629.0738.6428.3630.6919.0529.0421.6834.571.381.358.72
Gemini-2.5 Pro 24.6243.4332.0236.7931.3830.2122.6530.3324.2333.872.082.548.56
Qwen2.5-VL-7B-Instruct 26.0538.2020.2728.4326.0522.9316.0123.2116.6926.560.290.226.07
VideoLLaMA3-7B 14.8031.867.8230.088.0921.764.6124.285.3419.590.630.605.33
InternVL3.5-8B 23.2342.2621.9826.9022.2022.4516.5323.2115.6428.760.100.086.52
Video-LLaVA 2.9625.335.0514.876.8215.5516.6315.303.2919.500.080.084.81
Video-ChatGPT 7.0427.009.9025.725.1616.792.7021.441.4619.950.000.005.52
VideoChat-R1 27.4742.4721.5730.1122.5620.3615.0324.7017.2125.710.640.826.00
Video-R1 30.5647.4625.5634.5826.6829.1717.1327.5618.9130.331.431.726.45
Image-input
GPT-4o 33.0748.3730.0539.4731.1632.7516.1629.5519.9634.351.822.918.05
Qwen2.5-VL-7B-Instruct 24.8939.9719.4629.2122.5919.1416.8023.4919.1123.790.150.156.72
InternVL3.5-8B 6.4318.7815.7120.7719.3017.3813.9719.7212.6218.770.000.005.95
LLaVA-OV-8B 8.0826.4515.0923.2122.3121.4016.2119.5817.1121.250.000.006.45
Ours
Mind-Reward only 21.8439.9918.7027.8121.3518.8521.9023.3017.5824.680.280.406.63
SFT only 32.7852.7243.1542.4847.0137.8334.8639.4836.7043.848.5010.488.78
Ours (SFT+Mind-Reward) 44.7959.9349.1446.4951.2545.7537.7942.5740.1747.1211.7515.408.87

BibTeX

@article{zhang2025
    mindpower,
  title={MindPower: Enabling Theory-of-Mind Reasoning in VLM-based Embodied Agents},
  author={Zhang, Ruoxuan and Zheng, Qiyun and Zhou, Zhiyu and Liao, Ziqi and Wu, Siyu and Jiang-Lin, Jian-Yu and Wen, Bin and Xie, Hongxia and Fu, Jianlong and Cheng, Wen-Huang},
  journal={arXiv preprint},
  year={2025}
}