IR3D-Bench: Evaluating Vision-Language Model Scene Understanding as Agentic Inverse Rendering

"What I cannot create, I do not understand."
——Richard Feynman

Teaser Image

We propose IR3D-Bench, a benchmark that challenges VLMs to demonstrate real scene understanding by actively recreating 3D structures from images using tools. An "understanding-by-creating" approach that probes the generative and tool-using capacity of vision-language agents (VLAs), moving beyond the descriptive or conversational capacity measured by traditional scene understanding benchmarks.

Visual Representation Logo Dataset
Integration
Data Logo Inverse
Rendering
Recipe Logo Benchmark
Metrics
Connector Logo Experimental
Results

Click to jump to each section.


Motivation

benchmark category

Humans demonstrate true understanding through creation and recreate observed scenes because we genuinely comprehend spatial relationships and physical attributes. In contrast, current Vision-Language Agents (VLAs) are primarily evaluated on recognition tasks like captioning or QA, which fail to assess deeper understanding. Can VLAs truly understand what they see ? IR3D-Bench test it by letting them recreating the observations.

Stage 1: Dataset Integration and Inverse Rendering

CLEVR Dataset Integration

We use the CLEVR dataset, a popular benchmark for 3D vision tasks. Our work uses the validation split, which includes 15,000 synthetic images. Each image contains 3 to 10 objects with detailed annotations covering their 3D coordinates, pixel-space projections, shape, color, size, material, and spatial relationships. These rich annotations make CLEVR ideal for evaluating 3D reconstruction and spatial reasoning in a controlled environment.

Click the image to view the GT annotations

Inverse Rendering

VLM is prompted with an image and textual instruction to infer object-level geometric and material parameters, outputting structured scene representations in JSON format. These predictions are subsequently used to reconstruct the scene in Blender.

Stage 1 Pipeline

Stage 2: Benchmark Evaluation

We provide a comprehensive view of the VLA's internal world model and generative precision:

  • Localization: Object count, spatial alignment, and relation consistency
  • Visual Appearance: Shape and material accuracy via mask- and attribute-level scores
  • Language-Aligned Semantics: Layout fidelity and object plausibility assessed via GPT-4o
Stage 2 Pipeline

Experimental Results

IR3D-Bench LeaderBoard

Gemini-2.5-pro demonstrates strong understanding of object spatial positions and relative layouts. Grok-3 excels at modeling fine-grained details such as material and color. Qwen2.5-VL-72B struggles in more complex scenarios.

Holistic comparison over Metrics Suite

Stage 2 Pipeline

Quantitative Comparison

Model Release Layout & Localization Relation Instance Seg. CLIP Score LLM Score
Pix. Dist.↓ Count Acc↑ Bbox↑ Rel. Acc↑ IoU↑ Dice↑ Color↑ Size↑ Material↑ Shape↑ Overall↑ Obj App.↑ Layout↑ Overall↑
Latest Proprietary Models
Gemini-2.5-pro 2025-03 0.3791 1.00 0.45 0.55 0.11 0.18 96.12 97.00 99.50 99.75 93.08 2.96 2.05 2.62
Gemini-2.0-flash 2025-02 0.4291 0.99 0.37 0.46 0.08 0.13 96.59 97.67 99.41 99.92 94.97 2.99 2.08 2.72
Claude3.5-Sonnet 2024-10 0.5402 0.87 0.50 0.28 0.09 0.14 93.19 96.77 97.39 98.60 91.39 2.67 1.85 2.28
Claude-3.7-Sonnet 2025-02 0.5099 0.93 0.53 0.38 0.09 0.14 97.71 98.34 99.42 99.09 96.36 3.05 2.10 2.82
GPT-4.1 2025-04 0.4366 1.00 0.48 0.42 0.08 0.13 97.55 97.34 98.96 98.66 94.59 2.68 1.66 2.34
GPT-4o 2024-11 0.5528 0.94 0.29 0.30 0.07 0.11 96.70 98.36 98.66 99.88 94.22 2.90 1.94 2.52
grok-3 2024-12 0.4378 0.98 0.33 0.38 0.08 0.13 98.04 99.15 99.87 99.89 97.80 3.02 2.06 2.71
Open-source Models
DeepSeek-VL2 2024-12 × Failed
Llama-3.2-11B-Vision 2024-09 × Failed
H2OVL-Mississippi-2B 2024-10 × Failed
LLaVA-NeXT 2025-01 0.6835 0.69 0.38 0.12 0.03 0.04 92.11 96.78 96.31 96.85 89.17 2.03 0.96 1.47
Mistral3 2025-01 0.4733 0.99 0.26 0.44 0.06 0.11 99.56 99.79 99.85 99.90 97.95 3.17 2.16 2.78
phi-3.5-Vision 2024-07 0.6027 0.80 0.45 0.13 0.02 0.03 91.44 96.35 93.08 96.35 87.06 2.10 1.01 1.53
phi4_mm 2025-02 0.6192 0.92 0.21 0.32 0.03 0.05 94.82 93.16 96.02 99.58 92.63 2.59 1.49 2.04
Pixtral-12B 2024-11 0.4661 0.98 0.23 0.42 0.07 0.11 99.28 99.90 99.03 99.83 98.93 3.22 2.15 2.78
Aria 2024-11 0.5932 0.87 0.25 0.17 0.05 0.08 95.96 99.22 99.22 99.80 92.09 2.90 1.91 2.44
Idefics3-8B 2024-08 0.9100 0.97 0.11 0.18 0.03 0.06 98.35 99.83 95.35 99.98 97.97 3.14 1.79 2.48
InternVL2.5-8B 2024-11 0.9511 1.00 0.22 0.28 0.03 0.05 99.85 99.92 99.85 99.98 99.80 3.02 1.86 2.51
InternVL2.5-38B 2024-11 0.5233 1.00 0.23 0.38 0.07 0.11 99.79 99.98 100.00 100.00 99.86 3.26 2.17 2.83
InternVL3-8B 2025-04 0.5549 1.00 0.32 0.30 0.05 0.08 99.20 99.49 98.82 99.62 98.82 3.00 1.89 2.49
InternVL3-38B 2025-04 0.4560 1.00 0.18 0.40 0.07 0.13 99.15 99.98 100.00 100.00 99.47 3.25 2.22 2.89
Qwen2.5-VL-7B 2025-01 0.6537 0.96 0.40 0.30 0.04 0.06 98.21 99.60 99.71 99.86 96.89 3.04 1.95 2.55
Qwen2.5-VL-72B 2025-01 0.4082 1.00 0.21 0.39 0.08 0.13 99.86 99.98 99.99 99.98 99.80 3.24 2.20 3.02

Visual Results

Iterative Refinements

As the number of refinements increases, the performance of cases that performed poorly on gpt-4o gradually improves, even outperforming Gemini-2.5-pro.

User Icon
Objective: Refine the parameters of all objects in a 3D scene JSON to closely match a provided ground truth (GT) image, under a fixed camera setup.
Goal: Achieve a refined scene JSON whose rendered image (with the fixed camera) is visually and spatially consistent with the GT image, in terms of object count, placement, size, shape, material, and inter-object relationships
GT Response
VLM Icon
Obj1{
    "name": "brown large rubber cylinder",
    "location": [-2.0, 2.0, 1.0],
    "rotation_euler": [0.0, 0.0, 0.0],
    "size_params": { "radius": 0.8, "depth": 2.0},
    "material": {
        "name": "BrownRubber",
        "base_color": [ 0.6, 0.4, 0.2, 1.0 ],
        "metallic": 0.0,
        "roughness": 0.9 }
}
Blender Render GIF Blender Icon

Conclusion

IR3D-Bench redefines VLM scene understanding through agentic inverse rendering, challenging VLAs to reconstruct 3D scenes from 2D images via automatic tool-use. Our experiments find that current VLMs grasp high-level object attributes and tool-use abilities, but struggle with precise spatial control. We found that iterative refinement and careful prompt design can improve reconstruction quality, providing guidance for future VLM research. With IR3D-Bench, we provide the community with a systematic framework to measure progress of VLM scene understanding, moving beyond passive observation to agentic understanding-by-creating.

BibTeX

@article{liu2025ir3d,
  title={IR3D-Bench: Evaluating Vision-Language Model Scene Understanding as Agentic Inverse Rendering},
  author={Liu, Parker and Li, Chenxin and Li, Zhengxin and Wu, Yipeng, and Li, Wuyang and Yang, Zhiqin and Zhang, Zhenyuan and Lin, Yunlong and Han, Sirui and Feng, Brandon},
  journal={arXiv preprint},
  year={2025}
}

The website template was originally borrowed from here.