We propose IR3D-Bench, a benchmark that challenges VLMs to demonstrate real scene understanding by actively recreating 3D structures from images using tools. An "understanding-by-creating" approach that probes the generative and tool-using capacity of vision-language agents (VLAs), moving beyond the descriptive or conversational capacity measured by traditional scene understanding benchmarks.

Motivation

Humans demonstrate true understanding through creation and recreate observed scenes because we genuinely comprehend spatial relationships and physical attributes. In contrast, current Vision-Language Agents (VLAs) are primarily evaluated on recognition tasks like captioning or QA, which fail to assess deeper understanding. Can VLAs truly understand what they see ? IR3D-Bench test it by letting them recreating the observations.

Stage 1: Dataset Integration and Inverse Rendering

CLEVR Dataset Integration

We use the CLEVR dataset, a popular benchmark for 3D vision tasks. Our work uses the validation split, which includes 15,000 synthetic images. Each image contains 3 to 10 objects with detailed annotations covering their 3D coordinates, pixel-space projections, shape, color, size, material, and spatial relationships. These rich annotations make CLEVR ideal for evaluating 3D reconstruction and spatial reasoning in a controlled environment.

Click the image to view the GT annotations

Stage 2: Benchmark Evaluation

We provide a comprehensive view of the VLA's internal world model and generative precision:

Localization: Object count, spatial alignment, and relation consistency
Visual Appearance: Shape and material accuracy via mask- and attribute-level scores
Language-Aligned Semantics: Layout fidelity and object plausibility assessed via GPT-4o

Experimental Results

IR3D-Bench LeaderBoard

Gemini-2.5-pro demonstrates strong understanding of object spatial positions and relative layouts. Grok-3 excels at modeling fine-grained details such as material and color. Qwen2.5-VL-72B struggles in more complex scenarios.

Holistic comparison over Metrics Suite

Quantitative Comparison

Model	Release	Layout & Localization			Relation Instance Seg.			CLIP Score					LLM Score
Model	Release	Pix. Dist.↓	Count Acc↑	Bbox↑	Rel. Acc↑	IoU↑	Dice↑	Color↑	Size↑	Material↑	Shape↑	Overall↑	Obj App.↑	Layout↑	Overall↑
Latest Proprietary Models
Gemini-2.5-pro	2025-03	0.3791	1.00	0.45	0.55	0.11	0.18	96.12	97.00	99.50	99.75	93.08	2.96	2.05	2.62
Gemini-2.0-flash	2025-02	0.4291	0.99	0.37	0.46	0.08	0.13	96.59	97.67	99.41	99.92	94.97	2.99	2.08	2.72
Claude3.5-Sonnet	2024-10	0.5402	0.87	0.50	0.28	0.09	0.14	93.19	96.77	97.39	98.60	91.39	2.67	1.85	2.28
Claude-3.7-Sonnet	2025-02	0.5099	0.93	0.53	0.38	0.09	0.14	97.71	98.34	99.42	99.09	96.36	3.05	2.10	2.82
GPT-4.1	2025-04	0.4366	1.00	0.48	0.42	0.08	0.13	97.55	97.34	98.96	98.66	94.59	2.68	1.66	2.34
GPT-4o	2024-11	0.5528	0.94	0.29	0.30	0.07	0.11	96.70	98.36	98.66	99.88	94.22	2.90	1.94	2.52
grok-3	2024-12	0.4378	0.98	0.33	0.38	0.08	0.13	98.04	99.15	99.87	99.89	97.80	3.02	2.06	2.71
Open-source Models
DeepSeek-VL2	2024-12	× Failed
Llama-3.2-11B-Vision	2024-09	× Failed
H2OVL-Mississippi-2B	2024-10	× Failed
LLaVA-NeXT	2025-01	0.6835	0.69	0.38	0.12	0.03	0.04	92.11	96.78	96.31	96.85	89.17	2.03	0.96	1.47
Mistral3	2025-01	0.4733	0.99	0.26	0.44	0.06	0.11	99.56	99.79	99.85	99.90	97.95	3.17	2.16	2.78
phi-3.5-Vision	2024-07	0.6027	0.80	0.45	0.13	0.02	0.03	91.44	96.35	93.08	96.35	87.06	2.10	1.01	1.53
phi4_mm	2025-02	0.6192	0.92	0.21	0.32	0.03	0.05	94.82	93.16	96.02	99.58	92.63	2.59	1.49	2.04
Pixtral-12B	2024-11	0.4661	0.98	0.23	0.42	0.07	0.11	99.28	99.90	99.03	99.83	98.93	3.22	2.15	2.78
Aria	2024-11	0.5932	0.87	0.25	0.17	0.05	0.08	95.96	99.22	99.22	99.80	92.09	2.90	1.91	2.44
Idefics3-8B	2024-08	0.9100	0.97	0.11	0.18	0.03	0.06	98.35	99.83	95.35	99.98	97.97	3.14	1.79	2.48
InternVL2.5-8B	2024-11	0.9511	1.00	0.22	0.28	0.03	0.05	99.85	99.92	99.85	99.98	99.80	3.02	1.86	2.51
InternVL2.5-38B	2024-11	0.5233	1.00	0.23	0.38	0.07	0.11	99.79	99.98	100.00	100.00	99.86	3.26	2.17	2.83
InternVL3-8B	2025-04	0.5549	1.00	0.32	0.30	0.05	0.08	99.20	99.49	98.82	99.62	98.82	3.00	1.89	2.49
InternVL3-38B	2025-04	0.4560	1.00	0.18	0.40	0.07	0.13	99.15	99.98	100.00	100.00	99.47	3.25	2.22	2.89
Qwen2.5-VL-7B	2025-01	0.6537	0.96	0.40	0.30	0.04	0.06	98.21	99.60	99.71	99.86	96.89	3.04	1.95	2.55
Qwen2.5-VL-72B	2025-01	0.4082	1.00	0.21	0.39	0.08	0.13	99.86	99.98	99.99	99.98	99.80	3.24	2.20	3.02

Visual Results

GT Image

Gemini-2.5-pro

Grok-3

Qwen2.5-VL-72B

Iterative Refinements

As the number of refinements increases, the performance of cases that performed poorly on gpt-4o gradually improves, even outperforming Gemini-2.5-pro.

Objective: Refine the parameters of all objects in a 3D scene JSON to closely match a provided ground truth (GT) image, under a fixed camera setup.

Goal: Achieve a refined scene JSON whose rendered image (with the fixed camera) is visually and spatially consistent with the GT image, in terms of object count, placement, size, shape, material, and inter-object relationships

↑

Obj1{
    "name": "brown large rubber cylinder",
    "location": [-2.0, 2.0, 1.0],
    "rotation_euler": [0.0, 0.0, 0.0],
    "size_params": { "radius": 0.8, "depth": 2.0},
    "material": {
        "name": "BrownRubber",
        "base_color": [ 0.6, 0.4, 0.2, 1.0 ],
        "metallic": 0.0,
        "roughness": 0.9 }
}

→

IR3D-Bench: Evaluating Vision-Language Model Scene Understanding as Agentic Inverse Rendering

"What I cannot create, I do not understand."
——Richard Feynman

Motivation

Stage 1: Dataset Integration and Inverse Rendering

CLEVR Dataset Integration

Inverse Rendering

Stage 2: Benchmark Evaluation

Experimental Results