DriveSpatial: A Benchmark for Spatiotemporal Intelligence in VLMs for Autonomous Driving

Hao Vo1, Khoa Vo1, Phu Loc Nguyen1, Sieu Tran1, Duc Minh Nguyen1, Ngo Xuan Cuong1, Gladys Gawugah1, Sreevenkata Anjani Tishita Godavarthi1, Chase Rainwater1, Nghi D. Q. Bui2, Anh Nguyen3, Duy Minh Ho Nguyen4, Ngan Le1

1University of Arkansas, USA    2Google Research, Google    3University of Liverpool, UK    4Max Planck Research School for Intelligent Systems   


DriveSpatial overview
Overview of DriveSpatial. (Top) Spatiotemporal intelligence in driving mirrors human cognition: multi-view observations are used to construct an internal scene representation (Cognitive Scene Construction), infer spatial relationships (Multi-view Relational Understanding), and connect percepts across time (Temporal Reasoning). (Bottom left) DriveSpatial covers 20 tasks from four core abilities across five AD datasets. (Bottom right) Representative questions from three ability categories.

Abstract

Spatiotemporal intelligence in autonomous driving (AD) requires an agent to integrate multi-view observations into a coherent scene representation, maintain object continuity across viewpoints and time, and reason about spatial relations, interactions, and future dynamics. However, existing AD vision-language benchmarks largely focus on single-view, static, ego-centric, or single-source question answering, leaving it unclear whether current Vision-Language Models (VLMs) can truly construct and reason over dynamic driving scenes.

We introduce DriveSpatial, a benchmark of 15.6K human-verified QA pairs across 20 tasks from five large-scale AD datasets. DriveSpatial evaluates four abilities: Cognitive Scene Construction, Multi-view Relational Understanding, Temporal Reasoning, and Generalization. Unlike prior benchmarks, DriveSpatial is generated from a dynamic multi-relational scene graph encoding object states, spatial relations, interactions, camera visibility, and temporal correspondences, enabling QA pairs that enforce genuine cross-view and spatiotemporal reasoning.

Evaluating 15 representative VLMs reveals a substantial human–model gap: the strongest model trails humans by 28.4 points, with Cognitive Scene Construction as the key bottleneck. Language-only prompting is insufficient, while explicit BEV grounding consistently improves performance.


Question Examples

Representative samples from nine of the 20 tasks, grouped by ability. Each question requires integrating multi-view camera inputs, with correct answers shown in green in the original paper.

Question samples
Question samples from DriveSpatial across nine tasks. Grouped by ability: Const. Unders. Reas.

Dataset Statistics

Built from five large-scale AD datasets — nuScenes, Waymo, TruckScenes, AV2, and ONCE — spanning car and truck platforms across diverse weather conditions, scene layouts, and times of day.

Dataset statistics
DriveSpatial statistics. (Left) Sunburst view of 20 tasks under abilities Const. Unders. Reas. (Right) Scene-level diversity: nine weather conditions, three periods of day, seven scene types.

Benchmark Construction Framework

DriveSpatial is built through a four-stage pipeline with human-in-the-loop verification at both the metadata and QA generation stages.

Benchmark construction pipeline
Construction pipeline. (1) Data collection and calibration across five AD datasets; (2) Metadata completion (weather, time of day, scene type) via SigLIP-2 and Qwen3-VL with expert verification; (3) Dynamic multi-relational scene graph construction encoding spatial, interaction, action, and temporal edges; (4) QA generation and expert verification enforcing cross-view and spatiotemporal constraints.

Main Results

Results on DriveSpatial for 15 VLMs across three model groups. Human upper-bound is measured on DriveSpatial-mini (1,000 stratified questions). ‡ denotes models evaluated on DriveSpatial-mini.

Method Rank Avg. Const.
Acc ↑
Unders.
Acc ↑
Unders.
RMSE ↓
Reas.
Acc ↑
Baselines
Random1326.3325.3728.2425.39
Frequency832.4532.9433.8930.51
Human ‡183.3986.2085.6210.6288.96
Proprietary
GPT-4o ‡351.3748.2259.8412.7158.76
GPT-5 ‡254.9855.2062.4510.4157.69
Gemini-2 Pro ‡447.2644.0958.5113.2052.38
Generalist — Image-based
LLaVA-Onevision-7B1028.6533.7327.6415.7640.34
DeepSeek-VL2-Small1424.8823.9226.6015.1539.26
Gemma-3-12B-it735.2535.5544.0514.6540.80
InternVL-3.5 8B636.7339.0044.1514.2641.31
InternVL-3 8B1624.2030.6327.2615.9530.65
Qwen3-VL 8B542.2442.7650.6214.3247.66
Generalist — Video-based
LongVA-7B1225.9126.8435.9020.2635.25
LLaVA-Video-7B1325.2426.9427.5715.6836.89
Specialist
RoboTron-Drive1823.1424.3121.0315.4939.56
Ego3D1126.4830.2130.6715.9534.53
SpaceThinker1724.5327.1631.7416.9631.65
SenseNova-SI930.5437.9136.2515.2532.73

‡ Evaluated on DriveSpatial-mini. Avg. = (Const.Acc + Unders.Acc − Unders.RMSE + Reas.Acc) / 3.

Per-source generalization
Per-source generalization. Best-performing model per group across five data sources. Human performance is stable across sources; VLM performance drops notably on ONCE.

BibTeX

@article{vo2026drivespatial,
  title   = {DriveSpatial: A Benchmark for Spatiotemporal Intelligence in VLMs for Autonomous Driving},
  author  = {Vo, Hao and Vo, Khoa and Nguyen, Phu Loc and Tran, Sieu and
             Nguyen, Duc Minh and Cuong, Ngo Xuan and Gawugah, Gladys and
             Godavarthi, Sreevenkata Anjani Tishita and Rainwater, Chase and
             Bui, Nghi D. Q. and Nguyen, Anh and Ho Nguyen, Duy Minh and Le, Ngan},
  year    = {2026}
}