Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding
Paper • 2603.19235 • Published • 95
VEGA-3D is a plug-and-play framework that repurposes a pre-trained video diffusion model as a Latent World Simulator to enrich Multimodal Large Language Models (MLLMs) with implicit 3D spatial priors for scene understanding, spatial reasoning, and embodied decision making.
More details can be found in the paper: Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding.
If you find VEGA-3D useful in your research, please consider citing:
@inproceedings{wu2026vega,
title={Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding},
author={Xianjin Wu and Dingkang Liang and Tianrui Feng and Kui Xia and Yumeng Zhang and Xiaofan Li and Xiao Tan and Xiang Bai},
booktitle={Proceedings of the European Conference on Computer Vision (ECCV)},
year={2026}
}