VEGA-3D (Video Extracted Generative Awareness)

VEGA-3D is a plug-and-play framework that repurposes a pre-trained video diffusion model as a Latent World Simulator to enrich Multimodal Large Language Models (MLLMs) with implicit 3D spatial priors for scene understanding, spatial reasoning, and embodied decision making.

More details can be found in the paper: Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding.

Project Page: https://h-embodvis.github.io/VEGA-3D/
Repository: https://github.com/H-EmbodVis/VEGA-3D

Citation

If you find VEGA-3D useful in your research, please consider citing:

@inproceedings{wu2026vega,
      title={Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding},
      author={Xianjin Wu and Dingkang Liang and Tianrui Feng and Kui Xia and Yumeng Zhang and Xiaofan Li and Xiao Tan and Xiang Bai},
      booktitle={Proceedings of the European Conference on Computer Vision (ECCV)},
      year={2026}
}

Downloads last month: 21

Safetensors

Model size

10B params

Tensor type

BF16

Inference Providers NEW

Video-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for H-EmbodVis/VEGA-3D-Spatial-Reasoning

Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding

Paper • 2603.19235 • Published Mar 19 • 95