VEGA-3D (Video Extracted Generative Awareness)

VEGA-3D is a plug-and-play framework that repurposes a pre-trained video diffusion model as a Latent World Simulator to enrich Multimodal Large Language Models (MLLMs) with implicit 3D spatial priors for scene understanding, spatial reasoning, and embodied decision making.

More details can be found in the paper: Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding.

Citation

If you find VEGA-3D useful in your research, please consider citing:

@inproceedings{wu2026vega,
      title={Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding},
      author={Xianjin Wu and Dingkang Liang and Tianrui Feng and Kui Xia and Yumeng Zhang and Xiaofan Li and Xiao Tan and Xiang Bai},
      booktitle={Proceedings of the European Conference on Computer Vision (ECCV)},
      year={2026}
}
Downloads last month
21
Safetensors
Model size
10B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for H-EmbodVis/VEGA-3D-Spatial-Reasoning