MIMFlow: Integrating Masked Image Modeling with Normalizing Flows for End-to-End Image Generation
Abstract
MIMFlow combines Normalizing Flows with Masked Image Modeling to improve generative modeling by decoupling semantic representation from pixel-level details, achieving better performance with fewer tokens.
Normalizing Flows (NFs) are powerful generative models capable of exact density estimation and sampling. However, their strict invertibility often forces the model to exhaust its capacity on low-level pixel details, hindering the capture of high-level semantic structures. While Masked Image Modeling (MIM) has excelled in representation learning, its integration into generative pipelines has remained largely modular and disjointed. In this paper, we propose MIMFlow, a unified end-to-end framework that jointly optimizes latent semantics, pixel reconstruction, and generative flow. By employing a VAE encoder to infer semantic latent from masked images, MIMFlow achieves a principled decoupling of the generative task: the Normalizing Flow focuses on modeling a simplified, low-frequency semantic manifold, while a specialized decoder handles high-frequency synthesis. This design effectively resolves the inherent capacity bottleneck of NFs, allowing the model to prioritize global structural coherence over redundant noise. Empirical results on ImageNet 256times256 show that MIMFlow-L reaches 71.3\% linear probing accuracy and an FID of 2.50. Despite using only 128 tokens (50\% fewer than standard models), it yields a 32.8\% performance gain over similar-scale NF baselines. Our code is available at https://github.com/MCG-NJU/MIMFlow.
Community
Accepted by ECCV2026
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Vision Foundation Models as Generalist Tokenizers for Image Generation (2026)
- DecQ: Detail-Condensing Queries for Enhanced Reconstruction and Generation in Representation Autoencoders (2026)
- SRC-Flow: Compact Semantic Representations Enable Normalizing Flows for Image Generation (2026)
- FrequencyBooster: Full-Frequency Modeling for High-Fidelity Pixel Diffusion (2026)
- SPAR: Semantic-Pixel Self-Alignment and Adaptive Routing for Unified Multimodal Models (2026)
- PixelU: A U-Shaped Transformer for Efficient End-to-End Pixel Diffusion (2026)
- HyperDiT: Hyper-Connected Transformers for High-Fidelity Pixel-Space Diffusion (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on HF Mirror checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2606.26016 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper