Latent Diffusion Models with Masked AutoEncoders (LDMAE)
Junho Lee*, Jeongwoo Shin*, Hyungwook Choi, Joonseok Leeβ
Seoul National University, Seoul, Korea * Equal contribution β Corresponding author
Abstract
This project implements Latent Diffusion Models with Masked AutoEncoders (LDMAE), presented at ICCV 2025. We analyze the role of autoencoders in LDMs and identify three key properties: latent smoothness, perceptual compression quality, and reconstruction quality. We demonstrate that existing autoencoders fail to simultaneously satisfy all three properties, and propose Variational Masked AutoEncoders (VMAEs), taking advantage of the hierarchical features maintained by Masked AutoEncoders. Through comprehensive experiments, we demonstrate significantly enhanced image generation quality and computational efficiency.
The codebase is built upon MAE and LightningDiT.
Requirements
Environment Setup
- Create a conda environment:
conda create -n ldmae python=3.10
conda activate ldmae
- Install dependencies:
pip install -r requirements.txt
Project Structure
ldmae_for_github/
βββ LDMAE/ # Main diffusion model implementation (based on LightningDiT)
β βββ configs/ # Configuration files for different datasets
β βββ datasets/ # Dataset loaders and utilities
β βββ models/ # Model architectures
β βββ tokenizer/ # Tokenization modules
β βββ pretrain_weight/ # Directory for pretrained weights
βββ VMAE/ # Masked Autoencoder implementation
β βββ train_ae.sh # Autoencoder training script
β βββ ...
βββ requirements.txt # Python dependencies
Training Pipeline
Step 1: Train Autoencoder
First, train the autoencoder model using the VMAE module:
cd VMAE
bash train_ae.sh
The training script includes:
- Autoencoder training
- Positional embedding replacement
- Decoder fine-tuning
After training is complete, save the trained model checkpoint as vmaef8d16.pth in the LDMAE/pretrain_weight/ directory.
** Pretrained checkpoints are also available HERE **
Step 2: Configure Datasets
Before proceeding with feature extraction and training, configure the dataset paths in the config files located in the LDMAE/configs/ directory:
- For ImageNet: Edit
configs/imagenet/lightningdit_b_vmae_f8d16_cfg.yaml - For CelebA-HQ: Edit
configs/celeba_hq/lightningdit_b_vmae_f8d16_cfg.yaml
Update the dataset paths according to your local setup.
Step 3: Feature Extraction
Extract features from your datasets using the trained autoencoder:
ImageNet
cd LDMAE
bash run_extract_feature.sh configs/imagenet/lightningdit_b_vmae_f8d16_cfg.yaml
CelebA-HQ
cd LDMAE
bash run_extract_feature.sh configs/celeba_hq/lightningdit_b_vmae_f8d16_cfg.yaml
Step 4: Train Diffusion Model
Train the diffusion model on the extracted features:
ImageNet
bash run_train.sh configs/imagenet/lightningdit_b_vmae_f8d16_cfg.yaml
CelebA-HQ
bash run_train.sh configs/celeba_hq/lightningdit_b_vmae_f8d16_cfg.yaml
Step 5: Inference
Generate images using the trained model:
bash run_inference.sh {CONFIG_PATH}
Replace {CONFIG_PATH} with the path to your configuration file (e.g., configs/imagenet/lightningdit_b_vmae_f8d16_cfg.yaml).
python tools/save_npz.py {CONFIG_PATH} # save your npz
python tools/evaluator.py /path/to/reference.npz /path/to/your.npz # calculate metrics
You can download FID stats from HERE
Configuration Files
The project includes various configuration files for different model variants and datasets:
- ImageNet configs: Located in
LDMAE/configs/imagenet/ - CelebA-HQ configs: Located in
LDMAE/configs/celeba_hq/
Each configuration file specifies:
- Model architecture parameters
- Training hyperparameters
- Dataset paths
- Autoencoder settings
Notes
- Ensure all dataset paths are correctly configured before training
- The autoencoder must be trained first before feature extraction
- Feature extraction is required before training the diffusion model
- The codebase is based on LightningDiT with minimal modifications
Citation
If you use this code in your research, please cite our paper:
@InProceedings{Lee_2025_ICCV,
author = {Lee, Junho and Shin, Jeongwoo and Choi, Hyungwook and Lee, Joonseok},
title = {Latent Diffusion Models with Masked AutoEncoders},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
month = {October},
year = {2025},
pages = {17422-17431}
}
@article{lee2025latent,
title={Latent Diffusion Models with Masked AutoEncoders},
author={Lee, Junho and Shin, Jeongwoo and Choi, Hyungwook and Lee, Joonseok},
journal={arXiv preprint arXiv:2507.09984},
year={2025}
}