YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Latent Diffusion Models with Masked AutoEncoders (LDMAE)

Junho Lee*, Jeongwoo Shin*, Hyungwook Choi, Joonseok Lee†

Seoul National University, Seoul, Korea * Equal contribution † Corresponding author

Abstract

This project implements Latent Diffusion Models with Masked AutoEncoders (LDMAE), presented at ICCV 2025. We analyze the role of autoencoders in LDMs and identify three key properties: latent smoothness, perceptual compression quality, and reconstruction quality. We demonstrate that existing autoencoders fail to simultaneously satisfy all three properties, and propose Variational Masked AutoEncoders (VMAEs), taking advantage of the hierarchical features maintained by Masked AutoEncoders. Through comprehensive experiments, we demonstrate significantly enhanced image generation quality and computational efficiency.

The codebase is built upon MAE and LightningDiT.

Requirements

Environment Setup

Create a conda environment:

conda create -n ldmae python=3.10
conda activate ldmae

Install dependencies:

pip install -r requirements.txt

Project Structure

ldmae_for_github/
├── LDMAE/                  # Main diffusion model implementation (based on LightningDiT)
│   ├── configs/           # Configuration files for different datasets
│   ├── datasets/          # Dataset loaders and utilities
│   ├── models/            # Model architectures
│   ├── tokenizer/         # Tokenization modules
│   └── pretrain_weight/   # Directory for pretrained weights
├── VMAE/                   # Masked Autoencoder implementation
│   ├── train_ae.sh        # Autoencoder training script
│   └── ...
└── requirements.txt        # Python dependencies

Training Pipeline

Step 1: Train Autoencoder

First, train the autoencoder model using the VMAE module:

cd VMAE
bash train_ae.sh

The training script includes:

Autoencoder training
Positional embedding replacement
Decoder fine-tuning

After training is complete, save the trained model checkpoint as vmaef8d16.pth in the LDMAE/pretrain_weight/ directory.

** Pretrained checkpoints are also available HERE **

Step 2: Configure Datasets

Before proceeding with feature extraction and training, configure the dataset paths in the config files located in the LDMAE/configs/ directory:

For ImageNet: Edit configs/imagenet/lightningdit_b_vmae_f8d16_cfg.yaml
For CelebA-HQ: Edit configs/celeba_hq/lightningdit_b_vmae_f8d16_cfg.yaml

Update the dataset paths according to your local setup.

Step 3: Feature Extraction

Extract features from your datasets using the trained autoencoder:

ImageNet

cd LDMAE
bash run_extract_feature.sh configs/imagenet/lightningdit_b_vmae_f8d16_cfg.yaml

CelebA-HQ

cd LDMAE
bash run_extract_feature.sh configs/celeba_hq/lightningdit_b_vmae_f8d16_cfg.yaml

Step 4: Train Diffusion Model

Train the diffusion model on the extracted features:

ImageNet

bash run_train.sh configs/imagenet/lightningdit_b_vmae_f8d16_cfg.yaml

CelebA-HQ

bash run_train.sh configs/celeba_hq/lightningdit_b_vmae_f8d16_cfg.yaml

Step 5: Inference

Generate images using the trained model:

bash run_inference.sh {CONFIG_PATH}

Replace {CONFIG_PATH} with the path to your configuration file (e.g., configs/imagenet/lightningdit_b_vmae_f8d16_cfg.yaml).

python tools/save_npz.py {CONFIG_PATH} # save your npz
python tools/evaluator.py /path/to/reference.npz /path/to/your.npz # calculate metrics

You can download FID stats from HERE

Configuration Files

The project includes various configuration files for different model variants and datasets:

ImageNet configs: Located in LDMAE/configs/imagenet/
CelebA-HQ configs: Located in LDMAE/configs/celeba_hq/

Each configuration file specifies:

Model architecture parameters
Training hyperparameters
Dataset paths
Autoencoder settings

Notes

Ensure all dataset paths are correctly configured before training
The autoencoder must be trained first before feature extraction
Feature extraction is required before training the diffusion model
The codebase is based on LightningDiT with minimal modifications

Citation

If you use this code in your research, please cite our paper:

@InProceedings{Lee_2025_ICCV,
    author    = {Lee, Junho and Shin, Jeongwoo and Choi, Hyungwook and Lee, Joonseok},
    title     = {Latent Diffusion Models with Masked AutoEncoders},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2025},
    pages     = {17422-17431}
}

@article{lee2025latent,
  title={Latent Diffusion Models with Masked AutoEncoders},
  author={Lee, Junho and Shin, Jeongwoo and Choi, Hyungwook and Lee, Joonseok},
  journal={arXiv preprint arXiv:2507.09984},
  year={2025}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support