mdiffae_v1
mDiffAE β Masked Diffusion AutoEncoder. A fast, single-GPU-trainable diffusion autoencoder with a 64-channel spatial bottleneck. Uses decoder token masking as an implicit regularizer instead of REPA alignment.
This variant (mdiffae_v1): 81.4M parameters, 310.6 MB. Bottleneck: 64 channels at patch size 16 (compression ratio 12x).
Documentation
- Technical Report β architecture, masking strategy, and results
- iRDiffAE Technical Report β full background on VP diffusion, DiCo blocks, patchify encoder, AdaLN
- Results β interactive viewer β full-resolution side-by-side comparison
Quick Start
import torch
from m_diffae import MDiffAE
# Load from HuggingFace Hub (or a local path)
model = MDiffAE.from_pretrained("data-archetype/mdiffae_v1", device="cuda")
# Encode
images = ... # [B, 3, H, W] in [-1, 1], H and W divisible by 16
latents = model.encode(images)
# Decode (1 step by default β PSNR-optimal)
recon = model.decode(latents, height=H, width=W)
# Reconstruct (encode + 1-step decode)
recon = model.reconstruct(images)
Note: Requires
pip install huggingface_hub safetensorsfor Hub downloads. You can also pass a local directory path tofrom_pretrained().
Architecture
| Property | Value |
|---|---|
| Parameters | 81,410,624 |
| File size | 310.6 MB |
| Patch size | 16 |
| Model dim | 896 |
| Encoder depth | 4 |
| Decoder depth | 4 |
| Decoder topology | Flat sequential (no skip connections) |
| Bottleneck dim | 64 |
| MLP ratio | 4.0 |
| Depthwise kernel | 7 |
| AdaLN rank | 128 |
| PDG mechanism | Token-level masking (ratio 0.75) |
| Training regularizer | Decoder token masking (75% ratio, 50% apply prob) |
Encoder: Deterministic. Patchify (PixelUnshuffle + 1x1 conv) followed by DiCo blocks (depthwise conv + compact channel attention + GELU MLP) with learned residual gates.
Decoder: VP diffusion conditioned on encoder latents and timestep via shared-base + per-layer low-rank AdaLN-Zero. 4 flat sequential blocks (no skip connections).
Compared to iRDiffAE: iRDiffAE uses an 8-block decoder (2 start + 4 middle
- 2 end) with skip connections and 128 bottleneck channels (needed partly because REPA occupies half the channels). mDiffAE uses 4 flat blocks with no skip connections and 64 bottleneck channels (12x compression vs iRDiffAE's 6x), which gives better channel utilisation.
Key Differences from iRDiffAE
| Aspect | iRDiffAE v1 | mDiffAE v1 |
|---|---|---|
| Bottleneck dim | 128 | 64 |
| Decoder depth | 8 (2+4+2 skip-concat) | 4 (flat sequential) |
| PDG mechanism | Block dropping | Token masking |
| Training regularizer | REPA + covariance reg | Decoder token masking |
Recommended Settings
Best quality is achieved with 1 DDIM step and PDG disabled. PDG can sharpen images but should be kept very low (1.01β1.05).
| Setting | Default |
|---|---|
| Sampler | DDIM |
| Steps | 1 |
| PDG | Disabled |
| PDG strength (if enabled) | 1.05 |
from m_diffae import MDiffAEInferenceConfig
# PSNR-optimal (fast, 1 step)
cfg = MDiffAEInferenceConfig(num_steps=1, sampler="ddim")
recon = model.decode(latents, height=H, width=W, inference_config=cfg)
Citation
@misc{m_diffae,
title = {mDiffAE: A Fast Masked Diffusion Autoencoder},
author = {data-archetype},
year = {2026},
month = mar,
url = {https://huggingface.co/data-archetype/mdiffae_v1},
}
Dependencies
- PyTorch >= 2.0
- safetensors (for loading weights)
License
Apache 2.0
- Downloads last month
- 4