mdiffae_v1

mDiffAE — Masked Diffusion AutoEncoder. A fast, single-GPU-trainable diffusion autoencoder with a 64-channel spatial bottleneck. Uses decoder token masking as an implicit regularizer instead of REPA alignment.

This variant (mdiffae_v1): 81.4M parameters, 310.6 MB. Bottleneck: 64 channels at patch size 16 (compression ratio 12x).

Documentation

Technical Report — architecture, masking strategy, and results
iRDiffAE Technical Report — full background on VP diffusion, DiCo blocks, patchify encoder, AdaLN
Results — interactive viewer — full-resolution side-by-side comparison

Quick Start

import torch
from m_diffae import MDiffAE

# Load from HuggingFace Hub (or a local path)
model = MDiffAE.from_pretrained("data-archetype/mdiffae_v1", device="cuda")

# Encode
images = ...  # [B, 3, H, W] in [-1, 1], H and W divisible by 16
latents = model.encode(images)

# Decode (1 step by default — PSNR-optimal)
recon = model.decode(latents, height=H, width=W)

# Reconstruct (encode + 1-step decode)
recon = model.reconstruct(images)

Note: Requires pip install huggingface_hub safetensors for Hub downloads. You can also pass a local directory path to from_pretrained().

Architecture

Property	Value
Parameters	81,410,624
File size	310.6 MB
Patch size	16
Model dim	896
Encoder depth	4
Decoder depth	4
Decoder topology	Flat sequential (no skip connections)
Bottleneck dim	64
MLP ratio	4.0
Depthwise kernel	7
AdaLN rank	128
PDG mechanism	Token-level masking (ratio 0.75)
Training regularizer	Decoder token masking (75% ratio, 50% apply prob)

Encoder: Deterministic. Patchify (PixelUnshuffle + 1x1 conv) followed by DiCo blocks (depthwise conv + compact channel attention + GELU MLP) with learned residual gates.

Decoder: VP diffusion conditioned on encoder latents and timestep via shared-base + per-layer low-rank AdaLN-Zero. 4 flat sequential blocks (no skip connections).

Compared to iRDiffAE: iRDiffAE uses an 8-block decoder (2 start + 4 middle

2 end) with skip connections and 128 bottleneck channels (needed partly because REPA occupies half the channels). mDiffAE uses 4 flat blocks with no skip connections and 64 bottleneck channels (12x compression vs iRDiffAE's 6x), which gives better channel utilisation.

Key Differences from iRDiffAE

Aspect	iRDiffAE v1	mDiffAE v1
Bottleneck dim	128	64
Decoder depth	8 (2+4+2 skip-concat)	4 (flat sequential)
PDG mechanism	Block dropping	Token masking
Training regularizer	REPA + covariance reg	Decoder token masking

Recommended Settings

Best quality is achieved with 1 DDIM step and PDG disabled. PDG can sharpen images but should be kept very low (1.01–1.05).

Setting	Default
Sampler	DDIM
Steps	1
PDG	Disabled
PDG strength (if enabled)	1.05

from m_diffae import MDiffAEInferenceConfig

# PSNR-optimal (fast, 1 step)
cfg = MDiffAEInferenceConfig(num_steps=1, sampler="ddim")
recon = model.decode(latents, height=H, width=W, inference_config=cfg)

Citation

@misc{m_diffae,
  title   = {mDiffAE: A Fast Masked Diffusion Autoencoder},
  author  = {data-archetype},
  year    = {2026},
  month   = mar,
  url     = {https://huggingface.co/data-archetype/mdiffae_v1},
}

Dependencies

PyTorch >= 2.0
safetensors (for loading weights)

License

Apache 2.0

Downloads last month: 4

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support