mdiffae_v1

mDiffAE β€” Masked Diffusion AutoEncoder. A fast, single-GPU-trainable diffusion autoencoder with a 64-channel spatial bottleneck. Uses decoder token masking as an implicit regularizer instead of REPA alignment.

This variant (mdiffae_v1): 81.4M parameters, 310.6 MB. Bottleneck: 64 channels at patch size 16 (compression ratio 12x).

Documentation

Quick Start

import torch
from m_diffae import MDiffAE

# Load from HuggingFace Hub (or a local path)
model = MDiffAE.from_pretrained("data-archetype/mdiffae_v1", device="cuda")

# Encode
images = ...  # [B, 3, H, W] in [-1, 1], H and W divisible by 16
latents = model.encode(images)

# Decode (1 step by default β€” PSNR-optimal)
recon = model.decode(latents, height=H, width=W)

# Reconstruct (encode + 1-step decode)
recon = model.reconstruct(images)

Note: Requires pip install huggingface_hub safetensors for Hub downloads. You can also pass a local directory path to from_pretrained().

Architecture

Property Value
Parameters 81,410,624
File size 310.6 MB
Patch size 16
Model dim 896
Encoder depth 4
Decoder depth 4
Decoder topology Flat sequential (no skip connections)
Bottleneck dim 64
MLP ratio 4.0
Depthwise kernel 7
AdaLN rank 128
PDG mechanism Token-level masking (ratio 0.75)
Training regularizer Decoder token masking (75% ratio, 50% apply prob)

Encoder: Deterministic. Patchify (PixelUnshuffle + 1x1 conv) followed by DiCo blocks (depthwise conv + compact channel attention + GELU MLP) with learned residual gates.

Decoder: VP diffusion conditioned on encoder latents and timestep via shared-base + per-layer low-rank AdaLN-Zero. 4 flat sequential blocks (no skip connections).

Compared to iRDiffAE: iRDiffAE uses an 8-block decoder (2 start + 4 middle

  • 2 end) with skip connections and 128 bottleneck channels (needed partly because REPA occupies half the channels). mDiffAE uses 4 flat blocks with no skip connections and 64 bottleneck channels (12x compression vs iRDiffAE's 6x), which gives better channel utilisation.

Key Differences from iRDiffAE

Aspect iRDiffAE v1 mDiffAE v1
Bottleneck dim 128 64
Decoder depth 8 (2+4+2 skip-concat) 4 (flat sequential)
PDG mechanism Block dropping Token masking
Training regularizer REPA + covariance reg Decoder token masking

Recommended Settings

Best quality is achieved with 1 DDIM step and PDG disabled. PDG can sharpen images but should be kept very low (1.01–1.05).

Setting Default
Sampler DDIM
Steps 1
PDG Disabled
PDG strength (if enabled) 1.05
from m_diffae import MDiffAEInferenceConfig

# PSNR-optimal (fast, 1 step)
cfg = MDiffAEInferenceConfig(num_steps=1, sampler="ddim")
recon = model.decode(latents, height=H, width=W, inference_config=cfg)

Citation

@misc{m_diffae,
  title   = {mDiffAE: A Fast Masked Diffusion Autoencoder},
  author  = {data-archetype},
  year    = {2026},
  month   = mar,
  url     = {https://huggingface.co/data-archetype/mdiffae_v1},
}

Dependencies

  • PyTorch >= 2.0
  • safetensors (for loading weights)

License

Apache 2.0

Downloads last month
4
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support