DiM: Diffusion Mamba for Efficient High-Resolution Image Synthesis
Paper β’ 2405.14224 β’ Published β’ 15
A novel lightweight architecture for image generation that combines:
| Component | Source | Role |
|---|---|---|
| Liquid Time-Constant Networks | Hasani et al. 2020 | Adaptive ODE dynamics via CfC closed-form β bounded by construction |
| Selective State Space Models | Gu & Dao 2023 (Mamba) | Linear-time long-range context, parallelizable scanning |
| Zigzag Scanning | ZigMa 2024 | 2D spatial awareness through alternating scan patterns |
| Physics-Informed Loss | Wang et al. 2020, PIDM 2024 | Smoothness + TV regularization for training stability |
| Rectified Flow Matching | Lipman et al. 2022 | ODE-based generation β no noise schedule tuning needed |
Noise xβ ~ N(0,I) βββ LiquidFlow v_ΞΈ(xβ, t) βββ Image xβ
β
ββββββββ΄βββββββ
β Patchify β (image β non-overlapping patches)
β + PosEmb β (2D learnable positions)
β + DepthConvβ (local structure preservation)
ββββββββ¬βββββββ
β
ββββββββββββββΌβββββββββββββ
β L Γ LiquidSSM Block β
β ββββββββββββββββββββ β
β β AdaLN (t-cond) β β β DiT-style conditioning
β β Zigzag Scan β β β rotates scan pattern per layer
β β SelectiveSSM β β β Mamba-style, input-dependent A,B,C,Ξ
β β + LiquidCfC β β β CfC gating: Ο(-f_Ο)βh + (1-Ο(-f_Ο))βf_x
β β + FFN β β β GELU feed-forward
β β + Skip Connect β β β U-Net style long skips
β ββββββββββββββββββββ β
ββββββββββββββΌβββββββββββββ
β
ββββββββ΄βββββββ
β DepthConv β (local refinement)
β Unpatchify β (patches β image)
ββββββββ¬βββββββ
β
velocity v_ΞΈ (same shape as input)
Instead of solving the Liquid ODE numerically (sequential, slow):
dx/dt = -[1/Ο + f(x,I,t)] * x + f(x,I,t)
We use the Closed-form Continuous-depth (CfC) solution (parallel, fast, stable):
gate = sigmoid(-f_tau(x, h)) # time-constant gating
new_h = gate * h + (1 - gate) * f_x(x, h) # bounded update
The sigmoid gating guarantees that hidden states stay bounded β no explosion or collapse possible by construction.
Each LiquidSSM Block has two parallel branches:
A learnable mixing coefficient Ξ± balances them: output = Ξ±Β·SSM + (1-Ξ±)Β·Liquid
| Variant | Params | Image Size | Patch | GPU VRAM (bs=16) | Use Case |
|---|---|---|---|---|---|
tiny |
5.9M | 128Γ128 | 4 | ~4 GB | Quick experiments, mobile |
small |
13.7M | 128Γ128 | 4 | ~8 GB | Production 128Γ128 |
base |
37.6M | 256Γ256 | 8 | ~12 GB | High quality |
512 |
38.1M | 512Γ512 | 16 | ~14 GB | High resolution |
Open the notebook: LiquidFlow_Training.ipynb
It has interactive widgets for:
pip install torch torchvision einops pillow matplotlib tqdm
# Quick test (CIFAR-10 32Γ32)
python liquidflow/train.py --model_size tiny --img_size 32 --dataset cifar10 --epochs 50 --batch_size 64
# Production (Flowers 128Γ128)
python liquidflow/train.py --model_size small --img_size 128 --dataset flowers --epochs 200 --batch_size 16
# Custom images
python liquidflow/train.py --model_size small --img_size 128 --dataset folder --data_dir /path/to/images
from liquidflow import liquidflow_small, euler_sample, make_grid_image
import torch
model = liquidflow_small(img_size=128) # 13.7M params
# ... after training ...
model.eval()
images = euler_sample(model, (16, 3, 128, 128), num_steps=50, device='cuda')
grid = make_grid_image(images.clamp(-1,1)*0.5+0.5, nrow=4)
grid.save('generated.png')
βββ liquidflow/
β βββ __init__.py # Package exports
β βββ model.py # Core architecture (LiquidFlowNet, LiquidCfCCell, SelectiveSSM)
β βββ losses.py # Physics-informed flow matching loss + EMA
β βββ sampling.py # Euler & Heun ODE samplers
β βββ train.py # Full training script with CLI
βββ LiquidFlow_Training.ipynb # π Colab/Kaggle notebook
βββ smoke_test.py # Comprehensive CPU test suite (25 tests)
βββ README.md
L = L_flow + Ξ»_smooth Β· L_smooth + Ξ»_tv Β· L_tv
| Term | Formula | Purpose |
|---|---|---|
L_flow |
βv_ΞΈ(xβ,t) - (xβ-xβ)βΒ² |
Learn straight-line velocity field |
L_smooth |
ββΒ²x_predβΒ² (Laplacian) |
Penalize high-frequency noise |
L_tv |
ββx_predββ (Total Variation) |
Edge-preserving smoothness |
Physics loss is warmed up over the first 500 steps.
| Goal | Dataset | Model | Size | Epochs | Time (T4) |
|---|---|---|---|---|---|
| Sanity check | CIFAR-10 | tiny | 32 | 20 | ~5 min |
| Baseline | CIFAR-10 | tiny | 128 | 100 | ~2 hrs |
| Quality | Flowers-102 | small | 128 | 200 | ~4 hrs |
| Faces | CelebA | small | 128 | 50 | ~6 hrs |
| High-res | CelebA | 512 | 512 | 100 | ~12 hrs |
The notebook includes TorchScript and ONNX export cells. The tiny model produces a ~24MB file for on-device inference.
MIT