LatentRecurrentFlow (LRF) β A Novel Mobile-First Image Generation Architecture
A genuinely new architecture for image generation designed from scratch to run on consumer devices with 3β4 GB RAM, trained on 16 GB budgets.
π₯ v2 Training Results (CIFAR-10)
Trained end-to-end on CIFAR-10 (50K images, 10 classes) using:
- Pre-trained TAESD (2.4M frozen params) as the VAE β f=8 compression, 32Γ32 β 4Γ4Γ4 latents
- 1.47M parameter denoising core with recursive refinement (4 shared blocks Γ 2 recursions = 8 effective layers)
- Rectified flow matching with SNR-weighted loss and 10% CFG dropout
- Training: 30 epochs, AdamW with cosine schedule, EMA decay 0.999
| Metric | Value |
|---|---|
| Final Loss | 0.931 |
| Training Time | ~70 min (CPU only!) |
| VAE Recon MSE | 0.068 |
| All 10 classes produce colorful images | β |
Sample Outputs
VAE Reconstruction (top: original, bottom: TAESD reconstruction):
Training progression (epoch 5 β 30):
Class-conditional generation (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck):
Loss curve:
Validation: No Grey Images
Every class produces images with proper variance:
airplane : std=0.383, range=1.908 β
automobile : std=0.448, range=2.000 β
bird : std=0.341, range=1.663 β
cat : std=0.521, range=2.000 β
deer : std=0.401, range=1.869 β
dog : std=0.477, range=1.994 β
frog : std=0.366, range=1.996 β
horse : std=0.499, range=1.972 β
ship : std=0.448, range=1.786 β
truck : std=0.510, range=1.944 β
Architecture Overview
LRF combines five key innovations into a single coherent architecture:
| Innovation | Source Inspiration | What It Does |
|---|---|---|
| Recursive Latent Refinement (RLR) | HRM/TRM (2025) | Iterative fixed-point reasoning with O(1) memory backprop |
| Efficient Spatial Mixer | ViG/GLA + DyDiLA | Attention + DW-Conv locality (adapts to sequence length) |
| Pre-trained TAESD VAE | madebyollin/taesd | f=8 compression, 2.4M params, works out-of-box |
| Rectified Flow objective | SD3 / Liu et al. | Clean linear ODE for training and few-step sampling |
| Additive Image Conditioning | OmniGen | Same core supports text-to-image AND editing |
v2 Architecture (Trained & Validated)
| Component | Parameters | Description |
|---|---|---|
| TAESD VAE (frozen) | 2.4M | Pre-trained image encoder/decoder |
| Denoising Core | 1.47M | 4 shared blocks Γ 2 inner recursions |
| Class Conditioner | 1.4K | Learned class embeddings for CIFAR-10 |
| Trainable Total | 1.47M |
How It Works
# 1. Encode image to latent (TAESD, frozen)
z_0 = vae.encode(image) # [B, 4, 4, 4]
# 2. Add noise (rectified flow)
z_t = (1-t) * z_0 + t * noise # Linear interpolation
# 3. Predict velocity (recursive denoising core)
v = core(z_t, t, class_label) # 4 blocks Γ 2 recursions
# 4. Training target
loss = MSE(v, noise - z_0) # Velocity matching
# 5. Sampling (Euler ODE solver, t=1β0)
for step in timesteps:
v = core(z, t, class_label)
z = z - dt * v
# 6. Decode to image (TAESD, frozen)
image = vae.decode(z)
Quick Start
Generate from trained model:
import torch
from lrf.model_v2 import LRFv2, RectifiedFlowScheduler
from diffusers import AutoencoderTiny
# Load
vae = AutoencoderTiny.from_pretrained('madebyollin/taesd')
ckpt = torch.load('trained/cifar10_checkpoint.pt', map_location='cpu', weights_only=False)
model = LRFv2(ckpt['config'])
for name, p in model.named_parameters():
p.data.copy_(ckpt['ema_params'][name])
model.eval()
# Generate (class 3 = cat)
scheduler = RectifiedFlowScheduler()
labels = torch.full((4,), 3, dtype=torch.long)
z = scheduler.sample(model, (4,4,4,4), labels, num_steps=50, cfg_scale=3.0)
images = vae.decode(z).sample.clamp(-1, 1)
Train from scratch:
python lrf/train_v2.py
Files
| File | Description |
|---|---|
lrf/model_v2.py |
Core architecture (EfficientSpatialMixer, RecursiveLatentCore, LRFv2) |
lrf/train_v2.py |
CIFAR-10 training pipeline with TAESD VAE |
trained/cifar10_checkpoint.pt |
Trained weights (30 epochs, EMA) |
trained/config.json |
Model configuration |
samples/ |
Generated sample images at various epochs |
lrf/model.py |
v1 architecture (research prototype) |
lrf/training.py |
v1 training pipeline |
lrf/pipeline.py |
HF-compatible inference pipeline |
notebook.ipynb |
Interactive walkthrough |
Training Curriculum (Full Scale)
| Stage | Resolution | Data | Freeze | Train | LR | Steps |
|---|---|---|---|---|---|---|
| 1. VAE | 256Β² | ImageNet/COCO | - | VAE | 1e-4 | 50K |
| 2. Flow (low) | 64Β² | LAION-aesthetic | VAE | Core+Text | 1e-4 | 100K |
| 3. Flow (mid) | 256Β² | Filtered LAION | VAE | Core+Text | 5e-5 | 200K |
| 4. Flow (high) | 512Β² | Curated+JourneyDB | VAE | Core+Text | 2e-5 | 100K |
| 5. Distill | 512Β² | Same as 4 | VAE+Text | Core | 1e-5 | 50K |
| 6. Editing | 512Β² | InstructPix2Pix | VAE | Core+Text | 1e-5 | 50K |
Shortcut (proven in this repo): Skip Stage 1 entirely by using pre-trained TAESD. Start directly at Stage 2.
Relevant Papers (Grouped by Problem)
Subquadratic Spatial Mixing
- PDE-SSM-DiT (2603.13663): O(N log N) via Fourier PDE, 34Γ speedup
- DiMSUM (2411.04168): Mamba + wavelet, FID 2.11
- ViG/GLA (2405.18425): Gated Linear Attention, 90% memory savings
- DyDiLA (2601.13683): Dynamic differential linear attention
Recursive Reasoning
- HRM (2506.21734): Fixed-point recurrence, O(1) memory via IFT
- TRM (2510.04871): 7M params β 45% ARC-AGI-1
Compact Latent Spaces
- SANA DC-AE (2410.10629): f=32, PSNR 29.29
- SnapGen (2412.09619): 1.38M tiny decoder
- TAESD (madebyollin): 2.4M params, f=8, works immediately
Few-Step Generation
- Consistency Models (2303.01469): One-step from diffusion
- LCM (2310.04378): 2-4 step via consistency distillation
Editing Architectures
- OmniGen (2409.11340): Unified generation + editing
- InstructPix2Pix (2211.09800): Text-guided editing
License
Apache 2.0
- Downloads last month
- 282




