LatentRecurrentFlow (LRF) — A Novel Mobile-First Image Generation Architecture

A genuinely new architecture for image generation designed from scratch to run on consumer devices with 3–4 GB RAM, trained on 16 GB budgets.

🔥 v2 Training Results (CIFAR-10)

Trained end-to-end on CIFAR-10 (50K images, 10 classes) using:

Pre-trained TAESD (2.4M frozen params) as the VAE — f=8 compression, 32×32 → 4×4×4 latents
1.47M parameter denoising core with recursive refinement (4 shared blocks × 2 recursions = 8 effective layers)
Rectified flow matching with SNR-weighted loss and 10% CFG dropout
Training: 30 epochs, AdamW with cosine schedule, EMA decay 0.999

Metric	Value
Final Loss	0.931
Training Time	~70 min (CPU only!)
VAE Recon MSE	0.068
All 10 classes produce colorful images	✅

Sample Outputs

VAE Reconstruction (top: original, bottom: TAESD reconstruction):

Training progression (epoch 5 → 30):

Class-conditional generation (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck):

Loss curve:

Validation: No Grey Images

Every class produces images with proper variance:

airplane    : std=0.383, range=1.908 ✅
automobile  : std=0.448, range=2.000 ✅
bird        : std=0.341, range=1.663 ✅
cat         : std=0.521, range=2.000 ✅
deer        : std=0.401, range=1.869 ✅
dog         : std=0.477, range=1.994 ✅
frog        : std=0.366, range=1.996 ✅
horse       : std=0.499, range=1.972 ✅
ship        : std=0.448, range=1.786 ✅
truck       : std=0.510, range=1.944 ✅

Architecture Overview

LRF combines five key innovations into a single coherent architecture:

Innovation	Source Inspiration	What It Does
Recursive Latent Refinement (RLR)	HRM/TRM (2025)	Iterative fixed-point reasoning with O(1) memory backprop
Efficient Spatial Mixer	ViG/GLA + DyDiLA	Attention + DW-Conv locality (adapts to sequence length)
Pre-trained TAESD VAE	madebyollin/taesd	f=8 compression, 2.4M params, works out-of-box
Rectified Flow objective	SD3 / Liu et al.	Clean linear ODE for training and few-step sampling
Additive Image Conditioning	OmniGen	Same core supports text-to-image AND editing

v2 Architecture (Trained & Validated)

Component	Parameters	Description
TAESD VAE (frozen)	2.4M	Pre-trained image encoder/decoder
Denoising Core	1.47M	4 shared blocks × 2 inner recursions
Class Conditioner	1.4K	Learned class embeddings for CIFAR-10
Trainable Total	1.47M

How It Works

# 1. Encode image to latent (TAESD, frozen)
z_0 = vae.encode(image)                    # [B, 4, 4, 4]

# 2. Add noise (rectified flow)
z_t = (1-t) * z_0 + t * noise              # Linear interpolation

# 3. Predict velocity (recursive denoising core)
v = core(z_t, t, class_label)              # 4 blocks × 2 recursions

# 4. Training target
loss = MSE(v, noise - z_0)                 # Velocity matching

# 5. Sampling (Euler ODE solver, t=1→0)
for step in timesteps:
    v = core(z, t, class_label)
    z = z - dt * v

# 6. Decode to image (TAESD, frozen)
image = vae.decode(z)

Quick Start

Generate from trained model:

import torch
from lrf.model_v2 import LRFv2, RectifiedFlowScheduler
from diffusers import AutoencoderTiny

# Load
vae = AutoencoderTiny.from_pretrained('madebyollin/taesd')
ckpt = torch.load('trained/cifar10_checkpoint.pt', map_location='cpu', weights_only=False)
model = LRFv2(ckpt['config'])
for name, p in model.named_parameters():
    p.data.copy_(ckpt['ema_params'][name])
model.eval()

# Generate (class 3 = cat)
scheduler = RectifiedFlowScheduler()
labels = torch.full((4,), 3, dtype=torch.long)
z = scheduler.sample(model, (4,4,4,4), labels, num_steps=50, cfg_scale=3.0)
images = vae.decode(z).sample.clamp(-1, 1)

Train from scratch:

python lrf/train_v2.py

Files

File	Description
`lrf/model_v2.py`	Core architecture (EfficientSpatialMixer, RecursiveLatentCore, LRFv2)
`lrf/train_v2.py`	CIFAR-10 training pipeline with TAESD VAE
`trained/cifar10_checkpoint.pt`	Trained weights (30 epochs, EMA)
`trained/config.json`	Model configuration
`samples/`	Generated sample images at various epochs
`lrf/model.py`	v1 architecture (research prototype)
`lrf/training.py`	v1 training pipeline
`lrf/pipeline.py`	HF-compatible inference pipeline
`notebook.ipynb`	Interactive walkthrough

Training Curriculum (Full Scale)

Stage	Resolution	Data	Freeze	Train	LR	Steps
1. VAE	256²	ImageNet/COCO	-	VAE	1e-4	50K
2. Flow (low)	64²	LAION-aesthetic	VAE	Core+Text	1e-4	100K
3. Flow (mid)	256²	Filtered LAION	VAE	Core+Text	5e-5	200K
4. Flow (high)	512²	Curated+JourneyDB	VAE	Core+Text	2e-5	100K
5. Distill	512²	Same as 4	VAE+Text	Core	1e-5	50K
6. Editing	512²	InstructPix2Pix	VAE	Core+Text	1e-5	50K

Shortcut (proven in this repo): Skip Stage 1 entirely by using pre-trained TAESD. Start directly at Stage 2.

Relevant Papers (Grouped by Problem)

Subquadratic Spatial Mixing

PDE-SSM-DiT (2603.13663): O(N log N) via Fourier PDE, 34× speedup
DiMSUM (2411.04168): Mamba + wavelet, FID 2.11
ViG/GLA (2405.18425): Gated Linear Attention, 90% memory savings
DyDiLA (2601.13683): Dynamic differential linear attention

Recursive Reasoning

HRM (2506.21734): Fixed-point recurrence, O(1) memory via IFT
TRM (2510.04871): 7M params → 45% ARC-AGI-1

Compact Latent Spaces

SANA DC-AE (2410.10629): f=32, PSNR 29.29
SnapGen (2412.09619): 1.38M tiny decoder
TAESD (madebyollin): 2.4M params, f=8, works immediately

Few-Step Generation

Consistency Models (2303.01469): One-step from diffusion
LCM (2310.04378): 2-4 step via consistency distillation

Editing Architectures

OmniGen (2409.11340): Unified generation + editing
InstructPix2Pix (2211.09800): Text-guided editing

License

Apache 2.0

Downloads last month: 282