LatentRecurrentFlow (LRF) β€” A Novel Mobile-First Image Generation Architecture

A genuinely new architecture for image generation designed from scratch to run on consumer devices with 3–4 GB RAM, trained on 16 GB budgets.

πŸ”₯ v2 Training Results (CIFAR-10)

Trained end-to-end on CIFAR-10 (50K images, 10 classes) using:

  • Pre-trained TAESD (2.4M frozen params) as the VAE β€” f=8 compression, 32Γ—32 β†’ 4Γ—4Γ—4 latents
  • 1.47M parameter denoising core with recursive refinement (4 shared blocks Γ— 2 recursions = 8 effective layers)
  • Rectified flow matching with SNR-weighted loss and 10% CFG dropout
  • Training: 30 epochs, AdamW with cosine schedule, EMA decay 0.999
Metric Value
Final Loss 0.931
Training Time ~70 min (CPU only!)
VAE Recon MSE 0.068
All 10 classes produce colorful images βœ…

Sample Outputs

VAE Reconstruction (top: original, bottom: TAESD reconstruction):

VAE Reconstruction

Training progression (epoch 5 β†’ 30):

Epoch 5 Epoch 30

Class-conditional generation (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck):

Final Samples

Loss curve:

Loss

Validation: No Grey Images

Every class produces images with proper variance:

airplane    : std=0.383, range=1.908 βœ…
automobile  : std=0.448, range=2.000 βœ…
bird        : std=0.341, range=1.663 βœ…
cat         : std=0.521, range=2.000 βœ…
deer        : std=0.401, range=1.869 βœ…
dog         : std=0.477, range=1.994 βœ…
frog        : std=0.366, range=1.996 βœ…
horse       : std=0.499, range=1.972 βœ…
ship        : std=0.448, range=1.786 βœ…
truck       : std=0.510, range=1.944 βœ…

Architecture Overview

LRF combines five key innovations into a single coherent architecture:

Innovation Source Inspiration What It Does
Recursive Latent Refinement (RLR) HRM/TRM (2025) Iterative fixed-point reasoning with O(1) memory backprop
Efficient Spatial Mixer ViG/GLA + DyDiLA Attention + DW-Conv locality (adapts to sequence length)
Pre-trained TAESD VAE madebyollin/taesd f=8 compression, 2.4M params, works out-of-box
Rectified Flow objective SD3 / Liu et al. Clean linear ODE for training and few-step sampling
Additive Image Conditioning OmniGen Same core supports text-to-image AND editing

v2 Architecture (Trained & Validated)

Component Parameters Description
TAESD VAE (frozen) 2.4M Pre-trained image encoder/decoder
Denoising Core 1.47M 4 shared blocks Γ— 2 inner recursions
Class Conditioner 1.4K Learned class embeddings for CIFAR-10
Trainable Total 1.47M

How It Works

# 1. Encode image to latent (TAESD, frozen)
z_0 = vae.encode(image)                    # [B, 4, 4, 4]

# 2. Add noise (rectified flow)
z_t = (1-t) * z_0 + t * noise              # Linear interpolation

# 3. Predict velocity (recursive denoising core)
v = core(z_t, t, class_label)              # 4 blocks Γ— 2 recursions

# 4. Training target
loss = MSE(v, noise - z_0)                 # Velocity matching

# 5. Sampling (Euler ODE solver, t=1β†’0)
for step in timesteps:
    v = core(z, t, class_label)
    z = z - dt * v

# 6. Decode to image (TAESD, frozen)
image = vae.decode(z)

Quick Start

Generate from trained model:

import torch
from lrf.model_v2 import LRFv2, RectifiedFlowScheduler
from diffusers import AutoencoderTiny

# Load
vae = AutoencoderTiny.from_pretrained('madebyollin/taesd')
ckpt = torch.load('trained/cifar10_checkpoint.pt', map_location='cpu', weights_only=False)
model = LRFv2(ckpt['config'])
for name, p in model.named_parameters():
    p.data.copy_(ckpt['ema_params'][name])
model.eval()

# Generate (class 3 = cat)
scheduler = RectifiedFlowScheduler()
labels = torch.full((4,), 3, dtype=torch.long)
z = scheduler.sample(model, (4,4,4,4), labels, num_steps=50, cfg_scale=3.0)
images = vae.decode(z).sample.clamp(-1, 1)

Train from scratch:

python lrf/train_v2.py

Files

File Description
lrf/model_v2.py Core architecture (EfficientSpatialMixer, RecursiveLatentCore, LRFv2)
lrf/train_v2.py CIFAR-10 training pipeline with TAESD VAE
trained/cifar10_checkpoint.pt Trained weights (30 epochs, EMA)
trained/config.json Model configuration
samples/ Generated sample images at various epochs
lrf/model.py v1 architecture (research prototype)
lrf/training.py v1 training pipeline
lrf/pipeline.py HF-compatible inference pipeline
notebook.ipynb Interactive walkthrough

Training Curriculum (Full Scale)

Stage Resolution Data Freeze Train LR Steps
1. VAE 256Β² ImageNet/COCO - VAE 1e-4 50K
2. Flow (low) 64Β² LAION-aesthetic VAE Core+Text 1e-4 100K
3. Flow (mid) 256Β² Filtered LAION VAE Core+Text 5e-5 200K
4. Flow (high) 512Β² Curated+JourneyDB VAE Core+Text 2e-5 100K
5. Distill 512Β² Same as 4 VAE+Text Core 1e-5 50K
6. Editing 512Β² InstructPix2Pix VAE Core+Text 1e-5 50K

Shortcut (proven in this repo): Skip Stage 1 entirely by using pre-trained TAESD. Start directly at Stage 2.


Relevant Papers (Grouped by Problem)

Subquadratic Spatial Mixing

  • PDE-SSM-DiT (2603.13663): O(N log N) via Fourier PDE, 34Γ— speedup
  • DiMSUM (2411.04168): Mamba + wavelet, FID 2.11
  • ViG/GLA (2405.18425): Gated Linear Attention, 90% memory savings
  • DyDiLA (2601.13683): Dynamic differential linear attention

Recursive Reasoning

  • HRM (2506.21734): Fixed-point recurrence, O(1) memory via IFT
  • TRM (2510.04871): 7M params β†’ 45% ARC-AGI-1

Compact Latent Spaces

  • SANA DC-AE (2410.10629): f=32, PSNR 29.29
  • SnapGen (2412.09619): 1.38M tiny decoder
  • TAESD (madebyollin): 2.4M params, f=8, works immediately

Few-Step Generation

  • Consistency Models (2303.01469): One-step from diffusion
  • LCM (2310.04378): 2-4 step via consistency distillation

Editing Architectures

  • OmniGen (2409.11340): Unified generation + editing
  • InstructPix2Pix (2211.09800): Text-guided editing

License

Apache 2.0

Downloads last month
282
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support