RationalRewards: Reasoning Rewards Scale Visual Generation Both Training and Test Time
Abstract
Training reward models to generate multi-dimensional critiques improves visual generation through both enhanced reinforcement learning rewards and test-time refinement loops, achieving state-of-the-art performance with reduced training data requirements.
Most reward models for visual generation reduce rich human judgments to a single unexplained score, discarding the reasoning that underlies preference. We show that teaching reward models to produce explicit, multi-dimensional critiques before scoring transforms them from passive evaluators into active optimization tools, improving generators in two complementary ways: at training time, structured rationales provide interpretable, fine-grained rewards for reinforcement learning; at test time, a Generate-Critique-Refine loop turns critiques into targeted prompt revisions that improve outputs without any parameter updates. To train such a reward model without costly rationale annotations, we introduce Preference-Anchored Rationalization (PARROT), a principled framework that recovers high-quality rationales from readily available preference data through anchored generation, consistency filtering, and distillation. The resulting model, RationalRewards (8B), achieves state-of-the-art preference prediction among open-source reward models, competitive with Gemini-2.5-Pro, while using 10-20x less training data than comparable baselines. As an RL reward, it consistently improves text-to-image and image-editing generators beyond scalar alternatives. Most strikingly, its test-time critique-and-refine loop matches or exceeds RL-based fine-tuning on several benchmarks, suggesting that structured reasoning can unlock latent capabilities in existing generators that suboptimal prompts fail to elicit.
Community
Rationalrewards, a reasoning reward model that just scales image generation quality at both training and test time. A quick test-time prompt revision matches or beats full RL fine-tuning (the kind that eats 400 GPU hours) with almost zero extra delay. This suggests modern visual generators are secretly packed with massive dormant superpowers… we just needed the right key to wake them up. 🔥
Get this paper in your agent:
hf papers read 2604.11626 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 2
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper