Papers
arxiv:2512.17909

Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing

Published on Dec 19
· Submitted by
Shilong Zhang
on Dec 22
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

A systematic framework adapts understanding-oriented encoder features for generative tasks by introducing a semantic-pixel reconstruction objective, enabling state-of-the-art image reconstruction and generation.

AI-generated summary

Modern Latent Diffusion Models (LDMs) typically operate in low-level Variational Autoencoder (VAE) latent spaces that are primarily optimized for pixel-level reconstruction. To unify vision generation and understanding, a burgeoning trend is to adopt high-dimensional features from representation encoders as generative latents. However, we empirically identify two fundamental obstacles in this paradigm: (1) the discriminative feature space lacks compact regularization, making diffusion models prone to off-manifold latents that lead to inaccurate object structures; and (2) the encoder's inherently weak pixel-level reconstruction hinders the generator from learning accurate fine-grained geometry and texture. In this paper, we propose a systematic framework to adapt understanding-oriented encoder features for generative tasks. We introduce a semantic-pixel reconstruction objective to regularize the latent space, enabling the compression of both semantic information and fine-grained details into a highly compact representation (96 channels with 16x16 spatial downsampling). This design ensures that the latent space remains semantically rich and achieves state-of-the-art image reconstruction, while remaining compact enough for accurate generation. Leveraging this representation, we design a unified Text-to-Image (T2I) and image editing model. Benchmarking against various feature spaces, we demonstrate that our approach achieves state-of-the-art reconstruction, faster convergence, and substantial performance gains in both T2I and editing tasks, validating that representation encoders can be effectively adapted into robust generative components.

Community

Paper submitter

Modern Latent Diffusion Models (LDMs) typically operate in low-level Variational Autoencoder (VAE) latent spaces that are primarily optimized for pixel-level reconstruction. To unify vision generation and understanding, a burgeoning trend is to adopt high-dimensional features from representation encoders as generative latents. However, we empirically identify two fundamental obstacles in this paradigm: (1) the discriminative feature space lacks compact regularization, making diffusion models prone to off-manifold latents that lead to inaccurate object structures; and (2) the encoder's inherently weak pixel-level reconstruction hinders the generator from learning accurate fine-grained geometry and texture. In this paper, we propose a systematic framework to adapt understanding-oriented encoder features for generative tasks. We introduce a semantic-pixel reconstruction objective to regularize the latent space, enabling the compression of both semantic information and fine-grained details into a highly compact representation (96 channels with 16x16 spatial downsampling). This design ensures that the latent space remains semantically rich and achieves state-of-the-art image reconstruction, while remaining compact enough for accurate generation. Leveraging this representation, we design a unified Text-to-Image (T2I) and image editing model. Benchmarking against various feature spaces, we demonstrate that our approach achieves state-of-the-art reconstruction, faster convergence, and substantial performance gains in both T2I and editing tasks, validating that representation encoders can be effectively adapted into robust generative components.

Project Page: https://jshilong.github.io/PS-VAE-PAGE/

great work!

Great work! Do you have any plans to compare SVG in future experiments?

·

Thanks for the question!

SVG is indeed a excellent concurrent work to RAE, and we briefly discussed it in the Related Work section. Due to limited compute resources, we did not run SVG experiments in this paper. I do plan to add a comparison once I regain access to sufficient compute, but this may take some time since I’ve left Adobe and currently don’t have large-scale resources available.

From a modeling perspective, SVG largely follows a paradigm similar to RAE. While it introduces additional channels, the reconstruction quality still falls short of practical requirements such as image editing.
Moreover, its raw latent space is not very compact and suffers from off-manifold issues, again similar to RAE. This can also be seen in their latest T2I results: even when scaling to high resolutions, structural artifacts (e.g., broken limbs) persist, which are notoriously difficult to resolve. Adobe explored similar directions much earlier (e.g., with Qwenvl encoders) and observed the same limitations.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2512.17909 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2512.17909 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2512.17909 in a Space README.md to link it from this page.

Collections including this paper 1