arxiv:2512.17909

Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing

Published on Dec 19

· Submitted by

Shilong Zhang on Dec 22

Adobe

Upvote

Authors:

Abstract

A systematic framework adapts understanding-oriented encoder features for generative tasks by introducing a semantic-pixel reconstruction objective, enabling state-of-the-art image reconstruction and generation.

AI-generated summary

Modern Latent Diffusion Models (LDMs) typically operate in low-level Variational Autoencoder (VAE) latent spaces that are primarily optimized for pixel-level reconstruction. To unify vision generation and understanding, a burgeoning trend is to adopt high-dimensional features from representation encoders as generative latents. However, we empirically identify two fundamental obstacles in this paradigm: (1) the discriminative feature space lacks compact regularization, making diffusion models prone to off-manifold latents that lead to inaccurate object structures; and (2) the encoder's inherently weak pixel-level reconstruction hinders the generator from learning accurate fine-grained geometry and texture. In this paper, we propose a systematic framework to adapt understanding-oriented encoder features for generative tasks. We introduce a semantic-pixel reconstruction objective to regularize the latent space, enabling the compression of both semantic information and fine-grained details into a highly compact representation (96 channels with 16x16 spatial downsampling). This design ensures that the latent space remains semantically rich and achieves state-of-the-art image reconstruction, while remaining compact enough for accurate generation. Leveraging this representation, we design a unified Text-to-Image (T2I) and image editing model. Benchmarking against various feature spaces, we demonstrate that our approach achieves state-of-the-art reconstruction, faster convergence, and substantial performance gains in both T2I and editing tasks, validating that representation encoders can be effectively adapted into robust generative components.

View arXiv page View PDF Project page Add to collection

Community

shilongz

Paper submitter 2 days ago

Project Page: https://jshilong.github.io/PS-VAE-PAGE/

ainbo

2 days ago

great work!

brycebywang

1 day ago

Great work! Do you have any plans to compare SVG in future experiments?

shilongz

1 day ago

Thanks for the question!

SVG is indeed a excellent concurrent work to RAE, and we briefly discussed it in the Related Work section. Due to limited compute resources, we did not run SVG experiments in this paper. I do plan to add a comparison once I regain access to sufficient compute, but this may take some time since I’ve left Adobe and currently don’t have large-scale resources available.

From a modeling perspective, SVG largely follows a paradigm similar to RAE. While it introduces additional channels, the reconstruction quality still falls short of practical requirements such as image editing.
Moreover, its raw latent space is not very compact and suffers from off-manifold issues, again similar to RAE. This can also be seen in their latest T2I results: even when scaling to high resolutions, structural artifacts (e.g., broken limbs) persist, which are notoriously difficult to resolve. Adobe explored similar directions much earlier (e.g., with Qwenvl encoders) and observed the same limitations.