Abstract
A search-augmented image generation agent is presented that performs multi-hop reasoning and search to collect textual knowledge and reference images for grounded generation, trained with supervised fine-tuning and agentic reinforcement learning with dual reward feedback.
Recent image generation models have shown strong capabilities in generating high-fidelity and photorealistic images. However, they are fundamentally constrained by frozen internal knowledge, thus often failing on real-world scenarios that are knowledge-intensive or require up-to-date information. In this paper, we present Gen-Searcher, as the first attempt to train a search-augmented image generation agent, which performs multi-hop reasoning and search to collect the textual knowledge and reference images needed for grounded generation. To achieve this, we construct a tailored data pipeline and curate two high-quality datasets, Gen-Searcher-SFT-10k and Gen-Searcher-RL-6k, containing diverse search-intensive prompts and corresponding ground-truth synthesis images. We further introduce KnowGen, a comprehensive benchmark that explicitly requires search-grounded external knowledge for image generation and evaluates models from multiple dimensions. Based on these resources, we train Gen-Searcher with SFT followed by agentic reinforcement learning with dual reward feedback, which combines text-based and image-based rewards to provide more stable and informative learning signals for GRPO training. Experiments show that Gen-Searcher brings substantial gains, improving Qwen-Image by around 16 points on KnowGen and 15 points on WISE. We hope this work can serve as an open foundation for search agents in image generation, and we fully open-source our data, models, and code.
Community
Project page: https://gen-searcher.vercel.app/
Code: https://github.com/tulerfeng/Gen-Searcher
Model and Data: https://huggingface.co/GenSearcher
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MM-DeepResearch: A Simple and Effective Multimodal Agentic Search Baseline (2026)
- UniReason 1.0: A Unified Reasoning Framework for World Knowledge Aligned Image Generation and Editing (2026)
- DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing (2026)
- Enhancing Spatial Understanding in Image Generation via Reward Modeling (2026)
- VSearcher: Long-Horizon Multimodal Search Agent via Reinforcement Learning (2026)
- Mind-Brush: Integrating Agentic Cognitive Search and Reasoning into Image Generation (2026)
- MICON-Bench: Benchmarking and Enhancing Multi-Image Context Image Generation in Unified Multimodal Models (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 2
Datasets citing this paper 2
Spaces citing this paper 0
No Space linking this paper