arxiv:2512.16864

RePlan: Reasoning-guided Region Planning for Complex Instruction-based Image Editing

Published on Dec 18

· Submitted by

QU Tianyuan on Dec 19

Upvote

Authors:

Tianyuan Qu ,

Abstract

RePlan, a plan-then-execute framework, enhances instruction-based image editing by combining a vision-language planner with a diffusion editor, achieving superior performance in complex and intricate editing tasks using limited data.

AI-generated summary

Instruction-based image editing enables natural-language control over visual modifications, yet existing models falter under Instruction-Visual Complexity (IV-Complexity), where intricate instructions meet cluttered or ambiguous scenes. We introduce RePlan (Region-aligned Planning), a plan-then-execute framework that couples a vision-language planner with a diffusion editor. The planner decomposes instructions via step-by-step reasoning and explicitly grounds them to target regions; the editor then applies changes using a training-free attention-region injection mechanism, enabling precise, parallel multi-region edits without iterative inpainting. To strengthen planning, we apply GRPO-based reinforcement learning using 1K instruction-only examples, yielding substantial gains in reasoning fidelity and format reliability. We further present IV-Edit, a benchmark focused on fine-grained grounding and knowledge-intensive edits. Across IV-Complex settings, RePlan consistently outperforms strong baselines trained on far larger datasets, improving regional precision and overall fidelity. Our project page: https://replan-iv-edit.github.io

View arXiv page View PDF Project page GitHub 26 Add to collection

Community

TainU

Paper author Paper submitter 6 days ago

🚧 The Challenge: IV-Complexity

Current instruction-based editing models struggle when intricate instructions meet cluttered, realistic scenes—a challenge we define as Instruction-Visual Complexity (IV-Complexity). In these scenarios, high-level global context is insufficient to distinguish specific targets from semantically similar objects (e.g., distinguishing a "used cup" from a clean glass on a messy desk).

📉 The Gap: Global Semantic Guidance

Existing methods, including unified VLM-diffusion architectures, predominantly rely on Global Semantic Guidance. They compress instructions into global feature vectors, lacking spatial grounding. Consequently, edits often "spill over" into unrelated areas or modify the wrong targets, failing to preserve background consistency.

🚀 Our Solution: Region-Aligned Guidance

RePlan introduces a Plan-then-Execute framework that explicitly links text to pixels. Our key contributions include:

🧱 Reasoning-Guided Planning

A VLM planner performs Chain-of-Thought (CoT) reasoning to decompose complex instructions into structured, region-specific guidance (Bounding Boxes + Local Hints).
🎯 Training-Free Attention Injection

We introduce a mechanism tailored for Multimodal DiT (MMDiT) that executes edits via region-constrained attention. This enables precise, multi-region parallel edits in a single pass while preserving the background, without requiring any training of the DiT backbone.
⚡ Efficient GRPO Training

We enhance the planner's reasoning capabilities using Group Relative Policy Optimization (GRPO). Remarkably, we achieve strong planning performance using only ~1k instruction-only samples, bypassing the need for large-scale paired image datasets.
🎛️ Interactive & Flexible Editing

RePlan's intermediate region guidance is fully editable, enabling user-in-the-loop intervention. Users can adjust bounding boxes or hints directly to refine results. Furthermore, our attention mechanism supports regional negative prompts to prevent bleeding effects.
📊 IV-Edit Benchmark

To foster future research, we establish IV-Edit, the first benchmark specifically designed to evaluate IV-Complex editing, filling the gap left by current subject-dominated evaluation sets.