Image-Text-to-Text
Transformers
File size: 2,885 Bytes
e92c73c
 
 
 
 
 
 
 
 
 
 
 
 
5c02ef0
 
 
 
 
 
 
e92c73c
5c02ef0
 
 
 
 
 
 
 
 
e92c73c
 
 
 
5c02ef0
e92c73c
 
 
 
5c02ef0
e92c73c
 
5c02ef0
e92c73c
 
 
 
 
5c02ef0
 
e92c73c
5c02ef0
 
 
fb14508
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a3fbc28
fb14508
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
---
pipeline_tag: image-text-to-text
library_name: transformers
license: mit
---

# Skywork-R1V4

<div align="center">   
  <img src="skywork-logo.png" alt="Introduction Image" width="500" height="400"> 
</div>

## 1. Model Introduction

**Skywork-R1V4** is a 30B (A3B) multimodal agent that unifies:
- Multimodal task planning
- Active image manipulation (“thinking with images”)
- Deep multimodal search (text × image)
- Interleaved tool-grounded reasoning

Skywork-R1V4 is trained **purely via supervised finetuning** on **< 30k high-quality, execution-consistent trajectories**.

At inference time, the model exhibits **emergent long-horizon reasoning**, executing **10+ tool calls** across visual operations and web search to solve complex real-world tasks.

Skywork-R1V4 achieves **state-of-the-art performance** on multimodal search benchmarks:
- **MMSearch: 66.1**
- **FVQA: 67.2**
- **Beats Gemini 2.5 Flash on all 11 comparable metrics**


## 2. Feature

### 🔍 **“Thinking With Images”**  
Skywork-R1V4 actively manipulates images through:

&nbsp;&nbsp;&nbsp;&nbsp;• Multi-stage cropping  
&nbsp;&nbsp;&nbsp;&nbsp;• Local detail extraction  
&nbsp;&nbsp;&nbsp;&nbsp;• Region attention  
&nbsp;&nbsp;&nbsp;&nbsp;• Visual clue refinement  


### 🔄 **Interleaved Reasoning**  
The model alternates between:

&nbsp;&nbsp;&nbsp;&nbsp;1. Visual reasoning  
&nbsp;&nbsp;&nbsp;&nbsp;2. Image operation  
&nbsp;&nbsp;&nbsp;&nbsp;3. Web search  
&nbsp;&nbsp;&nbsp;&nbsp;4. Cross-evidence verification


## 3. Links

- **Model Center**: https://platform.skyworkmodel.ai/#/model-center  
- **API Documentation (R1V4)**: https://docs.skyworkmodel.ai/r1v4/api-reference/completions.html

## 4. Citation
If you use Skywork-R1V4 in your research, please cite:

```
@misc{zhang2025skyworkr1v4agenticmultimodalintelligence,
      title={Skywork-R1V4: Toward Agentic Multimodal Intelligence through Interleaved Thinking with Images and DeepResearch}, 
      author={Yifan Zhang and Liang Hu and Haofeng Sun and Peiyu Wang and Yichen Wei and Shukang Yin and Jiangbo Pei and Wei Shen and Peng Xia and Yi Peng and Tianyidan Xie and Eric Li and Yang Liu and Xuchen Song and Yahui Zhou},
      year={2025},
      eprint={2512.02395},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.02395}, 
}
```


```
@misc{peng2025skyworkr1vpioneeringmultimodal,
      title={Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought}, 
      author={Yi Peng and Peiyu Wang and Xiaokun Wang and Yichen Wei and Jiangbo Pei and Weijie Qiu and Ai Jian and Yunzhuo Hao and Jiachun Pan and Tianyidan Xie and Li Ge and Rongxian Zhuang and Xuchen Song and Yang Liu and Yahui Zhou},
      year={2025},
      eprint={2504.05599},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2504.05599}, 
}
```