Image-Text-to-Text
Transformers
OrlandoHugBot commited on
Commit
e92c73c
·
verified ·
1 Parent(s): 5c02ef0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +30 -15
README.md CHANGED
@@ -1,4 +1,16 @@
1
- # 📌 Overview
 
 
 
 
 
 
 
 
 
 
 
 
2
 
3
  **Skywork-R1V4** is a 30B (A3B) multimodal agent that unifies:
4
  - Multimodal task planning
@@ -6,7 +18,7 @@
6
  - Deep multimodal search (text × image)
7
  - Interleaved tool-grounded reasoning
8
 
9
- Unlike traditional VLMs that treat visual operations and search as disjoint capabilities—or agent systems that rely heavily on costly RL—Skywork-R1V4 is trained **purely via supervised finetuning** on **< 30k high-quality, execution-consistent trajectories**.
10
 
11
  At inference time, the model exhibits **emergent long-horizon reasoning**, executing **10+ tool calls** across visual operations and web search to solve complex real-world tasks.
12
 
@@ -16,25 +28,28 @@ Skywork-R1V4 achieves **state-of-the-art performance** on multimodal search benc
16
  - **Beats Gemini 2.5 Flash on all 11 comparable metrics**
17
 
18
 
19
- # 🚀 Key Features
 
 
 
20
 
21
- ### **“Thinking With Images”**
22
- Skywork-R1V4 actively manipulates images:
23
- - Multi-stage cropping
24
- - Local detail extraction
25
- - Region attention
26
- - Visual clue refinement
27
 
28
- ### **Interleaved Reasoning**
 
29
  The model alternates between:
30
- 1. Visual reasoning
31
- 2. Image operation
32
- 3. Web search
33
- 4. Cross-evidence verification
 
34
 
35
  ---
36
 
37
- # 🔗 Links
38
 
39
  - **Model Center**: https://platform.skyworkmodel.ai/#/model-center
40
  - **API Documentation (R1V4)**: https://docs.skyworkmodel.ai/r1v4/api-reference/completions.html
 
1
+ ---
2
+ pipeline_tag: image-text-to-text
3
+ library_name: transformers
4
+ license: mit
5
+ ---
6
+
7
+ # Skywork-R1V4
8
+
9
+ <div align="center">
10
+ <img src="skywork-logo.png" alt="Introduction Image" width="500" height="400">
11
+ </div>
12
+
13
+ ## 1. Model Introduction
14
 
15
  **Skywork-R1V4** is a 30B (A3B) multimodal agent that unifies:
16
  - Multimodal task planning
 
18
  - Deep multimodal search (text × image)
19
  - Interleaved tool-grounded reasoning
20
 
21
+ Skywork-R1V4 is trained **purely via supervised finetuning** on **< 30k high-quality, execution-consistent trajectories**.
22
 
23
  At inference time, the model exhibits **emergent long-horizon reasoning**, executing **10+ tool calls** across visual operations and web search to solve complex real-world tasks.
24
 
 
28
  - **Beats Gemini 2.5 Flash on all 11 comparable metrics**
29
 
30
 
31
+ ## 2. Feature
32
+
33
+ ### 🔍 **“Thinking With Images”**
34
+ Skywork-R1V4 actively manipulates images through:
35
 
36
+ &nbsp;&nbsp;&nbsp;&nbsp;• Multi-stage cropping
37
+ &nbsp;&nbsp;&nbsp;&nbsp;• Local detail extraction
38
+ &nbsp;&nbsp;&nbsp;&nbsp;• Region attention
39
+ &nbsp;&nbsp;&nbsp;&nbsp;• Visual clue refinement
 
 
40
 
41
+
42
+ ### 🔄 **Interleaved Reasoning**
43
  The model alternates between:
44
+
45
+ &nbsp;&nbsp;&nbsp;&nbsp;1. Visual reasoning
46
+ &nbsp;&nbsp;&nbsp;&nbsp;2. Image operation
47
+ &nbsp;&nbsp;&nbsp;&nbsp;3. Web search
48
+ &nbsp;&nbsp;&nbsp;&nbsp;4. Cross-evidence verification
49
 
50
  ---
51
 
52
+ ## 3. Links
53
 
54
  - **Model Center**: https://platform.skyworkmodel.ai/#/model-center
55
  - **API Documentation (R1V4)**: https://docs.skyworkmodel.ai/r1v4/api-reference/completions.html