Image-Text-to-Text
Transformers
OrlandoHugBot commited on
Commit
5c02ef0
·
verified ·
1 Parent(s): e1a59d2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +40 -3
README.md CHANGED
@@ -1,3 +1,40 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 📌 Overview
2
+
3
+ **Skywork-R1V4** is a 30B (A3B) multimodal agent that unifies:
4
+ - Multimodal task planning
5
+ - Active image manipulation (“thinking with images”)
6
+ - Deep multimodal search (text × image)
7
+ - Interleaved tool-grounded reasoning
8
+
9
+ Unlike traditional VLMs that treat visual operations and search as disjoint capabilities—or agent systems that rely heavily on costly RL—Skywork-R1V4 is trained **purely via supervised finetuning** on **< 30k high-quality, execution-consistent trajectories**.
10
+
11
+ At inference time, the model exhibits **emergent long-horizon reasoning**, executing **10+ tool calls** across visual operations and web search to solve complex real-world tasks.
12
+
13
+ Skywork-R1V4 achieves **state-of-the-art performance** on multimodal search benchmarks:
14
+ - **MMSearch: 66.1**
15
+ - **FVQA: 67.2**
16
+ - **Beats Gemini 2.5 Flash on all 11 comparable metrics**
17
+
18
+
19
+ # 🚀 Key Features
20
+
21
+ ### **“Thinking With Images”**
22
+ Skywork-R1V4 actively manipulates images:
23
+ - Multi-stage cropping
24
+ - Local detail extraction
25
+ - Region attention
26
+ - Visual clue refinement
27
+
28
+ ### **Interleaved Reasoning**
29
+ The model alternates between:
30
+ 1. Visual reasoning
31
+ 2. Image operation
32
+ 3. Web search
33
+ 4. Cross-evidence verification
34
+
35
+ ---
36
+
37
+ # 🔗 Links
38
+
39
+ - **Model Center**: https://platform.skyworkmodel.ai/#/model-center
40
+ - **API Documentation (R1V4)**: https://docs.skyworkmodel.ai/r1v4/api-reference/completions.html