File size: 6,424 Bytes
cd3abc6
 
 
 
23d28f6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
23eae7a
23d28f6
 
 
 
 
64ca075
23eae7a
 
64ca075
 
 
 
23d28f6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f41bac5
23d28f6
 
 
 
 
 
f41bac5
23d28f6
 
 
 
 
 
f41bac5
23d28f6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3ef1f6f
 
 
 
 
2ea7439
 
 
 
c817b96
 
23d28f6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
---
license: apache-2.0
language:
- en
datasets:
- mychen76/invoices-and-receipts_ocr_v1
- unsloth/LaTeX_OCR
- prithivMLmods/Latex-KIE
base_model:
- Qwen/Qwen2-VL-2B-Instruct
pipeline_tag: image-text-to-text
library_name: transformers
tags:
- text-generation-inference
- image-caption
- mini
- art explain
- visual report generation
- photo captions
- cutlines
- qwen2
- inscription subtitle
- representation
---
![2.png](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/yUKVKSX2E18k0h3YwCx1h.png)

# **Imgscope-OCR-2B-0527**

> The **Imgscope-OCR-2B-0527** model is a fine-tuned version of *Qwen2-VL-2B-Instruct*, specifically optimized for *messy handwriting recognition*, *document OCR*, *realistic handwritten OCR*, and *math problem solving with LaTeX formatting*. This model is trained on custom datasets for document and handwriting OCR tasks and integrates a conversational approach with strong visual and textual understanding for multi-modal applications.

> [!note]
Colab Demo : https://huggingface.co/prithivMLmods/Imgscope-OCR-2B-0527/blob/main/Imgscope%20OCR%202B%200527%20Demo/Imgscope-OCR-2B-0527.ipynb

> [!note]
Video Understanding Demo : https://huggingface.co/prithivMLmods/Imgscope-OCR-2B-0527/blob/main/Imgscope-OCR-2B-05270-Video-Understanding/Imgscope-OCR-2B-0527-Video-Understanding.ipynb


---

### Key Enhancements

* **SoTA Understanding of Images of Various Resolution & Ratio**
  Imgscope-OCR-2B-0527 achieves state-of-the-art performance on visual understanding benchmarks such as MathVista, DocVQA, RealWorldQA, and MTVQA.

* **Enhanced Handwriting OCR**
  Specifically optimized for recognizing and interpreting **realistic and messy handwriting** with high accuracy. Ideal for digitizing handwritten documents and notes.

* **Document OCR Fine-Tuning**
  Fine-tuned with curated and realistic **document OCR datasets**, enabling accurate extraction of text from various structured and unstructured layouts.

* **Understanding Videos of 20+ Minutes**
  Capable of processing long videos for **video-based question answering**, **transcription**, and **content generation**.

* **Device Control Agent**
  Supports decision-making and control capabilities for integration with **mobile devices**, **robots**, and **automation systems** using visual-textual commands.

* **Multilingual OCR Support**
  In addition to English and Chinese, the model supports **OCR in multiple languages** including European languages, Japanese, Korean, Arabic, and Vietnamese.

---

### How to Use

```python
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# Load the model
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "prithivMLmods/Imgscope-OCR-2B-0527",  # replace with updated model ID if available
    torch_dtype="auto",
    device_map="auto"
)

# Optional: Flash Attention for performance optimization
# model = Qwen2VLForConditionalGeneration.from_pretrained(
#     "prithivMLmods/Imgscope-OCR-2B-0527",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# Load processor
processor = AutoProcessor.from_pretrained("prithivMLmods/Imgscope-OCR-2B-0527")

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Recognize the handwriting in this image."},
        ],
    }
]

# Prepare input
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Generate output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
```

---

### Demo Inference

![Screenshot 2025-05-27 at 03-40-34 Gradio.png](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/9KiRkOGPB8cLl6VHwh2UD.png)
![Screenshot 2025-05-27 at 03-40-56 (anonymous) - output_e0fbfa20-686e-4bce-b2e8-25991be5a5a0.pdf.png](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/VOHQIrT7hCs5afGMRROvD.png)

### Video Inference

![Screenshot 2025-05-27 at 20-14-22 Video Understanding with Imgscope-OCR-2B-0527.png](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/fyAVI0hZICWpSXlcKaJF4.png)

---

### Buffering Output (Streaming)

```python
buffer = ""
for new_text in streamer:
    buffer += new_text
    buffer = buffer.replace("<|im_end|>", "")
    yield buffer
```

---

### Key Features

1. **Realistic Messy Handwriting OCR**

   * Fine-tuned for **complex and hard-to-read handwritten inputs** using real-world handwriting datasets.

2. **Document OCR and Layout Understanding**

   * Accurately extracts text from structured documents, including scanned pages, forms, and academic papers.

3. **Image and Text Multi-modal Reasoning**

   * Combines **vision-language capabilities** for tasks like captioning, answering image-based queries, and understanding image+text prompts.

4. **Math Problem Solving and LaTeX Rendering**

   * Converts mathematical expressions and problem-solving steps into **LaTeX** format.

5. **Multi-turn Conversations**

   * Supports **dialogue-based reasoning**, retaining context for follow-up questions.

6. **Video + Image + Text-to-Text Generation**

   * Accepts inputs from videos, images, or combined media with text, and generates relevant output accordingly.

---

## **Intended Use**

**Imgscope-OCR-2B-0527** is intended for:

* Handwritten and printed document digitization
* OCR pipelines for educational institutions and businesses
* Academic and scientific content parsing, especially math-heavy documents
* Assistive tools for visually impaired users
* Robotic and mobile automation agents interpreting screen or camera data
* Multilingual OCR processing for document translation or archiving