fx-nanovlm: Nano Vision-Language Model
This model is developed as part of efforts of Center for Excellence in Postal Technology (CEPT)- Department of Post, Government of India to develop and integrate AI/ML in service delivery to its customers
Model Description
fx-nanovlm is a lightweight Vision-Language Model (VLM) designed for efficient OCR and document understanding tasks. It combines the powerful SigLIP 2 vision encoder with the compact Gemma 3 language model through a custom 2-layer MLP projector.
Architecture
| Component | Model | Parameters | Description |
|---|---|---|---|
| Vision Encoder | SigLIP 2 So400m | ~400M | Extracts visual features at 1152-dim |
| Projector | 2-layer MLP | ~1.15M | Aligns visionβtext embeddings |
| Language Model | Gemma 3 270M | ~270M | Text generation & reasoning |
| Total | - | ~680M | Optimized for edge deployment |
Input Image (384Γ384)
β
βββββββββββββββββββββββββββββββ
β SigLIP 2 So400m β β Vision Encoder (Frozen)
β Output: [B, 729, 1152] β
βββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββ
β 2-Layer MLP Projector β β Trained to align modalities
β 1152 β 640 β 640 β
βββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββ
β Gemma 3 270M-IT β β Language Model
β 18 layers, 640 hidden β
βββββββββββββββββββββββββββββββ
β
Output Text
Key Features
- π Lightweight: Only ~680M total parameters
- π OCR Optimized: Trained on handwriting and document datasets
- β‘ Efficient: Designed for edge deployment and fast inference
- π§ Modular: Easy to extend with LoRA fine-tuning
Intended Uses
- Document OCR: Extract text from scanned documents
- Handwriting Recognition: Read handwritten text from images
- Document Q&A: Answer questions about document content
- Text Extraction: General-purpose text extraction from images
Usage
Installation
pip install transformers torch pillow
Quick Start
import torch
from PIL import Image
from transformers import AutoTokenizer, AutoProcessor
# Import custom model classes
from modeling_fx_nanovlm import FxNanoVLMForConditionalGeneration
from configuration_fx_nanovlm import FxNanoVLMConfig
# Load model and tokenizer
model = FxNanoVLMForConditionalGeneration.from_pretrained(
"Rohihtcept/fx-nanovlm",
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
"Rohithcept/fx-nanovlm",
trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained(
"google/siglip2-so400m-patch14-384"
)
# Load and process image
image = Image.open("document.png").convert("RGB")
pixel_values = processor(images=image, return_tensors="pt").pixel_values
pixel_values = pixel_values.to(model.device, dtype=torch.bfloat16)
# Create prompt
prompt = "<image>Extract the text from this image."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
# Generate
with torch.no_grad():
outputs = model.generate(
input_ids=inputs["input_ids"],
attention_mask=inputs["attention_mask"],
pixel_values=pixel_values,
max_new_tokens=512,
do_sample=False,
)
# Decode output
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
Batch Inference
# Process multiple images
images = [Image.open(f"doc_{i}.png").convert("RGB") for i in range(3)]
pixel_values = processor(images=images, return_tensors="pt").pixel_values
prompts = ["<image>Extract the text."] * len(images)
inputs = tokenizer(prompts, return_tensors="pt", padding=True)
outputs = model.generate(
input_ids=inputs["input_ids"].to(model.device),
attention_mask=inputs["attention_mask"].to(model.device),
pixel_values=pixel_values.to(model.device, dtype=torch.bfloat16),
max_new_tokens=512,
)
Training
Stage 1: Projector Pre-training
The projector was trained to align SigLIP 2 visual features with Gemma 3 text embeddings:
| Hyperparameter | Value |
|---|---|
| Epochs | 3 |
| Batch Size | 32 (4 Γ 8 grad accum) |
| Learning Rate | 2e-3 |
| LR Scheduler | Cosine |
| Warmup Ratio | 10% |
| Weight Decay | 0.01 |
| Precision | bf16 |
Training Data:
- IAM Handwriting Dataset (~4.7K samples)
- GNHK Dataset (~8.7K samples)
- DocVQA (~18.6K samples)
Frozen Components:
- Vision Encoder (SigLIP 2)
- Language Model (Gemma 3)
Trainable Components:
- Multi-modal Projector (2-layer MLP)
Stage 2: LoRA Fine-tuning (Optional)
For task-specific fine-tuning, you can apply LoRA to the language model:
from peft import LoraConfig, get_peft_model
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_dropout=0.05,
task_type="CAUSAL_LM",
)
model = get_peft_model(model.language_model, lora_config)
Model Files
fx_nanovlm/
βββ README.md # This file (model card)
βββ config.json # Model configuration
βββ model.safetensors # Model weights (~2.8GB)
βββ configuration_fx_nanovlm.py # Custom config class
βββ modeling_fx_nanovlm.py # Custom model class
βββ tokenizer.json # Tokenizer
βββ tokenizer.model # SentencePiece model
βββ tokenizer_config.json # Tokenizer config
βββ special_tokens_map.json # Special tokens
βββ added_tokens.json # Added tokens (<image>)
Limitations
- Image Resolution: Fixed at 384Γ384 pixels
- Language: Currently optimized for English
- Context Length: 512 tokens maximum sequence
- Training Data: OCR-focused, may not generalize to all vision tasks
Technical Details
Special Tokens
| Token | ID | Description |
|---|---|---|
<image> |
262145 | Image placeholder token |
<eos> |
1 | End of sequence |
<pad> |
0 | Padding token |
Model Configuration
{
"model_type": "fx_nanovlm",
"vision_config": {
"model_type": "siglip_vision_model",
"hidden_size": 1152,
"image_size": 384,
"patch_size": 14,
"num_attention_heads": 16,
"num_hidden_layers": 27
},
"text_config": {
"model_type": "gemma",
"hidden_size": 640,
"num_hidden_layers": 18,
"vocab_size": 262146
},
"projector_hidden_act": "gelu",
"image_token_index": 262145,
"num_image_tokens": 729
}
Citation
If you use this model, please cite:
@misc{fx-nanovlm,
author = Rohith Reddy,
title = {fx-nanovlm: A Nano Vision-Language Model for OCR},
year = {2025},
publisher = {HuggingFace},
url = {https://huggingface.co/Rohithcept/fx-nanovlm}
}
License
This model is released under the Apache 2.0 License.
Component Licenses
- SigLIP 2: Apache 2.0 (Google)
- Gemma 3: Gemma License (Google)
Acknowledgments
- Team AI/ML, CEPT - Department of Post
- Google for the SigLIP 2 and Gemma 3 models
- HuggingFace for the transformers library
- The document AI research community
Model Card Contact: CEPT [Department of Post - Govt. of India]
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support