fx-nanovlm: Nano Vision-Language Model

This model is developed as part of efforts of Center for Excellence in Postal Technology (CEPT)- Department of Post, Government of India to develop and integrate AI/ML in service delivery to its customers

Model Description

fx-nanovlm is a lightweight Vision-Language Model (VLM) designed for efficient OCR and document understanding tasks. It combines the powerful SigLIP 2 vision encoder with the compact Gemma 3 language model through a custom 2-layer MLP projector.

Architecture

Component	Model	Parameters	Description
Vision Encoder	SigLIP 2 So400m	~400M	Extracts visual features at 1152-dim
Projector	2-layer MLP	~1.15M	Aligns vision→text embeddings
Language Model	Gemma 3 270M	~270M	Text generation & reasoning
Total	-	~680M	Optimized for edge deployment

Input Image (384×384)
        ↓
┌─────────────────────────────┐
│   SigLIP 2 So400m          │ ← Vision Encoder (Frozen)
│   Output: [B, 729, 1152]   │
└─────────────────────────────┘
        ↓
┌─────────────────────────────┐
│   2-Layer MLP Projector    │ ← Trained to align modalities
│   1152 → 640 → 640         │
└─────────────────────────────┘
        ↓
┌─────────────────────────────┐
│   Gemma 3 270M-IT          │ ← Language Model
│   18 layers, 640 hidden    │
└─────────────────────────────┘
        ↓
    Output Text

Key Features

🚀 Lightweight: Only ~680M total parameters
📖 OCR Optimized: Trained on handwriting and document datasets
⚡ Efficient: Designed for edge deployment and fast inference
🔧 Modular: Easy to extend with LoRA fine-tuning

Intended Uses

Document OCR: Extract text from scanned documents
Handwriting Recognition: Read handwritten text from images
Document Q&A: Answer questions about document content
Text Extraction: General-purpose text extraction from images

Usage

Installation

pip install transformers torch pillow

Quick Start

import torch
from PIL import Image
from transformers import AutoTokenizer, AutoProcessor

# Import custom model classes
from modeling_fx_nanovlm import FxNanoVLMForConditionalGeneration
from configuration_fx_nanovlm import FxNanoVLMConfig

# Load model and tokenizer
model = FxNanoVLMForConditionalGeneration.from_pretrained(
    "Rohihtcept/fx-nanovlm",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

tokenizer = AutoTokenizer.from_pretrained(
    "Rohithcept/fx-nanovlm",
    trust_remote_code=True,
)

processor = AutoProcessor.from_pretrained(
    "google/siglip2-so400m-patch14-384"
)

# Load and process image
image = Image.open("document.png").convert("RGB")
pixel_values = processor(images=image, return_tensors="pt").pixel_values
pixel_values = pixel_values.to(model.device, dtype=torch.bfloat16)

# Create prompt
prompt = "<image>Extract the text from this image."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

# Generate
with torch.no_grad():
    outputs = model.generate(
        input_ids=inputs["input_ids"],
        attention_mask=inputs["attention_mask"],
        pixel_values=pixel_values,
        max_new_tokens=512,
        do_sample=False,
    )

# Decode output
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Batch Inference

# Process multiple images
images = [Image.open(f"doc_{i}.png").convert("RGB") for i in range(3)]
pixel_values = processor(images=images, return_tensors="pt").pixel_values

prompts = ["<image>Extract the text."] * len(images)
inputs = tokenizer(prompts, return_tensors="pt", padding=True)

outputs = model.generate(
    input_ids=inputs["input_ids"].to(model.device),
    attention_mask=inputs["attention_mask"].to(model.device),
    pixel_values=pixel_values.to(model.device, dtype=torch.bfloat16),
    max_new_tokens=512,
)

Training

Stage 1: Projector Pre-training

The projector was trained to align SigLIP 2 visual features with Gemma 3 text embeddings:

Hyperparameter	Value
Epochs	3
Batch Size	32 (4 × 8 grad accum)
Learning Rate	2e-3
LR Scheduler	Cosine
Warmup Ratio	10%
Weight Decay	0.01
Precision	bf16

Training Data:

IAM Handwriting Dataset (~4.7K samples)
GNHK Dataset (~8.7K samples)
DocVQA (~18.6K samples)

Frozen Components:

Vision Encoder (SigLIP 2)
Language Model (Gemma 3)

Trainable Components:

Multi-modal Projector (2-layer MLP)

Stage 2: LoRA Fine-tuning (Optional)

For task-specific fine-tuning, you can apply LoRA to the language model:

from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    task_type="CAUSAL_LM",
)

model = get_peft_model(model.language_model, lora_config)

Model Files

fx_nanovlm/
├── README.md                    # This file (model card)
├── config.json                  # Model configuration
├── model.safetensors            # Model weights (~2.8GB)
├── configuration_fx_nanovlm.py  # Custom config class
├── modeling_fx_nanovlm.py       # Custom model class
├── tokenizer.json               # Tokenizer
├── tokenizer.model              # SentencePiece model
├── tokenizer_config.json        # Tokenizer config
├── special_tokens_map.json      # Special tokens
└── added_tokens.json            # Added tokens (<image>)

Limitations

Image Resolution: Fixed at 384×384 pixels
Language: Currently optimized for English
Context Length: 512 tokens maximum sequence
Training Data: OCR-focused, may not generalize to all vision tasks

Technical Details

Special Tokens

Token	ID	Description
`<image>`	262145	Image placeholder token
`<eos>`	1	End of sequence
`<pad>`	0	Padding token

Model Configuration

{
  "model_type": "fx_nanovlm",
  "vision_config": {
    "model_type": "siglip_vision_model",
    "hidden_size": 1152,
    "image_size": 384,
    "patch_size": 14,
    "num_attention_heads": 16,
    "num_hidden_layers": 27
  },
  "text_config": {
    "model_type": "gemma",
    "hidden_size": 640,
    "num_hidden_layers": 18,
    "vocab_size": 262146
  },
  "projector_hidden_act": "gelu",
  "image_token_index": 262145,
  "num_image_tokens": 729
}

Citation

If you use this model, please cite:

@misc{fx-nanovlm,
  author = Rohith Reddy,
  title = {fx-nanovlm: A Nano Vision-Language Model for OCR},
  year = {2025},
  publisher = {HuggingFace},
  url = {https://huggingface.co/Rohithcept/fx-nanovlm}
}

License

This model is released under the Apache 2.0 License.

Component Licenses

SigLIP 2: Apache 2.0 (Google)
Gemma 3: Gemma License (Google)

Acknowledgments

Team AI/ML, CEPT - Department of Post
Google for the SigLIP 2 and Gemma 3 models
HuggingFace for the transformers library
The document AI research community

Model Card Contact: CEPT [Department of Post - Govt. of India]

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Rohithcept/fx-nanovlm

Base model

google/gemma-3-270m

Finetuned

google/gemma-3-270m-it