fx-nanovlm: Nano Vision-Language Model

Parameters Vision LLM License

This model is developed as part of efforts of Center for Excellence in Postal Technology (CEPT)- Department of Post, Government of India to develop and integrate AI/ML in service delivery to its customers

Model Description

fx-nanovlm is a lightweight Vision-Language Model (VLM) designed for efficient OCR and document understanding tasks. It combines the powerful SigLIP 2 vision encoder with the compact Gemma 3 language model through a custom 2-layer MLP projector.

Architecture

Component Model Parameters Description
Vision Encoder SigLIP 2 So400m ~400M Extracts visual features at 1152-dim
Projector 2-layer MLP ~1.15M Aligns vision→text embeddings
Language Model Gemma 3 270M ~270M Text generation & reasoning
Total - ~680M Optimized for edge deployment
Input Image (384Γ—384)
        ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   SigLIP 2 So400m          β”‚ ← Vision Encoder (Frozen)
β”‚   Output: [B, 729, 1152]   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   2-Layer MLP Projector    β”‚ ← Trained to align modalities
β”‚   1152 β†’ 640 β†’ 640         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Gemma 3 270M-IT          β”‚ ← Language Model
β”‚   18 layers, 640 hidden    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        ↓
    Output Text

Key Features

  • πŸš€ Lightweight: Only ~680M total parameters
  • πŸ“– OCR Optimized: Trained on handwriting and document datasets
  • ⚑ Efficient: Designed for edge deployment and fast inference
  • πŸ”§ Modular: Easy to extend with LoRA fine-tuning

Intended Uses

  • Document OCR: Extract text from scanned documents
  • Handwriting Recognition: Read handwritten text from images
  • Document Q&A: Answer questions about document content
  • Text Extraction: General-purpose text extraction from images

Usage

Installation

pip install transformers torch pillow

Quick Start

import torch
from PIL import Image
from transformers import AutoTokenizer, AutoProcessor

# Import custom model classes
from modeling_fx_nanovlm import FxNanoVLMForConditionalGeneration
from configuration_fx_nanovlm import FxNanoVLMConfig

# Load model and tokenizer
model = FxNanoVLMForConditionalGeneration.from_pretrained(
    "Rohihtcept/fx-nanovlm",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

tokenizer = AutoTokenizer.from_pretrained(
    "Rohithcept/fx-nanovlm",
    trust_remote_code=True,
)

processor = AutoProcessor.from_pretrained(
    "google/siglip2-so400m-patch14-384"
)

# Load and process image
image = Image.open("document.png").convert("RGB")
pixel_values = processor(images=image, return_tensors="pt").pixel_values
pixel_values = pixel_values.to(model.device, dtype=torch.bfloat16)

# Create prompt
prompt = "<image>Extract the text from this image."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

# Generate
with torch.no_grad():
    outputs = model.generate(
        input_ids=inputs["input_ids"],
        attention_mask=inputs["attention_mask"],
        pixel_values=pixel_values,
        max_new_tokens=512,
        do_sample=False,
    )

# Decode output
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Batch Inference

# Process multiple images
images = [Image.open(f"doc_{i}.png").convert("RGB") for i in range(3)]
pixel_values = processor(images=images, return_tensors="pt").pixel_values

prompts = ["<image>Extract the text."] * len(images)
inputs = tokenizer(prompts, return_tensors="pt", padding=True)

outputs = model.generate(
    input_ids=inputs["input_ids"].to(model.device),
    attention_mask=inputs["attention_mask"].to(model.device),
    pixel_values=pixel_values.to(model.device, dtype=torch.bfloat16),
    max_new_tokens=512,
)

Training

Stage 1: Projector Pre-training

The projector was trained to align SigLIP 2 visual features with Gemma 3 text embeddings:

Hyperparameter Value
Epochs 3
Batch Size 32 (4 Γ— 8 grad accum)
Learning Rate 2e-3
LR Scheduler Cosine
Warmup Ratio 10%
Weight Decay 0.01
Precision bf16

Training Data:

  • IAM Handwriting Dataset (~4.7K samples)
  • GNHK Dataset (~8.7K samples)
  • DocVQA (~18.6K samples)

Frozen Components:

  • Vision Encoder (SigLIP 2)
  • Language Model (Gemma 3)

Trainable Components:

  • Multi-modal Projector (2-layer MLP)

Stage 2: LoRA Fine-tuning (Optional)

For task-specific fine-tuning, you can apply LoRA to the language model:

from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    task_type="CAUSAL_LM",
)

model = get_peft_model(model.language_model, lora_config)

Model Files

fx_nanovlm/
β”œβ”€β”€ README.md                    # This file (model card)
β”œβ”€β”€ config.json                  # Model configuration
β”œβ”€β”€ model.safetensors            # Model weights (~2.8GB)
β”œβ”€β”€ configuration_fx_nanovlm.py  # Custom config class
β”œβ”€β”€ modeling_fx_nanovlm.py       # Custom model class
β”œβ”€β”€ tokenizer.json               # Tokenizer
β”œβ”€β”€ tokenizer.model              # SentencePiece model
β”œβ”€β”€ tokenizer_config.json        # Tokenizer config
β”œβ”€β”€ special_tokens_map.json      # Special tokens
└── added_tokens.json            # Added tokens (<image>)

Limitations

  • Image Resolution: Fixed at 384Γ—384 pixels
  • Language: Currently optimized for English
  • Context Length: 512 tokens maximum sequence
  • Training Data: OCR-focused, may not generalize to all vision tasks

Technical Details

Special Tokens

Token ID Description
<image> 262145 Image placeholder token
<eos> 1 End of sequence
<pad> 0 Padding token

Model Configuration

{
  "model_type": "fx_nanovlm",
  "vision_config": {
    "model_type": "siglip_vision_model",
    "hidden_size": 1152,
    "image_size": 384,
    "patch_size": 14,
    "num_attention_heads": 16,
    "num_hidden_layers": 27
  },
  "text_config": {
    "model_type": "gemma",
    "hidden_size": 640,
    "num_hidden_layers": 18,
    "vocab_size": 262146
  },
  "projector_hidden_act": "gelu",
  "image_token_index": 262145,
  "num_image_tokens": 729
}

Citation

If you use this model, please cite:

@misc{fx-nanovlm,
  author = Rohith Reddy,
  title = {fx-nanovlm: A Nano Vision-Language Model for OCR},
  year = {2025},
  publisher = {HuggingFace},
  url = {https://huggingface.co/Rohithcept/fx-nanovlm}
}

License

This model is released under the Apache 2.0 License.

Component Licenses

  • SigLIP 2: Apache 2.0 (Google)
  • Gemma 3: Gemma License (Google)

Acknowledgments

  • Team AI/ML, CEPT - Department of Post
  • Google for the SigLIP 2 and Gemma 3 models
  • HuggingFace for the transformers library
  • The document AI research community

Model Card Contact: CEPT [Department of Post - Govt. of India]

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Rohithcept/fx-nanovlm

Finetuned
(915)
this model

Evaluation results