Inference and usage

#3
by YsK-dev - opened

First of all, congratulations on the release of Infinity-Parser2-Pro! The performance metrics on olmOCR-Bench are impressive, especially the jump in throughput thanks to the MoE architecture.

However, as this is a large model, many of us in the community are looking for the best way to deploy this for practical use. To help us bridge the gap between "great results" and "production usage," could you clarify a few technical points?

Questions

  1. Official Quantization Recommendations: For those of us with limited VRAM (e.g., 24GB or 48GB), do you recommend a specific quantization method to preserve parsing accuracy? Have you tested AWQ, GPTQ, or GGUF (via llama.cpp) with this specific MoE architecture?
  2. Vision Encoder Precision: Does the vision tower require higher precision (FP16/BF16) to maintain document OCR quality, or is it safe to quantize the entire model (including the vision weights) to 4-bit/8-bit?
  3. vLLM Support: Are there specific flags or configuration settings needed to run this optimally in vLLM? Specifically regarding the --max-model-len (to support the 64k context) and the MoE expert routing?
  4. Usage Example: The current README is light on minimal inference code. Could you provide a simple "Hello World" snippet for parsing a single image/PDF to Markdown using the transformers or vLLM library?

Hardware Environment

I am planning to run this on:

  • GPU: kaggle colab free tier
  • Target Task: Handwriting Turkish text, Scanned PDFs

Thank you for your hard work on this model. Looking forward to your guidance!

Thank you for your interest in our project! Here are the answers to your questions:

  • Q1 & Q2: Our model is fine-tuned based on the Qwen3.5 architecture. For practical deployment, we recommend using GPTQ-Int4 or FP8 quantization. While we haven't officially tested these techniques on our end yet, you can refer to the official Qwen repositories for detailed quantization guides (https://huggingface.co/Qwen/Qwen3.5-35B-A3B-GPTQ-Int4 and https://huggingface.co/Qwen/Qwen3.5-35B-A3B-FP8). Additionally, please stay tuned for our upcoming lightweight parsing model, Infinity-Parser2-Flash! It is a 2B dense model designed specifically for environments with limited computing resources and will be released in the near future.

  • Q3: The provided vLLM configurations represent our best practices for document parsing. For instance, --max-model-len is set to 64k to handle documents with dense and long texts (such as magazines and newspapers). Feel free to adjust these parameters to best suit your specific parsing scenarios.

  • Q4: As you suggested, we have added a minimal "Hello World" snippet to the Quick Start section of our README.

Thanks again for your valuable feedback! Feel free to let us know if you have any further questions.

Sign up or log in to comment