OpusTranslate Collection Collection of tiny models for the OpusTranslate mobile phone application. • 10 items • Updated 3 days ago • 2
view article Article GGML and llama.cpp join HF to ensure the long-term progress of Local AI +4 21 days ago • 483
LightOnOCR: A 1B End-to-End Multilingual Vision-Language Model for State-of-the-Art OCR Paper • 2601.14251 • Published Jan 20 • 25
view article Article LightOnOCR-1B: The Case for End-to-End and Efficient Domain-Specific Vision-Language Models for OCR Oct 23, 2025 • 73
view article Article Transformers v5: Simple model definitions powering the AI ecosystem +2 Dec 1, 2025 • 305
SHAMIYAT: A Collection of Syrian Dialect Datasets & LLMs Collection A collection of datasets and language models focused on the Syrian dialect, supporting NLP research and applications for Syria • 4 items • Updated Nov 28, 2025 • 2
view article Article How to train a new language model from scratch using Transformers and Tokenizers Feb 14, 2020 • 60
Yiddish Whisper Training Collection Yiddish based Whisper post-training - Crowd Sourced Open Data • 10 items • Updated 10 days ago • 4
Scaling Low-Res MT via Synthetic Data Generation with LLMs Collection Synthetic baselines trained for our paper "Scaling Low-Resource MT via Synthetic Data Generation with LLMs" accepted as a main in EMNLP 2025. • 8 items • Updated Sep 16, 2025 • 1
Scaling Low-Resource MT via Synthetic Data Generation with LLMs Paper • 2505.14423 • Published May 20, 2025 • 2
DictaBERT Collection Collection of state-of-the-art language model for Hebrew, finetuned for various tasks, as detailed in the article: https://arxiv.org/abs/2308.16687 • 17 items • Updated Apr 4, 2024 • 6
Arabic-Nougat: Fine-Tuning Vision Transformers for Arabic OCR and Markdown Extraction Paper • 2411.17835 • Published Nov 19, 2024 • 4