view article Article ViDoRe V3: a comprehensive evaluation of retrieval for enterprise use-cases QuentinJG • Nov 5, 2025 • 64
view article Article Introducing RTEB: A New Standard for Retrieval Evaluation +4 fzliu, KennethEnevoldsen, Samoed, isaacchung, tomaarsen, fzoll • Oct 1, 2025 • 144
ModernVBERT: Towards Smaller Visual Document Retrievers Paper • 2510.01149 • Published Oct 1, 2025 • 33
view article Article Finally, a Replacement for BERT: Introducing ModernBERT +13 bwarner, NohTow, bclavie, orionweller, ohallstrom, staghado, alexisgallagher, rbiswasfc, fladhak, tomaarsen, ncoop57, griffin, jph00, johnowhitaker, iacolippo • Dec 19, 2024 • 740
view article Article Ettin Suite: SoTA Paired Encoders and Decoders +4 orionweller, kdricci, mmarone, NohTow, dlawrie, vandurme • Jul 16, 2025 • 80
view article Article SmolLM3: smol, multilingual, long-context reasoner +21 eliebak, cmpatino, anton-l, edbeeching, m-ric, nouamanetazi, akseljoonas, guipenedo, hynky, clefourrier, SaylorTwift, kashif, qgallouedec, hlarcher, glutamatt, Xenova, reach-vb, ngxson, craffel, lewtun, loubnabnl, lvwerra, thomwolf • Jul 8, 2025 • 775
Should We Still Pretrain Encoders with Masked Language Modeling? Paper • 2507.00994 • Published Jul 1, 2025 • 81
view article Article Efficient LLM Pretraining: Packed Sequences and Masked Attention sirluk • Oct 7, 2024 • 71
SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion Paper • 2503.11576 • Published Mar 14, 2025 • 158
view article Article DeepSearch Using Visual RAG in Agentic Frameworks 🔎 paultltc • Mar 21, 2025 • 38
view article Article ViDoRe Benchmark V2: Raising the Bar for Visual Retrieval manu • Mar 18, 2025 • 16
view article Article SmolVLM Grows Smaller – Introducing the 256M & 500M Models! +1 andito, mfarre, merve • Jan 23, 2025 • 192
view article Article SmolVLM - small yet mighty Vision Language Model +3 andito, merve, mfarre, eliebak, pcuenq • Nov 26, 2024 • 417
view article Article Introducing smolagents: simple agents that write actions in code. +1 m-ric, merve, thomwolf • Dec 31, 2024 • 1.19k
RegMix: Data Mixture as Regression for Language Model Pre-training Paper • 2407.01492 • Published Jul 1, 2024 • 41
Parallel Sentences Datasets Collection These datasets all have "english" and "non_english" columns for numerous datasets. They can be used to make embedding models multilingual. • 14 items • Updated Dec 10, 2025 • 23