view article Article Releasing the largest multilingual open pretraining dataset Pclanglais • Nov 13, 2024 • 107
view article Article The case for specialized pre-training: ultra-fast foundation models for dedicated tasks Pclanglais • Aug 4, 2024 • 30
view article Article Announcing Finance Commons and the Bad Data Toolbox: Pioneering Open Data and Advanced Document Processing Pclanglais • Jul 19, 2024 • 20
view article Article Post-OCR-Correction: 1 billion words dataset of automated OCR correction by LLM Pclanglais • Apr 26, 2024 • 18
view article Article Releasing Youtube-Commons: a massive open corpus for conversational and multimodal data Pclanglais • Apr 18, 2024 • 23
view article Article Releasing Common Corpus: the largest public domain dataset for training LLMs Pclanglais • Mar 20, 2024 • 32