commoncrawl (Common Crawl Foundation)

laurievb

authored 3 papers 10 months ago

An Open Dataset and Model for Language Identification

Paper • 2305.13820 • Published May 23, 2023

The University of Edinburgh's Submission to the WMT22 Code-Mixing Shared Task (MixMT)

Paper • 2210.11309 • Published Oct 20, 2022

An Expanded Massive Multilingual Dataset for High-Performance Language Technologies

Paper • 2503.10267 • Published Mar 13, 2025 • 2

malteos

authored 10 papers 11 months ago

Neighborhood Contrastive Learning for Scientific Document Representations with Citation Embeddings

Paper • 2202.06671 • Published Feb 14, 2022 • 2

Specialized Document Embeddings for Aspect-based Similarity of Research Papers

Paper • 2203.14541 • Published Mar 28, 2022

Investigating Gender Bias in Turkish Language Models

Paper • 2404.11726 • Published Apr 17, 2024 • 1

Efficient Language Model Training through Cross-Lingual and Progressive Transfer Learning

Paper • 2301.09626 • Published Jan 23, 2023 • 2

Progress Report: Towards European LLMs

Paper • 2410.03730 • Published Sep 30, 2024 • 3

Data Processing for the OpenGPT-X Model Family

Paper • 2410.08800 • Published Oct 11, 2024 • 1

MMTEB: Massive Multilingual Text Embedding Benchmark

Paper • 2502.13595 • Published Feb 19, 2025 • 43

greglindahl

authored a paper about 1 year ago

Towards Best Practices for Open Datasets for LLM Training

Paper • 2501.08365 • Published Jan 14, 2025 • 62

pjox

authored a paper over 1 year ago

mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus

Paper • 2406.08707 • Published Jun 13, 2024 • 17

lfoppiano

authored 2 papers over 1 year ago

Semi-automatic staging area for high-quality structured data extraction from scientific literature

Paper • 2309.10923 • Published Sep 19, 2023

Mining experimental data from Materials Science literature with Large Language Models: an evaluation study

Paper • 2401.11052 • Published Jan 19, 2024 • 1

pjox

authored a paper almost 2 years ago

CamemBERT: a Tasty French Language Model

Paper • 1911.03894 • Published Nov 10, 2019 • 4

lfoppiano

authored 2 papers about 2 years ago

SuperMat: Construction of a linked annotated dataset from superconductors-related publications

Paper • 2101.02455 • Published Jan 7, 2021 • 2

Automatic extraction of materials and properties from superconductors scientific literature

Paper • 2210.15600 • Published Oct 26, 2022 • 2

AI & ML interests

Team members 14

commoncrawl's activity