AI & ML interests
Historical Media Analysis and Enrichment
Recent Activity
Impresso - Media Monitoring of the Past is an interdisciplinary research project that uses machine learning to pursue a paradigm shift in the processing, semantic enrichment, representation, exploration and study of historical media across modalities, temporal, linguistic, and national borders. We develop the 🚀 Impresso Web App and the 🔬 Impresso Datalab (coming soon), providing search, exploratory analysis, and programmatic access to an unprecedented corpus of multilingual historical newspapers and radio broadcasts collections. Our work sits at the intersection of Natural Language Processing, Design, and History.
We share:
- 🤖 Impresso models tailored for historical, multilingual documents and include language identification, OCR quality assessment, topic inference, NER and NEL.
- 📚 Impresso datasets curated from digitized historical media sources, designed to support ML development and evaluation. Datasets are currently in preparation and will soon be released, including a NER and NEL benchmark developed as part of the HIPE evaluation campaign, an image type classification dataset, and more.
Impresso gratefully acknowledges the continued support of its cultural heritage 🏛️ partners as well as funding from the SNSF (Grant No. CRSII5_173719 and CRSII5_213585) and the FNR (Grant No. 17498891).
spaces 10
Ocrqa Exploration
OCR Quality Exploration on Impresso Corpus
Multilingual Named Entity Recognition
Multilingual Named Entity Recognition in Historical Data
Multilingual Entity Linking
Multilingual Entity Linking for Historical Data
Floret Word Embedding Search
Search for similar words using word embeddings
OCR Quality Assessment Demo
Measure the OCR output quality using word recognition ratios