Training mRNA Language Models Across 25 Species for $165
We built an end-to-end protein AI pipeline covering structure prediction, sequence design, and codon optimization. After comparing multiple transformer architectures for codon-level language modeling, CodonRoBERTa-large-v2 emerged as the clear winner with a perplexity of 4.10 and a Spearman CAI correlation of 0.40, significantly outperforming ModernBERT. We then scaled to 25 species, trained 4 production models in 55 GPU-hours, and built a species-conditioned system that no other open-source project offers. Complete results, architectural decisions, and runnable code below.
We annotated 119K medical images with two frontier VLMs (Qwen 3.5, Kimi K2.5), cross-validated at 93% agreement, and produced 110K training records, all for under $500. Fine-tuning 3 small models (2-3B params) improved all benchmarks: best model reaches +15.0% average exact match.
Everything is open-sourced: datasets, adapters, and code.
DNA, mRNA, proteins, AI. I spent the last year going deep into computational biology as an ML engineer. This is Part I of what I found. 🧬
In 2024, AlphaFold won the Nobel Prize in Chemistry.
By 2026, the open-source community had built alternatives that outperform it.
That's the story I find most interesting about protein AI right now. Not just the science (which is incredible), but the speed at which open-source caught up. Multiple teams, independently, reproduced and then exceeded AlphaFold 3's accuracy with permissive licenses. The field went from prediction to generation: we're not just modeling known proteins anymore, we're designing new ones.
I spent months mapping this landscape for ML engineers. What the architectures actually are (spoiler: transformers and diffusion models), which tools to use for what, and which ones you can actually ship commercially.
Today I am releasing 105 open-source models for Personally Identifiable Information (PII) detection in French, German, and Italian.
All Apache 2.0 licensed. Free for commercial use. No restrictions.
Performance:
- French: 97.97% F1 (top model) - German: 97.61% F1 (top model) - Italian: 97.28% F1 (top model)
All top-10 models per language exceed 96% F1
Coverage:
55+ PII entity types per language Native ID formats: NSS (French), Sozialversicherungsnummer (German), Codice Fiscale (Italian) Language-specific address, phone, and name patterns
European healthcare operates in European languages. Clinical notes, patient records, and medical documents are generated in French, German, Italian, and other languages.
Effective de-identification requires:
- Native language understanding — not translation - Local ID format recognition — each country has unique patterns - Cultural context awareness — names, addresses, and formats vary - These models deliver production-ready accuracy without requiring data to leave your infrastructure or language.
HIPAA & GDPR Compliance Built for US and European privacy regulations:
- On-premise deployment: Process data locally with zero external dependencies - Data sovereignty: No API calls, no cloud services, no cross-border transfers - Air-gapped capable: Deploy in fully isolated environments if required - Regulatory-grade accuracy: Supporting Expert Determination standards - HIPAA and GDPR compliance across languages, without compliance gaps.
Use Cases - Hospital EHR systems: Automated patient record de-identification - Clinical research: Multilingual dataset preparation for studies - Insurance companies: Claims processing across
🚨 Day 8/8: OpenMed Medical Reasoning Dataset Release - THE GRAND FINALE
Today I complete my 8-day release series with Medical-Reasoning-SFT-Mega. The largest open medical reasoning dataset, combining 7 state-of-the-art AI models with fair distribution deduplication.
deepseek-ai/DeepSeek-OCR is out! 🔥 my take ⤵️ > pretty insane it can parse and re-render charts in HTML > it uses CLIP and SAM features concatenated, so better grounding > very efficient per vision tokens/performance ratio > covers 100 languages
IBM just released small swiss army knife for the document models: granite-docling-258M on Hugging Face 🔥
> not only a document converter but also can do document question answering, understand multiple languages 🤯 > best part: released with Apache 2.0 license 👏 use it with your commercial projects! > it supports transformers, vLLM and MLX from the get-go! 🤗 > built on SigLIP2 & granite-165M