CRAG-dual-encoder-base
CRAG: Causal Reasoning for Adversomics Graphs
This is the base model in the CRAG dual-encoder family for drug-adverse drug reaction (ADR) relation extraction. It uses a dual-encoder architecture with PubMedBERT to score drug-ADR pairs for causal pharmacovigilance graph construction.
Model Description
CRAG-dual-encoder-base is designed to identify causal relationships between drugs and adverse drug reactions from biomedical text. Given a drug mention and an ADR mention in context, the model predicts whether they share a causal relationship.
Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CRAG Dual-Encoder Base β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Drug Context ADR Context β
β β β β
β βΌ βΌ β
β ββββββββββββ ββββββββββββ β
β βPubMedBERTβ βPubMedBERTβ (separate weights) β
β β Drug β β ADR β β
β β Encoder β β Encoder β β
β ββββββ¬ββββββ ββββββ¬ββββββ β
β β β β
β βΌ βΌ β
β [CLS] Pool [CLS] Pool β
β β β β
β ββββββββββ¬βββββββββββββ β
β β β
β βΌ β
β ββββββββββββββββ β
β β Bilinear β β
β β Fusion β β
β ββββββββ¬ββββββββ β
β β β
β βΌ β
β ββββββββββββββββ β
β β MLP Head β β
β β (256β1) β β
β ββββββββ¬ββββββββ β
β β β
β βΌ β
β P(causal) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
- Base Model:
microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext - Hidden Dimension: 768
- Fusion Dimension: 256
- Parameters: ~220M (two separate BERT encoders)
Training Procedure
The model was trained in two phases:
Phase 1: Contrastive Pre-training (3 epochs)
- InfoNCE loss with temperature Ο=0.07
- Learns to bring true drug-ADR pairs close in embedding space
- Random negative sampling (mismatched pairs)
Phase 2: Classification Fine-tuning (5 epochs)
- Binary cross-entropy loss
- Balanced positive/negative samples
- Learning rate: 2e-5 with linear warmup
Training Data
- Dataset: ADE Corpus V2
- Configuration:
Ade_corpus_v2_drug_ade_relation - Training Examples: ~6,800 positive pairs + ~6,800 negative pairs
- Validation Examples: ~850 pairs
Performance
| Metric | Value |
|---|---|
| F1 Score | 88.3% |
Comparison with CRAG Family
| Model | F1 | AUC | Key Features |
|---|---|---|---|
| CRAG-dual-encoder-base | 88.3% | - | PubMedBERT, random negatives |
| CRAG-dual-encoder-ade | 97.5% | 99.1% | BioLinkBERT, hard negatives, focal loss |
| CRAG-dual-encoder-mimicause | 98.9% | 99.8% | + MIMICause causal reasoning |
Usage
import torch
from transformers import AutoTokenizer, AutoModel
# Load model (custom architecture - need to define DualEncoderModel class)
# See training script for architecture definition
tokenizer = AutoTokenizer.from_pretrained("chrisvoncsefalvay/CRAG-dual-encoder-base")
# Example: Score a drug-ADR pair
drug_context = "Patient was prescribed aspirin for pain management."
adr_context = "The patient experienced gastrointestinal bleeding."
# Tokenize
drug_inputs = tokenizer(drug_context, return_tensors="pt", max_length=128, truncation=True, padding="max_length")
adr_inputs = tokenizer(adr_context, return_tensors="pt", max_length=128, truncation=True, padding="max_length")
# Forward pass (pseudo-code - requires loading custom model)
# drug_repr = model.encode_drug(**drug_inputs)
# adr_repr = model.encode_adr(**adr_inputs)
# score = model.classify(drug_repr, adr_repr)
Intended Uses
Primary Use Cases
- Pharmacovigilance: Automated extraction of drug-ADR relationships from literature
- Causal Graph Construction: Building drug-ADR knowledge graphs for safety analysis
- Literature Mining: Screening biomedical publications for adverse event reports
- Clinical Decision Support: Identifying potential drug safety signals
Out-of-Scope Uses
- Direct clinical decision-making without human review
- Diagnosis or treatment recommendations
- Processing non-English text
- Identifying drug-drug interactions (different task)
Limitations
- English Only: Trained exclusively on English biomedical text
- Domain Specific: Optimized for drug-ADR relationships; may not generalize to other biomedical relations
- Context Dependency: Requires both drug and ADR to be mentioned in related context
- Base Model Performance: This base version achieves 88.3% F1; consider using CRAG-dual-encoder-ade or CRAG-dual-encoder-mimicause for production use
Ethical Considerations
- Model predictions should be validated by domain experts before use in clinical or regulatory settings
- False negatives may miss important safety signals; false positives may trigger unnecessary reviews
- The model reflects biases present in the training data (ADE Corpus V2, sourced from MEDLINE)
Citation
@misc{crag-dual-encoder-2024,
title={CRAG: Causal Reasoning for Adversomics Graphs - Dual-Encoder Models for Drug-ADR Relation Extraction},
author={von Csefalvay, Chris},
year={2024},
publisher={Hugging Face},
url={https://huggingface.co/chrisvoncsefalvay/CRAG-dual-encoder-base}
}
Model Card Authors
Chris von Csefalvay (@chrisvoncsefalvay)
Model Card Contact
For questions or issues, please open a discussion on this model's repository or contact chris@chrisvoncsefalvay.com.
- Downloads last month
- 29
Dataset used to train chrisvoncsefalvay/CRAG-dual-encoder-base
Collection including chrisvoncsefalvay/CRAG-dual-encoder-base
Evaluation results
- F1 Score on ADE Corpus V2self-reported0.883