CRAG-dual-encoder-base

CRAG: Causal Reasoning for Adversomics Graphs

This is the base model in the CRAG dual-encoder family for drug-adverse drug reaction (ADR) relation extraction. It uses a dual-encoder architecture with PubMedBERT to score drug-ADR pairs for causal pharmacovigilance graph construction.

Model Description

CRAG-dual-encoder-base is designed to identify causal relationships between drugs and adverse drug reactions from biomedical text. Given a drug mention and an ADR mention in context, the model predicts whether they share a causal relationship.

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    CRAG Dual-Encoder Base                   β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                             β”‚
β”‚   Drug Context          ADR Context                         β”‚
β”‚        β”‚                     β”‚                              β”‚
β”‚        β–Ό                     β–Ό                              β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                         β”‚
β”‚  β”‚PubMedBERTβ”‚          β”‚PubMedBERTβ”‚    (separate weights)   β”‚
β”‚  β”‚  Drug    β”‚          β”‚   ADR    β”‚                         β”‚
β”‚  β”‚ Encoder  β”‚          β”‚ Encoder  β”‚                         β”‚
β”‚  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜          β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜                         β”‚
β”‚       β”‚                     β”‚                               β”‚
β”‚       β–Ό                     β–Ό                               β”‚
β”‚  [CLS] Pool            [CLS] Pool                           β”‚
β”‚       β”‚                     β”‚                               β”‚
β”‚       β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                               β”‚
β”‚                β”‚                                            β”‚
β”‚                β–Ό                                            β”‚
β”‚        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                     β”‚
β”‚        β”‚   Bilinear   β”‚                                     β”‚
β”‚        β”‚   Fusion     β”‚                                     β”‚
β”‚        β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜                                     β”‚
β”‚               β”‚                                             β”‚
β”‚               β–Ό                                             β”‚
β”‚        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                     β”‚
β”‚        β”‚  MLP Head    β”‚                                     β”‚
β”‚        β”‚  (256β†’1)     β”‚                                     β”‚
β”‚        β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜                                     β”‚
β”‚               β”‚                                             β”‚
β”‚               β–Ό                                             β”‚
β”‚           P(causal)                                         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
  • Base Model: microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext
  • Hidden Dimension: 768
  • Fusion Dimension: 256
  • Parameters: ~220M (two separate BERT encoders)

Training Procedure

The model was trained in two phases:

Phase 1: Contrastive Pre-training (3 epochs)

  • InfoNCE loss with temperature Ο„=0.07
  • Learns to bring true drug-ADR pairs close in embedding space
  • Random negative sampling (mismatched pairs)

Phase 2: Classification Fine-tuning (5 epochs)

  • Binary cross-entropy loss
  • Balanced positive/negative samples
  • Learning rate: 2e-5 with linear warmup

Training Data

  • Dataset: ADE Corpus V2
  • Configuration: Ade_corpus_v2_drug_ade_relation
  • Training Examples: ~6,800 positive pairs + ~6,800 negative pairs
  • Validation Examples: ~850 pairs

Performance

Metric Value
F1 Score 88.3%

Comparison with CRAG Family

Model F1 AUC Key Features
CRAG-dual-encoder-base 88.3% - PubMedBERT, random negatives
CRAG-dual-encoder-ade 97.5% 99.1% BioLinkBERT, hard negatives, focal loss
CRAG-dual-encoder-mimicause 98.9% 99.8% + MIMICause causal reasoning

Usage

import torch
from transformers import AutoTokenizer, AutoModel

# Load model (custom architecture - need to define DualEncoderModel class)
# See training script for architecture definition

tokenizer = AutoTokenizer.from_pretrained("chrisvoncsefalvay/CRAG-dual-encoder-base")

# Example: Score a drug-ADR pair
drug_context = "Patient was prescribed aspirin for pain management."
adr_context = "The patient experienced gastrointestinal bleeding."

# Tokenize
drug_inputs = tokenizer(drug_context, return_tensors="pt", max_length=128, truncation=True, padding="max_length")
adr_inputs = tokenizer(adr_context, return_tensors="pt", max_length=128, truncation=True, padding="max_length")

# Forward pass (pseudo-code - requires loading custom model)
# drug_repr = model.encode_drug(**drug_inputs)
# adr_repr = model.encode_adr(**adr_inputs)
# score = model.classify(drug_repr, adr_repr)

Intended Uses

Primary Use Cases

  • Pharmacovigilance: Automated extraction of drug-ADR relationships from literature
  • Causal Graph Construction: Building drug-ADR knowledge graphs for safety analysis
  • Literature Mining: Screening biomedical publications for adverse event reports
  • Clinical Decision Support: Identifying potential drug safety signals

Out-of-Scope Uses

  • Direct clinical decision-making without human review
  • Diagnosis or treatment recommendations
  • Processing non-English text
  • Identifying drug-drug interactions (different task)

Limitations

  1. English Only: Trained exclusively on English biomedical text
  2. Domain Specific: Optimized for drug-ADR relationships; may not generalize to other biomedical relations
  3. Context Dependency: Requires both drug and ADR to be mentioned in related context
  4. Base Model Performance: This base version achieves 88.3% F1; consider using CRAG-dual-encoder-ade or CRAG-dual-encoder-mimicause for production use

Ethical Considerations

  • Model predictions should be validated by domain experts before use in clinical or regulatory settings
  • False negatives may miss important safety signals; false positives may trigger unnecessary reviews
  • The model reflects biases present in the training data (ADE Corpus V2, sourced from MEDLINE)

Citation

@misc{crag-dual-encoder-2024,
  title={CRAG: Causal Reasoning for Adversomics Graphs - Dual-Encoder Models for Drug-ADR Relation Extraction},
  author={von Csefalvay, Chris},
  year={2024},
  publisher={Hugging Face},
  url={https://huggingface.co/chrisvoncsefalvay/CRAG-dual-encoder-base}
}

Model Card Authors

Chris von Csefalvay (@chrisvoncsefalvay)

Model Card Contact

For questions or issues, please open a discussion on this model's repository or contact chris@chrisvoncsefalvay.com.

Downloads last month
29
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train chrisvoncsefalvay/CRAG-dual-encoder-base

Collection including chrisvoncsefalvay/CRAG-dual-encoder-base

Evaluation results