CRAG-dual-encoder-base

CRAG: Causal Reasoning for Adversomics Graphs

This is the base model in the CRAG dual-encoder family for drug-adverse drug reaction (ADR) relation extraction. It uses a dual-encoder architecture with PubMedBERT to score drug-ADR pairs for causal pharmacovigilance graph construction.

Model Description

CRAG-dual-encoder-base is designed to identify causal relationships between drugs and adverse drug reactions from biomedical text. Given a drug mention and an ADR mention in context, the model predicts whether they share a causal relationship.

Architecture

┌─────────────────────────────────────────────────────────────┐
│                    CRAG Dual-Encoder Base                   │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│   Drug Context          ADR Context                         │
│        │                     │                              │
│        ▼                     ▼                              │
│  ┌──────────┐          ┌──────────┐                         │
│  │PubMedBERT│          │PubMedBERT│    (separate weights)   │
│  │  Drug    │          │   ADR    │                         │
│  │ Encoder  │          │ Encoder  │                         │
│  └────┬─────┘          └────┬─────┘                         │
│       │                     │                               │
│       ▼                     ▼                               │
│  [CLS] Pool            [CLS] Pool                           │
│       │                     │                               │
│       └────────┬────────────┘                               │
│                │                                            │
│                ▼                                            │
│        ┌──────────────┐                                     │
│        │   Bilinear   │                                     │
│        │   Fusion     │                                     │
│        └──────┬───────┘                                     │
│               │                                             │
│               ▼                                             │
│        ┌──────────────┐                                     │
│        │  MLP Head    │                                     │
│        │  (256→1)     │                                     │
│        └──────┬───────┘                                     │
│               │                                             │
│               ▼                                             │
│           P(causal)                                         │
└─────────────────────────────────────────────────────────────┘

Base Model: microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext
Hidden Dimension: 768
Fusion Dimension: 256
Parameters: ~220M (two separate BERT encoders)

Training Procedure

The model was trained in two phases:

Phase 1: Contrastive Pre-training (3 epochs)

InfoNCE loss with temperature τ=0.07
Learns to bring true drug-ADR pairs close in embedding space
Random negative sampling (mismatched pairs)

Phase 2: Classification Fine-tuning (5 epochs)

Binary cross-entropy loss
Balanced positive/negative samples
Learning rate: 2e-5 with linear warmup

Training Data

Dataset: ADE Corpus V2
Configuration: Ade_corpus_v2_drug_ade_relation
Training Examples: ~6,800 positive pairs + ~6,800 negative pairs
Validation Examples: ~850 pairs

Performance

Metric	Value
F1 Score	88.3%

Comparison with CRAG Family

Model	F1	AUC	Key Features
CRAG-dual-encoder-base	88.3%	-	PubMedBERT, random negatives
CRAG-dual-encoder-ade	97.5%	99.1%	BioLinkBERT, hard negatives, focal loss
CRAG-dual-encoder-mimicause	98.9%	99.8%	+ MIMICause causal reasoning

Usage

import torch
from transformers import AutoTokenizer, AutoModel

# Load model (custom architecture - need to define DualEncoderModel class)
# See training script for architecture definition

tokenizer = AutoTokenizer.from_pretrained("chrisvoncsefalvay/CRAG-dual-encoder-base")

# Example: Score a drug-ADR pair
drug_context = "Patient was prescribed aspirin for pain management."
adr_context = "The patient experienced gastrointestinal bleeding."

# Tokenize
drug_inputs = tokenizer(drug_context, return_tensors="pt", max_length=128, truncation=True, padding="max_length")
adr_inputs = tokenizer(adr_context, return_tensors="pt", max_length=128, truncation=True, padding="max_length")

# Forward pass (pseudo-code - requires loading custom model)
# drug_repr = model.encode_drug(**drug_inputs)
# adr_repr = model.encode_adr(**adr_inputs)
# score = model.classify(drug_repr, adr_repr)

Intended Uses

Primary Use Cases

Pharmacovigilance: Automated extraction of drug-ADR relationships from literature
Causal Graph Construction: Building drug-ADR knowledge graphs for safety analysis
Literature Mining: Screening biomedical publications for adverse event reports
Clinical Decision Support: Identifying potential drug safety signals

Out-of-Scope Uses

Direct clinical decision-making without human review
Diagnosis or treatment recommendations
Processing non-English text
Identifying drug-drug interactions (different task)

Limitations

English Only: Trained exclusively on English biomedical text
Domain Specific: Optimized for drug-ADR relationships; may not generalize to other biomedical relations
Context Dependency: Requires both drug and ADR to be mentioned in related context
Base Model Performance: This base version achieves 88.3% F1; consider using CRAG-dual-encoder-ade or CRAG-dual-encoder-mimicause for production use

Ethical Considerations

Model predictions should be validated by domain experts before use in clinical or regulatory settings
False negatives may miss important safety signals; false positives may trigger unnecessary reviews
The model reflects biases present in the training data (ADE Corpus V2, sourced from MEDLINE)

Citation

@misc{crag-dual-encoder-2024,
  title={CRAG: Causal Reasoning for Adversomics Graphs - Dual-Encoder Models for Drug-ADR Relation Extraction},
  author={von Csefalvay, Chris},
  year={2024},
  publisher={Hugging Face},
  url={https://huggingface.co/chrisvoncsefalvay/CRAG-dual-encoder-base}
}

Model Card Authors

Chris von Csefalvay (@chrisvoncsefalvay)

Model Card Contact

For questions or issues, please open a discussion on this model's repository or contact chris@chrisvoncsefalvay.com.

Downloads last month: 29

Dataset used to train chrisvoncsefalvay/CRAG-dual-encoder-base

Collection including chrisvoncsefalvay/CRAG-dual-encoder-base

CRAG: Causal Reasoning for Adversomics Graphs

Collection

SOTA dual-encoder models for drug-ADR relation extraction. • 3 items • Updated 6 days ago • 1

Evaluation results

F1 Score on ADE Corpus V2
self-reported

0.883