PathoPreter: A Parameter-Efficient Clinical Support Model for SNV Risk Flagging
β οΈ CRITICAL DISCLAIMER
PathoPreter is a clinical research tool for risk prioritization, NOT a diagnostic device.
- DO NOT use this model to confirm or rule out a medical diagnosis.
- DO NOT use this model to determine medical treatment.
- DO NOT use this model as a replacement for ACMG guidelines or expert review.
The model outputs "High/Low Pathogenic Indication" signals intended solely to help clinicians prioritize variants for further manual investigation.
π¬ Model Overview
PathoPreter-4B-SNV is a specialized Large Language Model (LLM) fine-tuned to screen Single Nucleotide Variants (SNVs) for pathogenicity. Unlike generalist biomedical models, PathoPreter focuses on domain saturationβlearning a dense representation of variant risk factors from over 1.1 million ClinVar records.
- Developer: Rohit Yadav (NIT Jalandhar)
- Base Architecture: Qwen-3 Instruct 4B
- Training Method: Low-Rank Adaptation (LoRA) via Unsloth on NVIDIA A100
- Input Data: HGVS variant strings + Gene context + Associated Conditions.
- Output: Semantic Risk Flag (High/Low Indication).
Key Features
- Variant-Level Isolation: Trained with strict HGVS separation to ensure the model learns biological patterns, not just memorizes training data.
- Deployable: Runs offline on consumer hardware (8-12GB VRAM) via GGUF.
- Safety-Aligned: Outputs are structurally constrained to prevent diagnostic overreach.
π Performance & Benchmarking
The model was evaluated on a strictly isolated test set of 55,376 unseen variants. To prevent data leakage, we enforced HGVS-level isolation, ensuring no variant string from the training set appeared in the evaluation set.
Core Metrics (vs. Base Model)
| Metric | PathoPreter-4B | Raw Base Model Qwen-2.5-7B | Clinical Implication |
|---|---|---|---|
| Pathogenic Recall | 94.0% | 0% | Successfully flags 94% of high-risk variants whereas Raw models got all wrong and said all varaints as Benign |
| Benign Specificity | 99.2% | 100% | PathoPreter Rarely hallucinates risk on safe variants. But raw model always says Benign |
| Overall Accuracy | 98.57% | 87%* | High reliability across the full distribution even though raw model looks good in Acc bcause dataset have more brnign but it never caught any pathogens where as PathoPreter show 94% recall in Pathogens |
*Note: The raw base model achieved 87% accuracy simply by predicting "Benign" for everything, failing to catch a single pathogenic case. PathoPreter's accuracy reflects actual signal detection.
Industry Comparison (vs. CADD)
On a subset of 1,937 variants, PathoPreter was benchmarked against CADD (PHRED β₯ 20), a standard bioinformatics tool.
- CADD Recall: ~99.0%
- PathoPreter Recall: ~94.0%
Result: PathoPreter achieves sensitivity comparable to established algorithmic tools while operating entirely on text-based clinical metadata, without requiring complex evolutionary conservation pipelines.
π οΈ Usage
Method 1: Python (Unsloth / Transformers)
Use this if you are a developer wanting to run the full LoRA adapters.
from unsloth import FastLanguageModel
# 1. Load the model
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "YADAV0206/Qwen-3-4B-finetuned-PathoPreter-Rohit",
max_seq_length = 2048,
load_in_4bit = True,
)
FastLanguageModel.for_inference(model)
# 2. Define the input prompt
prompt = """
Variant: NM_000059.4(BRCA2):c.8499G>A (p.Lys2833=)
Associated conditions: Hereditary breast ovarian cancer syndrome
### Response:
"""
# 3. Inference
inputs = tokenizer([prompt], return_tensors = "pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens = 12)
print(tokenizer.batch_decode(outputs))
Method 2: Local Inference (LM Studio / Ollama)
This repo includes GGUF files for easy local use.
Download qwen3-4b-instruct-2507.Q4_K_M.gguf.
Load into LM Studio or Ollama.
System Prompt: "You are an expert genetic variant classifier. Classify variants as Pathogenic or Benign based on the input."
π Training Details
Dataset Construction The training data was derived from a snapshot of ClinVar (NCBI) containing 1.1 million SNV records.
Filtered For: Single Nucleotide Variants (SNVs) only.
Labels: Binary mapping (Pathogenic/Likely Pathogenic β 1, Benign/Likely Benign β 0).
Exclusions: VUS (Variants of Uncertain Significance), conflicting interpretations, and incomplete records were removed.
Semantic Output Layer To prevent misuse, the model maps binary internal predictions to safe clinical language:
1 (Internal) β "High Pathogenic Indication"
0 (Internal) β "Low Pathogenic Indication"
π Limitations
Scope: Validated only for SNVs. Performance on Indels, CNVs, or Structural Variants is unknown and likely poor.
Binary Output: The model does not currently handle VUS (Variants of Uncertain Significance) complexity.
Hallucination: Like all LLMs, the model can hallucinate. It is not a clinical diagnosis tool
π Resources
GitHub Repository: YADAV1825/PathoPreter
ClinVar Database: NCBI ClinVar
- Downloads last month
- 54
4-bit
8-bit
Model tree for YADAV0206/Qwen-3-4B-finetuned-PathoPreter-Rohit
Base model
Qwen/Qwen3-4B-Instruct-2507