PathoPreter: A Parameter-Efficient Clinical Support Model for SNV Risk Flagging

⚠️ CRITICAL DISCLAIMER

PathoPreter is a clinical research tool for risk prioritization, NOT a diagnostic device.

DO NOT use this model to confirm or rule out a medical diagnosis.
DO NOT use this model to determine medical treatment.
DO NOT use this model as a replacement for ACMG guidelines or expert review.

The model outputs "High/Low Pathogenic Indication" signals intended solely to help clinicians prioritize variants for further manual investigation.

🔬 Model Overview

PathoPreter-4B-SNV is a specialized Large Language Model (LLM) fine-tuned to screen Single Nucleotide Variants (SNVs) for pathogenicity. Unlike generalist biomedical models, PathoPreter focuses on domain saturation—learning a dense representation of variant risk factors from over 1.1 million ClinVar records.

Developer: Rohit Yadav (NIT Jalandhar)
Base Architecture: Qwen-3 Instruct 4B
Training Method: Low-Rank Adaptation (LoRA) via Unsloth on NVIDIA A100
Input Data: HGVS variant strings + Gene context + Associated Conditions.
Output: Semantic Risk Flag (High/Low Indication).

Key Features

Variant-Level Isolation: Trained with strict HGVS separation to ensure the model learns biological patterns, not just memorizes training data.
Deployable: Runs offline on consumer hardware (8-12GB VRAM) via GGUF.
Safety-Aligned: Outputs are structurally constrained to prevent diagnostic overreach.

📊 Performance & Benchmarking

The model was evaluated on a strictly isolated test set of 55,376 unseen variants. To prevent data leakage, we enforced HGVS-level isolation, ensuring no variant string from the training set appeared in the evaluation set.

Core Metrics (vs. Base Model)

Metric	PathoPreter-4B	Raw Base Model Qwen-2.5-7B	Clinical Implication
Pathogenic Recall	94.0%	0%	Successfully flags 94% of high-risk variants whereas Raw models got all wrong and said all varaints as Benign
Benign Specificity	99.2%	100%	PathoPreter Rarely hallucinates risk on safe variants. But raw model always says Benign
Overall Accuracy	98.57%	87%*	High reliability across the full distribution even though raw model looks good in Acc bcause dataset have more brnign but it never caught any pathogens where as PathoPreter show 94% recall in Pathogens

*Note: The raw base model achieved 87% accuracy simply by predicting "Benign" for everything, failing to catch a single pathogenic case. PathoPreter's accuracy reflects actual signal detection.

Industry Comparison (vs. CADD)

On a subset of 1,937 variants, PathoPreter was benchmarked against CADD (PHRED ≥ 20), a standard bioinformatics tool.

CADD Recall: ~99.0%
PathoPreter Recall: ~94.0%

Result: PathoPreter achieves sensitivity comparable to established algorithmic tools while operating entirely on text-based clinical metadata, without requiring complex evolutionary conservation pipelines.

🛠️ Usage

Method 1: Python (Unsloth / Transformers)

Use this if you are a developer wanting to run the full LoRA adapters.

from unsloth import FastLanguageModel

# 1. Load the model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "YADAV0206/Qwen-3-4B-finetuned-PathoPreter-Rohit",
    max_seq_length = 2048,
    load_in_4bit = True,
)
FastLanguageModel.for_inference(model)

# 2. Define the input prompt
prompt = """
Variant: NM_000059.4(BRCA2):c.8499G>A (p.Lys2833=)
Associated conditions: Hereditary breast ovarian cancer syndrome
### Response:
"""

# 3. Inference
inputs = tokenizer([prompt], return_tensors = "pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens = 12)
print(tokenizer.batch_decode(outputs))

Method 2: Local Inference (LM Studio / Ollama)

This repo includes GGUF files for easy local use.

Download qwen3-4b-instruct-2507.Q4_K_M.gguf.

Load into LM Studio or Ollama.

System Prompt: "You are an expert genetic variant classifier. Classify variants as Pathogenic or Benign based on the input."

📂 Training Details

Dataset Construction The training data was derived from a snapshot of ClinVar (NCBI) containing 1.1 million SNV records.

Filtered For: Single Nucleotide Variants (SNVs) only.

Labels: Binary mapping (Pathogenic/Likely Pathogenic → 1, Benign/Likely Benign → 0).

Exclusions: VUS (Variants of Uncertain Significance), conflicting interpretations, and incomplete records were removed.

Semantic Output Layer To prevent misuse, the model maps binary internal predictions to safe clinical language:

1 (Internal) → "High Pathogenic Indication"

0 (Internal) → "Low Pathogenic Indication"

🛑 Limitations

Scope: Validated only for SNVs. Performance on Indels, CNVs, or Structural Variants is unknown and likely poor.

Binary Output: The model does not currently handle VUS (Variants of Uncertain Significance) complexity.

Hallucination: Like all LLMs, the model can hallucinate. It is not a clinical diagnosis tool

🔗 Resources

GitHub Repository: YADAV1825/PathoPreter

ClinVar Database: NCBI ClinVar

Downloads last month: 54

GGUF

Model size

4B params

Architecture

qwen3

Hardware compatibility

4-bit

8-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for YADAV0206/Qwen-3-4B-finetuned-PathoPreter-Rohit

Base model

Qwen/Qwen3-4B-Instruct-2507

Adapter

(126)

this model