PathoPreter: A Parameter-Efficient Clinical Support Model for SNV Risk Flagging

PathoPreter Badge Recall Base Model

⚠️ CRITICAL DISCLAIMER

PathoPreter is a clinical research tool for risk prioritization, NOT a diagnostic device.

  • DO NOT use this model to confirm or rule out a medical diagnosis.
  • DO NOT use this model to determine medical treatment.
  • DO NOT use this model as a replacement for ACMG guidelines or expert review.

The model outputs "High/Low Pathogenic Indication" signals intended solely to help clinicians prioritize variants for further manual investigation.


πŸ”¬ Model Overview

PathoPreter-4B-SNV is a specialized Large Language Model (LLM) fine-tuned to screen Single Nucleotide Variants (SNVs) for pathogenicity. Unlike generalist biomedical models, PathoPreter focuses on domain saturationβ€”learning a dense representation of variant risk factors from over 1.1 million ClinVar records.

  • Developer: Rohit Yadav (NIT Jalandhar)
  • Base Architecture: Qwen-3 Instruct 4B
  • Training Method: Low-Rank Adaptation (LoRA) via Unsloth on NVIDIA A100
  • Input Data: HGVS variant strings + Gene context + Associated Conditions.
  • Output: Semantic Risk Flag (High/Low Indication).

Key Features

  • Variant-Level Isolation: Trained with strict HGVS separation to ensure the model learns biological patterns, not just memorizes training data.
  • Deployable: Runs offline on consumer hardware (8-12GB VRAM) via GGUF.
  • Safety-Aligned: Outputs are structurally constrained to prevent diagnostic overreach.

πŸ“Š Performance & Benchmarking

The model was evaluated on a strictly isolated test set of 55,376 unseen variants. To prevent data leakage, we enforced HGVS-level isolation, ensuring no variant string from the training set appeared in the evaluation set.

Core Metrics (vs. Base Model)

Metric PathoPreter-4B Raw Base Model Qwen-2.5-7B Clinical Implication
Pathogenic Recall 94.0% 0% Successfully flags 94% of high-risk variants whereas Raw models got all wrong and said all varaints as Benign
Benign Specificity 99.2% 100% PathoPreter Rarely hallucinates risk on safe variants. But raw model always says Benign
Overall Accuracy 98.57% 87%* High reliability across the full distribution even though raw model looks good in Acc bcause dataset have more brnign but it never caught any pathogens where as PathoPreter show 94% recall in Pathogens

*Note: The raw base model achieved 87% accuracy simply by predicting "Benign" for everything, failing to catch a single pathogenic case. PathoPreter's accuracy reflects actual signal detection.

Industry Comparison (vs. CADD)

On a subset of 1,937 variants, PathoPreter was benchmarked against CADD (PHRED β‰₯ 20), a standard bioinformatics tool.

  • CADD Recall: ~99.0%
  • PathoPreter Recall: ~94.0%

Result: PathoPreter achieves sensitivity comparable to established algorithmic tools while operating entirely on text-based clinical metadata, without requiring complex evolutionary conservation pipelines.


πŸ› οΈ Usage

Method 1: Python (Unsloth / Transformers)

Use this if you are a developer wanting to run the full LoRA adapters.

from unsloth import FastLanguageModel

# 1. Load the model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "YADAV0206/Qwen-3-4B-finetuned-PathoPreter-Rohit",
    max_seq_length = 2048,
    load_in_4bit = True,
)
FastLanguageModel.for_inference(model)

# 2. Define the input prompt
prompt = """
Variant: NM_000059.4(BRCA2):c.8499G>A (p.Lys2833=)
Associated conditions: Hereditary breast ovarian cancer syndrome
### Response:
"""

# 3. Inference
inputs = tokenizer([prompt], return_tensors = "pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens = 12)
print(tokenizer.batch_decode(outputs))

Method 2: Local Inference (LM Studio / Ollama)

This repo includes GGUF files for easy local use.

Download qwen3-4b-instruct-2507.Q4_K_M.gguf.

Load into LM Studio or Ollama.

System Prompt: "You are an expert genetic variant classifier. Classify variants as Pathogenic or Benign based on the input."


πŸ“‚ Training Details

Dataset Construction The training data was derived from a snapshot of ClinVar (NCBI) containing 1.1 million SNV records.

Filtered For: Single Nucleotide Variants (SNVs) only.

Labels: Binary mapping (Pathogenic/Likely Pathogenic β†’ 1, Benign/Likely Benign β†’ 0).

Exclusions: VUS (Variants of Uncertain Significance), conflicting interpretations, and incomplete records were removed.

Semantic Output Layer To prevent misuse, the model maps binary internal predictions to safe clinical language:

1 (Internal) β†’ "High Pathogenic Indication"

0 (Internal) β†’ "Low Pathogenic Indication"


πŸ›‘ Limitations

Scope: Validated only for SNVs. Performance on Indels, CNVs, or Structural Variants is unknown and likely poor.

Binary Output: The model does not currently handle VUS (Variants of Uncertain Significance) complexity.

Hallucination: Like all LLMs, the model can hallucinate. It is not a clinical diagnosis tool


πŸ”— Resources

GitHub Repository: YADAV1825/PathoPreter

ClinVar Database: NCBI ClinVar

Downloads last month
54
GGUF
Model size
4B params
Architecture
qwen3
Hardware compatibility
Log In to view the estimation

4-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for YADAV0206/Qwen-3-4B-finetuned-PathoPreter-Rohit

Adapter
(126)
this model