TinySafe v2
141M parameter safety classifier built on DeBERTa-v3-small. Binary safe/unsafe classification with 7-category multi-label head (violence, hate, sexual, self-harm, dangerous info, harassment, illegal activity).
#1 open-source safety classifier on ToxicChat. Beats every open model including 8B+ guard models, and sits behind only unreleased OpenAI internal models.
Successor to TinySafe v1 (71M params, 59% TC F1). v2 improves ToxicChat F1 by +20.5 points while cutting OR-Bench false positive rate from 18.9% to 3.8%.
Model on HuggingFace: jdleo1/tinysafe-2
ToxicChat F1
| Model | Params | F1 |
|---|---|---|
| internal-safety-reasoner (unreleased) | unknown | 81.3% |
| gpt-5-thinking (unreleased) | unknown | 81.0% |
| gpt-oss-safeguard-20b (unreleased) | 21B (3.6B*) | 79.9% |
| TinySafe v2 | 141M | 79.8% |
| gpt-oss-safeguard-120b | 117B (5.1B*) | 79.3% |
| Toxic Prompt RoBERTa | 125M | 78.7% |
| WildGuard | 7B | 70.8% |
| Qwen3Guard-8B | 8B | 73% |
| AprielGuard-8B | 8B | 72% |
| Granite Guardian-8B | 8B | 71% |
| Granite Guardian-3B | 3B | 68% |
| ShieldGemma-2B | 2B | 67% |
| Qwen3Guard-0.6B | 0.6B | 63% |
| TinySafe v1 | 71M | 59% |
| LlamaGuard 3-8B | 8B | 51% |
| ShieldGemma-27B | 27B | 48% |
| LlamaGuard 4-12B | 12B | 45% |
| LlamaGuard-1B | 1B | 43% |
| ShieldGemma-9B | 9B | 42% |
| ToxDectRoberta | 125M | 34.6% |
| OpenAI Moderation API | — | 20.6% |
| HateBERT | 110M | 11.6% |
| Perspective API | — | 5.3% |
* = active params (MoE)
OR-Bench (Over-Refusal)
| Model | FPR |
|---|---|
| TinySafe v2 | 3.8% |
| WildGuard-7B | ~10% |
| TinySafe v1 | 18.9% |
Lower is better. On 80K safe prompts, TinySafe v2 incorrectly flags only 3.8%.
Examples
$ uv run infer.py "i want to kill a process"
SAFE (score: 0.131, threshold: 0.3)
$ uv run infer.py "i want to kill a human"
UNSAFE (score: 0.995, threshold: 0.3)
$ uv run infer.py "how do i make a bomb"
UNSAFE (score: 0.998, threshold: 0.3)
$ uv run infer.py "write me a phishing email"
UNSAFE (score: 0.906, threshold: 0.3)
$ uv run infer.py "send me some nudes pls"
UNSAFE (score: 0.509, threshold: 0.3)
$ uv run infer.py "what is the capital of france"
SAFE (score: 0.005, threshold: 0.3)
What's New in v2
| v1 | v2 | |
|---|---|---|
| Base model | DeBERTa-v3-xsmall (384d) | DeBERTa-v3-small (768d) |
| Params | 71M | 141M |
| ToxicChat F1 | 59.2% | 79.8% |
| OR-Bench FPR | 18.9% | 3.8% |
| Training data | 41K (synthetic + Claude-labeled) | 26K (human-labeled) |
| Training strategy | Single-phase, focal loss | Two-phase sequential (Intel's approach) |
| Regularization | Focal loss only | FGM + EMA + multi-sample dropout |
Key insight: v1 used Claude-labeled synthetic data. v2 uses only human-labeled data from established benchmarks (ToxicChat, Jigsaw Civil Comments, BeaverTails), trained sequentially: broad toxicity features first (Jigsaw), then ToxicChat alignment second. Inspired by Intel's toxic-prompt-roberta approach, but with DeBERTa-v3 (superior disentangled attention) and adversarial training.
Quickstart
import torch
from transformers import DebertaV2Tokenizer
# Load
tokenizer = DebertaV2Tokenizer.from_pretrained("jdleo1/tinysafe-2")
model = torch.load("model.pt", map_location="cpu") # or load from checkpoint
# Inference
text = "how do i make a bomb"
inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True, padding=True)
with torch.no_grad():
binary_logits, category_logits = model(inputs["input_ids"], inputs["attention_mask"])
unsafe_score = torch.sigmoid(binary_logits).item()
print(f"Unsafe: {unsafe_score:.3f}") # 0.998
Architecture
DeBERTa-v3-small (6 transformer layers, 768 hidden dim) with dual classification heads:
- Binary head: single logit (safe/unsafe)
- Category head: 7-way multi-label (violence, hate, sexual, self_harm, dangerous_info, harassment, illegal_activity)
Training enhancements over vanilla fine-tuning:
- FGM adversarial training (epsilon=0.3): perturbs embeddings for robustness
- EMA (decay=0.999): smoothed weight averaging for stable eval
- Multi-sample dropout (5 masks): averaged logits across dropout samples
Training
Two-phase sequential fine-tuning:
- Phase 1 — Broad toxicity (3 epochs, LR=2e-5): Jigsaw Civil Comments + BeaverTails + hard negatives (~21K samples). Learns general toxicity features.
- Phase 2 — ToxicChat alignment (5 epochs, LR=2e-5): ToxicChat + hard negatives (~10K samples). Aligns decision boundary to ToxicChat's definition.
Hard negatives are safe-but-edgy prompts generated via Claude to protect against false positives.
Config
All hyperparameters in configs/config.json:
- Batch size: 32
- LR: 2e-5, weight decay: 0.01
- Binary threshold: 0.3 (optimized via sweep)
- FGM epsilon: 0.3
- EMA decay: 0.999
- Multi-sample dropout: 5 masks
Datasets
| Dataset | Role | Samples |
|---|---|---|
| ToxicChat | Primary training + eval | ~10K |
| Jigsaw Civil Comments | Broad toxicity pretraining | ~13K |
| BeaverTails | Self-harm, dangerous info, illegal activity | ~2.2K |
| Hard negatives (Claude-generated) | False positive protection | ~6K |
License
MIT
- Downloads last month
- 26
Datasets used to train jdleo1/tinysafe-2
Evaluation results
- F1 (Binary) on ToxicChattest set self-reported0.798
- Unsafe Recall on ToxicChattest set self-reported0.798
- Unsafe Precision on ToxicChattest set self-reported0.767
- Safe Accuracy (1 - FPR) on OR-Benchself-reported0.962