SmolGRPO-135M by @maharshpatelx
This repository contains SmolGRPO-135M, a GRPO‑fine‑tuned version of HuggingFaceTB/SmolLM-135M-Instruct trained on the mlabonne/smoltldr dataset.[web:2][web:42][web:45]
The goal is to make a tiny model that learns to produce concise completions of roughly 50 tokens while staying lightweight and easy to run on a single GPU.[web:42][web:45]
Model description
- Base model:
HuggingFaceTB/SmolLM-135M-Instruct(135M parameter decoder‑only model).[web:2][web:3] - Fine‑tuning method: GRPO (Group Relative Policy Optimization) with LoRA adapters.[web:45][web:72]
- Dataset:
mlabonne/smoltldr– 2k short prompt/completion pairs (train split), plus validation and test.[web:42][web:45] - Objective: Encourage completions close to an ideal length of ~50 tokens via a simple reward function.[web:45]
- Intended use: Small, educational model for experimenting with GRPO, length‑controlled summarization, and instruction following on limited hardware.[web:45][web:76]
This model is primarily a learning / experimentation checkpoint, not a production‑ready general assistant.[web:45]
Usage
Quick start with pipeline
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
model_id = "maharshpatelx/SmolGRPO-135M"
device = 0 if torch.cuda.is_available() else -1
print(f"Using device index: {device}")
model = AutoModelForCausalLM.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)
pipe = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
device=device,
)
messages = [
{
"role": "user",
"content": (
"Summarize this paragraph in about 50 tokens:\n\n"
"Cats are small domesticated carnivores that live closely with humans..."
),
},
]
outputs = pipe(
messages,
max_new_tokens=128,
do_sample=True,
temperature=0.5,
min_p=0.1,
)
print(outputs["generated_text"])
The model expects chat‑style messages and works best for short summaries or short‑form answers where a compact response is desired.[web:2][web:45]
Example: long cat document (from the GRPO exercise)
prompt = """
# A long document about the Cat
The cat (Felis catus), also referred to as the domestic cat or house cat, is a small
domesticated carnivorous mammal. It is the only domesticated species of the family Felidae.
Advances in archaeology and genetics have shown that the domestication of the cat occurred
in the Near East around 7500 BC. It is commonly kept as a pet and farm cat, but also ranges
freely as a feral cat avoiding human contact. It is valued by humans for companionship and
its ability to kill vermin. Its retractable claws are adapted to killing small prey species
such as mice and rats. It has a strong, flexible body, quick reflexes, and sharp teeth,
and its night vision and sense of smell are well developed. It is a social species,
but a solitary hunter and a crepuscular predator. Cat communication includes
vocalizations—including meowing, purring, trilling, hissing, growling, and grunting—as
well as body language. It can hear sounds too faint or too high in frequency for human ears,
such as those made by small mammals. It secretes and perceives pheromones.
"""
messages = [
{"role": "user", "content": prompt},
]
outputs = pipe(
messages,
max_new_tokens=256,
do_sample=True,
temperature=0.5,
min_p=0.1,
)
print(outputs)
This mirrors the example used in the Hugging Face LLM course GRPO chapter to evaluate the fine‑tuned model.[web:45][web:46]
Training details
Data
- Dataset:
mlabonne/smoltldr.[web:42][web:45] - Splits: train (2000), validation (200), test (200) with
promptandcompletionfields.[web:42] - Task: Given a short prompt, produce a concise TL;DR‑style completion.
The dataset is well suited to testing length‑controlled summarization on a small model.[web:42]
Reward function
A simple length‑based reward was used:
ideal_length = 50
def reward_len(completions, **kwargs):
return [-abs(ideal_length - len(completion)) for completion in completions]
- The reward is highest when completion length is close to 50 (in characters/tokens) and decreases as the response becomes too short or too long.[web:45]
- This encourages the model to generate compact answers around the target length rather than overly long completions.[web:45][web:76]
LoRA configuration
from peft import LoraConfig, get_peft_model
lora_config = LoraConfig(
task_type="CAUSAL_LM",
r=16,
lora_alpha=32,
target_modules="all-linear",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
LoRA reduces the number of trainable parameters to a small fraction of the base model, which lowers memory usage and makes fine‑tuning feasible on a single consumer GPU.[web:45]
GRPO configuration
Core settings (adapted for a single‑GPU run):
from trl import GRPOConfig
training_args = GRPOConfig(
output_dir="GRPO",
learning_rate=2e-5,
per_device_train_batch_size=1,
gradient_accumulation_steps=1,
max_prompt_length=256,
max_completion_length=64,
num_generations=2,
optim="adamw_torch",
num_train_epochs=1,
bf16=True, # when supported on GPU
report_to=["wandb"],
remove_unused_columns=False,
logging_steps=1,
)
- Trainer:
trl.GRPOTrainerwithreward_funcs=[reward_len]andtrain_dataset=dataset["train"].[web:45] - Logging: Weights & Biases stores reward, loss, and KL divergence over training steps, which helps interpret how GRPO modifies the policy.[web:45][web:76]
After training, LoRA adapters are merged back into the base model with:
merged_model = trainer.model.merge_and_unload()
merged_model.push_to_hub("SmolGRPO-135M", private=False, tags=["GRPO", "Reasoning-Course"])
This is the same pattern described in the LLM course for publishing the fine‑tuned checkpoint to the Hub.[web:45][web:72]
Limitations and intended use
- Model size: 135M parameters is very small; the model is not competitive with larger LLMs on complex reasoning or long‑context tasks.[web:2][web:3]
- Training budget: Single epoch on a small dataset means behavior is shaped but not robust across all domains.[web:42][web:45]
- Intended use:
- Educational example for GRPO + LoRA fine‑tuning.[web:45][web:73]
- Lightweight summarizer / short‑answer assistant for experimentation on limited hardware.
Do not use this model in high‑risk applications (medical, legal, financial, safety‑critical scenarios) without additional training, evaluation, and safety measures.[web:78]
How to reproduce training
High‑level steps:
- Load
HuggingFaceTB/SmolLM-135M-Instructas the base model.[web:2] - Load
mlabonne/smoltldrusingdatasets.load_dataset("mlabonne/smoltldr").[web:42][web:45] - Wrap the model with LoRA using the configuration above.[web:45]
- Implement the
reward_lenfunction to target a length of ~50.[web:45] - Configure
GRPOConfigand instantiateGRPOTrainerwith the train split.[web:45] - Train for one epoch and monitor metrics in Weights & Biases.[web:45]
- Merge LoRA adapters into the base model and push the merged model to the Hub.[web:45][web:72]
For a full walkthrough (theory + code), see the “Practical Exercise: Fine‑tune a model with GRPO” in the Hugging Face LLM course.[web:45][web:73]
License
- Base model:
HuggingFaceTB/SmolLM-135M-Instructlicense; see that model card for exact terms.[web:2] - Dataset:
mlabonne/smoltldr; review its terms before commercial use.[web:42] - This fine‑tuned model: Same license as the base model, with no additional restrictions added.
By using this model, you agree to comply with the licenses and terms associated with the base model and dataset.
- Downloads last month
- 24