YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

SmolGRPO-135M by @maharshpatelx

This repository contains SmolGRPO-135M, a GRPO‑fine‑tuned version of HuggingFaceTB/SmolLM-135M-Instruct trained on the mlabonne/smoltldr dataset.[web:2][web:42][web:45]
The goal is to make a tiny model that learns to produce concise completions of roughly 50 tokens while staying lightweight and easy to run on a single GPU.[web:42][web:45]

Model description

Base model: HuggingFaceTB/SmolLM-135M-Instruct (135M parameter decoder‑only model).[web:2][web:3]
Fine‑tuning method: GRPO (Group Relative Policy Optimization) with LoRA adapters.[web:45][web:72]
Dataset: mlabonne/smoltldr – 2k short prompt/completion pairs (train split), plus validation and test.[web:42][web:45]
Objective: Encourage completions close to an ideal length of ~50 tokens via a simple reward function.[web:45]
Intended use: Small, educational model for experimenting with GRPO, length‑controlled summarization, and instruction following on limited hardware.[web:45][web:76]

This model is primarily a learning / experimentation checkpoint, not a production‑ready general assistant.[web:45]

Usage

Quick start with `pipeline`

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

model_id = "maharshpatelx/SmolGRPO-135M"

device = 0 if torch.cuda.is_available() else -1
print(f"Using device index: {device}")

model = AutoModelForCausalLM.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    device=device,
)

messages = [
    {
        "role": "user",
        "content": (
            "Summarize this paragraph in about 50 tokens:\n\n"
            "Cats are small domesticated carnivores that live closely with humans..."
        ),
    },
]

outputs = pipe(
    messages,
    max_new_tokens=128,
    do_sample=True,
    temperature=0.5,
    min_p=0.1,
)

print(outputs["generated_text"])

The model expects chat‑style messages and works best for short summaries or short‑form answers where a compact response is desired.[web:2][web:45]

Example: long cat document (from the GRPO exercise)

prompt = """
# A long document about the Cat

The cat (Felis catus), also referred to as the domestic cat or house cat, is a small
domesticated carnivorous mammal. It is the only domesticated species of the family Felidae.
Advances in archaeology and genetics have shown that the domestication of the cat occurred
in the Near East around 7500 BC. It is commonly kept as a pet and farm cat, but also ranges
freely as a feral cat avoiding human contact. It is valued by humans for companionship and
its ability to kill vermin. Its retractable claws are adapted to killing small prey species
such as mice and rats. It has a strong, flexible body, quick reflexes, and sharp teeth,
and its night vision and sense of smell are well developed. It is a social species,
but a solitary hunter and a crepuscular predator. Cat communication includes
vocalizations—including meowing, purring, trilling, hissing, growling, and grunting—as
well as body language. It can hear sounds too faint or too high in frequency for human ears,
such as those made by small mammals. It secretes and perceives pheromones.
"""

messages = [
    {"role": "user", "content": prompt},
]

outputs = pipe(
    messages,
    max_new_tokens=256,
    do_sample=True,
    temperature=0.5,
    min_p=0.1,
)

print(outputs)

This mirrors the example used in the Hugging Face LLM course GRPO chapter to evaluate the fine‑tuned model.[web:45][web:46]

Training details

Data

Dataset: mlabonne/smoltldr.[web:42][web:45]
Splits: train (2000), validation (200), test (200) with prompt and completion fields.[web:42]
Task: Given a short prompt, produce a concise TL;DR‑style completion.

The dataset is well suited to testing length‑controlled summarization on a small model.[web:42]

Reward function

A simple length‑based reward was used:

ideal_length = 50

def reward_len(completions, **kwargs):
    return [-abs(ideal_length - len(completion)) for completion in completions]

The reward is highest when completion length is close to 50 (in characters/tokens) and decreases as the response becomes too short or too long.[web:45]
This encourages the model to generate compact answers around the target length rather than overly long completions.[web:45][web:76]

LoRA configuration

from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    task_type="CAUSAL_LM",
    r=16,
    lora_alpha=32,
    target_modules="all-linear",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

LoRA reduces the number of trainable parameters to a small fraction of the base model, which lowers memory usage and makes fine‑tuning feasible on a single consumer GPU.[web:45]

GRPO configuration

Core settings (adapted for a single‑GPU run):

from trl import GRPOConfig

training_args = GRPOConfig(
    output_dir="GRPO",
    learning_rate=2e-5,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=1,
    max_prompt_length=256,
    max_completion_length=64,
    num_generations=2,
    optim="adamw_torch",
    num_train_epochs=1,
    bf16=True,                  # when supported on GPU
    report_to=["wandb"],
    remove_unused_columns=False,
    logging_steps=1,
)

Trainer: trl.GRPOTrainer with reward_funcs=[reward_len] and train_dataset=dataset["train"].[web:45]
Logging: Weights & Biases stores reward, loss, and KL divergence over training steps, which helps interpret how GRPO modifies the policy.[web:45][web:76]

After training, LoRA adapters are merged back into the base model with:

merged_model = trainer.model.merge_and_unload()
merged_model.push_to_hub("SmolGRPO-135M", private=False, tags=["GRPO", "Reasoning-Course"])

This is the same pattern described in the LLM course for publishing the fine‑tuned checkpoint to the Hub.[web:45][web:72]

Limitations and intended use

Model size: 135M parameters is very small; the model is not competitive with larger LLMs on complex reasoning or long‑context tasks.[web:2][web:3]
Training budget: Single epoch on a small dataset means behavior is shaped but not robust across all domains.[web:42][web:45]
Intended use:
- Educational example for GRPO + LoRA fine‑tuning.[web:45][web:73]
- Lightweight summarizer / short‑answer assistant for experimentation on limited hardware.

Do not use this model in high‑risk applications (medical, legal, financial, safety‑critical scenarios) without additional training, evaluation, and safety measures.[web:78]

How to reproduce training

High‑level steps:

Load HuggingFaceTB/SmolLM-135M-Instruct as the base model.[web:2]
Load mlabonne/smoltldr using datasets.load_dataset("mlabonne/smoltldr").[web:42][web:45]
Wrap the model with LoRA using the configuration above.[web:45]
Implement the reward_len function to target a length of ~50.[web:45]
Configure GRPOConfig and instantiate GRPOTrainer with the train split.[web:45]
Train for one epoch and monitor metrics in Weights & Biases.[web:45]
Merge LoRA adapters into the base model and push the merged model to the Hub.[web:45][web:72]

For a full walkthrough (theory + code), see the “Practical Exercise: Fine‑tune a model with GRPO” in the Hugging Face LLM course.[web:45][web:73]

License

Base model: HuggingFaceTB/SmolLM-135M-Instruct license; see that model card for exact terms.[web:2]
Dataset: mlabonne/smoltldr; review its terms before commercial use.[web:42]
This fine‑tuned model: Same license as the base model, with no additional restrictions added.

By using this model, you agree to comply with the licenses and terms associated with the base model and dataset.

Downloads last month: 24

Safetensors

Model size

0.1B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support