Text Generation
Safetensors

Model Card

1. Model Details

This model is the fine-tuned checkpoint described in the paper "Not All Steps are Informative: On the Linearity of LLMs’ RLVR Training". It was trained using Reinforcement Learning (RL) to enhance reasoning capabilities.

2. Training Details

  • Hyperparameters:
    • Learning Rate: 1e-6
    • Train Batch Size: 128
    • PPO Mini Batch Size: 64
    • RL Algorithm: GRPO
    • Rollout Temperature: 1.0
    • Group Size: 16
  • Compute: Trained on 32 x H100 GPUs for about 150 hours.

For full training configurations, please refer to the config.json or the training scripts in our GitHub.

3. Citation

If you use this model in your research, please cite our paper:

@misc{wang2026stepsinformativelinearityllms,
      title={Not All Steps are Informative: On the Linearity of LLMs' RLVR Training}, 
      author={Tianle Wang and Zhongyuan Wu and Shenghao Jin and Hao Xu and Wei Chen and Ning Miao},
      year={2026},
      eprint={2601.04537},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2601.04537}, 
}

Motivation for this Model This checkpoint is released primarily as a research artifact to facilitate the analysis of linearity in model outputs and weight updates during RLVR fine‑tuning.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Miaow-Lab/RLVR-Linearity-Checkpoints

Finetuned
(614)
this model

Dataset used to train Miaow-Lab/RLVR-Linearity-Checkpoints

Collection including Miaow-Lab/RLVR-Linearity-Checkpoints

Paper for Miaow-Lab/RLVR-Linearity-Checkpoints