Rethinking Expert Trajectory Utilization in LLM Post-training
Abstract
The Sequential SFT-then-RL pipeline is identified as optimal for integrating expert trajectories, with guidelines for scaling and trajectory selection based on performance metrics.
While effective post-training integrates Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), the optimal mechanism for utilizing expert trajectories remains unresolved. We propose the Plasticity-Ceiling Framework to theoretically ground this landscape, decomposing performance into foundational SFT performance and the subsequent RL plasticity. Through extensive benchmarking, we establish the Sequential SFT-then-RL pipeline as the superior standard, overcoming the stability deficits of synchronized approaches. Furthermore, we derive precise scaling guidelines: (1) Transitioning to RL at the SFT Stable or Mild Overfitting Sub-phase maximizes the final ceiling by securing foundational SFT performance without compromising RL plasticity; (2) Refuting ``Less is More'' in the context of SFT-then-RL scaling, we demonstrate that Data Scale determines the primary post-training potential, while Trajectory Difficulty acts as a performance multiplier; and (3) Identifying that the Minimum SFT Validation Loss serves as a robust indicator for selecting the expert trajectories that maximize the final performance ceiling. Our findings provide actionable guidelines for maximizing the value extracted from expert trajectories.
Community
The systematic study of expert trajectory utilization in LLM post-training.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Think Outside the Policy: In-Context Steered Policy Optimization (2025)
- RLoop: An Self-Improving Framework for Reinforcement Learning with Iterative Policy Initialization (2025)
- Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning (2025)
- VOLD: Reasoning Transfer from LLMs to Vision-Language Models via On-Policy Distillation (2025)
- Self-Rewarding PPO: Aligning Large Language Models with Demonstrations Only (2025)
- Beyond SFT: Reinforcement Learning for Safer Large Reasoning Models with Better Reasoning Ability (2025)
- Principled RL for Diffusion LLMs Emerges from a Sequence-Level Perspective (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Nice work. Well aligned with my experience. The pattern might be different for base model with adequate mid-training (requires less sft training). But I agree the validation loss is a good indicator.
arXiv lens breakdown of this paper ๐ https://arxivlens.com/PaperView/Details/rethinking-expert-trajectory-utilization-in-llm-post-training-1383-d4972d1b
- Key Findings
- Executive Summary
- Detailed Breakdown
- Practical Applications
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper