Stefano Fiorucci's picture
In a Training Loop ๐Ÿ”„

Stefano Fiorucci PRO

anakin87

AI & ML interests

Language Models: orchestration, post-training, GRPO, synthetic data... Contributing to Haystack LLM framework ๐Ÿ—๏ธ

Recent Activity

reacted to theirpost with โค๏ธ about 7 hours ago
How LLM training with RL Environments works? It all starts with ๐—ฅ๐—ฒ๐—ถ๐—ป๐—ณ๐—ผ๐—ฟ๐—ฐ๐—ฒ๐—บ๐—ฒ๐—ป๐˜ ๐—Ÿ๐—ฒ๐—ฎ๐—ฟ๐—ป๐—ถ๐—ป๐—ด ๐˜„๐—ถ๐˜๐—ต ๐—ฉ๐—ฒ๐—ฟ๐—ถ๐—ณ๐—ถ๐—ฎ๐—ฏ๐—น๐—ฒ ๐—ฅ๐—ฒ๐˜„๐—ฎ๐—ฟ๐—ฑ๐˜€ - question asked - model generates reasoning + answer - answer checked against ground truth - reward drives RL training In this setup, the environment is simple: fixed questions and answers, rollout logic, reward(s) Consider a more complex tic-tac-toe env โŒโญ• It adds: - dynamic game generation/handling - tunable opponent skill - multi-turn interactions (envs can also include tools) --- What happens at training? We use ๐—š๐—ฟ๐—ผ๐˜‚๐—ฝ ๐—ฅ๐—ฒ๐—น๐—ฎ๐˜๐—ถ๐˜ƒ๐—ฒ ๐—ฃ๐—ผ๐—น๐—ถ๐—ฐ๐˜† ๐—ข๐—ฝ๐˜๐—ถ๐—บ๐—ถ๐˜‡๐—ฎ๐˜๐—ถ๐—ผ๐—ป with a tic-tac-toe env No critic model needed, the group is the baseline Simpler than PPO 1๏ธโƒฃ Rollout generation: from the same board, model plays N games via sampling 2๏ธโƒฃ Each game scored with deterministic rewards (win, format, ...) 3๏ธโƒฃ Mean score computed across the group 4๏ธโƒฃ Each rollout's advantage = its score minus the group mean 5๏ธโƒฃ Model updated to favor trajectories above baseline ๐Ÿ” Repeat For a deep dive, check out ๐ŸŒฑ https://github.com/anakin87/llm-rl-environments-lil-course a free hands-on course on RL environments for LLMs
posted an update about 7 hours ago
How LLM training with RL Environments works? It all starts with ๐—ฅ๐—ฒ๐—ถ๐—ป๐—ณ๐—ผ๐—ฟ๐—ฐ๐—ฒ๐—บ๐—ฒ๐—ป๐˜ ๐—Ÿ๐—ฒ๐—ฎ๐—ฟ๐—ป๐—ถ๐—ป๐—ด ๐˜„๐—ถ๐˜๐—ต ๐—ฉ๐—ฒ๐—ฟ๐—ถ๐—ณ๐—ถ๐—ฎ๐—ฏ๐—น๐—ฒ ๐—ฅ๐—ฒ๐˜„๐—ฎ๐—ฟ๐—ฑ๐˜€ - question asked - model generates reasoning + answer - answer checked against ground truth - reward drives RL training In this setup, the environment is simple: fixed questions and answers, rollout logic, reward(s) Consider a more complex tic-tac-toe env โŒโญ• It adds: - dynamic game generation/handling - tunable opponent skill - multi-turn interactions (envs can also include tools) --- What happens at training? We use ๐—š๐—ฟ๐—ผ๐˜‚๐—ฝ ๐—ฅ๐—ฒ๐—น๐—ฎ๐˜๐—ถ๐˜ƒ๐—ฒ ๐—ฃ๐—ผ๐—น๐—ถ๐—ฐ๐˜† ๐—ข๐—ฝ๐˜๐—ถ๐—บ๐—ถ๐˜‡๐—ฎ๐˜๐—ถ๐—ผ๐—ป with a tic-tac-toe env No critic model needed, the group is the baseline Simpler than PPO 1๏ธโƒฃ Rollout generation: from the same board, model plays N games via sampling 2๏ธโƒฃ Each game scored with deterministic rewards (win, format, ...) 3๏ธโƒฃ Mean score computed across the group 4๏ธโƒฃ Each rollout's advantage = its score minus the group mean 5๏ธโƒฃ Model updated to favor trajectories above baseline ๐Ÿ” Repeat For a deep dive, check out ๐ŸŒฑ https://github.com/anakin87/llm-rl-environments-lil-course a free hands-on course on RL environments for LLMs
View all activity

Organizations

deepset's profile picture Blog-explorers's profile picture ZeroGPU Explorers's profile picture Hugging Face Discord Community's profile picture