Post
2300
How LLM training with RL Environments works?
It all starts with ๐ฅ๐ฒ๐ถ๐ป๐ณ๐ผ๐ฟ๐ฐ๐ฒ๐บ๐ฒ๐ป๐ ๐๐ฒ๐ฎ๐ฟ๐ป๐ถ๐ป๐ด ๐๐ถ๐๐ต ๐ฉ๐ฒ๐ฟ๐ถ๐ณ๐ถ๐ฎ๐ฏ๐น๐ฒ ๐ฅ๐ฒ๐๐ฎ๐ฟ๐ฑ๐
- question asked
- model generates reasoning + answer
- answer checked against ground truth
- reward drives RL training
In this setup, the environment is simple: fixed questions and answers, rollout logic, reward(s)
Consider a more complex tic-tac-toe env โโญ
It adds:
- dynamic game generation/handling
- tunable opponent skill
- multi-turn interactions
(envs can also include tools)
---
What happens at training?
We use ๐๐ฟ๐ผ๐๐ฝ ๐ฅ๐ฒ๐น๐ฎ๐๐ถ๐๐ฒ ๐ฃ๐ผ๐น๐ถ๐ฐ๐ ๐ข๐ฝ๐๐ถ๐บ๐ถ๐๐ฎ๐๐ถ๐ผ๐ป with a tic-tac-toe env
No critic model needed, the group is the baseline
Simpler than PPO
1๏ธโฃ Rollout generation: from the same board, model plays N games via sampling
2๏ธโฃ Each game scored with deterministic rewards (win, format, ...)
3๏ธโฃ Mean score computed across the group
4๏ธโฃ Each rollout's advantage = its score minus the group mean
5๏ธโฃ Model updated to favor trajectories above baseline
๐ Repeat
For a deep dive, check out
๐ฑ https://github.com/anakin87/llm-rl-environments-lil-course
a free hands-on course on RL environments for LLMs
It all starts with ๐ฅ๐ฒ๐ถ๐ป๐ณ๐ผ๐ฟ๐ฐ๐ฒ๐บ๐ฒ๐ป๐ ๐๐ฒ๐ฎ๐ฟ๐ป๐ถ๐ป๐ด ๐๐ถ๐๐ต ๐ฉ๐ฒ๐ฟ๐ถ๐ณ๐ถ๐ฎ๐ฏ๐น๐ฒ ๐ฅ๐ฒ๐๐ฎ๐ฟ๐ฑ๐
- question asked
- model generates reasoning + answer
- answer checked against ground truth
- reward drives RL training
In this setup, the environment is simple: fixed questions and answers, rollout logic, reward(s)
Consider a more complex tic-tac-toe env โโญ
It adds:
- dynamic game generation/handling
- tunable opponent skill
- multi-turn interactions
(envs can also include tools)
---
What happens at training?
We use ๐๐ฟ๐ผ๐๐ฝ ๐ฅ๐ฒ๐น๐ฎ๐๐ถ๐๐ฒ ๐ฃ๐ผ๐น๐ถ๐ฐ๐ ๐ข๐ฝ๐๐ถ๐บ๐ถ๐๐ฎ๐๐ถ๐ผ๐ป with a tic-tac-toe env
No critic model needed, the group is the baseline
Simpler than PPO
1๏ธโฃ Rollout generation: from the same board, model plays N games via sampling
2๏ธโฃ Each game scored with deterministic rewards (win, format, ...)
3๏ธโฃ Mean score computed across the group
4๏ธโฃ Each rollout's advantage = its score minus the group mean
5๏ธโฃ Model updated to favor trajectories above baseline
๐ Repeat
For a deep dive, check out
๐ฑ https://github.com/anakin87/llm-rl-environments-lil-course
a free hands-on course on RL environments for LLMs