RefAlign: RL with Similarity-based Rewards
GitHub repository: https://github.com/mzhaoshuai/RefAlign
Paper: Learning from Reference Answers: Versatile Language Model Alignment without Binary Human Preference Data.
This is the model aligned with RefAlign, a versatile REINFORCE-style alignment algorithm that utilizes language generation evaluation metrics (such as BERTScore) between sampled generations and reference answers as surrogate rewards.
It is primarily aligned for safety.
The training data is https://huggingface.co/datasets/mzhaoshuai/Llama-3.3-70B-Inst-awq_SafeRLHF.
When conducting Reinforcement Learning with Similarity-based Rewards, the reward function is BERTScore.
| Hyper-Parameters |
Value |
| LR |
3e-6 |
| Batch Size |
512 |
| Epoch |
2 |
| Prompt Length |
192 |
| Generation Length |
384 |
| Sampled Generations (K) |
2 |
| BertScore Model |
bart-large-mnli |
| harmless advantage weight |
4.0 |