GPT-OSS 2048 Strategy Generator (GRPO Fine-tuned)

This is a fine-tuned version of unsloth/gpt-oss-20b trained to generate Python strategies for the 2048 game using GRPO (Generative Reward-based Policy Optimization).

GIST

https://gist.github.com/bigsnarfdude/d444c1c9e6cf5b7377df22ea97eab10d

Model Description

  • Base Model: unsloth/gpt-oss-20b (20B parameters, 4-bit quantized)
  • Training Method: GRPO reinforcement learning
  • Task: Generate Python functions that play 2048 optimally
  • Training Steps: 1000
  • Performance: 60% win rate achieving the 2048 tile with average score of 27,138

Training Details

Architecture

  • LoRA Configuration:
    • Rank: 4
    • Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
    • Max sequence length: 768 tokens

GRPO Parameters

  • Learning rate: 5e-5
  • Batch size: 2
  • Weight decay: 0.01
  • Temperature: 1.0
  • Max training steps: 1000

Reward Functions

The model was trained with three reward components:

  1. function_works - Generated code executes without errors
  2. no_cheating - No forbidden modules or direct game state manipulation
  3. strategy_succeeds - Achieves high scores and wins games

Training Performance

Checkpoint Win Rate Avg Score Max Tile
100 0.0% 5,674 512
900 100.0% 22,794 2048
1000 60.0% 27,138 2048

The model shows a dramatic learning curve, achieving 100% win rate at checkpoint 900 and maintaining strong performance through checkpoint 1000.

Usage

Loading the Model

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="vincentoh/gpt-oss-2048-gpro",
    max_seq_length=768,
    dtype=None,
    load_in_4bit=True,
)

# Set up for inference
FastLanguageModel.for_inference(model)

Generating Strategies

prompt = """Create a Python function called `strategy` that takes a 2D board as input and returns a move direction ('W', 'A', 'S', or 'D') to play the 2048 game optimally.

Requirements:
- Input: board (list of lists representing the game state)
- Output: single character string ('W' for up, 'A' for left, 'S' for down, 'D' for right)
- Goal: Achieve the 2048 tile with high score

Example usage:
```python
board = [[2, 0, 0, 0], [0, 4, 0, 0], [0, 0, 8, 0], [0, 0, 0, 16]]
move = strategy(board)

Implement the strategy function:

"""

inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.7)
generated_code = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_code)

Example Generated Strategy

Here's an example of a winning strategy generated by this model (60% win rate):

def strategy(board):
    # Find reachable moves
    moves = []
    for i, row in enumerate(board):
        for j, val in enumerate(row):
            if val == 0:
                # Check if we can move a tile into this empty spot
                if i > 0 and board[i-1][j] != 0:
                    moves.append("S")
                if i < len(board)-1 and board[i+1][j] != 0:
                    moves.append("W")
                if j > 0 and board[i][j-1] != 0:
                    moves.append("D")
                if j < len(row)-1 and board[i][j+1] != 0:
                    moves.append("A")

    # Prefer moving towards top-left corner
    if "W" in moves: return "W"
    if "A" in moves: return "A"
    if "D" in moves: return "D"
    if "S" in moves: return "S"
    return "W"

Model Card

  • Developed by: Vincent Oh
  • Model type: Causal Language Model (GptOssForCausalLM)
  • Language: English
  • License: Apache 2.0
  • Finetuned from: unsloth/gpt-oss-20b

Intended Use

This model is designed for:

  • Generating 2048 game playing strategies
  • Research in reinforcement learning for code generation
  • Educational purposes in game AI development
  • Benchmarking LLM code generation capabilities

Limitations

  • Focused specifically on 2048 game strategies
  • Performance may vary on different board sizes (trained on 6x6 boards)
  • Generated code should be validated before execution
  • Requires GPU with 20GB+ VRAM for full model inference

Hardware Requirements

  • Recommended: NVIDIA GPU with 20GB+ VRAM (e.g., RTX 4090, A100)
  • Minimum: 32GB system RAM for CPU inference (slow)
  • Storage: 13GB for full model weights

Citation

If you use this model in your research, please cite:

@misc{gpt-oss-2048-grpo,
  author = {Vincent Oh},
  title = {GPT-OSS 2048 Strategy Generator},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/vincentoh/gpt-oss-2048-gpro}},
  note = {Fine-tuned with GRPO reinforcement learning}
}

Training Infrastructure

  • GPU: NVIDIA RTX 4070 Ti Super 16GB VRAM
  • Training Duration: ~12 hours for 1000 steps
  • Framework: Unsloth + TRL + Transformers

Acknowledgments

Contact

For questions or issues, please open an issue on the model repository or contact the author.

Downloads last month
8
Safetensors
Model size
22B params
Tensor type
BF16
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support