We release: 67,000+ trajectories from 3,800 resolved issues in 1,800+ Python repos. About 3x more successful trajectories and 1.5x more repos than our previous dataset. Trajectories are long: on average 64 turns, up to 100 turns and 131k context length.
> RFT on this data, SWE-bench Verified: Qwen3-30B-Instruct: 25.7% → 50.3% Pass@1. Qwen3-235B-Instruct: 46.2% → 61.7% Pass@1. Also strong gains on SWE-rebench September.
> We also did massive evals. We run OpenHands with 100 and 500 turns. We compare models under both limits. We run on SWE-bench Verified and several months of SWE-rebench.
!!! We also check tests written by the models. We measure how often tests are correct. We check how often the final patch passes its own tests. This gives a pool of tests for verifiers and auto graders.
We tested Qwen3-Coder, GPT-5 and other 30+ models on new SWE-Bench like tasks from July 2025!
Hi all, I’m Ibragim from Nebius.
We ran a benchmark on 34 fresh GitHub PR tasks from July 2025 using the SWE-rebench leaderboard https://swe-rebench.com/leaderboard . These are real, recent problems — no training-set contamination — and include both proprietary and open-source models.
Quick takeaways:
> GPT-5-Medium leads overall (29.4% resolved rate, 38.2% pass@5). > Qwen3-Coder is the best open-source performer, matching GPT-5-High in pass@5 (32.4%) despite a lower resolved rate. > Claude Sonnet 4.0 lags behind in pass@5 at 23.5%.
All tasks come from the continuously updated, decontaminated nebius/SWE-rebench-leaderboard for real-world SWE tasks.