Nebius

company

Verified

AI & ML interests

AI-centric cloud platform ready for intensive workloads Training-ready platform with NVIDIA® H100 Tensor Core GPUs. Competitive pricing. Dedicated support.

Recent Activity

ibragim-bad updated a dataset 9 days ago

nebius/SWE-rebench-leaderboard

djalexj authored a paper 18 days ago

Blockwise Advantage Estimation for Multi-Objective RL with Verifiable Rewards

djalexj authored a paper 18 days ago

Guided Search Strategies in Non-Serializable Environments with Applications to Software Engineering Agents

View all activity

Papers

Blockwise Advantage Estimation for Multi-Objective RL with Verifiable Rewards

View all Papers

ibragim-bad

updated a dataset 9 days ago

nebius/SWE-rebench-leaderboard

Viewer • Updated 9 days ago • 1.39k • 5.65k • 22

djalexj

authored 2 papers 18 days ago

Blockwise Advantage Estimation for Multi-Objective RL with Verifiable Rewards

Paper • 2602.10231 • Published 20 days ago • 12

Guided Search Strategies in Non-Serializable Environments with Applications to Software Engineering Agents

Paper • 2505.13652 • Published May 19, 2025

djalexj

submitted a paper to Daily Papers 18 days ago

Blockwise Advantage Estimation for Multi-Objective RL with Verifiable Rewards

Paper • 2602.10231 • Published 20 days ago • 12

ibragim-bad

posted an update 2 months ago

Post

365

🎄 67,074 Qwen3-Coder OpenHands trajectories + 2 RFT checkpoints.

We release: 67,000+ trajectories from 3,800 resolved issues in 1,800+ Python repos.
About 3x more successful trajectories and 1.5x more repos than our previous dataset.
Trajectories are long: on average 64 turns, up to 100 turns and 131k context length.

> RFT on this data, SWE-bench Verified:
Qwen3-30B-Instruct: 25.7% → 50.3% Pass@1.
Qwen3-235B-Instruct: 46.2% → 61.7% Pass@1.
Also strong gains on SWE-rebench September.

> We also did massive evals.
We run OpenHands with 100 and 500 turns.
We compare models under both limits.
We run on SWE-bench Verified and several months of SWE-rebench.

!!! We also check tests written by the models.
We measure how often tests are correct.
We check how often the final patch passes its own tests.
This gives a pool of tests for verifiers and auto graders.

> Fully permissive licenses
Dataset and models: https://huggingface.co/collections/nebius/openhands-trajectories

Blog post: https://nebius.ai/blog/posts/openhands-trajectories-with-qwen3-instruct

2 replies

mrshevan

authored a paper 7 months ago

Training Long-Context, Multi-Turn Software Engineering Agents with Reinforcement Learning

Paper • 2508.03501 • Published Aug 5, 2025 • 59

ibragim-bad

posted an update 7 months ago

Post

369

We tested Qwen3-Coder, GPT-5 and other 30+ models on new SWE-Bench like tasks from July 2025!

Hi all, I’m Ibragim from Nebius.

We ran a benchmark on 34 fresh GitHub PR tasks from July 2025 using the SWE-rebench leaderboard https://swe-rebench.com/leaderboard . These are real, recent problems — no training-set contamination — and include both proprietary and open-source models.

Quick takeaways:

> GPT-5-Medium leads overall (29.4% resolved rate, 38.2% pass@5).
> Qwen3-Coder is the best open-source performer, matching GPT-5-High in pass@5 (32.4%) despite a lower resolved rate.
> Claude Sonnet 4.0 lags behind in pass@5 at 23.5%.

All tasks come from the continuously updated, decontaminated nebius/SWE-rebench-leaderboard for real-world SWE tasks.