Calibrate-Then-Act: Cost-Aware Exploration in LLM Agents Paper • 2602.16699 • Published 2 days ago • 11
UniT: Unified Multimodal Chain-of-Thought Test-time Scaling Paper • 2602.12279 • Published 8 days ago • 19
UniT: Unified Multimodal Chain-of-Thought Test-time Scaling Paper • 2602.12279 • Published 8 days ago • 19
Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces Paper • 2601.11868 • Published Jan 17 • 32
mlfoundations-dev/Qwen3-8B_exp-swd-r2egym-standard_glm_4.7_traces_locetash_save-strategy_steps Updated Jan 9
mlfoundations-dev/Qwen3-8B_perturbed-docker-exp-taskmaster2-tasks_glm_4.7_traces_locetash_save-strategy_steps Updated Jan 9
mlfoundations-dev/staqc-ot3-100k-code-subset-traces-terminus-2_save-strategy_steps_Qwen3-8B Updated Jan 4
mlfoundations-dev/GLM-4.6-stackexchange-overflow-sandboxes-32eps-65k-reasoning_learning-rate_1e-05_Qwen3-32B Updated Dec 28, 2025
mlfoundations-dev/GLM-4.6-stackexchange-overflow-sandboxes-32eps-65k-reasoning_num-train-epochs_6.0_Qwen3-32B Updated Dec 26, 2025
mlfoundations-dev/GLM-4.6-stackexchange-overflow-sandboxes-32eps-65k-reasoning_num-train-epochs_4.0_Qwen3-32B Updated Dec 25, 2025
mlfoundations-dev/openthoughts-4-code-qwen3-32b-annotated-32k_qwen3-1.7B_32k_eval_8179 Updated Dec 23, 2025 • 1