arxiv:2601.21590

Scalable Power Sampling: Unlocking Efficient, Training-Free Reasoning for LLMs via Distribution Sharpening

Published on Jan 29

· Submitted by

Haitham Bou Ammar on Jan 30

Upvote

Authors:

Xiaotong Ji ,

Abstract

A theoretically grounded method for improving large language model reasoning performance through distribution sharpening without iterative sampling or external rewards, achieving comparable results to reinforcement learning post-training with significantly reduced computational costs.

AI-generated summary

Reinforcement learning (RL) post-training is a dominant approach for improving the reasoning performance of large language models (LLMs), yet growing evidence suggests that its gains arise primarily from distribution sharpening rather than the acquisition of new capabilities. Recent work has shown that sampling from the power distribution of LLMs using Markov chain Monte Carlo (MCMC) can recover performance comparable to RL post-training without relying on external rewards; however, the high computational cost of MCMC makes such approaches impractical for widespread adoption. In this work, we propose a theoretically grounded alternative that eliminates the need for iterative MCMC. We derive a novel formulation showing that the global power distribution can be approximated by a token-level scaled low-temperature one, where the scaling factor captures future trajectory quality. Leveraging this insight, we introduce a training-free and verifier-free algorithm that sharpens the base model's generative distribution autoregressively. Empirically, we evaluate our method on math, QA, and code tasks across four LLMs, and show that our method matches or surpasses one-shot GRPO without relying on any external rewards, while reducing inference latency by over 10x compared to MCMC-based sampling.

View arXiv page View PDF Add to collection

Community

hba123

Paper submitter 1 day ago

What if RL isn’t teaching LLMs how to reason, but just sharpening what’s already there?

Most recent progress in LLM reasoning comes from RL post-training (GRPO, verifiers, rewards).

But there’s growing evidence that these gains may come less from learning new capabilities and more from reshaping the distribution of outputs.

In our new work, we take that idea seriously.

We show that:

Reasoning trajectories already exist in base models
What matters is how you sample, not how you retrain
The global power distribution can be approximated autoregressively, without MCMC

The result is a training-free, verifier-free inference-time method that:
⚡ Matches GRPO-style post-training
⏱ Is ~10× faster than MCMC-based power sampling
🧪 Requires no rewards, no finetuning, no verifier

Conceptually, the key insight is simple:
Power sampling ≈ low-temperature sampling × future-aware token scaling

This lets us recover global reasoning behaviour token by token, without expensive trajectory-level inference.

hassenhamdi

1 day ago

Does the presented work in the paper have any code implementation.? 💻

avahal

about 20 hours ago

arXivLens breakdown of this paper 👉 https://arxivlens.com/PaperView/Details/scalable-power-sampling-unlocking-efficient-training-free-reasoning-for-llms-via-distribution-sharpening-7200-1ca97c34

Executive Summary
Detailed Breakdown
Practical Applications

hba123

about 20 hours ago

Amazing ❤️

librarian-bot

about 16 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

MichaelBarryUK

about 6 hours ago

I didn't see anything in the paper about power sampling over SFT trained models. Is this a viable use-case? Also does it play nice with Latent Reasoning? Cheers. Great paper 👍

hba123

about 5 hours ago

Thanks a lot for the kind words.

we tried over GRPO trained one, but not over SFT. It is indeed worth checking 🤗

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2601.21590 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2601.21590 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2601.21590 in a Space README.md to link it from this page.

Scalable Power Sampling: Unlocking Efficient, Training-Free Reasoning for LLMs via Distribution Sharpening

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 2