arxiv:2604.24927

Large Language Models Explore by Latent Distilling

Published on Apr 27

· Submitted by

Zeng Yuanhao on Apr 30

ShanghaiTech University

Upvote

Authors:

Yuanhao Zeng ,

Abstract

Exploratory Sampling enhances LLM generation diversity by using a lightweight distiller to predict hidden representations and bias decoding toward novel semantic patterns.

AI-generated summary

Generating diverse responses is crucial for test-time scaling of large language models (LLMs), yet standard stochastic sampling mostly yields surface-level lexical variation, limiting semantic exploration. In this paper, we propose Exploratory Sampling (ESamp), a decoding approach that explicitly encourages semantic diversity during generation. ESamp is motivated by the well-known observation that neural networks tend to make lower-error predictions on inputs similar to those encountered before, and incur higher prediction error on novel ones. Building on this property, we train a lightweight Distiller at test time to predict deep-layer hidden representations of the LLM from its shallow-layer representations to model the LLM's depth-wise representation transitions. During decoding, the Distiller continuously adapts to the mappings induced by the current generation context. ESamp uses the prediction error as a novelty signal to reweight candidate token extensions conditioned on the current prefix, thereby biasing decoding toward less-explored semantic patterns. ESamp is implemented with an asynchronous training--inference pipeline, with less than 5% worst case overhead (1.2% in the optimized release). Empirical results show that ESamp significantly boosts the Pass@k efficiency of reasoning models, showing superior or comparable performance to strong stochastic and heuristic baselines. Notably, ESamp achieves robust generalization across mathematics, science, and code generation benchmarks and breaks the trade-off between diversity and coherence in creative writing. Our code has released at: https://github.com/LinesHogan/tLLM.

View arXiv page View PDF GitHub 22 Add to collection

Community

Lines

Paper author Paper submitter 1 day ago

We are excited to share our new paper: “Large Language Models Explore by Latent Distilling.”

The core question we study is simple: when we sample multiple LLM responses at test time, are we really getting diverse reasoning paths, or just different surface forms of the same idea? To address this, we propose Exploratory Sampling (ESamp), a decoding method that encourages semantic exploration during generation.

ESamp trains a lightweight Latent Distiller online to predict deep-layer LLM representations from shallow-layer representations. The prediction error provides a novelty signal: familiar semantic trajectories become easier to predict, while under-explored directions produce higher error. We then use this signal to guide sampling toward less redundant continuations, with a formulation grounded in KL-regularized policy optimization.

Across math, science, code, and creative writing benchmarks, ESamp improves diversity and Pass@k efficiency while preserving strong throughput through an asynchronous implementation in tLLM.

We hope ESamp can be a useful step toward more efficient and principled test-time exploration for LLMs.

felomid

1 day ago

This paper is exciting not only because of the algorithm, but also because of the systems angle.

There are many recent attempts to intervene in LLM generation at test time, but in practice many of them become too slow once implemented seriously. ESamp is impressive because it shows that online adaptation during decoding does not have to destroy throughput. The paper decouples the lightweight Distiller’s training/inference from the main LLM generation through an asynchronous pipeline, and the open-source tLLM implementation reports about 98.8% of the optimized vLLM baseline throughput in the aligned benchmark.

I think this matters a lot for the broader test-time intervention community. The runtime abstraction is useful beyond ESamp itself: researchers can design new decoding-time adaptation algorithms while relying on a high-throughput implementation path instead of maintaining fragile private forks of inference engines.

Algorithmically interesting, but also genuinely practical. That combination is rare.

Lines

Paper author Paper submitter about 17 hours ago

Hi everyone! We are excited to share our work, Large Language Models Explore by Latent Distilling.

This paper introduces ESamp, a test-time sampling algorithm that helps LLMs generate multiple semantically diverse responses in parallel, rather than merely producing surface-level variations of the same idea. The key intuition is to use an online Latent Distiller to estimate whether the current generation trajectory is familiar or under-explored, and then guide sampling toward more novel semantic directions.

A major focus of this work is also efficiency. Through an algorithm-system co-design, ESamp reaches 98.8% of the throughput of an optimized vLLM baseline with most modern acceleration techniques enabled, showing that test-time intervention can be both effective and practical.

We have also open-sourced the efficient implementation as tLLM:
https://github.com/LinesHogan/tllm

tLLM is decoupled from ESamp and can be viewed as a lightweight module loader for vLLM-style inference, enabling low-overhead access to the LLM residual stream for a wide range of test-time algorithms. We hope it can help bring more intervention methods to production-level efficiency and serve as a shared playground for the community.

If you are interested in accelerating your own algorithms or contributing to tLLM, feel free to reach out!

avahal

about 12 hours ago

nerding out on the latent distiller idea, that mapping shallow to deep hidden states to model depth-wise transitions is a clean, test-time lever for semantic exploration. the trick is turning that latent error into a per-token novelty signal and feeding it into a KL-regularized reweighting during the async decode loop. the simple logit update logit_new = (1 + β) logit_ref − β logit_dist and the batch-wide coordination feel like the right knobs to bias away from overused reasoning patterns. btw the arxivlens breakdown helped me parse the method details, and i appreciated its recap next to the figures https://arxivlens.com/PaperView/Details/large-language-models-explore-by-latent-distilling-5130-10dc14c9. one question: how sensitive is performance to which deep layer you predict from the shallow state, or to the exact depth L used by the distiller?