Why Steering Works: Toward a Unified View of Language Model Parameter Dynamics
Abstract
Large language model control methods are unified under a dynamic weight update framework, revealing a preference-utility trade-off and enabling improved steering through SPLIT approach.
Methods for controlling large language models (LLMs), including local weight fine-tuning, LoRA-based adaptation, and activation-based interventions, are often studied in isolation, obscuring their connections and making comparison difficult. In this work, we present a unified view that frames these interventions as dynamic weight updates induced by a control signal, placing them within a single conceptual framework. Building on this view, we propose a unified preference-utility analysis that separates control effects into preference, defined as the tendency toward a target concept, and utility, defined as coherent and task-valid generation, and measures both on a shared log-odds scale using polarity-paired contrastive examples. Across methods, we observe a consistent trade-off between preference and utility: stronger control increases preference while predictably reducing utility. We further explain this behavior through an activation manifold perspective, in which control shifts representations along target-concept directions to enhance preference, while utility declines primarily when interventions push representations off the model's valid-generation manifold. Finally, we introduce a new steering approach SPLIT guided by this analysis that improves preference while better preserving utility. Code is available at https://github.com/zjunlp/EasyEdit/blob/main/examples/SPLIT.md.
Community
We unify LLM control methods as dynamic weight updates, analyze their trade-offs between preference (targeted behavior) and utility (task-valid generation) via a shared log-odds framework, explain these effects through activation manifolds, and introduce SPLIT, a steering method that enhances preference while better preserving utility.
Great paper—your unified view of control methods and the preference–utility trade-off provides a clear framework for understanding steering.
Our recent work SafeConstellations (https://arxiv.org/abs/2508.11290) takes a complementary approach. Instead of parameter updates, we analyze representation dynamics across layers, showing that tasks follow consistent "trajectory constellations" in embedding space. Over-refusals occur when benign inputs are pushed onto refusal-oriented trajectories.
We propose an inference-time method that selectively shifts representations back toward non-refusal pathways for over-refusal-prone tasks, reducing over-refusals by up to 73% with minimal impact on utility.
Your activation-manifold explanation aligns closely with our findings—SafeConstellations realizes this principle at the trajectory level. It would be exciting to explore combining SPLIT-style control with task-specific trajectory steering.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- YaPO: Learnable Sparse Activation Steering Vectors for Domain Adaptation (2026)
- CARD: Cluster-level Adaptation with Reward-guided Decoding for Personalized Text Generation (2026)
- AdaJudge: Adaptive Multi-Perspective Judging for Reward Modeling (2026)
- Activation Steering for Masked Diffusion Language Models (2025)
- Selective Steering: Norm-Preserving Control Through Discriminative Layer Selection (2026)
- CLaS-Bench: A Cross-Lingual Alignment and Steering Benchmark (2026)
- CBMAS: Cognitive Behavioral Modeling via Activation Steering (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper