Abstract
A knowledge distillation framework called Shallow-pi is presented that reduces transformer depth in vision-language-action models, achieving faster inference with minimal performance loss in real-world robotic applications.
The growing demand for real-time robotic deployment necessitates fast and on-device inference for vision-language-action (VLA) models. Within the VLA literature, efficiency has been extensively studied at the token level, such as visual token pruning. In contrast, systematic transformer layer reduction has received limited attention and, to the best of our knowledge, has not been explored for flow-based VLA models under knowledge distillation. In this work, we propose Shallow-pi, a principled knowledge distillation framework that aggressively reduces the transformer depth of both the VLM backbone and the flow-based action head, compressing the model from 18 to 6 layers. Shallow-pi achieves over two times faster inference with less than one percent absolute drop in success rate on standard manipulation benchmarks, establishing state-of-the-art performance among reduced VLA models. Crucially, we validate our approach through industrial-scale real-world experiments on Jetson Orin and Jetson Thor across multiple robot platforms, including humanoid systems, in complex and dynamic manipulation scenarios.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- AC^2-VLA: Action-Context-Aware Adaptive Computation in Vision-Language-Action Models for Efficient Robotic Manipulation (2026)
- Token Expand-Merge: Training-Free Token Compression for Vision-Language-Action Models (2025)
- BLURR: A Boosted Low-Resource Inference for Vision-Language-Action Models (2025)
- Robotic VLA Benefits from Joint Learning with Motion Image Diffusion (2025)
- VAT: Vision Action Transformer by Unlocking Full Representation of ViT (2025)
- HiMoE-VLA: Hierarchical Mixture-of-Experts for Generalist Vision-Language-Action Policies (2025)
- FASTer: Toward Efficient Autoregressive Vision Language Action Modeling via Neural Action Tokenization (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper