OxyGen: Unified KV Cache Management for Vision-Language-Action Models under Multi-Task Parallelism
Abstract
Embodied AI agents increasingly require parallel execution of multiple tasks, such as manipulation, conversation, and memory construction, from shared observations under distinct time constraints. Recent Mixture-of-Transformers (MoT) Vision-Language-Action Models (VLAs) architecturally support such heterogeneous outputs, yet existing inference systems fail to achieve efficient multi-task parallelism for on-device deployment due to redundant computation and resource contention. We identify isolated KV cache management as the root cause. To address this, we propose unified KV cache management, an inference paradigm that treats KV cache as a first-class shared resource across tasks and over time. This abstraction enables two key optimizations: cross-task KV sharing eliminates redundant prefill of shared observations, while cross-frame continuous batching decouples variable-length language decoding from fixed-rate action generation across control cycles. We implement this paradigm for π_{0.5}, the most popular MoT VLA, and evaluate under representative robotic configurations. OxyGen achieves up to 3.7times speedup over isolated execution, delivering over 200 tokens/s language throughput and 70 Hz action frequency simultaneously without action quality degradation.
Community
OxyGen optimizes multi-task inference for Mixture-of-Transformers (MoT) Vision-Language-Action (VLA) models (e.g., pi0.5) through a unified KV cache management. It achieves up to 3.7x speedup over the baseline system (openpi), delivering over 200 tokens/s language throughput and 70 Hz action frequency simultaneously without action quality degradation.
We have released the source code (based on openpi) on GitHub. Welcome to try OxyGen with the official pi0.5 checkpoints to reproduce our experimental results on your machine. We plan to release models trained for this inference paradigm in the future.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- ICaRus: Identical Cache Reuse for Efficient Multi Model Inference (2026)
- How Fast Can I Run My VLA? Demystifying VLA Inference Performance with VLA-Perf (2026)
- Environment-Aware Adaptive Pruning with Interleaved Inference Orchestration for Vision-Language-Action Models (2026)
- PrefillShare: A Shared Prefill Module for KV Reuse in Multi-LLM Disaggregated Serving (2026)
- PackInfer: Compute- and I/O-Efficient Attention for Batched LLM Inference (2026)
- Learning to Accelerate Vision-Language-Action Models through Adaptive Visual Token Caching (2026)
- AgentServe: Algorithm-System Co-Design for Efficient Agentic AI Serving on a Consumer-Grade GPU (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper