DIFFA-2: A Practical Diffusion Large Language Model for General Audio Understanding
Abstract
DIFFA-2, a diffusion-based large audio language model, achieves competitive audio understanding performance with improved efficiency over autoregressive counterparts through enhanced encoding, dual adapters, and staged training.
Autoregressive (AR) large audio language models (LALMs) such as Qwen-2.5-Omni have achieved strong performance on audio understanding and interaction, but scaling them remains costly in data and computation, and strictly sequential decoding limits inference efficiency. Diffusion large language models (dLLMs) have recently been shown to make effective use of limited training data, and prior work on DIFFA indicates that replacing an AR backbone with a diffusion counterpart can substantially improve audio understanding under matched settings, albeit at a proof-of-concept scale without large-scale instruction tuning, preference alignment, or practical decoding schemes. We introduce DIFFA-2, a practical diffusion-based LALM for general audio understanding. DIFFA-2 upgrades the speech encoder, employs dual semantic and acoustic adapters, and is trained with a four-stage curriculum that combines semantic and acoustic alignment, large-scale supervised fine-tuning, and variance-reduced preference optimization, using only fully open-source corpora. Experiments on MMSU, MMAU, and MMAR show that DIFFA-2 consistently improves over DIFFA and is competitive to strong AR LALMs under practical training budgets, supporting diffusion-based modeling is a viable backbone for large-scale audio understanding. Our code is available at https://github.com/NKU-HLT/DIFFA.git.
Community
DIFFA-2 provides a practical diffusion-based large audio language model with semantic/acoustic adapters and a four-stage curriculum, improving general audio understanding under practical budgets.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- dLLM-ASR: A Faster Diffusion LLM-based Framework for Speech Recognition (2026)
- Fun-Audio-Chat Technical Report (2025)
- FastSLM: Hierarchical Frame Q-Former for Effective Speech Modality Adaptation (2026)
- MiMo-Audio: Audio Language Models are Few-Shot Learners (2025)
- AzeroS: Extending LLM to Speech with Self-Generated Instruction-Free Tuning (2025)
- SDAR-VL: Stable and Efficient Block-wise Diffusion for Vision-Language Understanding (2025)
- AR-Omni: A Unified Autoregressive Model for Any-to-Any Generation (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper