arxiv:2604.26951

Turning the TIDE: Cross-Architecture Distillation for Diffusion Large Language Models

Published on Apr 29

· Submitted by

N2048M on Apr 30

Peking University

Upvote

Authors:

Abstract

Researchers developed TIDE, a framework for cross-architecture distillation of diffusion large language models that improves performance through specialized modules for distillation strength modulation, context enrichment, and cross-tokenizer objectives.

AI-generated summary

Diffusion large language models (dLLMs) offer parallel decoding and bidirectional context, but state-of-the-art dLLMs require billions of parameters for competitive performance. While existing distillation methods for dLLMs reduce inference steps within a single architecture, none address cross-architecture knowledge transfer, in which the teacher and student differ in architecture, attention mechanism, and tokenizer. We present TIDE, the first framework for cross-architecture dLLM distillation, comprising three modular components: (1) TIDAL, which jointly modulates distillation strength across training progress and diffusion timestep to account for the teacher's noise-dependent reliability; (2) CompDemo, which enriches the teacher's context via complementary mask splitting to improve predictions under heavy masking; and (3) Reverse CALM, a cross-tokenizer objective that inverts chunk-level likelihood matching, yielding bounded gradients and dual-end noise filtering. Distilling 8B dense and 16B MoE teachers into a 0.6B student via two heterogeneous pipelines outperforms the baseline by an average of 1.53 points across eight benchmarks, yielding notable gains in code generation, where HumanEval scores reach 48.78 compared to 32.3 for the AR baseline.

View arXiv page View PDF Project page GitHub 56 Add to collection

Community

N2048M

Paper submitter about 21 hours ago

🌊 Excited to share TIDE — to our knowledge, the first framework for cross-architecture distillation of diffusion LLMs, where teacher and student may differ in architecture, attention pattern, and tokenizer.

We evaluate TIDE in two heterogeneous teacher–student distillation pipelines, both using a 0.6B Qwen3-BD3LM student architecture:

LLaDA2.0-mini 16B MoE teacher → 0.6B BD3LM student
WeDLM-8B-Instruct 8B dense teacher → 0.6B BD3LM student

Three modular components make this work:

TIDAL — a dual-axis distillation schedule over training progress and diffusion timestep, which down-weights unreliable teacher signals in high-mask / high-noise regions.
CompDemo — complementary mask demonstrations with dual teacher forward passes, giving each masked position extra visible context to sharpen teacher predictions.
Reverse CALM — a reversed chunk-level likelihood matching objective for cross-tokenizer alignment, yielding bounded gradients and stronger mode-seeking behavior.

Across 8 benchmarks, TIDE improves over the BD3LM baseline by +1.53 on average, and achieves +16.48 on HumanEval over a same-size AR model. The 0.6B student further achieves 22× memory compression and 5.2× inference speedup compared with the 16B MoE teacher.

Code, checkpoints, and data are all open-sourced — feedback / Upvotes are very welcome 🌟

📄 Paper: https://huggingface.co/papers/2604.26951
💻 Code: https://github.com/PKU-YuanGroup/TIDE
🌐 Project: https://pku-yuangroup.github.io/TIDE-Page/
🤗 Models: https://huggingface.co/TIDE-dllm/models
📚 Data: https://huggingface.co/TIDE-dllm/datasets

avahal

about 9 hours ago

reverse CALM is the standout detail for me, a cross-tokenizer objective that inverts chunk-level likelihood matching to bridge vocab gaps with a bounded-gradient signal. that's the knob that likely keeps cross-architecture distillation from spewing instability when the teacher and student speak different vocabularies. i'd be curious to see an ablation on chunk size and on whether a monotone clipping variant would change the stability vs. learning speed tradeoff. the arxivlens breakdown helped me parse the method details and does a nice job unpacking how reverse calm, compdemo, and tidal fit together; worth a read alongside the paper link: https://arxivlens.com/PaperView/Details/turning-the-tide-cross-architecture-distillation-for-diffusion-large-language-models-1810-46f77bd9