Papers
arxiv:2604.26951

Turning the TIDE: Cross-Architecture Distillation for Diffusion Large Language Models

Published on Apr 29
Β· Submitted by
N2048M
on Apr 30
Authors:
,
,
,

Abstract

Researchers developed TIDE, a framework for cross-architecture distillation of diffusion large language models that improves performance through specialized modules for distillation strength modulation, context enrichment, and cross-tokenizer objectives.

AI-generated summary

Diffusion large language models (dLLMs) offer parallel decoding and bidirectional context, but state-of-the-art dLLMs require billions of parameters for competitive performance. While existing distillation methods for dLLMs reduce inference steps within a single architecture, none address cross-architecture knowledge transfer, in which the teacher and student differ in architecture, attention mechanism, and tokenizer. We present TIDE, the first framework for cross-architecture dLLM distillation, comprising three modular components: (1) TIDAL, which jointly modulates distillation strength across training progress and diffusion timestep to account for the teacher's noise-dependent reliability; (2) CompDemo, which enriches the teacher's context via complementary mask splitting to improve predictions under heavy masking; and (3) Reverse CALM, a cross-tokenizer objective that inverts chunk-level likelihood matching, yielding bounded gradients and dual-end noise filtering. Distilling 8B dense and 16B MoE teachers into a 0.6B student via two heterogeneous pipelines outperforms the baseline by an average of 1.53 points across eight benchmarks, yielding notable gains in code generation, where HumanEval scores reach 48.78 compared to 32.3 for the AR baseline.

Community

Paper submitter

🌊 Excited to share TIDE β€” to our knowledge, the first framework for cross-architecture distillation of diffusion LLMs, where teacher and student may differ in architecture, attention pattern, and tokenizer.

We evaluate TIDE in two heterogeneous teacher–student distillation pipelines, both using a 0.6B Qwen3-BD3LM student architecture:

  • LLaDA2.0-mini 16B MoE teacher β†’ 0.6B BD3LM student
  • WeDLM-8B-Instruct 8B dense teacher β†’ 0.6B BD3LM student

Three modular components make this work:

  1. TIDAL β€” a dual-axis distillation schedule over training progress and diffusion timestep, which down-weights unreliable teacher signals in high-mask / high-noise regions.

  2. CompDemo β€” complementary mask demonstrations with dual teacher forward passes, giving each masked position extra visible context to sharpen teacher predictions.

  3. Reverse CALM β€” a reversed chunk-level likelihood matching objective for cross-tokenizer alignment, yielding bounded gradients and stronger mode-seeking behavior.

Across 8 benchmarks, TIDE improves over the BD3LM baseline by +1.53 on average, and achieves +16.48 on HumanEval over a same-size AR model. The 0.6B student further achieves 22Γ— memory compression and 5.2Γ— inference speedup compared with the 16B MoE teacher.

Code, checkpoints, and data are all open-sourced β€” feedback / Upvotes are very welcome 🌟

πŸ“„ Paper: https://huggingface.co/papers/2604.26951
πŸ’» Code: https://github.com/PKU-YuanGroup/TIDE
🌐 Project: https://pku-yuangroup.github.io/TIDE-Page/
πŸ€— Models: https://huggingface.co/TIDE-dllm/models
πŸ“š Data: https://huggingface.co/TIDE-dllm/datasets

reverse CALM is the standout detail for me, a cross-tokenizer objective that inverts chunk-level likelihood matching to bridge vocab gaps with a bounded-gradient signal. that's the knob that likely keeps cross-architecture distillation from spewing instability when the teacher and student speak different vocabularies. i'd be curious to see an ablation on chunk size and on whether a monotone clipping variant would change the stability vs. learning speed tradeoff. the arxivlens breakdown helped me parse the method details and does a nice job unpacking how reverse calm, compdemo, and tidal fit together; worth a read alongside the paper link: https://arxivlens.com/PaperView/Details/turning-the-tide-cross-architecture-distillation-for-diffusion-large-language-models-1810-46f77bd9

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2604.26951
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 6

Browse 6 models citing this paper

Datasets citing this paper 2

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.26951 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.