We introduce a novel recipe for building an ultra-fast diffusion language model named d3LLM (pseuDo-Distilled Diffusion LLM) π.
[2025/12/11]: We release the models on Huggingface π€, see our d3LLM-LLaDA, d3LLM-Dream, and d3LLM-Dream-Coder.[2025/12/11]: We release the training scripts, training datasets, and evaluation code for d3LLM, see our GitHub repo.[2025/12/10]: We release the π blog.
Demo of d3LLM: Achieve up to 5Γ speedup over autoregressive models (Qwen-2.5-7B-it) on H100 GPU and 3.6Γ speedup on A100 GPU. You can try πΉοΈ our demo.
d3LLM (pseuDo-Distilled Diffusion LLM) is a novel framework for building ultra-fast diffusion language models with negligible accuracy degradation. d3LLM achieves 5Γ speedup over autoregressive models on H100 GPUs while maintaining competitive performance.
We recommend creating a dedicated ~/Codes directory to maintain consistent paths during evaluation:
# Create workspace directory
mkdir -p ~/Codes
cd ~/Codes
# Clone the repository
git clone https://github.com/hao-ai-lab/d3LLM.git
cd d3LLM
# Install dependencies
# It is important to check the version of transformers==4.49.0, lm_eval==0.4.9, datasets==3.2.0, and flash_attn==2.7.4.post1
pip install -r requirements.txtNote: We recommend cloning in
~/Codes/d3LLM, which ensureseval_scriptswork out-of-the-box with consistent paths.
Chat with d3LLM models using our simple chat scripts:
# Chat with d3LLM-Dream
python chat/chat_d3llm_dream.py
# Or chat with d3LLM-LLaDA
python chat/chat_d3llm_llada.pyNote that because our distillation data primarily consists of coding and math reasoning tasks, acceleration may only appear on prompts of these tasks.
The d3LLM framework combines two key innovations:
Instead of random masking, we extract the teacher model's decoding orderβthe sequence in which it unmasks tokens. This pseudo-trajectory guides the student model to learn efficient generation patterns.
- Pseudo-Trajectory Extraction β 15% TPF improvement
- Progressive Noise Schedule β Additional 18% TPF boost
- Progressive Window Sizing β Another 8% TPF gain
We enable parallel decoding across multiple blocks simultaneously using entropy-based token selection.
- Entropy-Based Multi-Block Decoding β 20% TPF improvement
- KV-Cache with Periodic Refresh β 20% TPS boost in long contexts
- Early Stopping on EOS β 5% TPF gain
Together, these innovations achieve 5-10Γ speedup on TPF (tokens per forward) over vanilla diffusion models while maintaining accuracy. Based on the d3LLM framework, we have released three models on π€ HuggingFace: d3LLM-LLaDA, d3LLM-Dream, and d3LLM-Coder.
We provide the training scripts for d3LLM-Dream and d3LLM-LLaDA. You can use the following commands to train the models.
# Training d3LLM-Dream
deepspeed --num_gpus=4 d3llm/d3llm_DREAM/distill_2_training/d3llm_dream_train.py
# Training d3LLM-LLaDA
deepspeed --num_gpus=4 d3llm/d3llm_LLaDA/distill_2_training/d3llm_llada_train.pyThe trajectory dataset is already extracted and uploaded to HuggingFace (see Dream Trajectory and LLaDA Trajectory). You can also generate the pseudo-trajectory dataset
using the script in distill_1_data_prepare/ folder.
All evaluation scripts are in the eval_scripts/ folderβjust install the environment and run! We include comprehensive evaluation codes for:
- β d3LLM (our method)
- β AR Model (e.g., Qwen-2.5-7B-it) - Autoregressive baselines
- β Vanilla LLaDA - Original LLaDA model
- β Vanilla Dream - Original Dream model
- β Fast-dLLM - Training-free acceleration with KV cache
- β D2F - Discrete diffusion forcing
- β dParallel - Distilled dLLMs
- β Fast-dLLM v2 - Block-wise diffusion
See eval_scripts for more details.
Our d3LLM achieves the highest AUP (Accuracy Under Parallelism) scores across multiple dLLMs and tasks:
![]() LLaDA-based Models |
![]() Dream-based Models |
![]() Coder Models |
Radar plots comparing AUP scores across different methods and benchmarks
| Model | H100's TPS | A100's TPS | Speedup vs. AR |
|---|---|---|---|
| Qwen-2.5-7B (AR) | 57.32 | 50.36 | 1.00Γ |
| d3LLM-LLaDA | 288.89 | 183.33 | 3.47Γ~5.04Γ |
| d3LLM-Dream | 235.34 | 128.19 | 2.55Γ~4.67Γ |
Want more details? Check out our dLLM leaderboard and comprehensive results at π this blog.
We further present a leaderboard that compares different diffusion LLMs across five representative benchmark tasks, using the AUP score (Accuracy Under Parallelism) as the primary evaluation metric, which is a hardware-independent metric that measures both the efficiency and the performance of a dLLM. More details can be found in AUP_leaderboard and π this blog.
If you find d3LLM useful for your research, please star our project and cite our work.
@article{preprint'25:d3llm,
author = {Yu-Yang Qian and Junda Su and Lanxiang Hu and Peiyuan Zhang and Zhijie Deng and Peng Zhao and Hao Zhang},
title = {d3LLM: Ultra-Fast Diffusion LLM using Pseudo-Trajectory Distillation},
journal = {ArXiv preprint},
volume = {to appear},
note = {\url{https://github.com/hao-ai-lab/d3LLM} [Accessed: 2025-12-11]},
year = {2025}
}The paper about d3LLM and AUP is coming soon. Please stay tuned!
This project builds upon excellent open-source work:
- LLaDA - Large Language Diffusion Models
- Dream - Diffusion Large Language Models
- Fast-dLLM - Training-free acceleration
- D2F - Discrete diffusion forcing
- dParallel - Distilled dLLMs
- lm-evaluation-harness - Evaluation framework
β Star us on GitHub and cite our paper if you find this project helpful!





