Skip to content

hao-ai-lab/d3LLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

4 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

d3LLM: Ultra-Fast Diffusion LLM using Pseudo-Trajectory Distillation πŸš€

Blog Demo d3LLM-Dream d3LLM-LLaDA d3LLM-Coder Paper

We introduce a novel recipe for building an ultra-fast diffusion language model named d3LLM (pseuDo-Distilled Diffusion LLM) πŸš€.

πŸ“£ News

✨ Demo

Demo of d3LLM: Achieve up to 5Γ— speedup over autoregressive models (Qwen-2.5-7B-it) on H100 GPU and 3.6Γ— speedup on A100 GPU. You can try πŸ•ΉοΈ our demo.

d3LLM Demo

πŸ“– What is d3LLM?

d3LLM (pseuDo-Distilled Diffusion LLM) is a novel framework for building ultra-fast diffusion language models with negligible accuracy degradation. d3LLM achieves 5Γ— speedup over autoregressive models on H100 GPUs while maintaining competitive performance.

🎯 Getting Started

Installation

We recommend creating a dedicated ~/Codes directory to maintain consistent paths during evaluation:

# Create workspace directory
mkdir -p ~/Codes
cd ~/Codes

# Clone the repository
git clone https://github.com/hao-ai-lab/d3LLM.git
cd d3LLM

# Install dependencies
# It is important to check the version of transformers==4.49.0, lm_eval==0.4.9, datasets==3.2.0, and flash_attn==2.7.4.post1
pip install -r requirements.txt

Note: We recommend cloning in ~/Codes/d3LLM, which ensures eval_scripts work out-of-the-box with consistent paths.

Try d3LLM Instantly

Chat with d3LLM models using our simple chat scripts:

# Chat with d3LLM-Dream
python chat/chat_d3llm_dream.py

# Or chat with d3LLM-LLaDA
python chat/chat_d3llm_llada.py

Note that because our distillation data primarily consists of coding and math reasoning tasks, acceleration may only appear on prompts of these tasks.

πŸ”¬ How d3LLM Works

The d3LLM framework combines two key innovations:

(i) Pseudo-Trajectory Distillation πŸ“š

Instead of random masking, we extract the teacher model's decoding orderβ€”the sequence in which it unmasks tokens. This pseudo-trajectory guides the student model to learn efficient generation patterns.

  • Pseudo-Trajectory Extraction β†’ 15% TPF improvement
  • Progressive Noise Schedule β†’ Additional 18% TPF boost
  • Progressive Window Sizing β†’ Another 8% TPF gain

Distillation Process Our pseudo-trajectory-based distillation

(ii) Multi-Block Decoding Strategy ⚑

We enable parallel decoding across multiple blocks simultaneously using entropy-based token selection.

  • Entropy-Based Multi-Block Decoding β†’ 20% TPF improvement
  • KV-Cache with Periodic Refresh β†’ 20% TPS boost in long contexts
  • Early Stopping on EOS β†’ 5% TPF gain

Multi-Block Decoding Entropy-based multi-block decoding with KV-cache and refresh.

Together, these innovations achieve 5-10Γ— speedup on TPF (tokens per forward) over vanilla diffusion models while maintaining accuracy. Based on the d3LLM framework, we have released three models on πŸ€— HuggingFace: d3LLM-LLaDA, d3LLM-Dream, and d3LLM-Coder.

πŸ‹οΈβ€β™€οΈ Training d3LLM Models

We provide the training scripts for d3LLM-Dream and d3LLM-LLaDA. You can use the following commands to train the models.

# Training d3LLM-Dream
deepspeed --num_gpus=4 d3llm/d3llm_DREAM/distill_2_training/d3llm_dream_train.py

# Training d3LLM-LLaDA
deepspeed --num_gpus=4 d3llm/d3llm_LLaDA/distill_2_training/d3llm_llada_train.py

The trajectory dataset is already extracted and uploaded to HuggingFace (see Dream Trajectory and LLaDA Trajectory). You can also generate the pseudo-trajectory dataset using the script in distill_1_data_prepare/ folder.

πŸ§ͺ Evaluation on Standard Benchmarks

All evaluation scripts are in the eval_scripts/ folderβ€”just install the environment and run! We include comprehensive evaluation codes for:

See eval_scripts for more details.

πŸ“Š Benchmark Results

Our d3LLM achieves the highest AUP (Accuracy Under Parallelism) scores across multiple dLLMs and tasks:


LLaDA-based Models

Dream-based Models

Coder Models

Radar plots comparing AUP scores across different methods and benchmarks

Acceleration Highlights (on GSM8K-CoT Dataset)

Model H100's TPS A100's TPS Speedup vs. AR
Qwen-2.5-7B (AR) 57.32 50.36 1.00Γ—
d3LLM-LLaDA 288.89 183.33 3.47Γ—~5.04Γ—
d3LLM-Dream 235.34 128.19 2.55Γ—~4.67Γ—

Want more details? Check out our dLLM leaderboard and comprehensive results at 🌐 this blog.

πŸ† Diffusion LLM Leaderboard

We further present a leaderboard that compares different diffusion LLMs across five representative benchmark tasks, using the AUP score (Accuracy Under Parallelism) as the primary evaluation metric, which is a hardware-independent metric that measures both the efficiency and the performance of a dLLM. More details can be found in AUP_leaderboard and 🌐 this blog.

πŸ“ Citation

If you find d3LLM useful for your research, please star our project and cite our work.

@article{preprint'25:d3llm,
  author  = {Yu-Yang Qian and Junda Su and Lanxiang Hu and Peiyuan Zhang and Zhijie Deng and Peng Zhao and Hao Zhang},
  title   = {d3LLM: Ultra-Fast Diffusion LLM using Pseudo-Trajectory Distillation},
  journal = {ArXiv preprint},
  volume  = {to appear},
  note    = {\url{https://github.com/hao-ai-lab/d3LLM} [Accessed: 2025-12-11]},
  year    = {2025}
}

The paper about d3LLM and AUP is coming soon. Please stay tuned!

πŸ™ Acknowledgments

This project builds upon excellent open-source work:

⭐ Star us on GitHub and cite our paper if you find this project helpful!