Inceptionlabs' "Mercury" Diffusion LLM
This is an educational repository for Diffusion Large Language Models (dLLM) following the same format as nanoGPT (Reader should probably be familiar with nanoGPT before going further). It keeps up with the latest advancements in dLLMs, including the recent release of LLaDA whose 8B model version is available on Hugging Face. dLLMs are diffusion models that utilize a discrete random masking process and train a mask predictor to approximate the reverse process. dLLMs are still an on-going research topic but the first commercial dLLM has already been released (cf gif above).
Some advantages of dLLMs over autoregressive (AR) LLMs are :
-
Faster generation : dLLMs generate the entire sequence at once and refine it iteratively, allowing for more parallel computation.
-
More diverse outputs : Diffusion models sample from a richer probability distribution at each step (compared to AR models that optimize for most likely token at each stop), leading to more diverse and creative outputs.
-
More Flexible with Missing or Corrupted Data : Since diffusion models use a masking and denoising process, they can handle missing data naturally.
Any auto-regressive GPT-like model can be easily modified to a dLLM. The only thing to do is remove all the causal mask from the self-attention mechanism. We choose to go with the boilerplate GT2 model that can be found in model.py.
Likewise, the training of this newly GPT-like dLLM also is altered. The dLLM is now a mask predictor instead of being a next-token predictor.
For a training sequence
where
Pre-training : dLLM is trained on text with random masks applied independently to all tokens at the same ratio t ∼ U[0, 1]
Pre-training dLLM code resembles autoregressive model training but with some small changes.
def forward_process(input_ids, eps=1e-3):
b, l = input_ids.shape
t = torch.rand(b, device=input_ids.device)
p_mask = (1 - eps) * t + eps
p_mask = p_mask[:, None].repeat(1, l)
masked_indices = torch.rand((b, l), device=input_ids.device) < p_mask
# 126336 is used for [MASK] token
noisy_batch = torch.where(masked_indices, 126336, input_ids)
return noisy_batch, masked_indices, p_mask
# The data is an integer tensor of shape (b, 4096),
# where b represents the batch size and 4096 is the sequence length.
input_ids = batch["input_ids"]
# We set 1% of the pre-training data to a random length that is uniformly sampled from the range [1, 4096].
# The following implementation is not elegant and involves some data waste.
# However, the data waste is minimal, so we ignore it.
if torch.rand(1) < 0.01:
random_length = torch.randint(1, input_ids.shape[1] + 1, (1,))
input_ids = input_ids[:, :random_length]
noisy_batch, masked_indices, p_mask = forward_process(input_ids)
logits = model(input_ids=noisy_batch).logits
token_loss = F.cross_entropy(logits[masked_indices], input_ids[masked_indices], reduction='none') / p_mask[masked_indices]
loss = token_loss.sum() / (input_ids.shape[0] * input_ids.shape[1])Please install relevant dependencies :
pip install .Please process the data by running (nanoGPT's code generation was shamelessly taken) :
python data/shakespeare_char/prepare.pyPlease note that, as mentioned earlier, the diffusion model samples from a broader probability distribution, making the training process more challenging for the LLM (compared to next token prediction). This "increased complexity" is reflected in the noisiness of the training and validation loss.
To pre-train your model that will saved in --save_filename, run:
python train.py --save_filename="YOURFILENAME"Among the most important hyperparameters that can influence the generation process are : the number of steps steps you take in your diffusion process and the well-known temperature temperature parameters. You can generate a few samples from the pre-trained model provided and changes hyperparameters values in the file :
python main.py --path="pretrained_gpt_last.pt"or given your trained version :
python main.py --path="YOURFILENAME"Code source used as inspiration :
Bibliography :

