This repository provides an implementation for the paper Can Speculative Sampling Accelerate ReAct Without Compromising Reasoning Quality?, for the ReAct paradigm using Llama2-70B and Speculative Sampling (SpS) techniques for answering questions from datasets like HotPotQA and FEVER. The code primarily refers to dust-tt/llama-ssp.
To install the necessary dependencies, run the following command:
pip install -r requirements.txt
Authenticate with Hugging Face using the CLI:
huggingface-cli login
To use Llama2-70b model for answering sample HotPotQA questions through the ReAct paradigm, execute the following command (requires over 35GB of GPU memory for model loading and extra memory for processing):
For HotPotQA:
python3 react-sps.py --dataset hotpot --mode llama2
For FEVER:
python3 react-sps.py --dataset fever --mode llama2
This will generate a <dataset>-llama2-<yymmddHHMM>.txt file to record the ReAct trajectories for each processed question.
To Speculative Sampling (Llama2-7b as approximate model and Llama-2 70b as target model) for answering sample HotPotQA questions through the ReAct paradigm, execute the following command (requires over 38.5GB of GPU memory for model loading and extra memory for processing):
For HotPotQA:
python3 react-sps.py --dataset hotpot --mode sps
For FEVER:
python3 react-sps.py --dataset fever --mode sps
This will generate a <dataset>-sps-<yymmddHHMM>.txt file to record the ReAct trajectories for each processed question.
To compare the performance of Llama2-70b and SpS in basic text generation tasks, use this command:
python3 react-sps.py --compare
You should see output like this:
Results for Llama-2-70b
Throughput: 1.28 tokens/sec
Latency: 0.780 sec/token
Results for SpS
Throughput: 2.80 tokens/sec
Latency: 0.356 sec/token
@inproceedings{
xu2024can,
title={Can Speculative Sampling Accelerate ReAct Without Compromising Reasoning Quality?},
author={Han Xu and Jingyang Ye and Yutong Li and Haipeng Chen},
booktitle={The Second Tiny Papers Track at ICLR 2024},
year={2024}
}