This repository contains a pipeline for benchmarking Large Language Models (LLMs) at cell type and gene set annotation introduced in:
George Crowley, Tabula Sapiens Consortium, Stephen R. Quake
bioRxiv 2024.10.10.617605
doi: https://doi.org/10.1101/2024.10.10.617605
-
src/: Contains core functions for- Plotting utilities
- Benchmarking functions
-
benchmark_pipeline/: Main pipeline directorySnakefile: Defines the pipeline workflow and dependencies.env: Configuration file for paths and API keys (create your own)scripts/: Processing scripts in labeled in the order that they are runres/: Output directory for results and figures
- Clone the repository
git clone https://github.com/ggit12/benchmark_llms.git- Create a local
.envfile by copying the template:
cd benchmark_llms
cp benchmark_pipeline/.env_template benchmark_pipeline/.env- Edit your
.envfile inbenchmark_pipeline/as needed, including:- LLM provider API keys
- Data path
- LLM used for downstream analysis
- Create a conda environment with
anndict,snakemake, andselenium
conda create -n benchmark_llms python=3.12
conda activate benchmark_llms
conda install -c conda-forge tbb numba
pip install -r requirements.txt
conda install -c conda-forge firefox geckodriver- (Optional) You can download a SIF file that contains all necessary libraries pre-installed instead of step 4.
wget https://github.com/ggit12/benchmark_llms/releases/download/sif/benchmark_llms.sif
Files are numbered sequentially to indicate processing order. Input and output of each script are defined in the Snakefile.
To run the pipeline:
First cd into benchmark_pipeline/:
cd benchmark_pipeline/Then run the pipeline with snakemake:
snakemake --snakefile SnakefileOr, if you are running on a computing cluster with slurm, you can run the pipeline like this:
tmux new -s snakemake -d
tmux send-keys -t snakemake "cd $(pwd) && \
conda activate benchmark_llms && \
snakemake --snakefile Snakefile --profile slurm_profile" C-mNote that this specific command might require some debugging, depending on your specific system configurations. For example, if conda is not initialized in your .bashrc, this might fail.
In this case, you can try manually by following the example below.
tmux new -s snakemake
cd /path/to/benchmark_llms/benchmark_pipeline
<<<Initialize conda here (however you normally would)>>>
conda activate benchmark_llms
snakemake --snakefile Snakefile --profile slurm_profileThen Ctrl+B, D to detach from the session and leave it running in the background.