This repository accompanies the research paper:
Laure Ciernik*, Agnieszka Kraft*, Florian Barkmann, JosΓ©phine Yates, and Valentina Boeva
"Robust and efficient annotation of cell states through gene signature scoring"
π Genome Research doi: 10.1101/gr.280926.125
*Equal contribution
- Overview
- Quick Start
- Installation
- Data Setup
- Repository Structure
- Running Experiments
- Correspondance
This repository contains Jupyter notebooks and scripts for reproducibility of experiments and visualizations from our research on gene signature scoring methods.
β οΈ Important: This code is shared for reproducibility purposes and is not organized as a package. For the production-ready implementation of ANS, please use our packaged repository.
-
Clone the repository
git clone https://github.com/BoevaLab/ANS_supplementary_information.git cd ANS_supplementary_information -
Set up environment
conda create -n scoring_env python=3.9.18 conda activate scoring_env pip install -r requirements.txt
-
Set up project folder
- Edit
data/constants.pyand replace allTODOplaceholders with your actual paths - Update path placeholders in notebooks and scripts marked with
TODO
python data/constants.py
- Edit
-
Download data
cd data python download_preprocessed_datasets.py -
Install ANS package (if not already installed)
pip install git+https://github.com/BoevaLab/ANS_signature_scoring.git
- Python 3.9.18
- Miniconda or Anaconda
- R (for specific experiments - see requirements below)
Choose one of the following methods:
Option 1: Using pip (recommended)
conda create -n scoring_env python=3.9.18
conda activate scoring_env
pip install -r requirements.txtOption 2: Using conda environment file
# With build hashes (exact reproduction)
conda env create -f environment_with_build_hash.yml
# Without build hashes (more flexible)
conda env create -f environment.ymlFor R-based experiments, install the packages listed in notebooks/construction_scoring_methods/session_info.txt.
- Set base path: In
data/constants.py, setBASE_PATH_DRIVEto your project folder - Required structure:
your_project_folder/ βββ data/ # Downloaded datasets go here βββ experiments/ # Experiment outputs βββ results/ # Results outputs (logs, potentially other results) - Replace placeholders: Search for
TODOin all files and replace with actual paths
- CRC, ESCC, LUAD (Xing), LUAD (Kim)
- Breast cancer, skin cancer, ovarian cancer
- PBMC datasets
- CanSig preprocessed datasets
python data/download_preprocessed_datasets.pyDownload all preprocessed datasets and annotations from our Google Drive folder.
βββ README.md
βββ data/ # Data loading and preprocessing
β βββ constants.py # β οΈ Configure paths here
β βββ load_data.py
β βββ download_preprocessed_datasets.py
β βββ run_preprocessing_cancer.sh
βββ experiments/ # Experiment scripts
β βββ comparable_score_ranges/
β βββ control_bias/
β βββ data_composition_experiments/
β βββ runtime/
β βββ signature_lengths_experiments/
β βββ signature_noise_addition_experiments/
β βββ run_all_experiments_on_dataset.sh
β βββ run_pbmc_experiments.sh
βββ notebooks/ # Analysis notebooks
β βββ EMT_scoring_experiments/ # π Case study: EMT signal decoupling
β βββ comparable_score_ranges/
β βββ construction_scoring_methods/
β βββ control_genes_selection_experiments/
β βββ correlation_scores_with_TRC_and_MTP_experiments/
β βββ data_composition_experiments/
β βββ signature_lengths_experiments/
β βββ signature_noise_addition_experiments/
βββ environment.yml
βββ environment_with_build_hash.yml
βββ requirements.txt
The experiments and notebooks folders contain the code for executed experiments. While the first contains the python scripts and R scripts, to run the experiments, the notebooks are used to create the visualizations. See "Rerunning experiments" below for details on how to rerun the experiments.
All experiments expect downloaded data!
This step can be skipped if the anntations have been downloaded successfully from the Drive, see Data Setup.
Run:
cd data/sh_files
bash run_dgex_cancer_sigs_with_pseudobulks.sh
bash run_dgex_cancer_sigs_on_individual_samples.sh
bash run_dgex_non_rel_genes_with_pseudobulks.shThe scripts are supposed to load the preprocessed data, compute the malignant cell-specific signatures and store them in the [Project location]/data/annotations folder.
To compare the euqality of implementations in Python and R (Figure 1b), we follow the following steps:
- First, load the data and convert it to single-cell experiments in R, i.e., running first part of
notebooks/construction_scoring_methods/compare_python_and_R_versions_of_scoring_methods.ipynb - Adapt the missing paths in the
Rscriptnotebooks/construction_scoring_methods/scoring_crc_escc_luad_w_ucell_jasmine_seurat_ans.Rand run it - Run the rest of the notebook in 1. It will create the subplot (Figure 1b)
To generate Figure 1c, first run the following experiment script
cd experiments
bash run_all_experiments_on_dataset.shFor the visualization:
- Figure 1c. and S3: Run
notebooks/control_genes_selection_experiments/figures_contol_bias_experiment.ipynb[!NOTE] The
test_foldervariable has to be set to a results folder created by the previous script, e.g.,[path to project]/experiments/control_genes_selection/mean_var_per_gene_scores/B memory kappa - Figure S1: Run
notebooks/control_genes_selection_experiments/Control_gene_selection_comparison_last_expression_bin.ipynb
To get Figure 1d. we please run,
cd experiments/data_composition_experiments
bash run_data_comp_exp.sh crc
bash run_data_comp_exp.sh esccIt will store the plots at [project location]/experiments/data_composition_experiments/[crc | escc]/mean_norm/dgex_on_pseudobulk/strip_plots
For Figure S5. run notebooks/data_composition_experiments/variance_decrease.ipynb
To recreate Figure S4b and S6a, first run the scoring experiments:
cd experiments/signature_lengths_experiments
bash run_sig_length_exp.sh escc 100
bash run_sig_length_exp.sh crc 150 Then use notebook notebooks/signature_lengths_experiments/result_heatmaps_figures.ipynb to create the figures.
To recreate Figure S4c and S6b, first run the scoring experiments:
cd experiments/signature_noise_addition_experiments
bash run_sig_noise_exp.sh escc 100
bash run_sig_noise_exp.sh crc 100 Then use notebook notebooks/signature_noise_addition_experiments/sig_noise_experiment_figures.ipynb to create the figures.
To create the subplots of Figure 2 as well as the Figures S7-S10, run the following experiments scripts:
cd experiments/comparable_score_ranges
bash run_comp_range_exp.shTo create then all figures run: notebooks/comparable_score_ranges/create_plot.ipynb.
To create Figure 3 run: notebooks/comparable_score_ranges/neuronal_differentation_dataset.py
To rerun the runtime experiment (Figure S2) please run the notebook /experiments/runtime/runtime_comparison.ipynb.
To run the analysis run the scripts notebooks/removal_of_signature_genes/removal_of_highly_expressed_genes_*py.
See notebooks/EMT_scoring_experiments/ for the main case study demonstrating decoupling of EMT signals in stromal and cancer cells. It contains a README.md file with the steps to reproduce the experiments.
For questions or any encountered issues, do not hesitate to contact: Laure Ciernik and Agnieszka Kraft