Family Analysis Pipeline

This repository contains code and tools developed for the paper Discovering Predisposing Genes for Hereditary Breast Cancer Using Deep Learning (https://academic.oup.com/bib/article/25/4/bbae346/7717952).

Overview

The FamAnalysis pipeline processes large datasets of missense variants and ranks them based on their pathogenicity compared to other variants in the dataset. We provide here a user-friendly command line interface of the DSRank methodology outlined in our paper. We hope to extend the interface to include FamRank in the near future.

Updates to current version:

we have now replaced FIRM with AlphaMissense precomputed scores.
We have also replaced ESM-1v inference with precomputed scores obtained from ESM-1b as described in the study Genome-wide prediction of disease variant effects with a deep protein language model - official GitHub repository (https://github.com/ntranoslab/esm-variants.git).
The current version can run on standard CPUs and does not require direct model inference using GPUs.

The pipeline consists of 3 stages:

Create protein and mutations objects based on input data by pulling all relevant metadata from UniProtKB.
Calculate pathogenicity scores using ESM, AlphaMissense and EVEmodel.
Normalize models scores and rank variants using an aggregated score - DSrank (see paper).

Setup

Hardware requirements

The entire pipeline can run on standard CPU. To install the database and precomputed scores, make sure to have at least 23GB free on disk.

Software requirements

FamAnalysis requires python 3.10

Installation

After cloning the project, run:

pip install pandas requests tqdm psutil click wget bio

python setup.py

In definition.py, make sure to set CONTACT to your email address. This is required for UniProtKB requests.

Note: In case of an unstable internet connection, setup.py may raise a ConnectionError while downloading large files. Simply re-run the program starting from the method that crashed. The download will resume from the point where it stopped.

Running FamAnalysis

Run the following command to get help:

python main.py --help

set verbose level using the --verbose flag.

Protein variants should be provided in the format shown in example.csv.

Note: DNA data (i.e., Chr, Start, End) is not required. If these fields are not available, please fill them with any arbitrary value.

Initializing Database

To build protein and mutation objects for all variants run:

python main.py --data-path example.csv --action init-DB

To limit CPUs used, run:

python main.py --data-path example.csv --action init-DB --workers 16

Note: When using > 16 CPUs or if internet connection is unstable it may be required to increase TIMEOUT, WAIT_TIME and RETRIES values in definitions.py

Calculating Models Scores

To calculate model scores for all variants, run:

python main.py --data-path example.csv --action score-AFM

python main.py --data-path example.csv --action score-ESM

python main.py --data-path example.csv --action score-EVE

To re-calculate scores for variants that were previously scored, run:

python main.py --data-path example.csv --action score-EVE --recalc 1

To use only EVEmodel without using CPT imputation, run:

python main.py --data-path example.csv --action score-EVE --use-cpt 0

To set offset for mutation position, run:

python main.py --data-path example.csv --action score-EVE --offset 1

Note changing the offset from default is not recommended and should preferably be handled at the input file.

Ranking Scores

FamAnalysis uses DSRank to aggregate the model scores. The normalized scores of the three ML models’ predictions are averaged over all available scores per variant.

To rank all variant, run:

python main.py --data-path example.csv --action rank-DS

use --ds-thr flag to set the threshold for the minimal number of model scores needed to calculate DSRank.

Results

To create a CSV file summarizing all scores, run:

main.py --data-path example.csv --action to-csv --out-path results.csv

use --include-type 1 to include esm and eve score types in output file.

Reference

If you use this code, please cite our paper.

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
.idea		.idea
eve_index		eve_index
Analyze.py		Analyze.py
Connections.py		Connections.py
Mutation.py		Mutation.py
Protein.py		Protein.py
README.md		README.md
definitions.py		definitions.py
example.csv		example.csv
main.py		main.py
setup.py		setup.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Family Analysis Pipeline

Overview

Setup

Hardware requirements

Software requirements

Installation

Running FamAnalysis

Initializing Database

Calculating Models Scores

Ranking Scores

Results

Reference

About

Uh oh!

Releases

Packages

Uh oh!

Languages

dina-lab3D/FamAnalysis

Folders and files

Latest commit

History

Repository files navigation

Family Analysis Pipeline

Overview

Setup

Hardware requirements

Software requirements

Installation

Running FamAnalysis

Initializing Database

Calculating Models Scores

Ranking Scores

Results

Reference

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages