custom_msas

Custom multiple sequence alignments (MSAs) generation using public databases with HMMER software. Based on AlphaFold2 Data Pipeline.

Specifically, this repository does:

Searches for homologous sequences in user-specified databases using jackhmmer.
Combines the results from all databases removing duplicates.
Saves the database-specific and combined MSAs in output_msas/ folder by default.

Installation

This repository

To install this repository, run the following commands in your terminal:

conda create -n custom_msas python=3.12
conda activate custom_msas
pipx install poetry
git clone https://github.com/annadiarov/custom_msas.git
cd custom_msas
poetry install

Note about pipx installation

If you don't have pipx installed, you can install it using pip according to the official documentation as follows:
python3 -m pip install --user pipx
python3 -m pipx ensurepath

External dependencies

Follow the instructions to install:

HMMER. Note: this repo uses version 3.3.

Databases (WIP)

Table of publicly available databases that can be used to generate custom MSAs.

Database	Version	License	Description	Download link
small BFD	AlphaFold2 version	CC BY 4.0	Representative structure from each cluster in Big Fantastic Database (BFD) clustered at 30% sequence identity	https://storage.googleapis.com/alphafold-databases/reduced_dbs/bfd-first_non_consensus_sequences.fasta.gz
UniRef90	v2022_05	CC BY 4.0	UniProt Reference Clusters at 90% sequence identity. Version used by AlphaFold3.	https://ftp.uniprot.org/pub/databases/uniprot/previous_major_releases/release-2022_05/uniref/uniref2022_05.tar.gz

Configuration file

Before running the pipeline, make sure to edit the configuration file config.yaml to specify the paths to the HMMER binaries and the databases you have downloaded.

The configuration file has the following structure:

jackhmmer_binary_path: /path/to/hmmer-3.3/bin/jackhmmer
logger_level: INFO # Options: DEBUG, INFO, WARNING, ERROR, CRITICAL

databases:
  # Warning: Database names are not case-sensitive
  db1:
    path: /path/to/db1.fasta
    max_sequences: 10000
  db2:
    path: /path/to/db2.fa
    max_sequences: 10000
  # Add as many databases as you want!

To add a new database to the pipeline, just download it and add the required information in the config.yaml file under the databases section as shown above.

Usage

To generate custom MSAs, run the following command:

custom_msas my_fasta.fasta out_dir -db uniref90 -db small_bfd -db mgnify

Where:

my_fasta.fasta is the input FASTA file containing the target sequence.
out_dir is an optional argument specifying the output directory where the MSAs will be saved. Default is output_msas/.
-db, --database specifies the databases to search against by order.

Check the help message for more options:

custom_msas --help

Current Limitations

Only tested for single chain FASTA files.
Only tested on protein sequences.
Only generates unpaired MSAs.

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
custom_msas		custom_msas
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.yaml		config.yaml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

custom_msas

Installation

This repository

External dependencies

Databases (WIP)

Configuration file

Usage

Current Limitations

About

Uh oh!

Releases

Packages

Languages

License

annadiarov/custom_msas

Folders and files

Latest commit

History

Repository files navigation

custom_msas

Installation

This repository

External dependencies

Databases (WIP)

Configuration file

Usage

Current Limitations

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages