Skip to content

annadiarov/custom_msas

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

45 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

custom_msas

Custom multiple sequence alignments (MSAs) generation using public databases with HMMER software. Based on AlphaFold2 Data Pipeline.

Specifically, this repository does:

  1. Searches for homologous sequences in user-specified databases using jackhmmer.
  2. Combines the results from all databases removing duplicates.
  3. Saves the database-specific and combined MSAs in output_msas/ folder by default.

Installation

This repository

To install this repository, run the following commands in your terminal:

conda create -n custom_msas python=3.12
conda activate custom_msas
pipx install poetry
git clone https://github.com/annadiarov/custom_msas.git
cd custom_msas
poetry install
Note about pipx installation

If you don't have pipx installed, you can install it using pip according to the official documentation as follows:

python3 -m pip install --user pipx
python3 -m pipx ensurepath

External dependencies

Follow the instructions to install:

  • HMMER. Note: this repo uses version 3.3.

Databases (WIP)

Table of publicly available databases that can be used to generate custom MSAs.

Database Version License Description Download link
small BFD AlphaFold2 version CC BY 4.0 Representative structure from each cluster in Big Fantastic Database (BFD) clustered at 30% sequence identity https://storage.googleapis.com/alphafold-databases/reduced_dbs/bfd-first_non_consensus_sequences.fasta.gz
UniRef90 v2022_05 CC BY 4.0 UniProt Reference Clusters at 90% sequence identity. Version used by AlphaFold3. https://ftp.uniprot.org/pub/databases/uniprot/previous_major_releases/release-2022_05/uniref/uniref2022_05.tar.gz

Configuration file

Before running the pipeline, make sure to edit the configuration file config.yaml to specify the paths to the HMMER binaries and the databases you have downloaded.

The configuration file has the following structure:

jackhmmer_binary_path: /path/to/hmmer-3.3/bin/jackhmmer
logger_level: INFO # Options: DEBUG, INFO, WARNING, ERROR, CRITICAL

databases:
  # Warning: Database names are not case-sensitive
  db1:
    path: /path/to/db1.fasta
    max_sequences: 10000
  db2:
    path: /path/to/db2.fa
    max_sequences: 10000
  # Add as many databases as you want!

To add a new database to the pipeline, just download it and add the required information in the config.yaml file under the databases section as shown above.

Usage

To generate custom MSAs, run the following command:

custom_msas my_fasta.fasta out_dir -db uniref90 -db small_bfd -db mgnify

Where:

  • my_fasta.fasta is the input FASTA file containing the target sequence.
  • out_dir is an optional argument specifying the output directory where the MSAs will be saved. Default is output_msas/.
  • -db, --database specifies the databases to search against by order.

Check the help message for more options:

custom_msas --help

Current Limitations

  • Only tested for single chain FASTA files.
  • Only tested on protein sequences.
  • Only generates unpaired MSAs.

About

Custom MSA generation using public databases with HMMER software. Based on AlphaFold2 Data Pipeline.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages