Custom multiple sequence alignments (MSAs) generation using public databases with HMMER software. Based on AlphaFold2 Data Pipeline.
Specifically, this repository does:
- Searches for homologous sequences in user-specified databases using
jackhmmer. - Combines the results from all databases removing duplicates.
- Saves the database-specific and combined MSAs in
output_msas/folder by default.
To install this repository, run the following commands in your terminal:
conda create -n custom_msas python=3.12
conda activate custom_msas
pipx install poetry
git clone https://github.com/annadiarov/custom_msas.git
cd custom_msas
poetry installNote about pipx installation
If you don't have pipx installed, you can install it using pip according to the official documentation as follows:
python3 -m pip install --user pipx python3 -m pipx ensurepath
Follow the instructions to install:
- HMMER. Note: this repo uses version 3.3.
Table of publicly available databases that can be used to generate custom MSAs.
| Database | Version | License | Description | Download link |
|---|---|---|---|---|
| small BFD | AlphaFold2 version | CC BY 4.0 | Representative structure from each cluster in Big Fantastic Database (BFD) clustered at 30% sequence identity | https://storage.googleapis.com/alphafold-databases/reduced_dbs/bfd-first_non_consensus_sequences.fasta.gz |
| UniRef90 | v2022_05 | CC BY 4.0 | UniProt Reference Clusters at 90% sequence identity. Version used by AlphaFold3. | https://ftp.uniprot.org/pub/databases/uniprot/previous_major_releases/release-2022_05/uniref/uniref2022_05.tar.gz |
Before running the pipeline, make sure to edit the configuration file
config.yaml to specify the paths to the HMMER binaries and
the databases you have downloaded.
The configuration file has the following structure:
jackhmmer_binary_path: /path/to/hmmer-3.3/bin/jackhmmer
logger_level: INFO # Options: DEBUG, INFO, WARNING, ERROR, CRITICAL
databases:
# Warning: Database names are not case-sensitive
db1:
path: /path/to/db1.fasta
max_sequences: 10000
db2:
path: /path/to/db2.fa
max_sequences: 10000
# Add as many databases as you want!To add a new database to the pipeline, just download it and add the required
information in the config.yaml file under the databases
section as shown above.
To generate custom MSAs, run the following command:
custom_msas my_fasta.fasta out_dir -db uniref90 -db small_bfd -db mgnifyWhere:
my_fasta.fastais the input FASTA file containing the target sequence.out_diris an optional argument specifying the output directory where the MSAs will be saved. Default isoutput_msas/.-db, --databasespecifies the databases to search against by order.
Check the help message for more options:
custom_msas --help- Only tested for single chain FASTA files.
- Only tested on protein sequences.
- Only generates unpaired MSAs.