This Python script automates the process of replacing long loops in protein structures with shorter ones, using MODELLER.
As input, it requires a protein structure model to be modified and information on the loops to be replaced. It is designed to replace a long loop with a shorter one, which is made of shorter snippets (start and end stretches) of the original loop. So for instance, if the sequence is something like:
s e
...LQFSRTNAMGSRGEGEDM...
where s and e mark loop start and end, we replace it with something like:
s e
...LQFSRTNEGEDM...
so we just keep the first and last 3 residues of the loop. It should be noted that the rest of the structure is kept completely frozen, only the replaced loop is remodeled by this script.
- Python >=3.7
- MODELLER software
- Ensure MODELLER and its dependencies are installed and accessible.
- Have the paths for the FASTA file and PDB file ready.
- Have the uniprot id of the protein ready.
- Running the Script - Execute the following command to generate the needed alignment file from the parsed fasta file and the loop set up for MODELLER and initiate the modeling process:
loop_replacer -f <fasta_file> -u <uniprot_id> -l <loop_positions> -r <residue_adjustments> -m <number_of_models> -p <pdb_file> (--chain <protein_chain>) > final_create_model.log
- Replace `<fasta_file>` with the path to your input FASTA file.
- Replace `<uniprot_id>` with the uniprot id of the protein.
- `<loop_positions>` should be specified in the format "start:end start:end ...". Notice that start and end should be the residue numbers in the original structure *before* and *after* the loop, respectively. So the loops start and ends right after and before the numbers given in input.
- `<residue_adjustments>` should correspond to the loop positions in the format "add_start:add_end add_start:add_end ...". and it corresponds to the number of residue to be kept at the N and C-terminus of the loop, respectively. These are also the only residues that will be remodeled.
- `<number_of_models>` specifies how many models to generate with MODELLER.
- Replace `<pdb_file>` with the path to your input PDB file.
- `<protein_chain>` is an optional parameter to specify the chain of the protein (default is chain A).
- Output
- The script will create:
- An alignment file (
<uniprot_id>.ali) for modeling. - Multiple model files (
<uniprot_id>.<num>.pdb) based on the specified number of models. - Important Output: - Log file (
final_create_model.log) where errors, warnings, and a summary of successfully produced models are mentioned. - MODELLER Model Evaluation and suggested best predicted model summary.
- Renumbered model files (
<uniprot_id>.<num>_renum.pdb). - Remember that in the cancermuts step of MAVISp when we generate the mutlist we should define the range as two separate domains
- An alignment file (
See example directory for a full example. A typical command line looks like:
./loop_replacer -f Q9UPY3.fasta -u Q9UPY3 -l 1100:1288 1389:1546 -r 6:5 4:5 -m 5 -p ../Q9UPY3.pdb > test.log