-
Notifications
You must be signed in to change notification settings - Fork 0
ndaniels/HomologyTesting
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
Contents of this directory
==========================
Files in this directory include:
* add-good-chains-to-whitelist.py
* analyze-smurf.sh
* argumentparsers.py
* blastp-to-muscle.sh
* gargamel.py
* generate-hmm.py
* generate-matt-alignments.py
* generate-negative-controls.py
* generate-positive-controls.py
* get-good-chains.sh
* smurf-query.sh
* smurf-train.sh
How to use scripts in this directory
====================================
The analyze-smurf.sh script will run everything necessary to generate multiple
alignments, HMM files, and positive and negative controls using both smurf and
hmmer as aligners, then analyze the results. However, the results analysis
script has not yet been written, so this will just generate the results for
now, by running generate-matt-alignments.py, generate-hmm.py,
generate-positive-controls.py and generate-negative-controls.py with smurf and
hmmer as aligners.
The four scripts generate-matt-alignments.py, generate-hmm.py,
generate-positive-controls.py and generate-negative-controls.py are intended to
be used together in that order. The gargamel.py module contains common
functionality for these scripts.
To generate multiple alignments from all proteins in a given superfamily,
leaving one family out at a time, use generate-matt-alignments.py.
If the multiple alignment fails because matt refuses to analyze certain chains,
use the get-good-chains.sh script. Use the add-good-chains-to-whitelist.py
script to add those good chains to the whitelist in this directory. NOTE: this
should only happen with recent versions of matt.
To generate HMM files from the multiple alignment output generated by matt, use
generate-hmm.py. Use this script only after running generate-matt-alignment.py.
To run smurf or hmmer queries on every protein not aligned by matt but in that
same superfamily, use generate-positive-controls.py. Use this script only after
running generate-matt-alignment.py and generate-hmm.py.
To run smurf or hmmer queries on every protein not in that superfamily, use the
generate-negative-controls.py script. Use this script only after running
generate-matt-alignment.py, generate-hmm.py and generate-positive-controls.py.
To simply generate an HMM file from matt for use with smurf, use the
smurf-train.sh script. To use smurf to query this generated HMM file with
specific protein chains, use the smurf-query.sh script.
Purpose of files
================
* add-good-chains-to-whitelist.py - Given a file containing a list of useable
PBD IDs and associated chain letter, this script adds to the whitelist file
the PDB IDs and chains read from the good chains input file. The format of
the whitelist file is as follows: each line consists of a PDB ID, which is an
alphanumeric string of length 4, followed by a colon, followed by one or more
letters identifying a chain (for example, A, B, C, etc.). Repeated PDB IDs
are NOT allowed in this file. Behavior on repeated PDB IDs is undefined.
Lines starting with hashes are comments and are ignored.
* analyze-smurf.sh -
* argumentparsers.py -
* blastp-to-muscle.sh -
* gargamel.py - The alignment-test.py and query-negative-controls.py scripts
import this module in order to access the common functions which it defines,
which are useful for running matt/smurf on certain sets of protein chains.
* generate-hmm.py -
* generate-matt-alignments.py -
* generate-negative-controls.py - Runs smurf queries on all proteins NOT in the
superfamily aligned by the generate-matt-alignments.py script. This script
relies on the successful completion of generate-matt-alignments.py and
generate-hmm.py scripts, and the directory structure which they output.
generate-negative-controls.py requires some command-line arguments. For more
information on which command-line arguments this script accepts, run the
command:
generate-negative-controls.py --help
* generate-positive-controls.py -
* get-good-chains.sh - Given a file containing the output from Matt redirected
from stdout, this script will output (to stdout) a list of protein chains
useable by Matt. The format of the output is as follows: each line consists
of a PDB ID, which is an alphanumeric string of length 4, followed by a
colon, followed by a single letter identifying a chain (for example, A, B, C,
etc.). Repeated PDB IDs are allowed in this file. Lines starting with hashes
are comments and are ignored.
NOTE: Currently, the downside to this method of determining which chains are
good and which are bad is that we have to run Matt once and let it fail
before we can determine which chains are good.
* smurf-train.sh - Given a file containing a list of PDB files, this script
runs Matt, SMURF-preparse, and hmmbuild on the specified set of PDB files to
create a hidden Markov model for use with SMURF, specifically for use with
the smurf-query.sh script. WARNING: Matt is finicky, and may fail. If Matt
fails, the rest of the script will cause some segmentation faults.
* smurf-query.sh - Given a hidden Markov model and a FASTA file containing a
protein structure to query, this script runs SMURF to determine an alignment
to the consensus template specified in the hidden Markov model generated by
the smurf-train.sh script.
Contact
=======
Jeffrey Finkelstein <jeffrey.finkelstein@gmail.com>
About
cross-validation framework for remote homology detection
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published