Oliver Bonham-Carter, Allegheny College
email: obonhamcarter@allegheny.edu
Figure 1. A GCS stands for Get Coding Sequences.
Genetic Music: Use your ears to study DNA!!
Often, when you have a tool from Bioinformatics, sequences are the input. This program, GCS creates fasta files of the coding sequences (producing protein) of a GenBank file. In addition, the program also outputs the non-coding sequences (those that produce no-known protein) from the Genbank file. These sequences can then be used for research or to test new tools.
Figure 2. In a GenBank file, there are references for the coding regions.
GCS works by locating the coding sequences from a GenBank file by finding their location references in the record, as shown in Figure 2. Then GCS locates the actual sequences using these starting and ending markers, and places this sequence data into fasta files. The noncoding regions are located by removing the coding regions from main sequence. The remaining sequence, from which all coding information has been removed, is the non-coding region. Sequences are then extracted from this body of non-coding genetic material.
numOfSeqs_int = 20
maxSize_int = 400
Note: shown above, the size of the extracted sequences is 400 base-pairs but this value may be customized in main.py, along with the number of sequences to produce.
You must first install Poetry to manage the code's dependencies, and to run the program.
* Setup with Poetry :
+ poetry install
* Find online help:
+ poetry run gcs --bighelp
* Produce reduced-sized sequences from a genbank file:
+ poetry run gcs --data-file data/df.gb
* Produce full-sized sequences from a genbank file:
+ poetry run gcs --data-file data/df.gb --fullseqs
OUTPUT: All output files are saved in the directory `0_out/
- Coding files (
C_startLocation-endLocation.fasta) are named according to their locations as detailed in GenBank files. - The noncoding sequencs (
nC_0.fasta) are arbitrariliy selected from the DNA text string after all the coding material has been removed from the super sequence.
This is a program to used primarily to obtain DNA sequence data.
One of the main reasons to create genetic data as sequence files is
to facilitate and provide data for another excellent project:
Genmus,
which converts DNA fasta sequences into piano music.
This is also a work in progress. If you see anyway to improve it, please let me know, or actually make that improvement in the code via a pull request. I would be very grateful for any productive input that you may have.