Skip to content

A project to extract coding and non-coding sequences from Genbank files. These sequences may be used as data for other tools.

Notifications You must be signed in to change notification settings

developmentAC/GetCodingSeqs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GetCodingSequences: a coding/non-coding sequence extractor to be used with Genbank files.

Oliver Bonham-Carter, Allegheny College


logo Figure 1. A GCS stands for Get Coding Sequences. Genetic Music: Use your ears to study DNA!!

Table of Contents

Description

Often, when you have a tool from Bioinformatics, sequences are the input. This program, GCS creates fasta files of the coding sequences (producing protein) of a GenBank file. In addition, the program also outputs the non-coding sequences (those that produce no-known protein) from the Genbank file. These sequences can then be used for research or to test new tools.

genbank record Figure 2. In a GenBank file, there are references for the coding regions.

Mechanism

GCS works by locating the coding sequences from a GenBank file by finding their location references in the record, as shown in Figure 2. Then GCS locates the actual sequences using these starting and ending markers, and places this sequence data into fasta files. The noncoding regions are located by removing the coding regions from main sequence. The remaining sequence, from which all coding information has been removed, is the non-coding region. Sequences are then extracted from this body of non-coding genetic material.

    numOfSeqs_int = 20
    maxSize_int = 400

Note: shown above, the size of the extracted sequences is 400 base-pairs but this value may be customized in main.py, along with the number of sequences to produce.

Running the code

You must first install Poetry to manage the code's dependencies, and to run the program.

* Setup with Poetry : 
    + poetry install
* Find online help:
    + poetry run gcs --bighelp
* Produce reduced-sized sequences from a genbank file:
    + poetry run gcs --data-file data/df.gb
* Produce full-sized sequences from a genbank file:
    + poetry run gcs --data-file data/df.gb --fullseqs

OUTPUT: All output files are saved in the directory `0_out/

  • Coding files (C_startLocation-endLocation.fasta) are named according to their locations as detailed in GenBank files.
  • The noncoding sequencs (nC_0.fasta) are arbitrariliy selected from the DNA text string after all the coding material has been removed from the super sequence.

Future Work

This is a program to used primarily to obtain DNA sequence data. One of the main reasons to create genetic data as sequence files is to facilitate and provide data for another excellent project: Genmus, which converts DNA fasta sequences into piano music.

This is also a work in progress. If you see anyway to improve it, please let me know, or actually make that improvement in the code via a pull request. I would be very grateful for any productive input that you may have.

About

A project to extract coding and non-coding sequences from Genbank files. These sequences may be used as data for other tools.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages