A pipeline for processing environmental DNA (eDNA) sequences through bioinformatics workflows including quality control, taxonomic assignment, and diversity analysis.
- Download from source:
git clone https://github.com/ComputationalAgronomy/ednabp.gitcd ednabp- Install dependencies:
pip install -r requirements.txt- Install package locally:
pip install -e .If you will run the bp module, ensure the following external tools are installed and available in your PATH:
If you will run the seq module, ensure the following external tools are installed and available in your PATH:
Core processing pipeline with the following stages:
- Decompress: Extract compressed FASTQ files
- Merge: Combine paired-end reads
- Cut Primer: Remove primer sequences and length filtering
- FASTQ to FASTA: Format conversion
- Dereplicate: Remove duplicate sequences
- Denoise: Error correction
- Assign Taxa: Taxonomic classification
- Data Objects: Structured data containers for pipeline results.
A complete data container structure looks like the following:
- Writing: Export diversity metrics CSV tables.
- Plotting: Visualization tools (
barchart,heatmap,rankcorr,sankey) using Plotly as the underlying package.
- Clustering: Sequence clustering analysis architecture that accepts a reducer class (e.g.,
PCA,TSNE,UMAP) and a clusterer class (e.g.,AgglomerativeClustering,HDBSCAN). Note: You may need to install additional packages to access these classes. - Phylogenetics: Tree construction and analysis using IQTREE. .
- (TODO) Haplotype Networks: Write NEXUS files as input for POPART to draw haplotype networks.
May separate the cluster module into an independent repository in the future to keep each repo simple.
May remove the phylo and hap_net modules as it is somewhat redundant to use a Python interface rather than using those software packages directly.
from ednabp.bp import BioPipelinepipeline = BioPipeline(
input_path="/path/to/files_folder", # Directory containing multiple files
# input_path="/path/to/single_file", # Alternative: single file input
output_path="/path/to/output",
)custom_settings = {
"rm_p_5": "GGACGATAAGACCCTATAAA",
"rm_p_3": "ACTTTAGGGATAACAGCGT",
"min_read_len": 154,
"max_read_len": 189,
"blast_db": "/path/to/custom/blast/db",
"lineage_db": "/path/to/custom/lineage/db"
"verbose": True,
"n_cpu": 8,
}
pipeline = BioPipeline(
input_path="/path/to/files_folder",
output_path="/path/to/output",
**custom_settings
)ednabp -i INPUT_PATH -o OUTPUT_PATHTo check the parameters, please run command:
ednabp -hfrom ednabp.data import BPData
data = BPData()
data.import_data("results/")
# optional
data.import_metadata("path/to/sample_metadata")
data.import_spc_info("path/to/fishbase_db", "path/to/stock_db")This package supports import data from another popular pipeline to run downstream analysis.
from ednabp.data import MitoData
data = MitoData()
data.import_data("results/")
# optional
data.import_metadata("path/to/sample_metadata")You can serialize and deserialize a data container for repetitive use. This process is known as "pickling" and "unpickling." Note: Never unpickle a .pkl file from an unknown source.
data.pickle_data("path/to/save_dir", "save_name")Next time, you only need to unpickle the data container and don't need to import everything again.
data = BPData() # or MitoData()
data.unpickle_data("path/to/pkl_file")Here is an example with writing abundance table and drawing barchart of species abundance across samples.
from ednabp.div.write import Writer
from ednabp.div.plot import barchart
# Create abundance dataframe
writer = Writer(data)
df = writer.abundance(taxa_lv='species')
# Generate barchart
fig, plotter = barchart(
df=df,
values='abundance',
index='species',
columns='sample_id'
)We also provide two other metrics: richness and detection probability, plus three additional visualization options: heatmap, sankey diagram, and rank correlation matrix. Additionally, you can customize parameters for summarizing metrics and visualizing data, such as taxa_lv, values, index, and columns. These options give you the flexibility to describe your own data.
Run the test suite:
pytest
# or
pytest ./tests/test_XXX.py
