Skip to content

The purpose of this project is to implement eDNA bioinformatics processing and to explore taxonomic delimitation from various perspectives using eDNA sequences.

License

Notifications You must be signed in to change notification settings

ComputationalAgronomy/ednabp

Repository files navigation

eDNA Bioinformatics Pipeline (ednabp)

A pipeline for processing environmental DNA (eDNA) sequences through bioinformatics workflows including quality control, taxonomic assignment, and diversity analysis.

Contents

Installation

Package Installation

  1. Download from source:
git clone https://github.com/ComputationalAgronomy/ednabp.git
cd ednabp
  1. Install dependencies:
pip install -r requirements.txt
  1. Install package locally:
pip install -e .

Additional prerequisites for running the bp module

If you will run the bp module, ensure the following external tools are installed and available in your PATH:

Additional prerequisites for running the seq module

If you will run the seq module, ensure the following external tools are installed and available in your PATH:

Modules

modules

1. Bioinformatics Pipeline (bp)

Core processing pipeline with the following stages:

  • Decompress: Extract compressed FASTQ files
  • Merge: Combine paired-end reads
  • Cut Primer: Remove primer sequences and length filtering
  • FASTQ to FASTA: Format conversion
  • Dereplicate: Remove duplicate sequences
  • Denoise: Error correction
  • Assign Taxa: Taxonomic classification

2. Data Management (data)

  • Data Objects: Structured data containers for pipeline results.

A complete data container structure looks like the following:

data

3. Diversity Analysis (div)

  • Writing: Export diversity metrics CSV tables.
  • Plotting: Visualization tools (barchart, heatmap, rankcorr, sankey) using Plotly as the underlying package.

4. Sequence Analysis (seq)

  • Clustering: Sequence clustering analysis architecture that accepts a reducer class (e.g., PCA, TSNE, UMAP) and a clusterer class (e.g., AgglomerativeClustering, HDBSCAN). Note: You may need to install additional packages to access these classes.
  • Phylogenetics: Tree construction and analysis using IQTREE. .
  • (TODO) Haplotype Networks: Write NEXUS files as input for POPART to draw haplotype networks.

May separate the cluster module into an independent repository in the future to keep each repo simple.

May remove the phylo and hap_net modules as it is somewhat redundant to use a Python interface rather than using those software packages directly.

Usage Examples

Pipeline Processing

from ednabp.bp import BioPipeline

Run default pipeline

pipeline = BioPipeline(
    input_path="/path/to/files_folder",   # Directory containing multiple files
    # input_path="/path/to/single_file",  # Alternative: single file input
    output_path="/path/to/output",
)

Run custom settings

custom_settings = {
    "rm_p_5": "GGACGATAAGACCCTATAAA",
    "rm_p_3": "ACTTTAGGGATAACAGCGT",
    "min_read_len": 154,
    "max_read_len": 189,
    "blast_db": "/path/to/custom/blast/db",
    "lineage_db": "/path/to/custom/lineage/db"
    "verbose": True,
    "n_cpu": 8,
}

pipeline = BioPipeline(
    input_path="/path/to/files_folder",
    output_path="/path/to/output",
    **custom_settings
)

CLI

ednabp -i INPUT_PATH -o OUTPUT_PATH

To check the parameters, please run command:

ednabp -h

Data Management

Import from ednabp.bp.BioPipeline outputs

from ednabp.data import BPData

data = BPData()
data.import_data("results/")
# optional
data.import_metadata("path/to/sample_metadata")
data.import_spc_info("path/to/fishbase_db", "path/to/stock_db")

Import from MiFish Pipeline outputs

This package supports import data from another popular pipeline to run downstream analysis.

MiFish Pipeline webpage

from ednabp.data import MitoData

data = MitoData()
data.import_data("results/")
# optional
data.import_metadata("path/to/sample_metadata")

Reuse a data container

You can serialize and deserialize a data container for repetitive use. This process is known as "pickling" and "unpickling." Note: Never unpickle a .pkl file from an unknown source.

data.pickle_data("path/to/save_dir", "save_name")

Next time, you only need to unpickle the data container and don't need to import everything again.

data = BPData() # or MitoData()
data.unpickle_data("path/to/pkl_file")

Diversity Metrics Summary

Here is an example with writing abundance table and drawing barchart of species abundance across samples.

from ednabp.div.write import Writer
from ednabp.div.plot import barchart

# Create abundance dataframe
writer = Writer(data)
df = writer.abundance(taxa_lv='species')

# Generate barchart
fig, plotter = barchart(
    df=df,
    values='abundance',
    index='species',
    columns='sample_id'
)

We also provide two other metrics: richness and detection probability, plus three additional visualization options: heatmap, sankey diagram, and rank correlation matrix. Additionally, you can customize parameters for summarizing metrics and visualizing data, such as taxa_lv, values, index, and columns. These options give you the flexibility to describe your own data.

Testing

Run the test suite:

pytest
# or
pytest ./tests/test_XXX.py

About

The purpose of this project is to implement eDNA bioinformatics processing and to explore taxonomic delimitation from various perspectives using eDNA sequences.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages