Skip to content

Workflows

dzhao edited this page Jan 14, 2026 · 3 revisions

EukFinder Workflows - User Decision Guide

This page shows you step-by-step what to do with your data. Follow the flowchart that matches your input data type.


⚠️ IMPORTANT: Input Files Must Be Uncompressed

EukFinder does NOT accept .gz compressed files (.fastq.gz, .fasta.gz)

Before following any workflow below, decompress your input files:

# Decompress FASTQ files
gunzip *.fastq.gz
gunzip *.fq.gz

# Decompress FASTA files  
gunzip *.fasta.gz
gunzip *.fa.gz

# Verify files are uncompressed
file your_file.fastq  # Should show "ASCII text" or "data"

Workflow 1: You Have Raw/Untrimmed Paired-End FASTQ Files

Use this workflow if: Your reads are raw from the sequencer and need adapter/quality trimming

Flowchart

Input: raw paired FASTQ (R1 and R2, uncompressed)
            ↓
   Need to remove host DNA?
      ↙ (No)             ↘ (Yes)
    RUN:                  RUN:
    read_prep_env         read_prep
    (faster)              (with --hg)
      ↓                      ↓
      └──────────→ Run short_seqs
                    ↓
                 Output: classified contigs

Decision: read_prep vs read_prep_env

Question read_prep read_prep_env
Sample type Clinical, cultured (has host DNA) Environmental (water, soil, ocean)
Examples Human gut, plant tissue, cultured organism Ocean water, soil, sediment, compost
Command Requires --hg host_genome.fasta No host genome needed
Speed Standard speed 10-30% faster ⚡
Removes host DNA? YES NO

Step-by-Step Instructions

Step 1: Prepare Your Files

Make sure your files are:

  • Uncompressed (not .gz)
  • Named clearly: sample_R1.fastq, sample_R2.fastq
  • In same directory

Step 2a: For Environmental Samples (NO host DNA)

Run read_prep_env (faster option):

eukfinder read_prep_env \
  --r1 sample_R1.fastq \
  --r2 sample_R2.fastq \
  -n 24 \
  -i adapters.fa \
  -o sample_name

Output files:

  • sample_name_R1PT.fq - Paired R1 reads (trimmed)
  • sample_name_R2PT.fq - Paired R2 reads (trimmed)
  • sample_name_R1unPT.fq - Unpaired reads (trimmed)
  • sample_name_centrifuge_P - Centrifuge results (paired)
  • sample_name_centrifuge_UP - Centrifuge results (unpaired)

Step 2b: For Samples WITH Host DNA (clinical, cultured)

Run read_prep with host genome:

eukfinder read_prep \
  --r1 sample_R1.fastq \
  --r2 sample_R2.fastq \
  --hg host_genome.fasta \
  -n 24 \
  -i adapters.fa \
  -o sample_name

Output files: Same as Step 2a above

Step 3: Run short_seqs

Use the output files from Step 2a or 2b:

eukfinder short_seqs \
  --r1 sample_name_R1PT.fq \
  --r2 sample_name_R2PT.fq \
  --un sample_name_R1unPT.fq \
  --pclass sample_name_centrifuge_P \
  --uclass sample_name_centrifuge_UP \
  -o sample_name \
  -n 24 -z 24 -t T \
  --max_m 400 -e 0.05 --pid 50 --cov 30 --mhlen 50

Output: Classified contigs in Eukfinder_results/ folder



Workflow 2: You Have Pre-cleaned/Trimmed FASTQ Files

Use this workflow if: Your paired-end reads have already been trimmed and cleaned (adapters removed, quality filtered)

Flowchart

Input: cleaned FASTQ (R1 and R2, uncompressed)
            ↓
   Have unpaired reads?
      ↙ (Yes)        ↘ (No)
    RUN:           CREATE: fake unpaired file
    centrifuge             ↓
    twice              Use output with paired file to Run centrifuge twice
      ↓                   ↓
      └─────→ Run short_seqs
                       ↓
            Output: classified contigs

Step-by-Step Instructions

Step 1: Prepare Your Files

Make sure your files are:

  • Uncompressed (not .gz)
  • Named clearly: sample_R1.fastq, sample_R2.fastq
  • In same directory: All input files together

If you have unpaired/orphan reads, prepare that file too:

  • File name: sample_unpaired.fastq (uncompressed)

Step 2a: If You HAVE Unpaired Reads

Skip this section and go to Step 3.

Step 2b: If You DON'T Have Unpaired Reads

Create a minimal dummy unpaired reads file:

# Extract a small subset of R1 reads and treat them as "orphan" reads
head -200 sample_R1.fastq > temp_unpaired.fastq

This is a minimal file that satisfies the requirement for unpaired reads.

Step 3: Run centrifuge (if you have unpaired reads)

If you have actual unpaired reads, run centrifuge to classify them:

# For paired reads (if you have them)
centrifuge -x /path/to/centrifuge_db \
  -1 sample_R1.fastq \
  -2 sample_R2.fastq \
  -S sample_centrifuge_P

# For unpaired reads
centrifuge -x /path/to/centrifuge_db \
  -U sample_unpaired.fastq \
  -S sample_centrifuge_UP

Step 4: Run short_seqs

eukfinder short_seqs \
  --r1 sample_R1.fastq \
  --r2 sample_R2.fastq \
  --un sample_unpaired.fastq \
  --pclass sample_centrifuge_P \
  --uclass sample_centrifuge_UP \
  -o sample_name \
  -n 24 -z 24 -t T \
  --max_m 400 -e 0.05 --pid 50 --cov 30 --mhlen 50

Output: Classified contigs in Eukfinder_results/ folder


Workflow 3: You Have Long Reads or Assembled Contigs

Use this workflow if: You have PacBio/Nanopore long reads or pre-assembled contig sequences

Flowchart (Simple - One Step!)

Input: Long reads or assembled contigs
       (FASTA or FASTQ, uncompressed)
            ↓
        Run long_seqs
            ↓
        Output: classified contigs

Step-by-Step Instructions

Step 1: Prepare Your File

Make sure your file is:

  • Uncompressed (not .gz)
  • Named clearly: contigs.fasta or long_reads.fastq
  • Valid format: FASTA or FASTQ

Step 2: Run long_seqs

eukfinder long_seqs \
  -l contigs.fasta \
  -o sample_name \
  -n 24 -z 24 -t T \
  --max_m 400 -e 0.05 --pid 50 --cov 30 --mhlen 50

Output: Classified contigs in Eukfinder_results/ folder


Summary: Quick Command Lookup

Data Type Primary Command Notes
Raw paired reads (no host) read_prep_env Faster, environmental samples
Raw paired reads (with host) read_prep --hg Requires host genome file
Pre-cleaned paired reads short_seqs Skip preprocessing
Long reads / contigs long_seqs Single-step workflow

Output Files Explained

From read_prep / read_prep_env

File What it contains
*_R1PT.fq Paired-end R1 reads (trimmed)
*_R2PT.fq Paired-end R2 reads (trimmed)
*_R1unPT.fq Orphan/unpaired reads (trimmed)
*_centrifuge_P Centrifuge classification (paired)
*_centrifuge_UP Centrifuge classification (unpaired)

From short_seqs or long_seqs

Found in Eukfinder_results/ folder:

File What it contains
scf_output.Euk.fasta High-confidence eukaryotic contigs
scf_output.Unk.fasta Unclassified contigs
scf_output.EUnk.fasta Combined Euk + Unk (permissive)
scf_output.Bact.fasta Bacterial contigs
scf_output.Arch.fasta Archaeal contigs
summary_table.txt Summary statistics

Important Notes

About Uncompressed Files

  • All input files MUST be uncompressed
  • Decompress using: gunzip *.fastq.gz

About Databases

  • Centrifuge database required for all workflows
  • Download databases using: eukfinder install

About Parameters

  • -n: Number of threads (adjust to your system)
  • -z: Number of chunks for assembly
  • -t: Assembly type (T = automatic)
  • --mhlen: Minimum hit length (default 50)
  • --pid: Percent identity threshold
  • --cov: Coverage threshold
  • -e: E-value threshold

Troubleshooting

Error: "File is not FASTQ format"

  • Check if file is compressed (.gz) and decompress it
  • Verify file format: head -2 your_file.fastq

Error: "unpaired reads required"

  • For paired-only data, create a dummy unpaired file (see Step 2b)

Error: "centrifuge database not found"

  • Verify database path is correct
  • Run: eukfinder install to setup databases

Need More Help?

  • Check the FAQ for detailed answers

Last updated: January 2025 EukFinder v1.2.4