Workflows

EukFinder Workflows - User Decision Guide

This page shows you step-by-step what to do with your data. Follow the flowchart that matches your input data type.

⚠️ IMPORTANT: Input Files Must Be Uncompressed

EukFinder does NOT accept .gz compressed files (.fastq.gz, .fasta.gz)

Before following any workflow below, decompress your input files:

# Decompress FASTQ files
gunzip *.fastq.gz
gunzip *.fq.gz

# Decompress FASTA files  
gunzip *.fasta.gz
gunzip *.fa.gz

# Verify files are uncompressed
file your_file.fastq  # Should show "ASCII text" or "data"

Workflow 1: You Have Raw/Untrimmed Paired-End FASTQ Files

Use this workflow if: Your reads are raw from the sequencer and need adapter/quality trimming

Flowchart

Input: raw paired FASTQ (R1 and R2, uncompressed)
            ↓
   Need to remove host DNA?
      ↙ (No)             ↘ (Yes)
    RUN:                  RUN:
    read_prep_env         read_prep
    (faster)              (with --hg)
      ↓                      ↓
      └──────────→ Run short_seqs
                    ↓
                 Output: classified contigs

Decision: read_prep vs read_prep_env

Question	read_prep	read_prep_env
Sample type	Clinical, cultured (has host DNA)	Environmental (water, soil, ocean)
Examples	Human gut, plant tissue, cultured organism	Ocean water, soil, sediment, compost
Command	Requires `--hg host_genome.fasta`	No host genome needed
Speed	Standard speed	10-30% faster ⚡
Removes host DNA?	YES	NO

Step-by-Step Instructions

Step 1: Prepare Your Files

Make sure your files are:

Uncompressed (not .gz)
Named clearly: sample_R1.fastq, sample_R2.fastq
In same directory

Step 2a: For Environmental Samples (NO host DNA)

Run read_prep_env (faster option):

eukfinder read_prep_env \
  --r1 sample_R1.fastq \
  --r2 sample_R2.fastq \
  -n 24 \
  -i adapters.fa \
  -o sample_name

Output files:

sample_name_R1PT.fq - Paired R1 reads (trimmed)
sample_name_R2PT.fq - Paired R2 reads (trimmed)
sample_name_R1unPT.fq - Unpaired reads (trimmed)
sample_name_centrifuge_P - Centrifuge results (paired)
sample_name_centrifuge_UP - Centrifuge results (unpaired)

Step 2b: For Samples WITH Host DNA (clinical, cultured)

Run read_prep with host genome:

eukfinder read_prep \
  --r1 sample_R1.fastq \
  --r2 sample_R2.fastq \
  --hg host_genome.fasta \
  -n 24 \
  -i adapters.fa \
  -o sample_name

Output files: Same as Step 2a above

Step 3: Run short_seqs

Use the output files from Step 2a or 2b:

eukfinder short_seqs \
  --r1 sample_name_R1PT.fq \
  --r2 sample_name_R2PT.fq \
  --un sample_name_R1unPT.fq \
  --pclass sample_name_centrifuge_P \
  --uclass sample_name_centrifuge_UP \
  -o sample_name \
  -n 24 -z 24 -t T \
  --max_m 400 -e 0.05 --pid 50 --cov 30 --mhlen 50

Output: Classified contigs in Eukfinder_results/ folder

Workflow 2: You Have Pre-cleaned/Trimmed FASTQ Files

Use this workflow if: Your paired-end reads have already been trimmed and cleaned (adapters removed, quality filtered)

Flowchart

Input: cleaned FASTQ (R1 and R2, uncompressed)
            ↓
   Have unpaired reads?
      ↙ (Yes)        ↘ (No)
    RUN:           CREATE: fake unpaired file
    centrifuge             ↓
    twice              Use output with paired file to Run centrifuge twice
      ↓                   ↓
      └─────→ Run short_seqs
                       ↓
            Output: classified contigs

Step-by-Step Instructions

Step 1: Prepare Your Files

Make sure your files are:

Uncompressed (not .gz)
Named clearly: sample_R1.fastq, sample_R2.fastq
In same directory: All input files together

If you have unpaired/orphan reads, prepare that file too:

File name: sample_unpaired.fastq (uncompressed)

Step 2a: If You HAVE Unpaired Reads

Skip this section and go to Step 3.

Step 2b: If You DON'T Have Unpaired Reads

Create a minimal dummy unpaired reads file:

# Extract a small subset of R1 reads and treat them as "orphan" reads
head -200 sample_R1.fastq > temp_unpaired.fastq

This is a minimal file that satisfies the requirement for unpaired reads.

Step 3: Run centrifuge (if you have unpaired reads)

If you have actual unpaired reads, run centrifuge to classify them:

# For paired reads (if you have them)
centrifuge -x /path/to/centrifuge_db \
  -1 sample_R1.fastq \
  -2 sample_R2.fastq \
  -S sample_centrifuge_P

# For unpaired reads
centrifuge -x /path/to/centrifuge_db \
  -U sample_unpaired.fastq \
  -S sample_centrifuge_UP

Step 4: Run short_seqs

eukfinder short_seqs \
  --r1 sample_R1.fastq \
  --r2 sample_R2.fastq \
  --un sample_unpaired.fastq \
  --pclass sample_centrifuge_P \
  --uclass sample_centrifuge_UP \
  -o sample_name \
  -n 24 -z 24 -t T \
  --max_m 400 -e 0.05 --pid 50 --cov 30 --mhlen 50

Output: Classified contigs in Eukfinder_results/ folder

Workflow 3: You Have Long Reads or Assembled Contigs

Use this workflow if: You have PacBio/Nanopore long reads or pre-assembled contig sequences

Flowchart (Simple - One Step!)

Input: Long reads or assembled contigs
       (FASTA or FASTQ, uncompressed)
            ↓
        Run long_seqs
            ↓
        Output: classified contigs

Step-by-Step Instructions

Step 1: Prepare Your File

Make sure your file is:

Uncompressed (not .gz)
Named clearly: contigs.fasta or long_reads.fastq
Valid format: FASTA or FASTQ

Step 2: Run long_seqs

eukfinder long_seqs \
  -l contigs.fasta \
  -o sample_name \
  -n 24 -z 24 -t T \
  --max_m 400 -e 0.05 --pid 50 --cov 30 --mhlen 50

Output: Classified contigs in Eukfinder_results/ folder

Summary: Quick Command Lookup

Data Type	Primary Command	Notes
Raw paired reads (no host)	`read_prep_env`	Faster, environmental samples
Raw paired reads (with host)	`read_prep --hg`	Requires host genome file
Pre-cleaned paired reads	`short_seqs`	Skip preprocessing
Long reads / contigs	`long_seqs`	Single-step workflow

Output Files Explained

From read_prep / read_prep_env

File	What it contains
`*_R1PT.fq`	Paired-end R1 reads (trimmed)
`*_R2PT.fq`	Paired-end R2 reads (trimmed)
`*_R1unPT.fq`	Orphan/unpaired reads (trimmed)
`*_centrifuge_P`	Centrifuge classification (paired)
`*_centrifuge_UP`	Centrifuge classification (unpaired)

From short_seqs or long_seqs

Found in Eukfinder_results/ folder:

File	What it contains
`scf_output.Euk.fasta`	High-confidence eukaryotic contigs
`scf_output.Unk.fasta`	Unclassified contigs
`scf_output.EUnk.fasta`	Combined Euk + Unk (permissive)
`scf_output.Bact.fasta`	Bacterial contigs
`scf_output.Arch.fasta`	Archaeal contigs
`summary_table.txt`	Summary statistics

Important Notes

About Uncompressed Files

All input files MUST be uncompressed
Decompress using: gunzip *.fastq.gz

About Databases

Centrifuge database required for all workflows
Download databases using: eukfinder install

About Parameters

-n: Number of threads (adjust to your system)
-z: Number of chunks for assembly
-t: Assembly type (T = automatic)
--mhlen: Minimum hit length (default 50)
--pid: Percent identity threshold
--cov: Coverage threshold
-e: E-value threshold

Troubleshooting

Error: "File is not FASTQ format"

Check if file is compressed (.gz) and decompress it
Verify file format: head -2 your_file.fastq

Error: "unpaired reads required"

For paired-only data, create a dummy unpaired file (see Step 2b)

Error: "centrifuge database not found"

Verify database path is correct
Run: eukfinder install to setup databases

Need More Help?

Check the FAQ for detailed answers

Last updated: January 2025 EukFinder v1.2.4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workflows

EukFinder Workflows - User Decision Guide

⚠️ IMPORTANT: Input Files Must Be Uncompressed

Workflow 1: You Have Raw/Untrimmed Paired-End FASTQ Files

Flowchart

Decision: read_prep vs read_prep_env

Step-by-Step Instructions

Workflow 2: You Have Pre-cleaned/Trimmed FASTQ Files

Flowchart

Step-by-Step Instructions

Workflow 3: You Have Long Reads or Assembled Contigs

Flowchart (Simple - One Step!)

Step-by-Step Instructions

Summary: Quick Command Lookup

Output Files Explained

From read_prep / read_prep_env

From short_seqs or long_seqs

Important Notes

About Uncompressed Files

About Databases

About Parameters

Troubleshooting

Need More Help?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally