-
Notifications
You must be signed in to change notification settings - Fork 3
Workflows
This page shows you step-by-step what to do with your data. Follow the flowchart that matches your input data type.
EukFinder does NOT accept .gz compressed files (.fastq.gz, .fasta.gz)
Before following any workflow below, decompress your input files:
# Decompress FASTQ files
gunzip *.fastq.gz
gunzip *.fq.gz
# Decompress FASTA files
gunzip *.fasta.gz
gunzip *.fa.gz
# Verify files are uncompressed
file your_file.fastq # Should show "ASCII text" or "data"
Use this workflow if: Your reads are raw from the sequencer and need adapter/quality trimming
Input: raw paired FASTQ (R1 and R2, uncompressed)
↓
Need to remove host DNA?
↙ (No) ↘ (Yes)
RUN: RUN:
read_prep_env read_prep
(faster) (with --hg)
↓ ↓
└──────────→ Run short_seqs
↓
Output: classified contigs
| Question | read_prep | read_prep_env |
|---|---|---|
| Sample type | Clinical, cultured (has host DNA) | Environmental (water, soil, ocean) |
| Examples | Human gut, plant tissue, cultured organism | Ocean water, soil, sediment, compost |
| Command | Requires --hg host_genome.fasta
|
No host genome needed |
| Speed | Standard speed | 10-30% faster ⚡ |
| Removes host DNA? | YES | NO |
Step 1: Prepare Your Files
Make sure your files are:
-
Uncompressed (not
.gz) -
Named clearly:
sample_R1.fastq,sample_R2.fastq - In same directory
Step 2a: For Environmental Samples (NO host DNA)
Run read_prep_env (faster option):
eukfinder read_prep_env \
--r1 sample_R1.fastq \
--r2 sample_R2.fastq \
-n 24 \
-i adapters.fa \
-o sample_nameOutput files:
-
sample_name_R1PT.fq- Paired R1 reads (trimmed) -
sample_name_R2PT.fq- Paired R2 reads (trimmed) -
sample_name_R1unPT.fq- Unpaired reads (trimmed) -
sample_name_centrifuge_P- Centrifuge results (paired) -
sample_name_centrifuge_UP- Centrifuge results (unpaired)
Step 2b: For Samples WITH Host DNA (clinical, cultured)
Run read_prep with host genome:
eukfinder read_prep \
--r1 sample_R1.fastq \
--r2 sample_R2.fastq \
--hg host_genome.fasta \
-n 24 \
-i adapters.fa \
-o sample_nameOutput files: Same as Step 2a above
Step 3: Run short_seqs
Use the output files from Step 2a or 2b:
eukfinder short_seqs \
--r1 sample_name_R1PT.fq \
--r2 sample_name_R2PT.fq \
--un sample_name_R1unPT.fq \
--pclass sample_name_centrifuge_P \
--uclass sample_name_centrifuge_UP \
-o sample_name \
-n 24 -z 24 -t T \
--max_m 400 -e 0.05 --pid 50 --cov 30 --mhlen 50Output: Classified contigs in Eukfinder_results/ folder
Use this workflow if: Your paired-end reads have already been trimmed and cleaned (adapters removed, quality filtered)
Input: cleaned FASTQ (R1 and R2, uncompressed)
↓
Have unpaired reads?
↙ (Yes) ↘ (No)
RUN: CREATE: fake unpaired file
centrifuge ↓
twice Use output with paired file to Run centrifuge twice
↓ ↓
└─────→ Run short_seqs
↓
Output: classified contigs
Step 1: Prepare Your Files
Make sure your files are:
-
Uncompressed (not
.gz) -
Named clearly:
sample_R1.fastq,sample_R2.fastq - In same directory: All input files together
If you have unpaired/orphan reads, prepare that file too:
- File name:
sample_unpaired.fastq(uncompressed)
Step 2a: If You HAVE Unpaired Reads
Skip this section and go to Step 3.
Step 2b: If You DON'T Have Unpaired Reads
Create a minimal dummy unpaired reads file:
# Extract a small subset of R1 reads and treat them as "orphan" reads
head -200 sample_R1.fastq > temp_unpaired.fastq
This is a minimal file that satisfies the requirement for unpaired reads.
Step 3: Run centrifuge (if you have unpaired reads)
If you have actual unpaired reads, run centrifuge to classify them:
# For paired reads (if you have them)
centrifuge -x /path/to/centrifuge_db \
-1 sample_R1.fastq \
-2 sample_R2.fastq \
-S sample_centrifuge_P
# For unpaired reads
centrifuge -x /path/to/centrifuge_db \
-U sample_unpaired.fastq \
-S sample_centrifuge_UP
Step 4: Run short_seqs
eukfinder short_seqs \
--r1 sample_R1.fastq \
--r2 sample_R2.fastq \
--un sample_unpaired.fastq \
--pclass sample_centrifuge_P \
--uclass sample_centrifuge_UP \
-o sample_name \
-n 24 -z 24 -t T \
--max_m 400 -e 0.05 --pid 50 --cov 30 --mhlen 50Output: Classified contigs in Eukfinder_results/ folder
Use this workflow if: You have PacBio/Nanopore long reads or pre-assembled contig sequences
Input: Long reads or assembled contigs
(FASTA or FASTQ, uncompressed)
↓
Run long_seqs
↓
Output: classified contigs
Step 1: Prepare Your File
Make sure your file is:
-
Uncompressed (not
.gz) -
Named clearly:
contigs.fastaorlong_reads.fastq - Valid format: FASTA or FASTQ
Step 2: Run long_seqs
eukfinder long_seqs \
-l contigs.fasta \
-o sample_name \
-n 24 -z 24 -t T \
--max_m 400 -e 0.05 --pid 50 --cov 30 --mhlen 50Output: Classified contigs in Eukfinder_results/ folder
| Data Type | Primary Command | Notes |
|---|---|---|
| Raw paired reads (no host) | read_prep_env |
Faster, environmental samples |
| Raw paired reads (with host) | read_prep --hg |
Requires host genome file |
| Pre-cleaned paired reads | short_seqs |
Skip preprocessing |
| Long reads / contigs | long_seqs |
Single-step workflow |
| File | What it contains |
|---|---|
*_R1PT.fq |
Paired-end R1 reads (trimmed) |
*_R2PT.fq |
Paired-end R2 reads (trimmed) |
*_R1unPT.fq |
Orphan/unpaired reads (trimmed) |
*_centrifuge_P |
Centrifuge classification (paired) |
*_centrifuge_UP |
Centrifuge classification (unpaired) |
Found in Eukfinder_results/ folder:
| File | What it contains |
|---|---|
scf_output.Euk.fasta |
High-confidence eukaryotic contigs |
scf_output.Unk.fasta |
Unclassified contigs |
scf_output.EUnk.fasta |
Combined Euk + Unk (permissive) |
scf_output.Bact.fasta |
Bacterial contigs |
scf_output.Arch.fasta |
Archaeal contigs |
summary_table.txt |
Summary statistics |
- All input files MUST be uncompressed
- Decompress using:
gunzip *.fastq.gz
- Centrifuge database required for all workflows
- Download databases using:
eukfinder install
-
-n: Number of threads (adjust to your system) -
-z: Number of chunks for assembly -
-t: Assembly type (T = automatic) -
--mhlen: Minimum hit length (default 50) -
--pid: Percent identity threshold -
--cov: Coverage threshold -
-e: E-value threshold
Error: "File is not FASTQ format"
- Check if file is compressed (.gz) and decompress it
- Verify file format:
head -2 your_file.fastq
Error: "unpaired reads required"
- For paired-only data, create a dummy unpaired file (see Step 2b)
Error: "centrifuge database not found"
- Verify database path is correct
- Run:
eukfinder installto setup databases
- Check the FAQ for detailed answers
Last updated: January 2025 EukFinder v1.2.4