A comprehensive annotation pipeline tailored for eukaryotic genomes, particularly those from less well-studied organisms. The pipeline integrates evidence-based gene prediction, homology mapping, and functional annotation to produce genome annotations that can be easily passed to downstream submission tools.
- Genome Annotation: Combines ab initio gene prediction with homology-based evidence to annotate protein-coding genes in eukaryotic genomes.
- Evidence Integration: Incorporates protein alignments, transcript assemblies, and RNA-seq hints for accurate gene models.
- Transcript Assembly: Preprocesses RNA-seq data into transcript assemblies for improved annotation accuracy.
- Functional Annotation: Annotates predicted proteins with functional information from UniProt-SwissProt and Pfam databases.
- Flexible Configuration: Supports various kingdoms (fungus, protist, animal, plant) and custom genetic codes.
- Docker Deployment: Simplifies installation and usage through containerization.
The pipeline requires Docker for isolated and reproducible execution. Before building the image, obtain a license for GeneMark-ES/ET/EP+ from topaz.gatech.edu/GeneMark/license_download.cgi.
git clone https://github.com/BFL-lab/eukan.git
cd eukan
docker build -t eukan -f Dockerfile .- Docker
- GeneMark-ES/ET/EP+ license
- For functional annotation: HMMER suite, Python 3 with biopython and other dependencies
Use the provided annot-docker script as a wrapper to run the pipeline inside the Docker container.
# Display help
./annot-docker eukan -hAssuming that a transcriptome assembly (transcriptome_assembly.sh, see below)
using the upstream script was used to reconstruct transcripts from RNA-Seq
reads, aligned to the genome, along with the hints file the script generates on
RNA-Seq coverage, transcript alignments and intron coordinates.
./annot-docker eukannotpass -g genome.fasta -p uniprot_sprot.fasta -tf nr_transcripts.fasta -tg nr_transcripts.gff3 -r hints_rnaseq.gff --utrs pasa.sqlite --protistusage: eukan [-h] --genome genome.fasta --proteins PROTEINS [PROTEINS ...] [--transcriptsFasta transcriptassembly.fasta] [--transcriptsGFF transcriptassembly.gff3] [--rnaseq_hints hints.gff] [--existing_augustus species]
[--strand_specific_transcripts] [--numcpu N] [--weights x y [z] [x y [z] ...]] [--code CODE] [--utrs UTRS] [--fungus] [--protist] [--animal] [--plant]
Annotates a eukaryotic genome.
required arguments:
--genome genome.fasta REQUIRED. Genome sequence in Fasta format. Ensure no lower-case nucleotides; the pipeline soft-masks repeats by converting to lower-case.
--proteins PROTEINS [PROTEINS ...]
REQUIRED. One or more protein sequence Fasta files, separated by spaces.
optional arguments:
--transcriptsFasta transcriptassembly.fasta
Assembled transcripts in Fasta format.
--transcriptsGFF transcriptassembly.gff3
Assembled transcripts in GFF3 format.
--rnaseq_hints hints.gff GFF hints file generated from RNA-seq alignment.
--existing_augustus species Use pre-trained AUGUSTUS species parameters.
--strand_specific_transcripts
Specify that assembled transcripts are strand-oriented.
--numcpu N Number of CPU threads to use (default: all available).
--weights x y [z] [x y [z] ...]
Weights for scoring evidence sources: protein alignments, gene predictors, transcript assembly (if provided).
Default: 1 2, plus 10 if transcript assembly is included.
--code CODE Genetic code (see NCBI taxonomy utils).
--utrs UTRS PASA SQLite database path for adding UTRs.
--fungus Tune parameters for fungal genomes.
--protist Tune parameters for protist genomes.
--animal Tune parameters for animal genomes.
--plant Tune parameters for plant genomes.
Prepare RNA-seq data for input using the transcriptome_assembly.sh script. This handles read mapping, assembly, and alignment to produce input files for the main pipeline.
# Display help
transcriptome_assembly.sh -hUsage: transcriptome_assembly.sh [OPTIONS] <ARGS>
[OPTIONS] and corresponding <ARGS> are:
Either paired-end:
[-l] <left reads>
[-r] <right reads>
or single-end:
[-s] <single-end reads>
[-m] <min intron length> # default 20
[-M] <max intron length> # default 5000
[-g] <genome fasta>
[-p] <phred quality score (33 for MISEQ, 64 for HISEQ)> # default 33
[-n] <number of CPUs> # default MAX
[-S] <specify strand-specific assembly, either RF or FR> # default off
[-A] <switch on read mapping>
[-E] <switch to extract reads>
[-T] <switch on Trinity assembly>
[-e] <switch on StringTie assembly>
[-P] <switch on PASA alignment>
[-c] <genetic code according to ncbi table>
[-h] Display this help message
[-j] switch on jaccard clipping (for gene-dense organisms and high coverage data)
[-t] <EndToEnd/Local> # default Local
# Assembled paired-end reads with Trinity
transcriptome_assembly.sh -l left_reads.fastq -r right_reads.fastq -g genome.fasta -M 10000 -S RF -A -T -PThe pipeline integrates genome mapping, de novo assembly (Trinity), followed by PASA alignment for evidence integration.
The functional-annotation directory contains scripts to add functional information to predicted proteins using similarity searches against UniProt-SwissProt and Pfam databases.
-
Databases: Prepare UniProt-SwissProt and Pfam databases (or use defaults if available).
-
Dependencies: Install HMMER, Python 3, and required packages:
- Python packages:
biopython,requests,gffutils(seerequirements.txt)
cd functional-annotation pip install -r requirements.txt - Python packages:
-
Get Databases: Use
db-fetch.pyto download and format the latest databases:python db-fetch.py
This will download:
uniprot_sprot.faa: UniProt-SwissProt protein sequencesPfam-A.hmm: Pfam HMM profiles (pressed for hmmscan)
The func-annot script runs phmmer against UniProt and hmmscan against Pfam to annotate protein sequences. Results are appended to Fasta headers or GFF3 attributes.
func-annot --proteins input.faa [--uniprot uniprot_sprot.faa] [--pfam Pfam-A.hmm] [--gff3 input.gff3] [--numcpu N] [--evalue 1e-5]--proteins PROTEINS, -p PROTEINS: Amino acid sequences in Fasta format (required).--uniprot uniprot_sprot.faa: UniProt-SwissProt database (default:/share/unsupported/databases/uniprot_sprot/uniprot_sprot.faa).--pfam Pfam-A.hmm: Pfam HMM database (default:/share/unsupported/databases/Pfam/35.0/Pfam-A.hmm).--gff3 gene_models.gff3: Optional GFF3 file to annotate with functional information.--numcpu N, -n N: Number of CPUs (default: all).--evalue Me-N, -e Me-N: E-value cutoff (default: 1e-1; marginal hits: 1e-3 to 1e-1).
- Annotated Fasta:
input.mod.faawith functional descriptions in headers. - Optional:
input.mod.gff3with addedproductandinferenceattributes.
# Run functional annotation pipeline and append information to fasta headers
func-annot -p input.faa
# Append functional info to Fasta headers from stricter e-values, and update corresponding gff3 feature column with annotations that can be read by table2asn
func-annot -p input.faa --evalue 1e-5 --gff3 input.gff3