The dataset we will be working are the practice dataset from the H3ABioNet 16S rDNA diversity analysis SOP. The source data can be accessed here but for our purposes it is already on the cluster and stored here:/scratch/DB/bio/training/16SrRNA/dog_stool_samples
The table below contains the metadata associated with the dog stool samples. There are three dogs which are treated with increased percentage of a compound in their diet: 5 different treatments (0-4, representing an increased percentage of a compound in their diet).
| Sample | Dog | Treatment | Read counts r1 | Read counts r2 |
|---|---|---|---|---|
| Dog1 | B | 2 | 118343 | 118343 |
| Dog2 | G | 3 | 108679 | 108679 |
| Dog3 | K | 3 | 101482 | 101482 |
| Dog8 | B | 4 | 108731 | 108731 |
| Dog9 | G | 0 | 109500 | 109500 |
| Dog10 | K | 4 | 79342 | 79342 |
| Dog15 | B | 1 | 131483 | 131483 |
| Dog16 | G | 4 | 114424 | 114424 |
| Dog17 | K | 0 | 99610 | 99610 |
| Dog22 | B | 3 | 145029 | 145029 |
| Dog23 | G | 1 | 193158 | 193158 |
| Dog24 | K | 2 | 162487 | 162487 |
| Dog29 | B | 0 | 122776 | 122776 |
| Dog30 | G | 2 | 137315 | 137315 |
| Dog31 | K | 1 | 150613 | 150613 |
- Edit files and command line options, work with bash variables, use bash snippets (pipes, awk, loops).
- Run a 16S analysis pipeline from raw reads up to OTU classification and alignment.
- Once done with this tutorial you can continue with the R downstream analysis tutorial using the data generated in this tutorial.
- All the outputs have been generated here
/scratch/DB/bio/training/16SrRNA/16SrRNA-hex-tutorial/results - Please find someone next to you that looks like they know what they are doing.
- Let me know.
This tutorial should be run in an interactive session. Please do not runnning anything on the headnode.
qsub -I -q UCTlong -l walltime=08:00:00Once you are on a compute node you will see that the prompt changes from @srvslshpc001 to @srvslshpc60X e.g.
gerrit@srvslshpc001:~> qsub -I -q UCTlong -l walltime=08:00:00
qsub: waiting for job 1598565.srvslshpc001 to start
qsub: job 1598565.srvslshpc001 ready
gerrit@srvslshpc601:~> hostname
srvslshpc601source /scratch/DB/bio/training/16SrRNA/16SrRNA-hex-tutorial/config/activate_soft.shNow set some variables. For the process_dir set replace hpc30 with the name that has been given to you in the class
raw_reads_dir=/scratch/DB/bio/training/16SrRNA/dog_stool_samples
process_dir=/researchdata/fhgfs/hpc30
uparse_dir=$process_dir/uparse
taxonomy_dir=$process_dir/tax
alignment_dir=$process_dir/align
greengenes_db=/scratch/DB/bio/qiime/greengenes/gg_13_8_otus
gold_db=/scratch/DB/bio/qiime/uchime/gold.fa
sid_fastq_pair_list=/scratch/DB/bio/training/16SrRNA/16SrRNA-hex-tutorial/sid.fastq_pair.listfastqc_dir=$process_dir/fastqc
mkdir $fastqc_dir
fastqc --extract -f fastq -o $fastqc_dir -t 1 $raw_reads_dir/*fastqc_combine.pl -v --out $fastqc_dir --skip --files "$fastqc_dir/*_fastqc"Now lets view the reports.
renamed_dir=$uparse_dir"/renamed"
mkdir -p $renamed_dir
while read sid_fastq_pair; do sid=`echo $sid_fastq_pair | awk -F ' ' '{print $1}'`; fastq_r1=`echo $sid_fastq_pair | awk -F ' ' '{print $2}'`; fastq_r2=`echo $sid_fastq_pair | awk -F ' ' '{print $3}'`; fastq_r1_renamed=$renamed_dir"/"$(basename $fastq_r1); fastq_r2_renamed=$renamed_dir"/"$(basename $fastq_r2); rename_fastq_headers.sh $sid $fastq_r1 $fastq_r2 $fastq_r1_renamed $fastq_r2_renamed;done < $sid_fastq_pair_listThis will take about 10 minutes to run. Lets have a look at the headers once done.
fastq_maxdiffs=3
merged_dir=$uparse_dir"/merged"
mkdir $merged_dir
while read sid_fastq_pair; do sid=`echo $sid_fastq_pair | awk -F ' ' '{print $1}'`; fastq_r1=`echo $sid_fastq_pair | awk -F ' ' '{print $2}'`; fastq_r2=`echo $sid_fastq_pair | awk -F ' ' '{print $3}'`; fastq_r1_renamed=$renamed_dir"/"$(basename $fastq_r1); fastq_r2_renamed=$renamed_dir"/"$(basename $fastq_r2); usearch9 -fastq_mergepairs $fastq_r1_renamed -reverse $fastq_r2_renamed -fastq_maxdiffs $fastq_maxdiffs -fastqout $merged_dir"/"$sid".merged.fastq";done < $sid_fastq_pair_listThis will take about 1 minute to run. Lets have a look at the fastq files of the merge reads.
fastq_maxee=0.1
filtered_dir=$uparse_dir"/filtered"
mkdir $filtered_dir
while read sid_fastq_pair; do sid=`echo $sid_fastq_pair | awk -F ' ' '{print $1}'`; usearch9 -threads 1 -fastq_filter $merged_dir"/"$sid".merged.fastq" -fastq_maxee $fastq_maxee -fastqout $filtered_dir"/"$sid".merged.filtered.fastq" ;done < $sid_fastq_pair_listThis will take about 1 minute to run. Lets do a read count on the filtered fastqs.
filtered_fastqc_dir=$uparse_dir"/filtered.fastqc"
mkdir $filtered_fastqc_dir
fastqc --extract -f fastq -o $uparse_dir"/filtered.fastqc" -t 1 $filtered_dir/*.fastqThis will take about 2 minutes to run.
fastqc_combine.pl -v --out $filtered_fastqc_dir --skip --files "$filtered_fastqc_dir/*_fastqc"Lets have a look at the FastQC summaries and see if we notice any changes from the FastQC reports on the raw reads.
filtered_fasta_dir=$uparse_dir"/filtered.fasta"
mkdir $filtered_fasta_dir
for i in `ls -1 $filtered_dir/*.fastq`; do filename=$(basename "$i"); base="${filename%.*}"; seqtk seq -A $i > $filtered_fasta_dir/$base.fa; doneLets have a look at the fasta format.
cat $filtered_fasta_dir/*.fa > $uparse_dir/filtered_all.facat $uparse_dir/filtered_all.fa | grep -v "^>" | grep -v [^ACGTacgt] | sort -d | uniq -c | while read abundance sequence ; do hash=$(printf "${sequence}" | sha1sum); hash=${hash:0:40};printf ">%s;size=%d;\n%s\n" "${hash}" "${abundance}" "${sequence}"; done > $uparse_dir/filtered_all.uniques.fa 2> $uparse_dir/filtered_all.uniques.fa.eThis will take about 15 minutes. Lets have a look at the headers.
Sort by size
min_size=2
usearch9 -sortbysize $uparse_dir/filtered_all.uniques.fa -fastaout $uparse_dir/filtered_all.uniques.sorted.fa -minsize $min_sizeDo OTU picking
otu_radius_pct=3
usearch9 -cluster_otus $uparse_dir/filtered_all.uniques.sorted.fa -otu_radius_pct $otu_radius_pct -otus $uparse_dir/otus_raw.faThis will take about 30 seconds. Once done lets count how many OTUs were generated.
usearch9 -threads 1 -uchime2_ref $uparse_dir/otus_raw.fa -db $gold_db -mode high_confidence -strand plus -notmatched $uparse_dir/otus_chimOUT.faThis will take about 10 seconds. Once done lets check how many OTUs were detected as being chimeric.
Rename OTU headers
fasta_number.py $uparse_dir/otus_chimOUT.fa OTU_ > $uparse_dir/otus_repsetOUT.faSplit fasta files to reduce memory on usearch_global run
mkdir $uparse_dir/split_files
cd $uparse_dir/split_files
fasta-splitter.pl --n-parts 100 $uparse_dir/filtered_all.faDo de-dereplication
for i in $(ls $uparse_dir/split_files/*.fa); do usearch9 -usearch_global $i -db $uparse_dir/otus_repsetOUT.fa -id 0.97 -strand plus -uc $i.map.uc; doneCombine mappings
cat $uparse_dir/split_files/*.map.uc > $uparse_dir/otus_mappedOUT.ucThis will take about 1 minute to complete. Lets have a look at the final mapping file.
uc2otutab.py $uparse_dir/otus_mappedOUT.uc > $uparse_dir/otus_table.tab.txtThis will take about 20 seconds to complete. Have a look that the OTU table generated.
mkdir $taxonomy_dir
assign_taxonomy.py -i $uparse_dir/otus_repsetOUT.fa -o $taxonomy_dir -r $greengenes_db/rep_set/97_otus.fasta -t $greengenes_db/taxonomy/97_otu_taxonomy.txt -m uclustThis will take about a minutee to complete. Let look at the GreenGenes files and also the final output.
For downstream analysis we need a .biom file. Lets create that from the OTU table.
biom convert -i $uparse_dir/otus_table.tab.txt --table-type="OTU table" -o $process_dir/otus_table.biomNow lets add the taxonomy annotation to the .biom file.
biom add-metadata -i $process_dir/otus_table.biom -o $process_dir/otus_table.tax.biom --observation-metadata-fp $taxonomy_dir/otus_repsetOUT_tax_assignments.txt --observation-header OTUID,taxonomy,confidence --sc-separated taxonomy --float-fields confidenceLets have a look if the annotation has been added.
Allign sequences against a template database
mkdir $alignment_dir
align_seqs.py -m pynast -i $uparse_dir/otus_repsetOUT.fa -o $alignment_dir -t $greengenes_db/rep_set_aligned/97_otus.fastaThis will take about 5 minutes to complete. Have a look at the output.
Now filter the alignment to remove gaps.
filter_alignment.py -i $alignment_dir/otus_repsetOUT_aligned.fasta -o $alignment_dir/filteredCreate a phylogenetic tree
make_phylogeny.py -i $alignment_dir/filtered/otus_repsetOUT_aligned_pfiltered.fasta -o $process_dir/otus_repsetOUT_aligned_pfiltered.tre
biom summarize-table -i $process_dir/otus_table.tax.biom -o $process_dir/otus_table.tax.biom.summary.quantativebiom summarize-table --qualitative -i $process_dir/otus_table.tax.biom -o $process_dir/otus_table.tax.biom.summary.qualitative- FastQC (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
- ImageMagick (https://www.imagemagick.org/script/download.php)
- Fastx-toolkit (http://hannonlab.cshl.edu/fastx_toolkit/)
- Fasta-splitter (http://kirill-kryukov.com/study/tools/fasta-splitter/files/fasta-splitter-0.2.4.zip)
- USEARCH (http://www.drive5.com/usearch/download.html)
- UPARSE helper scripts (http://drive5.com/python/python_scripts.tar.gz)
- QIIME base (http://qiime.org/install/alternative.html)
