Authors:
- Software: Karitskaya Polina, Kozlova Polina,
Institute of Bioinformatics, Saint Petersburg, Russia.
- Clone the repository to your local machine using Git or download the Snakefile from the GitHub repository.
git clone git@github.com:Cinnamonness/Bioinf-practice-BI.git
cd Bioinf-practice-BI- Create working directory For example:
mkdir practice2
cd practice2-
Make sure the Snakefile is in your working directory. In this example, the Snakefile should be in the practice2 directory.
-
Ensure that you have downloaded the reference sequence.fasta. It should be in your working directory.
-
Ensure you have installed VarScan.v2.3.9.jar. It should be in your working directory.
-
Change the absolute path in Snakefile (it is the first string). To know your absolute path you can use command:
pwd- install snakemake:
pip install snakemake - Run the Snakefile using this command from your working directory:
snakemake -p --cores all This project includes the following features:
-
Data Download:
- Downloads FastQ files from the SRA FTP server for different samples (SRR1705851, SRR1705858, SRR1705859, and SRR1705860).
-
Reference Indexing:
- The reference genome (
sequence.fasta) is indexed using BWA to generate required index files for alignment.
- The reference genome (
-
Alignment and Sorting:
- Aligns FastQ files to the reference genome using BWA.
- Sorts the resulting BAM files using Samtools for downstream processing.
-
MPileup Generation:
- Generates MPileup files from the aligned BAM files using Samtools, which are used for variant calling.
-
Variant Calling:
- Two levels of variant calling are supported:
- Common Variants: Calls variants with a minimum frequency of 95% using VarScan.
- Rare Variants: Calls variants with a minimum frequency of 0.1% using VarScan.
- Two levels of variant calling are supported:
-
Parallel Processing:
- Processes multiple samples in parallel using Snakemake's expand functionality, ensuring that the workflow runs efficiently for multiple samples (SRR1705858, SRR1705859, SRR1705860).
-
VCF Output:
- Generates VCF files containing the variant calls for each sample, separated into two files: one for common variants and another for rare variants.
-
Sample-Specific Results:
- Each sample's analysis (alignment, variant calling, etc.) is handled separately, allowing for easy scalability to more samples by adding them to the Snakefile.
-
Compatibility with VarScan:
- Integration with VarScan for variant calling ensures high-quality SNP calling and variant analysis from high-depth sequencing data.
-
Efficient Data Management:
- All output files are neatly organized in the working directory, ensuring proper file management and easy retrieval of results.
-
snakemake 8.25.3
-
bwa 0.7.17-r1188
-
samtools 1.13
-
VarScan.v2.3.9