We developed a R based command-line computational tool sdMAF - a robust and easy-to-use software for analysis of sex differences in allele frequencies. The initial aim was to scale the analysis in the “Major sex differences in allele frequencies for X chromosomal variants in both the 1000 Genomes Project and gnomAD”1 to the genomic data of the UK Biobank2. Resulting manuscript "Comprehensive whole-genome analyses of the UK Biobank reveal significant sex differences in both genotype missingness and allele frequency on the X chromosome". It handled these data sets with ease, and significantly reduced analysis time compared to running the pipeline manually. With an extra layer of bash scripting, sdMAF can batch process the entire genome by chromosome in a parallel fashion. Users provide PLINK readable genotype calls files; subsequently, sdMAF does input checking, filtering, computing, and detailed logging.
sdMAF accepts a mixture of Pseudo-Autosomal Region (or Autosomal)/Non-Pseudo-Autosomal Region SNPs and automatically assigns the appropriate statistical test during computing. With the help of argparse R package3, sdMAF was able to provide a user-friendly command-line interface. Each argument is clearly explained using the help option.
Details of the method can be found here.
Copyright (Chen et al.,2022)
Citation: Chen DZ, Roshandel D, Wang Z, Sun L, Paterson AD, Comprehensive whole-genome analyses of the UK Biobank reveal significant sex differences in both genotype missingness and allele frequency on the X chromosome, Human Molecular Genetics, 2023;, ddad201, https://doi.org/10.1093/hmg/ddad201
Report bugs to zeya [dot] chen [at] sickkids [dot] ca.
usage: sdMAF -f <filename> -m <filename> [-h] [--version]
[--bim <filename>] [-o <filename>] [-l <filename>]
[--multi-allelic] [--sex-specific] [--mac <minimum count>]
Only gcount files required.
-f <filename>, --female <filename>
Female genotype count file produced by PLINK.
-m <filename>, --male <filename>
Male genotype count file produced by PLINK.
-h, --help show this help message and exit
--version Print the version of this tool and exit.
--bim <filename> Input bim file address to extract base pair
position.Optional if ID in .gcount file are all
chr:bp:a1:a2.
-o <filename>, --out <filename>
Specify output file name and address. Default
autosomal.sdMAF.
-l <filename>, --log <filename>
Log file name and address. Default 'YOURINPUTin--
out'_sdMAF.log.
--multi-allelic Indicate whether to keep multi-allelic SNPs in the
results or not. Default FALSE.
--sex-specific Include to use sex specific minimum allele count
filter for both males and females.
--mac <minimum count>
Sex combined minimum allele count filter. Variant with
minor allele count less than input will be filtered
out. Default 5.
- CHROM - Chromosome Code
- ID - Variant ID
- A1 - Reference Allele
- A2 - Alternative Allele
- BP - Base-pair Postion from genotype calls files
- Mmissing - Number of missing genotypes per SNP in males
- Fmissing - Number of missing genotypes per SNP in female
- F_A1A1 - Female Homozygous Reference Count per SNP
- F_A1A2 - Female Heterozygous Ref-Alt Count per SNP
- F_A2A2 - Female Homozygous Alternative Count per SNP
- M_A1A1.A1 - Male Homozygous/Haploid Reference Count per SNP
- M_A1A2 - Male Heterozygous Ref-Alt Count per SNP
- M_A2A2.A2 - Male Homozygous/Haploid Alternative Count per SNP
- LOG10P - Log10 transformed P-Value
- Ffreq - Female Alt Allele Frequency
- Mfreq - Male Alt Allele Frequency
- DIFmaf - Female - Male Frequency Difference of Sex Combined Minor Allele
- sdMISSING - Female - Male Difference of Missing Rate
- Pmissing - Log10 transformed P-Value from Fisher's Exact test
If you have a fairly large cohort, have access to clusters, and want to batch process entire genome; this guide will teach you how to parallelly compute sdMAF at each chromosome. If not, you can check stream processing single file under example below.
Following bash scripts assume your bed bim fam files for all chromosomes are in same folder and have similar naming except chromosome number. chrX PAR1/PAR2 is assumed to be chrXY and chrX NPR is assumed to be chrX, if not change it accordingly. sdMAF 0.0.3 accepts a mixture of NPR/PAR variants, so PAR1/PAR2 and NPR in one file is okay as long as PAR1/PAR2 have different chromosome codes (PAR1/PAR2 from plink2 --split-par and XY from plink1.9 --split-x). If your genotype files are not in plink format, just change the plink code.
All job commends are under Moab/Torque environment.
####Make your own Analysis.sh file by filling in these address.
#PBS -l vmem=30g,mem=30g
#PBS -l walltime=24:00:00
#PBS -o /YourOutFolder
#PBS -e /YourErrorFolder
cd /YourWorkDirectory
chr=$PARAM1
module load plink2
plink2 --bfile /YourBedFileFolder/BedFileNamechr${chr} --geno-counts --keep-males --out /YourMaleGcountFolder/chr${chr}
plink2 --bfile /YourBedFileFolder/BedFileNamechr${chr} --geno-counts --keep-females --out /YourFemaleGcountFolder/chr${chr}
module load R
Rscript sdMAF.R \
-f /YourFemaleGcountFolder/chr${chr}.gcount \
-m /YourMaleGcountFolder/chr${chr}.gcount \
--mac 10 \
-o /YourOutFolder/OutFileNamechr${chr} \
-l /YourLogFolder/LogFileNamechr${chr}
#### on HPF
cd /WhereAnalysis.shLocated
#Autosomes
let a=1
while [ $a -le 22 ]
do
qsub -v PARAM1=$a Analysis.sh
let a=$a+1
done
#Caution! Check naming of PAR/NPR regions in your Bed files before using this.
qsub -v PARAM1=X Analysis.sh
qsub -v PARAM1=XY Analysis.sh
#Merging results after all batch jobs completed.
let a=1
head -n 1 /YourOutFolder/OutFileNamechr${a}.sdMAF > header.txt
while [ $a -le 22 ]
do
echo "$(tail -n +2 /YourOutFolder/OutFileNamechr${a}.sdMAF)" > /YourOutFolder/OutFileNamechr${a}.sdMAF
let a=$a+1
done
echo "$(tail -n +2 /YourOutFolder/OutFileNamechrX.sdMAF)" > /YourOutFolder/OutFileNamechrX.sdMAF
echo "$(tail -n +2 /YourOutFolder/OutFileNamechrXY.sdMAF)" > /YourOutFolder/OutFileNamechrXY.sdMAF
cat header.txt /YourOutFolder/OutFileNamechr* > /YourOutFolder/OutFileName_chr1to23sdMAF.txt
#Use plink to produce gcount files.
plink2 --bfile pilot --geno-counts --keep-males --out /male/pilot
plink2 --bfile pilot --geno-counts --keep-females --out /female/pilot
#Run sdMAF.
Rscript sdMAF.R \
-f /female/pilot.gcount \
-m /male/pilot.gcount \
--mac 10 \
-o /out/pilot_test \
-l /log/pilot_test
########## sdMAF 0.0.3 ##########
An R based commend-line tool used to compute sex differences in allele frequencies.
sdMAF is free and comes with ABSOLUTELY NO WARRANTY.
Details of the method can be found at:
https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1010231#sec017:~:text=MAF%20between%20populations.-,1.1.%20sdMAF%20test.,-For%20each%20bi
Copyright 2022 Zeya Chen, Zhong Wang, Delnaz Roshandel, Lei Sun, Andrew D. Paterson.
Report bugs to zeya [dot] chen [at] sickkids [dot] ca.
#################################
Checking if inputs are valid.
Loading female gcount file found from /female/pilot.txt.
198380 samples detected from female gcount file.
Number of SNPs by chromosome from female gcount file:
X
3917799
Loading male genotype count file found from /male/pilot.txt.
165243 samples detected from male gcount file.
Number of SNPs per chromosome from male gcount file:
X
3917799
3917799 chrX NPR SNPs detected.
0 autosomal/PAR SNPs detected.
0 missing SNPs detected. Don't worry it will be filtered out later.
No bim file provided. OK unless ID column from genotype file not all in CHR:BP:A1:A2 form.
#################################
Input checkers all passed, now applying filters.
Keeping 115459 biallelic SNPs out of 3917799 total SNPs from Input.
Keeping 63289 SNPs out of 115459 SNPs based on a sex combined minor allele count filter of 10.
#################################
All filters applied, now computing sdMAF!
Now calculated 5% (3165/63289).
Now calculated 10% (6329/63289).
Now calculated 15% (9494/63289).
Now calculated 20% (12658/63289).
Now calculated 25% (15823/63289).
Now calculated 30% (18987/63289).
Now calculated 35% (22152/63289).
Now calculated 40% (25316/63289).
Now calculated 45% (28481/63289).
Now calculated 50% (31645/63289).
Now calculated 55% (34809/63289).
Now calculated 60% (37974/63289).
Now calculated 65% (41138/63289).
Now calculated 70% (44303/63289).
Now calculated 75% (47467/63289).
Now calculated 80% (50632/63289).
Now calculated 85% (53796/63289).
Now calculated 90% (56961/63289).
Now calculated 95% (60125/63289).
Finito !
63289 total SNPs in results.
63289 were chrX NPR SNPs.
0 were autosomal/PAR SNPs.
Number of SNPs by chromosome table:
X
63289
Writing results to /out/pilot_test.sdMAF and logs to /log/pilot_test_sdMAF.log
A complementray tool to the sdMAF which would only run on sdMAF processed results. To use it, you simply have to provide address of the sdMAF results file and indicate the output result file address. Example: Rscript .../sdMISSING.R .../pilot_test.sdMAF .../pilot_test.sdMISSING
- Wang, Zhong et al. “Major sex differences in allele frequencies for X chromosomal variants in both the 1000 Genomes Project and gnomAD.” PLoS genetics vol. 18,5 e1010231. 31 May. 2022, doi:10.1371/journal.pgen.1010231
- Genome-wide genetic data on ~500,000 UK Biobank participants Clare Bycroft, Colin Freeman, Desislava Petkova, Gavin Band, Lloyd T. Elliott, Kevin Sharp, Allan Motyer, Damjan Vukcevic, Olivier Delaneau, Jared O’Connell, Adrian Cortes, Samantha Welsh, Gil McVean, Stephen Leslie, Peter Donnelly, Jonathan Marchini bioRxiv 166298; doi: https://doi.org/10.1101/166298
- Trevor L Davis (2022). argparse: Command Line Optional and Positional Argument Parser. R package version 2.1.6. https://CRAN.R-project.org/package=argparse
