-
Notifications
You must be signed in to change notification settings - Fork 133
Description
Hi, thanks for the great tool!
I’ve stumbled on a bug that causes FP variant calls when performing a pairwise comparison of VCFs (xcmp mode) while when taking all variants into account (that is, including non-PASS).
We've tested Hap.py v0.3.14 (and also present in 0.3.15 looking at the code)
Please correct me if I understood it incorrectly ;)
Argument can be --usefiltered-truth used to take filtered variants (non-Pass and non-‘dot’) into account. While this option works It appears that these variants will always result in a FP as the struth will always have the genotype ./.
Example:
Truth VCF (only 1 variant shown):
15 90294306 rs6496598 C G 21269.49 SnpCluster AC=2;AF=1.00;AN=2 GT:AD:DP:GQ:PL 1/1:0,59:59:99:2696,207,0
Query (only 1 variant shown):
15 90294306 rs6496598 C G 21269.49 SnpCluster AC=2;AF=1.00;AN=2 GT:AD:DP:GQ:PL 1/1:0,28:28:99:1440,108,0
Command:
hap.py truth.vcf.gz query.vcf.gz --usefiltered-truth --reference <ref_fasta> --threads 2 -R <bed-file> -T <bed_file> -o output
I’ve tested different varieties of options (including --preprocess-truth) and the result was always the same:
15 90294306 . C G 21269.5 SnpCluster BS=90294306 GT:BD:BK:BI:BVT:BLT:QQ ./.:.:.:.:NOCALL:nocall:. 1/1:FP:.:tv:SNP:homalt:21269.5
Although the position itself was in the output (this is not the case when --usefiltered-truth was not used), the truth genotype is ./. or NOCALL, and thus a FP in this specific case.
Diving in the code I’ve noticed that the bug seems to be in the xcmp command as option “--apply-filters-truth False” should have been passed to this command when “--usefiltered-truth” is used.
Example:
Using some intermediate chunk of the original run I’ve test xcmp: again with two VCF files (filter.vcf.gz and pass.vcf.gz) and focusing on a single variant (15: 90294306).
For filter.vcf.gz the filter status for that variant is SnpCluster, for pass.vcf.gz this is PASS
filter.vcf.gz
15 90294306 . C G 21269.5 SnpCluster AF=1 GT:AD:ADO:DP:GQ:PL 1/1:0,59:0:59:99:0
pass.vcf.gz
15 90294306 . C G 21269.5 PASS AF=1 GT:AD:ADO:DP:GQ:PL 1/1:0,28:0:28:99:0
Command:
xcmp filter.vcf.gz pass.vcf.gz -o test_merge_xcmp.bcf -r Homo_sapiens.GRCh37.GATK.illumina.fasta
Output:
15 90294306 . C G 21269.5 . AF=1;BS=90294306;IQQ=21269.5;ctype=simple:mismatch;gtt2=gt_homalt;kind=missing;type=FP GT ./. 1/1
Using the --apply-filters-truth False filter results in the expected output.
Command:
xcmp filter.vcf.gz pass.vcf.gz -o test_merge_xcmp.bcf -r Homo_sapiens.GRCh37.GATK.illumina.fasta --apply-filters-truth False
Output:
15 90294306 . C G 21269.5 SnpCluster AF=1;BS=90294306;IQQ=21269.5;ctype=simple:match;gtt1=gt_homalt;gtt2=gt_homalt;kind=match;type=TP GT 1/1 1/1
Btw. (sanity check for myself): using pass.vcf.gz as truth always results in a correct output, regardless is -apply-filters-truth is used
For me it makes sense to include “--apply-filters-truth False“ in the xcmp command if --usefiltered-truth is used in the main script. If I look into the code, this does not seems to be the case/ an option.
Could you add this to the code, or am I missing something here?