Conversation
- created a function called create_coordinate_range - created a test called test_create_coordinate_range
…ile. Added new GVF attributes to the file.
…pty only the mandatory VCF headers are printed
…checking for duplicates and merging duplicates
…cted in due course
# Conflicts: # README.md # convert_gvf_to_vcf/convertGVFtoVCF.py # convert_gvf_to_vcf/vcfline.py # tests/test_vcfline.py
generate_info_field_symbolic_allele, generate_info_field_for_imprecise_variant, generate_info_field_for_precise_variant. added new function convert_to_ci_bound to avoid repetition
| merge_or_kept_vcf_objects = get_list_of_merged_vcf_objects(list_of_vcf_objects, samples) | ||
| # identify if duplicates are present after merging | ||
| has_dups, chrom_pos_list = has_duplicates(merge_or_kept_vcf_objects) | ||
| # while duplicates are present, merge, then re-check for dups | ||
| max_iterations = 100 | ||
| iteration = 0 | ||
| list_of_vcf_objects_to_be_filtered = merge_or_kept_vcf_objects | ||
| while has_dups and iteration < max_iterations: | ||
| filtered_merge_or_kept_vcf_objects = filter_duplicates_by_merging(chrom_pos_list, has_dups, | ||
| list_of_vcf_objects, | ||
| list_of_vcf_objects_to_be_filtered, samples) | ||
| has_dups, chrom_pos_list = has_duplicates(filtered_merge_or_kept_vcf_objects) | ||
| iteration += 1 | ||
| list_of_vcf_objects_to_be_filtered = filtered_merge_or_kept_vcf_objects | ||
| logger.info(f"Iteration of merge (remove dups): {iteration}") |
There was a problem hiding this comment.
What is the reason why the merging algorithm not capable of removing all the merge in one pass ?
There was a problem hiding this comment.
The merge function compares the current line with the previous line (it is limited to 2 lines in its comparison and merge). For example:
The lines in the file: lineA, lineB, lineC
After the merge: lineAandB, lineC
After another iteration: lineAandBandC
There was a problem hiding this comment.
The merge function compares the current line with the previous line (it is limited to 2 lines in its comparison and merge).
Ok so now that you have identified the limitation you should work removing it rather engineer something around.
The issue is in get_list_of_merged_vcf_objects where you compare and merge separately.
A better algorithm would be:
- For all line starting with line 2
- take the
currentandpreviousline and compare them - if they are equal:
- merge and set the merge result as the
previousline
- merge and set the merge result as the
- otherwise set the
currentline as thepreviousline
- take the
Co-authored-by: Timothee Cezard <tcezard@ebi.ac.uk>
Co-authored-by: Timothee Cezard <tcezard@ebi.ac.uk>
Uh oh!
There was an error while loading. Please reload this page.