Skip to content

Statistics#29

Open
khetherin wants to merge 20 commits intoEBIvariation:mainfrom
khetherin:statistics
Open

Statistics#29
khetherin wants to merge 20 commits intoEBIvariation:mainfrom
khetherin:statistics

Conversation

@khetherin
Copy link
Collaborator

This PR is intended to provide an overview for the design of the conversion statistics.
Added new dataclass: StatisticsPayload
Added new class and tests: FileStatistics
Added a return value to the convertGVFtoVCF.py:convert function

creates output for input to statistics class
changed write_header params to get vcf_header_fields in statistics payload
@khetherin khetherin requested a review from tcezard February 4, 2026 10:20
@tcezard tcezard requested a review from apriltuesday February 11, 2026 11:51
Copy link

@apriltuesday apriltuesday left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! A lot of my comments are more stylistic or suggestions for the future, but I would check the first comment in conversionstatistics.py as this will affect the accuracy of the counts.


def __gt__(self, other_vcf_line):
if isinstance(other_vcf_line, VcfLine):
return (self.chrom != other_vcf_line.chrom) or (self.pos > other_vcf_line.pos)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you have a TODO about the sorting so maybe you're planning to look at this later, but I think what you want is to order first by chrom and then by pos (and then maybe by ref?), which would make the condition like this:

Suggested change
return (self.chrom != other_vcf_line.chrom) or (self.pos > other_vcf_line.pos)
return (self.chrom > other_vcf_line.chrom) or (self.chrom == other_vcf_line.chrom and self.pos > other_vcf_line.pos)

A shortcut for this is to rely on the built-in comparison for tuple, which also makes it much easier to add ref if it's needed:

Suggested change
return (self.chrom != other_vcf_line.chrom) or (self.pos > other_vcf_line.pos)
return (self.chrom, self.pos, self.ref) > (other_vcf_line.chrom, other_vcf_line.pos, other_vcf_line.ref)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is on me. We're using this to ensure order because this is required for the merging algorithm to work.
Here we're abusing the __gt__ for this sole purpose but we don't want to fail the assert statement line 325 because two chromosomes are not ordered (that is not actually required in VCF)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks for explaining. It looked weird to me but I think it's not actually contradictory with the equals method, just not a total ordering.

# # attribute mapping
self.vcf_info_counter = Counter()
self.vcf_imprecise_variants = self.vcf_info_counter["IMPRECISE"]
self.vcf_precise_variants = self.vcf_data_line_count - self.vcf_imprecise_variants

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For vcf_alt_missing, vcf_imprecise_variants, vcf_precise_variants: Each of these variables is defined only using whatever values are available at the time, so it won't "know" about any updates being made to the Counters or other variables as the processing occurs. For example:

>>> from collections import Counter
>>> x = Counter()
>>> y = x['frog']
>>> x['frog'] += 1
>>> x['frog']
1
>>> y
0

You can either just compute these at the time when you print out the report, or you could define @property methods, which would mean you could still refer to them as self.vcf_precise_variants and so on:

@property
def vcf_precise_variants(self):
    return self.vcf_data_line_count - self.vcf_imprecise_variants

for gvf_entry in read_in_gvf_data(gvf_input):
# record GVF counts
report.gvf_feature_line_count += 1
report.gvf_chromosome_count.update([gvf_entry.seqid])

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're just counting one more thing, it's a bit more readable to do this:

Suggested change
report.gvf_chromosome_count.update([gvf_entry.seqid])
report.gvf_chromosome_count[gvf_entry.seqid] += 1

Comment on lines 60 to 64
def find_version(search_term, header_list):
for line in header_list:
if search_term in line:
return line
return None

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this function is specifically to find the version number rather than a generic search, I wouldn't include search_term as a parameter but just always search for the appropriate header string, i.e.:

Suggested change
def find_version(search_term, header_list):
for line in header_list:
if search_term in line:
return line
return None
def find_version(header_list):
for line in header_list:
if "##gvf-version" in line:
return line
return None

You could also consider trying to extract the actual version number rather than returning the whole line, so replace return line with return line.split('=')[1] to get 1.06 instead of ##gvf-version=1.06 (assuming it's consistently formatted in the files of course!).

Comment on lines 70 to 71
if __name__ == '__main__':
unittest.main()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed here, the test runner should discover the tests on its own.

Suggested change
if __name__ == '__main__':
unittest.main()

'''
#
with open(file_to_write, "w") as stats_file:
stats_file.write(text_report) No newline at end of file

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily in this PR, but at some point you might want to think about how you'll use these count reports, and whether you want them to be human-readable and nicely formatted, or machine-readable and easy for another script to process (or both!). For example, I guess we'll be running this script on multiple studies and GVFs, do we want aggregate statistics in the end?

One thing you can do is dump a yaml file in addition to the report, since yaml is relatively readable but also really easy to load into another script. As an example you can look at the ClinVar counts (basically serving the same purpose as yours, it just accumulates counts while running a pipeline), where there's both dump_to_file and print_report methods.

if gvfline.startswith("#"):
gvf_header.append(line.strip())
else:
gvf_features.append(line.strip())

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there supposed to be some assertions on the GVF lines here?

If possible I would add some checks on the counts to this test as well.


class FileStatistics:
"""
The responsibility of this class is to determine the statistics of a GVF or VCF file.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The responsibility of this class is to determine the statistics of a GVF or VCF file.
The responsibility of this class is to accumulate counts and calculate statistics of a GVF to VCF file conversion.

{"Number of times each VCF sample has been seen:":<{key_width}}\n\t\t\t\t{self.vcf_sample_number_count}
{"Number of times INFO keys seen"}\n\t\t\t\t{self.vcf_info_counter}
{"VCF variant region Sequence Ontology ID counts"}\n\t\t\t\t{self.vcf_variant_region_SOID}
{"VCF variant call Sequence Ontology ID counts"}\n\t\t\t\t{self.vcf_variant_call_SOID}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what this report will look like with these new lines and extra tabs (not that is matter too much really).
You could also reuse the __str__ in print_report or the other way around rather than implement both.


def __gt__(self, other_vcf_line):
if isinstance(other_vcf_line, VcfLine):
return (self.chrom != other_vcf_line.chrom) or (self.pos > other_vcf_line.pos)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is on me. We're using this to ensure order because this is required for the merging algorithm to work.
Here we're abusing the __gt__ for this sole purpose but we don't want to fail the assert statement line 325 because two chromosomes are not ordered (that is not actually required in VCF)

@Property decorator added to vcf_alt_missing, vcf_imprecise_variants, vcf_precise_variants
edit find_Version for gvf_version
docstrings added for clarification
edited tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants