Statistics by khetherin · Pull Request #29 · EBIvariation/convertGVFtoVCF

khetherin · 2026-02-04T10:19:56Z

This PR is intended to provide an overview for the design of the conversion statistics.
Added new dataclass: StatisticsPayload
Added new class and tests: FileStatistics
Added a return value to the convertGVFtoVCF.py:convert function

creates output for input to statistics class changed write_header params to get vcf_header_fields in statistics payload

…Sequence ontology terms. added tests

added merge counts and variant line counts

…, edit test and tidy

apriltuesday

Great work! A lot of my comments are more stylistic or suggestions for the future, but I would check the first comment in conversionstatistics.py as this will affect the accuracy of the counts.

apriltuesday · 2026-02-12T14:10:10Z

convert_gvf_to_vcf/vcfline.py


+    def __gt__(self, other_vcf_line):
+        if isinstance(other_vcf_line, VcfLine):
+            return (self.chrom != other_vcf_line.chrom) or (self.pos > other_vcf_line.pos)


I see you have a TODO about the sorting so maybe you're planning to look at this later, but I think what you want is to order first by chrom and then by pos (and then maybe by ref?), which would make the condition like this:

Suggested change

return (self.chrom != other_vcf_line.chrom) or (self.pos > other_vcf_line.pos)

return (self.chrom > other_vcf_line.chrom) or (self.chrom == other_vcf_line.chrom and self.pos > other_vcf_line.pos)

A shortcut for this is to rely on the built-in comparison for tuple, which also makes it much easier to add ref if it's needed:

Suggested change

return (self.chrom != other_vcf_line.chrom) or (self.pos > other_vcf_line.pos)

return (self.chrom, self.pos, self.ref) > (other_vcf_line.chrom, other_vcf_line.pos, other_vcf_line.ref)

This one is on me. We're using this to ensure order because this is required for the merging algorithm to work.
Here we're abusing the __gt__ for this sole purpose but we don't want to fail the assert statement line 325 because two chromosomes are not ordered (that is not actually required in VCF)

Got it, thanks for explaining. It looked weird to me but I think it's not actually contradictory with the equals method, just not a total ordering.

apriltuesday · 2026-02-12T14:29:46Z

convert_gvf_to_vcf/conversionstatistics.py

+        # # attribute mapping
+        self.vcf_info_counter = Counter()
+        self.vcf_imprecise_variants = self.vcf_info_counter["IMPRECISE"]
+        self.vcf_precise_variants = self.vcf_data_line_count - self.vcf_imprecise_variants


For vcf_alt_missing, vcf_imprecise_variants, vcf_precise_variants: Each of these variables is defined only using whatever values are available at the time, so it won't "know" about any updates being made to the Counters or other variables as the processing occurs. For example:

>>> from collections import Counter >>> x = Counter() >>> y = x['frog'] >>> x['frog'] += 1 >>> x['frog'] 1 >>> y 0

You can either just compute these at the time when you print out the report, or you could define @property methods, which would mean you could still refer to them as self.vcf_precise_variants and so on:

@property def vcf_precise_variants(self): return self.vcf_data_line_count - self.vcf_imprecise_variants

apriltuesday · 2026-02-12T14:39:59Z

convert_gvf_to_vcf/convertGVFtoVCF.py

+        for gvf_entry in read_in_gvf_data(gvf_input):
+            # record GVF counts
+            report.gvf_feature_line_count += 1
+            report.gvf_chromosome_count.update([gvf_entry.seqid])


If you're just counting one more thing, it's a bit more readable to do this:

Suggested change

report.gvf_chromosome_count.update([gvf_entry.seqid])

report.gvf_chromosome_count[gvf_entry.seqid] += 1

apriltuesday · 2026-02-12T14:59:06Z

convert_gvf_to_vcf/conversionstatistics.py

+    def find_version(search_term, header_list):
+        for line in header_list:
+            if search_term in line:
+                return line
+        return None


Since this function is specifically to find the version number rather than a generic search, I wouldn't include search_term as a parameter but just always search for the appropriate header string, i.e.:

Suggested change

def find_version(search_term, header_list):

for line in header_list:

if search_term in line:

return line

return None

def find_version(header_list):

for line in header_list:

if "##gvf-version" in line:

return line

return None

You could also consider trying to extract the actual version number rather than returning the whole line, so replace return line with return line.split('=')[1] to get 1.06 instead of ##gvf-version=1.06 (assuming it's consistently formatted in the files of course!).

apriltuesday · 2026-02-12T15:01:35Z

tests/test_conversionstatistics.py

+if __name__ == '__main__':
+    unittest.main()


Not needed here, the test runner should discover the tests on its own.

Suggested change

if __name__ == '__main__':

unittest.main()

apriltuesday · 2026-02-12T16:24:21Z

convert_gvf_to_vcf/conversionstatistics.py

+        '''
+        #
+        with open(file_to_write, "w") as stats_file:
+            stats_file.write(text_report)


Not necessarily in this PR, but at some point you might want to think about how you'll use these count reports, and whether you want them to be human-readable and nicely formatted, or machine-readable and easy for another script to process (or both!). For example, I guess we'll be running this script on multiple studies and GVFs, do we want aggregate statistics in the end?

One thing you can do is dump a yaml file in addition to the report, since yaml is relatively readable but also really easy to load into another script. As an example you can look at the ClinVar counts (basically serving the same purpose as yours, it just accumulates counts while running a pipeline), where there's both dump_to_file and print_report methods.

apriltuesday · 2026-02-12T16:39:55Z

tests/test_convert_gvf_to_vcf.py

+                if gvfline.startswith("#"):
+                    gvf_header.append(line.strip())
+                else:
+                    gvf_features.append(line.strip())


Are there supposed to be some assertions on the GVF lines here?

If possible I would add some checks on the counts to this test as well.

tcezard · 2026-02-11T11:55:18Z

convert_gvf_to_vcf/conversionstatistics.py

+
+class FileStatistics:
+    """
+    The responsibility of this class is to determine the statistics of a GVF or VCF file.


Suggested change

The responsibility of this class is to determine the statistics of a GVF or VCF file.

The responsibility of this class is to accumulate counts and calculate statistics of a GVF to VCF file conversion.

tcezard · 2026-02-12T09:33:02Z

convert_gvf_to_vcf/conversionstatistics.py

+            {"Number of times each VCF sample has been seen:":<{key_width}}\n\t\t\t\t{self.vcf_sample_number_count}
+            {"Number of times INFO keys seen"}\n\t\t\t\t{self.vcf_info_counter}
+            {"VCF variant region Sequence Ontology ID counts"}\n\t\t\t\t{self.vcf_variant_region_SOID}
+            {"VCF variant call Sequence Ontology ID counts"}\n\t\t\t\t{self.vcf_variant_call_SOID}


I'm not sure what this report will look like with these new lines and extra tabs (not that is matter too much really).
You could also reuse the __str__ in print_report or the other way around rather than implement both.

tcezard · 2026-02-12T22:04:27Z

convert_gvf_to_vcf/vcfline.py


+    def __gt__(self, other_vcf_line):
+        if isinstance(other_vcf_line, VcfLine):
+            return (self.chrom != other_vcf_line.chrom) or (self.pos > other_vcf_line.pos)


This one is on me. We're using this to ensure order because this is required for the merging algorithm to work.
Here we're abusing the __gt__ for this sole purpose but we don't want to fail the assert statement line 325 because two chromosomes are not ordered (that is not actually required in VCF)

@Property decorator added to vcf_alt_missing, vcf_imprecise_variants, vcf_precise_variants edit find_Version for gvf_version docstrings added for clarification edited tests

khetherin added 3 commits February 4, 2026 10:09

changed imports for statistics

db513d1

creates output for input to statistics class changed write_header params to get vcf_header_fields in statistics payload

changed test for convert function

d336324

create statistics class and tests

c907efb

khetherin requested a review from tcezard February 4, 2026 10:20

khetherin added 16 commits February 4, 2026 11:57

renamed list to a more descriptive name: dropped_gvf_attributes

a60722d

added a counter to the statistics payload for GVF Structural variant …

5504514

…Sequence ontology terms. added tests

added ALT allele counters and tests.

7690959

added merge counts and variant line counts

added INFO key counters

02cf26e

change of design

ee14efd

cleanup

6fd3922

check for sorted GVF file

9b10273

added print_report function and tidy up

b06dfbf

sorted test gvf file and tidy

262d5bb

sorted test gvf file and tidy

27c4cdb

add chr4 only for drosophila_GCA_000001215.2.fa, sorted test gvf file…

e50186e

…, edit test and tidy

Merge remote-tracking branch 'origin/statistics' into statistics

934eae7

add sorted gvf file

c17a692

edit convert_500 test

c4f715e

edit convert_500 test

4793096

edit convert_500 test

3eab4ca

tcezard requested a review from apriltuesday February 11, 2026 11:51

apriltuesday reviewed Feb 12, 2026

View reviewed changes

tcezard approved these changes Feb 12, 2026

View reviewed changes

made changes as suggested in code review:

c4a5d1a

@Property decorator added to vcf_alt_missing, vcf_imprecise_variants, vcf_precise_variants edit find_Version for gvf_version docstrings added for clarification edited tests

	return (self.chrom != other_vcf_line.chrom) or (self.pos > other_vcf_line.pos)
	return (self.chrom > other_vcf_line.chrom) or (self.chrom == other_vcf_line.chrom and self.pos > other_vcf_line.pos)

	return (self.chrom != other_vcf_line.chrom) or (self.pos > other_vcf_line.pos)
	return (self.chrom, self.pos, self.ref) > (other_vcf_line.chrom, other_vcf_line.pos, other_vcf_line.ref)

	report.gvf_chromosome_count.update([gvf_entry.seqid])
	report.gvf_chromosome_count[gvf_entry.seqid] += 1

	The responsibility of this class is to determine the statistics of a GVF or VCF file.
	The responsibility of this class is to accumulate counts and calculate statistics of a GVF to VCF file conversion.

Conversation

khetherin commented Feb 4, 2026

Uh oh!

apriltuesday left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants