-
Notifications
You must be signed in to change notification settings - Fork 0
Review GAD2 #24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Review GAD2 #24
Conversation
grishchenkoira
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Хорошая работа! 🐱 Очень понравился фильтратор. Ниже есть несколько замечаний
| import os | ||
| import re | ||
| from typing import TextIO, Optional, Union | ||
| from abc import ABC, abstractmethod | ||
| from Bio import SeqIO, SeqUtils |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Правильно остортировано по происхождению бибилиотек, но нет лексикографического порядка.
| import os | |
| import re | |
| from typing import TextIO, Optional, Union | |
| from abc import ABC, abstractmethod | |
| from Bio import SeqIO, SeqUtils | |
| import os | |
| import re | |
| from abc import ABC, abstractmethod | |
| from typing import TextIO, Optional, Union | |
| from Bio import SeqIO, SeqUtils |
| complement_rule = {'a': 't', 'A': 'T', 't': 'a', 'T': 'A', | ||
| 'g': 'c', 'G': 'C', 'c': 'g', 'C': 'G'} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Здорово, что переменная вынесена до объявления всех методов. Она для всего класса и не предполагается, что она будет изменяться у экземпляров. Лучше записать заглавными буквами.
| Function can procced only DNA seqences. | ||
| """ | ||
| transcribed_sequence = '' | ||
| transcribed_sequence = self.seq.replace('t', 'u').replace('T', 'U') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Нет проверки на корректность. Если в сиквенсе будет буква, которой нет в алфавите, то она появится и в новой последовательности.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ну в целом проверка это не зона отвественности транскрайба
| from abc import ABC, abstractmethod | ||
| from Bio import SeqIO, SeqUtils | ||
|
|
||
| class BiologicalSequence(ABC): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Отмечу здесь, потому что касается всех классов для последовательностей.
Нет аннотации типов.
|
|
||
| def __init__(self, seq) -> None: | ||
| super().__init__(seq = seq) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Нет класса для белковых последовательностей 😿
| if not input_path.endswith('.fastq'): | ||
| raise ValueError('Incorrect input file extension, should be .fastq') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Здорово, что есть контроль входного файла!
| filtererd_fastq = open(os.path.join(output_path, output_filename), mode='w') | ||
| for seq_record in SeqIO.parse(input_path, "fastq"): | ||
| if check_gc(seq_record.seq, gc_params) and check_length(seq_record.seq, len_bound_params) and check_quality(seq_record, quality_threshold): | ||
| SeqIO.write(seq_record, filtererd_fastq, "fastq") | ||
| filtererd_fastq.close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Лучше использовать контекстный менеджер с оператором with, чтобы точно знать, что все закроется и не произойдет непредвиденного
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Даже в этом случае, согласен
KirPetrikov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Привет! Хорошая работа!
Куда-то потерялся AminoAcidSequence :'(
Функции для фильтрации, по-моему, уже можно было не выписывать отдельными, а реализовать сразу в run_fastq_filter. Но это не принципиально.
Докстринги - моё почтение!
Немного учесть замечания - и будет просто супер! Удачи!
| from abc import ABC, abstractmethod | ||
| from Bio import SeqIO, SeqUtils | ||
|
|
||
| class BiologicalSequence(ABC): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Насколько я разобрался: абстрактный класс написан прям как надо
| """Proccising nucleic acid sequences. | ||
| This class is parental for DNASequence and RNASequence classes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Докстринги - замечательно.
| """Proccising nucleic acid sequences. | ||
| This class is parental for DNASequence and RNASequence classes. | ||
| """ | ||
| def __init__(self, seq) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Обратил внимание: у классов аннотация типов только для ретёрна у инитов. Кажется, инитам она как раз не очень нужна. И точно для аргументов можно добавить.
| def __init__(self, seq) -> None: | |
| def __init__(self, seq: str) -> None: |
| if self.check_sequence(): | ||
| complement_sequence = '' | ||
| for nucleotide in self.seq: | ||
| if nucleotide in type(self).complement_rule: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Здесь у экземпляра класса вызывается атрибут complement_rule, но в самом классе NucleicAcidSequence его нет. Кажется, это не очень корректно.
Предлагаю такой вариант: в NucleicAcidSequence ставим всем подобным атрибутам = None. Тогда проверку для этих атрибутов в методах делаем такую:
if self.complement_rule is None:
raise NotImplementedError
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Когда пишите код в комментах PR, добавляйте python после ``` - так синтаксис будет подсвечен как надо
| """ | ||
| if self.check_sequence(): | ||
| complement_sequence = '' | ||
| for nucleotide in self.seq: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Такую проверку можно сделать не в цикле, а просто проверив подмножество:
if set(self.seq).issubset(set(complement_rule)):
| return True | ||
|
|
||
|
|
||
| def int_to_tuple(input_parameters) -> tuple: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Вынести это в отдельную функцию - очень правильное решение, по-моему, упрощает логику кода.
| return (0, input_parameters) | ||
|
|
||
|
|
||
| def run_fastq_filter(input_path: Optional[str] = None, output_filename: Optional[str] = None, gc_bounds: Union[int, tuple] = (0, 100), length_bounds: Union[int, tuple] = (0, 2**32), quality_threshold: int = 0) -> TextIO: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
А зачем дефолтный None для input_path? Всё равно упадёт с ошибкой при попытке прочитать None, но если у необходимого атрибута не будет дефолтного значения и его забудут указать, ошибка будет другая, понятнее.
| def run_fastq_filter(input_path: Optional[str] = None, output_filename: Optional[str] = None, gc_bounds: Union[int, tuple] = (0, 100), length_bounds: Union[int, tuple] = (0, 2**32), quality_threshold: int = 0) -> TextIO: | |
| def run_fastq_filter(input_path: Optional[str], output_filename: Optional[str] = None, gc_bounds: Union[int, tuple] = (0, 100), length_bounds: Union[int, tuple] = (0, 2**32), quality_threshold: int = 0) -> TextIO: |
| - reads quality score (quality_threshold). | ||
|
|
||
| Input: | ||
| - input_path (str): path to .fastq file; include 4 strings: 1 - read ID, 2 - sequence, 3 - comment, 4 - quality. Default - None. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Кажется, описание спецификации fastq-формата уже лишнее)
| - input_path (str): path to .fastq file; include 4 strings: 1 - read ID, 2 - sequence, 3 - comment, 4 - quality. Default - None. | |
| - input_path (str): path to .fastq file. |
| - output_filename (str): name of output file, by default, it will be saved in the directory 'fastq_filtrator_resuls'. Default name will be name of input file. | ||
| - gc_bounds (tuple or int): GC content filter parameters, it accepts lower and upper (tuple), or only upper threshold value (int). Default value (0, 100). | ||
| - length_bounds (tuple or int): read length filter parameters, it accepts lower and upper (tuple), or only upper threshold value (int). Default value (0, 2**32). | ||
| - quality_threshold (int): upper quality threshold in phred33 scale. Reads with average quality below the threshold are discarded. Default value - 0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Только не верхняя граница, а нижняя.
| - quality_threshold (int): upper quality threshold in phred33 scale. Reads with average quality below the threshold are discarded. Default value - 0. | |
| - quality_threshold (int): lower quality threshold in phred33 scale. Reads with average quality below the threshold are discarded. Default value - 0. |
| gc_params = int_to_tuple(gc_bounds) | ||
| len_bound_params = int_to_tuple(length_bounds) | ||
| "Filter and record results" | ||
| filtererd_fastq = open(os.path.join(output_path, output_filename), mode='w') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Работать с файлами через with open надёжнее. Если, например, произойдёт ошибка перед filtererd_fastq.close(), файл не закроется.
Sarsaparella
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Все аккуратно и чистно, радуют красивые докстринги и оформление. Некоторые стистические решения мне показались не очень понятными, но мб это мой недостаток. Нет класса AminoAcidSequence, ну ничего, он почти такой же как и NucleicAcidSequence, только там алфавит другой 😁
| def __init__(self, seq) -> None: | ||
| self.seq = seq |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Я не очень понимаю зачем тут писать "-> None" 🤔 хотя сам концепт мне нравится (первый раз с ним сталкиваюсь и возьму себе на вооружение)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
__init__ должен возвращать None по определению, иначе питон бросит ошибку, поэтому да, в случае с __init__ так делать не надо. Но когда просто функция кастомная, то там такая аннотация будет полезна
| def gc_content(self): | ||
| return (sum(1 for _ in re.finditer(r'[GCgc]', self.seq)))/self.__len__() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ничего плохого в этом способе нет, однако выглядит немного неинтуитивно что ли, сложно читат 🤧
| 'g': 'c', 'G': 'C', 'c': 'g', 'C': 'G'} | ||
|
|
||
| def __init__(self, seq: str) -> None: | ||
| super().__init__(seq = seq) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Возможно я чего-то не понимаю, но опять же не понятно зачем писать (seq = seq) 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
В целом явное лучше чем не явное, поэтому прописывать в какие аргументы какие значения мы передаем - звучит полезно
Review GAD2