-
Notifications
You must be signed in to change notification settings - Fork 0
Hw5 #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: hw4
Are you sure you want to change the base?
Hw5 #2
Changes from all commits
6c194e8
2b7aa28
0f8cabf
ea45dab
243ca38
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||
|---|---|---|---|---|---|---|---|---|
| @@ -1,23 +1,30 @@ | ||||||||
| # BioToolkit | ||||||||
|
|
||||||||
| This is small munbers of bioinformatics toolkit for work with DNA and RNA sequences. | ||||||||
|
|
||||||||
| ## What it can do | ||||||||
|
|
||||||||
| - DNA tools: check sequence is DNA, transcribe DNA to RNA, reverse, complement, reverse complement. | ||||||||
| - RNA tools: check sequence is RNA, reverse transcribe RNA to DNA, reverse, complement, reverse complement. | ||||||||
| - Filter FASTQ sequences by GC content, length, or quality. | ||||||||
|
|
||||||||
| ## How to use it | ||||||||
|
|
||||||||
| Clone repo: | ||||||||
|
|
||||||||
| ```bash | ||||||||
| git clone https://github.com/YOUR_USERNAME/BioToolkit.git | ||||||||
| cd BioToolkit | ||||||||
|
|
||||||||
| ## or | ||||||||
|
|
||||||||
| from BioToolkit import run_dna_rna_tools | ||||||||
| from BioToolkit import filter_fastq | ||||||||
|
|
||||||||
| # BioToolkit | ||||||||
|
|
||||||||
| This is a small bioinformatics toolkit for working with DNA and RNA sequences be dadaist2001. | ||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||
|
|
||||||||
| ## What it can do | ||||||||
|
|
||||||||
| - DNA tools: check if a sequence is DNA, transcribe DNA to RNA, reverse, complement, reverse complement. | ||||||||
| - RNA tools: check if a sequence is RNA, reverse transcribe RNA to DNA, reverse, complement, reverse complement. | ||||||||
| - Filter FASTQ sequences by GC content, length, or quality.. | ||||||||
| - Bioinformatics file utilities (HW5): | ||||||||
| - `convert_multiline_fasta_to_oneline(input_fasta, output_fasta=None)` – converts multi-line FASTA sequences to single-line format. | ||||||||
| - `parse_blast_output(input_file, output_file)` – extracts top hits from BLAST reports and saves them sorted. | ||||||||
| - `select_genes_from_gbk_to_fasta` – extracts neighboring genes from GenBank files into FASTA format (monster-function). | ||||||||
|
|
||||||||
| ## How to use it | ||||||||
|
|
||||||||
| Clone the repository: | ||||||||
|
|
||||||||
| ```bash | ||||||||
| git clone https://github.com/YOUR_USERNAME/BioToolkit.git | ||||||||
| cd BioToolkit | ||||||||
| ``` | ||||||||
|
|
||||||||
| ## For Python | ||||||||
|
|
||||||||
| from BioToolkit import run_dna_rna_tools | ||||||||
| from BioToolkit import filter_fastq | ||||||||
| from BioToolkit import convert_multiline_fasta_to_oneline | ||||||||
| from BioToolkit import parse_blast_output | ||||||||
| from BioToolkit import select_genes_from_gbk_to_fasta | ||||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,123 @@ | ||
| def convert_multiline_fasta_to_oneline(input_fasta, output_fasta=None): | ||
| """ | ||
| A function that converts a FASTA file where sequences | ||
| are split across multiple lines | ||
| into a format where each sequence is on a single line. | ||
|
|
||
| Args: | ||
| input_fasta (str): path to the input file. | ||
| output_fasta (str, optional): path for the converted file. | ||
|
|
||
| Returns: | ||
| str: path to the converted FASTA file. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Неправда, не возвращает такое. |
||
| """ | ||
|
|
||
| if output_fasta is None: | ||
| output_fasta = "converted_" + input_fasta.split("/")[-1] | ||
|
Comment on lines
+15
to
+16
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Да, хороший способ справиться. Единственное что на свой вкус я бы подписала как "oneline" или еще с каким-то более явным указанием на то, что конвертед то. |
||
|
|
||
| with open(input_fasta, "r") as infile, open(output_fasta, "w") as outfile: | ||
|
|
||
| header = None | ||
| seq = "" | ||
|
|
||
| for line in infile: | ||
| line = line.strip() | ||
|
|
||
| if not line: | ||
| continue | ||
|
|
||
| if line.startswith(">"): | ||
| if header is not None: | ||
| outfile.write(f"{header}\n{seq}\n") | ||
|
|
||
| header = line | ||
| seq = "" | ||
| else: | ||
| seq += line | ||
|
|
||
| if header is not None: | ||
| outfile.write(f"{header}\n{seq}\n") | ||
|
|
||
|
|
||
| def parse_blast_output(input_file, output_file): | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Не хватает аннотаций типов. |
||
| """ | ||
| A function for extracting the first name for QUERY | ||
| from a BLAST report and saves all found names. | ||
|
|
||
| Args: | ||
| input_file (str): path to the BLAST file. | ||
| output_file (str): path to the file will be saved. | ||
|
|
||
| Returns: | ||
| None | ||
| """ | ||
|
|
||
| results = [] | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Список вместо множества. Возможны дубликаты при нескольких QUERY с одинаковым топ-хитом. |
||
|
|
||
| with open(input_file, "r") as f: | ||
| lines = f.readlines() | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Для файлов BLAST это может быть большим объемом, не стоить загружать их целиком в память. |
||
|
|
||
| for i in range(len(lines)): | ||
| line = lines[i].strip() | ||
|
|
||
| if "Sequences producing significant alignments" in line: | ||
| if i + 1 < len(lines): | ||
| next_line = lines[i + 1].strip() | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Вот из-за этой части в ответе не ожидаемый формат, а |
||
| if next_line: | ||
| results.append(next_line) | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. В задании нужен только Description, а тут добавится полная строка, в которой есть и другие вещи (даже если исправить комментарий выше). |
||
| results.sort() | ||
|
|
||
| with open(output_file, "w") as out: | ||
| for res in results: | ||
| out.write(res + "\n") | ||
|
|
||
|
|
||
| def select_genes_from_gbk_to_fasta( | ||
| input_gbk: str, | ||
| genes: "str | list[str]", | ||
| n_before: int = 1, | ||
| n_after: int = 1, | ||
| output_fasta: str = "selected_genes.fasta", | ||
| ) -> None: | ||
| """ | ||
| The function for extracting protein sequences | ||
| for genes near specified genes of interest. | ||
| """ | ||
|
|
||
| with open(input_gbk, "r") as f: | ||
| all_lines = f.readlines() | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Аналогично, не стоит читать файлы целиком в память. Через readline можно работать с ними на лету, читая по 1 строке за раз. |
||
|
|
||
| all_genes = [] | ||
| gene = "" | ||
| translation = "" | ||
|
|
||
| for line in all_lines: | ||
| line = line.rstrip() | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. rstrip это rights strip, он не удалит отступы слева и строка никогда не будет начинаться с того, что ищется дальше. |
||
| if line.startswith("/gene="): | ||
| gene = line.split('"')[1] | ||
| elif line.startswith("/translation="): | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Перевод часто многострочный и это потеряется в таком подходе. |
||
| translation = line.split('"')[1] | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. А в таком формате потеряется всегда, потому что в строке просто не будет закрывающей кавычки. |
||
| all_genes.append({"gene_name": gene, "protein": translation}) | ||
| gene = "" | ||
| translation = "" | ||
|
|
||
| if isinstance(genes, str): | ||
| genes_to_find = [genes] | ||
| else: | ||
| genes_to_find = genes | ||
|
|
||
| genes_to_write = [] | ||
| index = 0 | ||
| while index < len(all_genes): | ||
| g = all_genes[index] | ||
| if g["gene_name"] in genes_to_find: | ||
| start_index = max(0, index - n_before) | ||
| end_index = min(len(all_genes), index + n_after + 1) | ||
| for k in range(start_index, end_index): | ||
| if k != index: | ||
| genes_to_write.append(all_genes[k]) | ||
| index += 1 | ||
|
|
||
| with open(output_fasta, "w") as out_file: | ||
| for item in genes_to_write: | ||
| out_file.write(f">{item['gene_name']}\n{item['protein']}\n") | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Не реализована часть задания. Совсем для красоты было бы еще проверить, output_fastq уже .fastq или нет, но это так.