Completed HW6_Files #2

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

DAChernikov wants to merge 14 commits into main from HW6_ChernikovDA

Owner

DAChernikov commented Oct 15, 2023

No description provided.

DAChernikov added 14 commits

October 7, 2023 17:38


          Add HW5 files

516f05d


          Update README.md

fe455ee

Rewrite README with full description of BioSeqTools


          Add files HW6

109d69a


          Move example data to dir

a1794cd


          Delete example_blast_results.txt

7b12a66


          Delete example_fastq.fastq

0c0be1c


          Delete example_gbk.gbk

17e3faa


          Delete example_multiline_fasta.fasta

f2134bb


          Delete example_multiline_fasta_converted.fasta

78a6310


          Delete filtered_output.fastq

76fa555


          Delete output.fasta

15e78af


          Delete output_selected_genes.fasta


          Delete parsed_blast_results.txt

21dbae7


          Update README.md

26b79e4

SidorinAnton reviewed

View reviewed changes

SidorinAnton left a comment

В целом неплохо!
Основные моменты:

Мусор в репозитории! .DS_Store и __pycache__. Используй .gitignore
read и readlines. В целом это может работать, но если файл очень большой, то мы можем упасть с ошибкой типа MemoryError.
Так что всё-таки надежнее читать файлы (особенно биоинформатические) построчно ))

Modules/aminoacids_tools.py

+                  for amino_acid, count in amino_acid_counts.items():
+                      percentage = round(((count / total_amino_acids) * 100), 2)
+                      amino_acid_percentages[amino_acid] = percentage
+                  return f'Amino acids percentage of the sequence {seq}: {amino_acid_percentages}'

SidorinAnton Dec 3, 2023

Почему возвращаешь строку? ))

Modules/aminoacids_tools.py

+                  weight = 18.02  # for the H and OH at the termini
+                  for amino_acid in seq:
+                      weight += amino_acid_weights[amino_acid]
+                  return f'Molecular weight of the sequence {seq}: {round(weight, 2)} Da'

SidorinAnton Dec 3, 2023

Почему возвращаешь строку? ))

Modules/aminoacids_tools.py

+                  # Substitute all found values into the formula and calculate pI
+                  pI = total_pK / count
+                  return f"Isoelectric point for the sequence {sequence}: {pI}"

SidorinAnton Dec 3, 2023

Почему возвращаешь строку? ))

bio_files_processor.py

Comment on lines +5 to +23

+                  sequences = {}
+                  current_id = None
+                  for line in fasta_lines:
+                      line = line.strip()
+                      if line.startswith('>'):
+                          current_id = line[1:]
+                          sequences[current_id] = ''
+                      else:
+                          if current_id:
+                              sequences[current_id] += line
+                  if output_fasta is None:
+                      output_fasta = input_fasta + ".fasta"
+                  with open(output_fasta, 'w') as output_file:
+                      for seq_id, sequence in sequences.items():
+                          output_file.write(f'>{seq_id}\n{sequence}\n')

SidorinAnton Dec 3, 2023

Можно проще. Что-нибудь в духе:

with open(INPUT) as inp_fa, open(OUTPUT, "w") as opt_fa:
    for line in inp_fa:
        if line.startswith(">"):
            opt_fa.write("\n")  # Можно аккуратнее сделать проверку на первую строку, тогда не будет \n вначале
            opt_fa.write(line)
            continue

        opt_fa.write(line.strip())

bio_files_processor.py

@@ @@ -0,0 +1,103 @@ @@
+              def convert_multiline_fasta_to_oneline(input_fasta, output_fasta=None):
+                  with open(input_fasta, 'r') as input_file:
+                      fasta_lines = input_file.readlines()

SidorinAnton Dec 3, 2023

Не оч хорошо, т.к. если файл очень большой, то можем упасть с ошибкой ))
Лучше делать итерацию по строкам

bio_files_processor.py

+                                                 n_after=1, output_fasta=None):
+                  with open(input_gbk, 'r') as gbk_file:
+                      gbk_data = gbk_file.read()

SidorinAnton Dec 3, 2023

Не оч хорошо, т.к. если файл очень большой, то можем упасть с ошибкой ))
Лучше делать итерацию по строкам

bio_files_processor.py

+              def change_fasta_start_pos(input_fasta, shift, output_fasta):
+                  with open(input_fasta, 'r') as input_file:
+                      lines = input_file.readlines()

SidorinAnton Dec 3, 2023

Не оч хорошо, т.к. если файл очень большой, то можем упасть с ошибкой ))
Лучше делать итерацию по строкам

bio_files_processor.py

+              def parse_blast_output(input_file, output_file=None):
+                  with open(input_file, 'r') as f:
+                      lines = f.read().splitlines()

SidorinAnton Dec 3, 2023

Не оч хорошо, т.к. если файл очень большой, то можем упасть с ошибкой ))
Лучше делать итерацию по строкам

filter_fastq_files.py

@@ @@ -0,0 +1,45 @@ @@
+              def read_fastq_file(file_path):
+                  with open(file_path, 'r') as file:
+                      lines = file.readlines()

SidorinAnton Dec 3, 2023

Не оч хорошо, т.к. если файл очень большой, то можем упасть с ошибкой ))
Лучше делать итерацию по строкам

filter_fastq_files.py

Comment on lines +44 to +45

		# Usage example
		run_filter_fastq('./HW6_Files/example_fastq.fastq', gc_bounds=(10, 30), quality_threshold=30, output_filename='filtered_output.fastq')

SidorinAnton Dec 3, 2023

Не, тут это не надо )))
Если хочется, можно, например, унести в другой файл

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet