diff --git a/HW4_Grigoriants/README.md b/HW4_Grigoriants/README.md new file mode 100644 index 0000000..c2c6801 --- /dev/null +++ b/HW4_Grigoriants/README.md @@ -0,0 +1,223 @@ +# Protein_tools.py +## A tool to work with protein sequences + +*Proteins* are under the constant focus of scientists. Currently, there are an enormous amount of tools to operate with nucleotide sequences, however, the same ones for proteins are extremely rare. + + +`protein_tools.py` is an open-source program that facilitates working with protein sequences. + +## Usage +The programm is based on `run_protein_tools` function that takes the list of **one-letter amino acid sequences**, a name of procedure and a relevant argument. If you have three-letter amino acids sequences you could convert them by using `three_one_letter_code` procedure in advance. Please convert your three-letter coded sequences with `three_one_letter_code` procedure before using any other procedures on them. + +To start with the program run the following command: + +`run_protein_tools(sequences, procedure="procedure", ...)` + +Where: +- sequences - positional argument, a list of protein sequences +- procedure - keyword argument, a type of procedure to use that is inputed in *string* type +- ... - an additional keyword arguments that are to be inputed in *string* type +- +Before start, check the *Options* and *Examples*. +## Options + +The program has five types of procedures, for more information please see provided docstrings: + + `three_one_letter_code` + + ![image](https://drive.google.com/uc?export=view&id=1eACjU_CXFbqeu1iW3ekwcg81n-X3WvTG) + +- The main aim - to convert three-letter amino acid sequences to one-letter ones and vice-versa +- In case of three-to-one translation the names of amino acids **must be separated with hyphen** +- An additional argument: no +``` +""" +Reverse the protein sequences from one-letter to three-letter format and vice-versa + +Case 1: get three-letter sequence\n +Use one-letter amino-acids sequences of any letter case + +Case 2: get one-letter sequence\n +Use three-letter amino-acid separated by "-" sequences. +Please note that sequences without "-" are parsed as one-letter code sequences\n +Example: for sequence "Ala" function will return "Ala-leu-ala" + +Arguments: +- sequences (tuple[str] or list[str]): protein sequences to convert\n +Example: ["WAG", "MkqRe", "msrlk", "Met-Ala-Gly", "Met-arg-asn-Trp-Ala-Gly", "arg-asn-trp"] + +Return: +- list: one-letter/three-letter protein sequences\n +Example: ["Met-Ala-Gly", "Met-arg-asn-Trp-Ala-Gly", "arg-asn-trp", "WAG", "MkqRe", "rlk"] +""" +``` + + `define_molecular_weight` + + ![image](https://drive.google.com/uc?export=view&id=1i9_4ys64XsAxnw-08zbgyBQnGzJoGJfr) + +- The main aim - to determine the exact molecular weight of protein sequences +- An additional argument: no +``` +""" +Define molecular weight of the protein sequences + +Use one-letter amino-acids sequences of any letter case +The molecular weight is: +- a sum of masses of each atom constituting a molecule +- expressed in units called daltons (Da) +- rounded to hundredths + +Arguments: +- sequences (tuple[str] or list[str]): protein sequences to convert + +Return: +- dictionary: protein sequences as keys and molecular masses as values\n +Example: {"WAG": 332.39, "MkqRe": 690.88, "msrlk": 633.86} +""" +``` + + `search_for_motifs` + + ![image](https://drive.google.com/uc?export=view&id=1_bVKRn4RblrfukIxoQc0NZ_FXaJliGAH) + +- The main aim - to search for the motif of interest in protein sequences +- An additional arguments: motif (*str*), overlapping (*bool*) +``` +""" +Search for motifs - conserved amino acids residues in protein sequence + +Search for one motif at a time\n +Search is letter case sensitive\n +Use one-letter aminoacids code for desired sequences and motifs\n +Positions of AA in sequences are counted from 0\n +By default, overlapping matches are counted + +Arguments: +- sequences (tuple[str] or list[str]): sequences to check for given motif within\n +Example: sequences = ["AMGAGW", "GAWSGRAGA"] +- motif (str]: desired motif to check presense in every given sequence\n +Example: motif = "GA" +- overlapping (bool): count (True) or skip (False) overlapping matches. (Optional)\n +Example: overlapping = False +Return: +- dictionary: sequences (str] as keys , starting positions for presented motif (list) as values\n +Example: {"AMGAGW": [2], "GAWSGRAGA": [0, 7]} +""" +``` + `search_for_alt_frames` + + ![image](https://drive.google.com/uc?export=view&id=1AdXnkRDIRiC_5yiiI2qiAMSMWbZf1RIm) + +- The main aim - to look for alternative frames that start with methyonine or other non-canonical start amino acids +- Ignores the last three amino acids due to the insignicance of alternative frames of this length +- An additional argument: alt_start_aa (*str*) +- Use alt_start_aa **only for non-canonical start amino acids** +- Without alt_start_aa the procedure find alternative frames that start with methyonine +``` +""" +Search for alternative frames in a protein sequences + +Search is not letter case sensitive\n +Without an alt_start_aa argument search for frames that start with methionine ("M") +To search frames with alternative start codon add alt_start_aa argument\n +In alt_start_aa argument use one-letter code + +The function ignores the last three amino acids in sequences + +Arguments: +- sequences (tuple[str] or list[str]): sequences to check +- alt_start_aa (str]: the name of an amino acid that is encoded by alternative start AA (Optional)\n +Example: alt_start_aa = "I" + +Return: +- dictionary: the number of a sequence and a collection of alternative frames +""" +``` +`convert_to_nucl_acids` + + ![image](https://drive.google.com/uc?export=view&id=1_pZJ0Gc-EVcR1zddpDW4Ok3w8t65fW_z) + +- The main aim - to convert protein sequences to DNA, RNA or both nucleic acid sequences +- The program use the most frequent codons in human that could be found [here](https://www.genscript.com/tools/codon-frequency-table) +- An additional argument: nucl_acids (*str*) +- Use as nucl_acids only DNA, RNA or both (for more detailes, check *Examples*) +``` +""" +Convert protein sequences to RNA or DNA sequences. + +Use the most frequent codons in human. The source - https://www.genscript.com/tools/codon-frequency-table\n +All nucleic acids (DNA and RNA) are showed in 5"-3" direction + +Arguments: +- sequences (tuple[str] or list[str]): sequences to convert +- nucl_acids (str]: the nucleic acid that is prefered\n +Example: nucl_acids = "RNA" - convert to RNA\n + nucl_acids = "DNA" - convert to DNA\n + nucl_acids = "both" - convert to RNA and DNA +Return: +- dictionary: nucleic acids (str) as keys, collection of sequences (list) as values +""" +``` + +## Examples +```python +# three_one_letter_code +run_protein_tools(['met-Asn-Tyr', 'Ile-Ala-Ala'], procedure='three_one_letter_code') # ['mNY', 'IAA'] +run_protein_tools(['mNY','IAA'], procedure='three_one_letter_code') # ['met-Asn-Tyr', 'Ile-Ala-Ala'] + + +# define_molecular_weight +run_protein_tools(['MNY','IAA'], procedure='define_molecular_weight') # {'MNY': 426.52, 'IAA': 273.35} + + +# check_for_motifs +run_protein_tools(['mNY','IAA'], procedure='search_for_motifs', motif='NY') +#Sequence: mNY +#Motif: NY +#Motif is present in protein sequence starting at positions: 1 + +#Sequence: IAA +#Motif: NY +#Motif is not present in protein sequence + +{'mNY': [1], 'IAA': []} + + +# search_for_alt_frames +run_protein_tools(['mNYQTMSPYYDMId'], procedure='search_for_alt_frames') # {'mNYQTMSPYYDMId': ['MSPYYDMId']} +run_protein_tools(['mNYTQTSP'], procedure='search_for_alt_frames', alt_start_aa='T') # {'mNYTQTSP': ['TQTSP']} + + +# convert_to_nucl_acids +run_protein_tools(['MNY'], procedure='convert_to_nucl_acids', nucl_acids = 'RNA') # {'RNA': ['AUGAACUAU']} +run_protein_tools(['MNY'], procedure='convert_to_nucl_acids', nucl_acids = 'DNA') # {'DNA': ['TACTTGATA']} +run_protein_tools(['MNY'], procedure='convert_to_nucl_acids', nucl_acids = 'both') # {'RNA': ['AUGAACUAU'], 'DNA': ['TACTTGATA']} + +``` + +## Troubleshooting + +| Type of the problem | Probable cause +| ------------------------------------------------------------ |-------------------- +| Output does not correspond the expected resultes | The name of procedure is wrong. You see the results of another procedure +| ValueError: No sequences provided | A list of sequences are not inputed +| ValueError: Wrong procedure | The procedure does not exist in this program +| TypeError: takes from 0 to 1 positional arguments but n were given | Sequences are not collected into the list type +| ValueError: Invalid sequence given | The sequences do not correspond to standard amino acid code +| ValueError: Please provide desired motif | There are no an additional argument *motif* in `search_for_motifs` +| ValueError: Invalid start AA | There is more than one letter in an additional argument *alt_start_aa* in `search_for_alt_frames` +| ValueError: Please provide desired type of nucl_acids | There are no an additional argument *nucl_acids* in `convert_to_nucl_acids` +| ValueError: Invalid nucl_acids argument | An additional argument in `convert_to_nucl_acids` is written incorrectly +## Contacts +Vladimir Grigoriants (vova.grig2002@gmail.com) +Team-leader. Bioinformatician, immunologist, MiLaborary inc. TCR-libraries QC developer + +Ekaterina Shitik (shitik.ekaterina@gmail.com) +Doctor of medicine, molecular biologist with the main interests on gene engineering, AAV vectors and CRISPR/Cas9 technologies + +Vlada Tuliavko (vladislavi2742@gmail.com) +MiLaboratory inc. manager&designer, immunologist + +## Our team +![image](https://drive.google.com/uc?export=view&id=1tdSGpNl6GorFPZIqweB0PaGxQW5wK5Oo) diff --git a/HW4_Grigoriants/dictionaries.py b/HW4_Grigoriants/dictionaries.py new file mode 100644 index 0000000..c5725d1 --- /dev/null +++ b/HW4_Grigoriants/dictionaries.py @@ -0,0 +1,66 @@ +AMINO_ACIDS = { + "A": "Ala", + "C": "Cys", + "D": "Asp", + "E": "Glu", + "F": "Phe", + "G": "Gly", + "H": "His", + "I": "Ile", + "K": "Lys", + "L": "Leu", + "M": "Met", + "N": "Asn", + "P": "Pro", + "Q": "Gln", + "R": "Arg", + "S": "Ser", + "T": "Thr", + "V": "Val", + "W": "Trp", + "Y": "Tyr", +} +TRANSLATION_RULE = { + "F": "UUU", + "L": "CUG", + "I": "AUU", + "M": "AUG", + "V": "GUG", + "P": "CCG", + "T": "ACC", + "A": "GCG", + "Y": "UAU", + "H": "CAU", + "Q": "CAG", + "N": "AAC", + "K": "AAA", + "D": "GAU", + "E": "GAA", + "C": "UGC", + "W": "UGG", + "R": "CGU", + "S": "AGC", + "G": "GGC", +} +AMINO_ACID_WEIGHTS = { + "A": 89.09, + "C": 121.16, + "D": 133.10, + "E": 147.13, + "F": 165.19, + "G": 75.07, + "H": 155.16, + "I": 131.17, + "K": 146.19, + "L": 131.17, + "M": 149.21, + "N": 132.12, + "P": 115.13, + "Q": 146.15, + "R": 174.20, + "S": 105.09, + "T": 119.12, + "V": 117.15, + "W": 204.23, + "Y": 181.19, +} diff --git a/HW4_Grigoriants/protein_tools.py b/HW4_Grigoriants/protein_tools.py new file mode 100644 index 0000000..df92cef --- /dev/null +++ b/HW4_Grigoriants/protein_tools.py @@ -0,0 +1,289 @@ +import dictionaries + + +def three_one_letter_code(sequences: (tuple[str] or list[str])) -> list: + """ + Reverse the protein sequences from one-letter to three-letter format and vice-versa + + Case 1: get three-letter sequence\n + Use one-letter amino-acids sequences of any letter case + + Case 2: get one-letter sequence\n + Use three-letter amino-acid separated by "-" sequences. + Please note that sequences without "-" are parsed as one-letter code sequences\n + Example: for sequence "Ala" function will return "Ala-leu-ala" + + Arguments: + - sequences (tuple[str] or list[str]): protein sequences to convert\n + Example: ["WAG", "MkqRe", "msrlk", "Met-Ala-Gly", "Met-arg-asn-Trp-Ala-Gly", "arg-asn-trp"] + + Return: + - list: one-letter/three-letter protein sequences\n + Example: ["Met-Ala-Gly", "Met-arg-asn-Trp-Ala-Gly", "arg-asn-trp", "WAG", "MkqRe", "rlk"] + """ + inversed_sequences = [] + for sequence in sequences: + inversed_sequence = [] + if "-" not in sequence: + for letter in sequence: + if letter.islower(): + inversed_sequence.append( + dictionaries.AMINO_ACIDS[letter.capitalize()].lower() + ) + else: + inversed_sequence.append(dictionaries.AMINO_ACIDS[letter]) + inversed_sequences.append("-".join(inversed_sequence)) + else: + aa_splitted = sequence.split("-") + for aa in aa_splitted: + aa_index = list(dictionaries.AMINO_ACIDS.values()).index( + aa.capitalize() + ) + if aa[0].islower(): + inversed_sequence.append( + list(dictionaries.AMINO_ACIDS.keys())[aa_index].lower() + ) + else: + inversed_sequence.append( + list(dictionaries.AMINO_ACIDS.keys())[aa_index] + ) + inversed_sequences.append("".join(inversed_sequence)) + return inversed_sequences + + +def define_molecular_weight(sequences: (tuple[str] or list[str])) -> dict: + """ + Define molecular weight of the protein sequences + + Use one-letter amino-acids sequences of any letter case + The molecular weight is: + - a sum of masses of each atom constituting a molecule + - expressed in units called daltons (Da) + - rounded to hundredths + + Arguments: + - sequences (tuple[str] or list[str]): protein sequences to convert + + Return: + - dictionary: protein sequences as keys and molecular masses as values\n + Example: {"WAG": 332.39, "MkqRe": 690.88, "msrlk": 633.86} + """ + sequences_weights = {} + for sequence in sequences: + sequence_weight = 0 + for letter in sequence: + sequence_weight += dictionaries.AMINO_ACID_WEIGHTS[letter.upper()] + sequence_weight -= (len(sequence) - 1) * 18 # deduct water from peptide bond + sequences_weights[sequence] = round(sequence_weight, 2) + return sequences_weights + + +def search_for_motifs( + sequences: (tuple[str] or list[str]), motif: str, overlapping: bool +) -> dict: + """ + Search for motifs - conserved amino acids residues in protein sequence + + Search for one motif at a time\n + Search is letter case sensitive\n + Use one-letter aminoacids code for desired sequences and motifs\n + Positions of AA in sequences are counted from 0\n + By default, overlapping matches are counted + + Arguments: + - sequences (tuple[str] or list[str]): sequences to check for given motif within\n + Example: sequences = ["AMGAGW", "GAWSGRAGA"] + - motif (str]: desired motif to check presense in every given sequence\n + Example: motif = "GA" + - overlapping (bool): count (True) or skip (False) overlapping matches. (Optional)\n + Example: overlapping = False + Return: + - dictionary: sequences (str] as keys , starting positions for presented motif (list) as values\n + Example: {"AMGAGW": [2], "GAWSGRAGA": [0, 7]} + """ + new_line = "\n" + all_positions = {} + for sequence in sequences: + start = 0 + positions = [] + print(f"Sequence: {sequence}") + print(f"Motif: {motif}") + if motif in sequence: + while True: + start = sequence.find(motif, start) + if start == -1: + break + positions.append(start) + if overlapping: + start += 1 + else: + start += len(motif) + print_pos = ", ".join(str(x) for x in positions) + print_pos = f"{print_pos}{new_line}" + print( + f"Motif is present in protein sequence starting at positions: {print_pos}" + ) + else: + print(f"Motif is not present in protein sequence{new_line}") + all_positions[sequence] = positions + return all_positions + + +def search_for_alt_frames( + sequences: (tuple[str] or list[str]), alt_start_aa: str +) -> dict: + """ + Search for alternative frames in a protein sequences + + Search is not letter case sensitive\n + Without an alt_start_aa argument search for frames that start with methionine ("M") + To search frames with alternative start codon add alt_start_aa argument\n + In alt_start_aa argument use one-letter code + + The function ignores the last three amino acids in sequences + + Arguments: + - sequences (tuple[str] or list[str]): sequences to check + - alt_start_aa (str]: the name of an amino acid that is encoded by alternative start AA (Optional)\n + Example: alt_start_aa = "I" + + Return: + - dictionary: the number of a sequence and a collection of alternative frames + """ + alternative_frames = {} + num_position = 0 + for sequence in sequences: + alternative_frames[sequence] = [] + for amino_acid in sequence[1:-3]: + alt_frame = "" + num_position += 1 + if amino_acid == alt_start_aa or amino_acid == alt_start_aa.swapcase(): + alt_frame += sequence[num_position:] + alternative_frames[sequence].append(alt_frame) + num_position = 0 + return alternative_frames + + +def convert_to_nucl_acids( + sequences: (tuple[str] or list[str]), nucl_acids: str +) -> dict: + """ + Convert protein sequences to RNA or DNA sequences. + + Use the most frequent codons in human. The source - https://www.genscript.com/tools/codon-frequency-table\n + All nucleic acids (DNA and RNA) are showed in 5"-3" direction + + Arguments: + - sequences (tuple[str] or list[str]): sequences to convert + - nucl_acids (str]: the nucleic acid that is prefered\n + Example: nucl_acids = "RNA" - convert to RNA\n + nucl_acids = "DNA" - convert to DNA\n + nucl_acids = "both" - convert to RNA and DNA + Return: + - dictionary: nucleic acids (str) as keys, collection of sequences (list) as values + """ + rule_of_translation = str.maketrans(dictionaries.TRANSLATION_RULE) + # add lower case pairs, because only upper case pairs are stored in dictionaries + rule_of_translation.update( + str.maketrans( + dict( + (k.lower(), v.lower()) for k, v in dictionaries.TRANSLATION_RULE.items() + ) + ) + ) + nucl_acid_seqs = {"RNA": [], "DNA": []} + for sequence in sequences: + rna_seq = sequence.translate(rule_of_translation) + if nucl_acids == "RNA": + nucl_acid_seqs["RNA"].append(rna_seq) + elif nucl_acids == "DNA": + dna_seq = rna_seq.replace("U", "T").replace("u", "t") + nucl_acid_seqs["DNA"].append(dna_seq) + elif nucl_acids == "both": + dna_seq = rna_seq.replace("U", "T").replace("u", "t") + nucl_acid_seqs["RNA"].append(rna_seq) + nucl_acid_seqs["DNA"].append(dna_seq) + if nucl_acids == "RNA": + del nucl_acid_seqs["DNA"] + if nucl_acids == "DNA": + del nucl_acid_seqs["RNA"] + return nucl_acid_seqs + + +PROTEINS_PROCEDURES_TO_FUNCTIONS = { + "search_for_motifs": search_for_motifs, + "search_for_alt_frames": search_for_alt_frames, + "convert_to_nucl_acids": convert_to_nucl_acids, + "three_one_letter_code": three_one_letter_code, + "define_molecular_weight": define_molecular_weight, +} + + +def check_and_parse_user_input( + sequences: (str, tuple[str] or list[str]), **kwargs +) -> dict and str: + """ + Check if user input can be correctly processed\n + Parse sequences and arguments for desired procedure + + Arguments: + - sequences (list[str] or tuple[str]): sequences to process + - **kwargs - needed arguments for completion of desired procedure + + Return: + - string: procedure name + - dictionary: a collection of procedure arguments and their values + """ + if isinstance(sequences, str): + sequences = sequences.split() + if "" in sequences or len(sequences) == 0: + raise ValueError("Empty sequence provided") + procedure = kwargs["procedure"] + if procedure not in PROTEINS_PROCEDURES_TO_FUNCTIONS.keys(): + raise ValueError("Wrong procedure") + allowed_inputs = set(dictionaries.AMINO_ACIDS.keys()) + allowed_inputs = allowed_inputs.union( + set(k.lower() for k in dictionaries.AMINO_ACIDS.keys()) + ) + if procedure == "three_one_letter_code": + allowed_inputs = allowed_inputs.union(set(dictionaries.AMINO_ACIDS.values())) + allowed_inputs = allowed_inputs.union( + set(v.lower() for v in dictionaries.AMINO_ACIDS.values()) + ) + for sequence in sequences: + allowed_inputs_seq = allowed_inputs.copy() + if procedure == "three_one_letter_code" and "-" in sequence: + allowed_inputs_seq -= set(dictionaries.AMINO_ACIDS.keys()) + allowed_inputs_seq -= set( + k.lower() for k in dictionaries.AMINO_ACIDS.keys() + ) + allowed_inputs_seq.union(set("-")) + if not set(sequence.split("-")).issubset(allowed_inputs_seq): + raise ValueError("Invalid sequence given") + else: + if not set(sequence).issubset(allowed_inputs_seq): + raise ValueError("Invalid sequence given") + procedure_arguments = {} + if procedure == "search_for_motifs": + if "motif" not in kwargs.keys(): + raise ValueError("Please provide desired motif") + procedure_arguments["motif"] = kwargs["motif"] + if "overlapping" not in kwargs.keys(): + procedure_arguments["overlapping"] = True + else: + procedure_arguments["overlapping"] = kwargs["overlapping"] + elif procedure == "search_for_alt_frames": + if "alt_start_aa" not in kwargs.keys(): + procedure_arguments["alt_start_aa"] = "M" + else: + if len(kwargs["alt_start_aa"]) > 1: + raise ValueError("Invalid alternative start AA") + procedure_arguments["alt_start_aa"] = kwargs["alt_start_aa"] + elif procedure == "convert_to_nucl_acids": + if "nucl_acids" not in kwargs.keys(): + raise ValueError("Please provide desired type of nucl_acids") + if kwargs["nucl_acids"] not in {"DNA", "RNA", "both"}: + raise ValueError("Invalid nucl_acids argument") + procedure_arguments["nucl_acids"] = kwargs["nucl_acids"] + procedure_arguments["sequences"] = sequences + return procedure_arguments, procedure