Skip to content

Conversation

@Goosang-Yu
Copy link

Fetching PDB files using ProteinChain.from_rcsb is a very convenient feature. However, many PDB files, especially those with complex and large structures, contain missing residues.

For instance, I fetched one of the Cas9 structures, "8G1I", using ProteinChain.from_rcsb.

pdb_id = "8G1I" # PDB ID corresponding to Renal Dipeptidase
chain_id = "A" # Chain ID corresponding to Renal Dipeptidase in the PDB structure
renal_dipep_chain = ProteinChain.from_rcsb(pdb_id, chain_id)

When checking the sequence and length of 8G1I, it appears shorter than the actual amino acid sequence, with the missing residues omitted.

pro_seq = renal_dipep_chain.sequence
print("Protein sequence:", pro_seq)
print("Protein length:", len(pro_seq))
Protein sequence: KYSIGLDIGTNSVGWAVITDEYKVPSKKFKVLGNTDRHSIKKNLIGALLFDSGETAEATRLKRTARRRYTRRKNRICYLQEIFSNEMAKVDDSFFHRLEESFLVEEDKKHERHPIFGNIVDEVAYHEKYPTIYHLRKKLVDSTDKADLRLIYLALAHMIKFRGHFLIEGDLNPDNSDVDKLFIQLVQTYNQLFEENPINASGVDAKAILSARLSKSRRLENLIAQLPGEKKNGLFGNLIALSLGLTPNFKSNFDLAEDAKLQLSKDTYDDDLDNLLAQIGDQYADLFLAAKNLSDAILLSDILRVNTEITKAPLSASMIKRYDEHHQDLTLLKALVRQQLPEKYKEIFFDQSKNGYAGYIDGGASQEEFYKFIKPILEKMDGTEELLVKLNREDLLRKQRTFDNGSIPHQIHLGELHAILRRQEDFYPFLKDNREKIEKILTFRIPYYVGPLARGNSRFAWMTRKSEETITPWNFEEVVDKGASAQSFIERMTNFDKNLPNEKVLPKHSLLYEYFTVYNELTKVKYVTEGMRKPAFLSGEQKKAIVDLLFKTNRKVTVKQLKEDYFKKIECFDSVEIASLGTYHDLLKIIKDKDFLDNEENEDILEDIVLTLTLFEDREMIEERLKTYAHLFDDKVMKQLKRRRYTGWGRLSRKLINGIRDKQSGKTILDFLKSDGFANRNFMQLIHDDSLTFKEDIQKAQSLHEHIANLAGSPAIKKGILQTVKVVDELVKVMGRHKPENIVIEMARNSRERMKRIEEGIKELGSQILKEHPVENTQLQNEKLYLYYLQNGRDMYVDQELDINRLSDYDVDHIVPQSFLKDDSIDNKVLTRSDKNRGKSDNVPSEEVVKKMKNYWRQLNAKLIRKFDNLTKAERGGLSELDKAGFIKRQLVETRQITKHVAQILDSRMNTKYDENDKLIREVKVITLKSKLVSDFRKDFQFYKVREINNYHHAHDAYLNAVVGTALIKKYPKLESEFVYGDGKATAKYFFYSNIMNFFKTEIKRPLIETNGETGEIVWDKGRDFATVRKVLSMPQVNIVKKTEVQTGGFSKESILPKRNSDKLIARKKDWDPKKYGGFDSPTVAYSVLVVAKVEKGKSKKLKSVKELLGITIMERSSFEKNPIDFLEAKGYKEVKKDLIIKLPKYSLFELENGRKRMLASAGELQKGNELALPSKYVNFLYLASHEQKQLFVEQHKHYLDEIIEQISEFSKRVILADANLDKVLSAYNKHRDKPIREQAENIIHLFTLTNLGAPAAFKYFDTTIDRKRYTSTKEVLDATLIHQSITGLYETRIDLSQ
Protein length: 1298

Even when retrieving atom information, the atom data for the missing residues could not be found.

renal_dipep_chain.atom_array.get_atom(1)
# res_id 1-3 is missing
Atom(np.array([127.615, 117.513, 179.528], dtype=float32), chain_id="A", res_id=4, ins_code="", res_name="LYS", hetero=False, atom_name="CA", element="C")

This can be resolved by using pdbfixer to find the missing residues.

import pdbfixer

fixer = pdbfixer.PDBFixer(pdbid="8G1I")

# PDBFixer operations
fixer.findNonstandardResidues()
fixer.replaceNonstandardResidues()
fixer.findMissingResidues()
fixer.findMissingAtoms()

print("Missing Residues:", fixer.missingResidues)
Missing Residues: {(1, 0): ['MET', 'ASP', 'LYS'], (1, 577): ['SER', 'GLY', 'VAL', 'GLU', 'ASP', 'ARG', 'PHE', 'ASN'], (1, 701): ['VAL', 'SER', 'GLY', 'GLN', 'GLY', 'ASP'], (1, 748): ['GLU', 'ASN', 'GLN', 'THR', 'THR', 'GLN', 'LYS', 'GLY', 'GLN', 'LYS'], (1, 859): ['LEU'], (1, 864): ['THR', 'GLN'], (1, 982): ['TYR', 'LYS', 'VAL', 'TYR', 'ASP', 'VAL', 'ARG', 'LYS', 'MET', 'ILE', 'ALA', 'LYS', 'SER', 'GLU', 'GLN', 'GLU', 'ILE'], (1, 1003): ['THR', 'LEU', 'ALA', 'ASN', 'GLY', 'GLU', 'ILE', 'ARG'], (1, 1186): ['TYR', 'GLU', 'LYS', 'LEU', 'LYS', 'GLY', 'SER', 'PRO', 'GLU', 'ASP', 'ASN'], (1, 1298): ['LEU', 'GLY', 'GLY', 'ASP', 'PRO', 'LYS', 'LYS', 'LYS', 'ARG', 'LYS', 'VAL', 'MET', 'ASP', 'LYS', 'HIS', 'HIS', 'HIS', 'HIS', 'HIS', 'HIS'], (3, 0): ['DC', 'DC', 'DA', 'DG', 'DT', 'DG', 'DC', 'DG', 'DT', 'DA', 'DT', 'DA', 'DC', 'DC', 'DA', 'DG', 'DC', 'DA', 'DA', 'DA', 'DA', 'DC', 'DA', 'DC', 'DT', 'DC', 'DC']}

I have added these functionalities as options to ProteinChain.from_rcsb. It can be used as follows:

pdb_id = "8G1I" # PDB ID corresponding to Renal Dipeptidase
chain_id = "A" # Chain ID corresponding to Renal Dipeptidase in the PDB structure
renal_dipep_chain = ProteinChain.from_rcsb(pdb_id, chain_id, fix_pdb=True)

The resulting protein chain includes the missing residue sequence information and the full length.

pro_seq = renal_dipep_chain.sequence
print("Protein sequence:", pro_seq)
print("Protein length:", len(pro_seq))
Protein sequence: MDKKYSIGLDIGTNSVGWAVITDEYKVPSKKFKVLGNTDRHSIKKNLIGALLFDSGETAEATRLKRTARRRYTRRKNRICYLQEIFSNEMAKVDDSFFHRLEESFLVEEDKKHERHPIFGNIVDEVAYHEKYPTIYHLRKKLVDSTDKADLRLIYLALAHMIKFRGHFLIEGDLNPDNSDVDKLFIQLVQTYNQLFEENPINASGVDAKAILSARLSKSRRLENLIAQLPGEKKNGLFGNLIALSLGLTPNFKSNFDLAEDAKLQLSKDTYDDDLDNLLAQIGDQYADLFLAAKNLSDAILLSDILRVNTEITKAPLSASMIKRYDEHHQDLTLLKALVRQQLPEKYKEIFFDQSKNGYAGYIDGGASQEEFYKFIKPILEKMDGTEELLVKLNREDLLRKQRTFDNGSIPHQIHLGELHAILRRQEDFYPFLKDNREKIEKILTFRIPYYVGPLARGNSRFAWMTRKSEETITPWNFEEVVDKGASAQSFIERMTNFDKNLPNEKVLPKHSLLYEYFTVYNELTKVKYVTEGMRKPAFLSGEQKKAIVDLLFKTNRKVTVKQLKEDYFKKIECFDSVEISGVEDRFNASLGTYHDLLKIIKDKDFLDNEENEDILEDIVLTLTLFEDREMIEERLKTYAHLFDDKVMKQLKRRRYTGWGRLSRKLINGIRDKQSGKTILDFLKSDGFANRNFMQLIHDDSLTFKEDIQKAQVSGQGDSLHEHIANLAGSPAIKKGILQTVKVVDELVKVMGRHKPENIVIEMARENQTTQKGQKNSRERMKRIEEGIKELGSQILKEHPVENTQLQNEKLYLYYLQNGRDMYVDQELDINRLSDYDVDHIVPQSFLKDDSIDNKVLTRSDKNRGKSDNVPSEEVVKKMKNYWRQLLNAKLITQRKFDNLTKAERGGLSELDKAGFIKRQLVETRQITKHVAQILDSRMNTKYDENDKLIREVKVITLKSKLVSDFRKDFQFYKVREINNYHHAHDAYLNAVVGTALIKKYPKLESEFVYGDYKVYDVRKMIAKSEQEIGKATAKYFFYSNIMNFFKTEITLANGEIRKRPLIETNGETGEIVWDKGRDFATVRKVLSMPQVNIVKKTEVQTGGFSKESILPKRNSDKLIARKKDWDPKKYGGFDSPTVAYSVLVVAKVEKGKSKKLKSVKELLGITIMERSSFEKNPIDFLEAKGYKEVKKDLIIKLPKYSLFELENGRKRMLASAGELQKGNELALPSKYVNFLYLASHYEKLKGSPEDNEQKQLFVEQHKHYLDEIIEQISEFSKRVILADANLDKVLSAYNKHRDKPIREQAENIIHLFTLTNLGAPAAFKYFDTTIDRKRYTSTKEVLDATLIHQSITGLYETRIDLSQLGGDPKKKRKVMDKHHHHHH
Protein length: 1384

Atom information can also be retrieved with the missing residues filled in by the fixer.

renal_dipep_chain.atom_array.get_atom(1)
Atom(np.array([125.775, 112.234, 187.093], dtype=float32), chain_id="A", res_id=1, ins_code="", res_name="MET", hetero=False, atom_name="CA", element="C")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant