Skip to content

Question about generating IDRs from EvoDiff-Seq #48

@zhang-bo-lilly

Description

@zhang-bo-lilly

Hello, I am having trouble executing the example in the Generating intrinsically disordered regions of the README file.

Per #41, I downloaded the dataset needed from https://zenodo.org/records/5146063, extracted the human_idr_homologues.zip, and saved it as human_protein_alignments directory, so the layout of the directory looks like this

data/
├── blosum62-special-MSA.mat
├── human_idr_alignments
│   ├── human_idr_boundaries_gap.tsv
│   ├── human_idr_boundaries.tsv
│   └── human_protein_alignments
│       ├── HUMAN00009_1to68.fasta
│       ├── HUMAN00009_633to749.fasta
│       ├── HUMAN00009_92to145.fasta
...

From the root directory of the repository, I executed and observed the following

export AMLT_OUTPUT_DIR=./test_output
python evodiff/conditional_generation_msa.py --model-type msa_oa_dm_maxsub --cond-task idr --num-seqs 1 --amlt
INDEX FILE LEN 10634
Traceback (most recent call last):
  File "/home/xxxxx/evodiff/evodiff/conditional_generation_msa.py", line 1065, in <module>
    main()
  File "/home/xxxxx/evodiff/evodiff/conditional_generation_msa.py", line 150, in main
    src, start_idx, end_idx, original_msa, num_sequences, b_src, b_start_idx, b_end_idx, oma_id = get_IDR_MSAs(index_file, data_top_dir,
  File "/home/xxxxx/evodiff/evodiff/conditional_generation_msa.py", line 826, in get_IDR_MSAs
    msa_data, new_start_idx, new_end_idx, num_sequences, b_start_idx, b_end_idx, oma_id = subsample_IDR_MSA(index_file, tokenizer, max_seq_len=max_seq_len, n_sequences=n_sequences,
  File "/home/xxxxx/evodiff/evodiff/conditional_generation_msa.py", line 893, in subsample_IDR_MSA
    query_idx = [i for i, name in enumerate(msa_names) if name == row['OMA_ID']][0]  # get query index
IndexError: list index out of range

I stepped through PDB and found these

(Pdb) p index_file.loc[index]
OMA_ID                                                HUMAN04185
UNIPROT_ID                                                Q96K76
START                                                        424
END                                                          479
IDR_SEQ        EDEKSPQTESCTDSGAENEGSCHSDQMSNDFSNDDGVDEGICLETN...
LENGTHS                                                       55
GAP START                                                    997
GAP END                                                     1141
GAP LENGTHS                                                  144
(Pdb) p row['OMA_ID']
'HUMAN04185'
(Pdb) p [file for i, file in enumerate(all_files) if 'HUMAN04185' in file]
['HUMAN04185_1to38.fasta', 'HUMAN04185_424to479.fasta', 'HUMAN04185_839to1026.fasta']
(Pdb) aa, bb=parse_fasta(data_dir + 'human_protein_alignments/HUMAN04185_1to38.fasta', return_names=True)
(Pdb) bb
['BRAFL21358 0 to 5', 'EPTBU02539 0 to 0', 'LEPOC10560 3 to 40', 'ANATE13683 3 to 20', 'SERDU25819 0 to 11', 'SCOMX25917 1 to 18', 'GASAC17394 1 to 37', 'TAKRU19760 1 to 40', 'TETNG11216 1 to 37', 'ORYLA12382 1 to 37', 'ORYME02443 0 to 14', 'NOTFU11912 3 to 20', 'CYPVA13923 3 to 20', 'POEFO06820 1 to 37', 'XIPMA06130 3 to 20', 'ORENI17527 1 to 38', 'AMPOC21119 3 to 20', 'HIPCM02252 3 to 20', 'GADMO19517 1 to 38', 'ASTMX08999 5 to 38', 'PYGNA16253 0 to 12', 'ICTPU01019 9 to 31', 'DANRE39301 3 to 20', 'LATCH10026 1 to 38', 'ORNAN18050 0 to 26', 'PROCA13584 0 to 25', 'LOXAF12537 1 to 39', 'ECHTE14028 0 to 25', 'RABIT01068 1 to 38', 'OCHPR15109 0 to 25', 'DIPOR05931 0 to 0', 'FUKDA04471 0 to 5', 'HETGA12775 0 to 25', 'CAVAP13955 0 to 17', 'CAVPO05047 0 to 25', 'CHILA04061 1 to 38', 'OCTDE12798 1 to 38', 'JACJA01745 0 to 25', 'CRIGR16916 1 to 38', 'MOUSE45885 1 to 18', 'RATNO01797 1 to 38', 'NANGA02552 1 to 38', 'CERAT32976 1 to 13', 'CHLSB00649 1 to 18', 'MACFA09490 1 to 13', 'MACMU07436 1 to 38', 'MACNE29351 1 to 13', 'MANLE36987 1 to 13', 'PAPAN05860 0 to 25', 'COLAP32362 1 to 13', 'RHIBE07503 1 to 13', 'RHIRO33601 0 to 0', 'GORGO03243 0 to 6', 'HUMAN04185 1 to 38', 'PANPA06196 0 to 0', 'PANTR02333 0 to 0', 'PONAB01347 1 to 38', 'NOMLE01511 1 to 18', 'AOTNA04675 1 to 13', 'SAIBB00262 1 to 13', 'TARSY11018 0 to 25', 'PROCO03960 1 to 13', 'OTOGA19308 0 to 25', 'TUPBE14316 0 to 0', 'CANLF08543 0 to 12', 'VULVU21503 0 to 0', 'MUSPF13712 0 to 24', 'AILME06514 0 to 39', 'URSAM01994 0 to 0', 'URSMA27578 0 to 12', 'FELCA11798 1 to 39', 'TURTR04946 0 to 21', 'BOVIN04360 0 to 38', 'SHEEP06239 0 to 39', 'PIGXX17664 1 to 38', 'VICPA03255 0 to 25', 'PTEVA15708 0 to 25', 'MYOLU05549 1 to 39', 'ERIEU12752 0 to 21', 'HORSE18107 0 to 25', 'DASNO16007 0 to 38', 'CHOHO10481 0 to 5', 'SARHA06263 1 to 18', 'MONDO10274 1 to 38', 'MACEU07613 0 to 25', 'PHACI02145 1 to 38', 'ANAPL07288 0 to 25', 'MELGA10549 0 to 25', 'CHICK11008 0 to 43', 'FICAL13955 0 to 0', 'TAEGU16862 0 to 25', 'CHRPI18449 1 to 38', 'SPHPU04621 0 to 6', 'ANOCA16740 1 to 38', 'XENTR16027 0 to 24', 'CIOSA04555 0 to 0', 'STRPU17710 1 to 56', 'STRMM09003 1 to 20', 'DAPPU07360 0 to 0', 'ORCCI04184 11 to 56', 'DROPE01541 1 to 2', 'DROPS09123 1 to 2', 'LUCCU03187 10 to 33', 'CULSO18336 1 to 4', 'ANOGA02647 1 to 4', 'AEDAE08107 1 to 4', 'CULQU04626 1 to 13', 'APIME11570 0 to 5', 'BOMIM10786 0 to 5', 'LINHU12916 0 to 5', 'OOCBI04348 0 to 5', 'CAMFO12507 0 to 5', 'ATTCE04431 0 to 5', 'SOLIN10701 0 to 0', 'HARSA07974 0 to 5', 'RHOPR10225 0 to 0', 'PEDHC04140 31 to 113', 'ZOONE05774 1 to 20', 'LINUN26257 1 to 18', 'CRAGI03987 1 to 61', 'OCTBM24223 1 to 18', 'NEMVE01956 1 to 18', 'HYDVU05760 1 to 12', 'AMPQE22746 9 to 63']
(Pdb) [i for i, name in enumerate(bb) if name == 'HUMAN04185']
[]
(Pdb) [i for i, name in enumerate(bb) if 'HUMAN04185' in name]
[53]
(Pdb) aa, bb=parse_fasta(data_dir + 'human_protein_alignments/HUMAN04185_424to479.fasta', return_names=True)
(Pdb) [i for i, name in enumerate(bb) if name == 'HUMAN04185']
[]
(Pdb) aa, bb=parse_fasta(data_dir + 'human_protein_alignments/HUMAN04185_839to1026.fasta', return_names=True)
(Pdb) [i for i, name in enumerate(bb) if name == 'HUMAN04185']
[]

It seems to me that

query_idx = [i for i, name in enumerate(msa_names) if name == row['OMA_ID']][0] # get query index

needs to be changed to row['OMA_ID'] in name. Is this correct?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions