Skip to content

test_megamolbart.py not working #154

@mircare

Description

@mircare

Hi, I got the MegaMolBART docker container up and running with the following command:

$docker run --name megamolbart --gpus all --rm -v $(pwd)/megamolbart_v0.1/:/models/megamolbart -v $(pwd)/shared/:/shared nvcr.io/nvidia/clara/megamolbart:latest &

I git cloned this repository in shared/ but can't find a way to even test the model.
In particular, I get the following error when trying to run test_megamolbart.py:

root@5f1951df00f5:/shared# mv cheminformatics/megamolbart/megamolbart/ . && mv cheminformatics/megamolbart/tests/test_megamolbart.py .
root@13b98eed1cbd:/shared# python test_megamolbart.py 
using world size: 1 and model-parallel size: 1 
using torch.float32 for parameters ...
-------------------- arguments --------------------
  adam_beta1 ...................... 0.9
  adam_beta2 ...................... 0.999
  adam_eps ........................ 1e-08
  adlr_autoresume ................. False
  adlr_autoresume_interval ........ 1000
  apply_query_key_layer_scaling ... False
  apply_residual_connection_post_layernorm  False
  attention_dropout ............... 0.1
  attention_softmax_in_fp32 ....... False
  batch_size ...................... None
  bert_load ....................... None
  bias_dropout_fusion ............. False
  bias_gelu_fusion ................ False
  block_data_path ................. None
  checkpoint_activations .......... False
  checkpoint_in_cpu ............... False
  checkpoint_num_layers ........... 1
  clip_grad ....................... 1.0
  contigious_checkpointing ........ False
  cpu_optimizer ................... False
  cpu_torch_adam .................. False
  data_impl ....................... infer
  data_path ....................... None
  dataset_path .................... None
  DDP_impl ........................ local
  deepscale ....................... False
  deepscale_config ................ None
  deepspeed ....................... False
  deepspeed_activation_checkpointing  False
  deepspeed_config ................ None
  deepspeed_mpi ................... False
  distribute_checkpointed_activations  False
  distributed_backend ............. nccl
  dynamic_loss_scale .............. True
  eod_mask_loss ................... False
  eval_interval ................... 1000
  eval_iters ...................... 100
  exit_interval ................... None
  faiss_use_gpu ................... False
  finetune ........................ False
  fp16 ............................ False
  fp16_lm_cross_entropy ........... False
  fp32_allreduce .................. False
  gas ............................. 1
  hidden_dropout .................. 0.1
  hidden_size ..................... 256
  hysteresis ...................... 2
  ict_head_size ................... None
  ict_load ........................ None
  indexer_batch_size .............. 128
  indexer_log_interval ............ 1000
  init_method_std ................. 0.02
  layernorm_epsilon ............... 1e-05
  lazy_mpu_init ................... None
  load ............................ /models/megamolbart/checkpoints
  local_rank ...................... None
  log_interval .................... 100
  loss_scale ...................... None
  loss_scale_window ............... 1000
  lr .............................. None
  lr_decay_iters .................. None
  lr_decay_style .................. linear
  make_vocab_size_divisible_by .... 128
  mask_prob ....................... 0.15
  max_position_embeddings ......... 512
  merge_file ...................... None
  min_lr .......................... 0.0
  min_scale ....................... 1
  mmap_warmup ..................... False
  model_parallel_size ............. 1
  no_load_optim ................... False
  no_load_rng ..................... False
  no_save_optim ................... False
  no_save_rng ..................... False
  num_attention_heads ............. 8
  num_layers ...................... 4
  num_unique_layers ............... None
  num_workers ..................... 2
  onnx_safe ....................... None
  openai_gelu ..................... False
  override_lr_scheduler ........... False
  param_sharing_style ............. grouped
  params_dtype .................... torch.float32
  partition_activations ........... False
  pipe_parallel_size .............. 0
  profile_backward ................ False
  query_in_block_prob ............. 0.1
  rank ............................ 0
  report_topk_accuracies .......... []
  reset_attention_mask ............ False
  reset_position_ids .............. False
  save ............................ None
  save_interval ................... None
  scaled_masked_softmax_fusion .... False
  scaled_upper_triang_masked_softmax_fusion  False
  seed ............................ 1234
  seq_length ...................... None
  short_seq_prob .................. 0.1
  split ........................... 969, 30, 1
  synchronize_each_layer .......... False
  tensorboard_dir ................. None
  titles_data_path ................ None
  tokenizer_type .................. GPT2BPETokenizer
  train_iters ..................... None
  use_checkpoint_lr_scheduler ..... False
  use_cpu_initialization .......... False
  use_one_sent_docs ............... False
  vocab_file ...................... /models/megamolbart/bart_vocab.txt
  warmup .......................... 0.01
  weight_decay .................... 0.01
  world_size ...................... 1
  zero_allgather_bucket_size ...... 0.0
  zero_contigious_gradients ....... False
  zero_reduce_bucket_size ......... 0.0
  zero_reduce_scatter ............. False
  zero_stage ...................... 1.0
---------------- end of arguments ----------------
> initializing torch distributed ...
Traceback (most recent call last):
  File "test_megamolbart.py", line 16, in <module>
    wf = MegaMolBART()
  File "/shared/megamolbart/inference.py", line 71, in __init__
    initialize_megatron(args_defaults=args, ignore_unknown_args=True)
  File "/opt/conda/lib/python3.6/site-packages/megatron/initialize.py", line 77, in initialize_megatron
    finish_mpu_init()
  File "/opt/conda/lib/python3.6/site-packages/megatron/initialize.py", line 59, in finish_mpu_init
    _initialize_distributed()
  File "/opt/conda/lib/python3.6/site-packages/megatron/initialize.py", line 156, in _initialize_distributed
    init_method=init_method)
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 448, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/rendezvous.py", line 133, in _tcp_rendezvous_handler
    store = TCPStore(result.hostname, result.port, world_size, start_daemon, timeout)
RuntimeError: Address already in use

I am interested in getting the embeddings for a bunch of molecules. Any suggestion?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions