Skip to content

Conversation

@dependabot
Copy link

@dependabot dependabot bot commented on behalf of github Apr 16, 2021

Bumps pytorch-lightning from 1.0.3 to 1.2.8.

Release notes

Sourced from pytorch-lightning's releases.

Standard weekly patch release

[1.2.8] - 2021-04-14

Added

  • Added TPUSpawn + IterableDataset error message (#6875)

Fixed

  • Fixed process rank not being available right away after Trainer instantiation (#6941)
  • Fixed sync_dist for tpus (#6950)
  • Fixed AttributeError for require_backward_grad_sync` when running manual optimization with sharded plugin (#6915)
  • Fixed --gpus default for parser returned by Trainer.add_argparse_args (#6898)
  • Fixed TPU Spawn all gather (#6896)
  • Fixed EarlyStopping logic when min_epochs or min_steps requirement is not met (#6705)
  • Fixed csv extension check (#6436)
  • Fixed checkpoint issue when using Horovod distributed backend (#6958)
  • Fixed tensorboard exception raising (#6901)
  • Fixed setting the eval/train flag correctly on accelerator model (#6983)
  • Fixed DDP_SPAWN compatibility with bug_report_model.py (#6892)
  • Fixed bug where BaseFinetuning.flatten_modules() was duplicating leaf node parameters (#6879)
  • Set better defaults for rank_zero_only.rank when training is launched with SLURM and torchelastic:
    • Support SLURM and torchelastic global rank environment variables (#5715)
    • Remove hardcoding of local rank in accelerator connector (#6878)

Contributors

@​ananthsub @​awaelchli @​ethanwharris @​justusschock @​kandluis @​kaushikb11 @​liob @​SeanNaren @​skmatz

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Standard weekly patch release

[1.2.7] - 2021-04-06

Fixed

  • Fixed resolve a bug with omegaconf and xm.save (#6741)
  • Fixed an issue with IterableDataset when len is not defined (#6828)
  • Sanitize None params during pruning (#6836)
  • Enforce an epoch scheduler interval when using SWA (#6588)
  • Fixed TPU Colab hang issue, post training (#6816])
  • Fixed a bug where TensorBoardLogger would give a warning and not log correctly to a symbolic link save_dir (#6730)

Contributors

@​awaelchli, @​ethanwharris, @​karthikprasad, @​kaushikb11, @​mibaumgartner, @​tchaton

If we forgot someone due to not matching commit email with GitHub account, let us know :]

... (truncated)

Changelog

Sourced from pytorch-lightning's changelog.

[1.2.8] - 2021-04-14

Added

  • Added TPUSpawn + IterableDataset error message (#6875)

Fixed

  • Fixed process rank not being available right away after Trainer instantiation (#6941)
  • Fixed sync_dist for tpus (#6950)
  • Fixed AttributeError for require_backward_grad_sync when running manual optimization with sharded plugin (#6915)
  • Fixed --gpus default for parser returned by Trainer.add_argparse_args (#6898)
  • Fixed TPU Spawn all gather (#6896)
  • Fixed EarlyStopping logic when min_epochs or min_steps requirement is not met (#6705)
  • Fixed csv extension check (#6436)
  • Fixed checkpoint issue when using Horovod distributed backend (#6958)
  • Fixed tensorboard exception raising (#6901)
  • Fixed setting the eval/train flag correctly on accelerator model (#6983)
  • Fixed DDP_SPAWN compatibility with bug_report_model.py (#6892)
  • Fixed bug where BaseFinetuning.flatten_modules() was duplicating leaf node parameters (#6879)
  • Set better defaults for rank_zero_only.rank when training is launched with SLURM and torchelastic:
    • Support SLURM and torchelastic global rank environment variables (#5715)
    • Remove hardcoding of local rank in accelerator connector (#6878)

[1.2.7] - 2021-04-06

Fixed

  • Fixed resolve a bug with omegaconf and xm.save (#6741)
  • Fixed an issue with IterableDataset when len is not defined (#6828)
  • Sanitize None params during pruning (#6836)
  • Enforce an epoch scheduler interval when using SWA (#6588)
  • Fixed TPU Colab hang issue, post training (#6816)
  • Fixed a bug where TensorBoardLogger would give a warning and not log correctly to a symbolic link save_dir (#6730)

[1.2.6] - 2021-03-30

Changed

  • Changed the behavior of on_epoch_start to run at the beginning of validation & test epoch (#6498)

Removed

  • Removed legacy code to include step dictionary returns in callback_metrics. Use self.log_dict instead. (#6682)

Fixed

  • Fixed DummyLogger.log_hyperparams raising a TypeError when running with fast_dev_run=True (#6398)

... (truncated)

Commits

Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

  • @dependabot rebase will rebase this PR
  • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
  • @dependabot merge will merge this PR after your CI passes on it
  • @dependabot squash and merge will squash and merge this PR after your CI passes on it
  • @dependabot cancel merge will cancel a previously requested merge and block automerging
  • @dependabot reopen will reopen this PR if it is closed
  • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
  • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
  • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
  • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot dependabot bot added the dependencies Pull requests that update a dependency file label Apr 16, 2021
@dependabot @github
Copy link
Author

dependabot bot commented on behalf of github Apr 24, 2021

Superseded by #18.

@dependabot dependabot bot closed this Apr 24, 2021
@dependabot dependabot bot deleted the dependabot/pip/python/requirements/pytorch-lightning-1.2.8 branch April 24, 2021 07:03
ijrsvt pushed a commit that referenced this pull request Aug 16, 2022
We encountered SIGSEGV when running Python test `python/ray/tests/test_failure_2.py::test_list_named_actors_timeout`. The stack is:

```
#0  0x00007fffed30f393 in std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(std::string const&) ()
   from /lib64/libstdc++.so.6
#1  0x00007fffee707649 in ray::RayLog::GetLoggerName() () from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so
#2  0x00007fffee70aa90 in ray::SpdLogMessage::Flush() () from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so
#3  0x00007fffee70af28 in ray::RayLog::~RayLog() () from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so
#4  0x00007fffee2b570d in ray::asio::testing::(anonymous namespace)::DelayManager::Init() [clone .constprop.0] ()
   from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so
#5  0x00007fffedd0d95a in _GLOBAL__sub_I_asio_chaos.cc () from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so
#6  0x00007ffff7fe282a in call_init.part () from /lib64/ld-linux-x86-64.so.2
#7  0x00007ffff7fe2931 in _dl_init () from /lib64/ld-linux-x86-64.so.2
#8  0x00007ffff7fe674c in dl_open_worker () from /lib64/ld-linux-x86-64.so.2
#9  0x00007ffff7b82e79 in _dl_catch_exception () from /lib64/libc.so.6
#10 0x00007ffff7fe5ffe in _dl_open () from /lib64/ld-linux-x86-64.so.2
#11 0x00007ffff7d5f39c in dlopen_doit () from /lib64/libdl.so.2
#12 0x00007ffff7b82e79 in _dl_catch_exception () from /lib64/libc.so.6
#13 0x00007ffff7b82f13 in _dl_catch_error () from /lib64/libc.so.6
#14 0x00007ffff7d5fb09 in _dlerror_run () from /lib64/libdl.so.2
#15 0x00007ffff7d5f42a in dlopen@@GLIBC_2.2.5 () from /lib64/libdl.so.2
#16 0x00007fffef04d330 in py_dl_open (self=<optimized out>, args=<optimized out>)
    at /tmp/python-build.20220507135524.257789/Python-3.7.11/Modules/_ctypes/callproc.c:1369
```

The root cause is that when loading `_raylet.so`, `static DelayManager _delay_manager` is initialized and `RAY_LOG(ERROR) << "RAY_testing_asio_delay_us is set to " << delay_env;` is executed. However, the static variables declared in `logging.cc` are not initialized yet (in this case, `std::string RayLog::logger_name_ = "ray_log_sink"`).

It's better not to rely on the initialization order of static variables in different compilation units because it's not guaranteed. I propose to change all `RAY_LOG`s to `std::cerr` in `DelayManager::Init()`.

The crash happens in Ant's internal codebase. Not sure why this test case passes in the community version though.

BTW, I've tried different approaches:

1. Using a static local variable in `get_delay_us` and remove the global variable. This doesn't work because `init()` needs to access the variable as well.
2. Defining the global variable as type `std::unique_ptr<DelayManager>` and initialize it in `get_delay_us`. This works but it requires a lock to be thread-safe.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Pull requests that update a dependency file

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant