Skip to content

Conversation

@deven-amd
Copy link

Cutting-n-Pasting the three commit messages here in lieu of description

@ekuznetsov139 , this PR is essesntially to add the _LogSessionRunHook class from your PR #11 (albeit with some tweaks). Please review. I will also be filing another PR soon (within a day or two) which pulls in all the AMP related changes from that PR. Once both those PRs are merged, PR #11 can be updated to be XLA specific.

@c0redumb @micmelesse @xdgarrido please review and merge


Add a --num_report_steps option to specify reporting frequency.

Currently the information for the following

  • global_step/sec and
  • examples/sec

gets displayed (and recorded via the summary writer) after every step.

This --num_report_steps=N option allows the user to specify the frequency (i.e. every N steps) with which such information should be displayed and recorded


Enable printing and recording of throughput + loss on a periodic basis

This commit adds the ability to report (i.e. display to stdout) the following information on a periodic basis

  • step number
  • throughput
  • total_loss
  • mlm_loss
  • nsp_loss
  • learning_rate

The frequency of the reporting is specified via the --num_report_steps option.

Currently only the throughput and total_loss values get recorded (to the trace events file meant for tensorboard consumption).

Note that throughput is the same as examples.sec and total_loss is the same as loss both of which are already reported and recorded via the TPUEstimator implementation.

The LogSessionRunHook class is based on a similar class in the NVBERT implementation. It can be easily enhanced to report and record other/more variables.


Disable the log messages from being printed twice.

Currently all the messages output via tf.compat.v1.logging.info get printed twice. For example

INFO:tensorflow:**** Trainable Variables ****
I0610 12:40:23.553335 139787876316928 run_pretraining.py:256] **** Trainable Variables ****

Setting the propgate flag in the loggger to False will prevent this. For the above example, only one line will be printed

INFO:tensorflow:**** Trainable Variables ****

This makes the output log file more readable.


Currently the information for the following
 * `global_step/sec` and
 * `examples/sec`

gets displayed (and recorded via the summary writer) after every step.

This `--num_report_steps=N` option allows the user to specify the frequency (i.e. every N steps) with which such information should be displayed and recorded
This commit adds the ability to report (i.e. display to stdout) the following information on a periodic basis
 * step number
 * throughput
 * total_loss
 * mlm_loss
 * nsp_loss
 * learning_rate

The frequency of the reporting is specified via the `--num_report_steps` option.

Currently only the `throughput` and `total_loss` values get recorded (to the trace events file meant for tensorboard consumption).

Note that `throughput` is the same as `examples.sec` and `total_loss` is the same as `loss` both of which are already reported and recorded via the `TPUEstimator` implementation.

The `LogSessionRunHook` class is based on a similar class in the NVBERT implementation. It can be easily enhanced to report and record other/more variables.
Currently all the messages output via `tf.compat.v1.logging.info` get printed twice. For example
```
INFO:tensorflow:**** Trainable Variables ****
I0610 12:40:23.553335 139787876316928 run_pretraining.py:256] **** Trainable Variables ****
```

Setting the `propgate` flag in the loggger to `False` will prevent this. For the above example, only one line will be printed
```
INFO:tensorflow:**** Trainable Variables ****
```

This makes the output log file more readable.
Copy link

@c0redumb c0redumb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

@c0redumb c0redumb merged commit 867f98c into master Jun 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants