Gradient normalization lowers the maximum learning rate that can converge.

I found this problem while training ResNet18 on cifar100 for some experiment. I still haven't looked into this issue enough to find out what the cause is.