I noticed that the coefficient of CAKLD is calculated without masks. (https://github.com/DD-DuDa/BitDistiller/blob/main/train/train.py#L375) Have you tried adding the mask to calculate the mean on the real input part only?