why I can only replicate the whole model on different gpu, not the model parallel that distributed part of the model to the gpu? I follow the instruction on the webpage