Replace all variable size loops inside kernels with fixed size loop + logic
eg
for (int i = nlevsno-snl; i < nlevsno; ++i) {...};
should be
for (int i = 0; i < nlevsno; ++i) { if (i >= nlevsno-snl) ... };
The excessive thread divergence this seeks to address will ultimately require a packing/masking approach, allowing the removal of much branching logic from the kernels.