Make kernel loops more gpu-friendly

Replace all variable size loops inside kernels with fixed size loop + logic

eg
`for (int i = nlevsno-snl; i < nlevsno; ++i)  {...};`
should be
`for (int i = 0; i < nlevsno; ++i) { if (i >= nlevsno-snl) ... };`

The excessive thread divergence this seeks to address will ultimately require a packing/masking approach, allowing the removal of much branching logic from the kernels.