Might be interesting to take advantage of tensor cores as well, which I don't believe CUDA 8.0 supports?