On GPU accelerated implementations the first iteration includes the initialization of the device and the data copy. This distorts the iterations/second with a significant outlier. It would be good to measure and report the times of the first iteration and rest independently.