From @aksarkar:
I found that fit_poisson_nmf spends almost all of its time in single-threaded code, specifically cost and in poisson_nmf_kkt. One obvious enhancement would be to move the objective function computation into the additive Poisson regression code (i.e., break it up across threads) and then sum up the results in single-threaded code. Similarly, one could compute the max absolute KKT residual inside the Poisson regression code and then take the max over the results in single-threaded code.