Add functionality to evaluate model averaging

I think it can be done solely by combining output objects, rather than running a "combined" run_eval(). I.e. you would run_eval() first separately for each model you want to include, then combine and parse out the averaged predictions. Will need #21 for computing the weights, so should do that first. Could perhaps be one of the generated outputs from #26 .