Conversation
| cdef double node_impurity2(self, double* pred_weights) | ||
| cdef void children_impurity2(self, double* impurity_left, | ||
| double* impurity_right, double* pred_weights) | ||
| cdef double proxy_impurity_improvement2(self, double* pred_weights) nogil |
morgsmss7
left a comment
There was a problem hiding this comment.
I noted a few things about the tests. @jheiko1 Mostly they were good. My main comments would be to split everything into different tests, recalculate MSE for the tests with new y values, and look over some of the logic for the random state tests. Looks good otherwise :). I also added some comments for myself. Don't worry about those.
sklearn/tree/tests/test_tree.py
Outdated
| ----------------------- | ||
|
|
There was a problem hiding this comment.
@jheiko1 make sure you have charts for each of the tests like the ones above and include your impurity calculations in the comments as well.
sklearn/tree/tests/test_tree.py
Outdated
| Right node Mean1 = Mean2 = 4 | ||
| Total error = ((4 - 4)^2 * 1.0) | ||
| = 0 | ||
|
|
There was a problem hiding this comment.
@jheiko1 add a separate test with a descriptive name for each thing you are testing
There was a problem hiding this comment.
also for the tree fit at line 1841, I think this is incorrect because you are going to need a try-except block. Since there is only 1 split, the algorithm is going to randomly choose either the first output (3,3,4,7,8) OR the second output (2,4,3,6,7) to split on, so the impurity can either be the same as line 1852 if the first output is chosen or it could be what you calculated here. Just make sure you write out your calculation. (I edited this after looking at it more)
There was a problem hiding this comment.
Sorry, for some reason it won't let me comment on some of the lines, but at line 1855, I might suggest increasing to range 30 or something just because it might be stronger.
There was a problem hiding this comment.
for lines 1870 and 1875, I think you need to set a variable to np.random.randint(1,100,(5,7)) and then use the same y for both dt_axis_3 and dt_axis_4 because otherwise it will definitely be different because they are using different data. If it doesn't pass, I would increase the number or outputs from 7 to like 20.
There was a problem hiding this comment.
after line 1878, I think you also need to break the for loop because just because the assert runs doesn't mean it won't continue trying until i = 100. The way it is right now, whether the test passes is contingent on what happens at random state 100.
|
|
||
| Right Impurity = Total error / total weight | ||
| = 0 / 1.0 | ||
| = 0.0 |
There was a problem hiding this comment.
@jheiko1 line 1987 needs to be different. You'll need to calculate what happens when both y1 and y2 are chosen here. You also need to add another nested try-except block or two because a few things could happen: just y1 can be chosen with weight 1 or -1, just y2 can be chosen with weight 1 or -1, both can be chosen with weights 1, both can be chosen with weights -1, or both can be chosen y1 with weight 1 and y2 with weight -1 or y2 with weight 1 and y1 with weight -1. You are going to need a try except for each case.
There was a problem hiding this comment.
line 2007, same as previous comment about axis_proj. use the same y for both.
There was a problem hiding this comment.
also line 1992, increase to a larger number to make the argument stronger
There was a problem hiding this comment.
line 2009: same comment as axis. I think you need a break statement.
There was a problem hiding this comment.
typo in line 2014. should be oblique not axis (that was probably me)
There was a problem hiding this comment.
typo in 2025. should be oblique not MAE (that was probably also me)
| min_impurity_split=min_impurity_split, | ||
| random_state=random_state, | ||
| ccp_alpha=ccp_alpha) | ||
|
No newline at end of file |
There was a problem hiding this comment.
lots of unnecessary changes throughout and in this file. I'll fix these. @jheiko1
|
The shared weights portion of the PR should be ready for review now! |
sampan501
left a comment
There was a problem hiding this comment.
Started with a few areas where the algorithm speed could be improved. Since you are using cython, I highly recommend using MemoryView syntax for numpy arrays whenever are doing basic math operations
| k = rand_int(0, self.n_outputs, random_state) | ||
|
|
||
| cdef DOUBLE_t w = 1.0 | ||
| for p in range(start, pos): |
| w = sample_weight[i] | ||
| y_ik = self.y[i, k] | ||
| sq_sum_total += w * y_ik * y_ik | ||
| for k in range(self.n_outputs): |
There was a problem hiding this comment.
would be faster to use matrix manipulation instead of loop. Something like np.sum(w * self.y * self.y * pred_weights)
|
|
||
| impurity = sq_sum_total / self.weighted_n_node_samples | ||
| impurity -= (sum_total[k] / self.weighted_n_node_samples)**2.0 | ||
| for k in range(self.n_outputs): |
There was a problem hiding this comment.
same here, faster with list comprehension do something like impurity -= np.sum([sum_total[k] * pred_weights[k] / self.weighted_n_node_samples for k in range(self.n_outputs)
| proxy_impurity_right += sum_right[k] * sum_right[k] | ||
|
|
||
|
|
||
| for k in range(self.n_outputs): |
|
|
||
| impurity_left[0] -= (sum_left[k] / self.weighted_n_left) ** 2.0 | ||
| impurity_right[0] -= (sum_right[k] / self.weighted_n_right) ** 2.0 | ||
| for k in range(self.n_outputs): |
| cdef SIZE_t i | ||
| cdef SIZE_t p | ||
| cdef SIZE_t k | ||
| cdef UINT32_t rand_r_state |
| cdef double proxy_impurity_left = 0.0 | ||
| cdef double proxy_impurity_right = 0.0 | ||
|
|
||
| cdef UINT32_t rand_r_state |
| if sample_weight != NULL: | ||
| w = sample_weight[i] | ||
|
|
||
| for k in range(self.n_outputs): |
|
@j1c @bdpedigo I've looked over the checks that aren't passing, and I'm not sure what the problem is really. All of the tests that fail in the checks pass on my machine (which I guess is understandable, but I can't really fix it if I can't recreate it) and a lot of them seem pretty unrelated to the changes we made. I know this was a reach goal for the first sprint this semester, so it doesn't really matter in terms of grades, but I'm wondering how you guys think I should proceed. |
I'm initiating this PR because for some reason, the test_tree.py file will not run on @jheiko1's machine. She will make comments for her tests and assign reviewers below.
Fixes #28
@jheiko1 and I have been working on the same branch. I have updated the criteria to save the outputs used in impurity calculations. @jheiko1 wrote tests for the new split criteria.
As of right now, the shared weights code is not ready for review, but the tests are.