Implement Shared Weights + Tests by morgsmss7 · Pull Request #33 · NeuroDataDesign/scikit-learn

morgsmss7 · 2020-03-14T22:44:19Z

I'm initiating this PR because for some reason, the test_tree.py file will not run on @jheiko1's machine. She will make comments for her tests and assign reviewers below.

Fixes #28

@jheiko1 and I have been working on the same branch. I have updated the criteria to save the outputs used in impurity calculations. @jheiko1 wrote tests for the new split criteria.

As of right now, the shared weights code is not ready for review, but the tests are.

morgsmss7 · 2020-03-15T15:41:21Z

sklearn/tree/_criterion.pxd

    cdef double node_impurity2(self, double* pred_weights)
    cdef void children_impurity2(self, double* impurity_left,
                                double* impurity_right, double* pred_weights)
    cdef double proxy_impurity_improvement2(self, double* pred_weights) nogil


@morgsmss7 is this used?

morgsmss7

I noted a few things about the tests. @jheiko1 Mostly they were good. My main comments would be to split everything into different tests, recalculate MSE for the tests with new y values, and look over some of the logic for the random state tests. Looks good otherwise :). I also added some comments for myself. Don't worry about those.

sklearn/tree/_criterion.pyx

sklearn/tree/_splitter.pyx

morgsmss7 · 2020-03-15T15:51:24Z

sklearn/tree/tests/test_tree.py

    -----------------------
-


@jheiko1 make sure you have charts for each of the tests like the ones above and include your impurity calculations in the comments as well.

morgsmss7 · 2020-03-15T15:53:28Z

sklearn/tree/tests/test_tree.py

    Right node Mean1 = Mean2 = 4
    Total error = ((4 - 4)^2 * 1.0)
                = 0
-


@jheiko1 add a separate test with a descriptive name for each thing you are testing

also for the tree fit at line 1841, I think this is incorrect because you are going to need a try-except block. Since there is only 1 split, the algorithm is going to randomly choose either the first output (3,3,4,7,8) OR the second output (2,4,3,6,7) to split on, so the impurity can either be the same as line 1852 if the first output is chosen or it could be what you calculated here. Just make sure you write out your calculation. (I edited this after looking at it more)

remove line 1845

Sorry, for some reason it won't let me comment on some of the lines, but at line 1855, I might suggest increasing to range 30 or something just because it might be stronger.

for lines 1870 and 1875, I think you need to set a variable to np.random.randint(1,100,(5,7)) and then use the same y for both dt_axis_3 and dt_axis_4 because otherwise it will definitely be different because they are using different data. If it doesn't pass, I would increase the number or outputs from 7 to like 20.

after line 1878, I think you also need to break the for loop because just because the assert runs doesn't mean it won't continue trying until i = 100. The way it is right now, whether the test passes is contingent on what happens at random state 100.

morgsmss7 · 2020-03-17T15:55:54Z

sklearn/tree/tests/test_tree.py

-
    Right Impurity = Total error / total weight
            = 0 / 1.0
            = 0.0


@jheiko1 line 1987 needs to be different. You'll need to calculate what happens when both y1 and y2 are chosen here. You also need to add another nested try-except block or two because a few things could happen: just y1 can be chosen with weight 1 or -1, just y2 can be chosen with weight 1 or -1, both can be chosen with weights 1, both can be chosen with weights -1, or both can be chosen y1 with weight 1 and y2 with weight -1 or y2 with weight 1 and y1 with weight -1. You are going to need a try except for each case.

line 2007, same as previous comment about axis_proj. use the same y for both.

also line 1992, increase to a larger number to make the argument stronger

line 2009: same comment as axis. I think you need a break statement.

typo in line 2014. should be oblique not axis (that was probably me)

typo in 2025. should be oblique not MAE (that was probably also me)

morgsmss7 · 2020-03-17T16:01:26Z

sklearn/tree/tree.py

            min_impurity_split=min_impurity_split,
            random_state=random_state,
            ccp_alpha=ccp_alpha)
+


lots of unnecessary changes throughout and in this file. I'll fix these. @jheiko1

morgsmss7 · 2020-04-06T18:54:08Z

The shared weights portion of the PR should be ready for review now!

sampan501

Started with a few areas where the algorithm speed could be improved. Since you are using cython, I highly recommend using MemoryView syntax for numpy arrays whenever are doing basic math operations

sampan501 · 2020-04-06T19:52:49Z

sklearn/tree/_criterion.pyx

-        k = rand_int(0, self.n_outputs, random_state) 

+        cdef DOUBLE_t w = 1.0
        for p in range(start, pos):


next 2 loops can be merged into one

sampan501 · 2020-04-06T20:15:44Z

sklearn/tree/_criterion.pyx

                w = sample_weight[i]
-            y_ik = self.y[i, k]
-            sq_sum_total += w * y_ik * y_ik
+            for k in range(self.n_outputs):


would be faster to use matrix manipulation instead of loop. Something like np.sum(w * self.y * self.y * pred_weights)

sampan501 · 2020-04-06T20:29:44Z

sklearn/tree/_criterion.pyx


        impurity = sq_sum_total / self.weighted_n_node_samples
-        impurity -= (sum_total[k] / self.weighted_n_node_samples)**2.0
+        for k in range(self.n_outputs):


same here, faster with list comprehension do something like impurity -= np.sum([sum_total[k] * pred_weights[k] / self.weighted_n_node_samples for k in range(self.n_outputs)

sampan501 · 2020-04-06T20:30:34Z

sklearn/tree/_criterion.pyx

-        proxy_impurity_right += sum_right[k] * sum_right[k]
-
-
+        for k in range(self.n_outputs):


list comprehension or broadcasting

sampan501 · 2020-04-06T20:33:23Z

sklearn/tree/_criterion.pyx


-        impurity_left[0] -= (sum_left[k] / self.weighted_n_left) ** 2.0
-        impurity_right[0] -= (sum_right[k] / self.weighted_n_right) ** 2.0
+        for k in range(self.n_outputs):


list comprehension thing here again

sampan501 · 2020-04-06T20:33:58Z

sklearn/tree/_criterion.pyx

        cdef SIZE_t i
        cdef SIZE_t p
        cdef SIZE_t k 
-        cdef UINT32_t rand_r_state


why was this code removed?

sampan501 · 2020-04-06T20:34:57Z

sklearn/tree/_criterion.pyx

        cdef double proxy_impurity_left = 0.0
        cdef double proxy_impurity_right = 0.0

-        cdef UINT32_t rand_r_state


same about this code block

sampan501 · 2020-04-06T20:35:49Z

sklearn/tree/_criterion.pyx

            if sample_weight != NULL:
                w = sample_weight[i]
+
            for k in range(self.n_outputs):


again, same list comment

morgsmss7 · 2020-05-14T02:57:37Z

@j1c @bdpedigo I've looked over the checks that aren't passing, and I'm not sure what the problem is really. All of the tests that fail in the checks pass on my machine (which I guess is understandable, but I can't really fix it if I can't recreate it) and a lot of them seem pretty unrelated to the changes we made. I know this was a reach goal for the first sprint this semester, so it doesn't really matter in terms of grades, but I'm wondering how you guys think I should proceed.

fixing issues I causes while attempting to resolve merge conflicts

2c9ce5a

morgsmss7 commented Mar 15, 2020

View reviewed changes

morgsmss7 commented Mar 17, 2020

View reviewed changes

morgsmss7 assigned jheiko1 and morgsmss7 Mar 17, 2020

morgsmss7 and others added 10 commits March 23, 2020 14:52

Merge branch 'master' into shared_weights

fcbce8d

fix issues from resolving merge conflicts

988627d

split tests and updated calculations in comments

ffb9402

split up test

b082510

fixed calulations for random.

a9bcb5a

final calculation fixes and random fix.

6819fc9

oblique random fit

e495621

diff random state fix

e658126

formatted with black

e16e2eb

delete unsed import

cf8d33a

sampan501 self-requested a review April 6, 2020 18:57

sampan501 reviewed Apr 6, 2020

View reviewed changes

jheiko1 requested review from bdpedigo and j1c May 14, 2020 01:45

trying to add all sklearn code again

351bbf4

bdpedigo removed their request for review November 16, 2020 02:19

		proxy_impurity_right += sum_right[k] * sum_right[k]


		for k in range(self.n_outputs):

Conversation

morgsmss7 commented Mar 14, 2020 • edited by jheiko1 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

morgsmss7 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

morgsmss7 Mar 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

morgsmss7 commented Apr 6, 2020

Uh oh!

sampan501 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

morgsmss7 commented May 14, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

morgsmss7 commented Mar 14, 2020 •

edited by jheiko1

Loading

morgsmss7 Mar 17, 2020 •

edited

Loading