Support multiple tree implementations with Bonsai#79
Support multiple tree implementations with Bonsai#79erik-stripe merged 11 commits intostripe-archive:masterfrom
Conversation
This commit removes the multi-way split support Brushfire has. In practice splits end up being binary in most important cases, and we end up with a lot of extra unnecessary predicates and indirection. As of this commit, all the tests pass and things more or less work. There are some things that can still be cleaned up and commented better, but this believed to be a moment where things work.
This removes BinarySplit, since now all splits are binary. It is a bit nicer to be working with Split as a concrete case class.
This should fix all the issues raised by @tixxit.
The tests pass. This is definitely not in the best form, but it's a start! I need to do some benchmarking to try to figure out what kind of performance impact this change has.
This commit ensures that parallel traversals will not change the order nodes are visited when using an explicit seed. It includes a test which verifies that this is currently working using parallel collections. It also stabilizes the bonsai dependency at 0.1.3.
Add warning for failed inlining, remove unnecessary Types object, and so on.
There was a problem hiding this comment.
Any reason to prefer Tuple3[Int, T, A] => B over ((Int, T, A)) => B or (Int, T, A) => B?
There was a problem hiding this comment.
I thought it was easier to read -- I think it's easy to lose track of which parts are argument lists and which parts are tuples otherwise.
There was a problem hiding this comment.
I could change it here to use an argument list, but it's going to be provided as a tuple by the ops type class anyway. But if you'd prefer I can definitely change it.
There was a problem hiding this comment.
No strong preference, more curiosity. It seems a bit weird to use a tuple if you know nothing about bonsai, but we are using it, so I think this is fine.
|
Just to be clear about the scope here: it doesn't make it possible to train using multiple representations (which is a good thing, I think); it just introduces a generalized traversal abstraction that we can use in other contexts (like evaluation) with other tree representations. Right? |
|
@avibryant Right, exactly. |
1. Move Reorder into its own file with better Scaladoc comments. 2. Make Predicate's injections public, since predicate is public. 3. Improve some names and comments to be clearer. 4. A bit of other clean up.
|
@johnynek I'm happy to make the seed an explicit parameter -- someone else will probably seed with the clock, but that's OK (these don't need to be cryptographically secure AFAIK). |
|
I think the initial goal with |
|
So to answer @johnynek's question in a long-winded sort of way: It's difficult to know the I'll just comment this for now. It's possible that we could fix the design (and if we're unhappy enough with the current one I'll take a crack) but it would likely require some changes to bonsai to happen. |
This change makes the RNG seed handling more explicit, and makes it clear how it is initialized in the "defualt" case. It is still not secure (it's initialized from the lower 32-bits of System.nanoTime) but at least it's clearer what is going on.
The API is a bit fiddly, and this comment helps explain why.
|
@johnynek What do you think? Seems reasonable? |
|
👍 |
|
Awesome. I think the only thing I think we should do is bump the minor version at least, so that we can indicate a fairly big change (and also possibly release patches to the previous version should migration prove difficult, though I don't suspect we'll have too much trouble). @NathanHowell Any concerns with this plan? |
|
It seems like everyone has signed-off on this, and given that we do have a plan to deal with multi-way splits in the future, I'm going to merge this. Thanks everyone! |
There have been a bunch of big brushfire changes, so it seems appropriate to update the version to 0.7.0.
Support multiple tree implementations with Bonsai
This is a relatively ambitious pull request. The basic idea is to allow Brushfire to work with many different tree representations, instead of requiring users to use its
TreeandAnnotatedTreetypes. This flexibility is provided by Bonsai'sTreeOps(andFullBinaryTreeOps) type classes, which provide methods to work with trees in a generic way.One concrete benefit of this is that users can use the compressed trees provided by Bonsai instead of Brushfire's trees. This also means it should be easier to use Brushfire to train models used in other systems (although I haven't actually done this so I don't have a specific example).
Benchmarking I've done shows that there is no major performance difference here -- some of the optimizations in the PR offset the costs of a more generic model. I think there are probably places we could improve further, but for now this seems reasonable.
I should acknowledge that I still haven't fully answered @NathanHowell's concerns from #76. I would be happy to try the changes @avibryant's suggested if everyone agrees that this is the right way to go.
This PR includes and supercedes #76. I could also rebase this PR against #76 if we would prefer to merge that first.