Make all splits binary#76
Conversation
This commit removes the multi-way split support Brushfire has. In practice splits end up being binary in most important cases, and we end up with a lot of extra unnecessary predicates and indirection. As of this commit, all the tests pass and things more or less work. There are some things that can still be cleaned up and commented better, but this believed to be a moment where things work.
This removes BinarySplit, since now all splits are binary. It is a bit nicer to be working with Split as a concrete case class.
There was a problem hiding this comment.
Super nit-picky, but could we put the key first?
There was a problem hiding this comment.
Sure, I can make that change.
|
Re: serialization format - is there an easy way to include some backwards compatibility here, assuming all our splits are actually just binary splits? Otherwise this will make the migration path quite a bit harder, as we have production models using the old format. |
|
@NathanHowell : Mind taking a look? |
|
Overall I'm really happy with this change. Great stuff! |
|
@tixxit I suspect we can pretty easily write a script (in Scala or even Ruby or something) that translates the JSON. |
|
@avibryant The problem is that we need to support both formats in production for a short time. Upgrading the affected models should be relatively easy (eg re-running a job + version bump). |
|
@NathanHowell Will the serialization format change make your life much harder? If not, then we can just support the old format from within our model server for the change over, then delete the code. |
|
Yeah I imagined writing a script to update our models. And like @tixxit says, I think writing a custom injection internally that can parse either format (temporarily) is a way to switch over our services to the new code without breaking anything. |
This should fix all the issues raised by @tixxit.
|
Is it possible to ease the use of binary splits without completely removing the ability to have n-way splits? I've been working on implementing an MDLP splitter for TDigest (based on http://ijcai.org/Past%20Proceedings/IJCAI-93-VOL2/PDF/022.pdf) that will generate multiple splits per node. The goal would be to reduce tree depth (and training iterations, important for very large training sets) without affecting model quality. It is not universally applicable though. Orthogonal trees/forests may similarly benefit from multiway splits. |
|
@NathanHowell First of all, sorry for opening a PR removing a feature you need. Just out of curiosity, is there a problem with encoding a mult-way split as a series of binary splits? The memory/performance overhead of this should be pretty much the same as it was under the previous implementation which used a collection of triples. I don't think it should make the trees any bigger, as far as allocations or size in memory goes. (And the big-O cost of traversing a series of binary nodes should be the same as traversing a collection of the same size.) We could provide a nice constructor to make it easy to instantiate a "logical" multi-way split. I could also imagine displaying a tree of many splits on the same feature as a multi-way split logically (sort of the dual of the previously mentioned constructor). Is there anything else that would be needed? I think we can definitely support an explicit multi-way split node if needed but I want to be sure I understand the requirement first. |
|
@Striation it could certainly be a tree of splits instead of a single split, that is fine. I was just imaging that there would be implications for tree expansion to fill in targets (and possibly annotations) for each split point, even if they're not a leaf. |
|
This could be a follow-up PR, but my proposal is that we get rid of Incidentally this makes it very clear that |
|
We have three big pull requests in flight - this one, #74, and now #77 . @NathanHowell @Striation any thoughts on the right order to try to land them in? |
This PR simplifies brushfire's conception of split nodes (and splits) so that all decisions are binary. In practice this is almost always the case.
Some benefits of this:
Predicateinstances we need.Splitclass.Some drawback:
What do you all think?