Skip to content
This repository was archived by the owner on Apr 8, 2021. It is now read-only.

Comments

Make all splits binary#76

Merged
erik-stripe merged 3 commits intostripe-archive:masterfrom
erik-stripe:make-all-splits-binary
Jan 27, 2016
Merged

Make all splits binary#76
erik-stripe merged 3 commits intostripe-archive:masterfrom
erik-stripe:make-all-splits-binary

Conversation

@erik-stripe
Copy link
Contributor

This PR simplifies brushfire's conception of split nodes (and splits) so that all decisions are binary. In practice this is almost always the case.

Some benefits of this:

  1. Should drastically reduce the number of Predicate instances we need.
  2. Avoid allocating a collection instances per split node.
  3. Simplify algorithms which had to encode/decode implicit binary structure.
  4. Nicer to use single concrete Split class.

Some drawback:

  1. Different serialization format.
  2. Source/binary-incompatible change.

What do you all think?

This commit removes the multi-way split support Brushfire has.
In practice splits end up being binary in most important cases,
and we end up with a lot of extra unnecessary predicates and
indirection.

As of this commit, all the tests pass and things more or less
work. There are some things that can still be cleaned up and
commented better, but this believed to be a moment where things
work.
This removes BinarySplit, since now all splits are binary. It is
a bit nicer to be working with Split as a concrete case class.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super nit-picky, but could we put the key first?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can make that change.

@tixxit
Copy link
Contributor

tixxit commented Dec 10, 2015

Re: serialization format - is there an easy way to include some backwards compatibility here, assuming all our splits are actually just binary splits? Otherwise this will make the migration path quite a bit harder, as we have production models using the old format.

@tixxit
Copy link
Contributor

tixxit commented Dec 10, 2015

@NathanHowell : Mind taking a look?

@tixxit
Copy link
Contributor

tixxit commented Dec 10, 2015

Overall I'm really happy with this change. Great stuff!

@avibryant
Copy link
Contributor

@tixxit I suspect we can pretty easily write a script (in Scala or even Ruby or something) that translates the JSON.

@tixxit
Copy link
Contributor

tixxit commented Dec 10, 2015

@avibryant The problem is that we need to support both formats in production for a short time. Upgrading the affected models should be relatively easy (eg re-running a job + version bump).

@tixxit
Copy link
Contributor

tixxit commented Dec 10, 2015

@NathanHowell Will the serialization format change make your life much harder? If not, then we can just support the old format from within our model server for the change over, then delete the code.

@erik-stripe
Copy link
Contributor Author

Yeah I imagined writing a script to update our models. And like @tixxit says, I think writing a custom injection internally that can parse either format (temporarily) is a way to switch over our services to the new code without breaking anything.

This should fix all the issues raised by @tixxit.
@NathanHowell
Copy link
Contributor

Is it possible to ease the use of binary splits without completely removing the ability to have n-way splits?

I've been working on implementing an MDLP splitter for TDigest (based on http://ijcai.org/Past%20Proceedings/IJCAI-93-VOL2/PDF/022.pdf) that will generate multiple splits per node. The goal would be to reduce tree depth (and training iterations, important for very large training sets) without affecting model quality. It is not universally applicable though. Orthogonal trees/forests may similarly benefit from multiway splits.

@erik-stripe
Copy link
Contributor Author

@NathanHowell First of all, sorry for opening a PR removing a feature you need.

Just out of curiosity, is there a problem with encoding a mult-way split as a series of binary splits? The memory/performance overhead of this should be pretty much the same as it was under the previous implementation which used a collection of triples. I don't think it should make the trees any bigger, as far as allocations or size in memory goes. (And the big-O cost of traversing a series of binary nodes should be the same as traversing a collection of the same size.)

We could provide a nice constructor to make it easy to instantiate a "logical" multi-way split. I could also imagine displaying a tree of many splits on the same feature as a multi-way split logically (sort of the dual of the previously mentioned constructor). Is there anything else that would be needed?

I think we can definitely support an explicit multi-way split node if needed but I want to be sure I understand the requirement first.

@NathanHowell
Copy link
Contributor

@Striation it could certainly be a tree of splits instead of a single split, that is fine. I was just imaging that there would be implications for tree expansion to fill in targets (and possibly annotations) for each split point, even if they're not a leaf.

@avibryant
Copy link
Contributor

This could be a follow-up PR, but my proposal is that we get rid of Split and instead have Splitter directly produce Nodes, which could then be a series of binary splits to simulate a multi-way. We then need a way for Evaluator to evaluate a Node; my proposal is that its interface changes to take Iterable[T], which is just the Ts living in all the leaves of a Node, regardless of how deep the tree might be.

Incidentally this makes it very clear that Evaluator should actually be called TrainingError, and could usefully be applied to a whole tree...

@avibryant
Copy link
Contributor

We have three big pull requests in flight - this one, #74, and now #77 . @NathanHowell @Striation any thoughts on the right order to try to land them in?

@erik-stripe erik-stripe merged commit 7610672 into stripe-archive:master Jan 27, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants