-
Notifications
You must be signed in to change notification settings - Fork 8
Description
Describe the workflow you want to enable
I am part of the @neurodata team. Binning features have resulted in highly efficient and little loss in performance in gradient-boosted trees. This feature should not only be used in gradient-boosted trees, but should be available within all decision trees [1].
By including binning as a feature for decision trees, we would enable massive speedups for decision trees that operate on high-dimensional data (both in features and sample sizes). This would be an additional tradeoff that users can take. The intuition behind binning for decision trees would be exactly that of Gradient Boosted Trees.
Describe your proposed solution
We propose introducing binning to the decision tree classifier and regressor.
An initial PR is proposed here: #24 (review)
However, it seems that many of the files were copied and it is not 100% clear if needed. Perhaps we can explore how to consolidate the _binning.py/pyx files using the current versions under ensemble/_hist_gradient_boosting/*.
Changes to the Cython codebase
TBD
Changes to the Python API
The following two parameters would be added to the DecisionTreeClassifier and Regressor:
hist_binning=False,
max_bins=255
where the default number of bins follows that of histgradient boosting.
Additional context
These changes can also trivially be applied to Oblique Trees.
References:
[1] https://papers.nips.cc/paper/6907-lightgbm-a-highly-efficient-gradient-boosting-decision-tree