Skip to content

[ENH] Adding binning capabilities to decision trees #23

@adam2392

Description

@adam2392

Describe the workflow you want to enable

I am part of the @neurodata team. Binning features have resulted in highly efficient and little loss in performance in gradient-boosted trees. This feature should not only be used in gradient-boosted trees, but should be available within all decision trees [1].

By including binning as a feature for decision trees, we would enable massive speedups for decision trees that operate on high-dimensional data (both in features and sample sizes). This would be an additional tradeoff that users can take. The intuition behind binning for decision trees would be exactly that of Gradient Boosted Trees.

Describe your proposed solution

We propose introducing binning to the decision tree classifier and regressor.

An initial PR is proposed here: #24 (review)
However, it seems that many of the files were copied and it is not 100% clear if needed. Perhaps we can explore how to consolidate the _binning.py/pyx files using the current versions under ensemble/_hist_gradient_boosting/*.

Changes to the Cython codebase

TBD

Changes to the Python API

The following two parameters would be added to the DecisionTreeClassifier and Regressor:

hist_binning=False,
max_bins=255

where the default number of bins follows that of histgradient boosting.

Additional context

These changes can also trivially be applied to Oblique Trees.

References:
[1] https://papers.nips.cc/paper/6907-lightgbm-a-highly-efficient-gradient-boosting-decision-tree

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions