Skip to content

Conversation

@eschmidt42
Copy link
Owner

This pull request introduces significant improvements and refactoring to the gradient boosted trees implementation, focusing on correctness, maintainability, and extensibility. The key changes include a refactor of the loss and gradient calculation logic for both regression and classification, the addition of optimal step size (line search) for each boosting iteration, and the introduction of comprehensive unit tests for the new utility functions. There are also improvements to data validation and utility function usage.

Gradient Boosted Trees Algorithm Refactor and Enhancements:

  • Refactored the loss and gradient (pseudo-residual) calculation logic for both regression and classification into standalone functions: get_pseudo_residual_mse, get_pseudo_residual_log_odds, get_start_estimate_mse, and get_start_estimate_log_odds, improving code clarity and reusability. [1] [2]
  • Implemented optimal step size (line search) for each boosting iteration via the new find_step_size function, replacing the previous fixed factor approach, and now storing per-tree step sizes in step_sizes_. [1] [2]
  • Updated the fit and predict logic in both regressor and classifier to use the new residuals and step size logic, improving correctness and aligning with standard gradient boosting algorithms. [1] [2]

Utility and API Improvements:

  • Replaced the old bool_to_float utility with vectorize_bool_to_float for efficient label mapping in classification. [1] [2]
  • Improved data validation by ensuring ensure_all_finite is consistently respected in fit and predict methods. [1] [2]
  • Added get_probabilities_from_mapped_bools for consistent probability output in classification. [1] [2]

Testing and Dependency Updates:

  • Added comprehensive unit tests for the new utility functions in tests/models/test_gradientboostedtrees.py, covering edge cases and ensuring correctness of pseudo-residual and starting estimate calculations. [1] [2]
  • Added scipy as a dependency for the new optimization routines.
  • Imported minimize_scalar from scipy.optimize to support line search in boosting.

These changes collectively improve the flexibility, correctness, and maintainability of the gradient boosted trees models, and provide a solid foundation for further enhancements.

@eschmidt42 eschmidt42 merged commit 9dee0cd into main Aug 18, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants