Skip to content

aypy01/scikit-learn

Repository files navigation

Scikit-learn

This repository contains structured notes on Scikit-learn's machine learning workflow, from preprocessing raw data to applying meta estimators for enhanced predictions. It is intended as a concise, reference-friendly guide for anyone learning or refining their skills in Scikit-learn and a project Heart Disease Predictor


Table of Contents


Basic Workflow

Data → Model → Fit → Predict → Evaluate
The standard process involves defining input features (X), target labels (y), fitting a model, making predictions, and evaluating results using metrics.


Preprocessing

Preprocessing ensures data consistency and better model performance.
Key tools:

  • StandardScaler, MinMaxScaler for normalization
  • PolynomialFeatures for non-linear terms
  • QuantileTransformer for reshaping distributions
  • OneHotEncoder for categorical encoding

Modeling and Evaluation

Steps:

  1. Import libraries and load data
  2. Separate into X and y
  3. Preprocess and build pipelines
  4. Train and tune models
  5. Evaluate with metrics and visualization

Note: Avoid deprecated datasets such as Boston due to bias concerns.


GridSearchCV and Cross Validation

GridSearchCV automates hyperparameter tuning and improves generalization by testing multiple configurations through cross-validation.


Sample Weights vs Class Weights

Term Purpose
class_weight Balances imbalanced target classes
sample_weight Assigns custom importance to specific samples

Outlier Detection

Example: Isolation Forest
Unsupervised anomaly detection by isolating observations. Outliers are identified as points requiring fewer splits.


Precision vs Recall

Metric Meaning
Precision Of predicted positives, how many were correct
Recall Of actual positives, how many were identified

Meta Estimators

Meta estimators enhance or combine models:

  • VotingClassifier for model aggregation
  • Threshold adjusters for classification control
  • FeatureUnion for combining feature transformations
  • Group-based predictors for segmented training

Summary

Scikit-learn workflows follow a modular approach: preprocess data, define pipelines, tune with GridSearchCV, and optionally apply meta estimators for improved performance. The process encourages reproducibility, scalability, and clarity in machine learning projects.

  • Unrealistic/extreme inputs cause unstable predictions; realistic values give consistent results. Inconsistent outputs were due to random splits without a fixed seed.

  • Dataset imbalance needs stratified splitting or resampling.

  • Saving models with pickle ensures correct loading and reuse.

  • Preprocessing (e.g., scaling) generally improves stability and performance.

  • Achieved accuracy 98% without overfitting.

  • Balanced data, realistic inputs, reproducibility, and preprocessing are key for reliable ML pipelines.


Thank You

Knowledge should not be gated behind paywalls or exclusivity. This repository exists so that anyone can access structured, practical Scikit-learn notes without restriction.
The journey doesn’t end here. After mastering meta estimators, take the next step with the full-fledged Scikit-learn project

am_i_cooked A more advanced heart disease predictor built on a larger dataset, featuring a Flask-based web UI for deployment.

Author

Created and maintained by Aaditya Yadav GitHub Badge

Typing SVG

License

This project is licensed under the License: MIT.

About

SciKit-Learn

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published