This repository contains structured notes on Scikit-learn's machine learning workflow, from preprocessing raw data to applying meta estimators for enhanced predictions. It is intended as a concise, reference-friendly guide for anyone learning or refining their skills in Scikit-learn and a project Heart Disease Predictor
- Basic Workflow
- Preprocessing
- Modeling and Evaluation
- GridSearchCV and Cross Validation
- Sample Weights vs Class Weights
- Outlier Detection
- Precision vs Recall
- Meta Estimators
- Heart Disease Predictor
- Summary
- Author
- License
Data → Model → Fit → Predict → Evaluate
The standard process involves defining input features (X), target labels (y), fitting a model, making predictions, and evaluating results using metrics.
Preprocessing ensures data consistency and better model performance.
Key tools:
StandardScaler,MinMaxScalerfor normalizationPolynomialFeaturesfor non-linear termsQuantileTransformerfor reshaping distributionsOneHotEncoderfor categorical encoding
Steps:
- Import libraries and load data
- Separate into X and y
- Preprocess and build pipelines
- Train and tune models
- Evaluate with metrics and visualization
Note: Avoid deprecated datasets such as Boston due to bias concerns.
GridSearchCV automates hyperparameter tuning and improves generalization by testing multiple configurations through cross-validation.
| Term | Purpose |
|---|---|
class_weight |
Balances imbalanced target classes |
sample_weight |
Assigns custom importance to specific samples |
Example: Isolation Forest
Unsupervised anomaly detection by isolating observations. Outliers are identified as points requiring fewer splits.
| Metric | Meaning |
|---|---|
| Precision | Of predicted positives, how many were correct |
| Recall | Of actual positives, how many were identified |
Meta estimators enhance or combine models:
VotingClassifierfor model aggregation- Threshold adjusters for classification control
FeatureUnionfor combining feature transformations- Group-based predictors for segmented training
Scikit-learn workflows follow a modular approach: preprocess data, define pipelines, tune with GridSearchCV, and optionally apply meta estimators for improved performance. The process encourages reproducibility, scalability, and clarity in machine learning projects.
-
Unrealistic/extreme inputs cause unstable predictions; realistic values give consistent results. Inconsistent outputs were due to random splits without a fixed seed.
-
Dataset imbalance needs stratified splitting or resampling.
-
Saving models with pickle ensures correct loading and reuse.
-
Preprocessing (e.g., scaling) generally improves stability and performance.
-
Achieved accuracy 98% without overfitting.
-
Balanced data, realistic inputs, reproducibility, and preprocessing are key for reliable ML pipelines.
Knowledge should not be gated behind paywalls or exclusivity. This repository exists so that anyone can access structured, practical Scikit-learn notes without restriction.
The journey doesn’t end here. After mastering meta estimators, take the next step with the full-fledged Scikit-learn project
am_i_cooked A more advanced heart disease predictor built on a larger dataset, featuring a Flask-based web UI for deployment.
Created and maintained by
Aaditya Yadav