A comprehensive collection of machine learning and deep learning projects focused on predictive modeling across different domains: computer vision, medical diagnosis, insurance analytics, and multi-class classification.
- Overview
- Projects
- Technologies Used
- Getting Started
- Project Comparison
- Installation
- Usage
- Results Summary
- Repository Structure
- Contributing
This repository contains six distinct machine learning projects that demonstrate various approaches to predictive modeling:
- Real Estate Prediction: Housing price prediction using advanced regression techniques
- Insurance Analytics: Medical insurance cost prediction with feature engineering
- Transfer Learning: Advanced computer vision with pre-trained ResNet18
- Computer Vision: Image classification using Convolutional Neural Networks
- Medical Prediction: Healthcare analytics with traditional ML algorithms
- Multi-class Classification: Large-scale flower species recognition
Each project showcases different aspects of machine learning pipeline development, from data preprocessing to model evaluation, covering both traditional ML and deep learning approaches.
Domain: Real Estate Analytics | Type: Regression | Framework: Scikit-learn + XGBoost
A comprehensive machine learning pipeline for predicting median house values in California districts using the famous California Housing dataset from scikit-learn.
- Dataset: 20,640 housing districts with 8 numerical features
- Best Model: XGBoost with 84.4% RΒ² score - Excellent performance!
- Pipeline: Complete regression workflow with advanced preprocessing
- Evaluation: Multiple metrics (RΒ², RMSE, MAE) with cross-validation
- Optimization: RandomizedSearchCV for hyperparameter tuning
- Advanced Preprocessing: RobustScaler for outlier handling
- Model Comparison: Linear, Ridge, Lasso, Random Forest, XGBoost
- Cross-Validation: 10-fold CV for robust model selection
- Business Application: Suitable for Automated Valuation Models (AVM)
| Model | Cross-Validation RΒ² | Test RΒ² | RMSE | MAE |
|---|---|---|---|---|
| Linear Regression | ~0.60 | - | - | - |
| Ridge Regression | ~0.60 | - | - | - |
| Lasso Regression | ~0.60 | - | - | - |
| Random Forest | ~0.80 | - | - | - |
| XGBoost | ~0.84 | 0.844 | 0.204 | 0.291 |
- 84.4% variance explained = Highly reliable price estimates
- Low prediction errors suitable for real estate valuation
- Scalable framework for property portfolio analysis
π Location: ./Machine Learning/California Housing/
π Main File: CaliforniaHousing.ipynb
Domain: Insurance Analytics | Type: Regression | Framework: Scikit-learn + XGBoost
A comprehensive machine learning pipeline for predicting medical insurance costs based on personal characteristics and lifestyle factors using advanced feature engineering and hyperparameter optimization.
- Dataset: 1,338 insurance records with 7 features
- Best Model: XGBoost with 88.4% RΒ² score - Outstanding performance!
- Pipeline: Complete regression workflow with categorical encoding
- Evaluation: Cross-validation with business-focused metrics
- Optimization: RandomizedSearchCV for hyperparameter tuning
- Advanced Preprocessing: LabelEncoder + One-Hot encoding for categoricals
- Model Comparison: Linear, Ridge, Lasso, Random Forest, XGBoost
- Cross-Validation: 10-fold CV for robust model validation
- Business Application: Insurance premium calculation and risk assessment
| Model | Cross-Validation RΒ² | Test RΒ² | Notes |
|---|---|---|---|
| Linear Regression | ~0.75 | - | Baseline |
| Ridge Regression | ~0.75 | - | Regularized |
| Lasso Regression | ~0.75 | - | Feature Selection |
| Random Forest | ~0.80 | - | Ensemble |
| XGBoost | ~0.87 | 0.884 | Best Performance |
- 88.4% variance explained = Extremely reliable cost estimates
- Feature insights for risk factor identification
- Scalable framework for insurance portfolio pricing
π Location: ./Machine Learning/Medical Insurance/
π Main File: Medical Insurance.ipynb
Domain: Computer Vision | Type: Binary Classification | Framework: PyTorch + Transfer Learning
An advanced transfer learning implementation using pre-trained ResNet18 for cats vs dogs classification, demonstrating state-of-the-art computer vision techniques with multiple fine-tuning strategies.
- Dataset: 8,000 training + 2,023 test images (cats vs dogs)
- Best Model: Fine-tuned ResNet18 with 98.5% test accuracy - Outstanding performance!
- Architecture: Pre-trained ResNet18 with modified classifier head
- Strategies: Feature extraction, full fine-tuning, partial fine-tuning
- Optimization: Multiple learning rates for backbone vs head layers
- Transfer Learning: ImageNet pre-trained ResNet18 backbone
- Advanced Data Augmentation: RandomResizedCrop, RandomHorizontalFlip
- Multiple Training Strategies:
- Feature extraction (frozen backbone): 98.37% accuracy
- Full fine-tuning: 98.56% accuracy
- Partial fine-tuning (layer4 + fc): 98.32% accuracy
- Differential Learning Rates: Lower LR for backbone, higher for head
- Model Variants: 7 different saved models with various configurations
| Strategy | Test Accuracy | Notes |
|---|---|---|
| Feature Extraction | 98.37% | Only FC layer trained |
| Full Fine-tuning | 98.56% | Best Performance |
| Partial Fine-tuning | 98.32% | Layer4 + FC trained |
| No Augmentation | ~98.3% | Baseline comparison |
| No Warmup | ~98.2% | Training strategy comparison |
resnet18_finetuned.pth: Best full fine-tuning modelresnet18_finetuned_warmup.pth: With warmup strategyresnet18_finetuned_noaug.pth: Without data augmentationresnet18_partial_finetune.pth: Partial fine-tuning variants
- 98.5% accuracy = Production-ready computer vision model
- Transfer learning efficiency = Faster training with less data
- Multiple strategies for different computational budgets
- Scalable framework for binary image classification tasks
π Location: ./Deep Learning/Transfer Learning ResNet18/
π Main File: Transfer Learning.ipynb
Domain: Computer Vision | Type: Multi-class Classification | Framework: PyTorch
An improved Convolutional Neural Network implementation for classifying images from the CIFAR-10 dataset into 10 distinct categories (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck) with enhanced architecture and training strategies.
- Dataset: 60,000 32Γ32 color images in 10 classes
- Architecture: Enhanced CNN with BatchNormalization and Dropout
- Performance: 78.51% accuracy on test set - Significant improvement!
- Training: 30 epochs with optimized hyperparameters
- Output: Saved model weights (
cifar_net.pth)
- Improved Architecture: 3-layer CNN with BatchNorm and Dropout
- Enhanced Preprocessing: Advanced data augmentation with larger batch size (32)
- GPU Acceleration: CUDA support for faster training
- Real-time Loss Monitoring: Training progress tracking
- Comprehensive Evaluation: Per-class accuracy analysis
- Conv Layers: 32β64β128 channels with 3Γ3 kernels and padding
- Batch Normalization: After each convolutional layer
- Dropout: 0.5 regularization before final classification
- Fully Connected: 512 hidden units for classification head
- Previous: ~54% accuracy (basic CNN)
- Current: 78.51% accuracy (+24.5% improvement!)
- Training: 30 epochs vs. previous 20 epochs
- Batch Size: Increased from 4 to 32 for better convergence
π Location: ./CIFAR10/
π Main File: CIFAR10_neural_network.ipynb
Domain: Healthcare Analytics | Type: Binary Classification | Framework: Scikit-learn + XGBoost
A comprehensive machine learning pipeline for predicting diabetes risk using medical and demographic data from the Pima Indians Diabetes Database.
- Dataset: 768 patient records with 8 medical features
- Best Model: XGBoost with 76% recall, 72% F1-score
- Pipeline: End-to-end ML workflow with preprocessing
- Clinical Focus: High recall to minimize missed diagnoses
- Class Imbalance: Handled with SMOTE and scale_pos_weight
- Advanced Preprocessing: Missing value imputation, robust scaling
- Feature Engineering: Glucose/Insulin ratio creation
- Model Comparison: Logistic Regression variants vs. XGBoost
- Medical Validation: Clinically relevant evaluation metrics
- Comprehensive EDA: Detailed data exploration and visualization
| Model | Recall | F1-Score | AUC | Notes |
|---|---|---|---|---|
| Logistic Regression | 0.55 | 0.65 | 0.82 | Baseline |
| Polynomial Features | 0.60 | 0.68 | 0.83 | Non-linear |
| XGBoost | 0.76 | 0.72 | 0.85 | Best |
π Location: ./DiabetPredictor/
π Main File: DiabetPredictor.ipynb
Domain: Computer Vision | Type: Multi-class Classification (102 classes) | Framework: PyTorch
A deep learning project for classifying flower images into 102 different species using the Oxford 102 Category Flower Dataset.
- Dataset: 8,189 high-resolution flower images
- Classes: 102 different flower species
- Architecture: 3-layer CNN with ~51.5M parameters
- Input: 224Γ224Γ3 RGB images
- Preprocessing: Resizing, normalization, tensor conversion
- Custom Dataset Class: Handles MATLAB label files
- Large-scale Classification: 102-way classification problem
- GPU Optimization: CUDA acceleration for training
- Data Loading: Efficient batch processing
- Pre-defined Splits: Uses official train/validation/test splits
Input (224Γ224Γ3) β Conv+ReLU+MaxPool β Conv+ReLU+MaxPool β
Conv+ReLU+MaxPool β Flatten β FC(512) β FC(102 classes)
π Location: ./Flower Classification/
π Main File: Flowers.ipynb
- PyTorch: Deep learning models (Transfer Learning, CIFAR-10, Flower Classification)
- Scikit-learn: Traditional ML algorithms (Diabetes Predictor)
- XGBoost: Gradient boosting for tabular data
- Transfer Learning: Pre-trained model fine-tuning
- Pandas: Data manipulation and analysis
- NumPy: Numerical computing
- SciPy: Scientific computing (MATLAB file handling)
- imbalanced-learn: Class imbalance handling (SMOTE)
- Matplotlib: Plotting and visualization
- Seaborn: Statistical data visualization
- Jupyter Notebook: Interactive development
- Python 3.8+: Core programming language
- CUDA: GPU acceleration (optional)
- Python 3.8 or higher
- pip package manager
- Jupyter Notebook
- GPU with CUDA support (recommended for deep learning projects)
-
Clone the repository
git clone https://github.com/doroteaMonaco/Predictors-Projects.git cd Predictors-Projects -
Choose a project and navigate to its directory
cd CIFAR10 # For CIFAR-10 classification cd DiabetPredictor # For diabetes prediction cd "Flower Classification" # For flower classification
-
Open the Jupyter notebook
jupyter notebook
| Aspect | California Housing | Medical Insurance | Transfer Learning | CIFAR-10 | Diabetes Predictor | Flower Classification |
|---|---|---|---|---|---|---|
| Domain | Real Estate | Insurance Analytics | Computer Vision | Computer Vision | Healthcare | Computer Vision |
| Data Type | Tabular | Tabular | Images (224Γ224) | Images (32Γ32) | Tabular | Images (224Γ224) |
| Problem Type | Regression | Regression | Binary Classification | Classification | Binary Classification | Multi-class (102) |
| Samples | 20,640 | 1,338 | 10,023 | 60,000 | 768 | 8,189 |
| Algorithm | XGBoost | XGBoost | ResNet18 Transfer | CNN | XGBoost | CNN |
| Framework | Scikit-learn | Scikit-learn | PyTorch | PyTorch | Scikit-learn | PyTorch |
| Performance | 84.4% RΒ² | 88.4% RΒ² | 98.5% Accuracy | 78.5% Accuracy | 76% Recall | In Progress |
| Focus | Price Prediction | Cost Prediction | Transfer Learning | Multi-class | Medical/Recall | Large-scale |
| Complexity | High Pipeline | High Pipeline | Advanced CV | Medium | High Pipeline | High Architecture |
# Essential packages for all projects
pip install jupyter pandas numpy matplotlib seaborn
# For PyTorch projects (CIFAR-10, Flower Classification)
pip install torch torchvision torchaudio
# For traditional ML (Diabetes Predictor)
pip install scikit-learn xgboost imbalanced-learn
# For flower classification MATLAB files
pip install scipy
# For visualization
pip install pillow# For CUDA-enabled PyTorch (check CUDA version first)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118import torch
import sklearn
import xgboost
print("PyTorch version:", torch.__version__)
print("CUDA available:", torch.cuda.is_available())
print("Scikit-learn version:", sklearn.__version__)cd "Machine Learning/California Housing"
jupyter notebook CaliforniaHousing.ipynb
# Uses built-in scikit-learn datasetcd "Machine Learning/Medical Insurance"
jupyter notebook "Medical Insurance.ipynb"
# Uses included insurance.csv datasetcd "Deep Learning/Transfer Learning ResNet18"
jupyter notebook "Transfer Learning.ipynb"
# Uses cats vs dogs dataset in archive/ foldercd "Deep Learning/CIFAR10"
jupyter notebook CIFAR10_neural_network.ipynb
# Dataset downloads automaticallycd "Machine Learning/Diabet Predictor"
jupyter notebook DiabetPredictor.ipynb
# Uses included diabetes.csv datasetcd "Deep Learning/Flower Classification"
jupyter notebook Flowers.ipynb
# Requires manual dataset setup (see project README)- GPU Recommended: Deep learning projects benefit significantly from GPU acceleration
- Memory Requirements: Ensure sufficient RAM (8GB+) and GPU memory (2GB+)
- Training Time: Varies from minutes (diabetes) to hours (deep learning)
| Project | Metric | Value | Significance |
|---|---|---|---|
| Transfer Learning | Test Accuracy | 98.5% | Outstanding computer vision performance |
| California Housing | Test RΒ² | 84.4% | Excellent regression performance |
| Medical Insurance | Test RΒ² | 88.4% | Outstanding regression performance |
| CIFAR-10 | Test Accuracy | 78.5% | Strong CNN performance with improvements |
| Diabetes | Recall | 76% | High medical relevance |
| Diabetes | F1-Score | 72% | Balanced performance |
| Flower | Architecture | 51.5M params | Large-scale classification |
- Transfer Learning: Outstanding 98.5% accuracy for binary image classification
- Medical Insurance: Outstanding 88.4% RΒ² score for insurance cost prediction
- California Housing: Excellent 84.4% RΒ² score for regression prediction
- CIFAR-10: Strong 78.5% accuracy with enhanced CNN architecture (+24.5% improvement)
- Diabetes: Clinically relevant model with 76% recall (38% improvement over baseline)
- Flower: Complex 102-class classification with sophisticated preprocessing
Predictors-Projects/
βββ README.md # This comprehensive overview
βββ .gitignore # Git ignore patterns
β
βββ Machine Learning/ # Traditional ML & Regression Projects
β βββ California Housing/ # Real Estate Price Prediction
β β βββ README.md # Project-specific documentation
β β βββ CaliforniaHousing.ipynb # Complete regression pipeline
β β
β βββ Medical Insurance/ # Insurance Cost Prediction
β β βββ README.md # Detailed project documentation
β β βββ Medical Insurance.ipynb # Complete ML pipeline (88.4% RΒ²)
β β βββ archive/ # Dataset directory
β β βββ insurance.csv # Medical insurance dataset
β β
β βββ Diabet Predictor/ # Healthcare Analytics
β βββ README.md # Detailed project documentation
β βββ DiabetPredictor.ipynb # Complete ML pipeline
β βββ data_lab9/ # Dataset directory
β βββ diabetes.csv # Pima Indians Diabetes Database
β
βββ Deep Learning/ # Neural Networks & Computer Vision
βββ Transfer Learning ResNet18/ # Advanced Transfer Learning
β βββ README.md # Project documentation
β βββ Transfer Learning.ipynb # Complete transfer learning pipeline
β βββ resnet18_finetuned.pth # Best model (98.5% accuracy)
β βββ resnet18_finetuned_warmup.pth # Warmup strategy model
β βββ resnet18_finetuned_noaug.pth # No augmentation model
β βββ resnet18_partial_finetune.pth # Partial fine-tuning model
β βββ archive/ # Cats vs Dogs dataset
β βββ training_set/ # Training images
β βββ test_set/ # Test images
β
βββ CIFAR10/ # Computer Vision - 10 classes
β βββ README.md # Project-specific documentation
β βββ CIFAR10_neural_network.ipynb # Main implementation
β βββ cifar_net.pth # Saved model weights
β βββ data/ # CIFAR-10 dataset (auto-downloaded)
β βββ cifar-10-batches-py/ # Extracted dataset files
β
βββ Fashion MINST/ # Fashion item classification
β βββ [Fashion MNIST files] # Fashion classification project
β
βββ Flower Classification/ # Computer Vision - 102 classes
βββ README.md # Project documentation
βββ Flowers.ipynb # Deep learning implementation
βββ imagelabels.mat # Flower category labels
βββ setid.mat # Train/test/validation splits
βββ 102flowers/ # Flower images dataset
βββ jpg/ # 8,189 flower images
This collection demonstrates:
- Deep Learning: CNN architecture design and training
- Traditional ML: Feature engineering and model selection
- Data Preprocessing: Handling missing values, scaling, imbalance
- Model Evaluation: Appropriate metrics for different domains
- Framework Proficiency: PyTorch and Scikit-learn expertise
- Computer Vision: Image classification challenges and solutions
- Healthcare Analytics: Medical data characteristics and evaluation priorities
- Multi-class Problems: Scaling to large number of categories
- Reproducible Research: Comprehensive documentation and code organization
- Evaluation Focus: Domain-appropriate metrics (accuracy vs. recall)
- Pipeline Development: End-to-end ML workflow implementation
Contributions are welcome! Areas for improvement:
- Transfer Learning: Pre-trained models for image classification
- Hyperparameter Tuning: Systematic optimization
- Cross-Validation: Robust evaluation strategies
- Ensemble Methods: Combining multiple models
- Deployment: Model serving and API development
- Fork the repository
- Create a feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
- CIFAR-10: Canadian Institute for Advanced Research
- Pima Indians Diabetes: UCI Machine Learning Repository
- Oxford 102 Flowers: Visual Geometry Group, University of Oxford
- PyTorch Team: Deep learning framework
- Scikit-learn Contributors: Machine learning library
- XGBoost Developers: Gradient boosting framework
- Politecnico di Torino: Academic context and guidance
- Open Source Community: Libraries and tools that made this possible
Author: Dorotea Monaco
Institution: Politecnico di Torino
GitHub: @doroteaMonaco
Repository: Predictors-Projects
- California Housing Price Prediction
- Medical Insurance Cost Predictor
- Transfer Learning ResNet18
- Diabetes Predictor
- CIFAR-10 Implementation
- Flower Classification
Built with π§ intelligence, β€οΈ passion, and π¬ scientific rigor
Last Updated: October 2025