Skip to content

arminerika/CreditRisk

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Credit Card Default – Regression and Classification

This repository contains the code and analysis for a machine learning exercise that uses the Default of Credit Card Clients dataset from the UCI Machine Learning Repository.

The goal is to model credit card client behavior in two ways:

  1. Regression: Predict the amount a client will pay in the next month (PAY_AMT1)
  2. Classification: Predict whether a client will default on their credit card payment next month

Dataset

The dataset includes:

  • Demographic information: SEX, EDUCATION, MARRIAGE, AGE
  • Credit limit: LIMIT_BAL
  • Past payment status: PAY_0, PAY_2, ..., PAY_6 (repayment status for 6 month period)
  • Past bill amounts: BILL_AMT1BILL_AMT6 (bill statements for 6 month period)
  • Past payment amounts: PAY_AMT1PAY_AMT6 (actual payments for 6 month period)
  • Target variable: default.payment.next.month (1 = default, 0 = no default)

The original data contains 30,000 observations and 25 columns.


Problem Formulation

Regression Task

  • Target: PAY_AMT1
    This is the amount paid in September (the most recent month in the dataset).
  • Goal: Given demographic variables, credit information, past bill amounts, and previous payments, estimate how much a client will pay in the next month.
  • Use Cases: Cash flow forecasting, budgeting, financial planning

Classification Task

  • Target: default.payment.next.month
    • 1: client defaulted on payment
    • 0: client did not default
  • Goal: Predict whether a client will default on their payment next month using the same feature set.
  • Use Cases: Risk assessment, credit decisions, early intervention for high-risk clients

Both tasks use the same feature matrix X so that we can directly compare how regression and classification models behave on the same real-world context.


Methods

All models are implemented using scikit-learn.

Preprocessing

  1. Select a subset of relevant features:

    • LIMIT_BAL, SEX, EDUCATION, MARRIAGE, AGE
    • PAY_0, PAY_2, PAY_3, PAY_4, PAY_5, PAY_6
    • BILL_AMT1BILL_AMT6
    • PAY_AMT2PAY_AMT6
  2. Drop rows with missing values in the selected columns and targets

  3. Split into train and test sets (80/20 split)

  4. Standardize numerical features using StandardScaler fitted on the training data only (to prevent data leakage)

Models

  • Regression: LinearRegression from scikit-learn
  • Classification: LogisticRegression with max_iter=1000

No heavy hyperparameter tuning is performed because the focus is on understanding and comparing the two basic model types.


Evaluation

Regression Metrics

  • Mean Absolute Error (MAE): Measures average prediction error in dollars
  • R-squared (R²): Indicates proportion of variance explained by the model

These metrics are used to measure how close the predicted payment amounts are to the actual payments.

Classification Metrics

  • Accuracy: Overall correctness of predictions
  • Precision: Of predicted defaults, how many were actual defaults
  • Recall: Of actual defaults, how many were correctly identified
  • F1-score: Harmonic mean of precision and recall
  • Confusion Matrix: Detailed breakdown of predictions
  • ROC Curve & AUC: Model's ability to distinguish between classes

Because the dataset is imbalanced (fewer defaults than non-defaults), precision, recall, and F1 are more informative than accuracy alone.


Repository Structure

.
├── UCI_Credit_Card.csv          # Dataset
├── ml_exercise.ipynb            # Main Jupyter notebook with analysis
├── models.py                    # Python module with ML functions
├── README.md                    # This file
├── requirements.txt             # Python dependencies
└── .gitignore

Installation and Usage

Prerequisites

  • Python 3.7 or higher
  • pip package manager

Setup

  1. Clone this repository:

    git clone <your-repo-url>
    cd <repo-name>
  2. Install required packages:

    pip install -r requirements.txt
  3. Run the Jupyter notebook:

    jupyter notebook ml_exercise.ipynb

Running the Analysis

The notebook is organized into the following sections:

  1. Data Loading: Import dependencies and load the dataset
  2. Data Exploration: Examine dataset structure and statistics
  3. Preprocessing: Build feature matrix and target variables
  4. Train/Test Split: Split data and standardize features
  5. Regression Model: Train and evaluate Linear Regression
  6. Classification Model: Train and evaluate Logistic Regression
  7. Visualizations:
    • Predicted vs Actual scatter plots
    • Residuals analysis
    • Confusion matrix
    • ROC curve
    • Feature importance
  8. Discussion: Interpret results and provide recommendations

Key Findings

  1. Class Imbalance: The dataset has more non-defaulters (77.88%) than defaulters (22.12%), which affects classification performance

  2. Feature Importance: Payment history features (PAY_0, PAY_2, etc.) and bill amounts are the strongest predictors for both tasks

  3. Model Trade-offs:

    • Classification model is better for identifying at-risk clients and making credit decisions
    • Regression model is better for financial forecasting and budgeting
  4. Performance: Both models show reasonable performance, but there's room for improvement through:

    • Feature engineering (ratios, trends, interactions)
    • Addressing class imbalance (SMOTE, class weighting)
    • Trying more complex models (Random Forest, Gradient Boosting, Neural Networks)

Recommendations

  1. For operational decisions (credit approvals, client contact): Use the classification model
  2. For financial planning (revenue forecasting, cash flow): Use the regression model
  3. For best results: Use both models together for comprehensive client risk assessment
  4. Future improvements:
    • Ensemble methods for better predictions
    • Cross-validation for more robust evaluation
    • Feature engineering for domain-specific insights
    • Cost-sensitive learning to account for business costs of false positives/negatives

License

This project is for educational purposes as part of CS4680 Prompt Engineering coursework.


Acknowledgments

  • Dataset: UCI Machine Learning Repository
  • Course: CS4680 Prompt Engineering
  • Tools: Python, scikit-learn, pandas, matplotlib, seaborn

About

Model credit card client behaviors.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published