Credit Card Default – Regression and Classification

This repository contains the code and analysis for a machine learning exercise that uses the Default of Credit Card Clients dataset from the UCI Machine Learning Repository.

The goal is to model credit card client behavior in two ways:

Regression: Predict the amount a client will pay in the next month (PAY_AMT1)
Classification: Predict whether a client will default on their credit card payment next month

Dataset

Name: Default of Credit Card Clients
Source (UCI): Default of Credit Card Clients Dataset
Kaggle: https://www.kaggle.com/datasets/uciml/default-of-credit-card-clients-dataset

The dataset includes:

Demographic information: SEX, EDUCATION, MARRIAGE, AGE
Credit limit: LIMIT_BAL
Past payment status: PAY_0, PAY_2, ..., PAY_6 (repayment status for 6 month period)
Past bill amounts: BILL_AMT1 – BILL_AMT6 (bill statements for 6 month period)
Past payment amounts: PAY_AMT1 – PAY_AMT6 (actual payments for 6 month period)
Target variable: default.payment.next.month (1 = default, 0 = no default)

The original data contains 30,000 observations and 25 columns.

Problem Formulation

Regression Task

Target: PAY_AMT1
This is the amount paid in September (the most recent month in the dataset).
Goal: Given demographic variables, credit information, past bill amounts, and previous payments, estimate how much a client will pay in the next month.
Use Cases: Cash flow forecasting, budgeting, financial planning

Classification Task

Target: default.payment.next.month
- 1: client defaulted on payment
- 0: client did not default
Goal: Predict whether a client will default on their payment next month using the same feature set.
Use Cases: Risk assessment, credit decisions, early intervention for high-risk clients

Both tasks use the same feature matrix X so that we can directly compare how regression and classification models behave on the same real-world context.

Methods

All models are implemented using scikit-learn.

Preprocessing

Select a subset of relevant features:
- LIMIT_BAL, SEX, EDUCATION, MARRIAGE, AGE
- PAY_0, PAY_2, PAY_3, PAY_4, PAY_5, PAY_6
- BILL_AMT1 – BILL_AMT6
- PAY_AMT2 – PAY_AMT6
Drop rows with missing values in the selected columns and targets
Split into train and test sets (80/20 split)
Standardize numerical features using StandardScaler fitted on the training data only (to prevent data leakage)

Models

Regression: LinearRegression from scikit-learn
Classification: LogisticRegression with max_iter=1000

No heavy hyperparameter tuning is performed because the focus is on understanding and comparing the two basic model types.

Evaluation

Regression Metrics

Mean Absolute Error (MAE): Measures average prediction error in dollars
R-squared (R²): Indicates proportion of variance explained by the model

These metrics are used to measure how close the predicted payment amounts are to the actual payments.

Classification Metrics

Accuracy: Overall correctness of predictions
Precision: Of predicted defaults, how many were actual defaults
Recall: Of actual defaults, how many were correctly identified
F1-score: Harmonic mean of precision and recall
Confusion Matrix: Detailed breakdown of predictions
ROC Curve & AUC: Model's ability to distinguish between classes

Because the dataset is imbalanced (fewer defaults than non-defaults), precision, recall, and F1 are more informative than accuracy alone.

Repository Structure

.
├── UCI_Credit_Card.csv          # Dataset
├── ml_exercise.ipynb            # Main Jupyter notebook with analysis
├── models.py                    # Python module with ML functions
├── README.md                    # This file
├── requirements.txt             # Python dependencies
└── .gitignore

Installation and Usage

Prerequisites

Python 3.7 or higher
pip package manager

Setup

Clone this repository:

git clone <your-repo-url>
cd <repo-name>

Install required packages:
```
pip install -r requirements.txt
```
Run the Jupyter notebook:
```
jupyter notebook ml_exercise.ipynb
```

Running the Analysis

The notebook is organized into the following sections:

Data Loading: Import dependencies and load the dataset
Data Exploration: Examine dataset structure and statistics
Preprocessing: Build feature matrix and target variables
Train/Test Split: Split data and standardize features
Regression Model: Train and evaluate Linear Regression
Classification Model: Train and evaluate Logistic Regression
Visualizations:
- Predicted vs Actual scatter plots
- Residuals analysis
- Confusion matrix
- ROC curve
- Feature importance
Discussion: Interpret results and provide recommendations

Key Findings

Class Imbalance: The dataset has more non-defaulters (77.88%) than defaulters (22.12%), which affects classification performance
Feature Importance: Payment history features (PAY_0, PAY_2, etc.) and bill amounts are the strongest predictors for both tasks
Model Trade-offs:
- Classification model is better for identifying at-risk clients and making credit decisions
- Regression model is better for financial forecasting and budgeting
Performance: Both models show reasonable performance, but there's room for improvement through:
- Feature engineering (ratios, trends, interactions)
- Addressing class imbalance (SMOTE, class weighting)
- Trying more complex models (Random Forest, Gradient Boosting, Neural Networks)

Recommendations

For operational decisions (credit approvals, client contact): Use the classification model
For financial planning (revenue forecasting, cash flow): Use the regression model
For best results: Use both models together for comprehensive client risk assessment
Future improvements:
- Ensemble methods for better predictions
- Cross-validation for more robust evaluation
- Feature engineering for domain-specific insights
- Cost-sensitive learning to account for business costs of false positives/negatives

License

This project is for educational purposes as part of CS4680 Prompt Engineering coursework.

Acknowledgments

Dataset: UCI Machine Learning Repository
Course: CS4680 Prompt Engineering
Tools: Python, scikit-learn, pandas, matplotlib, seaborn

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Credit Card Default – Regression and Classification

Dataset

Problem Formulation

Regression Task

Classification Task

Methods

Preprocessing

Models

Evaluation

Regression Metrics

Classification Metrics

Repository Structure

Installation and Usage

Prerequisites

Setup

Running the Analysis

Key Findings

Recommendations

License

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
README.md		README.md
UCI_Credit_Card.csv		UCI_Credit_Card.csv
ml_exercise.ipynb		ml_exercise.ipynb
models.py		models.py
requirements.txt		requirements.txt

arminerika/CreditRisk

Folders and files

Latest commit

History

Repository files navigation

Credit Card Default – Regression and Classification

Dataset

Problem Formulation

Regression Task

Classification Task

Methods

Preprocessing

Models

Evaluation

Regression Metrics

Classification Metrics

Repository Structure

Installation and Usage

Prerequisites

Setup

Running the Analysis

Key Findings

Recommendations

License

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages