This repository contains the code and analysis for a machine learning exercise that uses the Default of Credit Card Clients dataset from the UCI Machine Learning Repository.
The goal is to model credit card client behavior in two ways:
- Regression: Predict the amount a client will pay in the next month (PAY_AMT1)
- Classification: Predict whether a client will default on their credit card payment next month
- Name: Default of Credit Card Clients
- Source (UCI): Default of Credit Card Clients Dataset
- Kaggle: https://www.kaggle.com/datasets/uciml/default-of-credit-card-clients-dataset
The dataset includes:
- Demographic information:
SEX,EDUCATION,MARRIAGE,AGE - Credit limit:
LIMIT_BAL - Past payment status:
PAY_0,PAY_2, ...,PAY_6(repayment status for 6 month period) - Past bill amounts:
BILL_AMT1–BILL_AMT6(bill statements for 6 month period) - Past payment amounts:
PAY_AMT1–PAY_AMT6(actual payments for 6 month period) - Target variable:
default.payment.next.month(1 = default, 0 = no default)
The original data contains 30,000 observations and 25 columns.
- Target:
PAY_AMT1
This is the amount paid in September (the most recent month in the dataset). - Goal: Given demographic variables, credit information, past bill amounts, and previous payments, estimate how much a client will pay in the next month.
- Use Cases: Cash flow forecasting, budgeting, financial planning
- Target:
default.payment.next.month- 1: client defaulted on payment
- 0: client did not default
- Goal: Predict whether a client will default on their payment next month using the same feature set.
- Use Cases: Risk assessment, credit decisions, early intervention for high-risk clients
Both tasks use the same feature matrix X so that we can directly compare how regression and classification models behave on the same real-world context.
All models are implemented using scikit-learn.
-
Select a subset of relevant features:
LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGEPAY_0,PAY_2,PAY_3,PAY_4,PAY_5,PAY_6BILL_AMT1–BILL_AMT6PAY_AMT2–PAY_AMT6
-
Drop rows with missing values in the selected columns and targets
-
Split into train and test sets (80/20 split)
-
Standardize numerical features using
StandardScalerfitted on the training data only (to prevent data leakage)
- Regression:
LinearRegressionfrom scikit-learn - Classification:
LogisticRegressionwithmax_iter=1000
No heavy hyperparameter tuning is performed because the focus is on understanding and comparing the two basic model types.
- Mean Absolute Error (MAE): Measures average prediction error in dollars
- R-squared (R²): Indicates proportion of variance explained by the model
These metrics are used to measure how close the predicted payment amounts are to the actual payments.
- Accuracy: Overall correctness of predictions
- Precision: Of predicted defaults, how many were actual defaults
- Recall: Of actual defaults, how many were correctly identified
- F1-score: Harmonic mean of precision and recall
- Confusion Matrix: Detailed breakdown of predictions
- ROC Curve & AUC: Model's ability to distinguish between classes
Because the dataset is imbalanced (fewer defaults than non-defaults), precision, recall, and F1 are more informative than accuracy alone.
.
├── UCI_Credit_Card.csv # Dataset
├── ml_exercise.ipynb # Main Jupyter notebook with analysis
├── models.py # Python module with ML functions
├── README.md # This file
├── requirements.txt # Python dependencies
└── .gitignore
- Python 3.7 or higher
- pip package manager
-
Clone this repository:
git clone <your-repo-url> cd <repo-name>
-
Install required packages:
pip install -r requirements.txt
-
Run the Jupyter notebook:
jupyter notebook ml_exercise.ipynb
The notebook is organized into the following sections:
- Data Loading: Import dependencies and load the dataset
- Data Exploration: Examine dataset structure and statistics
- Preprocessing: Build feature matrix and target variables
- Train/Test Split: Split data and standardize features
- Regression Model: Train and evaluate Linear Regression
- Classification Model: Train and evaluate Logistic Regression
- Visualizations:
- Predicted vs Actual scatter plots
- Residuals analysis
- Confusion matrix
- ROC curve
- Feature importance
- Discussion: Interpret results and provide recommendations
-
Class Imbalance: The dataset has more non-defaulters (77.88%) than defaulters (22.12%), which affects classification performance
-
Feature Importance: Payment history features (
PAY_0,PAY_2, etc.) and bill amounts are the strongest predictors for both tasks -
Model Trade-offs:
- Classification model is better for identifying at-risk clients and making credit decisions
- Regression model is better for financial forecasting and budgeting
-
Performance: Both models show reasonable performance, but there's room for improvement through:
- Feature engineering (ratios, trends, interactions)
- Addressing class imbalance (SMOTE, class weighting)
- Trying more complex models (Random Forest, Gradient Boosting, Neural Networks)
- For operational decisions (credit approvals, client contact): Use the classification model
- For financial planning (revenue forecasting, cash flow): Use the regression model
- For best results: Use both models together for comprehensive client risk assessment
- Future improvements:
- Ensemble methods for better predictions
- Cross-validation for more robust evaluation
- Feature engineering for domain-specific insights
- Cost-sensitive learning to account for business costs of false positives/negatives
This project is for educational purposes as part of CS4680 Prompt Engineering coursework.
- Dataset: UCI Machine Learning Repository
- Course: CS4680 Prompt Engineering
- Tools: Python, scikit-learn, pandas, matplotlib, seaborn