This project implements the K-Means clustering algorithm to segment mall customers based on their Age, Annual Income, and Spending Score. The goal is to group customers into clusters that exhibit similar characteristics for targeted marketing or customer segmentation strategies.
The dataset used in this project is included in the repository. It contains customer data with the following columns:
CustomerID: Unique identifier for each customer.Gender: The gender of the customer.Age: The age of the customer.Annual Income (k$): The annual income of the customer in thousands of dollars.Spending Score (1-100): A score assigned to customers based on their spending behavior.- You can directly use this dataset to replicate the customer segmentation process.
-
Clone the Repository:
git clone https://github.com/Vignesha-S/Project-3.git
-
Install Required Libraries: If you have a
requirements.txtfile, you can install the dependencies with:pip install -r requirements.txt
Otherwise, manually install the necessary libraries:
pip install pandas matplotlib seaborn scikit-learn
-
Clone the Repository:
git clone https://github.com/Vignesha-S/Project-3.git
-
Install Required Libraries:
pip install -r requirements.txt
-
Run the Jupyter Notebook: Open and run the Jupyter Notebook (
k_means_clustering.ipynb) using Jupyter Notebook or JupyterLab:jupyter notebook k_means_clustering.ipynb
k_means_clustering.ipynb: The main Jupyter notebook file that contains all steps, including data loading, exploration, preprocessing, clustering, and visualization.data/: Folder containing the dataset (Mall_Customers.csv).requirements.txt: A file containing the list of dependencies for the project.
-
Data Loading and Initial Exploration:
- The dataset is loaded and inspected for its structure.
- Features such as Age, Annual Income, and Spending Score are selected for clustering.
-
Exploratory Data Analysis (EDA):
- Histograms and KDE plots are used to understand the distribution of features (Age and Spending Score).
- A correlation heatmap is generated to check the relationships between the features.
-
Data Preprocessing:
- The data is scaled using
StandardScalerto normalize the features and prepare them for clustering.
- The data is scaled using
-
K-Means Clustering:
- The Elbow Method is applied to determine the optimal number of clusters, which is found to be 4.
- The K-Means model is trained on the scaled data to segment the customers into 4 clusters.
-
Centroids Visualization:
- After clustering, the centroids of the clusters are visualized in a 3D plot along with the customer data points.
-
Cluster Summary and Interpretation:
- A summary of each cluster is generated by calculating the mean values of Age, Annual Income, and Spending Score within each cluster.
- Insights such as high-income vs. low-income groups, and high-spending vs. low-spending groups are provided.
-
Final Visualization:
- A final 2D plot is generated to visualize the clusters and their centroids, helping to understand the customer segments in the feature space.
- The K-Means algorithm successfully segmented the customers into 4 clusters based on their Age, Annual Income, and Spending Score.
- The clusters represent different customer segments that can be targeted for marketing strategies, such as:
- Cluster 0: High-income and low-spending group.
- Cluster 1: High-income and high-spending group.
- Cluster 2: Low-income and medium-spending group.
- Cluster 3: Medium-income and low-spending group.
This project is licensed under the MIT License - see the LICENSE file for details.