Skip to content

πŸ• Data Mining project for "ABCDEats Inc." using clustering (K-Means, SOM) to segment customers based on purchasing behavior. Includes an interactive Streamlit dashboard for exploration. Developed for the DM course at NOVA IMS.

License

Notifications You must be signed in to change notification settings

Silvestre17/DM_FoodDeliveryClustering_MasterProject

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

10 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

ABCDEats Banner

πŸ• ABCDEats Inc: A Data Mining Approach to Customer Segmentation πŸ“¦

GitHub Repo Live Dashboard Dashboard Code Repo

πŸ“ Description

This repository documents a comprehensive Data Mining project focused on ABCDEats Inc., a fictional food delivery service. We analyze a rich dataset of customer transactions and behaviors to develop a data-driven segmentation strategy. The goal is to empower ABCDEats to move beyond a one-size-fits-all approach and tailor its marketing, promotions, and service offerings to distinct customer profiles.

✨ Objective

The primary objectives of this project are to:

  1. Conduct an Exploratory Data Analysis (EDA) to understand customer behaviors, trends, and patterns.
  2. Preprocess the data, handling inconsistencies, missing values, outliers, and perform feature engineering/selection.
  3. Apply and evaluate various Clustering Algorithms (Hierarchical, K-Means, SOM, Density-based) from different perspectives (Overall, Value-based, Behavior-based).
  4. Develop a Final Customer Segmentation solution by comparing and potentially merging results from different approaches.
  5. Profile the resulting customer segments, highlighting their key characteristics.
  6. Suggest actionable Business Applications and marketing strategies for each segment.
  7. (Optional) Develop an interactive Web Application for exploring the EDA and segmentation results.

πŸŽ“ Project Context

This project was developed for the Data Mining course as part of the Master's in Data Science and Advanced Analytics program at NOVA IMS. The work was completed during the 1st Semester of the 2024/2025 academic year.

πŸ› οΈ Technologies & Libraries

The project was implemented entirely in Python, leveraging a powerful stack of libraries for data science, machine learning, and web deployment.

Python Jupyter Notebook Pandas NumPy Scikit-Learn MiniSom
Streamlit Plotly Matplotlib Seaborn

πŸ—ΊοΈ Project Workflow (CRISP-DM)

The project strictly followed the CRISP-DM (Cross-Industry Standard Process for Data Mining) methodology. The overall workflow is visualized below:

Project Workflow

(Diagram summarizing the key phases and steps of the project)


πŸ—οΈ Project Structure (CRISP-DM Phases)

  1. Business Understanding: πŸ’‘

    • Defined the core business problem: Need for effective customer segmentation for ABCDEats Inc. to personalize marketing and services.
    • Established project objectives aligned with business goals (improve customer satisfaction, retention, revenue).
  2. Data Understanding: πŸ”

    • Explored the initial dataset (31,888 customers, 56 features).
    • Identified data types, distributions (skewness, kurtosis), and initial relationships (pair plots).
    • Detected missing values (customer_age, first_order, HR_0), duplicates, and inconsistencies ('-' in customer_region, last_promo; illogical vendor/product/order_counts).

    Pandas NumPy Matplotlib Seaborn

  3. Data Preparation & Feature Engineering πŸ› οΈ

    • Cleaning: Handled duplicates (removed 13), treated inconsistencies (removed 18 illogical rows, reinterpreted '-').
    • Missing Value Imputation: Used deterministic logic (first_order, HR_0) and KNNImputer (customer_age).
    • Feature Engineering: Created new features (e.g., order_count, days_between_orders, customer_region_buckets, last_promo_bin, CUI totals/averages/most spent, PCA components). Discarded less informative engineered features (e.g., CUI proportions).
    • Outlier Handling: Applied a mixed strategy (modified IQR and manual removal based on boxplots/domain knowledge), retaining 98.61% of data.
    • Variable Selection: Used Spearman correlation (threshold 0.8) to identify and remove redundant features (vendor_count, product_count, days_between_orders, customer_age, customer_age_group).
    • Feature Scaling: Applied StandardScaler to numerical features for distance-based algorithms.
    • Dimensionality Reduction: Used PCA separately on CUI and HR feature groups to reduce noise/redundancy while preserving variance (kept 7 CUI PCs, 4 HR PCs). Original DOW variables were retained.

    Pandas Scikit-Learn

  4. Modeling: Multi-Perspective Clustering 🧠

    • Applied multiple clustering algorithms:
      • Hierarchical Clustering (HC - Agglomerative, Ward linkage)
      • K-Means
      • Self-Organizing Maps (SOM - using MiniSom) + HC/K-Means
      • Density-Based: Mean Shift, DBSCAN, Gaussian Mixture Models (GMM)
    • Performed clustering on 'Overall', 'Value-based', and 'Behavior-based' feature subsets.

    Scikit-Learn MiniSom

  5. Evaluation & Final Segmentation βœ…

    • Determined optimal cluster numbers using Elbow method (Inertia/SSE), Silhouette analysis, RΒ² metric (for HC), AIC/BIC (for GMM), and visual inspection (dendrograms).
    • Compared performance across algorithms and perspectives based on RΒ² and silhouette scores.
    • Selected best-performing methods for each perspective (SOM+K-Means overall, K-Means value, SOM+K-Means behavior).
    • Manually merged the 'Value' (k=3) and 'Behavior' (k=4) solutions based on centroid analysis to create a final, more robust 5-cluster solution.
    • Visualized cluster separation using t-SNE and UMAP.
  6. Deployment πŸš€

    • Profiling: Characterized the final 5 clusters using descriptive statistics, bar plots, and heatmaps.
    • Business Applications: Defined marketing strategies tailored to each segment.
    • (Optional) Interactive Dashboard: Developed a web application using Streamlit and Plotly for dynamic exploration of EDA and segmentation results. Access the App Here!

    Streamlit Plotly

πŸ“ˆ Results - Final Customer Segments

Based on the merged clustering solution (Value K-Means + Behavior SOM+K-Means), five distinct customer segments were identified:

Segment ID Segment Name Key Characteristics Recommended Marketing Approach
0 The Mainstream Base - Largest group (41.74%).
- Average spending & behavior, similar to overall dataset.
- Moderate to low engagement.
- Prefers Asian & American cuisines.
- Balanced across regions; uses card payments.
- Offer tiered loyalty (discounts/perks for higher spending/frequency).
- Target promotions for American/Asian cuisines & combo deals.
1 The Promo Pursuers - Second largest (38.00%), low engagement (lowest order count).
- Low total spend, but high average spend per order.
- Likely motivated by delivery promotions.
- Slight preference for evening orders & Noodles/Chinese/Chicken.
- Offer free delivery for orders above a certain value.
- Implement points-based rewards program redeemable for discounts/free delivery.
2 The Convenience Seekers - Concentrated in Region 2 (8.56%).
- High order frequency (lunch/dinner).
- Prefers Chicken, Chinese, Noodles, Other; less Asian/Street Food.
- Moderate spenders, but significant volume.
- Focus on premium dining experience (personalized service), especially in Region 2.
- Offer exclusive menu previews/early access.
- Loyalty program rewarding spend per order & frequency.
3 The Balanced Spenders - Located mostly in Region 2 & 4 (6.76%).
- Similar activity times to Cluster 2 (lunch/dinner) but lower frequency/spend.
- Prefers Italian & Other cuisines; less keen on Street Food/Snacks/Asian.
- Highlight Italian/Other cuisines in promotions (exclusive deals).
- Target lunch/dinner promotions.
- Offer discount combos for higher spend.
4 The Late-Night Enthusiasts - Highest spenders (absolute & average) (4.93%).
- Predominantly in Region 8.
- Strong preference for Asian, Snack, Street Food.
- Orders primarily late night & early breakfast.
- Less preference for Italian/Other.
- Highlight breakfast & late-night specific items/offerings.
- Introduce city-specific promotions (Region 8).
- Offer special discounts/VIP access for high spenders.

πŸ‘₯ Group 37 Members

  • AndrΓ© Silvestre, 20240502
  • Filipa Pereira, 20240509
  • Umeima Mahomed, 20240543

About

πŸ• Data Mining project for "ABCDEats Inc." using clustering (K-Means, SOM) to segment customers based on purchasing behavior. Includes an interactive Streamlit dashboard for exploration. Developed for the DM course at NOVA IMS.

Topics

Resources

License

Stars

Watchers

Forks