This repository documents a comprehensive Data Mining project focused on ABCDEats Inc., a fictional food delivery service. We analyze a rich dataset of customer transactions and behaviors to develop a data-driven segmentation strategy. The goal is to empower ABCDEats to move beyond a one-size-fits-all approach and tailor its marketing, promotions, and service offerings to distinct customer profiles.
The primary objectives of this project are to:
- Conduct an Exploratory Data Analysis (EDA) to understand customer behaviors, trends, and patterns.
- Preprocess the data, handling inconsistencies, missing values, outliers, and perform feature engineering/selection.
- Apply and evaluate various Clustering Algorithms (Hierarchical, K-Means, SOM, Density-based) from different perspectives (Overall, Value-based, Behavior-based).
- Develop a Final Customer Segmentation solution by comparing and potentially merging results from different approaches.
- Profile the resulting customer segments, highlighting their key characteristics.
- Suggest actionable Business Applications and marketing strategies for each segment.
- (Optional) Develop an interactive Web Application for exploring the EDA and segmentation results.
This project was developed for the Data Mining course as part of the Master's in Data Science and Advanced Analytics program at NOVA IMS. The work was completed during the 1st Semester of the 2024/2025 academic year.
The project was implemented entirely in Python, leveraging a powerful stack of libraries for data science, machine learning, and web deployment.
The project strictly followed the CRISP-DM (Cross-Industry Standard Process for Data Mining) methodology. The overall workflow is visualized below:
(Diagram summarizing the key phases and steps of the project)
-
Business Understanding: π‘
- Defined the core business problem: Need for effective customer segmentation for ABCDEats Inc. to personalize marketing and services.
- Established project objectives aligned with business goals (improve customer satisfaction, retention, revenue).
-
Data Understanding: π
- Explored the initial dataset (31,888 customers, 56 features).
- Identified data types, distributions (skewness, kurtosis), and initial relationships (pair plots).
- Detected missing values (
customer_age,first_order,HR_0), duplicates, and inconsistencies ('-'incustomer_region,last_promo; illogicalvendor/product/order_counts).
-
Data Preparation & Feature Engineering π οΈ
- Cleaning: Handled duplicates (removed 13), treated inconsistencies (removed 18 illogical rows, reinterpreted '-').
- Missing Value Imputation: Used deterministic logic (
first_order,HR_0) andKNNImputer(customer_age). - Feature Engineering: Created new features (e.g.,
order_count,days_between_orders,customer_region_buckets,last_promo_bin, CUI totals/averages/most spent, PCA components). Discarded less informative engineered features (e.g., CUI proportions). - Outlier Handling: Applied a mixed strategy (modified IQR and manual removal based on boxplots/domain knowledge), retaining 98.61% of data.
- Variable Selection: Used Spearman correlation (threshold 0.8) to identify and remove redundant features (
vendor_count,product_count,days_between_orders,customer_age,customer_age_group). - Feature Scaling: Applied
StandardScalerto numerical features for distance-based algorithms. - Dimensionality Reduction: Used
PCAseparately on CUI and HR feature groups to reduce noise/redundancy while preserving variance (kept 7 CUI PCs, 4 HR PCs). Original DOW variables were retained.
-
Modeling: Multi-Perspective Clustering π§
- Applied multiple clustering algorithms:
- Hierarchical Clustering (HC - Agglomerative, Ward linkage)
- K-Means
- Self-Organizing Maps (SOM - using
MiniSom) + HC/K-Means - Density-Based: Mean Shift, DBSCAN, Gaussian Mixture Models (GMM)
- Performed clustering on 'Overall', 'Value-based', and 'Behavior-based' feature subsets.
- Applied multiple clustering algorithms:
-
Evaluation & Final Segmentation β
- Determined optimal cluster numbers using Elbow method (Inertia/SSE), Silhouette analysis, RΒ² metric (for HC), AIC/BIC (for GMM), and visual inspection (dendrograms).
- Compared performance across algorithms and perspectives based on RΒ² and silhouette scores.
- Selected best-performing methods for each perspective (SOM+K-Means overall, K-Means value, SOM+K-Means behavior).
- Manually merged the 'Value' (k=3) and 'Behavior' (k=4) solutions based on centroid analysis to create a final, more robust 5-cluster solution.
- Visualized cluster separation using t-SNE and UMAP.
-
Deployment π
- Profiling: Characterized the final 5 clusters using descriptive statistics, bar plots, and heatmaps.
- Business Applications: Defined marketing strategies tailored to each segment.
- (Optional) Interactive Dashboard: Developed a web application using Streamlit and Plotly for dynamic exploration of EDA and segmentation results. Access the App Here!
- β‘οΈ Dashboard App Repository: Silvestre17/DM_Dashboard β¬ οΈ
Based on the merged clustering solution (Value K-Means + Behavior SOM+K-Means), five distinct customer segments were identified:
| Segment ID | Segment Name | Key Characteristics | Recommended Marketing Approach |
|---|---|---|---|
| 0 | The Mainstream Base | - Largest group (41.74%). - Average spending & behavior, similar to overall dataset. - Moderate to low engagement. - Prefers Asian & American cuisines. - Balanced across regions; uses card payments. |
- Offer tiered loyalty (discounts/perks for higher spending/frequency). - Target promotions for American/Asian cuisines & combo deals. |
| 1 | The Promo Pursuers | - Second largest (38.00%), low engagement (lowest order count). - Low total spend, but high average spend per order. - Likely motivated by delivery promotions. - Slight preference for evening orders & Noodles/Chinese/Chicken. |
- Offer free delivery for orders above a certain value. - Implement points-based rewards program redeemable for discounts/free delivery. |
| 2 | The Convenience Seekers | - Concentrated in Region 2 (8.56%). - High order frequency (lunch/dinner). - Prefers Chicken, Chinese, Noodles, Other; less Asian/Street Food. - Moderate spenders, but significant volume. |
- Focus on premium dining experience (personalized service), especially in Region 2. - Offer exclusive menu previews/early access. - Loyalty program rewarding spend per order & frequency. |
| 3 | The Balanced Spenders | - Located mostly in Region 2 & 4 (6.76%). - Similar activity times to Cluster 2 (lunch/dinner) but lower frequency/spend. - Prefers Italian & Other cuisines; less keen on Street Food/Snacks/Asian. |
- Highlight Italian/Other cuisines in promotions (exclusive deals). - Target lunch/dinner promotions. - Offer discount combos for higher spend. |
| 4 | The Late-Night Enthusiasts | - Highest spenders (absolute & average) (4.93%). - Predominantly in Region 8. - Strong preference for Asian, Snack, Street Food. - Orders primarily late night & early breakfast. - Less preference for Italian/Other. |
- Highlight breakfast & late-night specific items/offerings. - Introduce city-specific promotions (Region 8). - Offer special discounts/VIP access for high spenders. |
- AndrΓ© Silvestre, 20240502
- Filipa Pereira, 20240509
- Umeima Mahomed, 20240543
