This project focuses on cleaning and preparing a dataset related to layoffs in various companies. The goal is to perform data cleaning tasks such as deduplication, standardisation, handling null values, and removing unnecessary columns and rows. These steps are essential to make the data consistent, usable, and ready for further analysis or reporting.
The primary dataset used for this analysis is the 'Layoffs Dataset', found on Kaggle - it is current up until March 2023. The latest version of this dataset can be accessed here.
- MySQL
- Data Cleaning
- Data Transformation
- DBeaver
In the data preparation & cleaning phase, we performed the following tasks:
-
Deduplication:
- Created a staging table to copy the original data for manipulation.
- Identified duplicates based on key columns (
company,location,industry,total_laid_off,percentage_laid_off,date, andcountry). - Removed duplicate records by utilising a window function to assign a row number for each group of potential duplicates and then deleting rows with a row number greater than 1.
-
Standardisation:
- Replaced blank or invalid values in the
industrycolumn withNULL. - Trimmed whitespace from the
companycolumn to standardise values. - Standardised values in the
industryandcountrycolumns. - Reformatted the
datecolumn into a standardizedDATEdatatype.
- Replaced blank or invalid values in the
-
Handling Null Values:
- Handled missing values in the
industrycolumn by replacing them with values from other records with matchingcompanyandlocationinformation.
- Handled missing values in the
-
Row & Column Removal:
- Removed redundant columns such as
row_numused for deduplication purposes. - Removed rows with
NULLvalues in critical fields liketotal_laid_offandpercentage_laid_off.
- Removed redundant columns such as
Based on the results of the data cleaning process, we could recommend the following actions:
- Further Analysis: With the cleaned dataset, further exploratory data analysis (EDA) can be conducted to find insights like trends in layoffs across industries and regions.
- Data Archiving: Store cleaned datasets in a centralised location for easy access and sharing within others.
- Data Cleaning in SQL - Alex The Analyst