SanitiPy automates the data cleaning process for your data science projects using Python.
SanitiPy is a user-friendly Python library designed to streamline the data cleaning and preprocessing workflow. It provides essential utilities to prepare datasets for analysis or modeling by handling common data quality issues such as duplicate entries, missing values, and inconsistent data types.
-
Remove Duplicates: Easily eliminate duplicate rows from your DataFrame to ensure data integrity.
-
Handle Missing Values: Automatically identify and remove rows containing
NaN(Not a Number) values. -
Infer Data Types: Intelligently detect and convert column data types, including:
-
Converting potential datetime columns based on a configurable ratio of valid dates.
-
Converting numeric-like values to proper numeric types.
-
Falling back to string type when type inference is unsuccessful.
-
-
Automated Cleaning Process: The
DataCleanerclass orchestrates the cleaning steps, ensuring your data is ready for further analysis.
You can install SanitiPy using pip:
pip install sanitipyQuick example on using the package with a Pandas DataFrame:
import pandas as pd
from sanitipy import DataCleaner
# Create a sample DataFrame with some common data issues
data = {
'ID': [1, 2, 3, 1, 4, 5],
'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'David', 'Eve'],
'Value': [100, 200, None, 100, 400, 500],
'Date': ['2023/01/01', '2023/01/02', '2023/01/03', '2023/01/01', 'invalid-date', '2023/01/05'],
'Category': ['A', 'B', 'C', 'A', 'D', 'E']
}
df = pd.DataFrame(data)
# Initialize the DataCleaner
cleaner = DataCleaner(df)
# Clean the data
cleaned_df = cleaner.clean_data()The main class for orchestrating the data cleaning process.
-
__init__(self, data_frame: pd.DataFrame):- Initializes the
DataCleanerwith a pandas DataFrame
- Initializes the
-
clean_data(self) -> pd.DataFrame:- Performs a sequence of cleaning operations:
-
Removes duplicate rows.
-
Removes rows with missing values (if any are detected). Raises a
ValueErrorif missing values persist after removal. -
Infers and converts data types for columns with inconsistent types.
-
Resets the DataFrame index.
-
- Returns the cleaned pandas DataFrame.
- Performs a sequence of cleaning operations:
Provides individual data transformation and cleaning utilities.
-
remove_duplicates(self, data: pd.DataFrame) -> pd.DataFrame:- Removes duplicate rows from the input DataFrame.
-
remove_na(self, data: pd.DataFrame) -> pd.DataFrame:- Removes rows containing any
NaNvalues from the input DataFrame.
- Removes rows containing any
-
infer_data_types(self, data_frame: pd.DataFrame, date_time_ratio: float = 0.5) -> pd.DataFrame:- Infers and converts data types for columns.
date_time_ratio: The threshold (0-1) for treating an object column as datetime. Default is 0.5 (50% valid date values required).
Used internally by DataCleaner to check data quality.
-
check_missing_values(self, data: pd.DataFrame) -> int:- Retuns the total count of missing values in the DataFrame.
-
validate_data_types(self) -> bool:- Checks if all columns in the DataFrame have consistent data types. Returns
Trueif all columns have the same data type or if the DataFrame is empty,Falseotherwise.
- Checks if all columns in the DataFrame have consistent data types. Returns
Constributions are welcome! If you have suggestions for improvements or new features, please open an issue or submit a pull request.
This project is licensed under the GNU License - see the LICENSE file for details.