Skip to content

rayan589/fraud_detection

Repository files navigation

# Project Overview
This project demonstrates a machine learning pipeline for fraud detection using two approaches:

## Python, Scikit-learn, and Pandas for data augmentation and model training.
## PySpark for data preprocessing and training machine learning models on large datasets.


# File Descriptions
## File 1: data_augmentation_and_ml_models.ipynb

Contains the implementation of data augmentation using bootstrap sampling with Gaussian noise.
Includes machine learning model training and evaluation using libraries such as Scikit-learn and Pandas.

## File 2: data_preprocessing_and_pyspark_training.ipynb

Focuses on data preprocessing and machine learning model training using PySpark.
Designed to handle large datasets efficiently, showcasing the importance of big data tools.

## csv file of the original dataset (1k rows)
## csv  of the augmented dataset (10k rows)
## .txt file containing the link for the augment 100k rows dataset and the augmented 1M rows dataset

# Key Insight
This approach highlights that as datasets grow in size, the use of big data tools like PySpark becomes crucial for scalability and performance.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published