GitHub - rayan589/fraud

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.txt		README.txt
data_augmentation_and_ml_models.ipynb		data_augmentation_and_ml_models.ipynb
data_preprocessing_and_pyspark_training.ipynb		data_preprocessing_and_pyspark_training.ipynb

Repository files navigation

# Project Overview
This project demonstrates a machine learning pipeline for fraud detection using two approaches:

## Python, Scikit-learn, and Pandas for data augmentation and model training.
## PySpark for data preprocessing and training machine learning models on large datasets.


# File Descriptions
## File 1: data_augmentation_and_ml_models.ipynb

Contains the implementation of data augmentation using bootstrap sampling with Gaussian noise.
Includes machine learning model training and evaluation using libraries such as Scikit-learn and Pandas.

## File 2: data_preprocessing_and_pyspark_training.ipynb

Focuses on data preprocessing and machine learning model training using PySpark.
Designed to handle large datasets efficiently, showcasing the importance of big data tools.

## csv file of the original dataset (1k rows)
## csv  of the augmented dataset (10k rows)
## .txt file containing the link for the augment 100k rows dataset and the augmented 1M rows dataset

# Key Insight
This approach highlights that as datasets grow in size, the use of big data tools like PySpark becomes crucial for scalability and performance.