AWS-Airline-Data-Batch-Processing

Overview

This project establishes a batch processing pipeline for airline data on AWS. Upon data arrival in the S3 bucket, events are triggered through CloudTrail and EventBridge. It triggers the Step Function which automates the ETL operations that are executed using AWS Glue. The end goal is to load the processed data into Amazon Redshift, with notifications for both successful and failed ETL jobs.

Architecture

The pipeline structure is as follows:

Airline Data Loading to S3:

Airline data is loaded into S3 bucket by the producer.

Event Trigger with CloudTrail and EventBridge:

CloudTrail monitors S3 bucket events and triggers EventBridge upon data upload.

Automation with Step Function:

EventBridge triggers Step Function, which executes the processing workflow.

Glue Crawler and PySpark ETL Job:

Step Function initiates Glue Crawler to identify the schema of the arrived data.
Once the crawler completes its task, it triggers Glue PySpark ETL job.
If the job was successful, the processed data is loaded into Amazon Redshift.
Notifications are sent for both successful and failed ETL jobs.

Prerequisites

AWS account with appropriate IAM permissions to the above services.
Redshift VPC Connection.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
Pipeline_Design.png		Pipeline_Design.png
README.md		README.md
Redshift_table_schema.txt		Redshift_table_schema.txt
etl_pyspark_job.py		etl_pyspark_job.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AWS-Airline-Data-Batch-Processing

Overview

Architecture

Prerequisites

About

Uh oh!

Releases

Packages

Languages

timtimer11/AWS-Airline-Data-Batch-Processing

Folders and files

Latest commit

History

Repository files navigation

AWS-Airline-Data-Batch-Processing

Overview

Architecture

Prerequisites

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages