Data Engineering

Data Engineering code examples for batch processing with Python, PySpark, Airflow and AWS/Localstack.

This project attempts to put together a few data technology tools and concepts in a simple way, including batch processing, task orchestration and usage of different data storages.

Disclaimer

This project uses docker compose to provide services that comprise the stack, including:

Localstack for AWS S3 with initial datasets and S3 bucket as the datalake
Postgres as the data warehouse storage
Airflow for task orchestration
Trino for the query engine
Hive Metastore for data catalogue
Docker for container management

Projects

batch-jobs - Spark scripts with pyspark.
airflow - Airflow DAGs and operators that run the batch jobs.
hive-metastore - Hive standalone metastore for mapping partitioned parquet files on S3.

Running all together

Build the Spark batch jobs container:

docker build -f projects/batch-jobs/Dockerfile -t batch-jobs:latest ./projects/batch-jobs

Run Docker Compose:

docker compose up --build

Inspecting stuff

View Airflow at http://localhost:8080 with user admin and password admin.
Monitor Trino at http://localhost:8081 with user trino.
Watch the Docker tasks being spawned with watch -n1 docker ps.

Querying data

Query the datalake on jdbc:trino://localhost:8081/hive with user trino and no password.
Query the data warehouse on jdbc:postgresql://localhost:5432/data_warehouse with user postgres and password password.
If you are curious, query the hive metastore on jdbc:postgresql://localhost:5452/hive with user postgres and password password.

Intellij

Load one of the projects such as batch-jobs as a project itself.
Follow the respective README.md for build and run instructions.

Notes and Concepts

SQL

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
compose		compose
datasets/national_grid_demand		datasets/national_grid_demand
notes		notes
projects		projects
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Engineering

Disclaimer

Projects

Running all together

Inspecting stuff

Querying data

Intellij

Notes and Concepts

About

Uh oh!

Releases

Packages

Uh oh!

Languages

bgasparotto/data-engineering

Folders and files

Latest commit

History

Repository files navigation

Data Engineering

Disclaimer

Projects

Running all together

Inspecting stuff

Querying data

Intellij

Notes and Concepts

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages