Data Engineering code examples for batch processing with Python, PySpark, Airflow and AWS/Localstack.
This project attempts to put together a few data technology tools and concepts in a simple way, including batch processing, task orchestration and usage of different data storages.
This project uses docker compose to provide services that comprise the stack, including:
- Localstack for AWS S3 with initial datasets and S3 bucket as the datalake
- Postgres as the data warehouse storage
- Airflow for task orchestration
- Trino for the query engine
- Hive Metastore for data catalogue
- Docker for container management
- batch-jobs - Spark scripts with pyspark.
- airflow - Airflow DAGs and operators that run the batch jobs.
- hive-metastore - Hive standalone metastore for mapping partitioned parquet files on S3.
- Build the Spark batch jobs container:
docker build -f projects/batch-jobs/Dockerfile -t batch-jobs:latest ./projects/batch-jobs- Run Docker Compose:
docker compose up --build- View Airflow at http://localhost:8080 with user
adminand passwordadmin. - Monitor Trino at http://localhost:8081 with user
trino. - Watch the Docker tasks being spawned with
watch -n1 docker ps.
- Query the datalake on
jdbc:trino://localhost:8081/hivewith usertrinoand no password. - Query the data warehouse on
jdbc:postgresql://localhost:5432/data_warehousewith userpostgresand passwordpassword. - If you are curious, query the hive metastore on
jdbc:postgresql://localhost:5452/hivewith userpostgresand passwordpassword.
- Load one of the projects such as batch-jobs as a project itself.
- Follow the respective
README.mdfor build and run instructions.