Skip to content

yousraiq/airflow_assignment

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Airflow Capstone Project: Wikipedia Pageviews Analysis

Overview

This project is an Airflow DAG designed to download, process, and store Wikipedia pageview data for specific companies. It retrieves hourly pageview data, extracts the relevant information, and writes the results to a PostgreSQL database.

Table of Contents

  • Prerequisites Setup DAG Details Usage Tasks Database Schema
  • Prerequisites

    Python: Version 3.6 or higher Apache Airflow: Version 2.0 or higher PostgreSQL: A running instance with the required database and table Docker: (if using Docker for deployment)

    Setup

    Clone the Repository:

    git clone https://github.com/yousraiq/airflow_assignment.git cd my-airflow-project

    Set Up PostgreSQL:

    Ensure that you have a PostgreSQL instance running and create the following table: CREATE TABLE pageview_counts ( pagename VARCHAR(255), viewcount INTEGER, execution_date TIMESTAMP );

    Configure Airflow:

    Ensure your Airflow environment is set up correctly. Create a connection in Airflow for PostgreSQL with the conn_id set to my_postgres.

    DAG Details

    The DAG defined in this project is named capstone and runs on an hourly schedule. It consists of four main tasks:
    1. Download Wikipedia Pageviews Data: Fetches the gzip file containing pageviews data for the current execution date.

    2. Extract the Downloaded Data: Unzips the downloaded gzip file.

    3. Fetch Pageviews for Specific Companies: Parses the extracted file and counts the pageviews for specified companies: Google, Amazon, Apple, Microsoft, and Facebook.

    4. Write Results to PostgreSQL: Writes the counted pageviews into a PostgreSQL database

    Usage

    Start Airflow:

    1. If using Docker, run the following command: docker-compose up
    1. If using a local Airflow installation, start the Airflow web server and scheduler: airflow webserver --port 8080 airflow scheduler

    Access the Airflow UI:

    Open your web browser and navigate to http://localhost:8080 to view and trigger the DAG.

    Trigger the DAG:

    From the Airflow UI, you can manually trigger the capstone DAG and monitor the execution of each task

    Tasks

    get_data: Downloads the Wikipedia pageviews data. extract_gz: Unzips the downloaded data file. fetch_pageviews: Parses the data to count pageviews for specific companies. write_to_postgres: Inserts the results into the PostgreSQL database.

    Database Schema

    The results are stored in the pageview_counts table with the following schema:
    Column Name Type
    viewcount execution_date
    INTEGER TIMESTAMP

    About

    No description, website, or topics provided.

    Resources

    Stars

    Watchers

    Forks

    Releases

    No releases published

    Packages

    No packages published

    Languages