Ensure that you have a PostgreSQL instance running and create the following table: CREATE TABLE pageview_counts ( pagename VARCHAR(255), viewcount INTEGER, execution_date TIMESTAMP );
-
Download Wikipedia Pageviews Data: Fetches the gzip file containing pageviews data for the current execution date.
-
Extract the Downloaded Data: Unzips the downloaded gzip file.
-
Fetch Pageviews for Specific Companies: Parses the extracted file and counts the pageviews for specified companies: Google, Amazon, Apple, Microsoft, and Facebook.
-
Write Results to PostgreSQL: Writes the counted pageviews into a PostgreSQL database
- If using a local Airflow installation, start the Airflow web server and scheduler: airflow webserver --port 8080 airflow scheduler
| Column Name | Type |
|---|---|
| viewcount | execution_date |
| INTEGER | TIMESTAMP |