Author: Tianlin He
Date: 12 Aug 2020
Tag: #Udacity #Data Engineering
A startup called Sparkify wants to analyse the data they have been collected on songs and user activity on their music streaming app, with a particular interest in the songs that are listened by the users. The tasks of this project include:
- Query the data in the form of JSON logs from two local directories
- Create a Postgres database schema and ETL pipeline based on the queries
- Test the database and pipeline by running queries and validate with the expected results
The JSON data reside in two local directories with structure:
The log data record user activities (user id, user level etc), while the song data contain information about songs (artist, year etc).
We will create a star schema optimised for queries on song play analysis, which include a fact table and four dimension tables:
* `songplays` Records in log data associated with song plays in the app
usersUsers in the app- songs
Songs in music database artistsArtists in music databasetimeTimestamps of records in log data broken down into specific time units (hour, day, week etc)
The project is developed in a local environment (MacOS) with detailed steps in step-by-step-data-modelling-locally. In addition, there are several tips:
- For a typical data engineering task, the datatypes columns were not predefined. Thus they have to be manually added by inspecting the data files
- Don't forget to define a primary key (PK) for each table
ON CONFLICT (PK) DO actionin the case of duplicated rows
Once we built the database, we can run some queries in test.ipynb:
%sql SELECT COUNT(*) FROM songplays;There are songplays.
%sql SELECT * FROM songplays WHERE song_id IS NOT NULL;| songplay_id | start_time | user_id | level | song_id | artist_id | session_id | location | user_agent |
|---|---|---|---|---|---|---|---|---|
| 5537 | 2018-11-21 21:56:47.796000 | 15 | paid | SOZCTXZ12AB0182364 | AR5KOSW1187FB35FF4 | 818 | Chicago-Naperville-Elgin, IL-IN-WI | "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/36.0.1985.125 Chrome/36.0.1985.125 Safari/537.36" |
We could only match one song from music with log data.
%sql SELECT level, COUNT(*) FROM users GROUP BY level;There are
%%sql SELECT a.name AS artist, s.title AS song, s.duration AS song_length
FROM songs AS s
JOIN artists AS a
ON s.artist_id=a.artist_id
ORDER BY s.duration DESC
LIMIT 1;The longest song is:
| artist | song | song_length |
|---|---|---|
| Faiz Ali Faiz | Sohna Nee Sohna Data | 599.24853 |
%%sql SELECT *
FROM time
WHERE start_time=(SELECT MAX(start_time) FROM time);The most recent record is:
| start_time | hour | day | week | month | year | weekday |
|---|---|---|---|---|---|---|
| 2018-11-30 19:54:24.796000 | 19 | 30 | 48 | 11 | 2018 | 4 |
