Construct a data pipeline capable of normalising and storing data in AWS (Amazon Web Services) environment
- Solution should be utilising AWS solutions
- Code should be written in Python
- Data should be normalised as much as possible
- Data processing should be triggered by AWS S3 event notifications. More info can be found here.
- Data processing should be done utilising AWS Lambda Function
- Final data should be stored on AWS S3 or one of the AWS databases (RDS, DynamoDB etc.)
- At least one unit test should be created
- You can find dataset in data directory as
Open_DataRDW.csvfile - Dataset is an open data subset from https://opendata.rdw.nl/Voertuigen.
- Vehicle type -
Voertuigsoort- should be normalised to English term. Same goes for body type (Inrichting), and colour (Eerste kleur). - All date fields should be converted to ISO-8601 timezone-aware datetime format.
- Depending where the final data is going to be stored, it may need to be grouped by vehicle make (e.g. Ford, Audi et). Thus, if it is JSON/CSV final output, it should be separated to distinct files. If final output chosen as a database, grouping is not necessary however uniqueness should be explicitly defined by license plate number.
- Any
n.v.t.null values in colour column should be correctly translated to final output (e.g. empty for CSV,nullfor DBs)
- Code should be shared with @smagu and @Kontuzijus as collaborators on private repository when it is ready for a review
- Any explanations should be written either in README.md or within the code as comments
- AWS Free Tier should be sufficient enough for all the necessary work
- AWS resource creation such as Lambda Function and S3 events can be either created manually, or deployed using frameworks such as Serverless.
- The easiest approach is by using AWS console and supplied examples
- To imitate a new S3 object, you may upload new file by hand and then use
s3:ObjectCreated:*event notifaction to trigger AWS Lambda