From 5ffc1888b60db2b72a0f7884115cf97ffc3bf385 Mon Sep 17 00:00:00 2001
From: Riaz Arbi <riazarbi@users.noreply.github.com>
Date: Fri, 14 May 2021 18:02:28 +0200
Subject: [PATCH 01/28] Add not recommending watch for bugfixes

---
 README.md | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/README.md b/README.md
index a175d4f..043d124 100644
--- a/README.md
+++ b/README.md
@@ -37,6 +37,8 @@ You can use any tool to produce the output, e.g. Python, R, Excel, Power BI, Tab
 3. Host your repository somewhere that is publicly accessible. If you're using GitHub, please use a fork of our original repository.
 4. Inform us via email that your challenge is complete, including a link to your repo. Be sure to make sure it is set to public.
 
+**Be sure to 'watch' this repo for changes - we may push bugfixes**
+
 NOTE: If you would like to _improve_ the content of this repository, by fixing typos or perhaps enhancing the challenge, please do so by submitting a merge request.
 
 ### Candidates where programming is not required (Data Analysts)

From fc060c85127bd5e0d61184003fc0b2a81db2dd15 Mon Sep 17 00:00:00 2001
From: Gordon Inggs <gordon.inggs@gmail.com>
Date: Fri, 14 May 2021 18:41:08 +0200
Subject: [PATCH 02/28] Adding dummy credentials to make things easier.

---
 README.md | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/README.md b/README.md
index 043d124..9123767 100644
--- a/README.md
+++ b/README.md
@@ -56,6 +56,7 @@ For all roles, we expect the challenge response to include what you consider to
 Your code should be well formatted according to generally accepted style guides and include whatever is necessary for a team-mate unfamiliar with it to maintain it.
 
 ### 0. Setup
+#### Data
 We have made the following datasets available (each filename is a link). These are all available in an AWS bucket `cct-ds-code-challenge-input-data`, in the `af-south-1` region, with the object name being the filenames below):
 * [`sr.csv.gz`](https://cct-ds-code-challenge-input-data.s3.af-south-1.amazonaws.com/sr.csv.gz) contains 36 months of service request data, where each row is a service request. A service request is a request from one of the residents of the City of Cape Town to undertake significant work. This is an important source of information on service delivery, and our performance thereof. *Note* as indicated by the extension, this file is compressed.
 * [`sr_hex.csv.gz`](https://cct-ds-code-challenge-input-data.s3.af-south-1.amazonaws.com/sr_hex.csv.gz) contains the same data as `sr.csv` as well as a column `h3_level8_index`, which contains the appropriate resolution level 8 H3 index for that request. If the request doesn't have a valid geolocation, the index value will be `0`. *Note* as indicated by the extension, this file is compressed.
@@ -67,6 +68,11 @@ We have made the following datasets available (each filename is a link). These a
 
 In some of the tasks below you will be creating datasets that are similar to these, feel free to use them to validate your work.
 
+#### Dummy AWS Credentials
+We have made AWS credentials available in the following file, with the appropriate permissions set, [here](https://cct-ds-code-challenge-input-data.s3.af-south-1.amazonaws.com/ds_code_challenge_creds.json).
+
+*Note* These creds don't have any special access, other than what is already set on these resources for anonymous access. These are more provided to make using the various AWS client libraries easier.
+
 ### 1. Data Extraction (if applying for a Data Engineering Position)
 Use the [AWS S3 SELECT](https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-glacier-select-sql-reference-select.html) command to read in the H3 resolution 8 data from `city-hex-polygons-8-10.geojson`. Use the `city-hex-polygons-8.geojson` file to validate your work.
 

From 4b68376899ba354027304d4ee55db5187a13a457 Mon Sep 17 00:00:00 2001
From: Gordon Inggs <Gordonei@users.noreply.github.com>
Date: Mon, 15 Nov 2021 15:19:58 +0200
Subject: [PATCH 03/28] Adding expectation of effort paragraph

---
 README.md | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/README.md b/README.md
index 9123767..b579591 100644
--- a/README.md
+++ b/README.md
@@ -16,9 +16,14 @@ Principles of reproducible analysis and code versioning are very important to ou
 
 So, follow common conventions with respect to directory structure and names to make your work as easy to follow as possible.
 
+## Expectation of Effort
+We expect you to spend up to 48 calendar hours working on this assessment per position. If you are finding that you spending significantly more time than this, then please contact whomever sent you the link to this assessment to let them know.
+
+You should have received over 7 days warning that you would be undertaking this assesment. Please notify [Delyno du Toit](delyno.dutoit@capetown.gov.za) if this was not the case.
+
 ### Candidates where programming is required (Data Scientist and Engineers)
 Requirements and notes:
-* Our primary programming languages are `python` and `R`. We will accept code that is packaged in `py`, `.ipynb`, `.R` and `.Rmd` files. 
+* Our primary programming languages are `python` and `R`. We will accept code that is packaged in `.py`, `.ipynb`, `.R` and `.Rmd` files. 
 * Bash or similar scripting language files are fine for glue. You may develop in any development environment you choose. 
 * We expect to be able to clone your repo, immediately identify what script to execute from your README file, and execute it to completion with no human interaction. 
   In order to ensure that our environment has the right libraries or packages, please follow standard python (PEP8) or R guidelines for structure in your code, i.e place `import` and `library()` commands at the top of your scripts.

From f8bdb0fdc77736e092ebb5d7d2fe20d4ff093e56 Mon Sep 17 00:00:00 2001
From: Gordon Inggs <Gordonei@users.noreply.github.com>
Date: Wed, 22 Jun 2022 20:18:55 +0200
Subject: [PATCH 04/28] Updating Challenge for Round 3 (#2)

* Updating Challenge for Round 3

* Added section on "Things to focus on". Also added references to these attributes throughout the challenge
* Tweaked wording of (2) to suggest it's fine to just use H3 directly
* Changed question in (3). Further tweaked on suggestions from @jolandabeck, adding open ended 3rd question.
* Adjusted wording of (4), and added anomaly detection challenge. Clarified the wording of some questions, and extended, asking for the application of insights developed.
* Adjusted wording of (5) to make clearer where we want judgement exercised. Added more complex joins to make up for (6) being removed.
* Removed (6) - we just weren't getting anything good back
---
 README.md | 70 +++++++++++++++++++++++++++++++------------------------
 1 file changed, 40 insertions(+), 30 deletions(-)

diff --git a/README.md b/README.md
index b579591..06a1bfd 100644
--- a/README.md
+++ b/README.md
@@ -16,11 +16,19 @@ Principles of reproducible analysis and code versioning are very important to ou
 
 So, follow common conventions with respect to directory structure and names to make your work as easy to follow as possible.
 
-## Expectation of Effort
+## What we're looking for
+### Expectation of Effort
 We expect you to spend up to 48 calendar hours working on this assessment per position. If you are finding that you spending significantly more time than this, then please contact whomever sent you the link to this assessment to let them know.
 
 You should have received over 7 days warning that you would be undertaking this assesment. Please notify [Delyno du Toit](delyno.dutoit@capetown.gov.za) if this was not the case.
 
+### Things to focus on
+Over and and above the tasks specified below, there are particular aspects of each position that we would like you to pay attention to:
+
+* Data Scientist candidates - we're looking for both good, statistical insight into problems, as well as the ability to communicate complex topics. Please make special effort to highlight what you believe to be the crux of a particular problem, as well as how your work addresses it.
+* Data Engineer candidates - as the key enablers of our unit's work, we really want to see work done in a sustainable manner: writing for easy comprehension, testing, clean code, modularity all bring us joy.
+* Data Analyst candidates - we consider success for our analysts when they provide the insights that inform actual decisions. Hence, we want evidence of both the ability to surface these insights from data, as well as the rhetorical skill in conveying the implications thereof. Your audience is intelligent, but non-specialist.
+
 ### Candidates where programming is required (Data Scientist and Engineers)
 Requirements and notes:
 * Our primary programming languages are `python` and `R`. We will accept code that is packaged in `.py`, `.ipynb`, `.R` and `.Rmd` files. 
@@ -44,10 +52,11 @@ You can use any tool to produce the output, e.g. Python, R, Excel, Power BI, Tab
 
 **Be sure to 'watch' this repo for changes - we may push bugfixes**
 
-NOTE: If you would like to _improve_ the content of this repository, by fixing typos or perhaps enhancing the challenge, please do so by submitting a merge request.
+NOTE: If you would like to _improve_ the content of this repository, by fixing typos or perhaps enhancing the challenge, please do so by submitting a pull request.
 
 ### Candidates where programming is not required (Data Analysts)
 *NB* If you prefer, you may submit using the workflow described above.
+
 1. Download this repository using the Code -> `Download ZIP` option in the top right-hand corner.
 2. Add your work into this folder.
 3. Create a compressed archive file with all of your work in it.
@@ -56,22 +65,20 @@ NOTE: If you would like to _improve_ the content of this repository, by fixing t
 ## Challenge
 Follow the below steps, completing those indicated as relevant to the positions for which you are interviewing. If there are any steps that you can not complete after a reasonable amount of effort, rather move on to later steps, attempting everything relevant at least once.
 
-For all roles, we expect the challenge response to include what you consider to be role-appropriate testing and validation. For example, a Data Scientist might want to include MAPE scores or confusion matrices. A Data Engineer may want to include logging and data quality validation tests. A Data Analyst might want to plot histograms of request counts to ensure that outliers aren't overwhelming your analysis.
+For all roles, we expect the challenge response to include what you consider to be role-appropriate testing and validation. For example, a Data Scientist might want to include MAPE scores or confusion matrices. A Data Engineer may want to include logging and data quality validation tests, as well as unit and even integration tests. A Data Analyst might want to plot histograms of the data in question to ensure that outliers aren't overwhelming your analysis.
 
 Your code should be well formatted according to generally accepted style guides and include whatever is necessary for a team-mate unfamiliar with it to maintain it.
 
 ### 0. Setup
 #### Data
 We have made the following datasets available (each filename is a link). These are all available in an AWS bucket `cct-ds-code-challenge-input-data`, in the `af-south-1` region, with the object name being the filenames below):
-* [`sr.csv.gz`](https://cct-ds-code-challenge-input-data.s3.af-south-1.amazonaws.com/sr.csv.gz) contains 36 months of service request data, where each row is a service request. A service request is a request from one of the residents of the City of Cape Town to undertake significant work. This is an important source of information on service delivery, and our performance thereof. *Note* as indicated by the extension, this file is compressed.
+* [`sr.csv.gz`](https://cct-ds-code-challenge-input-data.s3.af-south-1.amazonaws.com/sr.csv.gz) contains 12 months of service request data, where each row is a service request. A service request is a request from one of the residents of the City of Cape Town to undertake significant work. This is an important source of information on service delivery, and our performance thereof. *Note* as indicated by the extension, this file is compressed.
 * [`sr_hex.csv.gz`](https://cct-ds-code-challenge-input-data.s3.af-south-1.amazonaws.com/sr_hex.csv.gz) contains the same data as `sr.csv` as well as a column `h3_level8_index`, which contains the appropriate resolution level 8 H3 index for that request. If the request doesn't have a valid geolocation, the index value will be `0`. *Note* as indicated by the extension, this file is compressed.
 * [`sr_hex_truncated.csv`](https://cct-ds-code-challenge-input-data.s3.af-south-1.amazonaws.com/sr_hex_truncated.csv) is a truncated version of `sr_hex.csv`, containing only 3 months of data.
 * [`city-hex-polygons-8.geojson`](https://cct-ds-code-challenge-input-data.s3.af-south-1.amazonaws.com/city-hex-polygons-8.geojson) contains the [H3 spatial indexing system](https://h3geo.org/) polygons and index values for the bounds of the City of Cape Town, at resolution level 8.
 * [`city-hex-polygons-8-10.geojson`](https://cct-ds-code-challenge-input-data.s3.af-south-1.amazonaws.com/city-hex-polygons-8-10.geojson) contains the [H3 spatial indexing system](https://h3geo.org/) polygons and index values for resolution levels 8, 9 and 10, for the City of Cape Town.
 
-*Note* Some of these files are large, so start downloading as soon as possible.
-
-In some of the tasks below you will be creating datasets that are similar to these, feel free to use them to validate your work.
+In some of the tasks below you will be creating datasets that are similar to these, feel free to use the provided files to validate your work.
 
 #### Dummy AWS Credentials
 We have made AWS credentials available in the following file, with the appropriate permissions set, [here](https://cct-ds-code-challenge-input-data.s3.af-south-1.amazonaws.com/ds_code_challenge_creds.json).
@@ -81,52 +88,55 @@ We have made AWS credentials available in the following file, with the appropria
 ### 1. Data Extraction (if applying for a Data Engineering Position)
 Use the [AWS S3 SELECT](https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-glacier-select-sql-reference-select.html) command to read in the H3 resolution 8 data from `city-hex-polygons-8-10.geojson`. Use the `city-hex-polygons-8.geojson` file to validate your work.
 
-Please log the time taken to perform the operations described, and within reason, try to optimise latency and computational resources used.
+Please log the time taken to perform the operations described, and within reason, try to optimise latency and computational resources used. Please also note the comments above about the nature of the code that we expect.
 
 ### 2. Initial Data Transformation (if applying for a Data Engineering and/or Science Position)
-Join the file `city-hex-polygons-8.geojson` to the service request dataset, such that each service request is assigned to a single H3 hexagon. Use the `sr_hex.csv` file to validate your work.
+Join the equivalent of the contents of the file `city-hex-polygons-8.geojson` to the service request dataset, such that each service request is assigned to a single H3 resolution level 8 hexagon. Use the `sr_hex.csv` file to validate your work.
 
 For any requests where the `Latitude` and `Longitude` fields are empty, set the index value to `0`.
 
-Include logging that lets the executor know how many of the records failed to join, and include a join error threshold above which the script will error out. Please also log the time taken to perform the operations described, and within reason, try to optimise latency and computational resources used.
+Include logging that lets the executor know how many of the records failed to join, and include a join error threshold above which the script will error out. Please motivate why you have selected the error threshold that you have. Please also log the time taken to perform the operations described, and within reason, try to optimise latency and computational resources used.
 
 ### 3. Descriptive Analytic Tasks (if applying for a Data Analyst Position)
 Please use the `sr_hex_truncated.csv` dataset to address the following.
 
 Please provide the following:
-1. Provide a visual answer to the question "which areas and request types should Electricity concentrate on to reduce the overall volume of their requests".
-2. Provide a working prototype dashboard for monitoring progress in reducing Electricity service request volume per area, and per type.
+1. An answer to the question "In which suburbs should the Water and Sanitation directorate concentrate their infrastructure improvement efforts?". Please motivate how you related the data provided to infrastructure issues.
+2. Provide a visual mock of a dashboard for the purpose of monitoring progress in applying the insights developed in (1). It should focus the user on  performance pain points. Add a note for each visual element, explaining how it helps fulfill this overall function. Please also provide a brief explanation as to how the data provided would be used to realise what is contained in your mock.
+3. Identify value-adding insights for the management of Water and Sanitation, from the dataset provided, with regards to water provision within the City.
  
-An Executive-level person should be able to read this report and follow your analysis without guidance.
+An Executive-level, non-specialist person should be able to read this report and follow your analysis without guidance.
 
 ### 4. Predictive Analytic Tasks (if applying for a Data Science Position)
-Please use `sr_hex.csv` dataset, only looking at requests from the `Water and Sanitation Services` department.
-
-Please chose two of the following:
-1. *Time series challenge*: Predict the weekly number of expected service requests per hex for the next 4 weeks.
-2. *Introspection challenge*: Reshape the data into number of requests, per type, per hex in the last 12 months. Chose a particular request type, or group of requests. Develop a model that predicts the number of requests of your selected type, using the rest of your data. Based upon the model, and any other analysis, determine the drivers of requests of that particular type(s).
-3. *Classification challenge*: Classify a hex as formal, informal or rural based on the data derived from the service request data.
-
+Using the `sr_hex.csv` dataset, please chose __two__ of the following:
+1. *Time series challenge*: Predict the weekly number of expected service requests per hex that will be created each week, for the next 4 weeks.
+2. *Introspection challenge*: 
+  1. Reshape the data into number of requests created, per type, per H3 level 8 hex in the last 12 months. 
+  2. Choose a type, and then develop a model that predicts the number of requests of that type per hex.
+  3. Use the model developed in (2) to predict the number in (1).
+  4. Based upon the model, and any other analysis, determine the drivers of requests of that particular type(s).
+3. *Classification challenge*: Classify a hex as sparsely or densely populated, solely based on the service request data. Provide an explanation as to how you're using the data to perform this classification. Using your classifier, please highlight any unexpected or unusual classifications, and comment on why that might be the case.
+4. *Anomaly Detection challenge*: Reshape the data into the number of requests created per department, per day. Please identify any days in the first 6 months of 2020 where an anomalous number of requests were created for a particular department. Please describe how you would motivate to the director of that department why they should investigate that anomaly. Your argument should rely upon the contents of the dataset and/or your anomaly detection model.
+2
 Feel free to use any other data you can find in the public domain, except for task (3).
 
 **The final output of the execution of your code should be a self-contained `html` file or executed `ipynb` file that is your report.** 
  
 A statistically minded layperson should be able to read this report and follow your analysis without guidance.
 
-Please log the time taken to perform the operations described, and within reason, try to optimise latency and computation resources used.
+Please log the time taken to perform the operations described, and within reason, try to optimise latency and computation resources used. Please also note the comments above with respect to the nature of work that we expect from data scientists.
 
 ### 5. Further Data Transformations (if applying for a Data Engineering Position)
-Write a script which anonymises the `sr_hex.csv` file, but preserves the following resolutions (You may use H3 indexes or lat/lon coordinates for your spatial data):
-   * location accuracy to within approximately 500m 
-   * temporal accuracy to within 6 hours
-   * scrubs any columns which may contain personally identifiable information.
+1. Create a subsample of the data by selecting all of the requests in `sr_hex.csv.gz` which are within 1 minute of the centroid of the BELLVILLE SOUTH official suburb. You may determine the centroid of the suburb by the method of your choice, but if any external data is used, your code should programmatically download and perform the centroid calculation. Please clearly document your method.
 
-We expect in the accompanying report that follows you will justify as to why this data is now anonymised. Please limit this commentary to less than 500 words. If your code is written in a code notebook such as Jupyter notebook or Rmarkdown, you can include this commentary in your notebook.
+2. Augment your filtered subsample of `sr_hex.csv.gz` from (1) with the appropriate [wind direction and speed data for 2020](https://www.capetown.gov.za/_layouts/OpenDataPortalHandler/DownloadHandler.ashx?DocumentName=Wind_direction_and_speed_2020.ods&DatasetDocument=https%3A%2F%2Fcityapps.capetown.gov.za%2Fsites%2Fopendatacatalog%2FDocuments%2FWind%2FWind_direction_and_speed_2020.ods) from the Bellville South Air Quality Measurement site, from when the notification was created. All of the steps for downloading and preparing the wind data, as well as the join should be performed programmatically within your script.
 
-### 6. Data Loading Tasks (if applying for a Data Engineering Position)
-Select a subset of columns (including the H3 index column) from the `sr_hex.csv` or the anonymised file created in the task above, and write it to the write-only S3 bucket. 
+3. Write a script which anonymises your augmented subsample from (2), but preserves the following precisions (You may use H3 indice or lat/lon coordinates for your spatial data):
+   * location accuracy to within approximately 500m
+   * temporal accuracy to within 6 hours
+Please also remove any columns which you believe could lead to the resident who made the request being identified. We expect in the accompanying report that you will justify as to why this data is now anonymised. Please limit this commentary to less than 500 words. If your code is written in a code notebook such as Jupyter notebook or Rmarkdown, you can include this commentary in your notebook.
 
-Be sure to name your output file something that is recognisable as your work, and unlikely to collide with the names of others.
+Please also note the comments above about the nature of the code that we expect.
 
 ## Contact
-You can contact riaz.arbi, gordon.inggs and/or colinscott.anthony @ capetown.gov.za for any questions on the above.
+You can contact gordon.inggs and/or colinscott.anthony @ capetown.gov.za for any questions on the above.

From 3f2d9fadd42dae8fdb69c9f0b3da2a290c1107f5 Mon Sep 17 00:00:00 2001
From: demainou <ballisties@yahoo.com>
Date: Wed, 22 Mar 2023 17:15:39 +0200
Subject: [PATCH 05/28] Viz Engineer proposals

Made some proposals for Viz Engineer
---
 README.md | 18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/README.md b/README.md
index 06a1bfd..4608c40 100644
--- a/README.md
+++ b/README.md
@@ -44,7 +44,7 @@ Requirements and notes:
 You can use any tool to produce the output, e.g. Python, R, Excel, Power BI, Tableau, etc. The final deliverable needs to be a pdf report with your analysis.
 
 ## How to submit
-### Candidates where programming is required (Data Scientist and Engineers)
+### Candidates where programming is required (Data Scientist;  Engineers and Visualization Engineers)
 1. Clone this repository and load it into your development environment. 
 2. Work the challenge, committing regularly to document your progress. Try have structured, meaningful commits, where each one adds significant functionality in a coherent manner.
 3. Host your repository somewhere that is publicly accessible. If you're using GitHub, please use a fork of our original repository.
@@ -136,6 +136,22 @@ Please log the time taken to perform the operations described, and within reason
    * temporal accuracy to within 6 hours
 Please also remove any columns which you believe could lead to the resident who made the request being identified. We expect in the accompanying report that you will justify as to why this data is now anonymised. Please limit this commentary to less than 500 words. If your code is written in a code notebook such as Jupyter notebook or Rmarkdown, you can include this commentary in your notebook.
 
+### 6. Data Visualization Task (if applying for a Data Visualization Engineering Position)
+
+Using the `sr_hex_truncated.csv` dataset and open source web development technologies, develop a data visualization / dashboard that help to answer the question:
+
+*"In which suburbs should the Water and Sanitation directorate concentrate their infrastructure improvement efforts?".*
+
+The data visualization / dashboard must include the following:
+
+1.    A chart (plot) or charts (plots) that helps to answer the above question.
+2.    A spatial map that helps to answer the above question.
+3.    Make (1) and (2) interactive to allow users to explore the data and uncover insights
+4.    Cross filtering: a filter on the map must update the chart(s) accordingly with the filter, and vice versa.
+5.    Data Storytelling: use the dataset to create a data-driven story, using visualizations to support your narrative.
+6.    Design Principles: Can you explain why you have chosen certain colors (e.g. for legends), fonts, the layout or anything else that will help us understand your thinking in designing the data visualization / dashboard.
+7.    Publish your work using online service like https://pages.github.com/ or anything other means you are familiar with.  We want to see the end product and interact with it.
+
 Please also note the comments above about the nature of the code that we expect.
 
 ## Contact

From eb91aa8c33dd00d7f5d04a76d8a491f33df33142 Mon Sep 17 00:00:00 2001
From: demainou <ballisties@yahoo.com>
Date: Thu, 23 Mar 2023 08:58:17 +0200
Subject: [PATCH 06/28] amendments to data viz

---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README.md b/README.md
index 4608c40..22e4fd7 100644
--- a/README.md
+++ b/README.md
@@ -150,7 +150,7 @@ The data visualization / dashboard must include the following:
 4.    Cross filtering: a filter on the map must update the chart(s) accordingly with the filter, and vice versa.
 5.    Data Storytelling: use the dataset to create a data-driven story, using visualizations to support your narrative.
 6.    Design Principles: Can you explain why you have chosen certain colors (e.g. for legends), fonts, the layout or anything else that will help us understand your thinking in designing the data visualization / dashboard.
-7.    Publish your work using online service like https://pages.github.com/ or anything other means you are familiar with.  We want to see the end product and interact with it.
+7.    Publish your work using online service like https://pages.github.com/ or usinf any other means you are familiar with.  We want to see the end product and interact with it.
 
 Please also note the comments above about the nature of the code that we expect.
 

From d2bb2f3e6f335978383260cd8e601aedcb8d8214 Mon Sep 17 00:00:00 2001
From: demainou <ballisties@yahoo.com>
Date: Tue, 28 Mar 2023 08:47:38 +0200
Subject: [PATCH 07/28] data storytelling amendment

Co-authored-by: Gordon Inggs <Gordonei@users.noreply.github.com>
---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README.md b/README.md
index 22e4fd7..2e02a2c 100644
--- a/README.md
+++ b/README.md
@@ -148,7 +148,7 @@ The data visualization / dashboard must include the following:
 2.    A spatial map that helps to answer the above question.
 3.    Make (1) and (2) interactive to allow users to explore the data and uncover insights
 4.    Cross filtering: a filter on the map must update the chart(s) accordingly with the filter, and vice versa.
-5.    Data Storytelling: use the dataset to create a data-driven story, using visualizations to support your narrative.
+5.    Data Storytelling: in a separate markdown document, use the dataset and your visualisations to outline a data-driven story that answers the above question. In your document, describe how your visualisations support your narrative.
 6.    Design Principles: Can you explain why you have chosen certain colors (e.g. for legends), fonts, the layout or anything else that will help us understand your thinking in designing the data visualization / dashboard.
 7.    Publish your work using online service like https://pages.github.com/ or usinf any other means you are familiar with.  We want to see the end product and interact with it.
 

From f626594edb2907c381d1a1c246be1d49e49f3477 Mon Sep 17 00:00:00 2001
From: demainou <ballisties@yahoo.com>
Date: Tue, 28 Mar 2023 08:49:16 +0200
Subject: [PATCH 08/28] change crossfilter to plot brushing

Co-authored-by: Gordon Inggs <Gordonei@users.noreply.github.com>
---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README.md b/README.md
index 2e02a2c..96bf794 100644
--- a/README.md
+++ b/README.md
@@ -147,7 +147,7 @@ The data visualization / dashboard must include the following:
 1.    A chart (plot) or charts (plots) that helps to answer the above question.
 2.    A spatial map that helps to answer the above question.
 3.    Make (1) and (2) interactive to allow users to explore the data and uncover insights
-4.    Cross filtering: a filter on the map must update the chart(s) accordingly with the filter, and vice versa.
+4.    Cross plot brushing: a filter on the map must update the chart(s) accordingly with the filter, and vice versa.
 5.    Data Storytelling: in a separate markdown document, use the dataset and your visualisations to outline a data-driven story that answers the above question. In your document, describe how your visualisations support your narrative.
 6.    Design Principles: Can you explain why you have chosen certain colors (e.g. for legends), fonts, the layout or anything else that will help us understand your thinking in designing the data visualization / dashboard.
 7.    Publish your work using online service like https://pages.github.com/ or usinf any other means you are familiar with.  We want to see the end product and interact with it.

From 60fa68b2b6ddcc27dd361cb26f3ee157076fe3ae Mon Sep 17 00:00:00 2001
From: demainou <ballisties@yahoo.com>
Date: Tue, 28 Mar 2023 08:51:39 +0200
Subject: [PATCH 09/28] amendment to publishing work

Co-authored-by: Gordon Inggs <Gordonei@users.noreply.github.com>
---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README.md b/README.md
index 96bf794..012ed67 100644
--- a/README.md
+++ b/README.md
@@ -150,7 +150,7 @@ The data visualization / dashboard must include the following:
 4.    Cross plot brushing: a filter on the map must update the chart(s) accordingly with the filter, and vice versa.
 5.    Data Storytelling: in a separate markdown document, use the dataset and your visualisations to outline a data-driven story that answers the above question. In your document, describe how your visualisations support your narrative.
 6.    Design Principles: Can you explain why you have chosen certain colors (e.g. for legends), fonts, the layout or anything else that will help us understand your thinking in designing the data visualization / dashboard.
-7.    Publish your work using online service like https://pages.github.com/ or usinf any other means you are familiar with.  We want to see the end product and interact with it.
+7.    Publish your work using an online service such as https://pages.github.com/ or any other means you are familiar with.  Anyone with an Internet connection and a modern browser such as Google Chrome, Mozilla Firefox or Microsoft Edge, should be able to see the end product and interact with it. Please reference the published link in the `README.md` of your repository.
 
 Please also note the comments above about the nature of the code that we expect.
 

From a604cfc345ed4f457bba09f9c00972499c36e0ba Mon Sep 17 00:00:00 2001
From: demainou <ballisties@yahoo.com>
Date: Tue, 28 Mar 2023 08:52:52 +0200
Subject: [PATCH 10/28] design principle amendment

Co-authored-by: Gordon Inggs <Gordonei@users.noreply.github.com>
---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README.md b/README.md
index 012ed67..6b18cb1 100644
--- a/README.md
+++ b/README.md
@@ -149,7 +149,7 @@ The data visualization / dashboard must include the following:
 3.    Make (1) and (2) interactive to allow users to explore the data and uncover insights
 4.    Cross plot brushing: a filter on the map must update the chart(s) accordingly with the filter, and vice versa.
 5.    Data Storytelling: in a separate markdown document, use the dataset and your visualisations to outline a data-driven story that answers the above question. In your document, describe how your visualisations support your narrative.
-6.    Design Principles: Can you explain why you have chosen certain colors (e.g. for legends), fonts, the layout or anything else that will help us understand your thinking in designing the data visualization / dashboard.
+6.    Design Principles: In a separate markdown document, please explain why you have chosen certain colours (e.g. for legends), fonts, the layout or anything else that will help us understand your thinking in designing the data visualisation / dashboard to answer the question.
 7.    Publish your work using an online service such as https://pages.github.com/ or any other means you are familiar with.  Anyone with an Internet connection and a modern browser such as Google Chrome, Mozilla Firefox or Microsoft Edge, should be able to see the end product and interact with it. Please reference the published link in the `README.md` of your repository.
 
 Please also note the comments above about the nature of the code that we expect.

From 1b09c9f8c7b11ccdc2e5339f49c318220da2a7f4 Mon Sep 17 00:00:00 2001
From: demainou <ballisties@yahoo.com>
Date: Tue, 28 Mar 2023 08:58:59 +0200
Subject: [PATCH 11/28] made suggested changes to vis eng

---
 README.md | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/README.md b/README.md
index 6b18cb1..2b3ace4 100644
--- a/README.md
+++ b/README.md
@@ -29,7 +29,7 @@ Over and and above the tasks specified below, there are particular aspects of ea
 * Data Engineer candidates - as the key enablers of our unit's work, we really want to see work done in a sustainable manner: writing for easy comprehension, testing, clean code, modularity all bring us joy.
 * Data Analyst candidates - we consider success for our analysts when they provide the insights that inform actual decisions. Hence, we want evidence of both the ability to surface these insights from data, as well as the rhetorical skill in conveying the implications thereof. Your audience is intelligent, but non-specialist.
 
-### Candidates where programming is required (Data Scientist and Engineers)
+### Candidates where programming is required (Data Scientist; Engineers & Visualisation Engineer)
 Requirements and notes:
 * Our primary programming languages are `python` and `R`. We will accept code that is packaged in `.py`, `.ipynb`, `.R` and `.Rmd` files. 
 * Bash or similar scripting language files are fine for glue. You may develop in any development environment you choose. 
@@ -44,7 +44,7 @@ Requirements and notes:
 You can use any tool to produce the output, e.g. Python, R, Excel, Power BI, Tableau, etc. The final deliverable needs to be a pdf report with your analysis.
 
 ## How to submit
-### Candidates where programming is required (Data Scientist;  Engineers and Visualization Engineers)
+### Candidates where programming is required (Data Scientist;  Engineers and Visualisation Engineers)
 1. Clone this repository and load it into your development environment. 
 2. Work the challenge, committing regularly to document your progress. Try have structured, meaningful commits, where each one adds significant functionality in a coherent manner.
 3. Host your repository somewhere that is publicly accessible. If you're using GitHub, please use a fork of our original repository.
@@ -136,16 +136,16 @@ Please log the time taken to perform the operations described, and within reason
    * temporal accuracy to within 6 hours
 Please also remove any columns which you believe could lead to the resident who made the request being identified. We expect in the accompanying report that you will justify as to why this data is now anonymised. Please limit this commentary to less than 500 words. If your code is written in a code notebook such as Jupyter notebook or Rmarkdown, you can include this commentary in your notebook.
 
-### 6. Data Visualization Task (if applying for a Data Visualization Engineering Position)
+### 6. Data Visualisation Task (if applying for a Data Visualisation Engineering Position)
 
-Using the `sr_hex_truncated.csv` dataset and open source web development technologies, develop a data visualization / dashboard that help to answer the question:
+Using the `sr_hex_truncated.csv` dataset and open source web technologies, develop a data visualisation / dashboard that help to answer the question:
 
 *"In which suburbs should the Water and Sanitation directorate concentrate their infrastructure improvement efforts?".*
 
-The data visualization / dashboard must include the following:
+The data visualisation / dashboard must include the following:
 
 1.    A chart (plot) or charts (plots) that helps to answer the above question.
-2.    A spatial map that helps to answer the above question.
+2.    A cartographic map with identifiable landmark features (e.g. major roads, railways, etc.)
 3.    Make (1) and (2) interactive to allow users to explore the data and uncover insights
 4.    Cross plot brushing: a filter on the map must update the chart(s) accordingly with the filter, and vice versa.
 5.    Data Storytelling: in a separate markdown document, use the dataset and your visualisations to outline a data-driven story that answers the above question. In your document, describe how your visualisations support your narrative.

From 5502d4157e9df1e7f888e1fb6815245a35a7cf9b Mon Sep 17 00:00:00 2001
From: demainou <ballisties@yahoo.com>
Date: Wed, 29 Mar 2023 10:42:24 +0200
Subject: [PATCH 12/28] added task 2 as an item to complete

---
 README.md | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/README.md b/README.md
index 2b3ace4..f4ef76e 100644
--- a/README.md
+++ b/README.md
@@ -90,7 +90,7 @@ Use the [AWS S3 SELECT](https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3
 
 Please log the time taken to perform the operations described, and within reason, try to optimise latency and computational resources used. Please also note the comments above about the nature of the code that we expect.
 
-### 2. Initial Data Transformation (if applying for a Data Engineering and/or Science Position)
+### 2. Initial Data Transformation (if applying for a Data Engineering and/or Science Position and Visualisation Engineer)
 Join the equivalent of the contents of the file `city-hex-polygons-8.geojson` to the service request dataset, such that each service request is assigned to a single H3 resolution level 8 hexagon. Use the `sr_hex.csv` file to validate your work.
 
 For any requests where the `Latitude` and `Longitude` fields are empty, set the index value to `0`.
@@ -144,7 +144,8 @@ Using the `sr_hex_truncated.csv` dataset and open source web technologies, devel
 
 The data visualisation / dashboard must include the following:
 
-1.    A chart (plot) or charts (plots) that helps to answer the above question.
+1.    Complete **TASK 2 - Initial Data Transformation** and use the dataset to complete the tasks below.  
+A chart (plot) or charts (plots) that helps to answer the above question.
 2.    A cartographic map with identifiable landmark features (e.g. major roads, railways, etc.)
 3.    Make (1) and (2) interactive to allow users to explore the data and uncover insights
 4.    Cross plot brushing: a filter on the map must update the chart(s) accordingly with the filter, and vice versa.

From f58a571a9478b12b6941b65b8fe29e9ffff616a2 Mon Sep 17 00:00:00 2001
From: demainou <ballisties@yahoo.com>
Date: Wed, 29 Mar 2023 10:49:24 +0200
Subject: [PATCH 13/28] update to Candidates where programming is required

Co-authored-by: Colin Anthony <colin.anthony001@gmail.com>
---
 README.md | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/README.md b/README.md
index f4ef76e..b33b3f9 100644
--- a/README.md
+++ b/README.md
@@ -31,7 +31,8 @@ Over and and above the tasks specified below, there are particular aspects of ea
 
 ### Candidates where programming is required (Data Scientist; Engineers & Visualisation Engineer)
 Requirements and notes:
-* Our primary programming languages are `python` and `R`. We will accept code that is packaged in `.py`, `.ipynb`, `.R` and `.Rmd` files. 
+* For Data Science and Data Engineering, our primary programming languages are `python`, `R` and `SQL`. We will accept code that is packaged in `.py`, `.ipynb`, `.R` and `.Rmd` files. Scripts in `.sql` may also be included where applicable.
+* Data Visualisation engineers should have knowledge of either `python` or `R`, and relevant front-end programming languages (e.g. Javascript, HTML, CSS). We will accept code that is packaged in `.py`, `.R` and appropriate front-end programming language specific files, e.g. `.js`, `.html` etc.
 * Bash or similar scripting language files are fine for glue. You may develop in any development environment you choose. 
 * We expect to be able to clone your repo, immediately identify what script to execute from your README file, and execute it to completion with no human interaction. 
   In order to ensure that our environment has the right libraries or packages, please follow standard python (PEP8) or R guidelines for structure in your code, i.e place `import` and `library()` commands at the top of your scripts.

From 036e53e100eb2c64273141a93638c63f75dad202 Mon Sep 17 00:00:00 2001
From: demainou <ballisties@yahoo.com>
Date: Wed, 29 Mar 2023 10:50:38 +0200
Subject: [PATCH 14/28] update to data storytelling

Co-authored-by: Colin Anthony <colin.anthony001@gmail.com>
---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README.md b/README.md
index b33b3f9..d4b95d4 100644
--- a/README.md
+++ b/README.md
@@ -150,7 +150,7 @@ A chart (plot) or charts (plots) that helps to answer the above question.
 2.    A cartographic map with identifiable landmark features (e.g. major roads, railways, etc.)
 3.    Make (1) and (2) interactive to allow users to explore the data and uncover insights
 4.    Cross plot brushing: a filter on the map must update the chart(s) accordingly with the filter, and vice versa.
-5.    Data Storytelling: in a separate markdown document, use the dataset and your visualisations to outline a data-driven story that answers the above question. In your document, describe how your visualisations support your narrative.
+5.    Data Storytelling: in a separate markdown document, titled `data-driven-storytelling.md`, provide a brief, step-by-step, point form description of how your visualisations (and information from the dataset) outline a data-driven story that answers the above question.
 6.    Design Principles: In a separate markdown document, please explain why you have chosen certain colours (e.g. for legends), fonts, the layout or anything else that will help us understand your thinking in designing the data visualisation / dashboard to answer the question.
 7.    Publish your work using an online service such as https://pages.github.com/ or any other means you are familiar with.  Anyone with an Internet connection and a modern browser such as Google Chrome, Mozilla Firefox or Microsoft Edge, should be able to see the end product and interact with it. Please reference the published link in the `README.md` of your repository.
 

From 0ee7debcd6b71fd688167abfd0d1d10a890d8e61 Mon Sep 17 00:00:00 2001
From: demainou <ballisties@yahoo.com>
Date: Wed, 29 Mar 2023 10:51:19 +0200
Subject: [PATCH 15/28] update to design principles

Co-authored-by: Colin Anthony <colin.anthony001@gmail.com>
---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README.md b/README.md
index d4b95d4..6cece14 100644
--- a/README.md
+++ b/README.md
@@ -151,7 +151,7 @@ A chart (plot) or charts (plots) that helps to answer the above question.
 3.    Make (1) and (2) interactive to allow users to explore the data and uncover insights
 4.    Cross plot brushing: a filter on the map must update the chart(s) accordingly with the filter, and vice versa.
 5.    Data Storytelling: in a separate markdown document, titled `data-driven-storytelling.md`, provide a brief, step-by-step, point form description of how your visualisations (and information from the dataset) outline a data-driven story that answers the above question.
-6.    Design Principles: In a separate markdown document, please explain why you have chosen certain colours (e.g. for legends), fonts, the layout or anything else that will help us understand your thinking in designing the data visualisation / dashboard to answer the question.
+6.    Design Principles: In a separate markdown document, titled `visualisation-design-choices.md`, please provide a brief, point form explanation for why you have chosen certain colours (e.g. for legends), fonts, the layout or anything else that will help us understand your thinking in designing the data visualisation / dashboard to answer the question.
 7.    Publish your work using an online service such as https://pages.github.com/ or any other means you are familiar with.  Anyone with an Internet connection and a modern browser such as Google Chrome, Mozilla Firefox or Microsoft Edge, should be able to see the end product and interact with it. Please reference the published link in the `README.md` of your repository.
 
 Please also note the comments above about the nature of the code that we expect.

From 3a7fc2d155e105c83216b7623ef0e1d0d16a8835 Mon Sep 17 00:00:00 2001
From: demainou <ballisties@yahoo.com>
Date: Wed, 29 Mar 2023 10:51:49 +0200
Subject: [PATCH 16/28] update publish work section

Co-authored-by: Colin Anthony <colin.anthony001@gmail.com>
---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README.md b/README.md
index 6cece14..143296b 100644
--- a/README.md
+++ b/README.md
@@ -152,7 +152,7 @@ A chart (plot) or charts (plots) that helps to answer the above question.
 4.    Cross plot brushing: a filter on the map must update the chart(s) accordingly with the filter, and vice versa.
 5.    Data Storytelling: in a separate markdown document, titled `data-driven-storytelling.md`, provide a brief, step-by-step, point form description of how your visualisations (and information from the dataset) outline a data-driven story that answers the above question.
 6.    Design Principles: In a separate markdown document, titled `visualisation-design-choices.md`, please provide a brief, point form explanation for why you have chosen certain colours (e.g. for legends), fonts, the layout or anything else that will help us understand your thinking in designing the data visualisation / dashboard to answer the question.
-7.    Publish your work using an online service such as https://pages.github.com/ or any other means you are familiar with.  Anyone with an Internet connection and a modern browser such as Google Chrome, Mozilla Firefox or Microsoft Edge, should be able to see the end product and interact with it. Please reference the published link in the `README.md` of your repository.
+7.    Publish your work using an online service such as https://pages.github.com/ or any other means you are familiar with.  Anyone with an Internet connection and a modern browser such as Google Chrome, Mozilla Firefox or Microsoft Edge, should be able to see the end product and interact with it. Please reference the published link to your visualisation tool in the `README.md` of your repository.
 
 Please also note the comments above about the nature of the code that we expect.
 

From 8496ceea7ba4147e057d4e0052a95d0660f9f760 Mon Sep 17 00:00:00 2001
From: demainou <ballisties@yahoo.com>
Date: Wed, 29 Mar 2023 10:55:50 +0200
Subject: [PATCH 17/28] corrected task 2

made this the dataset to use for the vis task
---
 README.md | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/README.md b/README.md
index 143296b..cb65cc7 100644
--- a/README.md
+++ b/README.md
@@ -139,14 +139,13 @@ Please also remove any columns which you believe could lead to the resident who
 
 ### 6. Data Visualisation Task (if applying for a Data Visualisation Engineering Position)
 
-Using the `sr_hex_truncated.csv` dataset and open source web technologies, develop a data visualisation / dashboard that help to answer the question:
+Complete **TASK 2 - Initial Data Transformation** and use the constructed dataset and open source web technologies, to develop a data visualisation / dashboard that help to answer the question:
 
 *"In which suburbs should the Water and Sanitation directorate concentrate their infrastructure improvement efforts?".*
 
 The data visualisation / dashboard must include the following:
 
-1.    Complete **TASK 2 - Initial Data Transformation** and use the dataset to complete the tasks below.  
-A chart (plot) or charts (plots) that helps to answer the above question.
+1.    A chart (plot) or charts (plots) that helps to answer the above question.
 2.    A cartographic map with identifiable landmark features (e.g. major roads, railways, etc.)
 3.    Make (1) and (2) interactive to allow users to explore the data and uncover insights
 4.    Cross plot brushing: a filter on the map must update the chart(s) accordingly with the filter, and vice versa.

From 43b2aca8ab7bd16767f9c10b976a57fbdb94e5e8 Mon Sep 17 00:00:00 2001
From: demainou <ballisties@yahoo.com>
Date: Thu, 30 Mar 2023 11:28:54 +0200
Subject: [PATCH 18/28] amended dataset to use for task

---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README.md b/README.md
index cb65cc7..254b43a 100644
--- a/README.md
+++ b/README.md
@@ -139,7 +139,7 @@ Please also remove any columns which you believe could lead to the resident who
 
 ### 6. Data Visualisation Task (if applying for a Data Visualisation Engineering Position)
 
-Complete **TASK 2 - Initial Data Transformation** and use the constructed dataset and open source web technologies, to develop a data visualisation / dashboard that help to answer the question:
+Using the [`sr_hex.csv.gz`](https://cct-ds-code-challenge-input-data.s3.af-south-1.amazonaws.com/sr_hex.csv.gz) dataset and open source web technologies, develop a data visualisation / dashboard that help to answer the question:
 
 *"In which suburbs should the Water and Sanitation directorate concentrate their infrastructure improvement efforts?".*
 

From ecef77fe6896e4025eafad3f5210082b1657ce9e Mon Sep 17 00:00:00 2001
From: Colin Anthony <colin.anthony001@gmail.com>
Date: Thu, 30 Mar 2023 17:25:01 +0200
Subject: [PATCH 19/28] Apply suggestions from code review

make the referenced filename consistent with other references
---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README.md b/README.md
index 254b43a..c5a2872 100644
--- a/README.md
+++ b/README.md
@@ -92,7 +92,7 @@ Use the [AWS S3 SELECT](https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3
 Please log the time taken to perform the operations described, and within reason, try to optimise latency and computational resources used. Please also note the comments above about the nature of the code that we expect.
 
 ### 2. Initial Data Transformation (if applying for a Data Engineering and/or Science Position and Visualisation Engineer)
-Join the equivalent of the contents of the file `city-hex-polygons-8.geojson` to the service request dataset, such that each service request is assigned to a single H3 resolution level 8 hexagon. Use the `sr_hex.csv` file to validate your work.
+Join the equivalent of the contents of the file `city-hex-polygons-8.geojson` to the service request dataset, such that each service request is assigned to a single H3 resolution level 8 hexagon. Use the `sr_hex.csv.gz` file to validate your work.
 
 For any requests where the `Latitude` and `Longitude` fields are empty, set the index value to `0`.
 

From 9bd855d6698a743e078083127d1582f1c544a538 Mon Sep 17 00:00:00 2001
From: Gordon Inggs <Gordonei@users.noreply.github.com>
Date: Thu, 10 Aug 2023 15:48:23 +0200
Subject: [PATCH 20/28] Reworking Data Analyst and Data Vis for August 2023
 recruitment (#4)

Co-authored-by: demainou <ballisties@yahoo.com>
Co-authored-by: Toufeeq Ockards <toufeeq.ockards+github@gmail.com>
---
 README.md | 45 +++++++++++++++++++++++++--------------------
 1 file changed, 25 insertions(+), 20 deletions(-)

diff --git a/README.md b/README.md
index c5a2872..205cca3 100644
--- a/README.md
+++ b/README.md
@@ -18,21 +18,21 @@ So, follow common conventions with respect to directory structure and names to m
 
 ## What we're looking for
 ### Expectation of Effort
-We expect you to spend up to 48 calendar hours working on this assessment per position. If you are finding that you spending significantly more time than this, then please contact whomever sent you the link to this assessment to let them know.
+We expect you to spend up to 48 calendar hours working on this assessment per position. If you are finding that you are spending significantly more time than this, then please contact whomever sent you the link to this assessment to let them know.
 
-You should have received over 7 days warning that you would be undertaking this assesment. Please notify [Delyno du Toit](delyno.dutoit@capetown.gov.za) if this was not the case.
+You should have received over 7 days warning that you would be undertaking this assessment. Please notify [Delyno du Toit](delyno.dutoit@capetown.gov.za) if this was not the case.
 
 ### Things to focus on
-Over and and above the tasks specified below, there are particular aspects of each position that we would like you to pay attention to:
+Over and above the tasks specified below, there are particular aspects of each position that we would like you to pay attention to:
 
 * Data Scientist candidates - we're looking for both good, statistical insight into problems, as well as the ability to communicate complex topics. Please make special effort to highlight what you believe to be the crux of a particular problem, as well as how your work addresses it.
 * Data Engineer candidates - as the key enablers of our unit's work, we really want to see work done in a sustainable manner: writing for easy comprehension, testing, clean code, modularity all bring us joy.
-* Data Analyst candidates - we consider success for our analysts when they provide the insights that inform actual decisions. Hence, we want evidence of both the ability to surface these insights from data, as well as the rhetorical skill in conveying the implications thereof. Your audience is intelligent, but non-specialist.
+* Data Analyst candidates - we think our analysts have done a good job when they provide insights that inform actual decisions. Hence, we want evidence of both the ability to surface these insights from data, as well as the skill to convey those insights. Your audience is intelligent, but non-specialist.
 
 ### Candidates where programming is required (Data Scientist; Engineers & Visualisation Engineer)
 Requirements and notes:
 * For Data Science and Data Engineering, our primary programming languages are `python`, `R` and `SQL`. We will accept code that is packaged in `.py`, `.ipynb`, `.R` and `.Rmd` files. Scripts in `.sql` may also be included where applicable.
-* Data Visualisation engineers should have knowledge of either `python` or `R`, and relevant front-end programming languages (e.g. Javascript, HTML, CSS). We will accept code that is packaged in `.py`, `.R` and appropriate front-end programming language specific files, e.g. `.js`, `.html` etc.
+* Data Visualisation engineers should have knowledge of either `python` or `R`, and relevant front-end programming languages (e.g. Javascript, HTML, CSS). We will accept code that is packaged in `.py`, `.R` and appropriate front-end programming language specific files, e.g. `.js`, `.html` etc. Furthermore, we greatly appreciate adherence to the principles and guidelines of [Single Page Applications](https://en.wikipedia.org/wiki/Single-page_application).
 * Bash or similar scripting language files are fine for glue. You may develop in any development environment you choose. 
 * We expect to be able to clone your repo, immediately identify what script to execute from your README file, and execute it to completion with no human interaction. 
   In order to ensure that our environment has the right libraries or packages, please follow standard python (PEP8) or R guidelines for structure in your code, i.e place `import` and `library()` commands at the top of your scripts.
@@ -42,7 +42,7 @@ Requirements and notes:
 ### Candidates where programming is not required (Data Analysts)
 *Note* If you prefer, you may submit using the requirements described above.
 
-You can use any tool to produce the output, e.g. Python, R, Excel, Power BI, Tableau, etc. The final deliverable needs to be a pdf report with your analysis.
+You can use any tool to produce the output, e.g. Python, R, Excel, Power BI, Tableau, etc. The **final deliverable needs to be a pdf report** with your analysis.
 
 ## How to submit
 ### Candidates where programming is required (Data Scientist;  Engineers and Visualisation Engineers)
@@ -91,7 +91,7 @@ Use the [AWS S3 SELECT](https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3
 
 Please log the time taken to perform the operations described, and within reason, try to optimise latency and computational resources used. Please also note the comments above about the nature of the code that we expect.
 
-### 2. Initial Data Transformation (if applying for a Data Engineering and/or Science Position and Visualisation Engineer)
+### 2. Initial Data Transformation (if applying for a Data Engineering, Visualisation Engineer and/or Science Position)
 Join the equivalent of the contents of the file `city-hex-polygons-8.geojson` to the service request dataset, such that each service request is assigned to a single H3 resolution level 8 hexagon. Use the `sr_hex.csv.gz` file to validate your work.
 
 For any requests where the `Latitude` and `Longitude` fields are empty, set the index value to `0`.
@@ -99,14 +99,20 @@ For any requests where the `Latitude` and `Longitude` fields are empty, set the
 Include logging that lets the executor know how many of the records failed to join, and include a join error threshold above which the script will error out. Please motivate why you have selected the error threshold that you have. Please also log the time taken to perform the operations described, and within reason, try to optimise latency and computational resources used.
 
 ### 3. Descriptive Analytic Tasks (if applying for a Data Analyst Position)
+*Note:* We are most interested in how you reason about the problem.
+
 Please use the `sr_hex_truncated.csv` dataset to address the following.
 
 Please provide the following:
-1. An answer to the question "In which suburbs should the Water and Sanitation directorate concentrate their infrastructure improvement efforts?". Please motivate how you related the data provided to infrastructure issues.
-2. Provide a visual mock of a dashboard for the purpose of monitoring progress in applying the insights developed in (1). It should focus the user on  performance pain points. Add a note for each visual element, explaining how it helps fulfill this overall function. Please also provide a brief explanation as to how the data provided would be used to realise what is contained in your mock.
-3. Identify value-adding insights for the management of Water and Sanitation, from the dataset provided, with regards to water provision within the City.
+1. An answer to the question "In which 3 suburbs should the Urban Mobility directorate concentrate their infrastructure improvement efforts?". Please motivate how you related the data provided to infrastructure issues.
+2. An answer to the questions:
+    1. Focusing on the Urban Mobility directorate - "What is the median & 80th percentile time to complete each service request across the City?" (each row represent a service request).
+    2. Focusing on the Urban Mobility directorate - "What is the median & 80th percentile time to complete each service request for the 3 suburbs identified in (1)?" (each row represent a service request).
+    3. "Is there any significant differences in the median and 80th percentile completion times between the City as a whole and the 3 suburbs identified in(1)?".  Please elaborate on the similarities or differences.
+3. Provide a visual mock of a dashboard for the purpose of monitoring progress in applying the insights developed in (1) & (2). It should focus the user on performance pain points. Add a note for each visual element, explaining how it helps fulfill this overall function. Please also provide a brief explanation as to how the data provided would be used to realise what is contained in your mock.
+4. Identify value-adding insights for the management of Urban Mobility, from the dataset provided, in regard to commuter transport within the City.
  
-An Executive-level, non-specialist person should be able to read this report and follow your analysis without guidance.
+The **final deliverable** is a report (in PDF form) for the Executive Management team of the City.  An Executive-level, non-specialist person should be able to read the report and follow your analysis without guidance.
 
 ### 4. Predictive Analytic Tasks (if applying for a Data Science Position)
 Using the `sr_hex.csv` dataset, please chose __two__ of the following:
@@ -139,21 +145,20 @@ Please also remove any columns which you believe could lead to the resident who
 
 ### 6. Data Visualisation Task (if applying for a Data Visualisation Engineering Position)
 
-Using the [`sr_hex.csv.gz`](https://cct-ds-code-challenge-input-data.s3.af-south-1.amazonaws.com/sr_hex.csv.gz) dataset and open source web technologies, develop a data visualisation / dashboard that help to answer the question:
+Using the [`sr_hex.csv.gz`](https://cct-ds-code-challenge-input-data.s3.af-south-1.amazonaws.com/sr_hex.csv.gz) dataset and open source front-end web technologies (html, css, javascript, etc), develop a data visualisation / dashboard that help to answer the question:
 
 *"In which suburbs should the Water and Sanitation directorate concentrate their infrastructure improvement efforts?".*
 
 The data visualisation / dashboard must include the following:
 
-1.    A chart (plot) or charts (plots) that helps to answer the above question.
-2.    A cartographic map with identifiable landmark features (e.g. major roads, railways, etc.)
-3.    Make (1) and (2) interactive to allow users to explore the data and uncover insights
-4.    Cross plot brushing: a filter on the map must update the chart(s) accordingly with the filter, and vice versa.
-5.    Data Storytelling: in a separate markdown document, titled `data-driven-storytelling.md`, provide a brief, step-by-step, point form description of how your visualisations (and information from the dataset) outline a data-driven story that answers the above question.
-6.    Design Principles: In a separate markdown document, titled `visualisation-design-choices.md`, please provide a brief, point form explanation for why you have chosen certain colours (e.g. for legends), fonts, the layout or anything else that will help us understand your thinking in designing the data visualisation / dashboard to answer the question.
-7.    Publish your work using an online service such as https://pages.github.com/ or any other means you are familiar with.  Anyone with an Internet connection and a modern browser such as Google Chrome, Mozilla Firefox or Microsoft Edge, should be able to see the end product and interact with it. Please reference the published link to your visualisation tool in the `README.md` of your repository.
+1. A chart (plot) or charts (plots) that helps to answer the above question.
+2. A minimalist cartographic map with identifiable landmark features (e.g. major roads, railways, etc.) and some representation of the data.
+3. Make (1) and (2) interactive in some manner, so as to allow users to explore the data and uncover insights. The following example [Map with "range" sliders](https://observablehq.com/d/a040753103477386) demostrates an interactive map. However, you're not limited to this example – feel free to explore other interactive approaches.
+4. Data Storytelling: in a separate markdown document, titled `data-driven-storytelling.md`, provide a brief, step-by-step, point form description of how your visualisations (and information from the dataset) outline a data-driven story that answers the above question. 
+5. Design Principles: In a separate markdown document, titled `visualisation-design-choices.md`, please provide a brief, point form explanation for why you have chosen certain colours (e.g. for legends), fonts, the layout or anything else that will help us understand your thinking in designing the data visualisation / dashboard to answer the question. 
+6. Publish your work using an online service such as https://pages.github.com/ or any other means you are familiar with.  Anyone with an Internet connection and a modern browser such as Google Chrome, Mozilla Firefox or Microsoft Edge, should be able to see the end product and interact with it. Please reference the published link to your visualisation tool in the `README.md` of your repository.
 
 Please also note the comments above about the nature of the code that we expect.
 
 ## Contact
-You can contact gordon.inggs and/or colinscott.anthony @ capetown.gov.za for any questions on the above.
+You can contact gordon.inggs, muhammed.ockards and/or colinscott.anthony @ capetown.gov.za for any questions on the above.

From 155161340100d0ab60f1897223442edaee8c7b06 Mon Sep 17 00:00:00 2001
From: Gordon Inggs <Gordonei@users.noreply.github.com>
Date: Wed, 18 Oct 2023 10:08:11 +0200
Subject: [PATCH 21/28] Changing Data Analyst task to focus on UWM

---
 README.md | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/README.md b/README.md
index 205cca3..0c4d47e 100644
--- a/README.md
+++ b/README.md
@@ -104,15 +104,15 @@ Include logging that lets the executor know how many of the records failed to jo
 Please use the `sr_hex_truncated.csv` dataset to address the following.
 
 Please provide the following:
-1. An answer to the question "In which 3 suburbs should the Urban Mobility directorate concentrate their infrastructure improvement efforts?". Please motivate how you related the data provided to infrastructure issues.
+1. An answer to the question "In which 3 suburbs should the Urban Waste Management directorate concentrate their infrastructure improvement efforts?". Please motivate how you related the data provided to infrastructure issues.
 2. An answer to the questions:
-    1. Focusing on the Urban Mobility directorate - "What is the median & 80th percentile time to complete each service request across the City?" (each row represent a service request).
-    2. Focusing on the Urban Mobility directorate - "What is the median & 80th percentile time to complete each service request for the 3 suburbs identified in (1)?" (each row represent a service request).
+    1. Focusing on the Urban Waste Management directorate - "What is the median & 80th percentile time to complete each service request across the City?" (each row represent a service request).
+    2. Focusing on the Urban Waste Management directorate - "What is the median & 80th percentile time to complete each service request for the 3 suburbs identified in (1)?" (each row represent a service request).
     3. "Is there any significant differences in the median and 80th percentile completion times between the City as a whole and the 3 suburbs identified in(1)?".  Please elaborate on the similarities or differences.
 3. Provide a visual mock of a dashboard for the purpose of monitoring progress in applying the insights developed in (1) & (2). It should focus the user on performance pain points. Add a note for each visual element, explaining how it helps fulfill this overall function. Please also provide a brief explanation as to how the data provided would be used to realise what is contained in your mock.
-4. Identify value-adding insights for the management of Urban Mobility, from the dataset provided, in regard to commuter transport within the City.
+4. Identify value-adding insights for the management of Urban Waste Management, from the dataset provided, in regard to waste collection within the City.
  
-The **final deliverable** is a report (in PDF form) for the Executive Management team of the City.  An Executive-level, non-specialist person should be able to read the report and follow your analysis without guidance.
+The **final deliverable** is a report (in PDF form) for the Executive Management team of the City.  An Executive-level, non-specialist should be able to read the report and follow your analysis without guidance.
 
 ### 4. Predictive Analytic Tasks (if applying for a Data Science Position)
 Using the `sr_hex.csv` dataset, please chose __two__ of the following:

From b8a394d7feeacfcd0faf70a4a0163a985edb24d3 Mon Sep 17 00:00:00 2001
From: mockards <muhammed.ockards@capetown.gov.za>
Date: Thu, 7 Dec 2023 13:15:57 +0200
Subject: [PATCH 22/28] include Front End Developer to README.md.

---
 README.md | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/README.md b/README.md
index 0c4d47e..ca85bf2 100644
--- a/README.md
+++ b/README.md
@@ -5,7 +5,7 @@
 
 ## Purpose
 
-The purpose of this challenge is to evaluate the skills of prospective Data Scientists, Engineers and Analysts for positions in the City of Cape Town's Data Science unit. 
+The purpose of this challenge is to evaluate the skills of prospective Data Scientists, Engineers, Analysts and Front End Developer for positions in the City of Cape Town's Data Science unit. 
 
 ## Intended audience
 
@@ -45,7 +45,7 @@ Requirements and notes:
 You can use any tool to produce the output, e.g. Python, R, Excel, Power BI, Tableau, etc. The **final deliverable needs to be a pdf report** with your analysis.
 
 ## How to submit
-### Candidates where programming is required (Data Scientist;  Engineers and Visualisation Engineers)
+### Candidates where programming is required (Data Scientist;  Engineers, Visualisation Engineers and Front End Developers)
 1. Clone this repository and load it into your development environment. 
 2. Work the challenge, committing regularly to document your progress. Try have structured, meaningful commits, where each one adds significant functionality in a coherent manner.
 3. Host your repository somewhere that is publicly accessible. If you're using GitHub, please use a fork of our original repository.
@@ -91,7 +91,7 @@ Use the [AWS S3 SELECT](https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3
 
 Please log the time taken to perform the operations described, and within reason, try to optimise latency and computational resources used. Please also note the comments above about the nature of the code that we expect.
 
-### 2. Initial Data Transformation (if applying for a Data Engineering, Visualisation Engineer and/or Science Position)
+### 2. Initial Data Transformation (if applying for a Data Engineering, Visualisation Engineer, Front End Developer and/or Science Position)
 Join the equivalent of the contents of the file `city-hex-polygons-8.geojson` to the service request dataset, such that each service request is assigned to a single H3 resolution level 8 hexagon. Use the `sr_hex.csv.gz` file to validate your work.
 
 For any requests where the `Latitude` and `Longitude` fields are empty, set the index value to `0`.
@@ -143,7 +143,7 @@ Please log the time taken to perform the operations described, and within reason
    * temporal accuracy to within 6 hours
 Please also remove any columns which you believe could lead to the resident who made the request being identified. We expect in the accompanying report that you will justify as to why this data is now anonymised. Please limit this commentary to less than 500 words. If your code is written in a code notebook such as Jupyter notebook or Rmarkdown, you can include this commentary in your notebook.
 
-### 6. Data Visualisation Task (if applying for a Data Visualisation Engineering Position)
+### 6. Data Visualisation Task (if applying for a Data Visualisation Engineering or Front End Developer Position)
 
 Using the [`sr_hex.csv.gz`](https://cct-ds-code-challenge-input-data.s3.af-south-1.amazonaws.com/sr_hex.csv.gz) dataset and open source front-end web technologies (html, css, javascript, etc), develop a data visualisation / dashboard that help to answer the question:
 

From 6fb301d9fa045866fa45518e9648d3fd8e6a5d7f Mon Sep 17 00:00:00 2001
From: mockards <muhammed.ockards@capetown.gov.za>
Date: Thu, 7 Dec 2023 13:41:32 +0200
Subject: [PATCH 23/28] add Front End Developer to requirements

---
 README.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/README.md b/README.md
index ca85bf2..efd8a5d 100644
--- a/README.md
+++ b/README.md
@@ -29,10 +29,10 @@ Over and above the tasks specified below, there are particular aspects of each p
 * Data Engineer candidates - as the key enablers of our unit's work, we really want to see work done in a sustainable manner: writing for easy comprehension, testing, clean code, modularity all bring us joy.
 * Data Analyst candidates - we think our analysts have done a good job when they provide insights that inform actual decisions. Hence, we want evidence of both the ability to surface these insights from data, as well as the skill to convey those insights. Your audience is intelligent, but non-specialist.
 
-### Candidates where programming is required (Data Scientist; Engineers & Visualisation Engineer)
+### Candidates where programming is required (Data Scientist; Engineers, Visualisation Engineer and Front End Developers)
 Requirements and notes:
 * For Data Science and Data Engineering, our primary programming languages are `python`, `R` and `SQL`. We will accept code that is packaged in `.py`, `.ipynb`, `.R` and `.Rmd` files. Scripts in `.sql` may also be included where applicable.
-* Data Visualisation engineers should have knowledge of either `python` or `R`, and relevant front-end programming languages (e.g. Javascript, HTML, CSS). We will accept code that is packaged in `.py`, `.R` and appropriate front-end programming language specific files, e.g. `.js`, `.html` etc. Furthermore, we greatly appreciate adherence to the principles and guidelines of [Single Page Applications](https://en.wikipedia.org/wiki/Single-page_application).
+* Data Visualisation engineers and Front End Developers should have knowledge of either `python` or `R`, and relevant front-end programming languages (e.g. Javascript, HTML, CSS). We will accept code that is packaged in `.py`, `.R` and appropriate front-end programming language specific files, e.g. `.js`, `.html` etc. Furthermore, we greatly appreciate adherence to the principles and guidelines of [Single Page Applications](https://en.wikipedia.org/wiki/Single-page_application).
 * Bash or similar scripting language files are fine for glue. You may develop in any development environment you choose. 
 * We expect to be able to clone your repo, immediately identify what script to execute from your README file, and execute it to completion with no human interaction. 
   In order to ensure that our environment has the right libraries or packages, please follow standard python (PEP8) or R guidelines for structure in your code, i.e place `import` and `library()` commands at the top of your scripts.

From 9c0721327ac3e39df1db7d23aeb773552bc71c69 Mon Sep 17 00:00:00 2001
From: Gordon Inggs <gordon.inggs@capetown.gov.za>
Date: Thu, 12 Jun 2025 11:22:59 +0200
Subject: [PATCH 24/28] Adding caveat about follow-on questions

---
 README.md | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/README.md b/README.md
index efd8a5d..24ba697 100644
--- a/README.md
+++ b/README.md
@@ -44,6 +44,14 @@ Requirements and notes:
 
 You can use any tool to produce the output, e.g. Python, R, Excel, Power BI, Tableau, etc. The **final deliverable needs to be a pdf report** with your analysis.
 
+### Follow-on Questions
+If we invite you to an interview, after completing and submitting this technical assessment, we will be asking follow-up 
+questions about the work submitted. These questions might be at a very detailed level, or broadly conceptual, relating to 
+the choices made in completing this assessment.
+
+We do not expect perfect recall of what you may have submitted, but we do expect a deep knowledge of the content, and 
+how it works.
+
 ## How to submit
 ### Candidates where programming is required (Data Scientist;  Engineers, Visualisation Engineers and Front End Developers)
 1. Clone this repository and load it into your development environment. 

From 55abfa5a9f85e7d777cf1b503a98a0b532c0c6d0 Mon Sep 17 00:00:00 2001
From: Gordon Inggs <gordon.inggs@capetown.gov.za>
Date: Thu, 12 Jun 2025 11:23:19 +0200
Subject: [PATCH 25/28] Adding computer vision question for DS position

---
 README.md | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/README.md b/README.md
index 24ba697..26fd0d4 100644
--- a/README.md
+++ b/README.md
@@ -86,6 +86,7 @@ We have made the following datasets available (each filename is a link). These a
 * [`sr_hex_truncated.csv`](https://cct-ds-code-challenge-input-data.s3.af-south-1.amazonaws.com/sr_hex_truncated.csv) is a truncated version of `sr_hex.csv`, containing only 3 months of data.
 * [`city-hex-polygons-8.geojson`](https://cct-ds-code-challenge-input-data.s3.af-south-1.amazonaws.com/city-hex-polygons-8.geojson) contains the [H3 spatial indexing system](https://h3geo.org/) polygons and index values for the bounds of the City of Cape Town, at resolution level 8.
 * [`city-hex-polygons-8-10.geojson`](https://cct-ds-code-challenge-input-data.s3.af-south-1.amazonaws.com/city-hex-polygons-8-10.geojson) contains the [H3 spatial indexing system](https://h3geo.org/) polygons and index values for resolution levels 8, 9 and 10, for the City of Cape Town.
+* `swimming-pool-labels` (`s3://cct-ds-code-challenge-input-data.s3.af-south-1.amazonaws.com/images/swimming-pool`) contains a random sample of aerial images from Cape Town, organised into two prefixes, `yes` or `no`, corresponding to whether there is a swimming pool in the image. Within each label prefix, there is a manifest file listing all the images available, i.e. [yes](https://cct-ds-code-challenge-input-data.s3.af-south-1.amazonaws.com/images/swimming-pool/yes/manifest) and [no](https://cct-ds-code-challenge-input-data.s3.af-south-1.amazonaws.com/images/swimming-pool/no/manifest).
 
 In some of the tasks below you will be creating datasets that are similar to these, feel free to use the provided files to validate your work.
 
@@ -123,17 +124,20 @@ Please provide the following:
 The **final deliverable** is a report (in PDF form) for the Executive Management team of the City.  An Executive-level, non-specialist should be able to read the report and follow your analysis without guidance.
 
 ### 4. Predictive Analytic Tasks (if applying for a Data Science Position)
-Using the `sr_hex.csv` dataset, please chose __two__ of the following:
-1. *Time series challenge*: Predict the weekly number of expected service requests per hex that will be created each week, for the next 4 weeks.
-2. *Introspection challenge*: 
+
+Please choose __two__ of the following:
+1. *Time series challenge*: Predict the weekly number of expected service requests per hex that will be created each week using `sr_hex.csv`, for 4 weeks past the end of the dataset.
+2. *Introspection challenge*: (using `sr_hex.csv`) 
   1. Reshape the data into number of requests created, per type, per H3 level 8 hex in the last 12 months. 
   2. Choose a type, and then develop a model that predicts the number of requests of that type per hex.
   3. Use the model developed in (2) to predict the number in (1).
   4. Based upon the model, and any other analysis, determine the drivers of requests of that particular type(s).
-3. *Classification challenge*: Classify a hex as sparsely or densely populated, solely based on the service request data. Provide an explanation as to how you're using the data to perform this classification. Using your classifier, please highlight any unexpected or unusual classifications, and comment on why that might be the case.
-4. *Anomaly Detection challenge*: Reshape the data into the number of requests created per department, per day. Please identify any days in the first 6 months of 2020 where an anomalous number of requests were created for a particular department. Please describe how you would motivate to the director of that department why they should investigate that anomaly. Your argument should rely upon the contents of the dataset and/or your anomaly detection model.
+3. *Classification challenge*: Classify a hex in `sr_hex.csv` as sparsely or densely populated, solely based on the service request data. Provide an explanation as to how you're using the data to perform this classification. Using your classifier, please highlight any unexpected or unusual classifications, and comment on why that might be the case.
+4. *Anomaly Detection challenge*: Reshape the `sr_hex.csv` data into the number of requests created per department, per day. Please identify any days in the first 6 months of 2020 where an anomalous number of requests were created for a particular department. Please describe how you would motivate to the director of that department why they should investigate that anomaly. Your argument should rely upon the contents of the dataset and/or your anomaly detection model.
 2
-Feel free to use any other data you can find in the public domain, except for task (3).
+5. *Computer Vision classification challenge*: Use a sample of images from the `swimming-pool` dataset to develop a model that classifies whether an image contains a swimming pool or not. Use the provided labels to validate your model.
+
+Feel free to use any other data you can find in the public domain, except for tasks (3) and (5).
 
 **The final output of the execution of your code should be a self-contained `html` file or executed `ipynb` file that is your report.** 
  

From be4c1a0901f1d56ee683d9bdeaccb449975ceb3f Mon Sep 17 00:00:00 2001
From: Gordon Inggs <gordon.inggs@capetown.gov.za>
Date: Thu, 12 Jun 2025 11:23:33 +0200
Subject: [PATCH 26/28] Asking for validation in data transformation question

---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README.md b/README.md
index 26fd0d4..c2c9a7c 100644
--- a/README.md
+++ b/README.md
@@ -103,7 +103,7 @@ Please log the time taken to perform the operations described, and within reason
 ### 2. Initial Data Transformation (if applying for a Data Engineering, Visualisation Engineer, Front End Developer and/or Science Position)
 Join the equivalent of the contents of the file `city-hex-polygons-8.geojson` to the service request dataset, such that each service request is assigned to a single H3 resolution level 8 hexagon. Use the `sr_hex.csv.gz` file to validate your work.
 
-For any requests where the `Latitude` and `Longitude` fields are empty, set the index value to `0`.
+For any requests where the `Latitude` and `Longitude` fields are empty, set the index value to `0`. Use your judgement to include any other appropriate validation.
 
 Include logging that lets the executor know how many of the records failed to join, and include a join error threshold above which the script will error out. Please motivate why you have selected the error threshold that you have. Please also log the time taken to perform the operations described, and within reason, try to optimise latency and computational resources used.
 

From bacae87abdc92b887357766ad75073a9081ef6a8 Mon Sep 17 00:00:00 2001
From: Gordon Inggs <gordon.inggs@capetown.gov.za>
Date: Thu, 12 Jun 2025 11:24:03 +0200
Subject: [PATCH 27/28] Refining data scientist output

---
 README.md | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/README.md b/README.md
index c2c9a7c..00ff170 100644
--- a/README.md
+++ b/README.md
@@ -141,9 +141,12 @@ Feel free to use any other data you can find in the public domain, except for ta
 
 **The final output of the execution of your code should be a self-contained `html` file or executed `ipynb` file that is your report.** 
  
-A statistically minded layperson should be able to read this report and follow your analysis without guidance.
+A statistically minded layperson should be able to read this report and follow your analysis without guidance. In the 
+report there should be evidence of any model training done (e.g. loss graph, output logs from hyperparameter tuning), 
+along with quantitative measures or predictions of the quality of any models developed. We also expect to see some 
+process commentary, describing the quality of any initial results, refinements made, and the resulting improvement.
 
-Please log the time taken to perform the operations described, and within reason, try to optimise latency and computation resources used. Please also note the comments above with respect to the nature of work that we expect from data scientists.
+Please also log the time taken to perform the operations described, and within reason, try to optimise latency and computation resources used. Please also note the comments above with respect to the nature of work that we expect from data scientists.
 
 ### 5. Further Data Transformations (if applying for a Data Engineering Position)
 1. Create a subsample of the data by selecting all of the requests in `sr_hex.csv.gz` which are within 1 minute of the centroid of the BELLVILLE SOUTH official suburb. You may determine the centroid of the suburb by the method of your choice, but if any external data is used, your code should programmatically download and perform the centroid calculation. Please clearly document your method.

From ddb027cf0604945db1f4684a2af022339c7edea0 Mon Sep 17 00:00:00 2001
From: demainou <ballisties@yahoo.com>
Date: Fri, 20 Jun 2025 16:44:53 +0200
Subject: [PATCH 28/28] additions

---
 README.md | 17 ++++++++++-------
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/README.md b/README.md
index 00ff170..8e647ff 100644
--- a/README.md
+++ b/README.md
@@ -125,16 +125,19 @@ The **final deliverable** is a report (in PDF form) for the Executive Management
 
 ### 4. Predictive Analytic Tasks (if applying for a Data Science Position)
 
-Please choose __two__ of the following:
+Please choose __one__ of the following four tasks to solve:  (for the tasks you choose to solve, we expect you to provide (1) an initial solution and (2) an improved solution of your initial solution. For both we expect the code together with evidence or images of training or inference results, e.g. metrics, loss graph, output logs from hyperparameter tuning, confusion matrixs, etc)
+
 1. *Time series challenge*: Predict the weekly number of expected service requests per hex that will be created each week using `sr_hex.csv`, for 4 weeks past the end of the dataset.
-2. *Introspection challenge*: (using `sr_hex.csv`) 
-  1. Reshape the data into number of requests created, per type, per H3 level 8 hex in the last 12 months. 
-  2. Choose a type, and then develop a model that predicts the number of requests of that type per hex.
-  3. Use the model developed in (2) to predict the number in (1).
-  4. Based upon the model, and any other analysis, determine the drivers of requests of that particular type(s).
+2. *Introspection challenge*: (using `sr_hex.csv`)    
+   2.1. Reshape the data into number of requests created, per type, per H3 level 8 hex in the last 12 months.  
+   2.2. Choose a type, and then develop a model that predicts the number of requests of that type per hex.   
+   2.3. Use the model developed in (2.2) to predict the number in (2.1).   
+   2.4. Based upon the model, and any other analysis, determine the drivers of requests of that particular type(s).   
 3. *Classification challenge*: Classify a hex in `sr_hex.csv` as sparsely or densely populated, solely based on the service request data. Provide an explanation as to how you're using the data to perform this classification. Using your classifier, please highlight any unexpected or unusual classifications, and comment on why that might be the case.
 4. *Anomaly Detection challenge*: Reshape the `sr_hex.csv` data into the number of requests created per department, per day. Please identify any days in the first 6 months of 2020 where an anomalous number of requests were created for a particular department. Please describe how you would motivate to the director of that department why they should investigate that anomaly. Your argument should rely upon the contents of the dataset and/or your anomaly detection model.
-2
+
+This task must be solved:  (we expect you to provide (1) an initial solution and (2) an improved solution of your initial solution. For both we expect the code together with evidence or images of training or inference results, e.g. metrics, loss graph, output logs from hyperparameter tuning, confusion matrixs, etc)
+
 5. *Computer Vision classification challenge*: Use a sample of images from the `swimming-pool` dataset to develop a model that classifies whether an image contains a swimming pool or not. Use the provided labels to validate your model.
 
 Feel free to use any other data you can find in the public domain, except for tasks (3) and (5).