Creates a training set and uses supervised learning (text classification) to build a model which predicts whether a webpage is relevant or not relevant based on features extracted from the website URL and title of the page.
The current training set includes 1271 total samples, 700 of which are relevant and 571 of which are not relevant. This training set currently yields an F1 score of .84 for the "relevant" class.