An easy to install Python web scraper for RACER.com news articles
This project utilizes Scrapy (v. 2.11.0), a fast and powerful open-source Python package designed for extracting information from websites. This web scraper will take RACER.com news articles, remove any external links or ads, and store the title and article content in .txt files.
Inside main.py there are a few simple functions - one to loop through the URLs in the input file, one to scrape a specific URL, and lastly one to write the scraped data to external text files.
Note that it is not necessary to delete the article.txt files before runnning the code. They will automatically be overwritten upon execution.
- OS: Windows 10 or higher, MacOS 10.13+ 64-bit, Linux
- Resources: Mininum 400 MB to download/install Anaconda. Find out more here.
- You will first need to install Conda to utilize the .yml file with installing packages necessary for scraping.
- Next, clone this repo:
git clone https://github.com/StayCool21/CS325_WebScraper.git. You will probably need to unzip the files after cloning. - Open up Anaconda by typing
anaconda promptin the Windows Start menu. - In Anaconda, navigate to the directory that the unzipped repo files are located in as a check so that all repo files are visible.
- Initialize the Python environment by typing
conda create -f requirements.yml. This will automatically install any necessary dependencies for running the script.
- Remember that the input file must contain URLs from RACER.com, one URL on each line with no spaces. You can use the
url.txtfile as a sample that will produce the output files. Make sure you have exactly 5 input URLs. - Also, the URLs must contain
https://in the address. If you usewww.racer.com/...a MissingSchema exception will be thrown.
You have two options: using Anaconda's CLI or using an IDE like Visual Studio Code.
- Open up Anaconda by typing
anaconda promptin the Windows Start menu. - In Anaconda, navigate to the directory that the unzipped repo files are located in.
- Type
python main.pyin the CLI. If no errors appear after pressing Enter, it means that the five articles were written successfully.
- If you have Visual Studio Code installed, you can change the environment (lower right-hand side) to the Anaconda environment and run from there.