This beginner project scrapes headlines from the BBC News homepage using libcurl and std::regex.
- Sends an HTTP request to
https://bbc.com/news - Saves the websites structure as a txt file
- Extracts headline titles using regular expressions
- Filters out short or irrelevant titles (e.g. "News", "Sport")
- Prints valid headlines to the terminal
- C++
- libcurl
- Regular Expressions
I used this to get hands-on practice with web scraping.
- Hardly applicable to other websites as many websites block the use of web scrapers
- infinite loop bug after running
- using regex makes it harder to expand upon the code. You should probably switch to a proper HTML parser
- The code can be expanded by adding more regex patterns in the main function- to do this just inspect the websites structure (txt file) and search for keywords like lastUpdated. To find the position of those keywords, search for a news title
Feel free to fork, i am here to learn!
- Make sure libcurl is installed.
- Compile the code:
g++ -std=c++11 main.cpp -o scraper -lcurl