🕷️ BBC Web Crawler (Python)

A simple Python-based web crawler that fetches the BBC homepage, parses the HTML, extracts all valid links, and continues crawling discovered pages.
This project was built using requests and BeautifulSoup4 as part of internship learning tasks.

🚀 Features

Fetches HTML pages using requests
Parses page content using BeautifulSoup
Extracts and normalizes all valid <a> links
Maintains a frontier list for crawling
Uses a visited dictionary to prevent repeated crawling
Basic error handling for network issues
URL filtering (skips # links, help links, invalid URLs)

🛠️ Technologies Used

Python 3.x
requests
beautifulsoup4

📦 Installation

# Clone the repository
git clone https://github.com/sarvesh-devrukhkar-ftech/web-crawler-in-python.git
cd web-crawler-in-python

# Create a virtual environment
python3 -m venv venv

# Activate the virtual environment
source venv/bin/activate      # macOS / Linux
venv\Scripts\activate         # Windows

# Install dependencies
pip install -r requirements.txt

# Run the Crawler
python crawler.py

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.gitignore		.gitignore
README.md		README.md
crawler.py		crawler.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🕷️ BBC Web Crawler (Python)

🚀 Features

🛠️ Technologies Used

📦 Installation

About

Uh oh!

Languages

sarvesh-devrukhkar-ftech/web-crawler-in-python

Folders and files

Latest commit

History

Repository files navigation

🕷️ BBC Web Crawler (Python)

🚀 Features

🛠️ Technologies Used

📦 Installation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Languages