A simple Python web scraper that extracts important information from a webpage and converts it to a PDF file.
- Takes a URL from the user
- Scrapes the webpage for title, paragraphs, and images
- Saves the data to a text file
- Converts the data to a PDF file
- Implements error handling with try-except-else blocks
- Includes file handling operations
- Creates unique PDF files with timestamps
- Handles long titles in PDFs by adjusting font size or breaking into multiple lines
- Python 3.6+
- Required packages:
- requests
- beautifulsoup4
- fpdf
- Clone this repository or download the files
- Install the required packages:
pip install requests beautifulsoup4 fpdf
Run the script:
python web_scraper.py
When prompted, enter the complete URL of the website you want to scrape (including http:// or https://).
The script will:
- Scrape the website
- Save the content to a unique text file (e.g.,
scraped_data_20240615_123045.txt) - Generate a unique PDF file (e.g.,
scraped_data_20240615_123045.pdf)
Each time you run the script, it will create new files rather than overwriting existing ones.
The script includes comprehensive error handling for:
- Network connection issues
- Invalid URLs
- File I/O operations
- PDF generation
The PDF generation has been improved to:
- Prevent titles from extending beyond the page width
- Adjust the font size for very long titles
- Break exceptionally long titles into multiple lines
This is a simple web scraper for educational purposes. Some websites may have measures to prevent scraping or might have complex structures that this basic scraper cannot handle effectively.