Skip to content

ScraperX: Web Content Mining and Document Generation Tool

Notifications You must be signed in to change notification settings

kaival775/web-scrapper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 

Repository files navigation

Web Scraper with PDF Generation

A simple Python web scraper that extracts important information from a webpage and converts it to a PDF file.

Features

  • Takes a URL from the user
  • Scrapes the webpage for title, paragraphs, and images
  • Saves the data to a text file
  • Converts the data to a PDF file
  • Implements error handling with try-except-else blocks
  • Includes file handling operations
  • Creates unique PDF files with timestamps
  • Handles long titles in PDFs by adjusting font size or breaking into multiple lines

Requirements

  • Python 3.6+
  • Required packages:
    • requests
    • beautifulsoup4
    • fpdf

Installation

  1. Clone this repository or download the files
  2. Install the required packages:
pip install requests beautifulsoup4 fpdf

Usage

Run the script:

python web_scraper.py

When prompted, enter the complete URL of the website you want to scrape (including http:// or https://).

The script will:

  1. Scrape the website
  2. Save the content to a unique text file (e.g., scraped_data_20240615_123045.txt)
  3. Generate a unique PDF file (e.g., scraped_data_20240615_123045.pdf)

Each time you run the script, it will create new files rather than overwriting existing ones.

Error Handling

The script includes comprehensive error handling for:

  • Network connection issues
  • Invalid URLs
  • File I/O operations
  • PDF generation

PDF Improvements

The PDF generation has been improved to:

  • Prevent titles from extending beyond the page width
  • Adjust the font size for very long titles
  • Break exceptionally long titles into multiple lines

Note

This is a simple web scraper for educational purposes. Some websites may have measures to prevent scraping or might have complex structures that this basic scraper cannot handle effectively.

About

ScraperX: Web Content Mining and Document Generation Tool

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages