Skip to content

thnamuS/scraper-api

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Web Scraper API

A high-performance web crawling and scraping service built with FastAPI and asyncio.

Features

  • High Performance: Built with asyncio for concurrent crawling
  • RESTful API: Clean, well-documented FastAPI endpoints
  • Customizable: Configure crawl depth, request limits, and more
  • Intelligent Parsing: Automatically detects and extracts content from product pages and generic web pages
  • Production Ready: Includes error handling, logging, and rate limiting
  • Cloud Deployment: Ready to deploy on Render with minimum configuration

Tech Stack

  • FastAPI: Modern, high-performance web framework for building APIs
  • asyncio/aiohttp: For asynchronous HTTP requests and concurrent crawling
  • Beautiful Soup 4: For HTML parsing
  • Pydantic: For data validation and settings management
  • Uvicorn: ASGI server for production deployment

Installation

Clone the repository and install dependencies:

git clone https://github.com/thnamus/scraper-api.git
cd scraper-api
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Usage

Starting the API Server

python run.py

Or directly with Uvicorn:

uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload

API Documentation

Once the server is running, you can access the interactive API documentation:

Online Demo API

You can test the API without local installation at:

Making API Requests

Example API request to crawl a website:

curl -X 'POST' \
  'http://localhost:8000/crawl' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "url": "https://example.com",
  "max_pages": 10
}'

Project Structure

scraper/
├── app/
│   ├── __init__.py
│   ├── main.py          # FastAPI application and endpoints
│   ├── crawler.py       # Web crawler implementation
│   ├── parser.py        # HTML parsing logic
│   └── utils.py         # Utility functions
├── requirements.txt     # Project dependencies
├── run.py              # Server startup script
├── render.yaml         # Render deployment configuration
└── runtime.txt         # Python version specification

Configuration

The crawler can be configured with the following settings:

  • max_pages: Maximum number of pages to crawl (default: 50)
  • MAX_CONNECTIONS: Maximum concurrent connections (default: 10)
  • DEFAULT_TIMEOUT: Request timeout in seconds (default: 10)
  • REQUEST_DELAY: Delay between requests in seconds (default: 0.5)

Deployment on Render

This project is ready to deploy on Render:

  1. Create a Render account if you don't have one

  2. Connect your GitHub repository to Render

  3. Create a new Web Service:

    • Select your repository
    • Render will automatically detect the configuration from render.yaml
    • Or configure manually:
      • Environment: Python
      • Build Command: pip install -r requirements.txt
      • Start Command: uvicorn app.main:app --host 0.0.0.0 --port $PORT
  4. Set environment variables (optional):

    • MAX_CONNECTIONS: 5 (recommended for free tier)
    • REQUEST_DELAY: 1.0 (be respectful to target websites)
    • DEFAULT_MAX_PAGES: 50
  5. Deploy the service:

    • Render will automatically build and deploy your application
    • Access your API at the URL provided by Render

Free Tier Considerations

Render's free tier has certain limitations:

  • Services sleep after 15 minutes of inactivity
  • Limited CPU and memory resources
  • Consider setting lower MAX_CONNECTIONS and higher REQUEST_DELAY values

Future Enhancements

  • Add authentication for API access
  • Add proxy support for IP rotation
  • Include support for JavaScript-rendered content
  • Implement content storage in databases

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published