Web Scraper API

A high-performance web crawling and scraping service built with FastAPI and asyncio.

Features

High Performance: Built with asyncio for concurrent crawling
RESTful API: Clean, well-documented FastAPI endpoints
Customizable: Configure crawl depth, request limits, and more
Intelligent Parsing: Automatically detects and extracts content from product pages and generic web pages
Production Ready: Includes error handling, logging, and rate limiting
Cloud Deployment: Ready to deploy on Render with minimum configuration

Tech Stack

FastAPI: Modern, high-performance web framework for building APIs
asyncio/aiohttp: For asynchronous HTTP requests and concurrent crawling
Beautiful Soup 4: For HTML parsing
Pydantic: For data validation and settings management
Uvicorn: ASGI server for production deployment

Installation

Clone the repository and install dependencies:

git clone https://github.com/thnamus/scraper-api.git
cd scraper-api
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Usage

Starting the API Server

python run.py

Or directly with Uvicorn:

uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload

API Documentation

Once the server is running, you can access the interactive API documentation:

Swagger UI: http://localhost:8000/docs
ReDoc: http://localhost:8000/redoc

Online Demo API

You can test the API without local installation at:

Live API: https://scraper.thnamus.tech
Documentation: https://scraper.thnamus.tech/docs
ReDoc: https://scraper.thnamus.tech/redoc

Making API Requests

Example API request to crawl a website:

curl -X 'POST' \
  'http://localhost:8000/crawl' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "url": "https://example.com",
  "max_pages": 10
}'

Project Structure

scraper/
├── app/
│   ├── __init__.py
│   ├── main.py          # FastAPI application and endpoints
│   ├── crawler.py       # Web crawler implementation
│   ├── parser.py        # HTML parsing logic
│   └── utils.py         # Utility functions
├── requirements.txt     # Project dependencies
├── run.py              # Server startup script
├── render.yaml         # Render deployment configuration
└── runtime.txt         # Python version specification

Configuration

The crawler can be configured with the following settings:

max_pages: Maximum number of pages to crawl (default: 50)
MAX_CONNECTIONS: Maximum concurrent connections (default: 10)
DEFAULT_TIMEOUT: Request timeout in seconds (default: 10)
REQUEST_DELAY: Delay between requests in seconds (default: 0.5)

Deployment on Render

This project is ready to deploy on Render:

Create a Render account if you don't have one
Connect your GitHub repository to Render
Create a new Web Service:
- Select your repository
- Render will automatically detect the configuration from render.yaml
- Or configure manually:
  - Environment: Python
  - Build Command: pip install -r requirements.txt
  - Start Command: uvicorn app.main:app --host 0.0.0.0 --port $PORT
Set environment variables (optional):
- MAX_CONNECTIONS: 5 (recommended for free tier)
- REQUEST_DELAY: 1.0 (be respectful to target websites)
- DEFAULT_MAX_PAGES: 50
Deploy the service:
- Render will automatically build and deploy your application
- Access your API at the URL provided by Render

Free Tier Considerations

Render's free tier has certain limitations:

Services sleep after 15 minutes of inactivity
Limited CPU and memory resources
Consider setting lower MAX_CONNECTIONS and higher REQUEST_DELAY values

Future Enhancements

Add authentication for API access
Add proxy support for IP rotation
Include support for JavaScript-rendered content
Implement content storage in databases

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
app		app
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
render.yaml		render.yaml
requirements.txt		requirements.txt
run.py		run.py
runtime.txt		runtime.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Scraper API

Features

Tech Stack

Installation

Usage

Starting the API Server

API Documentation

Online Demo API

Making API Requests

Project Structure

Configuration

Deployment on Render

Free Tier Considerations

Future Enhancements

About

Uh oh!

Releases

Packages

Languages

thnamuS/scraper-api

Folders and files

Latest commit

History

Repository files navigation

Web Scraper API

Features

Tech Stack

Installation

Usage

Starting the API Server

API Documentation

Online Demo API

Making API Requests

Project Structure

Configuration

Deployment on Render

Free Tier Considerations

Future Enhancements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages