A high-performance web crawling and scraping service built with FastAPI and asyncio.
- High Performance: Built with asyncio for concurrent crawling
- RESTful API: Clean, well-documented FastAPI endpoints
- Customizable: Configure crawl depth, request limits, and more
- Intelligent Parsing: Automatically detects and extracts content from product pages and generic web pages
- Production Ready: Includes error handling, logging, and rate limiting
- Cloud Deployment: Ready to deploy on Render with minimum configuration
- FastAPI: Modern, high-performance web framework for building APIs
- asyncio/aiohttp: For asynchronous HTTP requests and concurrent crawling
- Beautiful Soup 4: For HTML parsing
- Pydantic: For data validation and settings management
- Uvicorn: ASGI server for production deployment
Clone the repository and install dependencies:
git clone https://github.com/thnamus/scraper-api.git
cd scraper-api
python -m venv venv
source venv/bin/activate
pip install -r requirements.txtpython run.pyOr directly with Uvicorn:
uvicorn app.main:app --host 0.0.0.0 --port 8000 --reloadOnce the server is running, you can access the interactive API documentation:
- Swagger UI: http://localhost:8000/docs
- ReDoc: http://localhost:8000/redoc
You can test the API without local installation at:
- Live API: https://scraper.thnamus.tech
- Documentation: https://scraper.thnamus.tech/docs
- ReDoc: https://scraper.thnamus.tech/redoc
Example API request to crawl a website:
curl -X 'POST' \
'http://localhost:8000/crawl' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"url": "https://example.com",
"max_pages": 10
}'scraper/
├── app/
│ ├── __init__.py
│ ├── main.py # FastAPI application and endpoints
│ ├── crawler.py # Web crawler implementation
│ ├── parser.py # HTML parsing logic
│ └── utils.py # Utility functions
├── requirements.txt # Project dependencies
├── run.py # Server startup script
├── render.yaml # Render deployment configuration
└── runtime.txt # Python version specification
The crawler can be configured with the following settings:
max_pages: Maximum number of pages to crawl (default: 50)MAX_CONNECTIONS: Maximum concurrent connections (default: 10)DEFAULT_TIMEOUT: Request timeout in seconds (default: 10)REQUEST_DELAY: Delay between requests in seconds (default: 0.5)
This project is ready to deploy on Render:
-
Create a Render account if you don't have one
-
Connect your GitHub repository to Render
-
Create a new Web Service:
- Select your repository
- Render will automatically detect the configuration from
render.yaml - Or configure manually:
- Environment: Python
- Build Command:
pip install -r requirements.txt - Start Command:
uvicorn app.main:app --host 0.0.0.0 --port $PORT
-
Set environment variables (optional):
MAX_CONNECTIONS: 5 (recommended for free tier)REQUEST_DELAY: 1.0 (be respectful to target websites)DEFAULT_MAX_PAGES: 50
-
Deploy the service:
- Render will automatically build and deploy your application
- Access your API at the URL provided by Render
Render's free tier has certain limitations:
- Services sleep after 15 minutes of inactivity
- Limited CPU and memory resources
- Consider setting lower
MAX_CONNECTIONSand higherREQUEST_DELAYvalues
- Add authentication for API access
- Add proxy support for IP rotation
- Include support for JavaScript-rendered content
- Implement content storage in databases