A powerful web crawler and scraper with a terminal-based UI for extracting and analyzing web content.
- Web Crawling: Discover and map all pages on a website
- Web Scraping: Extract content from specific pages
- Multiple Export Formats: Convert to Markdown, JSON, HTML, or CSV
- Batch Processing: Scrape multiple URLs from previous crawls
- Pause & Resume: Save crawl state and resume later
- Content Search: Find specific content in your crawled/scraped data
- Proxy Support: Use proxies to avoid IP blocking
- User Agent Rotation: Rotate user agents to appear more like a regular browser
- Robots.txt Compliance: Respect website crawling rules
- Rate Limiting: Configurable delays between requests
- Blacklist/Whitelist: Control which URLs are crawled
- Python 3.6+
- Windows or macOS/Linux
-
Clone this repository:
Make a new folder Open up your terminal / Powershell Navigate to the folder
cd "name of folder"Once you're in that folder, use the "git clone" command below.
git clone https://github.com/KenKaiii/hungry.gitNow navigate to that folder that was downloaded
cd hungry -
Run the setup script:
Windows:
.\setup.batmacOS/Linux:
chmod +x setup.sh ./setup.sh -
The setup script will:
- Create a virtual environment
- Install required dependencies
- Set up necessary folders
- Create default settings
Run the crawler:
Windows:
.\hungry.bat
macOS/Linux:
./hungry.sh
Get help:
Windows:
hungry.bat /help
macOS/Linux:
./hungry.sh --help
- Crawl: Discover all pages on a website
- Scrape: Extract content from a specific page
- Scrape All URLs: Process multiple URLs from a previous crawl
- Resume Crawl: Continue a previously paused crawl
- Search: Find content in your crawled/scraped data
- Settings: Configure crawler behavior
- Results/: Contains scraped content
- Crawled/: Contains lists of crawled URLs
- Exports/: Contains exported data
- Logs/: Contains log files
You can configure the crawler through the Settings menu or by directly editing the settings.json file.
Key settings:
respect_robots_txt: Whether to respect robots.txt rulescrawl_delay: Delay between requests in secondsmax_pages: Maximum number of pages to crawl before asking to continueexport_formats: Default export formatsuse_proxies: Whether to use proxiesproxies: List of proxy URLs to userotate_user_agents: Whether to rotate user agents
This tool is provided for educational and research purposes only. Always respect website terms of service and robots.txt rules. The authors are not responsible for any misuse of this software.
MIT License
Contributions are welcome! Please feel free to submit a Pull Request.
Created by KenKaiii