Hungry Crawler

A powerful web crawler and scraper with a terminal-based UI for extracting and analyzing web content.

🚀 Features

Web Crawling: Discover and map all pages on a website
Web Scraping: Extract content from specific pages
Multiple Export Formats: Convert to Markdown, JSON, HTML, or CSV
Batch Processing: Scrape multiple URLs from previous crawls
Pause & Resume: Save crawl state and resume later
Content Search: Find specific content in your crawled/scraped data
Proxy Support: Use proxies to avoid IP blocking
User Agent Rotation: Rotate user agents to appear more like a regular browser
Robots.txt Compliance: Respect website crawling rules
Rate Limiting: Configurable delays between requests
Blacklist/Whitelist: Control which URLs are crawled

📋 Requirements

Python 3.6+
Windows or macOS/Linux

🔧 Installation

Clone this repository:

Make a new folder Open up your terminal / Powershell Navigate to the folder
```
cd "name of folder"
```
Once you're in that folder, use the "git clone" command below.
```
git clone https://github.com/KenKaiii/hungry.git
```
Now navigate to that folder that was downloaded
```
cd hungry
```

Run the setup script:

Windows:

.\setup.bat

macOS/Linux:

chmod +x setup.sh
./setup.sh

The setup script will:
- Create a virtual environment
- Install required dependencies
- Set up necessary folders
- Create default settings

🖥️ Usage

Run the crawler:

Windows:

.\hungry.bat

macOS/Linux:

./hungry.sh

Get help:

Windows:

hungry.bat /help

macOS/Linux:

./hungry.sh --help

Main Operations:

Crawl: Discover all pages on a website
Scrape: Extract content from a specific page
Scrape All URLs: Process multiple URLs from a previous crawl
Resume Crawl: Continue a previously paused crawl
Search: Find content in your crawled/scraped data
Settings: Configure crawler behavior

📁 Folder Structure

Results/: Contains scraped content
Crawled/: Contains lists of crawled URLs
Exports/: Contains exported data
Logs/: Contains log files

⚙️ Configuration

You can configure the crawler through the Settings menu or by directly editing the settings.json file.

Key settings:

respect_robots_txt: Whether to respect robots.txt rules
crawl_delay: Delay between requests in seconds
max_pages: Maximum number of pages to crawl before asking to continue
export_formats: Default export formats
use_proxies: Whether to use proxies
proxies: List of proxy URLs to use
rotate_user_agents: Whether to rotate user agents

⚠️ Legal Disclaimer

This tool is provided for educational and research purposes only. Always respect website terms of service and robots.txt rules. The authors are not responsible for any misuse of this software.

📄 License

MIT License

👨‍💻 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

🙏 Credits

Created by KenKaiii

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
ascii.md		ascii.md
crawler.log		crawler.log
crawler_scraper.py		crawler_scraper.py
hungry.bat		hungry.bat
hungry.sh		hungry.sh
requirements.txt		requirements.txt
settings.json		settings.json
setup.bat		setup.bat
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Hungry Crawler

🚀 Features

📋 Requirements

🔧 Installation

🖥️ Usage

Main Operations:

📁 Folder Structure

⚙️ Configuration

⚠️ Legal Disclaimer

📄 License

👨‍💻 Contributing

🙏 Credits

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

KenKaiii/hungry

Folders and files

Latest commit

History

Repository files navigation

Hungry Crawler

🚀 Features

📋 Requirements

🔧 Installation

🖥️ Usage

Main Operations:

📁 Folder Structure

⚙️ Configuration

⚠️ Legal Disclaimer

📄 License

👨‍💻 Contributing

🙏 Credits

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages