A powerful and user-friendly Python tool that downloads web documentation and converts it to clean Markdown format! Perfect for offline reading, documentation migration, or content analysis.
- π Multiple crawling methods:
- Sitemap-based crawling (auto-detects common sitemap locations)
- Recursive link-following for sites without sitemaps
- Custom sitemap URL support
- Support for sitemap indexes and nested sitemaps
- π Converts HTML to clean Markdown format
- π³ Maintains documentation structure with proper directory hierarchy
- π Shows real-time progress with nice progress bars
- π Respects rate limiting and robots.txt rules
- π― Smart error handling and detailed logging
- πΎ Organized output with clean filenames
- π¨ User-friendly command-line interface with clear prompts
- π Command-line arguments support for automation/scripting
- π‘ Set maximum pages to download and custom delay between requests
-
Clone or download this repository
-
Create a virtual environment:
# On Windows python -m venv venv .\venv\Scripts\activate # On macOS/Linux python -m venv venv source venv/bin/activate
-
Install dependencies:
pip install -r requirements.txt
- Purpose: Controls the time interval between consecutive requests to avoid overloading the target server.
- Default: 1.0 second
- Usage:
- Command line:
--delay 2.5(in seconds) - Interactive mode: Enter value when prompted
- Command line:
- Recommendation: Use higher values (2-3 seconds) for smaller servers, lower values (0.5-1 second) for robust sites.
- Purpose: Sets the maximum number of pages to download, preventing unintended large-scale crawling.
- Usage:
- Command line:
--max-pages 100 - Interactive mode: Enter value when prompted or leave empty for no limit
- Command line:
- Note: Setting this appropriately helps control execution time and output size.
- Purpose: Determines whether the crawler should respect robots.txt restrictions.
- Default: Enabled (respects robots.txt)
- Usage:
- Command line: Use
--no-robotsto disable - Interactive mode: Answer 'n' when prompted
- Command line: Use
- Purpose: Determines how the tool discovers pages to download.
- Options:
- Auto-detect sitemap (
--method auto): Fastest when sitemaps are available - Recursive crawling (
--method recursive): Most thorough but slower - Custom sitemap URL (
--method sitemap --sitemap URL): Best for known sitemap locations
- Auto-detect sitemap (
- Note: Choose based on the structure of the documentation site and your specific needs.
- Purpose: Specifies where the converted Markdown files will be saved.
- Default: 'markdown_docs'
- Usage:
- Command line:
--output custom_folder_name - Interactive mode: Enter value when prompted
- Command line:
-
Activate the virtual environment (if not already activated):
# On Windows .\venv\Scripts\activate # On macOS/Linux source venv/bin/activate
-
Run the script:
python main.py
-
Follow the interactive prompts:
- Enter the documentation base URL
- Choose your preferred crawling method:
- Auto-detect sitemap.xml (tries common locations)
- Recursive crawling (follows links within the domain)
- Enter custom sitemap URL
- Choose an output directory for the Markdown files
- Set optional parameters like delay between requests
You can also run the script with command-line arguments for automation:
python main.py --url https://docs.example.com --output docs_output --method recursive --delay 1.5 --max-pages 100 --no-robotsAvailable arguments:
--url: Base URL of the documentation--output: Output directory name (default: markdown_docs)--method: Crawling method (auto/recursive/sitemap)--sitemap: Custom sitemap URL (required if method=sitemap)--delay: Delay between requests in seconds (default: 1.0)--max-pages: Maximum number of pages to download--no-robots: Ignore robots.txt restrictions
$ python main.py
βββββββββββββββββββββββββββββββββββββββββββββ
β Documentation Downloader v1.0 β
β Convert Docs to Markdown β
βββββββββββββββββββββββββββββββββββββββββββββ
Welcome to Documentation Downloader!
This tool will help you convert web documentation to Markdown format.
Enter the base documentation URL: https://docs.example.com
Choose crawling method:
1. Auto-detect sitemap.xml
2. Recursive crawling (follows links)
3. Enter custom sitemap URL
Enter choice (1/2/3): 2
Enter output directory name [markdown_docs]: my_docs
Enter delay between requests in seconds [1.0]: 2
Maximum number of pages to download (leave empty for no limit): 50
Respect robots.txt restrictions? (y/n) [y]: y
Starting documentation download...
Downloading documentation: 100%|ββββββββββ| 42/42 [01:24<00:00]
Pages: 42, Pending: 13
Success! Documentation has been downloaded and converted.
You can find the Markdown files in the 'my_docs' directory.The downloaded documentation maintains its original structure:
my_docs/
βββ index.md
βββ getting-started/
β βββ installation.md
β βββ configuration.md
βββ guides/
β βββ basic-usage.md
β βββ advanced-features.md
βββ api/
βββ reference.md
Each Markdown file includes:
- Clean, readable content
- Original formatting preserved
- YAML frontmatter with:
- Original title
- Source URL
- Download timestamp
Example Markdown file:
---
title: Getting Started Guide
source_url: https://docs.example.com/getting-started
date_downloaded: 2024-03-14 11:20:15
---
# Getting Started
Rest of the converted content...The script creates a crawler.log file with detailed information about the download process, helpful for debugging any issues.
The tool respects robots.txt rules by default, but you can disable this with the --no-robots flag or by answering "n" to the robots.txt prompt.
The tool can handle both standard sitemaps and sitemap indexes (which contain links to multiple sitemaps).
The tool provides detailed error handling and logging, with graceful fallbacks when issues occur.
- Choose the appropriate crawling method:
- Sitemap-based: Faster and more efficient if available
- Recursive: More thorough but slower, great for sites without sitemaps
- Respect website terms of service and robots.txt
- Use reasonable delays between requests (default: 1 second)
- Some websites may block automated downloads
- Large documentation sites may take significant time to download
Contributions are welcome! Feel free to:
- Report issues
- Suggest improvements
- Submit pull requests
This project is licensed under the MIT License - see the LICENSE file for details.
- Built with Python and lots of β€οΈ
- Uses excellent libraries:
- beautifulsoup4 for HTML parsing
- html2text for conversion
- tqdm for progress bars
- requests for HTTP requests
- validators for URL validation
- python-robots for robots.txt parsing
- Inspired by the need for offline documentation access