Extract ALL links from web applications - A powerful CLI tool built with Python, using Typer for the CLI interface and Scrapling for advanced scraping capabilities with anti-bot detection.
The primary goal of this tool is to comprehensively extract every link found on web applications, with advanced filtering, source detection, and analysis capabilities.
# Install with pipx (recommended)
pipx install .
# Extract all links from a single page
scrape_this scrape https://example.com --depth 0
# Crawl entire website and extract ALL links
scrape_this scrape https://example.com --depth 3
# Extract only PDF files across a website
scrape_this scrape https://example.com -e "pdf" --links-only- π Comprehensive Link Extraction: Extract ALL links from web applications with a single command
- π Source Detection: Automatically identify where each link was found (navigation, header, footer, main content, sidebar, etc.)
- π― Smart Filtering: Filter by internal/external domains, link types, file extensions
- π Multiple Output Formats: Export to JSON, CSV, HTML, or view as formatted tables
- π Link Analysis: Automatically categorize links by type (images, documents, pages, media, etc.)
- β Link Validation: Check if links are accessible with HTTP status codes (optional)
- π JavaScript Support: Render JavaScript-heavy SPAs with headless browser automation
- π·οΈ Recursive Crawling: Crawl entire websites with depth control and page limits
- π¬ Comment Extraction: Extract HTML and JavaScript comments for security analysis
- π¨ Rich Terminal Output: Beautiful progress indicators and formatted results
- π§ Authentication Support: Custom headers for APIs and authenticated endpoints
pipx installs the tool in an isolated environment and makes it available globally:
# Install pipx if you haven't already
python -m pip install --user pipx
python -m pipx ensurepath
# Install scrape_this
pipx install .
# Or install from a git repository
pipx install git+https://github.com/yourusername/scrape_this.gitAfter installation, you can use scrape_this directly from anywhere:
scrape_this scrape https://example.com
scrape_this spider https://example.com --depth 2
scrape_this --helppip install .
# Or install from requirements file for development
pip install -r requirements.txtFor development or if you want to modify the code:
git clone https://github.com/yourusername/scrape_this.git
cd scrape_this
pip install -e .scrape_this versionFor JavaScript rendering (--js flag), you need to install browser dependencies:
# Install browser dependencies for JavaScript support
scrapling install
# Or install Camoufox manually
pip install camoufox
camoufox fetchIf JavaScript mode is not set up, the tool will automatically fall back to regular HTTP fetching with a warning.
# Extract all links from a webpage
scrape_this links https://example.com
# Get only internal links as JSON
scrape_this links https://example.com --internal-only -f json
# Get only external links and save to CSV
scrape_this links https://example.com --external-only -o external_links.csv
# Get only subdomain links
scrape_this links https://example.com --subdomains-only
# Include subdomains as internal links
scrape_this links https://example.com --include-subdomains --internal-only
# See detailed progress and link breakdown
scrape_this links https://example.com --show-progress
# Filter by file types (images, documents, etc.)
scrape_this links https://example.com --filter images -f json
scrape_this links https://example.com --filter "pdf,doc,docx" -o documents.csv
# Filter by specific file extensions (Scrapy-inspired)
scrape_this links https://example.com --extensions "pdf,zip,exe"
scrape_this links https://example.com -e "docx,pptx,xlsx" -o office_files.csv
# Validate links (check if accessible)
scrape_this links https://example.com --validate --internal-only
# Extract links with JavaScript rendering
scrape_this links https://example.com --js
# Include duplicate links
scrape_this links https://example.com --allow-duplicatesEach extracted link includes a source field indicating where it was found on the page:
| Source Type | Description | Common Examples |
|---|---|---|
navigation |
Main navigation menus | Home, About, Products, Services |
header |
Page headers and top sections | Login, Register, Account links |
footer |
Page footers | Privacy Policy, Terms, Contact |
main_content |
Article/content areas | "Read more", inline article links |
sidebar |
Sidebar sections | Related articles, categories |
breadcrumb |
Breadcrumb navigation | Home > Category > Page |
content |
General page content | Any other content links |
unknown |
Could not be determined | Dynamically loaded content |
Example output with source detection:
{
"url": "https://example.com/privacy",
"text": "Privacy Policy",
"source": "footer",
"is_internal": true,
"link_type": "page"
}
### π Primary Link Extraction (Default Mode)
```bash
# Extract all links from a single page
scrape_this scrape https://example.com --depth 0
# Crawl entire website and gather ALL links
scrape_this scrape https://example.com --depth 3 --max-pages 100
# Find all downloadable files across website
scrape_this scrape https://example.com -e "pdf,docx,zip,exe" --links-only
# Get all external links for SEO analysis
scrape_this scrape https://mysite.com --external-only --validate -f csv
# JavaScript-heavy app with full analysis
scrape_this scrape https://spa.example.com --js --include-comments --depth 2# Extract page content instead of links
scrape_this scrape https://example.com --content-only
# Extract specific elements with CSS selector
scrape_this scrape https://example.com --selector "h1, h2" -f json
# JavaScript rendering with screenshot
scrape_this scrape https://example.com --content-only --js --screenshot page.png
# Extract comments from source code
scrape_this scrape https://example.com --content-only --include-comments --comment-type javascript# Include comments in link extraction
scrape_this scrape https://example.com --include-comments --depth 2
# Extract comments across entire website during crawling
scrape_this scrape https://example.com --include-comments --depth 3 --comment-type javascript
# Get only HTML comments with content extraction
scrape_this scrape https://example.com --content-only --include-comments --comment-type html -f json
# Filter JavaScript comments by minimum length
scrape_this scrape https://spa-site.com --content-only --include-comments --comment-type javascript --min-comment-length 20Create a rules file (e.g., rules.json):
{
"title": {"selector": "h1", "attribute": "text"},
"price": {"selector": ".price", "attribute": "text"},
"image": {"selector": "img.product", "attribute": "src"},
"features": {"selector": "li.feature", "attribute": "text", "all": true}
}Then extract data:
scrape_this extract https://shop.example.com/product rules.json -o product.json
# With JavaScript rendering and custom headers
scrape_this extract https://spa-site.com/data rules.json --js --wait 2.0 --headers '{"X-API-Key": "secret"}'# Custom headers
scrape_this scrape https://api.example.com --headers '{"Authorization": "Bearer TOKEN"}'
# Limit results
scrape_this scrape https://example.com -s ".item" --limit 10
# Export as CSV
scrape_this scrape https://example.com -s "table tr" -f csv -o data.csvscrape: π Extract ALL links from web applications (primary command)links: Extract links from a single webpage with advanced filteringextract: Extract structured data using custom rulesversion: Show version information
table: Rich formatted table (default for links command) - Great for terminal viewingjson: JSON format - Perfect for programmatic processingcsv: CSV format - Ideal for spreadsheet analysishtml: HTML table format - For web reports
{
"crawl_info": {
"start_url": "https://example.com",
"pages_crawled": 5,
"max_depth": 2,
"total_links": 127,
"files_found": 15
},
"links": [
{
"url": "https://example.com/docs/guide.pdf",
"text": "Download User Guide",
"domain": "example.com",
"path": "/docs/guide.pdf",
"is_internal": true,
"is_subdomain": false,
"link_type": "document",
"source": "main_content",
"original_href": "/docs/guide.pdf",
"found_on_page": "https://example.com/support"
},
{
"url": "https://cdn.example.com/images/logo.png",
"text": "",
"domain": "cdn.example.com",
"is_internal": false,
"is_subdomain": true,
"link_type": "image",
"source": "header",
"found_on_page": "https://example.com"
}
]
}The tool automatically classifies links by domain relationship:
- Internal: Same domain as the source (including www variants)
- Subdomain: Subdomains of the source domain (e.g.,
api.example.comwhen source isexample.com) - External: Different domains entirely
The links command automatically categorizes links into types:
page: Regular web pages (no file extension)image: Image files (.jpg, .png, .gif, .svg, etc.)document: Documents (.pdf, .doc, .docx, .xls, etc.)video: Video files (.mp4, .avi, .mkv, etc.)audio: Audio files (.mp3, .wav, .flac, etc.)archive: Compressed files (.zip, .rar, .tar, etc.)code: Code files (.js, .css, .json, .html, etc.)api: API endpoints (contain 'api' or query parameters)other: Everything else
You can also use these convenient filter groups:
images: All image typesdocuments: All document typesmedia: Video and audio filesfiles: All downloadable filespages: Regular web pagescode: Code and markup files
# Extract all downloadable files from a website
scrape_this links https://example.com --filter files -o all_files.json
# Find all PDF documents
scrape_this links https://research-site.com --filter pdf -f csv
# Get all external links for SEO analysis
scrape_this links https://mysite.com --external-only -o external_links.csv
# Extract all images from a gallery
scrape_this links https://gallery.example.com --filter images --validate
# Find internal navigation links
scrape_this links https://example.com --internal-only --filter pages
# Extract all media files (audio + video)
scrape_this links https://media-site.com --filter media -f json
# Get all subdomain links (API endpoints, CDNs, etc.)
scrape_this links https://example.com --subdomains-only --show-progress
# Comprehensive link analysis with progress details
scrape_this links https://example.com --show-progress -o all_links.json# Find all external links for backlink analysis
scrape_this scrape https://mysite.com --external-only --validate -o external_links.csv
# Analyze internal link structure
scrape_this scrape https://mysite.com --internal-only --depth 3 -o site_structure.json
# Find orphaned pages (pages with no internal links)
scrape_this scrape https://mysite.com --depth 5 --internal-only -f json# Find all PDFs on a website
scrape_this scrape https://example.com -e "pdf" --links-only -o all_pdfs.csv
# Locate all media files (images, videos, audio)
scrape_this scrape https://example.com --filter media --depth 3
# Find all downloadable documents
scrape_this scrape https://docs.example.com --filter documents --max-pages 200# Extract all JavaScript comments for security review
scrape_this scrape https://app.example.com --include-comments --comment-type javascript
# Find all API endpoints
scrape_this scrape https://example.com --filter api --depth 2
# Discover all form actions and endpoints
scrape_this scrape https://example.com --selector "form" --content-only# Complete site inventory before migration
scrape_this scrape https://old-site.com --depth 10 --max-pages 1000 -o full_inventory.json
# Find all image assets for migration
scrape_this scrape https://old-site.com --filter images --links-only -o all_images.csv
# Identify all subdomains and external dependencies
scrape_this scrape https://example.com --subdomains-only --depth 3pipx is the recommended way to install CLI tools because it:
- β Isolated environments: Each tool gets its own virtual environment
- β Global access: Tools are available from anywhere on your system
- β
Easy updates:
pipx upgrade scrape_thisto update - β
Easy removal:
pipx uninstall scrape_thisto remove completely - β No conflicts: Dependencies don't interfere with your system Python
# If installed with pipx
pipx uninstall scrape_this
# If installed with pip
pip uninstall scrape_this- Start Small: Test with
--depth 0or--depth 1before full crawls - Use Filters Early: Apply
--internal-onlyor extension filters to reduce processing - Set Page Limits: Use
--max-pagesto prevent infinite crawls - Skip JavaScript: Only use
--jswhen necessary (it's slower) - Export Formats: Use JSON for further processing, CSV for spreadsheets
- Respect robots.txt: Check site policies before crawling
- Use Rate Limiting: Don't overwhelm servers with rapid requests
- Filter Smartly: Combine filters for precise results (e.g.,
--internal-only -e "pdf") - Validate Selectively: Use
--validateonly when needed (it's slower) - Save Raw Data: Export to JSON first, then filter/analyze offline
- Monitor Progress: Use
--show-progressfor long crawls
# Combine multiple filters for precise extraction
scrape_this scrape https://example.com \
--internal-only \
--filter documents \
--extensions "pdf,docx" \
--max-pages 100 \
--validate \
-o important_docs.json
# Use custom headers for authenticated scraping
scrape_this scrape https://api.example.com \
--headers '{"Authorization": "Bearer YOUR_TOKEN"}' \
--depth 2 \
-o api_endpoints.json- Python 3.8+
- typer[all]
- scrapling
- rich
- pandas
- Fork the repository
- Create a feature branch:
git checkout -b feature-name - Make your changes
- Run tests:
python -m pytest(if available) - Submit a pull request
MIT