A flexible and powerful web crawler that extracts metadata, titles, and page content in structured Markdown format. This web crawler provides full proxy control, enabling stable and anonymous large-scale crawling across any website. Designed for developers, analysts, and businesses needing reliable website content extraction with complete configurability.
Created by Bitbash, built to showcase our approach to Scraping and Automation!
If you are looking for web-crawler you've just found your team — Let’s Chat. 👆👆
This project crawls websites and extracts essential on-page data such as titles, metadata, and Markdown-formatted body content. It solves the challenges of collecting structured content from multi-page websites while giving users full control over proxy usage and crawling behavior. Perfect for researchers, SEO professionals, content analysts, and developers needing automated content discovery.
- Crawls a website and follows internal links to collect all accessible pages.
- Extracts metadata, OpenGraph fields, and SEO-relevant attributes.
- Converts page content to clean, readable Markdown.
- Supports user-provided proxies for anonymity and anti-blocking.
- Works with a single URL input and scales across large sites.
| Feature | Description |
|---|---|
| Full-site crawling | Automatically discovers and crawls all internal pages of the target website. |
| Metadata extraction | Collects meta tags, descriptions, keywords, and SEO markers. |
| Markdown content output | Converts raw HTML into structured Markdown for readability and portability. |
| Proxy support | Allows users to connect via their own proxy servers for anonymity and safety. |
| Customizable crawling options | Adjust settings to match site structure or reduce load. |
| Multi-format export | Download results as JSON, CSV, or other common formats. |
| Field Name | Field Description |
|---|---|
| page_url | URL of the crawled page. |
| title | Page title extracted from <title>. |
| metadata | Serialized metadata fields including keywords and descriptions. |
| content | The page’s readable content converted to Markdown. |
Example: [ { "page_url": "http://www.FITaxPlanning.com/taxcenter2.php", "title": "Placentia, CA Accounting Firm | Tax Center Page | Financial Insight Tax Planning, Inc.", "metadata": "{"viewport": "width=device-width, initial-scale=1.0", "description": "Take a look at our Tax Center page.", "keywords": "QuickBooks, CPA, Tax Preparation"}", "content": "## FITax Planning, Inc.\n\n * Home\n * About\n ...\n\n# Tax Center\n\n## Days until April 15\n\nDeadline for Individuals and C-Corporations\n..." } ]
web-crawler-scraper/
├── src/
│ ├── crawler.js
│ ├── extractors/
│ │ ├── metadata_parser.js
│ │ ├── markdown_converter.js
│ │ └── link_resolver.js
│ ├── utils/
│ │ ├── proxy_manager.js
│ │ └── request_handler.js
│ ├── outputs/
│ │ └── exporters.js
│ └── config/
│ └── settings.example.json
├── data/
│ ├── sample_input.json
│ └── sample_output.json
├── docs/
│ └── reference.md
├── tests/
│ ├── crawler.test.js
│ └── parser.test.js
├── LICENSE
├── package.json
└── README.md
- SEO specialists use it to extract titles and metadata, enabling them to audit website content for optimization opportunities.
- Researchers use the crawler to collect structured content, so they can analyze topics, patterns, and textual features.
- Developers integrate it to automate full-site documentation extraction, helping them migrate or refactor content-heavy websites.
- Businesses use it to monitor competitor site changes, empowering them to track updates and strategic shifts.
- Content teams use the crawler to convert HTML websites into Markdown, allowing them to repurpose content across platforms.
Q: Can it crawl an entire website from a single starting URL? Yes. The crawler automatically discovers and follows internal links until all accessible pages are processed.
Q: Do I need to use a proxy? Proxies are optional but highly recommended for large websites or sites with rate limits.
Q: What proxy formats are supported? Any standard HTTP/HTTPS proxy URL works, including residential, datacenter, mobile, rotating, and straight proxies.
Q: In what formats can I download the results? Results can be exported in JSON, CSV, or accessed programmatically.
- Primary Metric: Typically crawls medium-sized sites (100–200 pages) in minutes with high throughput.
- Reliability Metric: Maintains a consistent crawling success rate across websites with varying structure and depth.
- Efficiency Metric: Optimized request handling reduces unnecessary loads and minimizes proxy usage.
- Quality Metric: Produces high-fidelity Markdown content with accurate metadata extraction and minimal formatting artifacts.
