Web Crawler

A flexible and powerful web crawler that extracts metadata, titles, and page content in structured Markdown format. This web crawler provides full proxy control, enabling stable and anonymous large-scale crawling across any website. Designed for developers, analysts, and businesses needing reliable website content extraction with complete configurability.

Created by Bitbash, built to showcase our approach to Scraping and Automation!
If you are looking for web-crawler you've just found your team — Let’s Chat. 👆👆

Introduction

This project crawls websites and extracts essential on-page data such as titles, metadata, and Markdown-formatted body content. It solves the challenges of collecting structured content from multi-page websites while giving users full control over proxy usage and crawling behavior. Perfect for researchers, SEO professionals, content analysts, and developers needing automated content discovery.

Full-Site Content Extraction

Crawls a website and follows internal links to collect all accessible pages.
Extracts metadata, OpenGraph fields, and SEO-relevant attributes.
Converts page content to clean, readable Markdown.
Supports user-provided proxies for anonymity and anti-blocking.
Works with a single URL input and scales across large sites.

Features

Feature	Description
Full-site crawling	Automatically discovers and crawls all internal pages of the target website.
Metadata extraction	Collects meta tags, descriptions, keywords, and SEO markers.
Markdown content output	Converts raw HTML into structured Markdown for readability and portability.
Proxy support	Allows users to connect via their own proxy servers for anonymity and safety.
Customizable crawling options	Adjust settings to match site structure or reduce load.
Multi-format export	Download results as JSON, CSV, or other common formats.

What Data This Scraper Extracts

Field Name	Field Description
page_url	URL of the crawled page.
title	Page title extracted from `<title>`.
metadata	Serialized metadata fields including keywords and descriptions.
content	The page’s readable content converted to Markdown.

Example Output

Example: [ { "page_url": "http://www.FITaxPlanning.com/taxcenter2.php", "title": "Placentia, CA Accounting Firm | Tax Center Page | Financial Insight Tax Planning, Inc.", "metadata": "{"viewport": "width=device-width, initial-scale=1.0", "description": "Take a look at our Tax Center page.", "keywords": "QuickBooks, CPA, Tax Preparation"}", "content": "## FITax Planning, Inc.\n\n * Home\n * About\n ...\n\n# Tax Center\n\n## Days until April 15\n\nDeadline for Individuals and C-Corporations\n..." } ]

Directory Structure Tree

web-crawler-scraper/
├── src/
│   ├── crawler.js
│   ├── extractors/
│   │   ├── metadata_parser.js
│   │   ├── markdown_converter.js
│   │   └── link_resolver.js
│   ├── utils/
│   │   ├── proxy_manager.js
│   │   └── request_handler.js
│   ├── outputs/
│   │   └── exporters.js
│   └── config/
│       └── settings.example.json
├── data/
│   ├── sample_input.json
│   └── sample_output.json
├── docs/
│   └── reference.md
├── tests/
│   ├── crawler.test.js
│   └── parser.test.js
├── LICENSE
├── package.json
└── README.md

Use Cases

SEO specialists use it to extract titles and metadata, enabling them to audit website content for optimization opportunities.
Researchers use the crawler to collect structured content, so they can analyze topics, patterns, and textual features.
Developers integrate it to automate full-site documentation extraction, helping them migrate or refactor content-heavy websites.
Businesses use it to monitor competitor site changes, empowering them to track updates and strategic shifts.
Content teams use the crawler to convert HTML websites into Markdown, allowing them to repurpose content across platforms.

FAQs

Q: Can it crawl an entire website from a single starting URL? Yes. The crawler automatically discovers and follows internal links until all accessible pages are processed.

Q: Do I need to use a proxy? Proxies are optional but highly recommended for large websites or sites with rate limits.

Q: What proxy formats are supported? Any standard HTTP/HTTPS proxy URL works, including residential, datacenter, mobile, rotating, and straight proxies.

Q: In what formats can I download the results? Results can be exported in JSON, CSV, or accessed programmatically.

Performance Benchmarks and Results

Primary Metric: Typically crawls medium-sized sites (100–200 pages) in minutes with high throughput.
Reliability Metric: Maintains a consistent crawling success rate across websites with varying structure and depth.
Efficiency Metric: Optimized request handling reduces unnecessary loads and minimizes proxy usage.
Quality Metric: Produces high-fidelity Markdown content with accurate metadata extraction and minimal formatting artifacts.

“Bitbash is a top-tier automation partner, innovative, reliable, and dedicated to delivering real results every time.”

Nathan Pennington
Marketer
★★★★★

“Bitbash delivers outstanding quality, speed, and professionalism, truly a team you can rely on.”

Eliza
SEO Affiliate Expert
★★★★★

“Exceptional results, clear communication, and flawless delivery. Bitbash nailed it.”

Syed
Digital Strategist
★★★★★

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Web Crawler

Introduction

Full-Site Content Extraction

Features

What Data This Scraper Extracts

Example Output

Directory Structure Tree

Use Cases

FAQs

Performance Benchmarks and Results

About

Uh oh!

Releases

Packages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md

jaishasohail/web-crawler

Folders and files

Latest commit

History

Repository files navigation

Web Crawler

Introduction

Full-Site Content Extraction

Features

What Data This Scraper Extracts

Example Output

Directory Structure Tree

Use Cases

FAQs

Performance Benchmarks and Results

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages