Skip to content

jaishasohail/web-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 

Repository files navigation

Web Crawler

A flexible and powerful web crawler that extracts metadata, titles, and page content in structured Markdown format. This web crawler provides full proxy control, enabling stable and anonymous large-scale crawling across any website. Designed for developers, analysts, and businesses needing reliable website content extraction with complete configurability.

Bitbash Banner

Telegram   WhatsApp   Gmail   Website

Created by Bitbash, built to showcase our approach to Scraping and Automation!
If you are looking for web-crawler you've just found your team — Let’s Chat. 👆👆

Introduction

This project crawls websites and extracts essential on-page data such as titles, metadata, and Markdown-formatted body content. It solves the challenges of collecting structured content from multi-page websites while giving users full control over proxy usage and crawling behavior. Perfect for researchers, SEO professionals, content analysts, and developers needing automated content discovery.

Full-Site Content Extraction

  • Crawls a website and follows internal links to collect all accessible pages.
  • Extracts metadata, OpenGraph fields, and SEO-relevant attributes.
  • Converts page content to clean, readable Markdown.
  • Supports user-provided proxies for anonymity and anti-blocking.
  • Works with a single URL input and scales across large sites.

Features

Feature Description
Full-site crawling Automatically discovers and crawls all internal pages of the target website.
Metadata extraction Collects meta tags, descriptions, keywords, and SEO markers.
Markdown content output Converts raw HTML into structured Markdown for readability and portability.
Proxy support Allows users to connect via their own proxy servers for anonymity and safety.
Customizable crawling options Adjust settings to match site structure or reduce load.
Multi-format export Download results as JSON, CSV, or other common formats.

What Data This Scraper Extracts

Field Name Field Description
page_url URL of the crawled page.
title Page title extracted from <title>.
metadata Serialized metadata fields including keywords and descriptions.
content The page’s readable content converted to Markdown.

Example Output

Example: [ { "page_url": "http://www.FITaxPlanning.com/taxcenter2.php", "title": "Placentia, CA Accounting Firm | Tax Center Page | Financial Insight Tax Planning, Inc.", "metadata": "{"viewport": "width=device-width, initial-scale=1.0", "description": "Take a look at our Tax Center page.", "keywords": "QuickBooks, CPA, Tax Preparation"}", "content": "## FITax Planning, Inc.\n\n * Home\n * About\n ...\n\n# Tax Center\n\n## Days until April 15\n\nDeadline for Individuals and C-Corporations\n..." } ]


Directory Structure Tree

web-crawler-scraper/
├── src/
│   ├── crawler.js
│   ├── extractors/
│   │   ├── metadata_parser.js
│   │   ├── markdown_converter.js
│   │   └── link_resolver.js
│   ├── utils/
│   │   ├── proxy_manager.js
│   │   └── request_handler.js
│   ├── outputs/
│   │   └── exporters.js
│   └── config/
│       └── settings.example.json
├── data/
│   ├── sample_input.json
│   └── sample_output.json
├── docs/
│   └── reference.md
├── tests/
│   ├── crawler.test.js
│   └── parser.test.js
├── LICENSE
├── package.json
└── README.md

Use Cases

  • SEO specialists use it to extract titles and metadata, enabling them to audit website content for optimization opportunities.
  • Researchers use the crawler to collect structured content, so they can analyze topics, patterns, and textual features.
  • Developers integrate it to automate full-site documentation extraction, helping them migrate or refactor content-heavy websites.
  • Businesses use it to monitor competitor site changes, empowering them to track updates and strategic shifts.
  • Content teams use the crawler to convert HTML websites into Markdown, allowing them to repurpose content across platforms.

FAQs

Q: Can it crawl an entire website from a single starting URL? Yes. The crawler automatically discovers and follows internal links until all accessible pages are processed.

Q: Do I need to use a proxy? Proxies are optional but highly recommended for large websites or sites with rate limits.

Q: What proxy formats are supported? Any standard HTTP/HTTPS proxy URL works, including residential, datacenter, mobile, rotating, and straight proxies.

Q: In what formats can I download the results? Results can be exported in JSON, CSV, or accessed programmatically.


Performance Benchmarks and Results

  • Primary Metric: Typically crawls medium-sized sites (100–200 pages) in minutes with high throughput.
  • Reliability Metric: Maintains a consistent crawling success rate across websites with varying structure and depth.
  • Efficiency Metric: Optimized request handling reduces unnecessary loads and minimizes proxy usage.
  • Quality Metric: Produces high-fidelity Markdown content with accurate metadata extraction and minimal formatting artifacts.

Book a Call Watch on YouTube

Review 1

“Bitbash is a top-tier automation partner, innovative, reliable, and dedicated to delivering real results every time.”

Nathan Pennington
Marketer
★★★★★

Review 2

“Bitbash delivers outstanding quality, speed, and professionalism, truly a team you can rely on.”

Eliza
SEO Affiliate Expert
★★★★★

Review 3

“Exceptional results, clear communication, and flawless delivery. Bitbash nailed it.”

Syed
Digital Strategist
★★★★★