A professional, production-ready PHP sitemap generator by iProDev (Hemn Chawroka) — supports concurrency, robots.txt, gzip compression, sitemap index files, and comprehensive error handling.
- ✨ Database Storage with change detection and historical tracking
- 🔄 Resume Capability with checkpoint system
- 🎯 SEO Analysis and content quality checking
- 📊 Performance Metrics and detailed analytics
- 🖼️ Multi-format Sitemaps (Images, Videos, News)
- 🌐 JavaScript Rendering support for SPAs
- 🔐 Proxy Support with rotation
- 🔔 Webhook Notifications for events
- 📅 Scheduled Crawling with cron integration
- 🎨 Interactive Mode for easy configuration
- ⚡ Caching System (File & Redis)
- 🎛️ Smart Filtering with priority rules
- 📈 Rate Limiting with retry handling
- Features
- Installation
- Quick Start
- CLI Usage
- Advanced Features
- Programmatic Usage
- Examples
- Configuration
- API Reference
- Testing
- Docker
- Troubleshooting
- Contributing
- 🚀 High Performance - Concurrent HTTP requests
- 🤖 Robots.txt Compliant - Respects crawling rules
- 📦 Gzip Compression - Automatic compression
- 📊 Sitemap Index - Multiple sitemap files
- 🛡️ Error Handling - Comprehensive error management
- 📝 PSR-3 Logging - Standard logging interface
- 💾 Database Storage - SQLite/MySQL/PostgreSQL support
- 🔄 Change Detection - Track URL changes over time
- 📈 SEO Analysis - Analyze pages for SEO issues
- 🔍 Quality Checks - Find duplicates, broken links
- 🎯 Smart Filtering - Include/exclude patterns
- ⚡ Caching - File and Redis cache support
- 📍 Resume Support - Continue interrupted crawls
- 🔔 Webhooks - Real-time notifications
- 📅 Scheduling - Automated periodic crawls
- 🌐 JavaScript - Render SPAs with headless Chrome
- 🔐 Proxy Support - HTTP/SOCKS proxies with rotation
- 🎨 Interactive Mode - User-friendly configuration
- 📄 Standard XML Sitemap
- 🖼️ Image Sitemap
- 🎬 Video Sitemap
- 📰 News Sitemap
composer require iprodev/sitemap-generator-pro- PHP >= 8.0
- Extensions: curl, xml, mbstring, zlib, pdo
- Optional: redis, posix
php bin/sitemap --url=https://www.example.comphp bin/sitemap --interactivephp bin/sitemap \
--url=https://www.example.com \
--out=./sitemaps \
--concurrency=20 \
--cache-enabled \
--db-enabled \
--seo-analysis \
--image-sitemap \
--webhook-url=https://example.com/webhook \
--verbose--url=<URL> # Starting URL (required)
--out=<PATH> # Output directory
--concurrency=<N> # Concurrent requests (1-100)
--max-pages=<N> # Maximum pages to crawl
--max-depth=<N> # Maximum link depth
--public-base=<URL> # Public base URL for sitemap index
--verbose, -v # Verbose output
--help, -h # Show help--cache-enabled # Enable caching
--cache-driver=file # Cache driver: file|redis
--cache-ttl=3600 # Cache TTL in seconds--db-enabled # Enable database storage
--db-dsn=<DSN> # Database DSN
--detect-changes # Compare with previous crawl
--only-changed # Only include changed URLs--resume # Resume from checkpoint
--checkpoint-interval=<N> # Save checkpoint every N pages--rate-limit=<N> # Requests per minute
--delay=<MS> # Delay between requests (ms)--exclude=<PATTERNS> # Exclude patterns (comma-separated)
--include=<PATTERNS> # Include only patterns
--priority-rules=<JSON> # Priority rules as JSON--seo-analysis # Enable SEO analysis
--check-quality # Check content quality
--find-duplicates # Find duplicate content
--find-broken-links # Find broken links--image-sitemap # Generate image sitemap
--video-sitemap # Generate video sitemap
--news-sitemap # Generate news sitemap--enable-javascript # Enable JS rendering
--chrome-path=<PATH> # Path to Chrome/Chromium
--wait-for-ajax=<MS> # Wait time for AJAX--proxy=<URL> # Proxy URL
--proxy-file=<PATH> # Load proxies from file
--rotate-proxies # Rotate through proxies--webhook-url=<URL> # Webhook for notifications
--notify-on-complete # Notify when complete
--notify-on-error # Notify on errorsTrack changes over time:
php bin/sitemap \
--url=https://example.com \
--db-enabled \
--detect-changesThe system will:
- Store all URLs in database
- Compare with previous crawl
- Generate change report (new, modified, deleted)
- Track SEO metrics over time
Large crawls can be resumed:
php bin/sitemap \
--url=https://example.com \
--resume \
--checkpoint-interval=1000Analyze pages for SEO issues:
php bin/sitemap \
--url=https://example.com \
--seo-analysis \
--find-duplicates \
--find-broken-linksReports include:
- Missing title/meta descriptions
- Duplicate content
- Broken links
- Page load times
- Mobile optimization
- Structured data
For SPAs (React, Vue, Angular):
php bin/sitemap \
--url=https://spa.example.com \
--enable-javascript \
--chrome-path=/usr/bin/chromium \
--wait-for-ajax=5000Setup automated crawls:
use IProDev\Sitemap\Scheduler\CronScheduler;
$scheduler = new CronScheduler();
$scheduler->addSchedule('daily-crawl', [
'url' => 'https://example.com',
'schedule' => 'daily', // or cron: '0 2 * * *'
'out' => './sitemaps',
'db_enabled' => true
]);
// Add to crontab:
// * * * * * php bin/schedulerGet notified of events:
php bin/sitemap \
--url=https://example.com \
--webhook-url=https://example.com/webhook \
--notify-on-complete \
--notify-on-errorWebhook payload:
{
"event": "crawl.completed",
"timestamp": "2025-01-20T10:30:00Z",
"data": {
"url": "https://example.com",
"stats": {
"pages": 1523,
"duration": 45.3
}
}
}Use proxies for crawling:
# Single proxy
php bin/sitemap \
--url=https://example.com \
--proxy=http://proxy.example.com:8080
# Proxy file with rotation
php bin/sitemap \
--url=https://example.com \
--proxy-file=./proxies.txt \
--rotate-proxiesProxy file format:
http://proxy1.example.com:8080
http://proxy2.example.com:8080|username:password
socks5://proxy3.example.com:1080
Control what gets crawled:
php bin/sitemap \
--url=https://example.com \
--exclude="/admin/*,/test/*,*.pdf" \
--include="/products/*,/blog/*" \
--priority-rules='{"homepage":1.0,"/products/*":0.8}'use IProDev\Sitemap\Fetcher;
use IProDev\Sitemap\Crawler;
use IProDev\Sitemap\SitemapWriter;
use IProDev\Sitemap\RobotsTxt;
$fetcher = new Fetcher(['concurrency' => 20]);
$robots = RobotsTxt::fromUrl('https://example.com', $fetcher);
$crawler = new Crawler($fetcher, $robots);
$pages = $crawler->crawl('https://example.com', 10000, 5);
$files = SitemapWriter::write($pages, './sitemaps');use IProDev\Sitemap\Database\Database;
use IProDev\Sitemap\ChangeDetector;
// Initialize database
$db = new Database('sqlite:./sitemap.db');
$db->createTables();
// Start crawl
$domain = 'example.com';
$crawlId = $db->startCrawl($domain, 'https://example.com', []);
// Crawl and save
foreach ($pages as $page) {
$db->saveUrl($crawlId, $page);
}
// Detect changes
$prevCrawl = $db->getPreviousCrawl($domain, $crawlId);
if ($prevCrawl) {
$detector = new ChangeDetector($db);
$changes = $detector->detectChanges($prevCrawl['id'], $crawlId);
print_r($changes);
}use IProDev\Sitemap\Analyzer\SeoAnalyzer;
$analyzer = new SeoAnalyzer();
foreach ($pages as $page) {
$analysis = $analyzer->analyze(
$page['url'],
$page['html'],
$page['status_code']
);
echo "Score: {$analysis['score']}/100\n";
echo "Issues: " . count($analysis['issues']) . "\n";
}use IProDev\Sitemap\Cache\FileCache;
use IProDev\Sitemap\Cache\RedisCache;
// File cache
$cache = new FileCache('./cache', 3600);
// Redis cache
$cache = new RedisCache('127.0.0.1', 6379);
// Use in fetcher
$fetcher = new Fetcher(['cache' => $cache]);Create sitemap.config.php:
<?php
return [
'url' => 'https://example.com',
'out' => './sitemaps',
'concurrency' => 20,
'max_pages' => 10000,
'max_depth' => 5,
'cache_enabled' => true,
'db_enabled' => true,
'seo_analysis' => true,
'exclude' => ['/admin/*', '/test/*'],
'priority_rules' => [
'homepage' => 1.0,
'/products/*' => 0.8,
'/blog/*' => 0.6
]
];Use config file:
php bin/sitemap --config=sitemap.config.phpThe examples/ directory contains ready-to-use code examples:
Simple sitemap generation with minimal configuration:
require 'vendor/autoload.php';
use IProDev\Sitemap\Fetcher;
use IProDev\Sitemap\Crawler;
use IProDev\Sitemap\SitemapWriter;
use IProDev\Sitemap\RobotsTxt;
$fetcher = new Fetcher(['concurrency' => 10]);
$robots = RobotsTxt::fromUrl('https://example.com', $fetcher);
$crawler = new Crawler($fetcher, $robots);
$pages = $crawler->crawl('https://example.com', 1000, 3);
SitemapWriter::write($pages, './output');Includes caching, filtering, and rate limiting.
Full-featured example demonstrating all capabilities:
- Database storage with change detection
- SEO analysis and quality checks
- Performance metrics
- Webhook notifications
- Image sitemap generation
- Resume capability
Run the comprehensive example:
php examples/comprehensive.php| Class | Description |
|---|---|
Fetcher |
HTTP client with concurrent request support |
Crawler |
Web crawler with depth and page limits |
Parser |
HTML parser for link extraction |
RobotsTxt |
Robots.txt parser with wildcard support |
SitemapWriter |
XML sitemap generator with gzip compression |
Utils |
Utility functions (URL handling, formatting) |
| Class | Description |
|---|---|
Database |
SQLite/MySQL/PostgreSQL storage |
ChangeDetector |
Detect URL changes between crawls |
SeoAnalyzer |
Page-level SEO analysis |
ContentQualityChecker |
Find duplicates, broken links, thin content |
UrlFilter |
Include/exclude patterns with priority rules |
FileCache / RedisCache |
Caching implementations |
RateLimiter |
Request throttling |
CrawlCheckpoint |
Resume interrupted crawls |
WebhookNotifier |
Event notifications |
ProxyManager |
Proxy rotation support |
PerformanceMetrics |
Crawl statistics and reporting |
| Class | Description |
|---|---|
SitemapWriter |
Standard XML sitemap |
ImageSitemapWriter |
Image sitemap with title/caption |
VideoSitemapWriter |
Video sitemap with metadata |
NewsSitemapWriter |
News sitemap for Google News |
# Run tests
composer test
# With coverage
composer test-coverage
# Code style
composer lint
# Static analysis
composer analyze
# All checks
composer check# Build
docker build -t sitemap-generator-pro .
# Run
docker run --rm \
-v $(pwd)/sitemaps:/app/output \
sitemap-generator-pro \
--url=https://example.com \
--out=/app/output-
Increase Concurrency: For faster crawling
--concurrency=50
-
Enable Caching: Reduce duplicate requests
--cache-enabled --cache-driver=redis
-
Use Database: Track changes efficiently
--db-enabled --detect-changes
-
Smart Filtering: Reduce unnecessary pages
--exclude="/admin/*,*.pdf" -
Resume Support: Handle large sites
--resume --checkpoint-interval=1000
1. Memory Exhausted Error
# Increase PHP memory limit
php -d memory_limit=512M bin/sitemap --url=https://example.com2. Too Many Open Connections
# Reduce concurrency
php bin/sitemap --url=https://example.com --concurrency=53. Crawl Takes Too Long
# Use checkpoints and resume
php bin/sitemap --url=https://example.com --resume --checkpoint-interval=5004. Rate Limited by Server
# Enable rate limiting
php bin/sitemap --url=https://example.com --rate-limit=30 --delay=10005. JavaScript-rendered Content Not Captured
# Enable JavaScript rendering (requires Chrome/Chromium)
php bin/sitemap --url=https://example.com --enable-javascript --chrome-path=/usr/bin/chromium6. SSL Certificate Errors
// In code, configure Guzzle to skip verification (not recommended for production)
$fetcher = new Fetcher(['verify' => false]);Enable verbose output for debugging:
php bin/sitemap --url=https://example.com --verboseWhen using database storage, check crawl history:
sqlite3 sitemap.db "SELECT * FROM crawls ORDER BY created_at DESC LIMIT 5;"- Path traversal prevention
- URL validation and sanitization
- Safe XML generation
- Proxy authentication support
- Rate limiting to prevent blocking
MIT License - see LICENSE.md
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Make your changes
- Run tests (
composer check) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
# Clone the repository
git clone https://github.com/iprodev/PHP-XML-Sitemap-Generator.git
cd sitemap-generator-pro
# Install dependencies
composer install
# Run tests
composer test
# Check code style
composer lint
# Fix code style issues
composer phpcbf
# Run static analysis
composer analyzeCreated by iProDev (Hemn Chawroka) - https://github.com/iprodev
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Documentation: Wiki
Made with ❤️ by iProDev (Hemn Chawroka)