Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
224 changes: 224 additions & 0 deletions .github/workflows/CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,224 @@
# Contributing to Crawler

Thank you for your interest in contributing to the Crawler project! We welcome contributions from everyone. This document provides guidelines and instructions for contributing.

## Code of Conduct

We are committed to providing a welcoming and inspiring community for all. Please be respectful and constructive in all interactions. Harassment, discrimination, or disruptive behavior will not be tolerated.

## How to Contribute

There are many ways to contribute to this project:

- **Report bugs** by opening an issue with detailed information
- **Suggest features** with clear use cases and expected behavior
- **Improve documentation** by fixing typos or clarifying confusing sections
- **Submit code changes** by creating pull requests with meaningful improvements
- **Review pull requests** and provide constructive feedback to other contributors

## Getting Started

### Prerequisites

- Python 3.8 or higher
- Git
- A MySQL database for testing (optional but recommended)
- A code editor or IDE of your choice

### Setting Up Your Development Environment

1. Fork the repository on GitHub
2. Clone your fork locally:
```bash
git clone https://github.com/your-username/crawler.git
cd crawler
```
3. Create a virtual environment:
```bash
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
```
4. Install development dependencies:
```bash
pip install -r requirements-dev.txt
```
5. Create a local `.env` file for testing:
```bash
cp .env.example .env
```

## Making Changes

### Branch Naming

Create a descriptive branch name for your changes:
- `feature/add-proxy-support`
- `bugfix/fix-mysql-connection-timeout`
- `docs/improve-readme`
- `test/add-crawler-tests`

```bash
git checkout -b feature/your-feature-name
```

### Code Style

Follow these guidelines to maintain consistent code quality:

- Use PEP 8 style guide for Python code
- Keep lines under 100 characters when possible
- Use meaningful variable and function names
- Add docstrings to functions and classes
- Use type hints where applicable

Example:
```python
def fetch_url(url: str, timeout: int = 10) -> str:
"""
Fetch content from a given URL.

Args:
url: The URL to fetch
timeout: Request timeout in seconds (default: 10)

Returns:
The HTML content of the page

Raises:
requests.exceptions.RequestException: If the request fails
"""
response = requests.get(url, timeout=timeout)
response.raise_for_status()
return response.text
```

### Testing

Before submitting a pull request, ensure your code passes all tests:

```bash
# Run all tests
pytest

# Run tests with coverage
pytest --cov=crawler

# Run specific test file
pytest tests/test_crawler.py
```

Write tests for new features:
```python
def test_fetch_url_success():
"""Test that fetch_url returns content for valid URLs."""
result = fetch_url("https://example.com")
assert result is not None
assert len(result) > 0
```

### Commits

Write clear, descriptive commit messages:

```bash
# Good
git commit -m "Add proxy support to crawler

- Add ProxyManager class to handle proxy rotation
- Update fetch_url to accept proxy configuration
- Add tests for proxy connection handling"

# Avoid
git commit -m "fix stuff"
git commit -m "changes"
```

## Submitting Changes

### Pull Request Process

1. Ensure all tests pass and code is formatted correctly
2. Push your branch to your fork:
```bash
git push origin feature/your-feature-name
```
3. Open a pull request on GitHub with:
- A clear title describing the change
- A detailed description of what was changed and why
- Reference to any related issues (e.g., "Fixes #123")
- Screenshots or examples if applicable
4. Address review comments and make requested changes
5. Ensure the CI/CD pipeline passes
6. Once approved, your PR will be merged

### Pull Request Template

```markdown
## Description
Brief explanation of what this PR does.

## Changes Made
- Change 1
- Change 2
- Change 3

## Related Issues
Fixes #123

## Testing
Describe how you tested these changes.

## Checklist
- [ ] Code follows style guidelines
- [ ] Tests pass locally
- [ ] Documentation is updated
- [ ] No breaking changes (or documented in PR)
```

## Reporting Bugs

When reporting bugs, please include:

- **Description**: What you were trying to do
- **Expected behavior**: What should have happened
- **Actual behavior**: What actually happened
- **Environment**: Python version, OS, MySQL version
- **Steps to reproduce**: Clear steps to replicate the issue
- **Error message**: Full error traceback if available
- **Screenshots**: If applicable

Example:
```
Title: Crawler fails with timeout on large datasets

Description: When crawling more than 10,000 pages, the crawler
consistently times out.

Steps to reproduce:
1. Configure crawler with 15,000 pages
2. Run `python crawler.py`
3. After ~8,000 pages, connection fails

Expected: Crawler should complete all 15,000 pages
Actual: Crawler crashes with timeout error

Environment: Python 3.9, Ubuntu 20.04, MySQL 8.0
```

## Suggesting Features

When suggesting features, explain:

- **Use case**: Why this feature is needed
- **Expected behavior**: How it should work
- **Alternative approaches**: Other possible implementations
- **Impact**: How it affects existing functionality

## Documentation

Help improve documentation by:

- Fixing typos and grammatical errors
- Adding missing sections or examples
- Clarifying confusing explanations
- Adding inline code comments for complex logic
83 changes: 83 additions & 0 deletions .github/workflows/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
# Crawler

A web crawler for collecting and processing data from specified sources.

## Table of Contents

- [Installation](#installation)
- [Configuration](#configuration)
- [Usage](#usage)
- [Database Setup](#database-setup)
- [Contributing](#contributing)

## Installation

Install the required dependencies:

```bash
pip install -r requirements.txt
```

Ensure you have Python 3.8+ installed on your system.

## Configuration

### Environment Variables

Create a `.env` file in the project root with the following variables:

```
DATABASE_HOST=localhost
DATABASE_USER=crawler_user
DATABASE_PASSWORD=your_password
DATABASE_NAME=crawler_db
```

Update these values according to your local environment.

## Usage

Run the crawler with:

```bash
python crawler.py
```

Optional flags:
- `--verbose`: Enable detailed logging output
- `--limit N`: Limit crawling to N pages
- `--timeout S`: Set request timeout to S seconds

## Database Setup

### MySQL Configuration

The crawler uses MySQL to store collected data. Follow these steps to set up your database:

1. **Install MySQL**: Download and install from [MySQL Official Website](https://dev.mysql.com/downloads/mysql/)

2. **Create Database and User**:
```sql
CREATE DATABASE crawler_db;
CREATE USER 'crawler_user'@'localhost' IDENTIFIED BY 'your_password';
GRANT ALL PRIVILEGES ON crawler_db.* TO 'crawler_user'@'localhost';
FLUSH PRIVILEGES;
```

3. **Initialize Tables**: Run the database migration script:
```bash
python scripts/init_db.py
```

### Connection Details

- **Host**: localhost (default)
- **Port**: 3306 (default MySQL port)
- **User**: crawler_user
- **Database**: crawler_db

Update the connection parameters in your `.env` file if using different settings.

## Contributing

Please read CONTRIBUTING.md for details on our code of conduct and the process for submitting pull requests.
78 changes: 78 additions & 0 deletions .github/workflows/azure-webapps-node.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
# This workflow will build and push a node.js application to an Azure Web App when a commit is pushed to your default branch.
#
# This workflow assumes you have already created the target Azure App Service web app.
# For instructions see https://docs.microsoft.com/en-us/azure/app-service/quickstart-nodejs?tabs=linux&pivots=development-environment-cli
#
# To configure this workflow:
#
# 1. Download the Publish Profile for your Azure Web App. You can download this file from the Overview page of your Web App in the Azure Portal.
# For more information: https://docs.microsoft.com/en-us/azure/app-service/deploy-github-actions?tabs=applevel#generate-deployment-credentials
#
# 2. Create a secret in your repository named AZURE_WEBAPP_PUBLISH_PROFILE, paste the publish profile contents as the value of the secret.
# For instructions on obtaining the publish profile see: https://docs.microsoft.com/azure/app-service/deploy-github-actions#configure-the-github-secret
#
# 3. Change the value for the AZURE_WEBAPP_NAME. Optionally, change the AZURE_WEBAPP_PACKAGE_PATH and NODE_VERSION environment variables below.
#
# For more information on GitHub Actions for Azure: https://github.com/Azure/Actions
# For more information on the Azure Web Apps Deploy action: https://github.com/Azure/webapps-deploy
# For more samples to get started with GitHub Action workflows to deploy to Azure: https://github.com/Azure/actions-workflow-samples

on:
push:
branches: [ "main" ]
workflow_dispatch:

env:
AZURE_WEBAPP_NAME: your-app-name # set this to your application's name
AZURE_WEBAPP_PACKAGE_PATH: '.' # set this to the path to your web app project, defaults to the repository root
NODE_VERSION: '20.x' # set this to the node version to use

permissions:
contents: read

jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4

- name: Set up Node.js
uses: actions/setup-node@v4
with:
node-version: ${{ env.NODE_VERSION }}
cache: 'npm'

- name: npm install, build, and test
run: |
npm install
npm run build --if-present
npm run test --if-present
- name: Upload artifact for deployment job
uses: actions/upload-artifact@v4
with:
name: node-app
path: .

deploy:
permissions:
contents: none
runs-on: ubuntu-latest
needs: build
environment:
name: 'Development'
url: ${{ steps.deploy-to-webapp.outputs.webapp-url }}

steps:
- name: Download artifact from build job
uses: actions/download-artifact@v4
with:
name: node-app

- name: 'Deploy to Azure WebApp'
id: deploy-to-webapp
uses: azure/webapps-deploy@v2
with:
app-name: ${{ env.AZURE_WEBAPP_NAME }}
publish-profile: ${{ secrets.AZURE_WEBAPP_PUBLISH_PROFILE }}
package: ${{ env.AZURE_WEBAPP_PACKAGE_PATH }}
Loading