Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1,551 changes: 1,551 additions & 0 deletions data/raw/tutorial_course.json

Large diffs are not rendered by default.

50 changes: 50 additions & 0 deletions data/raw/tutorial_info.csv
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should store the URL tree in DB and version it with date
The content extracted should have this version and the URL path of the tree in the content data model so we can easily fetched related content

Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
title,url
No Title Found,https://www.w3schools.com/nodejs/default.asp
No Title Found,https://www.w3schools.com/nodejs/nodejs_intro.asp
No Title Found,https://www.w3schools.com/nodejs/nodejs_get_started.asp
No Title Found,https://www.w3schools.com/nodejs/nodejs_modules.asp
No Title Found,https://www.w3schools.com/nodejs/nodejs_http.asp
No Title Found,https://www.w3schools.com/nodejs/nodejs_filesystem.asp
No Title Found,https://www.w3schools.com/nodejs/nodejs_url.asp
No Title Found,https://www.w3schools.com/nodejs/nodejs_npm.asp
No Title Found,https://www.w3schools.com/nodejs/nodejs_events.asp
No Title Found,https://www.w3schools.com/nodejs/nodejs_uploadfiles.asp
No Title Found,https://www.w3schools.com/nodejs/nodejs_email.asp
No Title Found,https://www.w3schools.com/nodejs/nodejs_mysql.asp
No Title Found,https://www.w3schools.com/nodejs/nodejs_mysql_create_db.asp
No Title Found,https://www.w3schools.com/nodejs/nodejs_mysql_create_table.asp
No Title Found,https://www.w3schools.com/nodejs/nodejs_mysql_insert.asp
No Title Found,https://www.w3schools.com/nodejs/nodejs_mysql_select.asp
No Title Found,https://www.w3schools.com/nodejs/nodejs_mysql_where.asp
No Title Found,https://www.w3schools.com/nodejs/nodejs_mysql_orderby.asp
No Title Found,https://www.w3schools.com/nodejs/nodejs_mysql_delete.asp
No Title Found,https://www.w3schools.com/nodejs/nodejs_mysql_drop_table.asp
No Title Found,https://www.w3schools.com/nodejs/nodejs_mysql_update.asp
No Title Found,https://www.w3schools.com/nodejs/nodejs_mysql_limit.asp
No Title Found,https://www.w3schools.com/nodejs/nodejs_mysql_join.asp
No Title Found,https://www.w3schools.com/nodejs/nodejs_mongodb.asp
No Title Found,https://www.w3schools.com/nodejs/nodejs_mongodb_create_db.asp
No Title Found,https://www.w3schools.com/nodejs/nodejs_mongodb_createcollection.asp
No Title Found,https://www.w3schools.com/nodejs/nodejs_mongodb_insert.asp
No Title Found,https://www.w3schools.com/nodejs/nodejs_mongodb_find.asp
No Title Found,https://www.w3schools.com/nodejs/nodejs_mongodb_query.asp
No Title Found,https://www.w3schools.com/nodejs/nodejs_mongodb_sort.asp
No Title Found,https://www.w3schools.com/nodejs/nodejs_mongodb_delete.asp
No Title Found,https://www.w3schools.com/nodejs/nodejs_mongodb_drop.asp
No Title Found,https://www.w3schools.com/nodejs/nodejs_mongodb_update.asp
No Title Found,https://www.w3schools.com/nodejs/nodejs_mongodb_limit.asp
No Title Found,https://www.w3schools.com/nodejs/nodejs_mongodb_join.asp
No Title Found,https://www.w3schools.com/nodejs/nodejs_raspberrypi.asp
No Title Found,https://www.w3schools.com/nodejs/nodejs_raspberrypi_gpio_intro.asp
No Title Found,https://www.w3schools.com/nodejs/nodejs_raspberrypi_blinking_led.asp
No Title Found,https://www.w3schools.com/nodejs/nodejs_raspberrypi_led_pushbutton.asp
No Title Found,https://www.w3schools.com/nodejs/nodejs_raspberrypi_flowing_leds.asp
No Title Found,https://www.w3schools.com/nodejs/nodejs_raspberrypi_webserver_websocket.asp
No Title Found,https://www.w3schools.com/nodejs/nodejs_raspberrypi_rgb_led_websocket.asp
No Title Found,https://www.w3schools.com/nodejs/nodejs_raspberrypi_components.asp
No Title Found,https://www.w3schools.com/nodejs/ref_modules.asp
No Title Found,https://www.w3schools.com/nodejs/nodejs_compiler.asp
No Title Found,https://www.w3schools.com/nodejs/nodejs_server.asp
No Title Found,https://www.w3schools.com/nodejs/nodejs_syllabus.asp
No Title Found,https://www.w3schools.com/nodejs/nodejs_study_plan.asp
No Title Found,https://www.w3schools.com/nodejs/nodejs_exam.asp
50 changes: 50 additions & 0 deletions data/raw/tutorial_modules.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
title,url
Node.js HOME,https://www.w3schools.com/nodejs/default.asp
Node.js Intro,https://www.w3schools.com/nodejs/nodejs_intro.asp
Node.js Get Started,https://www.w3schools.com/nodejs/nodejs_get_started.asp
Node.js Modules,https://www.w3schools.com/nodejs/nodejs_modules.asp
Node.js HTTP Module,https://www.w3schools.com/nodejs/nodejs_http.asp
Node.js File System,https://www.w3schools.com/nodejs/nodejs_filesystem.asp
Node.js URL Module,https://www.w3schools.com/nodejs/nodejs_url.asp
Node.js NPM,https://www.w3schools.com/nodejs/nodejs_npm.asp
Node.js Events,https://www.w3schools.com/nodejs/nodejs_events.asp
Node.js Upload Files,https://www.w3schools.com/nodejs/nodejs_uploadfiles.asp
Node.js Email,https://www.w3schools.com/nodejs/nodejs_email.asp
MySQL Get Started,https://www.w3schools.com/nodejs/nodejs_mysql.asp
MySQL Create Database,https://www.w3schools.com/nodejs/nodejs_mysql_create_db.asp
MySQL Create Table,https://www.w3schools.com/nodejs/nodejs_mysql_create_table.asp
MySQL Insert Into,https://www.w3schools.com/nodejs/nodejs_mysql_insert.asp
MySQL Select From,https://www.w3schools.com/nodejs/nodejs_mysql_select.asp
MySQL Where,https://www.w3schools.com/nodejs/nodejs_mysql_where.asp
MySQL Order By,https://www.w3schools.com/nodejs/nodejs_mysql_orderby.asp
MySQL Delete,https://www.w3schools.com/nodejs/nodejs_mysql_delete.asp
MySQL Drop Table,https://www.w3schools.com/nodejs/nodejs_mysql_drop_table.asp
MySQL Update,https://www.w3schools.com/nodejs/nodejs_mysql_update.asp
MySQL Limit,https://www.w3schools.com/nodejs/nodejs_mysql_limit.asp
MySQL Join,https://www.w3schools.com/nodejs/nodejs_mysql_join.asp
MongoDB Get Started,https://www.w3schools.com/nodejs/nodejs_mongodb.asp
MongoDB Create DB,https://www.w3schools.com/nodejs/nodejs_mongodb_create_db.asp
MongoDB Collection,https://www.w3schools.com/nodejs/nodejs_mongodb_createcollection.asp
MongoDB Insert,https://www.w3schools.com/nodejs/nodejs_mongodb_insert.asp
MongoDB Find,https://www.w3schools.com/nodejs/nodejs_mongodb_find.asp
MongoDB Query,https://www.w3schools.com/nodejs/nodejs_mongodb_query.asp
MongoDB Sort,https://www.w3schools.com/nodejs/nodejs_mongodb_sort.asp
MongoDB Delete,https://www.w3schools.com/nodejs/nodejs_mongodb_delete.asp
MongoDB Drop Collection,https://www.w3schools.com/nodejs/nodejs_mongodb_drop.asp
MongoDB Update,https://www.w3schools.com/nodejs/nodejs_mongodb_update.asp
MongoDB Limit,https://www.w3schools.com/nodejs/nodejs_mongodb_limit.asp
MongoDB Join,https://www.w3schools.com/nodejs/nodejs_mongodb_join.asp
RasPi Get Started,https://www.w3schools.com/nodejs/nodejs_raspberrypi.asp
RasPi GPIO Introduction,https://www.w3schools.com/nodejs/nodejs_raspberrypi_gpio_intro.asp
RasPi Blinking LED,https://www.w3schools.com/nodejs/nodejs_raspberrypi_blinking_led.asp
RasPi LED & Pushbutton,https://www.w3schools.com/nodejs/nodejs_raspberrypi_led_pushbutton.asp
RasPi Flowing LEDs,https://www.w3schools.com/nodejs/nodejs_raspberrypi_flowing_leds.asp
RasPi WebSocket,https://www.w3schools.com/nodejs/nodejs_raspberrypi_webserver_websocket.asp
RasPi RGB LED WebSocket,https://www.w3schools.com/nodejs/nodejs_raspberrypi_rgb_led_websocket.asp
RasPi Components,https://www.w3schools.com/nodejs/nodejs_raspberrypi_components.asp
Built-in Modules,https://www.w3schools.com/nodejs/ref_modules.asp
Node.js Compiler,https://www.w3schools.com/nodejs/nodejs_compiler.asp
Node.js Server,https://www.w3schools.com/nodejs/nodejs_server.asp
Node.js Syllabus,https://www.w3schools.com/nodejs/nodejs_syllabus.asp
Node.js Study Plan,https://www.w3schools.com/nodejs/nodejs_study_plan.asp
Node.js Certificate,https://www.w3schools.com/nodejs/nodejs_exam.asp
81 changes: 81 additions & 0 deletions main.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
#!/usr/bin/env python
"""
Main entry point for the web scrapers application.
"""
import argparse
import logging
import sys

from src.scrapers.w3schools import W3SchoolsScraper
from src.app import ScraperApplication

# Configure logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
handlers=[
logging.StreamHandler(sys.stdout),
logging.FileHandler('scrapers.log')
]
)
logger = logging.getLogger(__name__)


def parse_arguments():
"""Parse command line arguments"""
parser = argparse.ArgumentParser(description='Web scrapers for tutorial websites')
parser.add_argument(
'--url',
type=str,
default="https://www.w3schools.com/nodejs/nodejs_intro.asp",
help='Starting URL for the scrapers'
)
parser.add_argument(
'--output',
type=str,
default="data/raw",
help='Output directory for scraped data'
)
parser.add_argument(
'--delay',
type=float,
default=1.0,
help='Delay between requests in seconds'
)
return parser.parse_args()


def main():
"""Main function to run the scrapers"""
try:
args = parse_arguments()

# Log startup information
logger.info("Starting Web Scraper")
logger.info(f"URL: {args.url}")
logger.info(f"Output directory: {args.output}")
logger.info(f"Request delay: {args.delay} seconds")

# Create the scrapers
w3schools_scraper = W3SchoolsScraper(args.url, delay=args.delay)

# Create the application
app = ScraperApplication(w3schools_scraper)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of Scraper application we will use ZenML pipeline for orchestration


# Run the application
data = app.run(args.output)

# Access and print metadata
logger.info("\nMetadata from the course:")
for key, value in data["metadata"].items():
logger.info(f" {key}: {value}")

logger.info("\nScraping and export completed!")
return 0
except Exception as e:
logger.error(f"Error in main: {e}", exc_info=True)
return 1


if __name__ == "__main__":
sys.exit(main())
Empty file added src/__init__.py
Empty file.
Binary file added src/__pycache__/__init__.cpython-312.pyc
Binary file not shown.
Binary file added src/__pycache__/app.cpython-312.pyc
Binary file not shown.
54 changes: 54 additions & 0 deletions src/app.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
import os
import logging
from typing import Dict, Any, Optional

from src.scrapers.base import WebScraper
from src.scrapers.w3schools import W3SchoolsScraper
from src.exporters.json_exporter import JsonExporter
from src.exporters.csv_exporter import CsvExporter

logger = logging.getLogger(__name__)


class ScraperApplication:
"""Main application that coordinates scraping and exporting"""

def __init__(self, scraper: WebScraper):
self.scraper = scraper
self.json_exporter = JsonExporter()
self.csv_exporter = CsvExporter()

def run(self, output_dir: str = "data") -> Dict[str, Any]:
"""Run the scrapers and export the data"""
# Create output directory if it doesn't exist
os.makedirs(output_dir, exist_ok=True)

# Run the scrapers
if isinstance(self.scraper, W3SchoolsScraper):
try:
logger.info(f"Starting scrapers application with output directory: {output_dir}")
course_data = self.scraper.scrape_course()

# Export full course data to JSON
json_filename = os.path.join(output_dir, "tutorial_course.json")
self.json_exporter.export(course_data, json_filename)

# Export modules to CSV
modules_filename = os.path.join(output_dir, "tutorial_modules.csv")
self.csv_exporter.export(course_data.modules, modules_filename)

# Export tutorial titles and URLs to CSV
tutorial_info = [
{"title": t.title, "url": t.url} for t in course_data.tutorials
]
tutorials_filename = os.path.join(output_dir, "tutorial_info.csv")
self.csv_exporter.export(tutorial_info, tutorials_filename, ["title", "url"])

logger.info("Scraping and exporting completed successfully")
return course_data.dict()
except Exception as e:
logger.error(f"Error in scrapers application: {e}")
raise
else:
logger.error("Unsupported scrapers type")
raise ValueError("Unsupported scrapers type")
6 changes: 6 additions & 0 deletions src/exporters/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# Import exporters for easier access
from src.exporters.base import DataExporter
from src.exporters.json_exporter import JsonExporter
from src.exporters.csv_exporter import CsvExporter

__all__ = ['DataExporter', 'JsonExporter', 'CsvExporter']
Binary file added src/exporters/__pycache__/__init__.cpython-312.pyc
Binary file not shown.
Binary file added src/exporters/__pycache__/base.cpython-312.pyc
Binary file not shown.
Binary file not shown.
Binary file not shown.
17 changes: 17 additions & 0 deletions src/exporters/base.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
import logging
from abc import ABC, abstractmethod
from typing import Any

logger = logging.getLogger(__name__)


class DataExporter(ABC):
"""Abstract base class for data exporters"""

def __init__(self):
self.logger = logging.getLogger(f"{__name__}.{self.__class__.__name__}")

@abstractmethod
def export(self, data: Any, filename: str) -> None:
"""Export data to a file"""
pass
Comment on lines +1 to +17
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not required for now as we will be storing the text in NoSQL Mongodb

41 changes: 41 additions & 0 deletions src/exporters/csv_exporter.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
import csv
import logging
from typing import List, Dict, Any

from pydantic import BaseModel
from src.exporters.base import DataExporter

logger = logging.getLogger(__name__)


class CsvExporter(DataExporter):
"""CSV data exporter"""

def export(self, data: List[Dict], filename: str, fields: List[str] = None) -> None:
"""Export data to CSV file"""
if not data:
self.logger.warning("No data to export")
return

try:
# Determine fields if not provided
if not fields and data:
if isinstance(data[0], BaseModel):
fields = list(data[0].dict().keys())
else:
fields = list(data[0].keys())

with open(filename, 'w', newline='', encoding='utf-8') as f:
writer = csv.DictWriter(f, fieldnames=fields)
writer.writeheader()

for item in data:
if isinstance(item, BaseModel):
writer.writerow(item.dict())
else:
writer.writerow(item)

self.logger.info(f"Data exported to {filename}")
except Exception as e:
self.logger.error(f"Error exporting data to {filename}: {e}")
raise
25 changes: 25 additions & 0 deletions src/exporters/json_exporter.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
import json
import logging
from typing import Any

from pydantic import BaseModel
from src.exporters.base import DataExporter

logger = logging.getLogger(__name__)


class JsonExporter(DataExporter):
"""JSON data exporter"""

def export(self, data: Any, filename: str) -> None:
"""Export data to JSON file"""
try:
with open(filename, 'w', encoding='utf-8') as f:
if isinstance(data, BaseModel):
json.dump(data.dict(), f, indent=4, ensure_ascii=False)
else:
json.dump(data, f, indent=4, ensure_ascii=False)
self.logger.info(f"Data exported to {filename}")
except Exception as e:
self.logger.error(f"Error exporting data to {filename}: {e}")
raise
4 changes: 4 additions & 0 deletions src/models/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Import models for easier access
from src.models.schema import TutorialLink, CodeExample, TutorialContent, TutorialCourse

__all__ = ['TutorialLink', 'CodeExample', 'TutorialContent', 'TutorialCourse']
Binary file added src/models/__pycache__/__init__.cpython-312.pyc
Binary file not shown.
Binary file added src/models/__pycache__/schema.cpython-312.pyc
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add all the metadata files to gitignore and remove from the PR

Binary file not shown.
38 changes: 38 additions & 0 deletions src/models/schema.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
from typing import List, Dict, Any, Optional
from pydantic import BaseModel, Field, validator


class TutorialLink(BaseModel):
"""Model for a tutorial link in the sidebar"""
title: str
url: str


class CodeExample(BaseModel):
"""Model for a code example"""
language: str = "javascript" # Default language
code: str


class TutorialContent(BaseModel):
"""Model for the content of a tutorial page"""
title: str
url: str
content: str
code_examples: List[CodeExample] = []
next_link: Optional[str] = None

@validator('content')
def content_not_empty(cls, v):
if not v or len(v.strip()) == 0:
raise ValueError('content cannot be empty')
return v


class TutorialCourse(BaseModel):
"""Model for the entire tutorial course"""
title: str
source_url: str
modules: List[TutorialLink]
tutorials: List[TutorialContent]
metadata: Dict[str, Any] = Field(default_factory=dict)
Comment on lines +1 to +38
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great start on content data model

5 changes: 5 additions & 0 deletions src/scrapers/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Import scrapers for easier access
from src.scrapers.base import WebScraper
from src.scrapers.w3schools import W3SchoolsScraper

__all__ = ['WebScraper', 'W3SchoolsScraper']
Binary file added src/scrapers/__pycache__/__init__.cpython-312.pyc
Binary file not shown.
Binary file added src/scrapers/__pycache__/base.cpython-312.pyc
Binary file not shown.
Binary file not shown.
Loading