-
Notifications
You must be signed in to change notification settings - Fork 0
web scraping methodology #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Large diffs are not rendered by default.
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,50 @@ | ||
| title,url | ||
| No Title Found,https://www.w3schools.com/nodejs/default.asp | ||
| No Title Found,https://www.w3schools.com/nodejs/nodejs_intro.asp | ||
| No Title Found,https://www.w3schools.com/nodejs/nodejs_get_started.asp | ||
| No Title Found,https://www.w3schools.com/nodejs/nodejs_modules.asp | ||
| No Title Found,https://www.w3schools.com/nodejs/nodejs_http.asp | ||
| No Title Found,https://www.w3schools.com/nodejs/nodejs_filesystem.asp | ||
| No Title Found,https://www.w3schools.com/nodejs/nodejs_url.asp | ||
| No Title Found,https://www.w3schools.com/nodejs/nodejs_npm.asp | ||
| No Title Found,https://www.w3schools.com/nodejs/nodejs_events.asp | ||
| No Title Found,https://www.w3schools.com/nodejs/nodejs_uploadfiles.asp | ||
| No Title Found,https://www.w3schools.com/nodejs/nodejs_email.asp | ||
| No Title Found,https://www.w3schools.com/nodejs/nodejs_mysql.asp | ||
| No Title Found,https://www.w3schools.com/nodejs/nodejs_mysql_create_db.asp | ||
| No Title Found,https://www.w3schools.com/nodejs/nodejs_mysql_create_table.asp | ||
| No Title Found,https://www.w3schools.com/nodejs/nodejs_mysql_insert.asp | ||
| No Title Found,https://www.w3schools.com/nodejs/nodejs_mysql_select.asp | ||
| No Title Found,https://www.w3schools.com/nodejs/nodejs_mysql_where.asp | ||
| No Title Found,https://www.w3schools.com/nodejs/nodejs_mysql_orderby.asp | ||
| No Title Found,https://www.w3schools.com/nodejs/nodejs_mysql_delete.asp | ||
| No Title Found,https://www.w3schools.com/nodejs/nodejs_mysql_drop_table.asp | ||
| No Title Found,https://www.w3schools.com/nodejs/nodejs_mysql_update.asp | ||
| No Title Found,https://www.w3schools.com/nodejs/nodejs_mysql_limit.asp | ||
| No Title Found,https://www.w3schools.com/nodejs/nodejs_mysql_join.asp | ||
| No Title Found,https://www.w3schools.com/nodejs/nodejs_mongodb.asp | ||
| No Title Found,https://www.w3schools.com/nodejs/nodejs_mongodb_create_db.asp | ||
| No Title Found,https://www.w3schools.com/nodejs/nodejs_mongodb_createcollection.asp | ||
| No Title Found,https://www.w3schools.com/nodejs/nodejs_mongodb_insert.asp | ||
| No Title Found,https://www.w3schools.com/nodejs/nodejs_mongodb_find.asp | ||
| No Title Found,https://www.w3schools.com/nodejs/nodejs_mongodb_query.asp | ||
| No Title Found,https://www.w3schools.com/nodejs/nodejs_mongodb_sort.asp | ||
| No Title Found,https://www.w3schools.com/nodejs/nodejs_mongodb_delete.asp | ||
| No Title Found,https://www.w3schools.com/nodejs/nodejs_mongodb_drop.asp | ||
| No Title Found,https://www.w3schools.com/nodejs/nodejs_mongodb_update.asp | ||
| No Title Found,https://www.w3schools.com/nodejs/nodejs_mongodb_limit.asp | ||
| No Title Found,https://www.w3schools.com/nodejs/nodejs_mongodb_join.asp | ||
| No Title Found,https://www.w3schools.com/nodejs/nodejs_raspberrypi.asp | ||
| No Title Found,https://www.w3schools.com/nodejs/nodejs_raspberrypi_gpio_intro.asp | ||
| No Title Found,https://www.w3schools.com/nodejs/nodejs_raspberrypi_blinking_led.asp | ||
| No Title Found,https://www.w3schools.com/nodejs/nodejs_raspberrypi_led_pushbutton.asp | ||
| No Title Found,https://www.w3schools.com/nodejs/nodejs_raspberrypi_flowing_leds.asp | ||
| No Title Found,https://www.w3schools.com/nodejs/nodejs_raspberrypi_webserver_websocket.asp | ||
| No Title Found,https://www.w3schools.com/nodejs/nodejs_raspberrypi_rgb_led_websocket.asp | ||
| No Title Found,https://www.w3schools.com/nodejs/nodejs_raspberrypi_components.asp | ||
| No Title Found,https://www.w3schools.com/nodejs/ref_modules.asp | ||
| No Title Found,https://www.w3schools.com/nodejs/nodejs_compiler.asp | ||
| No Title Found,https://www.w3schools.com/nodejs/nodejs_server.asp | ||
| No Title Found,https://www.w3schools.com/nodejs/nodejs_syllabus.asp | ||
| No Title Found,https://www.w3schools.com/nodejs/nodejs_study_plan.asp | ||
| No Title Found,https://www.w3schools.com/nodejs/nodejs_exam.asp |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,50 @@ | ||
| title,url | ||
| Node.js HOME,https://www.w3schools.com/nodejs/default.asp | ||
| Node.js Intro,https://www.w3schools.com/nodejs/nodejs_intro.asp | ||
| Node.js Get Started,https://www.w3schools.com/nodejs/nodejs_get_started.asp | ||
| Node.js Modules,https://www.w3schools.com/nodejs/nodejs_modules.asp | ||
| Node.js HTTP Module,https://www.w3schools.com/nodejs/nodejs_http.asp | ||
| Node.js File System,https://www.w3schools.com/nodejs/nodejs_filesystem.asp | ||
| Node.js URL Module,https://www.w3schools.com/nodejs/nodejs_url.asp | ||
| Node.js NPM,https://www.w3schools.com/nodejs/nodejs_npm.asp | ||
| Node.js Events,https://www.w3schools.com/nodejs/nodejs_events.asp | ||
| Node.js Upload Files,https://www.w3schools.com/nodejs/nodejs_uploadfiles.asp | ||
| Node.js Email,https://www.w3schools.com/nodejs/nodejs_email.asp | ||
| MySQL Get Started,https://www.w3schools.com/nodejs/nodejs_mysql.asp | ||
| MySQL Create Database,https://www.w3schools.com/nodejs/nodejs_mysql_create_db.asp | ||
| MySQL Create Table,https://www.w3schools.com/nodejs/nodejs_mysql_create_table.asp | ||
| MySQL Insert Into,https://www.w3schools.com/nodejs/nodejs_mysql_insert.asp | ||
| MySQL Select From,https://www.w3schools.com/nodejs/nodejs_mysql_select.asp | ||
| MySQL Where,https://www.w3schools.com/nodejs/nodejs_mysql_where.asp | ||
| MySQL Order By,https://www.w3schools.com/nodejs/nodejs_mysql_orderby.asp | ||
| MySQL Delete,https://www.w3schools.com/nodejs/nodejs_mysql_delete.asp | ||
| MySQL Drop Table,https://www.w3schools.com/nodejs/nodejs_mysql_drop_table.asp | ||
| MySQL Update,https://www.w3schools.com/nodejs/nodejs_mysql_update.asp | ||
| MySQL Limit,https://www.w3schools.com/nodejs/nodejs_mysql_limit.asp | ||
| MySQL Join,https://www.w3schools.com/nodejs/nodejs_mysql_join.asp | ||
| MongoDB Get Started,https://www.w3schools.com/nodejs/nodejs_mongodb.asp | ||
| MongoDB Create DB,https://www.w3schools.com/nodejs/nodejs_mongodb_create_db.asp | ||
| MongoDB Collection,https://www.w3schools.com/nodejs/nodejs_mongodb_createcollection.asp | ||
| MongoDB Insert,https://www.w3schools.com/nodejs/nodejs_mongodb_insert.asp | ||
| MongoDB Find,https://www.w3schools.com/nodejs/nodejs_mongodb_find.asp | ||
| MongoDB Query,https://www.w3schools.com/nodejs/nodejs_mongodb_query.asp | ||
| MongoDB Sort,https://www.w3schools.com/nodejs/nodejs_mongodb_sort.asp | ||
| MongoDB Delete,https://www.w3schools.com/nodejs/nodejs_mongodb_delete.asp | ||
| MongoDB Drop Collection,https://www.w3schools.com/nodejs/nodejs_mongodb_drop.asp | ||
| MongoDB Update,https://www.w3schools.com/nodejs/nodejs_mongodb_update.asp | ||
| MongoDB Limit,https://www.w3schools.com/nodejs/nodejs_mongodb_limit.asp | ||
| MongoDB Join,https://www.w3schools.com/nodejs/nodejs_mongodb_join.asp | ||
| RasPi Get Started,https://www.w3schools.com/nodejs/nodejs_raspberrypi.asp | ||
| RasPi GPIO Introduction,https://www.w3schools.com/nodejs/nodejs_raspberrypi_gpio_intro.asp | ||
| RasPi Blinking LED,https://www.w3schools.com/nodejs/nodejs_raspberrypi_blinking_led.asp | ||
| RasPi LED & Pushbutton,https://www.w3schools.com/nodejs/nodejs_raspberrypi_led_pushbutton.asp | ||
| RasPi Flowing LEDs,https://www.w3schools.com/nodejs/nodejs_raspberrypi_flowing_leds.asp | ||
| RasPi WebSocket,https://www.w3schools.com/nodejs/nodejs_raspberrypi_webserver_websocket.asp | ||
| RasPi RGB LED WebSocket,https://www.w3schools.com/nodejs/nodejs_raspberrypi_rgb_led_websocket.asp | ||
| RasPi Components,https://www.w3schools.com/nodejs/nodejs_raspberrypi_components.asp | ||
| Built-in Modules,https://www.w3schools.com/nodejs/ref_modules.asp | ||
| Node.js Compiler,https://www.w3schools.com/nodejs/nodejs_compiler.asp | ||
| Node.js Server,https://www.w3schools.com/nodejs/nodejs_server.asp | ||
| Node.js Syllabus,https://www.w3schools.com/nodejs/nodejs_syllabus.asp | ||
| Node.js Study Plan,https://www.w3schools.com/nodejs/nodejs_study_plan.asp | ||
| Node.js Certificate,https://www.w3schools.com/nodejs/nodejs_exam.asp |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,81 @@ | ||
| #!/usr/bin/env python | ||
| """ | ||
| Main entry point for the web scrapers application. | ||
| """ | ||
| import argparse | ||
| import logging | ||
| import sys | ||
|
|
||
| from src.scrapers.w3schools import W3SchoolsScraper | ||
| from src.app import ScraperApplication | ||
|
|
||
| # Configure logging | ||
| logging.basicConfig( | ||
| level=logging.INFO, | ||
| format='%(asctime)s - %(name)s - %(levelname)s - %(message)s', | ||
| handlers=[ | ||
| logging.StreamHandler(sys.stdout), | ||
| logging.FileHandler('scrapers.log') | ||
| ] | ||
| ) | ||
| logger = logging.getLogger(__name__) | ||
Aarthy-Ocha marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
|
|
||
| def parse_arguments(): | ||
| """Parse command line arguments""" | ||
| parser = argparse.ArgumentParser(description='Web scrapers for tutorial websites') | ||
| parser.add_argument( | ||
| '--url', | ||
| type=str, | ||
| default="https://www.w3schools.com/nodejs/nodejs_intro.asp", | ||
| help='Starting URL for the scrapers' | ||
| ) | ||
| parser.add_argument( | ||
| '--output', | ||
| type=str, | ||
| default="data/raw", | ||
| help='Output directory for scraped data' | ||
| ) | ||
| parser.add_argument( | ||
| '--delay', | ||
| type=float, | ||
| default=1.0, | ||
| help='Delay between requests in seconds' | ||
| ) | ||
| return parser.parse_args() | ||
|
|
||
|
|
||
| def main(): | ||
| """Main function to run the scrapers""" | ||
| try: | ||
| args = parse_arguments() | ||
|
|
||
| # Log startup information | ||
| logger.info("Starting Web Scraper") | ||
| logger.info(f"URL: {args.url}") | ||
| logger.info(f"Output directory: {args.output}") | ||
| logger.info(f"Request delay: {args.delay} seconds") | ||
|
|
||
| # Create the scrapers | ||
| w3schools_scraper = W3SchoolsScraper(args.url, delay=args.delay) | ||
|
|
||
| # Create the application | ||
| app = ScraperApplication(w3schools_scraper) | ||
|
Owner
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Instead of Scraper application we will use ZenML pipeline for orchestration |
||
|
|
||
| # Run the application | ||
| data = app.run(args.output) | ||
|
|
||
| # Access and print metadata | ||
| logger.info("\nMetadata from the course:") | ||
| for key, value in data["metadata"].items(): | ||
| logger.info(f" {key}: {value}") | ||
|
|
||
| logger.info("\nScraping and export completed!") | ||
| return 0 | ||
| except Exception as e: | ||
| logger.error(f"Error in main: {e}", exc_info=True) | ||
| return 1 | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| sys.exit(main()) | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,54 @@ | ||
| import os | ||
| import logging | ||
| from typing import Dict, Any, Optional | ||
|
|
||
| from src.scrapers.base import WebScraper | ||
| from src.scrapers.w3schools import W3SchoolsScraper | ||
| from src.exporters.json_exporter import JsonExporter | ||
| from src.exporters.csv_exporter import CsvExporter | ||
|
|
||
| logger = logging.getLogger(__name__) | ||
|
|
||
|
|
||
| class ScraperApplication: | ||
| """Main application that coordinates scraping and exporting""" | ||
|
|
||
| def __init__(self, scraper: WebScraper): | ||
| self.scraper = scraper | ||
| self.json_exporter = JsonExporter() | ||
| self.csv_exporter = CsvExporter() | ||
|
|
||
| def run(self, output_dir: str = "data") -> Dict[str, Any]: | ||
| """Run the scrapers and export the data""" | ||
| # Create output directory if it doesn't exist | ||
| os.makedirs(output_dir, exist_ok=True) | ||
|
|
||
| # Run the scrapers | ||
| if isinstance(self.scraper, W3SchoolsScraper): | ||
| try: | ||
| logger.info(f"Starting scrapers application with output directory: {output_dir}") | ||
| course_data = self.scraper.scrape_course() | ||
|
|
||
| # Export full course data to JSON | ||
| json_filename = os.path.join(output_dir, "tutorial_course.json") | ||
| self.json_exporter.export(course_data, json_filename) | ||
|
|
||
| # Export modules to CSV | ||
| modules_filename = os.path.join(output_dir, "tutorial_modules.csv") | ||
| self.csv_exporter.export(course_data.modules, modules_filename) | ||
|
|
||
| # Export tutorial titles and URLs to CSV | ||
| tutorial_info = [ | ||
| {"title": t.title, "url": t.url} for t in course_data.tutorials | ||
| ] | ||
| tutorials_filename = os.path.join(output_dir, "tutorial_info.csv") | ||
| self.csv_exporter.export(tutorial_info, tutorials_filename, ["title", "url"]) | ||
|
|
||
| logger.info("Scraping and exporting completed successfully") | ||
| return course_data.dict() | ||
| except Exception as e: | ||
| logger.error(f"Error in scrapers application: {e}") | ||
| raise | ||
| else: | ||
| logger.error("Unsupported scrapers type") | ||
| raise ValueError("Unsupported scrapers type") |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,6 @@ | ||
| # Import exporters for easier access | ||
| from src.exporters.base import DataExporter | ||
| from src.exporters.json_exporter import JsonExporter | ||
| from src.exporters.csv_exporter import CsvExporter | ||
|
|
||
| __all__ = ['DataExporter', 'JsonExporter', 'CsvExporter'] |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,17 @@ | ||
| import logging | ||
| from abc import ABC, abstractmethod | ||
| from typing import Any | ||
|
|
||
| logger = logging.getLogger(__name__) | ||
|
|
||
|
|
||
| class DataExporter(ABC): | ||
| """Abstract base class for data exporters""" | ||
|
|
||
| def __init__(self): | ||
| self.logger = logging.getLogger(f"{__name__}.{self.__class__.__name__}") | ||
|
|
||
| @abstractmethod | ||
| def export(self, data: Any, filename: str) -> None: | ||
| """Export data to a file""" | ||
| pass | ||
|
Comment on lines
+1
to
+17
Owner
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not required for now as we will be storing the text in NoSQL Mongodb |
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,41 @@ | ||
| import csv | ||
| import logging | ||
| from typing import List, Dict, Any | ||
|
|
||
| from pydantic import BaseModel | ||
| from src.exporters.base import DataExporter | ||
|
|
||
| logger = logging.getLogger(__name__) | ||
|
|
||
|
|
||
| class CsvExporter(DataExporter): | ||
| """CSV data exporter""" | ||
|
|
||
| def export(self, data: List[Dict], filename: str, fields: List[str] = None) -> None: | ||
| """Export data to CSV file""" | ||
| if not data: | ||
| self.logger.warning("No data to export") | ||
| return | ||
|
|
||
| try: | ||
| # Determine fields if not provided | ||
| if not fields and data: | ||
| if isinstance(data[0], BaseModel): | ||
| fields = list(data[0].dict().keys()) | ||
| else: | ||
| fields = list(data[0].keys()) | ||
|
|
||
| with open(filename, 'w', newline='', encoding='utf-8') as f: | ||
| writer = csv.DictWriter(f, fieldnames=fields) | ||
| writer.writeheader() | ||
|
|
||
| for item in data: | ||
| if isinstance(item, BaseModel): | ||
| writer.writerow(item.dict()) | ||
| else: | ||
| writer.writerow(item) | ||
|
|
||
| self.logger.info(f"Data exported to {filename}") | ||
| except Exception as e: | ||
| self.logger.error(f"Error exporting data to {filename}: {e}") | ||
| raise |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,25 @@ | ||
| import json | ||
| import logging | ||
| from typing import Any | ||
|
|
||
| from pydantic import BaseModel | ||
| from src.exporters.base import DataExporter | ||
|
|
||
| logger = logging.getLogger(__name__) | ||
|
|
||
|
|
||
| class JsonExporter(DataExporter): | ||
| """JSON data exporter""" | ||
|
|
||
| def export(self, data: Any, filename: str) -> None: | ||
| """Export data to JSON file""" | ||
| try: | ||
| with open(filename, 'w', encoding='utf-8') as f: | ||
| if isinstance(data, BaseModel): | ||
| json.dump(data.dict(), f, indent=4, ensure_ascii=False) | ||
| else: | ||
| json.dump(data, f, indent=4, ensure_ascii=False) | ||
| self.logger.info(f"Data exported to {filename}") | ||
| except Exception as e: | ||
| self.logger.error(f"Error exporting data to {filename}: {e}") | ||
| raise |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,4 @@ | ||
| # Import models for easier access | ||
| from src.models.schema import TutorialLink, CodeExample, TutorialContent, TutorialCourse | ||
|
|
||
| __all__ = ['TutorialLink', 'CodeExample', 'TutorialContent', 'TutorialCourse'] |
|
Owner
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please add all the metadata files to gitignore and remove from the PR |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,38 @@ | ||
| from typing import List, Dict, Any, Optional | ||
| from pydantic import BaseModel, Field, validator | ||
|
|
||
|
|
||
| class TutorialLink(BaseModel): | ||
| """Model for a tutorial link in the sidebar""" | ||
| title: str | ||
| url: str | ||
|
|
||
|
|
||
| class CodeExample(BaseModel): | ||
| """Model for a code example""" | ||
| language: str = "javascript" # Default language | ||
| code: str | ||
|
|
||
|
|
||
| class TutorialContent(BaseModel): | ||
| """Model for the content of a tutorial page""" | ||
| title: str | ||
| url: str | ||
| content: str | ||
| code_examples: List[CodeExample] = [] | ||
| next_link: Optional[str] = None | ||
|
|
||
| @validator('content') | ||
| def content_not_empty(cls, v): | ||
| if not v or len(v.strip()) == 0: | ||
| raise ValueError('content cannot be empty') | ||
| return v | ||
|
|
||
|
|
||
| class TutorialCourse(BaseModel): | ||
| """Model for the entire tutorial course""" | ||
| title: str | ||
| source_url: str | ||
| modules: List[TutorialLink] | ||
| tutorials: List[TutorialContent] | ||
| metadata: Dict[str, Any] = Field(default_factory=dict) | ||
|
Comment on lines
+1
to
+38
Owner
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Great start on content data model |
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,5 @@ | ||
| # Import scrapers for easier access | ||
| from src.scrapers.base import WebScraper | ||
| from src.scrapers.w3schools import W3SchoolsScraper | ||
|
|
||
| __all__ = ['WebScraper', 'W3SchoolsScraper'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should store the URL tree in DB and version it with date
The content extracted should have this version and the URL path of the tree in the content data model so we can easily fetched related content