web scraping methodology #3

Aarthy-Ocha · 2025-04-15T05:06:07Z

No description provided.

main.py

src/scrapers/base.py

simrathanspal · 2025-04-18T14:39:24Z

src/scrapers/base.py

+        return session
+
+    def get_page_content(self, url: str) -> Optional[str]:
+        """Get HTML content from a URL"""


Lets elaborate on the doc string
Get HTML content from a URL with a session time out of 30 seconds or minutes?
The session has a default of 3 retries ... etc.

simrathanspal · 2025-04-18T14:44:37Z

main.py

+        w3schools_scraper = W3SchoolsScraper(args.url, delay=args.delay)
+
+        # Create the application
+        app = ScraperApplication(w3schools_scraper)


Instead of Scraper application we will use ZenML pipeline for orchestration

simrathanspal · 2025-04-18T14:47:06Z

src/exporters/base.py

+import logging
+from abc import ABC, abstractmethod
+from typing import Any
+
+logger = logging.getLogger(__name__)
+
+
+class DataExporter(ABC):
+    """Abstract base class for data exporters"""
+
+    def __init__(self):
+        self.logger = logging.getLogger(f"{__name__}.{self.__class__.__name__}")
+
+    @abstractmethod
+    def export(self, data: Any, filename: str) -> None:
+        """Export data to a file"""
+        pass


Not required for now as we will be storing the text in NoSQL Mongodb

simrathanspal · 2025-04-18T14:52:04Z

src/models/schema.py

+from typing import List, Dict, Any, Optional
+from pydantic import BaseModel, Field, validator
+
+
+class TutorialLink(BaseModel):
+    """Model for a tutorial link in the sidebar"""
+    title: str
+    url: str
+
+
+class CodeExample(BaseModel):
+    """Model for a code example"""
+    language: str = "javascript"  # Default language
+    code: str
+
+
+class TutorialContent(BaseModel):
+    """Model for the content of a tutorial page"""
+    title: str
+    url: str
+    content: str
+    code_examples: List[CodeExample] = []
+    next_link: Optional[str] = None
+
+    @validator('content')
+    def content_not_empty(cls, v):
+        if not v or len(v.strip()) == 0:
+            raise ValueError('content cannot be empty')
+        return v
+
+
+class TutorialCourse(BaseModel):
+    """Model for the entire tutorial course"""
+    title: str
+    source_url: str
+    modules: List[TutorialLink]
+    tutorials: List[TutorialContent]
+    metadata: Dict[str, Any] = Field(default_factory=dict)


Great start on content data model

simrathanspal · 2025-04-18T14:56:46Z

data/raw/tutorial_info.csv

We should store the URL tree in DB and version it with date
The content extracted should have this version and the URL path of the tree in the content data model so we can easily fetched related content

simrathanspal · 2025-04-18T15:04:00Z

src/models/__pycache__/schema.cpython-312.pyc

Please add all the metadata files to gitignore and remove from the PR

web scraping methodology

6b78398

simrathanspal requested changes Apr 18, 2025

View reviewed changes

simrathanspal reviewed Apr 18, 2025

View reviewed changes

src/models/__pycache__/schema.cpython-312.pyc

Copy link

Owner

simrathanspal Apr 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add all the metadata files to gitignore and remove from the PR

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

web scraping methodology #3

web scraping methodology #3

Uh oh!

Aarthy-Ocha commented Apr 15, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

simrathanspal Apr 18, 2025

Uh oh!

simrathanspal Apr 18, 2025

Uh oh!

simrathanspal Apr 18, 2025

Uh oh!

simrathanspal Apr 18, 2025

Uh oh!

simrathanspal Apr 18, 2025

Uh oh!

simrathanspal Apr 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

web scraping methodology #3

Are you sure you want to change the base?

web scraping methodology #3

Uh oh!

Conversation

Aarthy-Ocha commented Apr 15, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

simrathanspal Apr 18, 2025

Choose a reason for hiding this comment

Uh oh!

simrathanspal Apr 18, 2025

Choose a reason for hiding this comment

Uh oh!

simrathanspal Apr 18, 2025

Choose a reason for hiding this comment

Uh oh!

simrathanspal Apr 18, 2025

Choose a reason for hiding this comment

Uh oh!

simrathanspal Apr 18, 2025

Choose a reason for hiding this comment

Uh oh!

simrathanspal Apr 18, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants