Skip to content

Conversation

@Aarthy-Ocha
Copy link
Collaborator

No description provided.

return session

def get_page_content(self, url: str) -> Optional[str]:
"""Get HTML content from a URL"""
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets elaborate on the doc string
Get HTML content from a URL with a session time out of 30 seconds or minutes?
The session has a default of 3 retries ... etc.

w3schools_scraper = W3SchoolsScraper(args.url, delay=args.delay)

# Create the application
app = ScraperApplication(w3schools_scraper)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of Scraper application we will use ZenML pipeline for orchestration

Comment on lines +1 to +17
import logging
from abc import ABC, abstractmethod
from typing import Any

logger = logging.getLogger(__name__)


class DataExporter(ABC):
"""Abstract base class for data exporters"""

def __init__(self):
self.logger = logging.getLogger(f"{__name__}.{self.__class__.__name__}")

@abstractmethod
def export(self, data: Any, filename: str) -> None:
"""Export data to a file"""
pass No newline at end of file
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not required for now as we will be storing the text in NoSQL Mongodb

Comment on lines +1 to +38
from typing import List, Dict, Any, Optional
from pydantic import BaseModel, Field, validator


class TutorialLink(BaseModel):
"""Model for a tutorial link in the sidebar"""
title: str
url: str


class CodeExample(BaseModel):
"""Model for a code example"""
language: str = "javascript" # Default language
code: str


class TutorialContent(BaseModel):
"""Model for the content of a tutorial page"""
title: str
url: str
content: str
code_examples: List[CodeExample] = []
next_link: Optional[str] = None

@validator('content')
def content_not_empty(cls, v):
if not v or len(v.strip()) == 0:
raise ValueError('content cannot be empty')
return v


class TutorialCourse(BaseModel):
"""Model for the entire tutorial course"""
title: str
source_url: str
modules: List[TutorialLink]
tutorials: List[TutorialContent]
metadata: Dict[str, Any] = Field(default_factory=dict) No newline at end of file
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great start on content data model

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should store the URL tree in DB and version it with date
The content extracted should have this version and the URL path of the tree in the content data model so we can easily fetched related content

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add all the metadata files to gitignore and remove from the PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants