STONESCRAPER

Extracting Web Intelligence, One Page at a Time!

Built with the tools and technologies:

Table of Contents

📍 Overview
👾 Features
📁 Project Structure
- 📂 Project Index
🚀 Getting Started
📌 Project Roadmap
🔰 Contributing
🎗 License
🙌 Acknowledgments

📍 Overview

Stone Scraper is a ground-breaking AI-powered tool that transforms the way we harvest data from the web. By automating the extraction of structured data from websites, it offers invaluable assistance to data analysts, web researchers, and anyone interested in web data mining. Leveraging state-of-the-art technologies like Streamlit, Langchain, and Selenium, it streamlines web scraping, HTML cleaning, and data parsing. The result? A user-friendly, efficient, and interactive solution that turns unstructured web content into organized, actionable insights.

👾 Features

	Feature	Summary
⚙️	Architecture	The architecture is based on a Python-powered script. The project is structured into distinct script files for specific tasks such as `scrape.py` for web scraping and `parse.py` for data parsing. The `main.py` script integrates all functions and provides an interactive user interface.
🔩	Code Quality	The code is well-structured into separate Python scripts with specific tasks. There is a clear distinction of tasks among different scripts such as `main.py`, `scrape.py`, and `parse.py`. The use of Python and its simplicity ensures maintainability and readability of the code.
📄	Documentation	The primary language used is Python, with 3 Python files in the project. Installation and usage commands are well-documented. Dependencies are managed through the `requirements.txt` file.
🔌	Integrations	The project integrates several Python packages such as `streamlit`, `langchain`, `selenium`, and `beautifulsoup4`. It also includes `chromedriver` for web navigation and scraping. The `python-dotenv` package is used for managing environment variables.
🧩	Modularity	Tasks are separated into different scripts ensuring high modularity. `scrape.py` handles web scraping, `parse.py` handles data parsing, and `main.py` integrates all functionalities. The use of `requirements.txt` for managing dependencies also adds to the modularity.
🧪	Testing	Unit tests can be run using the `pytest` command as indicated in the documentation. However, specific details about the test coverage and the results are not provided.
⚡️	Performance	The use of `selenium` and `chromedriver` ensures efficient web scraping. `beautifulsoup4` and `lxml` allow for fast and efficient HTML parsing. The `langchain` and `langchain_ollama` packages provide high-performance language processing capabilities.
🛡️	Security	The use of `python-dotenv` for managing environment variables provides a level of security for sensitive data. However, without more details about authentication or encryption methods, a complete security analysis cannot be provided.

📁 Project Structure

└── Stone_Scraper.git/
    ├── README.md
    ├── __pycache__
    │   ├── parse.cpython-313.pyc
    │   └── scrape.cpython-313.pyc
    ├── chromedriver
    ├── main.py
    ├── parse.py
    ├── requirements.txt
    └── scrape.py

📂 Project Index

STONE_SCRAPER.GIT/

__root__

chromedriver - Without the complete information provided, such as the project structure and the specific code file, it's quite challenging to give an accurate summary
- Could you please provide the missing data? I will then be able to highlight the main purpose and use of the code file in regards to the entire codebase architecture.

main.py - Main.py powers the Stone Scraper, a web intelligence tool that extracts structured data from websites using artificial intelligence
- It manages the web scraping process, runs AI parsing on the cleaned content, and provides an interactive user interface for data extraction
- It's essential for data analysts, web researchers, and individuals interested in data extraction from websites.

scrape.py - Scrape.py is instrumental in automating web scraping processes, extracting HTML content, and managing secure credentials
- It deploys a browser, navigates to specified websites, and retrieves full-page HTML
- The script further cleans the HTML by removing scripts/styles and unnecessary whitespace
- It also breaks down the DOM text into manageable chunks, facilitating subsequent processing stages.

requirements.txt - Requirements.txt manages dependencies for the project, specifying Python packages like Streamlit, Langchain, Selenium, BeautifulSoup4, and others
- These packages aid in web scraping, XML parsing, environment management, and building interactive web applications, thus forming a crucial part of the project's functionality.

parse.py - Parse.py utilizes the OllamaLLM language model to extract quantitative or structured data from web content
- It generates structured prompts, applies the language model, then parses and combines web content chunks based on these prompts
- The outcome is a single string of parsed results, potentially formatted as a clean table.

🚀 Getting Started

☑️ Prerequisites

Before getting started with Stone_Scraper.git, ensure your runtime environment meets the following requirements:

Programming Language: Python
Package Manager: Pip

⚙️ Installation

Install Stone_Scraper.git using one of the following methods:

Build from source:

Clone the Stone_Scraper.git repository:

❯ git clone https://github.com/ralstonraphael/Stone_Scraper.git

Navigate to the project directory:

❯ cd Stone_Scraper.git

Install the project dependencies:

Using pip

❯ pip install -r requirements.txt

🤖 Usage

Run Stone_Scraper.git using the following command: Using pip

❯ python {entrypoint}

🧪 Testing

Run the test suite using the following command: Using pip

❯ pytest

📌 Project Roadmap

Task 1: ~~Implement webscraper API.~~
Task 2: Implement OpenAI API and deploy though cloud.
Task 3: Additional option to input your own API key.

🔰 Contributing

💬 Join the Discussions: Share your insights, provide feedback, or ask questions.
🐛 Report Issues: Submit bugs found or log feature requests for the Stone_Scraper.git project.
💡 Submit Pull Requests: Review open PRs, and submit your own PRs.

Contributing Guidelines

Fork the Repository: Start by forking the project repository to your github account.
Clone Locally: Clone the forked repository to your local machine using a git client.
```
git clone https://github.com/ralstonraphael/Stone_Scraper.git
```
Create a New Branch: Always work on a new branch, giving it a descriptive name.
```
git checkout -b new-feature-x
```
Make Your Changes: Develop and test your changes locally.
Commit Your Changes: Commit with a clear message describing your updates.
```
git commit -m 'Implemented new feature x.'
```
Push to github: Push the changes to your forked repository.
```
git push origin new-feature-x
```
Submit a Pull Request: Create a PR against the original project repository. Clearly describe the changes and their motivations.
Review: Once your PR is reviewed and approved, it will be merged into the main branch. Congratulations on your contribution!

Contributor Graph

🎗 License

This project is protected under the Unlicence. For more details, refer to the LICENSE file.

🙌 Acknowledgments

I’d like to express my gratitude to the tools, libraries, and communities that made this project possible:

Streamlit – for making it incredibly easy to build interactive web apps with Python.
LangChain – for providing a powerful framework to integrate LLMs in structured pipelines.
OpenAI – for their accessible and reliable language models, powering intelligent text analysis.
Ollama – for enabling local LLM experimentation and model integration.
Selenium – for automating browser interactions and web scraping workflows.
BeautifulSoup – for efficient and intuitive HTML parsing.
Python – the language at the core of everything.
Special thanks to the open-source community for making powerful tools accessible to everyone.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
.devcontainer		.devcontainer
__pycache__		__pycache__
assets		assets
.DS_Store		.DS_Store
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
chromedriver		chromedriver
main.py		main.py
page.png		page.png
parse.py		parse.py
requirements.txt		requirements.txt
scrape.py		scrape.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

STONESCRAPER

📍 Overview

👾 Features

📁 Project Structure

📂 Project Index

🚀 Getting Started

☑️ Prerequisites

⚙️ Installation

🤖 Usage

🧪 Testing

📌 Project Roadmap

🔰 Contributing

🎗 License

🙌 Acknowledgments

About

Uh oh!

Releases

Uh oh!

Languages

chromedriver	- Without the complete information provided, such as the project structure and the specific code file, it's quite challenging to give an accurate summary - Could you please provide the missing data? I will then be able to highlight the main purpose and use of the code file in regards to the entire codebase architecture.
main.py	- Main.py powers the Stone Scraper, a web intelligence tool that extracts structured data from websites using artificial intelligence - It manages the web scraping process, runs AI parsing on the cleaned content, and provides an interactive user interface for data extraction - It's essential for data analysts, web researchers, and individuals interested in data extraction from websites.
scrape.py	- Scrape.py is instrumental in automating web scraping processes, extracting HTML content, and managing secure credentials - It deploys a browser, navigates to specified websites, and retrieves full-page HTML - The script further cleans the HTML by removing scripts/styles and unnecessary whitespace - It also breaks down the DOM text into manageable chunks, facilitating subsequent processing stages.
requirements.txt	- Requirements.txt manages dependencies for the project, specifying Python packages like Streamlit, Langchain, Selenium, BeautifulSoup4, and others - These packages aid in web scraping, XML parsing, environment management, and building interactive web applications, thus forming a crucial part of the project's functionality.
parse.py	- Parse.py utilizes the OllamaLLM language model to extract quantitative or structured data from web content - It generates structured prompts, applies the language model, then parses and combines web content chunks based on these prompts - The outcome is a single string of parsed results, potentially formatted as a clean table.

License

ralstonraphael/Stone_Scraper

Folders and files

Latest commit

History

Repository files navigation

STONESCRAPER

📍 Overview

👾 Features

📁 Project Structure

📂 Project Index

🚀 Getting Started

☑️ Prerequisites

⚙️ Installation

🤖 Usage

🧪 Testing

📌 Project Roadmap

🔰 Contributing

🎗 License

🙌 Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Uh oh!

Languages