Skip to content

Stone Scraper is an AI-powered tool for automated web data extraction. Built with Streamlit, Langchain, and Selenium to transform unstructured web content into clean, actionable insights. CURRENTLY WORKING ON OPEN AI API CLOUD DEPLOYMENT. Try it out ⬇️⬇️

License

Notifications You must be signed in to change notification settings

ralstonraphael/Stone_Scraper

Repository files navigation

STONESCRAPER

Extracting Web Intelligence, One Page at a Time!

Built with the tools and technologies:

Markdown Selenium Python LangChain BeautifulSoup Ollama OpenAI Streamlit VSCode

Table of Contents

📍 Overview

Stone Scraper is a ground-breaking AI-powered tool that transforms the way we harvest data from the web. By automating the extraction of structured data from websites, it offers invaluable assistance to data analysts, web researchers, and anyone interested in web data mining. Leveraging state-of-the-art technologies like Streamlit, Langchain, and Selenium, it streamlines web scraping, HTML cleaning, and data parsing. The result? A user-friendly, efficient, and interactive solution that turns unstructured web content into organized, actionable insights.


👾 Features

Feature Summary
⚙️ Architecture
  • The architecture is based on a Python-powered script.
  • The project is structured into distinct script files for specific tasks such as scrape.py for web scraping and parse.py for data parsing.
  • The main.py script integrates all functions and provides an interactive user interface.
🔩 Code Quality
  • The code is well-structured into separate Python scripts with specific tasks.
  • There is a clear distinction of tasks among different scripts such as main.py, scrape.py, and parse.py.
  • The use of Python and its simplicity ensures maintainability and readability of the code.
📄 Documentation
  • The primary language used is Python, with 3 Python files in the project.
  • Installation and usage commands are well-documented.
  • Dependencies are managed through the requirements.txt file.
🔌 Integrations
  • The project integrates several Python packages such as streamlit, langchain, selenium, and beautifulsoup4.
  • It also includes chromedriver for web navigation and scraping.
  • The python-dotenv package is used for managing environment variables.
🧩 Modularity
  • Tasks are separated into different scripts ensuring high modularity.
  • scrape.py handles web scraping, parse.py handles data parsing, and main.py integrates all functionalities.
  • The use of requirements.txt for managing dependencies also adds to the modularity.
🧪 Testing
  • Unit tests can be run using the pytest command as indicated in the documentation.
  • However, specific details about the test coverage and the results are not provided.
⚡️ Performance
  • The use of selenium and chromedriver ensures efficient web scraping.
  • beautifulsoup4 and lxml allow for fast and efficient HTML parsing.
  • The langchain and langchain_ollama packages provide high-performance language processing capabilities.
🛡️ Security
  • The use of python-dotenv for managing environment variables provides a level of security for sensitive data.
  • However, without more details about authentication or encryption methods, a complete security analysis cannot be provided.

📁 Project Structure

└── Stone_Scraper.git/
    ├── README.md
    ├── __pycache__
    │   ├── parse.cpython-313.pyc
    │   └── scrape.cpython-313.pyc
    ├── chromedriver
    ├── main.py
    ├── parse.py
    ├── requirements.txt
    └── scrape.py

📂 Project Index

STONE_SCRAPER.GIT/
__root__
chromedriver - Without the complete information provided, such as the project structure and the specific code file, it's quite challenging to give an accurate summary
- Could you please provide the missing data? I will then be able to highlight the main purpose and use of the code file in regards to the entire codebase architecture.
main.py - Main.py powers the Stone Scraper, a web intelligence tool that extracts structured data from websites using artificial intelligence
- It manages the web scraping process, runs AI parsing on the cleaned content, and provides an interactive user interface for data extraction
- It's essential for data analysts, web researchers, and individuals interested in data extraction from websites.
scrape.py - Scrape.py is instrumental in automating web scraping processes, extracting HTML content, and managing secure credentials
- It deploys a browser, navigates to specified websites, and retrieves full-page HTML
- The script further cleans the HTML by removing scripts/styles and unnecessary whitespace
- It also breaks down the DOM text into manageable chunks, facilitating subsequent processing stages.
requirements.txt - Requirements.txt manages dependencies for the project, specifying Python packages like Streamlit, Langchain, Selenium, BeautifulSoup4, and others
- These packages aid in web scraping, XML parsing, environment management, and building interactive web applications, thus forming a crucial part of the project's functionality.
parse.py - Parse.py utilizes the OllamaLLM language model to extract quantitative or structured data from web content
- It generates structured prompts, applies the language model, then parses and combines web content chunks based on these prompts
- The outcome is a single string of parsed results, potentially formatted as a clean table.

🚀 Getting Started

☑️ Prerequisites

Before getting started with Stone_Scraper.git, ensure your runtime environment meets the following requirements:

  • Programming Language: Python
  • Package Manager: Pip

⚙️ Installation

Install Stone_Scraper.git using one of the following methods:

Build from source:

  1. Clone the Stone_Scraper.git repository:
❯ git clone https://github.com/ralstonraphael/Stone_Scraper.git
  1. Navigate to the project directory:
cd Stone_Scraper.git
  1. Install the project dependencies:

Using pip  

❯ pip install -r requirements.txt

🤖 Usage

Run Stone_Scraper.git using the following command: Using pip  

❯ python {entrypoint}

🧪 Testing

Run the test suite using the following command: Using pip  

❯ pytest

📌 Project Roadmap

  • Task 1: Implement webscraper API.
  • Task 2: Implement OpenAI API and deploy though cloud.
  • Task 3: Additional option to input your own API key.

🔰 Contributing

Contributing Guidelines
  1. Fork the Repository: Start by forking the project repository to your github account.
  2. Clone Locally: Clone the forked repository to your local machine using a git client.
    git clone https://github.com/ralstonraphael/Stone_Scraper.git
  3. Create a New Branch: Always work on a new branch, giving it a descriptive name.
    git checkout -b new-feature-x
  4. Make Your Changes: Develop and test your changes locally.
  5. Commit Your Changes: Commit with a clear message describing your updates.
    git commit -m 'Implemented new feature x.'
  6. Push to github: Push the changes to your forked repository.
    git push origin new-feature-x
  7. Submit a Pull Request: Create a PR against the original project repository. Clearly describe the changes and their motivations.
  8. Review: Once your PR is reviewed and approved, it will be merged into the main branch. Congratulations on your contribution!
Contributor Graph


🎗 License

This project is protected under the Unlicence. For more details, refer to the LICENSE file.


🙌 Acknowledgments

I’d like to express my gratitude to the tools, libraries, and communities that made this project possible:

  • Streamlit – for making it incredibly easy to build interactive web apps with Python.
  • LangChain – for providing a powerful framework to integrate LLMs in structured pipelines.
  • OpenAI – for their accessible and reliable language models, powering intelligent text analysis.
  • Ollama – for enabling local LLM experimentation and model integration.
  • Selenium – for automating browser interactions and web scraping workflows.
  • BeautifulSoup – for efficient and intuitive HTML parsing.
  • Python – the language at the core of everything.
  • Special thanks to the open-source community for making powerful tools accessible to everyone.

About

Stone Scraper is an AI-powered tool for automated web data extraction. Built with Streamlit, Langchain, and Selenium to transform unstructured web content into clean, actionable insights. CURRENTLY WORKING ON OPEN AI API CLOUD DEPLOYMENT. Try it out ⬇️⬇️

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Languages