Extracting Web Intelligence, One Page at a Time!
Built with the tools and technologies:
Table of Contents
Stone Scraper is a ground-breaking AI-powered tool that transforms the way we harvest data from the web. By automating the extraction of structured data from websites, it offers invaluable assistance to data analysts, web researchers, and anyone interested in web data mining. Leveraging state-of-the-art technologies like Streamlit, Langchain, and Selenium, it streamlines web scraping, HTML cleaning, and data parsing. The result? A user-friendly, efficient, and interactive solution that turns unstructured web content into organized, actionable insights.
| Feature | Summary | |
|---|---|---|
| ⚙️ | Architecture |
|
| 🔩 | Code Quality |
|
| 📄 | Documentation |
|
| 🔌 | Integrations |
|
| 🧩 | Modularity |
|
| 🧪 | Testing |
|
| ⚡️ | Performance |
|
| 🛡️ | Security |
|
└── Stone_Scraper.git/
├── README.md
├── __pycache__
│ ├── parse.cpython-313.pyc
│ └── scrape.cpython-313.pyc
├── chromedriver
├── main.py
├── parse.py
├── requirements.txt
└── scrape.pySTONE_SCRAPER.GIT/
__root__
chromedriver - Without the complete information provided, such as the project structure and the specific code file, it's quite challenging to give an accurate summary
- Could you please provide the missing data? I will then be able to highlight the main purpose and use of the code file in regards to the entire codebase architecture.main.py - Main.py powers the Stone Scraper, a web intelligence tool that extracts structured data from websites using artificial intelligence
- It manages the web scraping process, runs AI parsing on the cleaned content, and provides an interactive user interface for data extraction
- It's essential for data analysts, web researchers, and individuals interested in data extraction from websites.scrape.py - Scrape.py is instrumental in automating web scraping processes, extracting HTML content, and managing secure credentials
- It deploys a browser, navigates to specified websites, and retrieves full-page HTML
- The script further cleans the HTML by removing scripts/styles and unnecessary whitespace
- It also breaks down the DOM text into manageable chunks, facilitating subsequent processing stages.requirements.txt - Requirements.txt manages dependencies for the project, specifying Python packages like Streamlit, Langchain, Selenium, BeautifulSoup4, and others
- These packages aid in web scraping, XML parsing, environment management, and building interactive web applications, thus forming a crucial part of the project's functionality.parse.py - Parse.py utilizes the OllamaLLM language model to extract quantitative or structured data from web content
- It generates structured prompts, applies the language model, then parses and combines web content chunks based on these prompts
- The outcome is a single string of parsed results, potentially formatted as a clean table.
Before getting started with Stone_Scraper.git, ensure your runtime environment meets the following requirements:
- Programming Language: Python
- Package Manager: Pip
Install Stone_Scraper.git using one of the following methods:
Build from source:
- Clone the Stone_Scraper.git repository:
❯ git clone https://github.com/ralstonraphael/Stone_Scraper.git- Navigate to the project directory:
❯ cd Stone_Scraper.git- Install the project dependencies:
❯ pip install -r requirements.txtRun Stone_Scraper.git using the following command:
Using pip
❯ python {entrypoint}Run the test suite using the following command:
Using pip
❯ pytest-
Task 1:Implement webscraper API. -
Task 2: Implement OpenAI API and deploy though cloud. -
Task 3: Additional option to input your own API key.
- 💬 Join the Discussions: Share your insights, provide feedback, or ask questions.
- 🐛 Report Issues: Submit bugs found or log feature requests for the
Stone_Scraper.gitproject. - 💡 Submit Pull Requests: Review open PRs, and submit your own PRs.
Contributing Guidelines
- Fork the Repository: Start by forking the project repository to your github account.
- Clone Locally: Clone the forked repository to your local machine using a git client.
git clone https://github.com/ralstonraphael/Stone_Scraper.git
- Create a New Branch: Always work on a new branch, giving it a descriptive name.
git checkout -b new-feature-x
- Make Your Changes: Develop and test your changes locally.
- Commit Your Changes: Commit with a clear message describing your updates.
git commit -m 'Implemented new feature x.' - Push to github: Push the changes to your forked repository.
git push origin new-feature-x
- Submit a Pull Request: Create a PR against the original project repository. Clearly describe the changes and their motivations.
- Review: Once your PR is reviewed and approved, it will be merged into the main branch. Congratulations on your contribution!
This project is protected under the Unlicence. For more details, refer to the LICENSE file.
I’d like to express my gratitude to the tools, libraries, and communities that made this project possible:
- Streamlit – for making it incredibly easy to build interactive web apps with Python.
- LangChain – for providing a powerful framework to integrate LLMs in structured pipelines.
- OpenAI – for their accessible and reliable language models, powering intelligent text analysis.
- Ollama – for enabling local LLM experimentation and model integration.
- Selenium – for automating browser interactions and web scraping workflows.
- BeautifulSoup – for efficient and intuitive HTML parsing.
- Python – the language at the core of everything.
- Special thanks to the open-source community for making powerful tools accessible to everyone.
