Topic Researcher Bot is an intelligent assistant designed to streamline the process of researching and analyzing online content. It uses LLMs (via LangChain and Ollama) to automate topic-based article discovery, content cleaning, and summarization. This tool is ideal for researchers, analysts, and content teams who want fast, structured insights from recent web articles.
- Automated Web Search: Uses Google Custom Search API to find recent articles related to specific topics and sites.
- Article Scraping: Extracts content from article pages using BeautifulSoup and custom rules to avoid noisy or irrelevant data.
- LLM-based Cleaning: Leverages LLMs to isolate and retain only the core article body, removing ads, UI elements, and boilerplate.
- Summarization: Uses a second LLM pass to generate concise summaries of the cleaned content.
- Structured Storage: Articles are stored in a MongoDB database for easy retrieval, filtering, and deletion.
- Batch Article Retrieval: Supports multi-week historical searches and batch retrieval per topic-site pair.
- Backend: Developed using FastAPI
Defines the Article class with fields like:
idtopicssitestitleurlsourcecontentclean_contentsummary
Handles MongoDB operations for:
- Storing new articles
- Retrieving and deleting existing ones
- Removing duplicates based on title and content hash
Core script that:
- Searches and scrapes articles via Google CSE
- Cleans noisy text using LLM prompts
- Summarizes content with LangChain + Ollama
- Manages LLM memory and prompts for efficiency
-
Define Topics & Sites
Configure your topics and target news sites. -
Run Retrieval
Useget_recent_articles(topics, sites)to perform the search and scraping. -
Clean & Summarize
Each article goes throughclean_article_text()andsummarize_article_text()functions for post-processing. -
View/Manage Data
Use database functions to filter or delete articles as needed.
- FastAPI
- LangChain
- Ollama or Groq AI (LLM backend)
- MongoDB
- Google Custom Search API
- BeautifulSoup