This project aims to extract and analyze topics from Eenadu news articles using Natural Language Processing (NLP) techniques. The pipeline includes web scraping, preprocessing, topic modeling, and result visualization.
-
Open Google Colab:
Visit Google Colab. -
Create a new notebook or upload the provided
.ipynbfile. -
Install required libraries by running the following commands in a Colab code cell:
!pip install indic-nlp-library !pip install indicnlp !pip install gensim !pip install pyLDAvis !pip install stanza
-
Run the Scripts:
- Purpose: Automates the extraction of Telugu news articles from the Eenadu website.
- Steps: Run the notebook to collect and save the dataset as a CSV file.
- Purpose: Performs preprocessing steps like normalization, tokenization, stop word removal, stemming, and POS tagging.
- Steps: Upload the dataset generated by the web scraper and preprocess the data for topic modeling.
- Purpose: Implements LDA for topic modeling and evaluates coherence scores for optimal topic selection.
- Steps: Generate topics, calculate coherence scores, and visualize topics using PyLDAvis.
-
Indic NLP Library
- Purpose: A library designed specifically for processing Indic languages like Telugu.
- Importance:
- Provides support for tokenization, transliteration, and text normalization for Telugu scripts.
- Handles the unique grammar and structure of Telugu, enabling accurate preprocessing.
- Essential for building NLP pipelines for non-Latin scripts.
-
Stanza
- Purpose: A multi-language NLP library by Stanford for text analysis.
- Importance:
- Provides tools for tokenization, part-of-speech tagging, and dependency parsing.
- Supports Telugu, making it an essential tool for deeper linguistic analysis.
- Facilitates advanced NLP tasks, ensuring high-quality text preparation.
-
Gensim
- Purpose: A Python library for topic modeling and document similarity analysis.
- Importance:
- Implements the Latent Dirichlet Allocation (LDA) algorithm for topic modeling.
- Provides efficient methods to process large datasets using a streaming API.
- Includes coherence score calculation to evaluate topic model quality.
-
PyLDAvis
- Purpose: A library for interactive topic modeling visualization.
- Importance:
- Helps visualize and interpret the results of LDA models.
- Provides an intuitive interface to explore relationships between topics, words, and documents.
- A key tool for presenting insights in a visually compelling format.
-
TF-IDF (via Gensim or custom implementation)
- Purpose: A statistical measure to evaluate the importance of words in documents.
- Importance:
- Prepares the text for LDA by highlighting important words while filtering out less relevant ones.
- Reduces noise in the data, improving the quality of the topic modeling.