This project classifies news articles into four categories — World, Sports, Business, and Sci/Tech — using the AG News dataset. It features text preprocessing, TF-IDF vectorization, and classification using Logistic Regression and a feedforward Neural Network.
- Text preprocessing: tokenization, stopword removal, lemmatization
- TF-IDF vectorization for feature extraction
- Classification using Logistic Regression and a feedforward Neural Network
- Visualizations: class distribution and word clouds per category
- Model evaluation with accuracy, classification report, and confusion matrix
The goal is to build a classifier to predict the category of a news article based on its title and description.
- Data cleaning and preprocessing include tokenization, stopword removal, and lemmatization.
- Feature extraction using TF-IDF vectorization (uni-grams and bi-grams).
- Classification using:
- Logistic Regression
- Feedforward Neural Network with dropout and L2 regularization
The dataset consists of CSV files (train.csv and test.csv) with the following columns:
class_index: Numeric class label (1 to 4)title: News article titledescription: News article description
Class mapping:
| class_index | category_name |
|---|---|
| 1 | World |
| 2 | Sports |
| 3 | Business |
| 4 | Sci/Tech |
- Clone the repository:
git clone https://github.com/your-username/news-category-classification.git
cd news-category-classification2.(Optional) Create and activate a virtual environment:
python -m venv venv
# On Linux/macOS
source venv/bin/activate
# On Windows
venv\Scripts\activate3.Install the required packages directly with pip:
pip install pandas numpy matplotlib seaborn wordcloud nltk scikit-learn tensorflow- Update the dataset file paths inside the Python script:
Open the main script file (main.py or your script filename) and replace the following variables with the paths to your local dataset files:
train_path = r"YOUR_LOCAL_PATH_TO_train.csv"
test_path = r"YOUR_LOCAL_PATH_TO_test.csv"- Trained with TF-IDF features (up to 10,000 features, uni-grams and bi-grams).
- Maximum iterations set to 1000.
- Uses
scikit-learnimplementation with a fixed random seed for reproducibility.
- Input layer size matches TF-IDF feature size.
- Two hidden layers with 512 and 256 neurons respectively.
- Uses ReLU activation functions.
- Includes Dropout (0.5) and L2 regularization (0.01) to reduce overfitting.
- Output layer with softmax activation for multi-class classification.
- Optimizer: Adam.
- Loss: Sparse categorical crossentropy.
- Early stopping with patience of 3 epochs.
The models produced the following outputs and metrics:
- Class Distribution
- Logistic Regression Accuracy
- Logistic Regression Confusion Matrix
- Neural Network Accuracy Over Epochs
- Neural Network Training History
- Successful Predictions
- Word Cloud - Business
- Word Cloud - Sci/Tech
- Word Cloud - Sports
- Word Cloud - World
- Bar plot for class distribution (
class_distribution.png). - Word clouds for each category (
wordcloud_<category>.png). - Confusion matrix heatmap for Logistic Regression (
logreg_confusion_matrix.png). - Neural network training accuracy and loss over epochs (
nn_training_history.png).
All visualizations are saved automatically when you run the script.
This project is licensed under the MIT License - see the LICENSE file for details.









