A production-ready sentiment analysis system built using classical NLP techniques on the IMDB Movie Reviews dataset.
This project demonstrates an end-to-end NLP workflow โ from raw text preprocessing to model training, evaluation, and real-time inference via API and UI.
โ ๏ธ No transformers were used.
This project intentionally focuses on strong NLP fundamentals before moving to modern LLM-based systems.
---
The system classifies movie reviews as Positive or Negative using a TF-IDF + Logistic Regression pipeline and exposes predictions through:
- โ A FastAPI REST API
- โ An Streamlit interactive UI
This project is part of my Pre-Transformer NLP Project Series, designed to build deep intuition for text pipelines and production ML systems.
- Build a classical NLP pipeline from scratch
- Perform robust text preprocessing
- Extract features using TF-IDF
- Train and evaluate a machine-learning classifier
- Persist and reload model artifacts safely
- Serve predictions via a REST API
- (Optional) Provide a human-friendly UI
- Lowercasing
- HTML tag removal
- URL removal
- Punctuation & digit removal
- Stopword removal (NLTK)
- Stemming (Porter Stemmer)
- TF-IDF Vectorization
- Unigrams + Bigrams
- Feature cap for efficiency & generalization
- Logistic Regression (binary classification)
- Probability-based confidence scores
- Lightweight, fast, and interpretable
project-sentiment/
โ
โโโ data/
โ โโโ imdb.csv
โ โโโ imdb_clean.csv
โ
โโโ models/
โ โโโ sentiment_model.joblib
โ โโโ vectorizer.joblib
โ
โโโ src/
โ โโโ preprocess.py
โ โโโ train.py
โ โโโ predict.py
โ
โโโ app.py # Streamlit UI
โโโ requirements.txt
โโโ README.md
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activatepip install -r requirements.txtEnsure data/imdb_clean.csv exists.
python -m src.trainThis will:
-
Train the sentiment classifier
-
Evaluate performance
-
Save model artifacts:
models/sentiment_model.joblibmodels/vectorizer.joblib
Accuracy: ~0.89
F1 Score: ~0.89
Confusion Matrix:
[[TN FP]
[FN TP]]
(Exact values may vary slightly due to randomness.)
uvicorn src.predict:app --reloadOpen Swagger UI:
http://127.0.0.1:8000/docs
curl -X POST "http://127.0.0.1:8000/predict" \
-H "Content-Type: application/json" \
-d "{\"text\": \"This movie was absolutely fantastic\"}"{
"sentiment": "positive",
"confidence": 0.97
}Run:
streamlit run app.pyProvides a simple UI to test predictions interactively.
| Review | Prediction |
|---|---|
| Amazing acting and storyline | Positive |
| Boring movie, waste of time | Negative |
- Python
- NLTK
- scikit-learn
- FastAPI
- Uvicorn
- Streamlit
- Pandas / NumPy
- Built a production-style NLP system
- Understood classical NLP pipelines end-to-end
- Learned artifact management & safe loading
- Deployed ML inference via REST API
- Created a user-facing ML demo UI
- Compare TF-IDF vs Word2Vec / GloVe
- Replace Logistic Regression with LightGBM
- Add batch inference & logging
- Upgrade to Transformer-based model
- Deploy to cloud (Render / Hugging Face Spaces)
Tanish Sarkar Pre-Transformer NLP Projects


