A lean NLP project exploring automated classification of medical transcripts into their respective specialties using traditional and modern NLP methods.
This project builds a multi-class text classifier using the MTSamples dataset (medical transcription notes).
The goal is to assign each record to the correct medical specialty (e.g., Surgery, Radiology, Pain Management).
We benchmark classical models and progressively improve them with contextual embeddings (transformers).
- Language: Python
- Libraries: pandas, scikit-learn, matplotlib, seaborn, XGBoost
- Environment: Jupyter Notebooks (VS Code)
- ML Tasks: NLP preprocessing, TF-IDF vectorization, classification, and evaluation
med-nlp-classification-proj/
│
├── data/
│ └── mtsamples.csv
│
├── notebooks/
│ ├── 01_data_cleaning.ipynb
│ ├── 02_exploratory_analysis.ipynb
│ ├── 03_tfidf_baseline_model.ipynb
│ ├── 04_model_evaluation.ipynb
│ ├── 05_error_analysis.ipynb
│
├── models/
│ └── gradient_boosting_baseline.pkl
│
├── visuals/
│ ├── confusion_matrix.png
│ └── class_performance_chart.png
│
├── requirements.txt
├── README.md
└── .gitignore| Step | Description |
|---|---|
| 1 | Data cleaning & preprocessing |
| 2 | Exploratory data analysis (EDA) |
| 3 | Feature extraction using TF-IDF |
| 4 | Baseline model training (Logistic Regression, Random Forest, Gradient Boosting) |
| 5 | Model evaluation & performance benchmarking |
| 6 | Error analysis & confusion matrix |
| 7 | Transformer-based model (to be added) |
- Accuracy: 12.8%
- F1 Score: 12.4%
- Top Classes: Pain Management, Sleep Medicine, Surgery
- Lowest Classes: Hospice Care, Work Comp, Emergency Reports
💬 Interpretation:
The baseline demonstrates the limitations of sparse features in multi-class NLP with imbalanced data.
Misclassifications like Orthopedic → Radiology reveal semantic overlap between specialties.
- Integrate transformer embeddings (e.g., BioClinicalBERT)
- Fine-tune and compare contextual models
- Deploy a minimal Streamlit interface for specialty prediction
Stephen Karanja
📍 Nairobi, Kenya
💼 Data Analyst | AI & Automation Specialist
🔗 LinkedIn
📧 muhurakaranja7@gmail.com