NLP_Final_Project

Automated Detection of Food Hazards in Incident Reports

Overview

This repository contains the code and documentation for the final project in CST8507 - Natural Language Processing, based on SemEval-2025 Task 9: Food Hazard Detection Challenge. The project implements machine learning models to classify food incident reports by predicting product and hazard categories from report titles. The implementation includes baseline models (TF-IDF with Logistic Regression and Random Forest) and an advanced fine-tuned BERT model, along with explainability and error analysis.

Team Members: Khaled Saleh, Yazid Rahmouni
Repository Structure:
- NLP_Final_Project_Code.ipynb: Jupyter notebook with full code implementation (data loading, preprocessing, models, evaluation).
- incidents_train.csv, incidents_valid.csv, incidents_test.csv: Dataset files (not included here; download from Zenodo as described below).
- README.md: This file.

Problem Definition

The project addresses the SemEval-2025 Task 9: Food Hazard Detection Challenge, which focuses on automated classification of food safety incident reports to identify potential hazards. This is crucial for public health, as it enables faster response to food recalls and contamination events.

Subtask 1: Coarse-grained classification – Predict the product category (22 classes, e.g., "meat, egg and dairy products", "fruits and vegetables") and hazard category (10 classes, e.g., "biological", "foreign bodies") from the title of the incident report.
Subtask 2: Fine-grained classification – Predict the specific product (1,142 types, e.g., "smoked sausage", "chicken breast") and hazard (128 types, e.g., "listeria monocytogenes", "plastic fragment").

The data is multilingual and multiclass, requiring models that are accurate and interpretable to support real-world applications in food safety monitoring.

Dataset Used

The dataset consists of simulated food recall incident reports, sourced from food_recall_incidents.csv on Zenodo.

Splits:
- Training: 5,082 samples
- Validation: 565 samples
- Test: 997 samples
Features:
- title: The main text feature (incident report title, preprocessed for modeling).
- Labels: product-category (coarse product), hazard-category (coarse hazard), product (fine-grained product), hazard (fine-grained hazard).
- Other columns (not used for modeling): year, month, day, country, text (full report body), etc.
Preprocessing:
- Convert titles to lowercase.
- Remove punctuation and extra spaces.
- Example: Original title "Recall Notification: FSIS-024-94" → Processed: "recall notification fsis02494".

The dataset is imbalanced, with some categories (e.g., "meat, egg and dairy products") dominating, which poses challenges for fine-grained classification. Labels were encoded using sklearn.preprocessing.LabelEncoder for consistency across splits.

To obtain the dataset:

Download from Zenodo or the SemEval-2025 task repository.
Place the CSV files in the root directory to run the notebook.

Evaluation Metrics

The primary metric is Macro F1-score, which averages F1-scores across all classes without weighting by support. This is suitable for imbalanced multiclass problems, as it treats rare classes equally.

F1-score = 2 * (Precision * Recall) / (Precision + Recall)
Macro F1 = Average of per-class F1-scores.

Additional analysis includes confusion matrices, classification reports (precision, recall, F1 per class), and inspection of misclassified samples.

Model Explanation

Baseline Models

Vectorization: TF-IDF (Term Frequency-Inverse Document Frequency) with sklearn.feature_extraction.text.TfidfVectorizer (max_features=5000, ngram_range=(1,2) for unigrams and bigrams).
Classifiers:
- Logistic Regression (sklearn.linear_model.LogisticRegression, max_iter=1000).
- Random Forest (sklearn.ensemble.RandomForestClassifier, n_estimators=100, random_state=42).
Trained separately for each subtask/target (product-category, hazard-category, product, hazard).
Explainability: SHAP (SHapley Additive exPlanations) for feature importance in Logistic Regression (e.g., keywords like "recall", "allergy").

References for baselines:

TF-IDF: Jones, K. S. (1972). "A statistical interpretation of term specificity and its application in retrieval." Journal of Documentation.
Logistic Regression: Wright, R. E. (1995). "Logistic regression." In Reading and understanding multivariate statistics.
Random Forest: Breiman, L. (2001). "Random forests." Machine Learning.
SHAP: Lundberg, S. M., & Lee, S. I. (2017). "A unified approach to interpreting model predictions." NeurIPS.

Advanced Model

BERT (Bidirectional Encoder Representations from Transformers): Fine-tuned bert-base-uncased for sequence classification using transformers.BertForSequenceClassification.
- Tokenizer: BertTokenizer with max_length=128.
- Optimizer: AdamW with learning rate=2e-5.
- Training: 3 epochs, batch_size=32 (or 16 for fine-grained due to more classes).
- Custom Dataset class for PyTorch DataLoader.
Separate models for each target to handle different class counts.
Explainability: LIME (Local Interpretable Model-agnostic Explanations) for local predictions on BERT.

Reference for BERT:

Devlin, J., et al. (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." NAACL.

All models were trained on preprocessed titles only, with reproducibility ensured via random seeds (42).

Results Achieved

Validation Set Results

Model	Subtask	Macro F1
Logistic Regression	Subtask 1 Product Cat	0.5105
Logistic Regression	Subtask 1 Hazard Cat	0.5295
Random Forest	Subtask 1 Product Cat	0.5294
Random Forest	Subtask 1 Hazard Cat	0.5785
Logistic Regression	Subtask 2 Product	0.1354
Logistic Regression	Subtask 2 Hazard	0.1982
Random Forest	Subtask 2 Product	0.2248
Random Forest	Subtask 2 Hazard	0.3832
BERT	Subtask 1 Product Cat	~0.69

(Note: BERT weighted avg F1 ~0.69; macro approx from report.)

Test Set Results

Model	Subtask	Macro F1
Logistic Regression	Subtask 1 Product Cat	0.4463
Logistic Regression	Subtask 1 Hazard Cat	0.5417
Random Forest	Subtask 1 Product Cat	0.4249
Random Forest	Subtask 1 Hazard Cat	0.5827
Logistic Regression	Subtask 2 Product	0.0787
Logistic Regression	Subtask 2 Hazard	0.1764
Random Forest	Subtask 2 Product	0.2293
Random Forest	Subtask 2 Hazard	0.3359
BERT	Subtask 1 Product Cat	0.5283
BERT	Subtask 1 Hazard Cat	0.5135
BERT	Subtask 2 Product	0.0096
BERT	Subtask 2 Hazard	0.1491

Error analysis revealed common misclassifications between similar categories (e.g., "ices and desserts" vs. "nuts, nut products and seeds"). SHAP highlighted keywords like "recall" and "allergy" as important features.

Comparison with Baseline Results

BERT outperforms baselines in Subtask 1 (coarse classification), achieving higher Macro F1 on both validation (~0.69 vs. ~0.51-0.58) and test (0.5283 Product Cat vs. 0.4463 LR / 0.4249 RF; 0.5135 Hazard Cat vs. 0.5417 LR / 0.5827 RF). This is due to BERT's contextual embeddings capturing nuances in titles better than TF-IDF bag-of-words.

However, for Subtask 2 (fine-grained), BERT underperforms (e.g., test Product: 0.0096 vs. 0.2293 RF), likely due to extreme class imbalance (1,142 products) and limited training data/epochs. Baselines like RF handle sparsity better here but still yield low scores overall.

Recommendations for improvement: Handle imbalance (e.g., oversampling), more epochs for BERT, or hierarchical classification (coarse then fine).

Baseline References (as used in implementation):

Scikit-learn for TF-IDF, LR, RF: Pedregosa, F., et al. (2011). "Scikit-learn: Machine Learning in Python." Journal of Machine Learning Research.
Hugging Face Transformers for BERT: Wolf, T., et al. (2020). "Transformers: State-of-the-Art Natural Language Processing." EMNLP.
SHAP: https://github.com/shap/shap
LIME: https://github.com/marcotcr/lime

Installation and Usage

Install dependencies: pip install -r requirements.txt (create from notebook imports: pandas, numpy, scikit-learn, transformers, torch, shap, lime, etc.).
Run the notebook: jupyter notebook NLP_Final_Project_Code.ipynb.
Ensure CUDA/GPU for BERT training if available.

For questions, contact the team members.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NLP_Final_Project

Automated Detection of Food Hazards in Incident Reports

Overview

Problem Definition

Dataset Used

Evaluation Metrics

Model Explanation

Baseline Models

Advanced Model

Results Achieved

Validation Set Results

Test Set Results

Comparison with Baseline Results

Installation and Usage

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
NLP_Final_Project_Code.ipynb		NLP_Final_Project_Code.ipynb
README.md		README.md
food_recall_incidents.csv		food_recall_incidents.csv
incidents_test.csv		incidents_test.csv
incidents_train.csv		incidents_train.csv
incidents_valid.csv		incidents_valid.csv

mr-kalll/NLP_Final_Project

Folders and files

Latest commit

History

Repository files navigation

NLP_Final_Project

Automated Detection of Food Hazards in Incident Reports

Overview

Problem Definition

Dataset Used

Evaluation Metrics

Model Explanation

Baseline Models

Advanced Model

Results Achieved

Validation Set Results

Test Set Results

Comparison with Baseline Results

Installation and Usage

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages