This project aims to classify automotive fault descriptions into predefined complaint categories using a BERT-based model. It also extracts root causes and failure modes from the fault descriptions.
-
Data Preprocessing:
- Extract keywords from fault descriptions.
- Identify root causes and failure modes.
-
Model Training:
- Tokenize the input data using BERT tokenizer.
- Train a BERT model for sequence classification on the preprocessed data.
- Save the trained model and tokenizer.
-
Prediction:
- Load the trained model and tokenizer.
- Predict complaints for new fault descriptions.
- Save predictions to an Excel file.
-
Online Learning:
- Retrain the model with new predictions to improve accuracy over time.
- Python 3.6+
- PyTorch
- Transformers (Hugging Face)
- SpaCy
- NLTK
- pandas
- openpyxl
-
Clone the repository:
git clone https://github.com/jatintop/fault-description-classification.git cd fault-description-classification -
Install dependencies:
pip install -r requirements.txt python -m spacy download en_core_web_lg
-
Set up NLTK stemmer:
import nltk nltk.download('snowball_data')
The script preprocesses fault descriptions by extracting keywords, root causes, and failure modes.
import spacy
from nltk.stem.snowball import SnowballStemmer
# Initialize SpaCy and Snowball Stemmer
nlp = spacy.load("en_core_web_lg")
stemmer = SnowballStemmer(language='english')
def extract_keywords(text):
doc = nlp(text)
keywords = [stemmer.stem(token.text) for token in doc if not token.is_stop and token.is_alpha]
return ' '.join(keywords)
def extract_root_cause(text):
# Root cause extraction logic
pass
def extract_failure_mode(text):
# Failure mode extraction logic
passTrain a BERT model using fault descriptions and complaint categories.
from transformers import BertTokenizerFast, BertForSequenceClassification, Trainer, TrainingArguments
import torch
# Tokenize and encode data
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')
encodings = tokenizer(keywords, truncation=True, padding=True, max_length=128, return_tensors='pt')
labels = torch.tensor([label_dict[label] for label in complaints])
# Split data into training and validation sets
# Create datasets and train the modelLoad the trained model and tokenizer to predict complaints for new fault descriptions.
def predict_complaint(FaultD):
keywords = extract_keywords(FaultD)
inputs = tokenizer(keywords, return_tensors='pt', truncation=True, padding=True, max_length=128)
outputs = model(**inputs)
predicted_class = torch.argmax(outputs.logits, dim=1).item()
complaint = index_to_label[predicted_class]
return complaint
# Load new data and predict complaintsRetrain the model with new predictions to adapt to new data.
def retrain_model(new_data_df):
# Combine new data with existing training data
# Retrain the model
trainer.train()
model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)Save the predictions along with root causes and failure modes to an Excel file.
new_df['Prediction'] = new_df['FaultD'].apply(predict_complaint)
new_df['Root Cause'] = new_df['FaultD'].apply(extract_root_cause)
new_df['Failure Mode'] = new_df['FaultD'].apply(extract_failure_mode)
new_df.to_excel(output_file_path, index=False)This project is licensed under the MIT License. See the LICENSE file for details.
For any queries or suggestions, please contact jatintopakar@yahoo.com.