Explainable Early Detection of Diabetic Foot Ulcers Using Thermal Imaging with Vision Transformers and Grad-CAM
This project utilizes Vision Transformer (ViT) and Data-efficient Image Transformer (DeiT) models to classify plantar thermogram images for the early detection of Diabetic Foot Ulcers (DFU).
To validate model performance, I employed Stratified 5-Fold Cross-Validation for robust evaluation across balanced data splits and utilized a Class-Weighted Loss function to address class imbalance. Furthermore, Grad-CAM visualizations are integrated to provide explainable AI insights into the model's diagnostic regions.
I use Gradient-weighted Class Activation Mapping (Grad-CAM) to visualize the regions of the thermogram that the models focus on when making a prediction.
| ViT Attention Map | DeiT Attention Map |
|---|---|
![]() |
![]() |
The following table shows the performance metrics (Mean ± Std) across folds for the Vision Transformer (ViT) and Data-efficient Image Transformer (DeiT) models.
| Model | Accuracy | Precision | Recall | F1 Score |
|---|---|---|---|---|
| ViT | 0.9193 ± 0.0489 | 0.9416 ± 0.0493 | 0.9508 ± 0.0471 | 0.9452 ± 0.0330 |
| DeiT | 0.9221 ± 0.0387 | 0.9541 ± 0.0180 | 0.9385 ± 0.0456 | 0.9459 ± 0.0277 |
Comparison of classification performance between ViT and DeiT.
| Vision Transformer (ViT) | Data-efficient Image Transformer (DeiT) |
|---|---|
![]() |
![]() |
Average Accuracy and F1 Score across folds over training epochs.
| Vision Transformer (ViT) | Data-efficient Image Transformer (DeiT) |
|---|---|
![]() |
![]() |





