Skip to content

Ja-Tink/CNN-VIT-Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 

Repository files navigation

CNN-VIT-Project

Lana Lee, Alex Manko, Jack Tinker Advanced Topics Project

This project explores methods to combine the stregnths of CNNs and the more modern vision transformers. We used the Food-101 dataset, introduced in Bossard, L, et al., to evaluate the efficacy of our approachs.

CODE.ipynb contains all code for this project, including data loading and preprocessing, model definition, training, and evaluation. Each section is clearly labelled.

PyTorch support for transformers is currently very limited relative to CNNs, so we began by implementing a basic customizable ViT architecture from scratch. Debugging/preliminary evaluation of this model was performed using the MNIST dataset. This simple model failed to perform on the Food-101 dataset, likely due to the relatively small size of Food-101. In light of this we pivoted to a ViT pretrained on ImageNet1K to get a baseline performance. Setting the last 4 of 16 attention heads to be trainable gave the following results:

The pretrained vision transformer reached around 60% accuracy before rapidly overfitting.

PNG image

We then tried a naive combination of pretrained ViT and ResNet50 (also pretrained on ImageNet) by simply concatenating their outputs and passing it through a linear layer. Layers were frozen to only train this final linear layer.

combinedAcc combinedLoss

Next we used ResNet50 as a feature exactor to pass values to our untrained ViT. ResNet was frozen and the ViT was trained.

AGV_vUczCidwzkSzsXZ2Wlofyo3ITEgA5-nCWdPBPNnvb0j2GwZrdMuWJR3yQ13wI4JA-Ylbo5echRkHRsXvywDCcVvbJf-vibStAhbToc2HxjGT5A2ckFP7br9l3hlRLwLK_2CHqtDY0w=s2048

Finally we ran the two pretrained models in parallel, frozen like before, but with a multi-head attention mechanism before the linear layer to attend to the outputs and weight them. The results of this model are below:

PNG image PNG image

We used four heads because it had the highest accuracy, but one head likely could have reduced the complexity of the model at little cost to accuracy:

PNG image

Below are sample predictions produced by our final model. (Code to produce such results can be found at the end of the notebook) Screenshot 2024-12-14 at 10 08 11 PM

Given more time and computing resources, we would like to further explore how performance of the combined model can be improved by unlocking the later layers of the pretrained models for finetuning. Another promising approach would be to use these hybrid models to apply pseudolabels to an unlabeled dataset for student-teacher learning. Vision Transformers are an inherently data hungry architecture, so more data would likely greatly improve performance, especially in the case of training from scratch.

Code to produce all visuals can be found in CODE.ipynb

About

Lana Lee, Alex Manko, Jack Tinker Advanced Topics Project

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published