CNN-VIT-Project

Lana Lee, Alex Manko, Jack Tinker Advanced Topics Project

This project explores methods to combine the stregnths of CNNs and the more modern vision transformers. We used the Food-101 dataset, introduced in Bossard, L, et al., to evaluate the efficacy of our approachs.

CODE.ipynb contains all code for this project, including data loading and preprocessing, model definition, training, and evaluation. Each section is clearly labelled.

PyTorch support for transformers is currently very limited relative to CNNs, so we began by implementing a basic customizable ViT architecture from scratch. Debugging/preliminary evaluation of this model was performed using the MNIST dataset. This simple model failed to perform on the Food-101 dataset, likely due to the relatively small size of Food-101. In light of this we pivoted to a ViT pretrained on ImageNet1K to get a baseline performance. Setting the last 4 of 16 attention heads to be trainable gave the following results:

The pretrained vision transformer reached around 60% accuracy before rapidly overfitting.

We then tried a naive combination of pretrained ViT and ResNet50 (also pretrained on ImageNet) by simply concatenating their outputs and passing it through a linear layer. Layers were frozen to only train this final linear layer.

Next we used ResNet50 as a feature exactor to pass values to our untrained ViT. ResNet was frozen and the ViT was trained.

AGV_vUczCidwzkSzsXZ2Wlofyo3ITEgA5-nCWdPBPNnvb0j2GwZrdMuWJR3yQ13wI4JA-Ylbo5echRkHRsXvywDCcVvbJf-vibStAhbToc2HxjGT5A2ckFP7br9l3hlRLwLK_2CHqtDY0w=s2048

Finally we ran the two pretrained models in parallel, frozen like before, but with a multi-head attention mechanism before the linear layer to attend to the outputs and weight them. The results of this model are below:

We used four heads because it had the highest accuracy, but one head likely could have reduced the complexity of the model at little cost to accuracy:

Below are sample predictions produced by our final model. (Code to produce such results can be found at the end of the notebook)

Given more time and computing resources, we would like to further explore how performance of the combined model can be improved by unlocking the later layers of the pretrained models for finetuning. Another promising approach would be to use these hybrid models to apply pseudolabels to an unlabeled dataset for student-teacher learning. Vision Transformers are an inherently data hungry architecture, so more data would likely greatly improve performance, especially in the case of training from scratch.

Code to produce all visuals can be found in CODE.ipynb

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
CODE.ipynb		CODE.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CNN-VIT-Project

About

Uh oh!

Releases

Packages

Languages

Ja-Tink/CNN-VIT-Project

Folders and files

Latest commit

History

Repository files navigation

CNN-VIT-Project

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages