GitHub - kd1510/neural_image_caption: Implementing a ConvNet+LSTM caption net

This is a neural image caption generator based on the paper Show and Tell: A Neural Image Caption Generator by Vinyals et al.
The model is trained on the Flickr8k dataset.

The encoder is an EfficientNet with weights pretrained on ImageNet.
The final layer of the EfficientNet is removed all prior layers are frozen for the duration of the training process.
The image embedding is passed through a linear layer to reduce the dimensionality of the feature vector to the dimensionality of the joint embedding space.
This final layer is jointly trained along with the decoder in order to learn the joint embedding space.

The decoder is an LSTM which generates a caption for the image.
At the start of the decoding process, the feature vector from the encoder is passed through the LSTM to allow the hidden state to view the embedded representation of the image.
A linear layer is added in order to map the hidden state outputs to the vocabulary space, in order to generate a probability distribution over the next word in the caption.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
evaluation		evaluation
resources		resources
.gitignore		.gitignore
README.md		README.md
encoder_decoder.py		encoder_decoder.py
hooks.py		hooks.py
prep_data.py		prep_data.py
train.py		train.py
utils.py		utils.py

Provide feedback