Captionify is a simple web application that generates English captions from images using a deep learning model. It combines a FastAPI backend (for model inference) and a responsive HTML/CSS/JS frontend.
The model is trained on datasets like Flickr8k with a CNN + RNN architecture.
- Upload images drag & drop or file browser.
- Generate descriptive English captions from your images.
- Live typing effect for generated captions.
- Trained on Flickr8k dataset with a custom PyTorch model.
- API endpoint
POST /reload-modelto dynamically reload the latest model from Kaggle.
git clone https://github.com/quangduy201/captionify.git
cd captionify# Create a virtual environment named '.venv'
python -m venv .venv
# Activate the virtual environment
# On Windows
.venv\Scripts\activate
# On macOS/Linux
source .venv/bin/activatepip install -r requirements.txtuvicorn run:app --reloadThis will:
- Automatically download the latest model from
Kaggle Model Hub. - Load model + vocabulary into memory.
- Start FastAPI on
http://localhost:8000
Open a web browser and go to http://localhost:8000 to access the application.
You can train a custom captioning model using the provided Kaggle notebook here
Visit the provided Kaggle link and create a copy of the notebook.
You can choose which dataset which is the most suitable for your model. The default dataset is flickr8k.
You should train the model using GPU T4 x2 or GPU P100
Simply press Run all in the notebook to train your custom model.
After training, download the trained model checkpoint (/kaggle/working/training/output/checkpoint.pth.tar).
Place the downloaded checkpoint.pth.tar file in the training/output directory of the repository.
If you want to improve the trained model, you can upload your current checkpoint of the model and use it as an Input of the notebook.
- fastapi
- uvicorn
- torch
- torchvision
- spacy
- tqdm
- Pillow
- python-multipart
- tensorboard
- kagglehub