The OBSS Internship Competition 2025 is a machine learning challenge focused on image captioning. The goal is to generate semantically meaningful and contextually accurate textual descriptions for input images.
This repository contains my work throughout the competition, including models, training experiments, submission files, and result visualizations.
This project leverages state-of-the-art vision-language models, fine-tuned on the competition dataset to generate high-quality image captions:
- 🔹 BLIP – Bootstrapped Language-Image Pretraining
- 🔹 BLIP-2 – Enhanced with Flan-T5-XL for multimodal reasoning
- 🔹 GIT – Generative Image-to-Text Transformer
The dataset contains 21,367 images. Below are a few example of training set images:

Caption quality is primarily evaluated using the Fréchet GTE Distance (FGD) metric, a semantic similarity score based on sentence embeddings.
🔸 Embedding Model: gte-small
🔸 Additional metrics such as BLEU, CIDEr may be added for comparison
Evaluation scripts can be included if needed.
🚀 How to Use 🔗 Run Experiments via Colab Each model has a dedicated Colab notebook
