BLIP-2 Fine-Tuned for Image Captioning 🖼️

This repository contains code and notebooks to fine-tune the BLIP-2 (Bootstrapped Language-Image Pretraining) model for image captioning on the Flickr8k dataset, using PEFT (LoRA) to enable efficient training.

🚀 Overview

Base model: BLIP-2 from Salesforce (vision encoder + Q-Former + frozen LLM)
Task: Image captioning
Dataset: Flickr8k (8,000 images, each with 5 captions) :contentReference[oaicite:0]{index=0}
Fine-tuning method: LoRA (Parameter-Efficient Fine-Tuning) :contentReference[oaicite:1]{index=1}
Frameworks / Libraries:
- transformers (Hugging Face)
- datasets for data loading :contentReference[oaicite:2]{index=2}
- peft for LoRA :contentReference[oaicite:3]{index=3}
- bitsandbytes (optional, for memory-efficient training) :contentReference[oaicite:4]{index=4}

🛠️ Installation

Clone the repository:

git clone https://github.com/Muavia1/BLIP-2-Fine-Tuned.git
cd BLIP-2-Fine-Tuned

Create a Python environment (recommended):

python -m venv venv
source venv/bin/activate   # On Windows: venv\Scripts\activate

Install dependencies:
```
pip install -r requirements.txt
```

📊 Fine-Tuning / Training

To fine-tune the model:

python src/train.py \
  --dataset_dir /path/to/flickr8k/ \
  --output_dir ./models/blip2-finetuned/ \
  --per_device_train_batch_size 8 \
  --learning_rate 1e-4 \
  --num_train_epochs 5 \
  --lora_rank 8 \
  --lora_alpha 32 \
  --lora_dropout 0.1

Parameters Explanation:

dataset_dir: Directory where Flickr8k images + captions are stored
output_dir: Where to save the fine-tuned model
lora_rank, lora_alpha, lora_dropout: LoRA hyperparameters

🔍 Inference

After fine-tuning, you can generate captions given an image:

python src/inference.py \
  --image_path /path/to/image.jpg \
  --model_dir ./models/blip2-finetuned/ \
  --max_length 30

The script will load the fine-tuned model and output a generated caption for the provided image.

📓 Notebook Usage

Open notebooks/fine_tune.ipynb
Follow the steps to load data, define the model, apply LoRA, train, and run inference
Useful for quick experiments / visualizations

🧪 Evaluation & Results

After training, you can evaluate generated captions using standard metrics like BLEU, ROUGE, or CIDEr.
Use the inference script to generate predictions and compare them with ground-truth captions from Flickr8k.

✅ Why Use This

Efficient Training: LoRA-based fine-tuning means you don’t have to update the full model, saving memory and compute.
Practical Application: Fine-tuned for image captioning — useful in accessibility, image search, and content description.
Reproducible: Notebook + scripts make it easy to reproduce results or extend to other datasets.
Modular: You can easily adapt the code for other vision-language tasks (e.g., VQA) or datasets.

🚀 Future Work / Extensions

Fine-tune on larger datasets (e.g., Flickr30k, MS-COCO)
Experiment with other PEFT methods (e.g., QLoRA)
Add support for visual question answering (VQA) or image-text retrieval
Deploy as a public inference API or Streamlit / Gradio app

🤝 Contributing

Contributions are very welcome! Feel free to:

Open issues (e.g., bugs, feature requests)
Submit pull requests
Share improved hyperparameters, training tricks, or evaluation scripts

📚 References

BLIP-2: Bootstrapped Language-Image Pretraining ([Learning Muse by Mehdi Seyfi][1])
PEFT / LoRA method for efficient fine-tuning ([Medium][2])
Flickr8k dataset for image captioning ([Medium][2])

📝 License

This project is licensed under the MIT License – feel free to use, modify, and distribute freely.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
BLIP2_FineTuned_for_Image_Captioning_FIickr8k_dataset.ipynb		BLIP2_FineTuned_for_Image_Captioning_FIickr8k_dataset.ipynb
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

BLIP-2 Fine-Tuned for Image Captioning 🖼️

🚀 Overview

🛠️ Installation

📊 Fine-Tuning / Training

🔍 Inference

📓 Notebook Usage

🧪 Evaluation & Results

✅ Why Use This

🚀 Future Work / Extensions

🤝 Contributing

📚 References

📝 License

About

Uh oh!

Releases

Packages

Languages

Muavia1/BLIP-2-Fine-Tuned

Folders and files

Latest commit

History

Repository files navigation

BLIP-2 Fine-Tuned for Image Captioning 🖼️

🚀 Overview

🛠️ Installation

📊 Fine-Tuning / Training

🔍 Inference

📓 Notebook Usage

🧪 Evaluation & Results

✅ Why Use This

🚀 Future Work / Extensions

🤝 Contributing

📚 References

📝 License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages