Skip to content

Fine-tuning BLIP-2 for image captioning using the Flickr8k dataset with PEFT (LoRA) for efficient training. Inference-ready notebook included.

Notifications You must be signed in to change notification settings

Muavia1/BLIP-2-Fine-Tuned

Repository files navigation

BLIP-2 Fine-Tuned for Image Captioning 🖼️

This repository contains code and notebooks to fine-tune the BLIP-2 (Bootstrapped Language-Image Pretraining) model for image captioning on the Flickr8k dataset, using PEFT (LoRA) to enable efficient training.


🚀 Overview

  • Base model: BLIP-2 from Salesforce (vision encoder + Q-Former + frozen LLM)
  • Task: Image captioning
  • Dataset: Flickr8k (8,000 images, each with 5 captions) :contentReference[oaicite:0]{index=0}
  • Fine-tuning method: LoRA (Parameter-Efficient Fine-Tuning) :contentReference[oaicite:1]{index=1}
  • Frameworks / Libraries:
    • transformers (Hugging Face)
    • datasets for data loading :contentReference[oaicite:2]{index=2}
    • peft for LoRA :contentReference[oaicite:3]{index=3}
    • bitsandbytes (optional, for memory-efficient training) :contentReference[oaicite:4]{index=4}

🛠️ Installation

  1. Clone the repository:

    git clone https://github.com/Muavia1/BLIP-2-Fine-Tuned.git
    cd BLIP-2-Fine-Tuned
    
  2. Create a Python environment (recommended):

    python -m venv venv
    source venv/bin/activate   # On Windows: venv\Scripts\activate
  3. Install dependencies:

    pip install -r requirements.txt

📊 Fine-Tuning / Training

To fine-tune the model:

python src/train.py \
  --dataset_dir /path/to/flickr8k/ \
  --output_dir ./models/blip2-finetuned/ \
  --per_device_train_batch_size 8 \
  --learning_rate 1e-4 \
  --num_train_epochs 5 \
  --lora_rank 8 \
  --lora_alpha 32 \
  --lora_dropout 0.1

Parameters Explanation:

  • dataset_dir: Directory where Flickr8k images + captions are stored
  • output_dir: Where to save the fine-tuned model
  • lora_rank, lora_alpha, lora_dropout: LoRA hyperparameters

🔍 Inference

After fine-tuning, you can generate captions given an image:

python src/inference.py \
  --image_path /path/to/image.jpg \
  --model_dir ./models/blip2-finetuned/ \
  --max_length 30

The script will load the fine-tuned model and output a generated caption for the provided image.


📓 Notebook Usage

  • Open notebooks/fine_tune.ipynb
  • Follow the steps to load data, define the model, apply LoRA, train, and run inference
  • Useful for quick experiments / visualizations

🧪 Evaluation & Results

  • After training, you can evaluate generated captions using standard metrics like BLEU, ROUGE, or CIDEr.
  • Use the inference script to generate predictions and compare them with ground-truth captions from Flickr8k.

✅ Why Use This

  • Efficient Training: LoRA-based fine-tuning means you don’t have to update the full model, saving memory and compute.
  • Practical Application: Fine-tuned for image captioning — useful in accessibility, image search, and content description.
  • Reproducible: Notebook + scripts make it easy to reproduce results or extend to other datasets.
  • Modular: You can easily adapt the code for other vision-language tasks (e.g., VQA) or datasets.

🚀 Future Work / Extensions

  • Fine-tune on larger datasets (e.g., Flickr30k, MS-COCO)
  • Experiment with other PEFT methods (e.g., QLoRA)
  • Add support for visual question answering (VQA) or image-text retrieval
  • Deploy as a public inference API or Streamlit / Gradio app

🤝 Contributing

Contributions are very welcome! Feel free to:

  • Open issues (e.g., bugs, feature requests)
  • Submit pull requests
  • Share improved hyperparameters, training tricks, or evaluation scripts

📚 References

  • BLIP-2: Bootstrapped Language-Image Pretraining ([Learning Muse by Mehdi Seyfi][1])
  • PEFT / LoRA method for efficient fine-tuning ([Medium][2])
  • Flickr8k dataset for image captioning ([Medium][2])

📝 License

This project is licensed under the MIT License – feel free to use, modify, and distribute freely.


About

Fine-tuning BLIP-2 for image captioning using the Flickr8k dataset with PEFT (LoRA) for efficient training. Inference-ready notebook included.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published