This repository contains code and notebooks to fine-tune the BLIP-2 (Bootstrapped Language-Image Pretraining) model for image captioning on the Flickr8k dataset, using PEFT (LoRA) to enable efficient training.
- Base model: BLIP-2 from Salesforce (vision encoder + Q-Former + frozen LLM)
- Task: Image captioning
- Dataset: Flickr8k (8,000 images, each with 5 captions) :contentReference[oaicite:0]{index=0}
- Fine-tuning method: LoRA (Parameter-Efficient Fine-Tuning) :contentReference[oaicite:1]{index=1}
- Frameworks / Libraries:
transformers(Hugging Face)datasetsfor data loading :contentReference[oaicite:2]{index=2}peftfor LoRA :contentReference[oaicite:3]{index=3}bitsandbytes(optional, for memory-efficient training) :contentReference[oaicite:4]{index=4}
-
Clone the repository:
git clone https://github.com/Muavia1/BLIP-2-Fine-Tuned.git cd BLIP-2-Fine-Tuned -
Create a Python environment (recommended):
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
To fine-tune the model:
python src/train.py \
--dataset_dir /path/to/flickr8k/ \
--output_dir ./models/blip2-finetuned/ \
--per_device_train_batch_size 8 \
--learning_rate 1e-4 \
--num_train_epochs 5 \
--lora_rank 8 \
--lora_alpha 32 \
--lora_dropout 0.1Parameters Explanation:
dataset_dir: Directory where Flickr8k images + captions are storedoutput_dir: Where to save the fine-tuned modellora_rank,lora_alpha,lora_dropout: LoRA hyperparameters
After fine-tuning, you can generate captions given an image:
python src/inference.py \
--image_path /path/to/image.jpg \
--model_dir ./models/blip2-finetuned/ \
--max_length 30The script will load the fine-tuned model and output a generated caption for the provided image.
- Open
notebooks/fine_tune.ipynb - Follow the steps to load data, define the model, apply LoRA, train, and run inference
- Useful for quick experiments / visualizations
- After training, you can evaluate generated captions using standard metrics like BLEU, ROUGE, or CIDEr.
- Use the inference script to generate predictions and compare them with ground-truth captions from Flickr8k.
- Efficient Training: LoRA-based fine-tuning means you don’t have to update the full model, saving memory and compute.
- Practical Application: Fine-tuned for image captioning — useful in accessibility, image search, and content description.
- Reproducible: Notebook + scripts make it easy to reproduce results or extend to other datasets.
- Modular: You can easily adapt the code for other vision-language tasks (e.g., VQA) or datasets.
- Fine-tune on larger datasets (e.g., Flickr30k, MS-COCO)
- Experiment with other PEFT methods (e.g., QLoRA)
- Add support for visual question answering (VQA) or image-text retrieval
- Deploy as a public inference API or Streamlit / Gradio app
Contributions are very welcome! Feel free to:
- Open issues (e.g., bugs, feature requests)
- Submit pull requests
- Share improved hyperparameters, training tricks, or evaluation scripts
- BLIP-2: Bootstrapped Language-Image Pretraining ([Learning Muse by Mehdi Seyfi][1])
- PEFT / LoRA method for efficient fine-tuning ([Medium][2])
- Flickr8k dataset for image captioning ([Medium][2])
This project is licensed under the MIT License – feel free to use, modify, and distribute freely.