Using the BLIP-2 Model for Image Captioning

2024-03-05

Overview

In the previous post we looked at the BLIP model for image captioning. The same group of researchers from Salesforce developed a more advanced version of the BLIP model, called BLIP-2. In this post we will look at the BLIP-2 model and how we can use it for image captioning tasks.

The BLIP-2 paper proposes a generic and efficient pre-training strategy that bootstraps vision-language pre-training from off-the-shelf frozen pre-trained image encoders and frozen large language models. BLIP-2 bridges the modality gap with a lightweight Querying Transformer, which is pre-trained in two stages. The first stage bootstraps vision-language representation learning from a frozen image encoder. The second stage bootstraps vision-to-language generative learning from a frozen language model. BLIP-2 achieves state-of-the-art performance on various vision-language tasks, despite having significantly fewer trainable parameters than existing methods.

Image captioning with BLIP-2

Step 1: Install the `lavis` library

The lavis library provides a simple API for loading pre-trained models and processing images and text. The lavis library can be installed using pip:

$ pip install salesforce-lavis

Step 2: Generate image captions

The code snippet below demonstrates how to use the BLIP-2 model for image captioning.

import torch
    import requests
    from PIL import Image
    
    img_url = 'https://i.pinimg.com/564x/26/c7/35/26c7355fe46f62d84579857c6f8c4ea5.jpg'
    device = torch.device("cuda") if torch.cuda.is_available() else "cpu"
    raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')   
    
    import torch
    from lavis.models import load_model_and_preprocess
    
    model, vis_processors, _ = load_model_and_preprocess(
        name="blip2_t5",
        model_type="caption_coco_flant5xl",
        is_eval=True,
        device=device
    )
    image = vis_processors["eval"](raw_image).unsqueeze(0).to(device)
    
    # generate caption using beam search
    model.generate({"image": image})
    
    # generate caption using necleus sampling
    # due to the non-determinstic nature of necleus sampling, you may get different captions.
    model.generate({"image": image}, use_nucleus_sampling=True, num_captions=3)

CaptionCraft support of BLIP-2