text-to-speech model for dialogue scenarios

2024-07-31

Overview

ChatTTS is an open-source model (source code is available at https://github.com/2noise/ChatTTS). It has many highlights, including:

  • Conversational TTS: ChatTTS is optimized for dialogue-based tasks, enabling natural and expressive speech synthesis. It supports multiple speakers, facilitating interactive conversations
  • Fine-grained Control: The model could predict and control fine-grained prosodic features, including laughter, pauses, and interjections
  • Better Prosody: ChatTTS surpasses most of open-source TTS models in terms of prosody. We provide pretrained models to support further research and development.

BLIP was proposed in BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation by Junnan Li, Dongxu Li, Caiming Xiong, Steven Hoi. The diagram below demonstrates how BLIP works at a high level.



Image captioning with BLIP

Next we will demonstrate how to use the BLIP model for image captioning from scratch.


Step 1: Clone the BLIP repository

$ git clone https://github.com/salesforce/BLIP
    

Step 2: Create a virtual environment and install the required packages

$ cd BLIP/
    $ python3 -m venv .venv
    $ source .venv/bin/activate
    $ pip install -r requirements.txt
    

Step 3: Generate image captions

The following Python code shows how to generate image captions using the BLIP model. The code loads a demo image from the internet and generates two captions using beam search and nucleus sampling. Note that beam search and nucleus sampling are two popular decoding strategies for sequence generation tasks. Simply put, beam search is a deterministic decoding method and the generated caption is more consistent in each run; while nucleus sampling is a stochastic decoding method that leads to better performance but the generated caption may vary each time.

from PIL import Image
    import requests
    import torch
    from torchvision import transforms
    from torchvision.transforms.functional import InterpolationMode
    
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    img_url = 'https://i.pinimg.com/564x/26/c7/35/26c7355fe46f62d84579857c6f8c4ea5.jpg'
    model_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_caption_capfilt_large.pth'
    
    def load_demo_image(image_size, device):
        raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')   
        w,h = raw_image.size
        transform = transforms.Compose([
            transforms.Resize((image_size, image_size), interpolation=InterpolationMode.BICUBIC),
            transforms.ToTensor(),
            transforms.Normalize((0.48145466, 0.4578275, 0.40821073), (0.26862954, 0.26130258, 0.27577711))
        ]) 
        image = transform(raw_image).unsqueeze(0).to(device)   
        return image
    
    from models.blip import blip_decoder
    
    image_size = 384
    image = load_demo_image(image_size=image_size, device=device)
    
    model = blip_decoder(pretrained=model_url, image_size=image_size, vit='base')
    model.eval()
    model = model.to(device)
    
    with torch.no_grad():
        # beam search
        caption = model.generate(image, sample=False, num_beams=3, max_length=20, min_length=5) 
        print(f'caption (beam search): {caption[0]}')
    
        # nucleus sampling
        caption = model.generate(image, sample=True, top_p=0.9, max_length=20, min_length=5) 
        print(f'caption (nucleus sampling): {caption[0]}')
    

CaptionCraft support of BLIP

CaptionCraft provides an easy-to-integrate API for image captioning using the BLIP model. You can try it out for free at https://rapidapi.com/fantascatllc/api/image-caption-generator2.

import requests
    
    url = "https://image-caption-generator2.p.rapidapi.com/v2/captions/simple"
    params = {"imageUrl": "https://i.pinimg.com/564x/26/c7/35/26c7355fe46f62d84579857c6f8c4ea5.jpg"}
    headers = {
        "X-RapidAPI-Key": "<Your-RapidAPI-Key>",
        "X-RapidAPI-Host": "image-caption-generator2.p.rapidapi.com"
    }
    
    response = requests.get(url, headers=headers, params=params)
    print(response.json())