text-to-speech model for dialogue scenarios
2024-07-31Overview
ChatTTS is an open-source model (source code is available at https://github.com/2noise/ChatTTS). It has many highlights, including:
- Conversational TTS: ChatTTS is optimized for dialogue-based tasks, enabling natural and expressive speech synthesis. It supports multiple speakers, facilitating interactive conversations
- Fine-grained Control: The model could predict and control fine-grained prosodic features, including laughter, pauses, and interjections
- Better Prosody: ChatTTS surpasses most of open-source TTS models in terms of prosody. We provide pretrained models to support further research and development.
BLIP was proposed in BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation by Junnan Li, Dongxu Li, Caiming Xiong, Steven Hoi. The diagram below demonstrates how BLIP works at a high level.
Image captioning with BLIP
Next we will demonstrate how to use the BLIP model for image captioning from scratch.
Step 1: Clone the BLIP repository
$ git clone https://github.com/salesforce/BLIP
Step 2: Create a virtual environment and install the required packages
$ cd BLIP/
$ python3 -m venv .venv
$ source .venv/bin/activate
$ pip install -r requirements.txt
Step 3: Generate image captions
The following Python code shows how to generate image captions using the BLIP model. The code loads a demo image from the internet and generates two captions using beam search and nucleus sampling. Note that beam search and nucleus sampling are two popular decoding strategies for sequence generation tasks. Simply put, beam search is a deterministic decoding method and the generated caption is more consistent in each run; while nucleus sampling is a stochastic decoding method that leads to better performance but the generated caption may vary each time.
from PIL import Image
import requests
import torch
from torchvision import transforms
from torchvision.transforms.functional import InterpolationMode
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
img_url = 'https://i.pinimg.com/564x/26/c7/35/26c7355fe46f62d84579857c6f8c4ea5.jpg'
model_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_caption_capfilt_large.pth'
def load_demo_image(image_size, device):
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')
w,h = raw_image.size
transform = transforms.Compose([
transforms.Resize((image_size, image_size), interpolation=InterpolationMode.BICUBIC),
transforms.ToTensor(),
transforms.Normalize((0.48145466, 0.4578275, 0.40821073), (0.26862954, 0.26130258, 0.27577711))
])
image = transform(raw_image).unsqueeze(0).to(device)
return image
from models.blip import blip_decoder
image_size = 384
image = load_demo_image(image_size=image_size, device=device)
model = blip_decoder(pretrained=model_url, image_size=image_size, vit='base')
model.eval()
model = model.to(device)
with torch.no_grad():
# beam search
caption = model.generate(image, sample=False, num_beams=3, max_length=20, min_length=5)
print(f'caption (beam search): {caption[0]}')
# nucleus sampling
caption = model.generate(image, sample=True, top_p=0.9, max_length=20, min_length=5)
print(f'caption (nucleus sampling): {caption[0]}')
CaptionCraft support of BLIP
CaptionCraft provides an easy-to-integrate API for image captioning using the BLIP model. You can try it out for free at https://rapidapi.com/fantascatllc/api/image-caption-generator2.
import requests
url = "https://image-caption-generator2.p.rapidapi.com/v2/captions/simple"
params = {"imageUrl": "https://i.pinimg.com/564x/26/c7/35/26c7355fe46f62d84579857c6f8c4ea5.jpg"}
headers = {
"X-RapidAPI-Key": "<Your-RapidAPI-Key>",
"X-RapidAPI-Host": "image-caption-generator2.p.rapidapi.com"
}
response = requests.get(url, headers=headers, params=params)
print(response.json())