StyleTTS2 Tutorial - The Ultimate Guide to Getting Started

A complete guide to understanding StyleTTS,  Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models.

What is StyleTTS?

StyleTTS 2 is a state-of-the-art text-to-speech (TTS) model that represents a significant leap in the field of speech synthesis. It is designed to produce human-like speech by incorporating advanced techniques such as style diffusion and adversarial training with large speech language models (SLMs). One of the key innovations in StyleTTS 2 is its ability to model speech styles as latent random variables through diffusion models. This approach enables the system to generate a style that is most suitable for the given text, without the need for reference speech. The use of diffusion models, known for their effectiveness in generating diverse and high-quality outputs in various domains, allows StyleTTS 2 to achieve a wide range of speech styles and tones, enhancing the naturalness and expressiveness of the synthesized speech.

Another significant aspect of StyleTTS 2 is its use of adversarial training with large pre-trained SLMs, such as WavLM. These large models are used as discriminators in conjunction with a novel differentiable duration modeling technique. This end-to-end training approach is crucial for improving the naturalness of the speech output. The discriminators, trained on vast amounts of speech data, are adept at distinguishing between natural and synthetic speech, guiding the TTS model to produce outputs that are increasingly indistinguishable from human speech. This method not only enhances the quality of the speech synthesis but also contributes to the model's ability to adapt to different voices and speaking styles, making it highly versatile.

The performance of StyleTTS 2 is particularly noteworthy when evaluated on various speech datasets. It has shown remarkable results, surpassing human recordings on the single-speaker LJSpeech dataset and matching human performance on the multispeaker VCTK dataset. Furthermore, when trained on the LibriTTS dataset, StyleTTS 2 outperforms existing publicly available models in zero-shot speaker adaptation, demonstrating its superior ability to generate natural-sounding speech in a variety of voices without prior training on those specific voices. This level of performance indicates that StyleTTS 2 has achieved a milestone in TTS technology, reaching human-level synthesis on both single and multispeaker datasets. The integration of style diffusion, adversarial training, and the use of large SLMs makes StyleTTS 2 a groundbreaking development in the quest for realistic and adaptable speech synthesis.

Key Components of a StyleTTS2

The components of StyleTTS, as described in the provided content, include several innovative elements that work together to achieve high-quality, human-level text-to-speech synthesis. These components are:

  1. Style Diffusion Models: These are at the core of StyleTTS. Style diffusion involves modeling speech styles as latent random variables using diffusion processes. This approach allows the system to generate a suitable style for the given text without the need for reference speech. The diffusion models are key in creating a wide range of diverse and natural-sounding speech styles.
  2. Adversarial Training Framework: StyleTTS employs adversarial training, a technique commonly used in generative models. In this context, large pre-trained speech language models (SLMs) like WavLM are used as discriminators. These discriminators are trained to differentiate between natural human speech and synthetic speech generated by the TTS model, guiding the TTS model to produce outputs that closely resemble human speech.
  3. Differentiable Duration Modeling: This is a novel component in StyleTTS that allows for end-to-end training of the TTS model. Duration modeling is crucial in speech synthesis for determining the length of time each phoneme or sound should be held during speech. Making this process differentiable allows for smoother and more natural transitions between sounds, contributing significantly to the naturalness of the synthesized speech.
  4. Large Pre-Trained Speech Language Models (SLMs): StyleTTS integrates large SLMs like WavLM in its architecture. These models, trained on extensive speech datasets, bring a deep understanding of language and speech nuances, which is critical for generating high-quality, natural-sounding speech.
  5. End-to-End Training Mechanism: The entire system is designed for end-to-end training, meaning that all components from text input to speech output are trained together in a unified framework. This approach ensures that the various elements of the model work in harmony, leading to more efficient learning and better overall performance.
  6. Zero-Shot Speaker Adaptation: This component refers to the model's ability to adapt to new voices without prior training specifically on those voices. This feature is particularly important for creating a TTS system that can handle a wide range of voices and speaking styles.

These components collectively enable StyleTTS to achieve its goal of human-level speech synthesis, making it a versatile and powerful tool in the field of text-to-speech technology.

How to get started with StyleTTS

To begin using StyleTTS effectively, you'll need to have Python installed on your system, along with several essential but lightweight packages. The list of these packages will be provided for easy installation. Additionally, you'll require a code editor or an Integrated Development Environment (IDE) to write and execute your code. For this guide, we'll be using Visual Studio Code (VS Code), a popular and user-friendly IDE that offers a range of features to facilitate coding in Python and other languages. This setup will provide you with a robust environment to explore and utilize the capabilities of StyleTTS.

requirement

To initiate the installation process, it's essential to have a compatible version of Python. Specifically, your system should have Python 3.7 or a later version installed. This requirement ensures compatibility and smooth functioning of the software we will be working with.

git clone https://github.com/yl4579/StyleTTS2.git
cd StyleTTS2
pip install SoundFile torchaudio munch torch pydub pyyaml librosa nltk matplotlib accelerate transformers phonemizer einops einops-exts tqdm typing-extensions git+https://github.com/resemble-ai/monotonic_align.git
sudo apt-get install espeak-ng
pip install gdown
gdown --id 1K3jt1JEbtohBLUA0X75KLw36TW7U1yxq
unzip Models.zip

The initial step in our setup process involved cloning the StyleTTS GitHub repository. To do this, we used the git clone command, which created a local copy of the StyleTTS source code on our machine. Once the cloning was complete, we navigated into the newly created StyleTTS directory. This step was crucial as it set the stage for all subsequent actions to be performed within the context of the StyleTTS project environment.

Following the repository cloning, our next task was to install various Python packages using pip, Python's package installer. These packages are essential for running StyleTTS, as they include dependencies for audio processing, machine learning, and other necessary functionalities. The installation of these packages was a straightforward but vital step, ensuring that all the tools and libraries required for the tutorial were readily available in our environment.

After setting up the necessary Python packages, we proceeded to install espeak-ng. This software is a key component in text-to-speech systems, responsible for converting text into phonemes and enabling the synthesis of speech in various languages and accents. The installation of espeak-ng is a critical step in preparing our system for effective speech synthesis, as it lays the foundation for generating natural and accurate speech outputs.

Next, we installed gdown, a tool that facilitates the downloading of files from Google Drive via command line. This utility was particularly important for our setup, as we needed to download specific pre-trained models for StyleTTS, which were hosted on Google Drive. Using gdown with the specified file ID (gdown --id 1K3jt1JEbtohBLUA0X75KLw36TW7U1yxq), we were able to efficiently retrieve the necessary model files.

The final step in our setup process was to unzip the downloaded model files. This action was performed using the unzip command, which extracted the contents of the Models.zip file. Unzipping these models was essential, as it made them accessible for use in StyleTTS, allowing us to leverage the pre-trained models for high-quality text-to-speech synthesis.

import torch
torch.manual_seed(0)
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = True

Here, we import PyTorch, a leading deep learning library. We set a manual seed to ensure reproducibility of results. The settings for cudnn.benchmark and cudnn.deterministic are adjusted to ensure consistent performance across runs, especially important for neural network training.

import random
random.seed(0)

import numpy as np
np.random.seed(0)

Similar to PyTorch, we set seeds for the random and numpy libraries. This step is crucial for reproducibility, ensuring that random number generation is consistent.

import nltk
nltk.download('punkt')

The Natural Language Toolkit (nltk) is imported and we download the 'punkt' tokenizer models. This is used for dividing text into a list of sentences or words, which is essential in text processing for TTS.

# load packages
import time
import random
import yaml
from munch import Munch
import numpy as np
import torch
from torch import nn
import torch.nn.functional as F
import torchaudio
import librosa
from nltk.tokenize import word_tokenize

Here, we load various packages necessary for our TTS system. This includes yaml for configuration files, Munch for dictionary-like objects, audio processing libraries like torchaudio and librosa, and nltk for text processing.

from models import *
from utils import *
from text_utils import TextCleaner
textclenaer = TextCleaner()


We import our model architectures, utility functions, and text utilities. The TextCleaner is initialized, which will be used for cleaning and preparing text data for synthesis.

%matplotlib inline

This line is specific to Jupyter Notebooks, enabling the inline display of plots.

device = 'cuda' if torch.cuda.is_available() else 'cpu'

We set the device for computation. If a CUDA-enabled GPU is available, it's used; otherwise, the CPU is used.

to_mel = torchaudio.transforms.MelSpectrogram(
    n_mels=80, n_fft=2048, win_length=1200, hop_length=300)
mean, std = -4, 4

Here, we define a transformation to convert audio waveforms into Mel Spectrograms, a representation commonly used in speech and audio processing. The parameters like n_mels, n_fft, etc., are set for this transformation.

def length_to_mask(lengths):
    mask = torch.arange(lengths.max()).unsqueeze(0).expand(lengths.shape[0], -1).type_as(lengths)
    mask = torch.gt(mask+1, lengths.unsqueeze(1))
    return mask

This function creates a mask based on lengths, useful in processing sequences of different lengths, a common scenario in TTS.

def preprocess(wave):
    wave_tensor = torch.from_numpy(wave).float()
    mel_tensor = to_mel(wave_tensor)
    mel_tensor = (torch.log(1e-5 + mel_tensor.unsqueeze(0)) - mean) / std
    return mel_tensor
    

The preprocess function converts an audio waveform into a Mel Spectrogram, normalizes it, and prepares it for the model.

def compute_style(ref_dicts):
    reference_embeddings = {}
    for key, path in ref_dicts.items():
        wave, sr = librosa.load(path, sr=24000)
        audio, index = librosa.effects.trim(wave, top_db=30)
        if sr != 24000:
            audio = librosa.resample(audio, sr, 24000)
        mel_tensor = preprocess(audio).to(device)

        with torch.no_grad():
            ref = model.style_encoder(mel_tensor.unsqueeze(1))
        reference_embeddings[key] = (ref.squeeze(1), audio)

    return reference_embeddings
    

Finally, compute_style is a function that computes style embeddings from reference audio files. This is crucial in StyleTTS2, where the style of speech is an important aspect of synthesis.

In this segment, we delve deeper into the StyleTTS2 system, exploring a Python code snippet that demonstrates the advanced capabilities of this text-to-speech synthesis framework.

# load phonemizer
import phonemizer
global_phonemizer = phonemizer.backend.EspeakBackend(language='en-us', preserve_punctuation=True, with_stress=True, words_mismatch='ignore')

We begin by importing and setting up the phonemizer, which is crucial for converting text into phonetic representations. The EspeakBackend is configured for English (US), with specific settings to preserve punctuation and stress in speech, enhancing the naturalness of the synthesized voice.


config = yaml.safe_load(open("Models/LJSpeech/config.yml"))

Here, we load the configuration settings from a YAML file. These settings are essential for defining various parameters and paths used in the model.

# load pretrained ASR model
ASR_config = config.get('ASR_config', False)
ASR_path = config.get('ASR_path', False)
text_aligner = load_ASR_models(ASR_path, ASR_config)

The code loads a pre-trained Automatic Speech Recognition (ASR) model. This model plays a key role in aligning and understanding the text input, which is a fundamental part of the TTS process.

# load pretrained F0 model
F0_path = config.get('F0_path', False)
pitch_extractor = load_F0_models(F0_path)

Next, we load a model dedicated to extracting the fundamental frequency (F0) from speech. This aspect is crucial for capturing the pitch characteristics of the voice, contributing to the naturalness of the output.

# load BERT model
from Utils.PLBERT.util import load_plbert
BERT_path = config.get('PLBERT_dir', False)
plbert = load_plbert(BERT_path)

Here, a BERT model is loaded. BERT models are known for their effectiveness in understanding context and nuances in language, which is vital for generating natural-sounding speech.

model = build_model(recursive_munch(config['model_params']), text_aligner, pitch_extractor, plbert)
= [model[key].eval() for key in model]
_ = [model[key].to(device) for key in model]

The main StyleTTS2 model is constructed using various components like the text aligner, pitch extractor, and BERT model. The model is then set to evaluation mode and moved to the appropriate computational device (GPU or CPU).

params_whole = torch.load("Models/LJSpeech/epoch_2nd_00100.pth", map_location='cpu')
params = params_whole['net']

This section loads the pre-trained model parameters. These parameters are essential for the model to function correctly and produce high-quality speech.

for key in model:
    if key in params:
        print('%s loaded' % key)
        try:
            model[key].load_state_dict(params[key])
        except:
            from collections import OrderedDict
            state_dict = params[key]
            new_state_dict = OrderedDict()
            for k, v in state_dict.items():
                name = k[7:] # remove `module.`
                new_state_dict[name] = v
            # load params
            model[key].load_state_dict(new_state_dict, strict=False)
#             except:
#                 _load(params[key], model[key])
_ = [model[key].eval() for key in model]

The loop iterates through the model components, loading the respective pre-trained parameters. This step is crucial for ensuring that each part of the model is correctly initialized with learned weights.

from Modules.diffusion.sampler import DiffusionSampler, ADPM2Sampler, KarrasSchedule

sampler = DiffusionSampler(
    model.diffusion.diffusion,
    sampler=ADPM2Sampler(),
    sigma_schedule=KarrasSchedule(sigma_min=0.0001, sigma_max=3.0, rho=9.0), # empirical parameters
    clamp=False
)

The code sets up a diffusion sampler, an advanced technique used in generating high-quality, varied speech outputs. The sampler uses specific algorithms and schedules to control the synthesis process

def inference(text, noise, diffusion_steps=5, embedding_scale=1):
    text = text.strip()
    text = text.replace('"', '')
    ps = global_phonemizer.phonemize([text])
    ps = word_tokenize(ps[0])
    ps = ' '.join(ps)

    tokens = textclenaer(ps)
    tokens.insert(0, 0)
    tokens = torch.LongTensor(tokens).to(device).unsqueeze(0)

    with torch.no_grad():
        input_lengths = torch.LongTensor([tokens.shape[-1]]).to(tokens.device)
        text_mask = length_to_mask(input_lengths).to(tokens.device)

        t_en = model.text_encoder(tokens, input_lengths, text_mask)
        bert_dur = model.bert(tokens, attention_mask=(~text_mask).int())
        d_en = model.bert_encoder(bert_dur).transpose(-1, -2)

        s_pred = sampler(noise,
              embedding=bert_dur[0].unsqueeze(0), num_steps=diffusion_steps,
              embedding_scale=embedding_scale).squeeze(0)

        s = s_pred[:, 128:]
        ref = s_pred[:, :128]

        d = model.predictor.text_encoder(d_en, s, input_lengths, text_mask)

        x, _ = model.predictor.lstm(d)
        duration = model.predictor.duration_proj(x)
        duration = torch.sigmoid(duration).sum(axis=-1)
        pred_dur = torch.round(duration.squeeze()).clamp(min=1)

        pred_dur[-1] += 5

        pred_aln_trg = torch.zeros(input_lengths, int(pred_dur.sum().data))
        c_frame = 0
        for i in range(pred_aln_trg.size(0)):
            pred_aln_trg[i, c_frame:c_frame + int(pred_dur[i].data)] = 1
            c_frame += int(pred_dur[i].data)

        # encode prosody
        en = (d.transpose(-1, -2) @ pred_aln_trg.unsqueeze(0).to(device))
        F0_pred, N_pred = model.predictor.F0Ntrain(en, s)
        out = model.decoder((t_en @ pred_aln_trg.unsqueeze(0).to(device)),
                                F0_pred, N_pred, ref.squeeze().unsqueeze(0))

    return out.squeeze().cpu().numpy()
    

The inference function is defined for generating speech from text. It takes text input, processes it through various model components, and synthesizes speech. This function showcases the end-to-end capability of StyleTTS2.

def LFinference(text, s_prev, noise, alpha=0.7, diffusion_steps=5, embedding_scale=1):
  text = text.strip()
  text = text.replace('"', '')
  ps = global_phonemizer.phonemize([text])
  ps = word_tokenize(ps[0])
  ps = ' '.join(ps)

  tokens = textclenaer(ps)
  tokens.insert(0, 0)
  tokens = torch.LongTensor(tokens).to(device).unsqueeze(0)

  with torch.no_grad():
      input_lengths = torch.LongTensor([tokens.shape[-1]]).to(tokens.device)
      text_mask = length_to_mask(input_lengths).to(tokens.device)

      t_en = model.text_encoder(tokens, input_lengths, text_mask)
      bert_dur = model.bert(tokens, attention_mask=(~text_mask).int())
      d_en = model.bert_encoder(bert_dur).transpose(-1, -2)

      s_pred = sampler(noise,
            embedding=bert_dur[0].unsqueeze(0), num_steps=diffusion_steps,
            embedding_scale=embedding_scale).squeeze(0)

      if s_prev is not None:
          # convex combination of previous and current style
          s_pred = alpha * s_prev + (1 - alpha) * s_pred

      s = s_pred[:, 128:]
      ref = s_pred[:, :128]

      d = model.predictor.text_encoder(d_en, s, input_lengths, text_mask)

      x, _ = model.predictor.lstm(d)
      duration = model.predictor.duration_proj(x)
      duration = torch.sigmoid(duration).sum(axis=-1)
      pred_dur = torch.round(duration.squeeze()).clamp(min=1)

      pred_aln_trg = torch.zeros(input_lengths, int(pred_dur.sum().data))
      c_frame = 0
      for i in range(pred_aln_trg.size(0)):
          pred_aln_trg[i, c_frame:c_frame + int(pred_dur[i].data)] = 1
          c_frame += int(pred_dur[i].data)

      # encode prosody
      en = (d.transpose(-1, -2) @ pred_aln_trg.unsqueeze(0).to(device))
      F0_pred, N_pred = model.predictor.F0Ntrain(en, s)
      out = model.decoder((t_en @ pred_aln_trg.unsqueeze(0).to(device)),
                              F0_pred, N_pred, ref.squeeze().unsqueeze(0))

  return out.squeeze().cpu().numpy(), s_pred
  

Lastly, the LFinference function is similar to inference but includes the capability to blend styles from previous synthesis, allowing for more control and variation in speech output.

All code

%cd StyleTTS2

import torch
torch.manual_seed(0)
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = True

import random
random.seed(0)

import numpy as np
np.random.seed(0)

import nltk
nltk.download('punkt')

# load packages
import time
import random
import yaml
from munch import Munch
import numpy as np
import torch
from torch import nn
import torch.nn.functional as F
import torchaudio
import librosa
from nltk.tokenize import word_tokenize

from models import *
from utils import *
from text_utils import TextCleaner
textclenaer = TextCleaner()

%matplotlib inline

device = 'cuda' if torch.cuda.is_available() else 'cpu'

to_mel = torchaudio.transforms.MelSpectrogram(
    n_mels=80, n_fft=2048, win_length=1200, hop_length=300)
mean, std = -4, 4

def length_to_mask(lengths):
    mask = torch.arange(lengths.max()).unsqueeze(0).expand(lengths.shape[0], -1).type_as(lengths)
    mask = torch.gt(mask+1, lengths.unsqueeze(1))
    return mask

def preprocess(wave):
    wave_tensor = torch.from_numpy(wave).float()
    mel_tensor = to_mel(wave_tensor)
    mel_tensor = (torch.log(1e-5 + mel_tensor.unsqueeze(0)) - mean) / std
    return mel_tensor

def compute_style(ref_dicts):
    reference_embeddings = {}
    for key, path in ref_dicts.items():
        wave, sr = librosa.load(path, sr=24000)
        audio, index = librosa.effects.trim(wave, top_db=30)
        if sr != 24000:
            audio = librosa.resample(audio, sr, 24000)
        mel_tensor = preprocess(audio).to(device)

        with torch.no_grad():
            ref = model.style_encoder(mel_tensor.unsqueeze(1))
        reference_embeddings[key] = (ref.squeeze(1), audio)

    return reference_embeddings

# load phonemizer
import phonemizer
global_phonemizer = phonemizer.backend.EspeakBackend(language='en-us', preserve_punctuation=True, with_stress=True, words_mismatch='ignore')

config = yaml.safe_load(open("Models/LJSpeech/config.yml"))

# load pretrained ASR model
ASR_config = config.get('ASR_config', False)
ASR_path = config.get('ASR_path', False)
text_aligner = load_ASR_models(ASR_path, ASR_config)

# load pretrained F0 model
F0_path = config.get('F0_path', False)
pitch_extractor = load_F0_models(F0_path)

# load BERT model
from Utils.PLBERT.util import load_plbert
BERT_path = config.get('PLBERT_dir', False)
plbert = load_plbert(BERT_path)

model = build_model(recursive_munch(config['model_params']), text_aligner, pitch_extractor, plbert)
_ = [model[key].eval() for key in model]
_ = [model[key].to(device) for key in model]

params_whole = torch.load("Models/LJSpeech/epoch_2nd_00100.pth", map_location='cpu')
params = params_whole['net']

for key in model:
    if key in params:
        print('%s loaded' % key)
        try:
            model[key].load_state_dict(params[key])
        except:
            from collections import OrderedDict
            state_dict = params[key]
            new_state_dict = OrderedDict()
            for k, v in state_dict.items():
                name = k[7:] # remove `module.`
                new_state_dict[name] = v
            # load params
            model[key].load_state_dict(new_state_dict, strict=False)
#             except:
#                 _load(params[key], model[key])
_ = [model[key].eval() for key in model]

from Modules.diffusion.sampler import DiffusionSampler, ADPM2Sampler, KarrasSchedule

sampler = DiffusionSampler(
    model.diffusion.diffusion,
    sampler=ADPM2Sampler(),
    sigma_schedule=KarrasSchedule(sigma_min=0.0001, sigma_max=3.0, rho=9.0), # empirical parameters
    clamp=False
)

def inference(text, noise, diffusion_steps=5, embedding_scale=1):
    text = text.strip()
    text = text.replace('"', '')
    ps = global_phonemizer.phonemize([text])
    ps = word_tokenize(ps[0])
    ps = ' '.join(ps)

    tokens = textclenaer(ps)
    tokens.insert(0, 0)
    tokens = torch.LongTensor(tokens).to(device).unsqueeze(0)

    with torch.no_grad():
        input_lengths = torch.LongTensor([tokens.shape[-1]]).to(tokens.device)
        text_mask = length_to_mask(input_lengths).to(tokens.device)

        t_en = model.text_encoder(tokens, input_lengths, text_mask)
        bert_dur = model.bert(tokens, attention_mask=(~text_mask).int())
        d_en = model.bert_encoder(bert_dur).transpose(-1, -2)

        s_pred = sampler(noise,
              embedding=bert_dur[0].unsqueeze(0), num_steps=diffusion_steps,
              embedding_scale=embedding_scale).squeeze(0)

        s = s_pred[:, 128:]
        ref = s_pred[:, :128]

        d = model.predictor.text_encoder(d_en, s, input_lengths, text_mask)

        x, _ = model.predictor.lstm(d)
        duration = model.predictor.duration_proj(x)
        duration = torch.sigmoid(duration).sum(axis=-1)
        pred_dur = torch.round(duration.squeeze()).clamp(min=1)

        pred_dur[-1] += 5

        pred_aln_trg = torch.zeros(input_lengths, int(pred_dur.sum().data))
        c_frame = 0
        for i in range(pred_aln_trg.size(0)):
            pred_aln_trg[i, c_frame:c_frame + int(pred_dur[i].data)] = 1
            c_frame += int(pred_dur[i].data)

        # encode prosody
        en = (d.transpose(-1, -2) @ pred_aln_trg.unsqueeze(0).to(device))
        F0_pred, N_pred = model.predictor.F0Ntrain(en, s)
        out = model.decoder((t_en @ pred_aln_trg.unsqueeze(0).to(device)),
                                F0_pred, N_pred, ref.squeeze().unsqueeze(0))

    return out.squeeze().cpu().numpy()

def LFinference(text, s_prev, noise, alpha=0.7, diffusion_steps=5, embedding_scale=1):
  text = text.strip()
  text = text.replace('"', '')
  ps = global_phonemizer.phonemize([text])
  ps = word_tokenize(ps[0])
  ps = ' '.join(ps)

  tokens = textclenaer(ps)
  tokens.insert(0, 0)
  tokens = torch.LongTensor(tokens).to(device).unsqueeze(0)

  with torch.no_grad():
      input_lengths = torch.LongTensor([tokens.shape[-1]]).to(tokens.device)
      text_mask = length_to_mask(input_lengths).to(tokens.device)

      t_en = model.text_encoder(tokens, input_lengths, text_mask)
      bert_dur = model.bert(tokens, attention_mask=(~text_mask).int())
      d_en = model.bert_encoder(bert_dur).transpose(-1, -2)

      s_pred = sampler(noise,
            embedding=bert_dur[0].unsqueeze(0), num_steps=diffusion_steps,
            embedding_scale=embedding_scale).squeeze(0)

      if s_prev is not None:
          # convex combination of previous and current style
          s_pred = alpha * s_prev + (1 - alpha) * s_pred

      s = s_pred[:, 128:]
      ref = s_pred[:, :128]

      d = model.predictor.text_encoder(d_en, s, input_lengths, text_mask)

      x, _ = model.predictor.lstm(d)
      duration = model.predictor.duration_proj(x)
      duration = torch.sigmoid(duration).sum(axis=-1)
      pred_dur = torch.round(duration.squeeze()).clamp(min=1)

      pred_aln_trg = torch.zeros(input_lengths, int(pred_dur.sum().data))
      c_frame = 0
      for i in range(pred_aln_trg.size(0)):
          pred_aln_trg[i, c_frame:c_frame + int(pred_dur[i].data)] = 1
          c_frame += int(pred_dur[i].data)

      # encode prosody
      en = (d.transpose(-1, -2) @ pred_aln_trg.unsqueeze(0).to(device))
      F0_pred, N_pred = model.predictor.F0Ntrain(en, s)
      out = model.decoder((t_en @ pred_aln_trg.unsqueeze(0).to(device)),
                              F0_pred, N_pred, ref.squeeze().unsqueeze(0))

  return out.squeeze().cpu().numpy(), s_pred


Generating text-to-speech

We've successfully established our codebase, paving the way for generating voiceovers from simple text inputs. Our next step involves crafting the text we wish to vocalize. Once our script is ready, we'll feed it into the inference function detailed above. This process will transform our written words into a natural-sounding speech, bringing our text to life. Let's proceed to prepare our text and experience the seamless synthesis that our setup offers.

text = "StyleTTS 2 is a text-to-speech model that leverages style diffusion and adversarial training with large speech language models to achieve human-level text-to-speech synthesis." 

start = time.time()
noise = torch.randn(1,1,256).to(device)
wav = inference(text, noise, diffusion_steps=5, embedding_scale=1)
rtf = (time.time() - start) / (len(wav) / 24000)
print(f"RTF = {rtf:5f}")
import IPython.display as ipd
display(ipd.Audio(wav, rate=24000))

Lets breakdown the code,

Defining the Text to be Synthesized:

text = "StyleTTS 2 is a text-to-speech model that leverages style diffusion and adversarial training with large speech language models to achieve human-level text-to-speech synthesis."

This line sets the variable text to a string that describes StyleTTS 2. This is the input text that will be converted into speech.

Measuring Synthesis Time:

start = time.time()

This line records the current time (in seconds) before the synthesis starts. It's used to calculate how long the synthesis process takes.

Generating Noise Input:

noise = torch.randn(1,1,256).to(device)

This line generates a random noise tensor with the shape (1, 1, 256) using PyTorch. This noise tensor is likely used as an input to the StyleTTS 2 model to initiate the style diffusion process. The tensor is moved to the appropriate device (CPU or GPU).

Performing the Synthesis:

wav = inference(text, noise, diffusion_steps=5, embedding_scale=1)

Here, the inference function is called with the input text, the generated noise, and additional parameters (diffusion_steps and embedding_scale). This function is responsible for converting the input text into a waveform (audio data).

rtf = (time.time() - start) / (len(wav) / 24000)
print(f"RTF = {rtf:5f}")

After the synthesis, the current time is recorded again, and the difference from the start time is calculated. This difference is the total time taken for the synthesis. The Real-Time Factor (RTF) is then calculated by dividing this time by the duration of the generated audio in seconds (len(wav) / 24000, assuming the sample rate is 24,000 Hz). RTF is a measure of how much longer the synthesis takes compared to the duration of the generated audio. An RTF of less than 1 means the synthesis is faster than real-time.

import IPython.display as ipd
display(ipd.Audio(wav, rate=24000))

Finally, the synthesized audio (wav) is played. This is done using IPython's Audio class, which is capable of playing an array of audio data in a Jupyter notebook environment. The sample rate is set to 24,000 Hz, which should match the rate used in the synthesis process.

In summary, this code snippet takes a piece of text, synthesizes it into speech using the StyleTTS 2 model, measures the time taken for this process, calculates the Real-Time Factor, and plays back the synthesized audio.

audio-thumbnail
Output
0:00
/0:12

In conclusion

I trust that this concise guide has provided you with a clearer understanding of StyleTTS and its capabilities. By now, you should have a good foundation to begin exploring and utilizing this advanced text-to-speech technology. If you have any questions or need further assistance, feel free to reach out or consult the documentation for more in-depth information. Happy experimenting with StyleTTS!