Overview of Zero-Shot Multi-Speaker TTS Systems: Top Q&As

Title slide for Speech AI Summit session.

The Speech AI Summit is an annual conference that brings together experts in the field of AI and speech technology to discuss the latest industry trends and…

The Speech AI Summit is an annual conference that brings together experts in the field of AI and speech technology to discuss the latest industry trends and advancements. This post summarizes the top questions asked during Overview of Zero-Shot Multi-Speaker TTS System, a recorded talk from the 2022 summit featuring Coqui.ai.

Synthesizing a voice with seconds of audio

Text-to-speech (TTS) systems have significantly advanced in recent years with deep learning approaches. These advances have motivated research that aims to synthesize speech into the voice of a target speaker using just a few seconds of speech. This approach is called zero-shot multi-speaker TTS. The Coqui.ai session explored the timeline and state-of-the-art technology behind this approach.

Here are some key takeaways from the session:

  • YourTTS achieved state-of-the-art performance in English and showed the feasibility of performing zero-shot multi-speaker TTS in a target language by using a single-speaker dataset. This opened possibilities for the development of these systems in low-resource languages, such as indigenous languages. For more information, see YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for Everyone.
  • Advances in speaker verification systems can improve the performance of zero-shot multi-speaker TTS systems.
  • Zero-shot multi-speaker TTS can be used to generate new artificial voices. This is achieved by sampling a new speaker embedding. The new speaker embedding can be a completely random vector or interpolation between different speaker embeddings. For example, you could generate voices without copyrighting.

Top Q&As for zero-shot multi-speaker TTS systems

Can you create entirely brand new voices? Are there benefits to zero-shot to consider over one-minute fine-tuning? What are the hardware requirements to train a TTS model? Edresson Casanova dives into the top questions for developing zero-shot multi-speaker TTS systems.

How is text-to-speech quality measured?

Generally, the quality and naturalness of a TTS system are evaluated using the mean opinion score (MOS). With this metric, human evaluators listen to the audio and give a score on a scale between one and five, with one indicating bad quality and five indicating excellent quality.

In zero-shot multi-speaker TTS systems, you must also evaluate the similarity for new speakers by using a similarity MOS. In addition, a speaker encoder is used to measure speaker similarity. Compute the speaker encoder cosine similarity (SECS) where the speakers’ embeddings for two audio samples are extracted, and the cosine similarity between these embeddings is computed.

Researchers have recently published papers that explore the use of artificial neural networks to predict the MOS. At present, the generalization of these systems is not good enough, especially for new recording conditions, such as a different microphone or noise environment.

Can speech-to-text systems be used to measure text-to-speech system quality?

Speech-to-text (STT) systems can be used to check if the TTS model’s pronunciation is right, but it is not so much used in literature. Evaluation with an STT model covers only pronunciation and not the quality aspects of the speech itself.

What is the benefit of zero-shot compared to one-minute fine-tuning?

Zero-shot can work well, but not always. In some recordings, conditions and voices are too different from those seen in training. The zero-shot can fail and produce a voice not as similar to the target speaker’s voice. In this case, one-minute fine-tuning can be used. The YourTTS paper shows that the model can learn voices well, even for voices where the model has had a bad zero-shot.

How important is the architecture of the speaker encoder? Do you suggest training the speaker encoder separately or along with the spectrogram generator?

The speaker encoder is one of the most important components for the final quality of zero-shot multi-speaker TTS models. Without good speaker embeddings, the model can’t clone new voices. The speaker encoder used on the YourTTS model was pretrained separately on thousands of speakers, and it was kept frozen during the training.

Some papers—such as Attentron: Few-Shot Text-to-Speech Utilizing Attention-Based Variable-Length Embedding—show that training a speaker encoder-like module along with the TTS model could produce good results. In my experience, it depends on how many speakers you have in the training set. Without adequate speaker diversity, the model easily overfits and does not work well with speakers or recording conditions not seen in training.

Is it possible to interpolate speaker encoder representation to create an unseen voice as a mix of known voices?

It is possible. It is also able to generate new artificial voices through a random speaker embedding. Although the YourTTS colab demos do not cover it, the SC-GlowTTS colab demo shows an example of how to generate a completely new artificial voice.

Are models phoneme-based or character-based for training?

YourTTS is character-based. However, for example, Sc-GlowTTS is phoneme-based. On YourTTS, we decided to train it using characters instead of phonemes because the objective of this model is to be used in low-resource languages that normally do not have good phonemizers.

How much data center compute is required to train your leading text-to-speech model?

YourTTS used one NVIDIA V100 32-GB GPU with a batch size of 64. However, it is possible to train it with a smaller batch size using GPUs with less VRAM. I have never tried a GPU with less VRAM, but I know that some Coqui TTS contributors have already fine-tuned the YourTTS model using GPUs with 11 GB of VRAM.

When computing the speaker embeddings, does it help to exclude certain segments from embedding extraction like silence or unvoiced or plosive phonemes?

Although the speaker encoder should learn how to ignore the silences and focus just on speech, during the dataset preprocessing step, we removed beginning and end long silences to avoid problems during the model training. Then, we removed long silences. However, we do not remove unvoiced or plosive phonemes segments.

Can zero-shot text-to-speech be achieved for expressive speech?

It can be achieved. At Coqui.ai, we have already developed a model that can do zero-shot multi-speaker TTS and generates expressive speech in five different emotions. This model is available through Coqui Studio.

More resources

From fine-tuning a model to generating a custom voice, speech AI technology helps organizations tackle complex conversations globally. Check out the following resources to learn how your organization can integrate speech AI into core operations.

  • Get a detailed overview of the growing speech AI landscape with the free ebook, Introduction to Speech AI.
  • Explore how to build and deploy real-time speech AI pipelines for your application with the free ebook, Building Speech AI Applications.
  • Learn how to customize a speech recognition pipeline with the self-paced NVIDIA Deep Learning Institute course, Get Started with Highly Accurate Custom ASR for Speech AI.

Source:: NVIDIA