MetaVoice-1B

Unreal Speech

Feb 14, 2024 • 5 min read

Introduction

MetaVoice-1B is a text-to-speech (TTS) model developed with a focus on emotional speech rhythm and tone in English. It's notable for its ability to clone voices with minimal training data, including zero-shot cloning for American and British voices. It also supports long-form synthesis. The model, licensed under Apache 2.0, works by predicting EnCodec tokens from text and speaker information, using a multi-step process involving a causal GPT, a non-causal transformer, and multi-band diffusion. Optimizations include KV-caching and batching.

MetaVoice-1B is a new multilingual voice cloning model with a 1.2 billion parameter base for text-to-speech (TTS). It is notable for its support for voice cloning across languages, emphasizing fine-tuning capabilities, and the ability to clone voices with American and British accents accurately, requiring only a 30-second audio sample. The model also aims to minimize the occurrence of hallucinated words and supports emotional speech rhythm and tone in English. It has been trained on a significant corpus of 100,000 hours of speech data and is available under an open-source Apache license, allowing for broad experimentation and modification. The model is available on GitHub and is provided through a Colab notebook for users to experiment with its capabilities and contribute to its ongoing development

Supported Languages

MetaVoice-1B supports cross-lingual voice cloning with fine-tuning. The model has had success with as little as 1 minute of training data for Indian speakers, and it is expected to support a wider variety of accents and languages in the future. This includes the ability to clone voices with American and British accents accurately, and future updates are expected to expand the model’s fine-tuning abilities, allowing for more personalized voice cloning

Features of MetaVoice-1B model

The MetaVoice-1B model, with its innovative text-to-speech (TTS) capabilities, offers a range of features focused on enhancing the quality and versatility of synthesized speech. A few key aspects are:

Emotionally Expressive Speech: The model's ability to infuse emotional depth into speech rhythm and tone stands out, making the generated voice sound more natural and engaging.
Advanced Voice Cloning: It demonstrates proficiency in cloning voices using minimal training data. This includes the remarkable feature of zero-shot cloning for specific accents like American and British English, allowing for a broader range of voice styles with less input.
Long-Form Speech Synthesis: Unlike some TTS systems that are optimized for short clips, MetaVoice-1B excels in creating longer speech segments, making it suitable for applications like audiobooks or podcasts.
Technical Innovations in Speech Generation: The model's process involves advanced techniques, including a combination of a causal GPT, a non-causal transformer, and multi-band diffusion, which contribute to the quality and efficiency of voice generation.
Efficiency and Performance Enhancements: Incorporating features like KV-caching and batching, MetaVoice-1B optimizes its performance, potentially reducing the computational load and speeding up the voice generation process.

These features collectively make MetaVoice-1B a powerful tool in the realm of text-to-speech technology, particularly for applications that require emotional depth and variety in voice synthesis.

Applications of MetaVoice-1B model

The MetaVoice-1B model can be applied in various use cases where realistic and emotionally expressive speech synthesis is crucial. Its ability to clone voices with minimal data, including zero-shot cloning for specific accents, makes it ideal for creating diverse and natural-sounding voiceovers in different languages and dialects. It can be used in audiobooks, podcasts, virtual assistants, and customer service bots, where long-form speech synthesis is essential. The emotional depth it adds to the speech rhythm and tone could also benefit applications in gaming, animations, and educational tools, making interactions more engaging and lifelike.

MetaVoice-1B's advanced text-to-speech capabilities make it suitable for a wide range of applications:

Audiobooks: Creating expressive and varied narration.
Virtual Assistants: Enhancing interaction with natural-sounding voices.
Customer Service Bots: Providing more human-like responses.
E-Learning Platforms: Making educational content engaging.
Gaming: Adding depth to character voices.
Animation and Film Dubbing: Offering realistic voice acting.
Language Learning Tools: Assisting in pronunciation and language training.
Accessibility Tools: Aiding visually impaired users with natural voice reading.
Public Announcements: Automated, clear voice announcements.
Podcasting: Generating diverse voices for content creation.

Architecture

We predict EnCodec tokens from text, and speaker information. This is then diffused up to the waveform level, with post-processing applied to clean up the audio.

The architecture of the MetaVoice-1B model is sophisticated and involves several steps for optimal speech synthesis:

Token Prediction: It begins by predicting EnCodec tokens from text and speaker information. This process includes diffusing the tokens up to the waveform level, followed by post-processing to enhance audio quality.
Causal and Non-Causal Transformers: The model utilizes a causal GPT for the initial prediction of EnCodec tokens. Then, a non-causal transformer, which is a smaller model with significant generalization capabilities, predicts additional hierarchies of tokens.
Speaker Information Handling: Speaker characteristics are integrated at the token embedding layer, benefiting from a separately trained speaker verification network.
Tokenization and Sampling Methodology: A custom-trained BPE tokenizer is used, and the model employs condition-free sampling to enhance voice cloning capabilities.
Waveform Generation and Refinement: Multi-band diffusion is used for waveform generation. To address any background artifacts from this process, DeepFilterNet is applied for cleanup, ensuring clearer and more pleasant audio output.

How to use model in python

To use the MetaVoice-1B model in Python, you typically need to follow these general steps:

Install Necessary Libraries: This might include libraries like transformers from Hugging Face.
Load the Model: Use appropriate functions to load the MetaVoice-1B model.
Prepare Input Data: This includes the text you want to synthesize and any speaker information.
Run the Model: Pass your input data to the model to generate speech.
Process the Output: Handle the generated speech as per your application's needs.

For specific code examples and detailed instructions, you should visit the MetaVoice-1B model's page on Hugging Face. They usually provide code snippets and usage guidelines.

Install the Transformers Library: You can install it using pip:


pip install transformers

Load the Model and Tokenizer: Use the transformers library to load the model and tokenizer. This might look something like:

from transformers import AutoModel, AutoTokenizer

model_name = "metavoiceio/metavoice-1B-v0.1"  # Example model name
model = AutoModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Prepare Your Input Text: Tokenize your input text using the tokenizer.

Generate Speech: Pass the tokenized text to the model to generate speech. The model will return audio data.
Process the Output: Handle the output audio data as needed for your application.

Conclusion

In conclusion, the MetaVoice-1B model represents a significant advancement in the field of text-to-speech technology. With its innovative approach to voice cloning, emotional expression, and long-form synthesis, it opens up new possibilities for creating natural and engaging voice experiences. Whether for audiobooks, virtual assistants, or e-learning platforms, MetaVoice-1B's capabilities suggest a future where digital voices are indistinguishable from human ones, bringing a new level of realism and accessibility to digital interactions. As this technology continues to evolve, it will undoubtedly play a crucial role in shaping the landscape of voice-based applications.