OpenVoice Completed Guide

Unreal Speech

Jan 5, 2024 • 7 min read

Introduction

OpenVoice is an open-source voice cloning tool developed by a team of AI researchers from MIT, Tsinghua University, and Canadian startup MyShell. It can clone voices with remarkable precision and control, generating natural-sounding speech mimicking that voice in multiple languages while accent, rhythm, and intonation. The advantages of OpenVoice are three-fold: accurate tone color cloning, flexible voice style control, and zero-shot cross-lingual voice cloning. OpenVoice provides both standalone audio conferencing and integrated toll-free audio for GoToMeeting, GoToWebinar, and GoToTraining, and it can be used in various industries and applications, such as enhancing media content creation, revolutionizing chatbots and interactive AI interfaces, and preserving the voice of a loved one.

OpenVoice's standout feature is its ability to clone voices from very short audio clips, a capability that has garnered significant attention in the AI community. This efficiency in voice cloning, even with minimal audio input, demonstrates the advanced nature of the technology. Beyond voice cloning, OpenVoice can perform multiple synthesis transformations on a reference voice. These transformations include altering the tone, infusing various emotions, and modifying rhythm, pauses, and intonation. This level of control and customization makes OpenVoice a versatile tool in the field of voice synthesis and AI.

Features of OpenVoice

OpenVoice, developed by MyShell, offers several advanced features for voice cloning:

Accurate Tone Color Cloning: OpenVoice excels at replicating the unique tone color of a reference voice, ensuring a high degree of similarity.
Flexible Voice Style Control: It allows extensive customization of voice styles, including adjustments to emotion, accent, rhythm, pauses, and intonation, providing users with a wide range of expressive capabilities.
Zero-shot Cross-lingual Voice Cloning: A standout feature, this allows the cloning of voices in languages not included in the training dataset, demonstrating its versatility in language handling.
Multi-Language and Accent Support: OpenVoice can generate speech in multiple languages and accents, catering to diverse linguistic needs.
Customizable Base Speaker Model: Users can replace the default base speaker model with any model of their choice, enabling them to select any language and style for their voice cloning projects.

Why OpenVoice?

OpenVoice is valuable for its advanced voice cloning capabilities, enabling accurate replication of tone and style from brief audio samples. Its versatility extends to cross-lingual applications and customizable voice styles, including emotion and rhythm adjustments. This technology can enhance accessibility, provide personalized user experiences, and support diverse linguistic needs, making it a valuable tool in various sectors like entertainment, customer service, and assistive technology.

As an open-source model, OpenVoice benefits from the contributions of a wide community, ensuring regular updates and maintenance. This community-driven development model not only enhances the tool's capabilities but also keeps it accessible and free for integration into various applications. This aspect makes OpenVoice an appealing choice for developers and organizations looking to incorporate advanced voice cloning technology without incurring high costs.

Click here to see all OpenVoice voice samples

Applications of OpenVoice

OpenVoice, with its advanced voice cloning technology, can be applied in various fields:

Accessibility: Assisting visually impaired individuals by converting text to speech in a natural, familiar voice.
Entertainment: Creating varied character voices in animations, games, and audiobooks.
Education: Tailoring language learning tools with authentic accents and intonations.
Customer Service: Enhancing interactive voice response systems with more natural, engaging voices.
Personalized Alerts and Messages: Customizing voice notifications in apps and devices.
Research and Development: Facilitating studies in linguistics and AI voice recognition.

How to integrate OpenVoice to your python App

To integrate OpenVoice into your Python application, follow these general steps:

Clone the OpenVoice repository from GitHub.
Set up a Python environment and install necessary dependencies as outlined in the OpenVoice documentation.
Download the required model checkpoint and place it in the appropriate directory.
Utilize the provided Python notebooks (demo_part1.ipynb and demo_part2.ipynb) as examples to understand how to use the OpenVoice API within your application.
Adapt the code from these examples to suit your application's specific requirements.

Clone the Repository:


git clone https://github.com/myshell-ai/OpenVoice.git

Set Up Environment:Use Conda to create an environment with Python 3.9 and activate it.


conda create -n openvoice python=3.9
conda activate openvoice

Install Dependencies:Install PyTorch and other libraries using Conda, then use pip to install requirements from the provided requirements.txt file.


conda install pytorch torchvision torchaudio -c pytorch
pip install -r requirements.txt

Download the Checkpoint:Download the necessary model checkpoint and place it in the 'checkpoints' directory.

https://myshell-public-repo-hosting.s3.amazonaws.com/checkpoints_1226.zip

Voice Style Control Demo


import os
import torch
import se_extractor
from api import BaseSpeakerTTS, ToneColorConverter

# Initialization
ckpt_base = 'checkpoints/base_speakers/EN'
ckpt_converter = 'checkpoints/converter'
device = 'cuda:0'
output_dir = 'outputs'

base_speaker_tts = BaseSpeakerTTS(f'{ckpt_base}/config.json', device=device)
base_speaker_tts.load_ckpt(f'{ckpt_base}/checkpoint.pth')

tone_color_converter = ToneColorConverter(f'{ckpt_converter}/config.json', device=device)
tone_color_converter.load_ckpt(f'{ckpt_converter}/checkpoint.pth')

os.makedirs(output_dir, exist_ok=True)

Obtain Tone Color Embedding

The source_se is the tone color embedding of the base speaker. It is an average of multiple sentences generated by the base speaker. We directly provide the result here but the readers feel free to extract source_se by themselves.


source_se = torch.load(f'{ckpt_base}/en_default_se.pth').to(device)

The reference_speaker.mp3 below points to the short audio clip of the reference whose voice we want to clone. We provide an example here. If you use your own reference speakers, please make sure each speaker has a unique filename. The se_extractor will save the targeted_se using the filename of the audio and will not automatically overwrite.


reference_speaker = 'resources/example_reference.mp3'
target_se, audio_name = se_extractor.get_se(reference_speaker, tone_color_converter, target_dir='processed', vad=True)

Inference


save_path = f'{output_dir}/output_en_default.wav'

# Run the base speaker tts
text = "This audio is generated by OpenVoice."
src_path = f'{output_dir}/tmp.wav'
base_speaker_tts.tts(text, src_path, speaker='default', language='English', speed=1.0)

# Run the tone color converter
encode_message = "@MyShell"
tone_color_converter.convert(
    audio_src_path=src_path, 
    src_se=source_se, 
    tgt_se=target_se, 
    output_path=save_path,
    message=encode_message)

Try with different styles and speed. The style can be controlled by the speaker parameter in the base_speaker_tts.tts method. Available choices: friendly, cheerful, excited, sad, angry, terrified, shouting, whispering. Note that the tone color embedding need to be updated. The speed can be controlled by the speed parameter. Let's try whispering with speed 0.9.


source_se = torch.load(f'{ckpt_base}/en_style_se.pth').to(device)
save_path = f'{output_dir}/output_whispering.wav'

# Run the base speaker tts
text = "This audio is generated by OpenVoice with a half-performance model."
src_path = f'{output_dir}/tmp.wav'
base_speaker_tts.tts(text, src_path, speaker='whispering', language='English', speed=0.9)

# Run the tone color converter
encode_message = "@MyShell"
tone_color_converter.convert(
    audio_src_path=src_path, 
    src_se=source_se, 
    tgt_se=target_se, 
    output_path=save_path,
    message=encode_message)

Try with different languages. OpenVoice can achieve multi-lingual voice cloning by simply replace the base speaker. We provide an example with a Chinese base speaker here and we encourage the readers to try demo_part2.ipynb for a detailed demo.


ckpt_base = 'checkpoints/base_speakers/ZH'
base_speaker_tts = BaseSpeakerTTS(f'{ckpt_base}/config.json', device=device)
base_speaker_tts.load_ckpt(f'{ckpt_base}/checkpoint.pth')

source_se = torch.load(f'{ckpt_base}/zh_default_se.pth').to(device)
save_path = f'{output_dir}/output_chinese.wav'

# Run the base speaker tts
text = "今天天气真好，我们一起出去吃饭吧。"
src_path = f'{output_dir}/tmp.wav'
base_speaker_tts.tts(text, src_path, speaker='default', language='Chinese', speed=1.0)

# Run the tone color converter
encode_message = "@MyShell"
tone_color_converter.convert(
    audio_src_path=src_path, 
    src_se=source_se, 
    tgt_se=target_se, 
    output_path=save_path,
    message=encode_message)

Cross-Lingual Voice Clone Demo


import os
import torch
import se_extractor
from api import ToneColorConverter

Initialization


ckpt_converter = 'checkpoints/converter'
device = 'cuda:0'
output_dir = 'outputs'

tone_color_converter = ToneColorConverter(f'{ckpt_converter}/config.json', device=device)
tone_color_converter.load_ckpt(f'{ckpt_converter}/checkpoint.pth')

os.makedirs(output_dir, exist_ok=True)

In this demo, we will use OpenAI TTS as the base speaker to produce multi-lingual speech audio. The users can flexibly change the base speaker according to their own needs. Please create a file named .env and place OpenAI key as OPENAI_API_KEY=xxx. We have also provided a Chinese base speaker model (see demo_part1.ipynb).


from openai import OpenAI
from dotenv import load_dotenv

# Please create a file named .env and place your
# OpenAI key as OPENAI_API_KEY=xxx
load_dotenv() 

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

response = client.audio.speech.create(
    model="tts-1",
    voice="nova",
    input="This audio will be used to extract the base speaker tone color embedding. " + \
        "Typically a very short audio should be sufficient, but increasing the audio " + \
        "length will also improve the output audio quality."
)

response.stream_to_file(f"{output_dir}/openai_source_output.mp3")

Obtain Tone Color Embedding

The source_se is the tone color embedding of the base speaker. It is an average for multiple sentences with multiple emotions of the base speaker. We directly provide the result here but the readers feel free to extract source_se by themselves.


base_speaker = f"{output_dir}/openai_source_output.mp3"
source_se, audio_name = se_extractor.get_se(base_speaker, tone_color_converter, vad=True)

reference_speaker = 'resources/example_reference.mp3'
target_se, audio_name = se_extractor.get_se(reference_speaker, tone_color_converter, vad=True)

Inference


# Run the base speaker tts
text = [
    "MyShell is a decentralized and comprehensive platform for discovering, creating, and staking AI-native apps.",
    "MyShell es una plataforma descentralizada y completa para descubrir, crear y apostar por aplicaciones nativas de IA.",
    "MyShell est une plateforme décentralisée et complète pour découvrir, créer et miser sur des applications natives d'IA.",
    "MyShell ist eine dezentralisierte und umfassende Plattform zum Entdecken, Erstellen und Staken von KI-nativen Apps.",
    "MyShell è una piattaforma decentralizzata e completa per scoprire, creare e scommettere su app native di intelligenza artificiale.",
    "MyShellは、AIネイティブアプリの発見、作成、およびステーキングのための分散型かつ包括的なプラットフォームです。",
    "MyShell — это децентрализованная и всеобъемлющая платформа для обнаружения, создания и стейкинга AI-ориентированных приложений.",
    "MyShell هي منصة لامركزية وشاملة لاكتشاف وإنشاء ورهان تطبيقات الذكاء الاصطناعي الأصلية.",
    "MyShell是一个去中心化且全面的平台，用于发现、创建和投资AI原生应用程序。",
    "MyShell एक विकेंद्रीकृत और व्यापक मंच है, जो AI-मूल ऐप्स की खोज, सृजन और स्टेकिंग के लिए है।",
    "MyShell é uma plataforma descentralizada e abrangente para descobrir, criar e apostar em aplicativos nativos de IA."
]
src_path = f'{output_dir}/tmp.wav'

for i, t in enumerate(text):

    response = client.audio.speech.create(
        model="tts-1",
        voice="alloy",
        input=t,
    )

    response.stream_to_file(src_path)

    save_path = f'{output_dir}/output_crosslingual_{i}.wav'

    # Run the tone color converter
    encode_message = "@MyShell"
    tone_color_converter.convert(
        audio_src_path=src_path, 
        src_se=source_se, 
        tgt_se=target_se, 
        output_path=save_path,
        message=encode_message)

In conclusion

OpenVoice emerges as a groundbreaking tool in the realm of voice cloning technology. Its ability to accurately clone voice tones from minimal audio input, coupled with its versatile applications in cross-lingual cloning and voice style customization, make it a valuable asset for various sectors. As an open-source platform, it not only fosters community collaboration and continuous improvement but also offers cost-effective integration for developers and businesses. OpenVoice is poised to revolutionize how we interact with and utilize voice technology in diverse applications. For a deeper dive into OpenVoice, visit the OpenVoice GitHub page.

If you haven't subscribe to our YouTube Channel please do subscribe to get our latest videos on different topics.