Text to Speech

What Is SSML? Use Cases, Examples & Best Practices For Using SSML

Curious what is SSML? Learn what Speech Synthesis Markup Language is, explore its applications and get tips for using it in your projects.

Unreal Speech

Apr 10, 2024 • 8 min read

SSML (Speech Synthesis Markup Language) is an essential tool for developers harnessing the power of text to speech technology. By understanding What Is SSML, you can take your digital creations to the next level with rich, expressive speech synthesis that brings your applications to life. Intrigued? Keep reading to discover how SSML can enhance your projects with advanced features and customization possibilities, helping you create exceptional voice experiences that captivate users and enrich your digital products.

What Is SSML And Why Do You Need One?

Speech Synthesis Markup Language (SSML) is an XML-based markup language that offers an enhanced way to modify text-to-speech output attributes. SSML provides control over various speech attributes like pitch, pronunciation, speaking rate, volume, and more. This structured approach to text-to-speech output allows for customization that surpasses basic text input, enabling improved naturalness and human-like qualities in synthesized speech.

Why SSML is Essential for TTS

Speech output generated by text-to-speech systems may not always meet expectations in terms of naturalness. SSML is pivotal in refining synthesized speech to improve its human-like qualities. SSML can rectify pronunciation errors, regulate speech pacing, adjust intonation, insert pauses, emphasize particular words or phrases, and make the speech sound more engaging. In essence, SSML enhances the overall quality and natural flow of synthesized speech output.

Common Use Cases Of SSML

glasses infront of coding screen - What Is SSML

1. Text Structure

In SSML, I can define the structure of my text to shape my speech output. I can specify the structure of my text by using tags like paragraph, sentence, break, or silence. I can wrap my text with event tags like bookmarks or visemes for future processing. A viseme is a visual representation of a phoneme in spoken language, allowing me to finely tune the output structure.

2. Voice and Style Customization

SSML gives me the power to choose the voice, language, name, style, and role in my speech output. I can incorporate multiple voices within a single SSML document and tweak properties like emphasis, speaking rate, pitch, and volume. I can effortlessly insert pre-recorded audio such as sound effects or music notes to augment the richness of my audio output.

3. Pronunciation Control

With SSML, I can meticulously control the pronunciation of my output audio. I have the liberty to adjust pronunciations using phonemes and a custom lexicon to enhance clarity. I can specify the pronunciation of specific words or mathematical expressions to ensure that the speech output is articulate and precise.

How To Use SSML

Using SSML in Dialogue Systems

When creating a dialogue system, it is essential to use Speech Synthesis Markup Language (SSML) to add richness and customization to the spoken text. SSML specifies how text should be spoken, allowing developers to control aspects such as pronunciation, volume, rate, pitch, and more. By wrapping dialogue in the <speak> tag, you are informing the text-to-speech system of your intention for those words to be spoken aloud. This tag is crucial for a TTS system to function correctly, as it distinguishes between spoken and non-spoken text.

Incorporating SSML tags in your dialogue can enhance the naturalness and expressiveness of the synthesized speech. By utilizing tags such as <prosody> and <break>, developers can modify the speech characteristics to align with the intended tone of the conversation. SSML opens a world of possibilities for creating engaging and interactive voice applications.

SSML Tag Overview

SSML offers a wide range of tags that allow you to fine-tune the spoken content of your dialogue system. Below are a few key SSML tags and their functions:

<prosody>: Controls the rate, volume, pitch, and range of the speech.
<break>: Inserts a pause in the speech.
<emphasis>: Specifies words that should be emphasized.
<phoneme>: Allows the phonetic pronunciation of a word to be defined.
<sub>: Indicates a substitution for an abbreviated word.

These are just a few examples of the many SSML tags you can use to customize spoken text in your dialogue system. Whether you want a robot to speak slowly and clearly or a teenager to deliver their lines with enthusiasm, SSML tags help you achieve the desired effect.

Integration of SSML

Integrating SSML into your dialogue system is relatively simple. You can embed SSML directly in the text you send to the text-to-speech engine. To implement SSML, surround the spoken text with the <speak> tag and incorporate the necessary SSML tags within it. Once the text is processed by the speech synthesis engine, the SSML tags will influence how each word is spoken.

Using SSML in dialogue systems is a powerful method for enhancing the overall user experience. It allows developers to create more dynamic and engaging content by personalizing how the text is spoken. By incorporating SSML, developers can ensure that their voice applications generate more natural and lifelike synthetic speech.

Examples Of SSML Tags

SSML tags are essential for customizing text-to-speech systems.

Audio Tag

The audio tag allows you to include sound files in your dialogue. For example, you can use it to add sound effects or background music to your text-to-speech output. It's a fantastic way of immersing your audience in an auditory experience.

Break Tag

The break tag is another useful tool in your SSML toolbox. It allows you to insert pauses into your spoken text, enhancing comprehension or perhaps adding drama to your narrative.

Emphasis Tag

The emphasis tag is a powerful tool that changes the pitch of the spoken words. It can make certain words or phrases stand out from the rest of the conversation.

Lang Tag

The lang tag helps your TTS system to understand which language or dialect you want it to use. It's particularly useful for multilingual projects.

P Tag

The p tag is a handy way to add pauses at the end of paragraphs in your text-to-speech output. It helps listeners to understand when a section of text is ending.

Phoneme

Phoneme is a tag that allows you to build specific pronunciations for words using the International Phonetic Alphabet.

Prosody

The prosody tag helps you to adjust the volume, speed, and pitch of the spoken text. If you want to add drama, or if you need to speak extra softly for a particular part of your conversation, prosody is your friend.

S Tag

The s tag adds shorter pauses when the system reaches the end of a sentence. It's like a more subtle version of the p tag.

Say-as Tag

The say-as tag is a versatile tool for changing how your text-to-speech system reads certain words, phrases, or numbers. For example, “1234” can be pronounced as “one thousand, two hundred and thirty-four.”

Speak Tag

The speak tag is the root element for all spoken text. The sub tag allows you to replace one word for another, like changing the pronunciation of “e.g.” to “for example.”

Voice Tag

The voice tag is used to specify a custom TTS voice, like an Amazon Polly voice for Alexa skills.

W Tag

The w tag is a powerful tool for changing word pronunciation, like changing “read” to “red” for different tenses.

Different systems have unique tags, but the standard tags should work on most systems.

How To Use SSML Tags In Speech Synthesis Systems

team working together in understanding What Is SSML

To insert specific SSML into a dialogue, you could include a break or long pause in a speech by using the <break> tag at the point in the dialogue where you'd like the system to pause. For instance, <speak>Hi, my name is Unrealspeech, and here's today's news. <break time="2s"/> Unrealspeech World publishes a new guide to SSML...</speak>.

Slow Down

You can also slow down the speed of the dialogue. To accomplish this, you can use <speak><prosody rate="x-slow">Hi, my name is Unrealspeech.</prosody>. It's noteworthy that you can nest SSML tags within each other so that numerous dialogue manipulations can be stitched together. This is similar to how you'd insert an <a> tag within a <p> tag in web development. For example, <p>Unrealspeech World rocks. <a href="https://unreal.world">Check it out here.</a></p>.

Raise the Pitch

If you'd like to raise the pitch of a single word and have that word pronounced in a French accent, you could use <speak><prosody pitch="high"><lang xml:lang="fr-FR">Bonjour!</lang></prosody></speak>. These manipulations function effectively with the standard TTS voices but may not work as well with neural net voices.

The Limitations Of SSML

As flexible as SSML is, it isn’t always perfect. Every conversation designer has a story to tell about SSML and the pain it has caused them at some point in their career: meticulously tweaking phonemes to pronounce a brand name just perfectly. Working in the appropriate level of pauses to fine-tune the cadence to sound more natural. It all takes work.

At some point, if you need to make really big changes, you’ll start to affect the quality of the actual voice and it’ll have a detrimental impact on your application.

You may face these limitation issues working with SSML

1. Complexity

SSML markup can add complexity to the text, making it harder to read and maintain.

2. Limited Support

Not all TTS engines or platforms support SSML, limiting its interoperability.

3. Learning Curve

Using SSML requires learning the markup language, which may require additional time and effort.

4. Compatibility Issues

SSML support varies across different TTS engines and platforms, leading to potential compatibility issues when deploying SSML-enhanced content.

Best Practices And Tips For Using SSML

man discussing best practices - what is ssml

Familiarity with all SSML tags, including lesser-known ones like spell-out and src, is essential for effective speech synthesis. Understanding the nuances of each tag can greatly enhance the quality of the synthesized speech.

Optimization Strategies

Optimizing SSML documents involves balancing the use of various elements to achieve clear and natural-sounding speech. This includes careful consideration of break strength, prosody pitch, and emphasis levels.

Test and Iterate

Regularly test your SSML-enhanced content across different TTS engines and platforms to identify any compatibility issues or improvements needed, and iterate accordingly.

Try Unreal Speech for Free Today — Affordably and Scalably Convert Text into Natural-Sounding Speech with Our Text-to-Speech API

Unreal Speech offers a cost-effective and scalable text-to-speech API with natural-sounding AI voices. It provides the cheapest and most high-quality solution in the market, reducing text-to-speech costs by up to 90%. With its super-fast/low latency API, Unreal Speech delivers human-like AI voices with the option for per-word timestamps. Its simple and easy-to-use API allows for giving your LLM a voice and offering this functionality at scale.

If you are looking for an affordable, scalable, and realistic TTS solution to incorporate into your products, try Unreal Speech's text-to-speech API for free today to convert text into natural-sounding speech.

Table of Contents