How to customize generated speech with SSML tags

If you want to create an extra pause, or emphasize something in your Text-to-Speech, check out this article!

Karyn Green avatar
Written by Karyn Green
Updated over a week ago

Wave.video uses Amazon Polly technology to generate audio tracks from text. Sometimes, the default result is not flawless, and you might want to tune the speech. Here come the SSML tags for your help!

How to start using SSML in your Text-to-Speech

It's easy! Put your text inside two <speak> tags:

<speak>Hello!</speak>

Please note that some SSML tags don't work with neural voices, and vice versa. Make sure that you're using the correct ones here.

How to make a pause

TTS automatically makes pauses after commas, ends of the sentences, and paragraphs. The easiest way to create an additional pause is to use the <break> tag. It's available both for regular and neural voices.

This tag has different parameters that define how strong the pause will be: weak, medium, strong, x-strong. Also, you can specify the exact time of the pause with the time parameter. See the examples below:

<speak>
Oh, laziness, come, come to me, <break strength="strong"/> alone.
You’re called for by soft coolness and good rest <break time="0.8s"/>
Only in you I see my goddess own
</speak>

How to change the pitch of the voice or make it louder/quieter

You can make the voice sound louder or quieter with <prosody> tag. Use silent, x-soft, soft, medium, loud, x-loud values to change it:

<speak>
Everybody's wondering <prosody volume="x-loud">where did the blues</prosody> come from?
</speak>

Want to control the volume in a more strict way? Just put a value in dB. Try it:

<speak>
And everything <prosody volume="-5dB">looks good</prosody> tonight
</speak>

Note: +6dB almost doubles the volume, -6dB make it 50% quieter.

Volume control is supported both by regular and neural Text-to-Speech.

Make speech faster or slower

Same <prosody> tag helps here! Speed can be set with the rate attribute using x-slow, slow, medium, fast, x-fast or percentage. Try this:

<speak>
<prosody rate="x-slow">Red lorry, yellow lorry.</prosody>
<prosody rate="fast">Red lorry, yellow lorry.</prosody>
<prosody rate="200%">Red lorry, yellow lorry.</prosody>
</speak>

Works both for neural and regular voices.

Make an emphasis

To emphasize a word, use the <emphasis> tag with the level attribute. It has 3 options, here are how they're working:

  • Strong: Increases the volume and slows the speaking rate so that the speech is louder and slower.

  • Moderate: Increases the volume and slows the speaking rate, but less than strong. Moderate is the default.

  • Reduced: Decreases the volume and speeds up the speaking rate. Speech is softer and faster.

Here's an example:

<speak>
<emphasis level="reduced">She is the one</emphasis>
who <emphasis level="strong">will notice</emphasis>
that the first snapdragon of Spring <emphasis level="moderate">is in bloom</emphasis>
</speak>

Emphasizing doesn't work with neural voices.

Newscaster speech

Looking for a newscaster's style speech? We got that covered!

<speak>
<amazon:domain name="news">
From the Tuesday, April 16th, 1912 edition of The Guardian newspaper: The maiden voyage of the White Star liner Titanic, the largest ship ever launched, has ended in disaster. The Titanic started her trip from Southampton to New York on Wednesday. Late on Sunday night, she struck an iceberg off the Grand Banks of Newfoundland. By wireless telegraphy, she sent out signals of distress, and several liners were near enough to catch and respond to the call.
</amazon:domain>
</speak>

However, this trick is available for some neural voices:

  • Matthew or Joanna voices (en-US)

  • Lupe (es-US)

  • Amy (en-GB)

Want to do more with SSML?

Find out all the options of this feature in the Amazon Polly documentation.

Did this answer your question?