IPA in SSML

There are multiple text-to-speech implementations/services today that are capable of producing natural-sounding speech audio from written text. The three best known examples of TTS services are Amazon Polly, Google Cloud Text-to-Speech, and Microsoft Azure Speech Service.

The cool feature of TTS services is support for SSML (Speech Synthesis Markup Language)—an XML-based language that allows you to control various aspects of speech such as volume, pitch, rate, and, most interestingly, pronunciation.

Customizing pronunciation with SSML

In SSML, you can customize the pronunciation of a word in the audio response using the <phoneme> tag. Here's an example of SSML that changes/overrides the pronunciation of the word "potatoes" by explicitly providing the phonetic transcription:

<?xml version="1.0"?>
<speak xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-GB">
She's growing carrots and <phoneme alphabet="ipa"
ph="ˈblɑː">potatoes</phoneme> in her garden this year.
</speak>

Depending on the TTS implementation, the pronunciation strings in SSML can use different alphabets which may also be usually language-specific. The universal standard for phonetic transcription is the International Phonetic Alphabet (IPA), supported by many TTS services.

IPA, the lingua franca of phonetics

In the SSML example above, the custom pronunciation is provided using the International Phonetic Alphabet (IPA)—the standardized system of phonetic notation created by the International Phonetic Association.

As a written representation of speech sounds, IPA is a very powerful and complex tool used extensively in (computational) linguistics. You can also find IPA transcriptions in the pronunciation section of many dictionaries and linguistics textbooks.

IPA-to-speech

When the pronunciation is provided explicitly as in the SSML above, it doesn't actually matter what text is inside the <phoneme> tag. This means that we're essentially dealing with "IPA-to-speech" rather than "text-to-speech".

Using this feature of speech synthesizers, I've created a simple IPA reader that converts IPA to speech in 42 languages.

Bottom line

By leveraging IPA, a universal phonetic notation, you can achieve a high level of control over SSML's audio output and create tailored audio experiences for a variety of applications.

See also

Made by Anton Vasetenkov.

If you want to say hi, you can reach me on LinkedIn or via email. If you like my work, you can support me by buying me a coffee.