Speech Synthesis Markup Language (SSML) is a markup language used with text-to-speech applications. Specifically, SSML provides more control over how the text is "read" and how the TTS engine is to pronounce proper names, acronyms, numbers, and so on.
Much like HTML is an alternative to plaintext on the web, SSML is used to mark up input text and is an alternative to submitting plaintext to the speech synthesizer. SSML elements allow you to add various details to the text input. The following is an example of an SSML document:
<?xml version="1.0"?>
<speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-GB">
Hi!
<break time="2s" />
Would you like some <phoneme alphabet="ipa" ph="ləˈsænjə">lasagne</phoneme>?
</speak>
The following table summarizes the most commonly used SSML elements:
Element | Description |
---|---|
<speak> | The root element of an SSML document. |
<break> | Represents a pause of specified duration. |
<say‑as> | Used to specify additional information about how numbers, dates, times are to be pronounced. |
<audio> | Used to insert an audio clip into the synthesized audio. |
<sub> | Used to substitute input text with an alternative utterance. |
<prosody> | Used to customize prosodic features of the contained text. |
<emphasis> | Represents the presence or absence of emphasis. |
<phoneme> | Used to provide phonetic respelling. |