When to use SSML
Plain text is great when you want the voice to just speak naturally. SSML is for when you need:
- Specific pauses (500ms between paragraphs, 2s before a punchline)
- Emphasis on specific words
- Switching voices mid-speech (dialog, character voice)
- Switching languages mid-speech (bilingual narration)
- Custom pronunciations (brand names, acronyms)
- Overriding prosody on specific phrases
Entering SSML mode
Tick Use SSML format in the Text-to-Speech editor. The plain editor turns into an SSML editor with a toolbar of common inserts.
The SSML editor toolbar
One-click inserts for common tags:
- Voice — wrap selected text in
<voice name="..."> - Style —
<mstts:express-as style="cheerful"> - Prosody —
<prosody rate="fast" pitch="+10%"> - Break —
<break time="500ms" />for precise pauses - Language —
<lang xml:lang="es-ES">for inline language switches - Emphasis —
<emphasis level="strong">
Example: two-voice dialog
<speak xmlns="http://www.w3.org/2001/10/synthesis" version="1.0" xml:lang="en-US">
<voice name="en-US-AvaNeural">Hey, did you try the new feature?</voice>
<break time="400ms" />
<voice name="en-US-AndrewNeural">Yeah, it's great. Let me show you.</voice>
</speak>
Example: precise pauses
Welcome to the podcast. <break time="1s" />
Today we're talking about desktop software. <break time="500ms" />
And why it still matters.
Validation
Tick Validate SSML before generating. Speech Studio checks syntax locally before sending to Azure — you catch typos without burning API calls.
MultiTalker voices
With a MultiTalker voice (e.g. Ava & Andrew), you don't need to wrap each speaker separately. The model infers turn-taking from your text. Perfect for quick dialog generation.