SSML Support¶
Kaizen Speech Studio supports Speech Synthesis Markup Language (SSML), giving you fine-grained control over how your text is spoken. With SSML, you can add pauses, control emphasis, mix multiple voices, and create professional-quality audio productions.
Don't want to write XML manually?
Use our Visual SSML Editor to build SSML with a simple point-and-click interface — no coding required. Select voices, adjust prosody with sliders, add emotions, and copy the generated XML directly into Speech Studio.
What Is SSML?¶
SSML is an XML-based markup language that tells the text-to-speech engine exactly how to pronounce your text. Instead of relying on the AI to interpret your text on its own, SSML lets you specify:
- Pauses between words or sentences
- Emphasis on specific words
- Pitch, rate, and volume changes within the text
- Multiple voices in the same audio file
- Pronunciation overrides for specific words
Basic SSML Structure¶
Every SSML document is wrapped in a <speak> tag:
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
Your text goes here.
</speak>
Common SSML Tags¶
Break (Pause)¶
Add a pause at any point in the text:
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
Welcome to our presentation.
<break time="1s"/>
Let us begin with the introduction.
</speak>
| Attribute | Values | Example |
|---|---|---|
time |
Milliseconds or seconds | 500ms, 1s, 2s |
strength |
Predefined pauses | none, x-weak, weak, medium, strong, x-strong |
Prosody (Speed, Pitch, Volume)¶
Control how the text is spoken:
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
<prosody rate="slow" pitch="low" volume="soft">
This text is spoken slowly, in a low pitch, and at a soft volume.
</prosody>
</speak>
Rate values: x-slow, slow, medium, fast, x-fast, or a percentage like +20% or -30%
Pitch values: x-low, low, medium, high, x-high, or relative values like +10%
Volume values: silent, x-soft, soft, medium, loud, x-loud
Emphasis¶
Add emphasis to specific words:
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
This is <emphasis level="strong">extremely</emphasis> important.
</speak>
Levels: reduced, none, moderate, strong
Voice (Multiple Voices)¶
Mix different voices in the same audio file:
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
<voice name="en-US-JennyNeural">
Hi, I'm Jenny. Let me introduce my colleague.
</voice>
<voice name="en-US-GuyNeural">
Hello! I'm Guy. Nice to meet you.
</voice>
</speak>
Great for Dialogues
The multi-voice feature is perfect for creating podcast-style conversations, audiobook dialogues, or educational content with different speakers.
Say-As (Pronunciation Control)¶
Control how specific content is interpreted:
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
The date is <say-as interpret-as="date" format="mdy">03/12/2026</say-as>.
Call us at <say-as interpret-as="telephone">1-800-555-0199</say-as>.
The total is <say-as interpret-as="currency">$49.99</say-as>.
</speak>
Interpret-as values: date, time, telephone, currency, number, spell-out, characters, ordinal, cardinal
Complete Example¶
Here is a full SSML example combining multiple features:
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
<voice name="en-US-JennyNeural">
<prosody rate="medium" pitch="medium">
Welcome to Kaizen Speech Studio!
<break time="500ms"/>
Today, we will learn about <emphasis level="strong">SSML</emphasis>,
the Speech Synthesis Markup Language.
<break time="1s"/>
</prosody>
<prosody rate="slow" pitch="low" volume="soft">
SSML gives you complete control over how your text is spoken.
</prosody>
</voice>
<break time="1s"/>
<voice name="en-US-GuyNeural">
And I am here to demonstrate a second voice in the same audio file.
<break time="500ms"/>
Pretty cool, right?
</voice>
</speak>
Using SSML in Speech Studio¶
- Open Speech Studio and navigate to the TTS screen
- Switch to SSML mode (look for an SSML toggle or tab)
- Enter your SSML markup in the text area
- Click Generate to produce the audio
- Preview and save as usual
Valid XML Required
SSML must be valid XML. Make sure all tags are properly opened and closed. A missing closing tag will cause an error.
Tips for Natural-Sounding SSML¶
- Use breaks between sections -- A
<break time="500ms"/>between paragraphs sounds more natural than continuous speech - Vary the rate -- Slow down for important points and speed up for transitions
- Use emphasis sparingly -- Too much emphasis sounds unnatural
- Mix voices for dialogue -- Two different voices make conversations sound realistic
- Test incrementally -- Build your SSML step by step, testing after each addition
- Use prosody for emotion -- Combine rate, pitch, and volume to convey different moods
Troubleshooting SSML¶
"Invalid SSML" error:
- Check that all tags are properly closed
- Verify the
<speak>wrapper tag is present - Make sure voice names are correct (e.g.,
en-US-JennyNeural)
Audio sounds choppy:
- Reduce the number of break tags
- Use smoother rate transitions (avoid jumping from
x-slowtox-fast)
Voice not changing:
- Double-check the voice name spelling
- Ensure the voice name matches an available Azure neural voice
Next Steps¶
- Text-to-Speech Guide -- Standard TTS without SSML
- Voice Selection -- Find voice names for SSML
- Video Dubbing -- Apply SSML concepts to video dubbing
- FAQ -- Common questions
Need help? Contact us at [email protected]