SSML Support¶

Kaizen Speech Studio supports Speech Synthesis Markup Language (SSML), giving you fine-grained control over how your text is spoken. With SSML, you can add pauses, control emphasis, mix multiple voices, and create professional-quality audio productions.

Don't want to write XML manually?

Use our Visual SSML Editor to build SSML with a simple point-and-click interface — no coding required. Select voices, adjust prosody with sliders, add emotions, and copy the generated XML directly into Speech Studio.

What Is SSML?¶

SSML is an XML-based markup language that tells the text-to-speech engine exactly how to pronounce your text. Instead of relying on the AI to interpret your text on its own, SSML lets you specify:

Pauses between words or sentences
Emphasis on specific words
Pitch, rate, and volume changes within the text
Multiple voices in the same audio file
Pronunciation overrides for specific words

Basic SSML Structure¶

Every SSML document is wrapped in a <speak> tag:

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
    Your text goes here.
</speak>

Common SSML Tags¶

Break (Pause)¶

Add a pause at any point in the text:

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
    Welcome to our presentation.
    <break time="1s"/>
    Let us begin with the introduction.
</speak>

Attribute	Values	Example
`time`	Milliseconds or seconds	`500ms`, `1s`, `2s`
`strength`	Predefined pauses	`none`, `x-weak`, `weak`, `medium`, `strong`, `x-strong`

Prosody (Speed, Pitch, Volume)¶

Control how the text is spoken:

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
    <prosody rate="slow" pitch="low" volume="soft">
        This text is spoken slowly, in a low pitch, and at a soft volume.
    </prosody>
</speak>

Rate values: x-slow, slow, medium, fast, x-fast, or a percentage like +20% or -30%

Pitch values: x-low, low, medium, high, x-high, or relative values like +10%

Volume values: silent, x-soft, soft, medium, loud, x-loud

Emphasis¶

Add emphasis to specific words:

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
    This is <emphasis level="strong">extremely</emphasis> important.
</speak>

Levels: reduced, none, moderate, strong

Voice (Multiple Voices)¶

Mix different voices in the same audio file:

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
    <voice name="en-US-JennyNeural">
        Hi, I'm Jenny. Let me introduce my colleague.
    </voice>
    <voice name="en-US-GuyNeural">
        Hello! I'm Guy. Nice to meet you.
    </voice>
</speak>

Great for Dialogues

The multi-voice feature is perfect for creating podcast-style conversations, audiobook dialogues, or educational content with different speakers.

Say-As (Pronunciation Control)¶

Control how specific content is interpreted:

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
    The date is <say-as interpret-as="date" format="mdy">03/12/2026</say-as>.
    Call us at <say-as interpret-as="telephone">1-800-555-0199</say-as>.
    The total is <say-as interpret-as="currency">$49.99</say-as>.
</speak>

Interpret-as values: date, time, telephone, currency, number, spell-out, characters, ordinal, cardinal

Complete Example¶

Here is a full SSML example combining multiple features:

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
    <voice name="en-US-JennyNeural">
        <prosody rate="medium" pitch="medium">
            Welcome to Kaizen Speech Studio!
            <break time="500ms"/>
            Today, we will learn about <emphasis level="strong">SSML</emphasis>,
            the Speech Synthesis Markup Language.
            <break time="1s"/>
        </prosody>
        <prosody rate="slow" pitch="low" volume="soft">
            SSML gives you complete control over how your text is spoken.
        </prosody>
    </voice>
    <break time="1s"/>
    <voice name="en-US-GuyNeural">
        And I am here to demonstrate a second voice in the same audio file.
        <break time="500ms"/>
        Pretty cool, right?
    </voice>
</speak>

Using SSML in Speech Studio¶

Open Speech Studio and navigate to the TTS screen
Switch to SSML mode (look for an SSML toggle or tab)
Enter your SSML markup in the text area
Click Generate to produce the audio
Preview and save as usual

Valid XML Required

SSML must be valid XML. Make sure all tags are properly opened and closed. A missing closing tag will cause an error.

Tips for Natural-Sounding SSML¶

Use breaks between sections -- A <break time="500ms"/> between paragraphs sounds more natural than continuous speech
Vary the rate -- Slow down for important points and speed up for transitions
Use emphasis sparingly -- Too much emphasis sounds unnatural
Mix voices for dialogue -- Two different voices make conversations sound realistic
Test incrementally -- Build your SSML step by step, testing after each addition
Use prosody for emotion -- Combine rate, pitch, and volume to convey different moods

Troubleshooting SSML¶

"Invalid SSML" error:

Check that all tags are properly closed
Verify the <speak> wrapper tag is present
Make sure voice names are correct (e.g., en-US-JennyNeural)

Audio sounds choppy:

Reduce the number of break tags
Use smoother rate transitions (avoid jumping from x-slow to x-fast)

Voice not changing:

Double-check the voice name spelling
Ensure the voice name matches an available Azure neural voice

Next Steps¶

Text-to-Speech Guide -- Standard TTS without SSML
Voice Selection -- Find voice names for SSML
Video Dubbing -- Apply SSML concepts to video dubbing
FAQ -- Common questions

Need help? Contact us at [email protected]