Skip to content

SSML Support

Kaizen Speech Studio supports Speech Synthesis Markup Language (SSML), giving you fine-grained control over how your text is spoken. With SSML, you can add pauses, control emphasis, mix multiple voices, and create professional-quality audio productions.

Don't want to write XML manually?

Use our Visual SSML Editor to build SSML with a simple point-and-click interface — no coding required. Select voices, adjust prosody with sliders, add emotions, and copy the generated XML directly into Speech Studio.

What Is SSML?

SSML is an XML-based markup language that tells the text-to-speech engine exactly how to pronounce your text. Instead of relying on the AI to interpret your text on its own, SSML lets you specify:

  • Pauses between words or sentences
  • Emphasis on specific words
  • Pitch, rate, and volume changes within the text
  • Multiple voices in the same audio file
  • Pronunciation overrides for specific words

Basic SSML Structure

Every SSML document is wrapped in a <speak> tag:

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
    Your text goes here.
</speak>

Common SSML Tags

Break (Pause)

Add a pause at any point in the text:

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
    Welcome to our presentation.
    <break time="1s"/>
    Let us begin with the introduction.
</speak>
Attribute Values Example
time Milliseconds or seconds 500ms, 1s, 2s
strength Predefined pauses none, x-weak, weak, medium, strong, x-strong

Prosody (Speed, Pitch, Volume)

Control how the text is spoken:

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
    <prosody rate="slow" pitch="low" volume="soft">
        This text is spoken slowly, in a low pitch, and at a soft volume.
    </prosody>
</speak>

Rate values: x-slow, slow, medium, fast, x-fast, or a percentage like +20% or -30%

Pitch values: x-low, low, medium, high, x-high, or relative values like +10%

Volume values: silent, x-soft, soft, medium, loud, x-loud

Emphasis

Add emphasis to specific words:

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
    This is <emphasis level="strong">extremely</emphasis> important.
</speak>

Levels: reduced, none, moderate, strong

Voice (Multiple Voices)

Mix different voices in the same audio file:

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
    <voice name="en-US-JennyNeural">
        Hi, I'm Jenny. Let me introduce my colleague.
    </voice>
    <voice name="en-US-GuyNeural">
        Hello! I'm Guy. Nice to meet you.
    </voice>
</speak>

Great for Dialogues

The multi-voice feature is perfect for creating podcast-style conversations, audiobook dialogues, or educational content with different speakers.

Say-As (Pronunciation Control)

Control how specific content is interpreted:

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
    The date is <say-as interpret-as="date" format="mdy">03/12/2026</say-as>.
    Call us at <say-as interpret-as="telephone">1-800-555-0199</say-as>.
    The total is <say-as interpret-as="currency">$49.99</say-as>.
</speak>

Interpret-as values: date, time, telephone, currency, number, spell-out, characters, ordinal, cardinal

Complete Example

Here is a full SSML example combining multiple features:

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
    <voice name="en-US-JennyNeural">
        <prosody rate="medium" pitch="medium">
            Welcome to Kaizen Speech Studio!
            <break time="500ms"/>
            Today, we will learn about <emphasis level="strong">SSML</emphasis>,
            the Speech Synthesis Markup Language.
            <break time="1s"/>
        </prosody>
        <prosody rate="slow" pitch="low" volume="soft">
            SSML gives you complete control over how your text is spoken.
        </prosody>
    </voice>
    <break time="1s"/>
    <voice name="en-US-GuyNeural">
        And I am here to demonstrate a second voice in the same audio file.
        <break time="500ms"/>
        Pretty cool, right?
    </voice>
</speak>

Using SSML in Speech Studio

  1. Open Speech Studio and navigate to the TTS screen
  2. Switch to SSML mode (look for an SSML toggle or tab)
  3. Enter your SSML markup in the text area
  4. Click Generate to produce the audio
  5. Preview and save as usual

Valid XML Required

SSML must be valid XML. Make sure all tags are properly opened and closed. A missing closing tag will cause an error.

Tips for Natural-Sounding SSML

  1. Use breaks between sections -- A <break time="500ms"/> between paragraphs sounds more natural than continuous speech
  2. Vary the rate -- Slow down for important points and speed up for transitions
  3. Use emphasis sparingly -- Too much emphasis sounds unnatural
  4. Mix voices for dialogue -- Two different voices make conversations sound realistic
  5. Test incrementally -- Build your SSML step by step, testing after each addition
  6. Use prosody for emotion -- Combine rate, pitch, and volume to convey different moods

Troubleshooting SSML

"Invalid SSML" error:

  • Check that all tags are properly closed
  • Verify the <speak> wrapper tag is present
  • Make sure voice names are correct (e.g., en-US-JennyNeural)

Audio sounds choppy:

  • Reduce the number of break tags
  • Use smoother rate transitions (avoid jumping from x-slow to x-fast)

Voice not changing:

  • Double-check the voice name spelling
  • Ensure the voice name matches an available Azure neural voice

Next Steps


Need help? Contact us at [email protected]