SSML editor — Kaizen Speech Studio Help

When to use SSML

Plain text is great when you want the voice to just speak naturally. SSML is for when you need:

Specific pauses (500ms between paragraphs, 2s before a punchline)
Emphasis on specific words
Switching voices mid-speech (dialog, character voice)
Switching languages mid-speech (bilingual narration)
Custom pronunciations (brand names, acronyms)
Overriding prosody on specific phrases

Entering SSML mode

Tick Use SSML format in the Text-to-Speech editor. The plain editor turns into an SSML editor with a toolbar of common inserts.

The SSML editor toolbar

One-click inserts for common tags:

Voice — wrap selected text in <voice name="...">
Style — <mstts:express-as style="cheerful">
Prosody — <prosody rate="fast" pitch="+10%">
Break — <break time="500ms" /> for precise pauses
Language — <lang xml:lang="es-ES"> for inline language switches
Emphasis — <emphasis level="strong">

Example: two-voice dialog

<speak xmlns="http://www.w3.org/2001/10/synthesis" version="1.0" xml:lang="en-US">
  <voice name="en-US-AvaNeural">Hey, did you try the new feature?</voice>
  <break time="400ms" />
  <voice name="en-US-AndrewNeural">Yeah, it's great. Let me show you.</voice>
</speak>

Example: precise pauses

Welcome to the podcast. <break time="1s" />
Today we're talking about desktop software. <break time="500ms" />
And why it still matters.

Validation

Tick Validate SSML before generating. Speech Studio checks syntax locally before sending to Azure — you catch typos without burning API calls.

MultiTalker voices

With a MultiTalker voice (e.g. Ava & Andrew), you don't need to wrap each speaker separately. The model infers turn-taking from your text. Perfect for quick dialog generation.