Download
Kaizen Speech Studio Kaizen Speech Studio Help All Help Contact
Core feature

SSML editor

Speech Synthesis Markup Language for fine control: pauses, emphasis, multi-voice dialog, language switches.

When to use SSML

Plain text is great when you want the voice to just speak naturally. SSML is for when you need:

  • Specific pauses (500ms between paragraphs, 2s before a punchline)
  • Emphasis on specific words
  • Switching voices mid-speech (dialog, character voice)
  • Switching languages mid-speech (bilingual narration)
  • Custom pronunciations (brand names, acronyms)
  • Overriding prosody on specific phrases

Entering SSML mode

Tick Use SSML format in the Text-to-Speech editor. The plain editor turns into an SSML editor with a toolbar of common inserts.

The SSML editor toolbar

One-click inserts for common tags:

  • Voice — wrap selected text in <voice name="...">
  • Style<mstts:express-as style="cheerful">
  • Prosody<prosody rate="fast" pitch="+10%">
  • Break<break time="500ms" /> for precise pauses
  • Language<lang xml:lang="es-ES"> for inline language switches
  • Emphasis<emphasis level="strong">

Example: two-voice dialog

<speak xmlns="http://www.w3.org/2001/10/synthesis" version="1.0" xml:lang="en-US">
  <voice name="en-US-AvaNeural">Hey, did you try the new feature?</voice>
  <break time="400ms" />
  <voice name="en-US-AndrewNeural">Yeah, it's great. Let me show you.</voice>
</speak>

Example: precise pauses

Welcome to the podcast. <break time="1s" />
Today we're talking about desktop software. <break time="500ms" />
And why it still matters.

Validation

Tick Validate SSML before generating. Speech Studio checks syntax locally before sending to Azure — you catch typos without burning API calls.

MultiTalker voices

With a MultiTalker voice (e.g. Ava & Andrew), you don't need to wrap each speaker separately. The model infers turn-taking from your text. Perfect for quick dialog generation.