Advanced Dialect TTS Tuning: SSML, Prosody, Emotion Control

Why Advanced Tuning Matters

Basic dialect TTS calls meet most needs, but creating professional-grade content requires finer control. Advanced tuning helps you:

Improve naturalness: Make speech closer to human reading
Enhance expressiveness: Adjust emotion and rhythm based on content
Optimize listening experience: Reduce fatigue, improve completion rates
Enable complex scenarios: Multi-character dialogues, dramatic scenes

This guide is for developers and creators already familiar with basic TTS calls. If you're new to dialect TTS, start with [Getting Started with Dialect TTS](/en/blog/getting-started-with-dialect-tts).

SSML Tags Explained

SSML (Speech Synthesis Markup Language) is the standard markup language for controlling speech synthesis. It enables precise control over every aspect of voice output.

Basic SSML Structure

<speak>
  This is normal text.
  <break time="500ms"/>
  Here we inserted a half-second pause.
</speak>

Pause Control `<break>`

Pauses are crucial for natural-sounding speech:

<speak>
  Hey folks<break time="300ms"/>let me show you something cool today!
  <break strength="medium"/>
  First look at this<break time="200ms"/>then check this out.
</speak>

Pause Parameters:

Parameter	Values	Effect
time	100ms-3000ms	Precise duration control
strength	none	No pause
strength	x-weak	Very short pause
strength	weak	Short pause
strength	medium	Medium pause (default)
strength	strong	Long pause
strength	x-strong	Very long pause

Usage Examples:

<!-- Between paragraphs -->
<speak>
  First paragraph ends here.
  <break strength="strong"/>
  Second paragraph begins.
</speak>

<!-- Between list items -->
<speak>
  Today's menu includes<break time="200ms"/>
  braised pork<break time="300ms"/>
  sweet and sour ribs<break time="300ms"/>
  and twice-cooked pork.
</speak>

Speed Control `<prosody>`

<speak>
  <prosody rate="slow">This part speaks slowly</prosody>
  <prosody rate="fast">This part speaks quickly</prosody>
  <prosody rate="120%">Speed increased by 20%</prosody>
  <prosody rate="80%">Speed decreased by 20%</prosody>
</speak>

Speed Parameters:

Value	Effect
x-slow	Very slow (~50%)
slow	Slow (~75%)
medium	Normal (100%)
fast	Fast (~125%)
x-fast	Very fast (~150%)
50%-200%	Precise percentage

Pitch Control `<prosody>`

<speak>
  <prosody pitch="high">Higher pitch</prosody>
  <prosody pitch="low">Lower pitch</prosody>
  <prosody pitch="+10%">Slightly higher</prosody>
  <prosody pitch="-10%">Slightly lower</prosody>
</speak>

Volume Control `<prosody>`

<speak>
  <prosody volume="loud">This is louder</prosody>
  <prosody volume="soft">This is softer</prosody>
  <prosody volume="+6dB">Volume increased by 6dB</prosody>
</speak>

Combined Usage

<speak>
  <prosody rate="slow" pitch="low" volume="soft">
    This is a deep, slow narration
  </prosody>
  <break time="500ms"/>
  <prosody rate="fast" pitch="high" volume="loud">
    Suddenly getting excited!
  </prosody>
</speak>

Emotion Parameter Adjustment

Beyond SSML, XiangYinGe API supports direct emotion control through parameters.

Emotion Types

emotions = {
    "neutral": "Neutral, suitable for news",
    "happy": "Happy, for entertainment",
    "sad": "Sad, for emotional stories",
    "angry": "Angry, for emotional expression",
    "fearful": "Fearful, for suspense",
    "surprised": "Surprised, for plot twists",
    "cheerful": "Cheerful, for short videos",
    "enthusiastic": "Enthusiastic, for live commerce",
    "storytelling": "Narrative, for audiobooks",
    "friendly": "Friendly, for customer service"
}

Emotion Intensity Control

data = {
    "text": "This product is absolutely amazing!",
    "dialect": "sichuan",
    "voice": "sichuan_meizi",
    "emotion": "enthusiastic",
    "emotion_intensity": 0.8
}

Intensity Range: 0.0 (weakest) to 1.0 (strongest)

Recommended Settings:

Content Type	Emotion	Intensity
News broadcast	neutral	0.3-0.5
Audiobook narration	storytelling	0.5-0.7
Short video	cheerful	0.6-0.8
Live commerce	enthusiastic	0.7-0.9
Emotional climax	sad/happy	0.8-1.0

Dynamic Emotion Switching

For long texts, set different emotions per segment:

segments = [
    {
        "text": "The story begins on a quiet morning.",
        "emotion": "storytelling",
        "emotion_intensity": 0.5
    },
    {
        "text": "Suddenly, a loud noise broke the silence!",
        "emotion": "surprised",
        "emotion_intensity": 0.9
    },
    {
        "text": "He slowly walked toward the source of the sound...",
        "emotion": "fearful",
        "emotion_intensity": 0.6
    }
]

audios = []
for segment in segments:
    response = client.synthesize(**segment, dialect="dongbei", voice="dongbei_laotie")
    audios.append(response.audio)

final_audio = concatenate_audio(audios)

Multi-Character Dialogue Processing

Character Voice Assignment

characters = {
    "narrator": {
        "voice": "shanghai_ajie",
        "emotion": "storytelling",
        "speed": 0.9
    },
    "young_man": {
        "voice": "dongbei_xiaohuo",
        "emotion": "cheerful",
        "speed": 1.05
    },
    "grandmother": {
        "voice": "sichuan_meizi",
        "emotion": "friendly",
        "speed": 0.85,
        "pitch": 0.95
    }
}

Dialogue Script Parsing

import re

script = '''
Narrator: It was a cold winter day.
Young Man: Grandma, it's snowing outside!
Grandmother: Is it now? Let's make something warm to eat.
Narrator: And so, the two began preparing lunch.
'''

def parse_dialogue(script):
    pattern = r'(\w+): (.+)'
    dialogues = []
    for line in script.strip().split('\n'):
        match = re.match(pattern, line)
        if match:
            role, text = match.groups()
            dialogues.append({"role": role, "text": text})
    return dialogues

role_mapping = {
    "Narrator": "narrator",
    "Young Man": "young_man",
    "Grandmother": "grandmother"
}

dialogues = parse_dialogue(script)

for dialogue in dialogues:
    character = role_mapping.get(dialogue["role"], "narrator")
    config = characters[character]
    audio = client.synthesize(
        text=dialogue["text"],
        dialect="dongbei",
        **config
    )

Dialogue Gap Optimization

def generate_dialogue_audio(dialogues, gap_duration=500):
    audios = []
    for i, dialogue in enumerate(dialogues):
        character = role_mapping.get(dialogue["role"], "narrator")
        config = characters[character]

        audio = client.synthesize(
            text=dialogue["text"],
            dialect="dongbei",
            **config
        )
        audios.append(audio)

        if i < len(dialogues) - 1:
            gap = generate_silence(gap_duration)
            audios.append(gap)

    return concatenate_audio(audios)

Batch Production Pipeline

Efficient Batch Architecture

import asyncio
from concurrent.futures import ThreadPoolExecutor

class TTSPipeline:
    def __init__(self, api_key, max_workers=5):
        self.client = TTSClient(api_key=api_key)
        self.executor = ThreadPoolExecutor(max_workers=max_workers)

    async def process_batch(self, items):
        loop = asyncio.get_event_loop()
        tasks = [
            loop.run_in_executor(
                self.executor,
                self._synthesize_item,
                item
            )
            for item in items
        ]
        return await asyncio.gather(*tasks)

    def _synthesize_item(self, item):
        try:
            result = self.client.synthesize(
                text=item["text"],
                dialect=item.get("dialect", "sichuan"),
                voice=item.get("voice", "sichuan_meizi"),
                **item.get("options", {})
            )
            return {"id": item["id"], "audio": result, "success": True}
        except Exception as e:
            return {"id": item["id"], "error": str(e), "success": False}

Usage Example

async def main():
    pipeline = TTSPipeline(api_key="your_api_key", max_workers=10)

    items = [
        {"id": 1, "text": "First segment", "dialect": "sichuan"},
        {"id": 2, "text": "Second segment", "dialect": "dongbei"},
        {"id": 3, "text": "Third segment", "dialect": "cantonese"},
    ]

    results = await pipeline.process_batch(items)

    for result in results:
        if result["success"]:
            result["audio"].save(f"output_{result['id']}.mp3")
        else:
            print(f"Error for item {result['id']}: {result['error']}")

asyncio.run(main())

Error Retry Mechanism

import time
from functools import wraps

def retry_on_failure(max_retries=3, delay=1.0):
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            last_exception = None
            for attempt in range(max_retries):
                try:
                    return func(*args, **kwargs)
                except Exception as e:
                    last_exception = e
                    if attempt < max_retries - 1:
                        time.sleep(delay * (2 ** attempt))
            raise last_exception
        return wrapper
    return decorator

@retry_on_failure(max_retries=3, delay=1.0)
def synthesize_with_retry(client, **kwargs):
    return client.synthesize(**kwargs)

Progress Tracking & Logging

import logging
from tqdm import tqdm

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

def batch_synthesize_with_progress(items, pipeline):
    results = []
    failed_items = []

    with tqdm(total=len(items), desc="Processing") as pbar:
        for item in items:
            try:
                result = synthesize_with_retry(
                    pipeline.client,
                    **item
                )
                results.append({"id": item["id"], "audio": result})
                logger.info(f"Success: {item['id']}")
            except Exception as e:
                failed_items.append({"id": item["id"], "error": str(e)})
                logger.error(f"Failed: {item['id']} - {e}")
            finally:
                pbar.update(1)

    return results, failed_items

Audio Post-Processing

Volume Normalization

from pydub import AudioSegment

def normalize_audio(audio_path, target_dBFS=-20.0):
    audio = AudioSegment.from_file(audio_path)
    change_in_dBFS = target_dBFS - audio.dBFS
    normalized_audio = audio.apply_gain(change_in_dBFS)
    return normalized_audio

Adding Background Music

def add_background_music(voice_path, music_path, music_volume=-20):
    voice = AudioSegment.from_file(voice_path)
    music = AudioSegment.from_file(music_path)

    if len(music) < len(voice):
        music = music * (len(voice) // len(music) + 1)
    music = music[:len(voice)]

    music = music + music_volume

    combined = voice.overlay(music)
    return combined

Audio Concatenation with Crossfade

def concatenate_with_crossfade(audio_paths, crossfade_ms=200):
    audios = [AudioSegment.from_file(path) for path in audio_paths]

    result = audios[0]
    for audio in audios[1:]:
        result = result.append(audio, crossfade=crossfade_ms)

    return result

Performance Optimization

Caching Strategy

import hashlib
import os

class TTSCache:
    def __init__(self, cache_dir="./cache"):
        self.cache_dir = cache_dir
        os.makedirs(cache_dir, exist_ok=True)

    def _get_cache_key(self, **params):
        param_str = str(sorted(params.items()))
        return hashlib.md5(param_str.encode()).hexdigest()

    def get(self, **params):
        key = self._get_cache_key(**params)
        cache_path = os.path.join(self.cache_dir, f"{key}.mp3")
        if os.path.exists(cache_path):
            return cache_path
        return None

    def set(self, audio_data, **params):
        key = self._get_cache_key(**params)
        cache_path = os.path.join(self.cache_dir, f"{key}.mp3")
        with open(cache_path, "wb") as f:
            f.write(audio_data)
        return cache_path

Concurrency Control

from threading import Semaphore

class RateLimiter:
    def __init__(self, max_concurrent=5):
        self.semaphore = Semaphore(max_concurrent)

    def __enter__(self):
        self.semaphore.acquire()
        return self

    def __exit__(self, *args):
        self.semaphore.release()

rate_limiter = RateLimiter(max_concurrent=10)

def synthesize_with_rate_limit(**kwargs):
    with rate_limiter:
        return client.synthesize(**kwargs)

Debugging & Troubleshooting

Common Issues

Issue	Possible Cause	Solution
Unnatural speed	Parameter too extreme	Keep within 0.8-1.2 range
Weak emotion	Intensity too low	Increase emotion_intensity
Pauses too long/short	Incorrect break values	Adjust time parameter
Inconsistent volume	No normalization	Apply post-processing
Character confusion	Similar voice traits	Choose more distinct voices

Debug Mode

import json

def debug_synthesize(**kwargs):
    print("=== Request Parameters ===")
    print(json.dumps(kwargs, indent=2, ensure_ascii=False))

    result = client.synthesize(**kwargs)

    print("=== Response Info ===")
    print(f"Duration: {result.duration}s")
    print(f"Sample Rate: {result.sample_rate}")
    print(f"Channels: {result.channels}")

    return result

Best Practices Summary

Tuning Checklist

Choose appropriate speed for content type (news slower, videos faster)
Add pauses at natural break points
Match emotion intensity to content
Use distinctly different voices for multiple characters
Implement error retry for batch processing
Use caching to avoid redundant generation
Normalize output audio volume

Content Type Presets

PRESETS = {
    "news": {
        "speed": 0.9,
        "emotion": "neutral",
        "emotion_intensity": 0.4,
        "break_after_sentence": "300ms"
    },
    "audiobook": {
        "speed": 0.85,
        "emotion": "storytelling",
        "emotion_intensity": 0.6,
        "break_after_paragraph": "800ms"
    },
    "short_video": {
        "speed": 1.1,
        "emotion": "cheerful",
        "emotion_intensity": 0.7,
        "break_after_sentence": "200ms"
    },
    "live_commerce": {
        "speed": 1.05,
        "emotion": "enthusiastic",
        "emotion_intensity": 0.8,
        "break_after_sentence": "150ms"
    }
}

Next Steps

With advanced tuning skills, start creating professional content:

Getting Started with Dialect TTS: Review basics
Dialect TTS API Deep Integration: Enterprise architecture
Dialect TTS Service Comparison: Choose the right service

For questions, contact us via email: hello@xiangyinge.com