Xiangyinge Logo
Back to Blog
Guides & TutorialsAdvancedMandarinCantoneseSichuan DialectHokkienWu ChineseXiang ChineseGan ChineseMin ChineseHakka

Advanced Dialect TTS Tuning: SSML, Prosody, Emotion Control

Improve naturalness and controllability with SSML tags, pauses, and emotion parameters.

XiangYinGe Team

XiangYinGe Team

1/2/20259 Reading time

Why Advanced Tuning Matters

Basic dialect TTS calls meet most needs, but creating professional-grade content requires finer control. Advanced tuning helps you:

  • Improve naturalness: Make speech closer to human reading
  • Enhance expressiveness: Adjust emotion and rhythm based on content
  • Optimize listening experience: Reduce fatigue, improve completion rates
  • Enable complex scenarios: Multi-character dialogues, dramatic scenes
This guide is for developers and creators already familiar with basic TTS calls. If you're new to dialect TTS, start with [Getting Started with Dialect TTS](/en/blog/getting-started-with-dialect-tts).

SSML Tags Explained

SSML (Speech Synthesis Markup Language) is the standard markup language for controlling speech synthesis. It enables precise control over every aspect of voice output.

Basic SSML Structure

<speak>
  This is normal text.
  <break time="500ms"/>
  Here we inserted a half-second pause.
</speak>

Pause Control <break>

Pauses are crucial for natural-sounding speech:

<speak>
  Hey folks<break time="300ms"/>let me show you something cool today!
  <break strength="medium"/>
  First look at this<break time="200ms"/>then check this out.
</speak>

Pause Parameters:

Parameter Values Effect
time 100ms-3000ms Precise duration control
strength none No pause
strength x-weak Very short pause
strength weak Short pause
strength medium Medium pause (default)
strength strong Long pause
strength x-strong Very long pause

Usage Examples:

<!-- Between paragraphs -->
<speak>
  First paragraph ends here.
  <break strength="strong"/>
  Second paragraph begins.
</speak>

<!-- Between list items -->
<speak>
  Today's menu includes<break time="200ms"/>
  braised pork<break time="300ms"/>
  sweet and sour ribs<break time="300ms"/>
  and twice-cooked pork.
</speak>

Speed Control <prosody>

<speak>
  <prosody rate="slow">This part speaks slowly</prosody>
  <prosody rate="fast">This part speaks quickly</prosody>
  <prosody rate="120%">Speed increased by 20%</prosody>
  <prosody rate="80%">Speed decreased by 20%</prosody>
</speak>

Speed Parameters:

Value Effect
x-slow Very slow (~50%)
slow Slow (~75%)
medium Normal (100%)
fast Fast (~125%)
x-fast Very fast (~150%)
50%-200% Precise percentage

Pitch Control <prosody>

<speak>
  <prosody pitch="high">Higher pitch</prosody>
  <prosody pitch="low">Lower pitch</prosody>
  <prosody pitch="+10%">Slightly higher</prosody>
  <prosody pitch="-10%">Slightly lower</prosody>
</speak>

Volume Control <prosody>

<speak>
  <prosody volume="loud">This is louder</prosody>
  <prosody volume="soft">This is softer</prosody>
  <prosody volume="+6dB">Volume increased by 6dB</prosody>
</speak>

Combined Usage

<speak>
  <prosody rate="slow" pitch="low" volume="soft">
    This is a deep, slow narration
  </prosody>
  <break time="500ms"/>
  <prosody rate="fast" pitch="high" volume="loud">
    Suddenly getting excited!
  </prosody>
</speak>

Emotion Parameter Adjustment

Beyond SSML, XiangYinGe API supports direct emotion control through parameters.

Emotion Types

emotions = {
    "neutral": "Neutral, suitable for news",
    "happy": "Happy, for entertainment",
    "sad": "Sad, for emotional stories",
    "angry": "Angry, for emotional expression",
    "fearful": "Fearful, for suspense",
    "surprised": "Surprised, for plot twists",
    "cheerful": "Cheerful, for short videos",
    "enthusiastic": "Enthusiastic, for live commerce",
    "storytelling": "Narrative, for audiobooks",
    "friendly": "Friendly, for customer service"
}

Emotion Intensity Control

data = {
    "text": "This product is absolutely amazing!",
    "dialect": "sichuan",
    "voice": "sichuan_meizi",
    "emotion": "enthusiastic",
    "emotion_intensity": 0.8
}

Intensity Range: 0.0 (weakest) to 1.0 (strongest)

Recommended Settings:

Content Type Emotion Intensity
News broadcast neutral 0.3-0.5
Audiobook narration storytelling 0.5-0.7
Short video cheerful 0.6-0.8
Live commerce enthusiastic 0.7-0.9
Emotional climax sad/happy 0.8-1.0

Dynamic Emotion Switching

For long texts, set different emotions per segment:

segments = [
    {
        "text": "The story begins on a quiet morning.",
        "emotion": "storytelling",
        "emotion_intensity": 0.5
    },
    {
        "text": "Suddenly, a loud noise broke the silence!",
        "emotion": "surprised",
        "emotion_intensity": 0.9
    },
    {
        "text": "He slowly walked toward the source of the sound...",
        "emotion": "fearful",
        "emotion_intensity": 0.6
    }
]

audios = []
for segment in segments:
    response = client.synthesize(**segment, dialect="dongbei", voice="dongbei_laotie")
    audios.append(response.audio)

final_audio = concatenate_audio(audios)

Multi-Character Dialogue Processing

Character Voice Assignment

characters = {
    "narrator": {
        "voice": "shanghai_ajie",
        "emotion": "storytelling",
        "speed": 0.9
    },
    "young_man": {
        "voice": "dongbei_xiaohuo",
        "emotion": "cheerful",
        "speed": 1.05
    },
    "grandmother": {
        "voice": "sichuan_meizi",
        "emotion": "friendly",
        "speed": 0.85,
        "pitch": 0.95
    }
}

Dialogue Script Parsing

import re

script = '''
Narrator: It was a cold winter day.
Young Man: Grandma, it's snowing outside!
Grandmother: Is it now? Let's make something warm to eat.
Narrator: And so, the two began preparing lunch.
'''

def parse_dialogue(script):
    pattern = r'(\w+): (.+)'
    dialogues = []
    for line in script.strip().split('\n'):
        match = re.match(pattern, line)
        if match:
            role, text = match.groups()
            dialogues.append({"role": role, "text": text})
    return dialogues

role_mapping = {
    "Narrator": "narrator",
    "Young Man": "young_man",
    "Grandmother": "grandmother"
}

dialogues = parse_dialogue(script)

for dialogue in dialogues:
    character = role_mapping.get(dialogue["role"], "narrator")
    config = characters[character]
    audio = client.synthesize(
        text=dialogue["text"],
        dialect="dongbei",
        **config
    )

Dialogue Gap Optimization

def generate_dialogue_audio(dialogues, gap_duration=500):
    audios = []
    for i, dialogue in enumerate(dialogues):
        character = role_mapping.get(dialogue["role"], "narrator")
        config = characters[character]

        audio = client.synthesize(
            text=dialogue["text"],
            dialect="dongbei",
            **config
        )
        audios.append(audio)

        if i < len(dialogues) - 1:
            gap = generate_silence(gap_duration)
            audios.append(gap)

    return concatenate_audio(audios)

Batch Production Pipeline

Efficient Batch Architecture

import asyncio
from concurrent.futures import ThreadPoolExecutor

class TTSPipeline:
    def __init__(self, api_key, max_workers=5):
        self.client = TTSClient(api_key=api_key)
        self.executor = ThreadPoolExecutor(max_workers=max_workers)

    async def process_batch(self, items):
        loop = asyncio.get_event_loop()
        tasks = [
            loop.run_in_executor(
                self.executor,
                self._synthesize_item,
                item
            )
            for item in items
        ]
        return await asyncio.gather(*tasks)

    def _synthesize_item(self, item):
        try:
            result = self.client.synthesize(
                text=item["text"],
                dialect=item.get("dialect", "sichuan"),
                voice=item.get("voice", "sichuan_meizi"),
                **item.get("options", {})
            )
            return {"id": item["id"], "audio": result, "success": True}
        except Exception as e:
            return {"id": item["id"], "error": str(e), "success": False}

Usage Example

async def main():
    pipeline = TTSPipeline(api_key="your_api_key", max_workers=10)

    items = [
        {"id": 1, "text": "First segment", "dialect": "sichuan"},
        {"id": 2, "text": "Second segment", "dialect": "dongbei"},
        {"id": 3, "text": "Third segment", "dialect": "cantonese"},
    ]

    results = await pipeline.process_batch(items)

    for result in results:
        if result["success"]:
            result["audio"].save(f"output_{result['id']}.mp3")
        else:
            print(f"Error for item {result['id']}: {result['error']}")

asyncio.run(main())

Error Retry Mechanism

import time
from functools import wraps

def retry_on_failure(max_retries=3, delay=1.0):
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            last_exception = None
            for attempt in range(max_retries):
                try:
                    return func(*args, **kwargs)
                except Exception as e:
                    last_exception = e
                    if attempt < max_retries - 1:
                        time.sleep(delay * (2 ** attempt))
            raise last_exception
        return wrapper
    return decorator

@retry_on_failure(max_retries=3, delay=1.0)
def synthesize_with_retry(client, **kwargs):
    return client.synthesize(**kwargs)

Progress Tracking & Logging

import logging
from tqdm import tqdm

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

def batch_synthesize_with_progress(items, pipeline):
    results = []
    failed_items = []

    with tqdm(total=len(items), desc="Processing") as pbar:
        for item in items:
            try:
                result = synthesize_with_retry(
                    pipeline.client,
                    **item
                )
                results.append({"id": item["id"], "audio": result})
                logger.info(f"Success: {item['id']}")
            except Exception as e:
                failed_items.append({"id": item["id"], "error": str(e)})
                logger.error(f"Failed: {item['id']} - {e}")
            finally:
                pbar.update(1)

    return results, failed_items

Audio Post-Processing

Volume Normalization

from pydub import AudioSegment

def normalize_audio(audio_path, target_dBFS=-20.0):
    audio = AudioSegment.from_file(audio_path)
    change_in_dBFS = target_dBFS - audio.dBFS
    normalized_audio = audio.apply_gain(change_in_dBFS)
    return normalized_audio

Adding Background Music

def add_background_music(voice_path, music_path, music_volume=-20):
    voice = AudioSegment.from_file(voice_path)
    music = AudioSegment.from_file(music_path)

    if len(music) < len(voice):
        music = music * (len(voice) // len(music) + 1)
    music = music[:len(voice)]

    music = music + music_volume

    combined = voice.overlay(music)
    return combined

Audio Concatenation with Crossfade

def concatenate_with_crossfade(audio_paths, crossfade_ms=200):
    audios = [AudioSegment.from_file(path) for path in audio_paths]

    result = audios[0]
    for audio in audios[1:]:
        result = result.append(audio, crossfade=crossfade_ms)

    return result

Performance Optimization

Caching Strategy

import hashlib
import os

class TTSCache:
    def __init__(self, cache_dir="./cache"):
        self.cache_dir = cache_dir
        os.makedirs(cache_dir, exist_ok=True)

    def _get_cache_key(self, **params):
        param_str = str(sorted(params.items()))
        return hashlib.md5(param_str.encode()).hexdigest()

    def get(self, **params):
        key = self._get_cache_key(**params)
        cache_path = os.path.join(self.cache_dir, f"{key}.mp3")
        if os.path.exists(cache_path):
            return cache_path
        return None

    def set(self, audio_data, **params):
        key = self._get_cache_key(**params)
        cache_path = os.path.join(self.cache_dir, f"{key}.mp3")
        with open(cache_path, "wb") as f:
            f.write(audio_data)
        return cache_path

Concurrency Control

from threading import Semaphore

class RateLimiter:
    def __init__(self, max_concurrent=5):
        self.semaphore = Semaphore(max_concurrent)

    def __enter__(self):
        self.semaphore.acquire()
        return self

    def __exit__(self, *args):
        self.semaphore.release()

rate_limiter = RateLimiter(max_concurrent=10)

def synthesize_with_rate_limit(**kwargs):
    with rate_limiter:
        return client.synthesize(**kwargs)

Debugging & Troubleshooting

Common Issues

Issue Possible Cause Solution
Unnatural speed Parameter too extreme Keep within 0.8-1.2 range
Weak emotion Intensity too low Increase emotion_intensity
Pauses too long/short Incorrect break values Adjust time parameter
Inconsistent volume No normalization Apply post-processing
Character confusion Similar voice traits Choose more distinct voices

Debug Mode

import json

def debug_synthesize(**kwargs):
    print("=== Request Parameters ===")
    print(json.dumps(kwargs, indent=2, ensure_ascii=False))

    result = client.synthesize(**kwargs)

    print("=== Response Info ===")
    print(f"Duration: {result.duration}s")
    print(f"Sample Rate: {result.sample_rate}")
    print(f"Channels: {result.channels}")

    return result

Best Practices Summary

Tuning Checklist

  • Choose appropriate speed for content type (news slower, videos faster)
  • Add pauses at natural break points
  • Match emotion intensity to content
  • Use distinctly different voices for multiple characters
  • Implement error retry for batch processing
  • Use caching to avoid redundant generation
  • Normalize output audio volume

Content Type Presets

PRESETS = {
    "news": {
        "speed": 0.9,
        "emotion": "neutral",
        "emotion_intensity": 0.4,
        "break_after_sentence": "300ms"
    },
    "audiobook": {
        "speed": 0.85,
        "emotion": "storytelling",
        "emotion_intensity": 0.6,
        "break_after_paragraph": "800ms"
    },
    "short_video": {
        "speed": 1.1,
        "emotion": "cheerful",
        "emotion_intensity": 0.7,
        "break_after_sentence": "200ms"
    },
    "live_commerce": {
        "speed": 1.05,
        "emotion": "enthusiastic",
        "emotion_intensity": 0.8,
        "break_after_sentence": "150ms"
    }
}

Next Steps

With advanced tuning skills, start creating professional content:

For questions, contact us via email: hello@xiangyinge.com

Further Reading