Advanced Dialect TTS Tuning: SSML, Prosody, Emotion Control
Improve naturalness and controllability with SSML tags, pauses, and emotion parameters.
XiangYinGe Team
Why Advanced Tuning Matters
Basic dialect TTS calls meet most needs, but creating professional-grade content requires finer control. Advanced tuning helps you:
- Improve naturalness: Make speech closer to human reading
- Enhance expressiveness: Adjust emotion and rhythm based on content
- Optimize listening experience: Reduce fatigue, improve completion rates
- Enable complex scenarios: Multi-character dialogues, dramatic scenes
SSML Tags Explained
SSML (Speech Synthesis Markup Language) is the standard markup language for controlling speech synthesis. It enables precise control over every aspect of voice output.
Basic SSML Structure
<speak>
This is normal text.
<break time="500ms"/>
Here we inserted a half-second pause.
</speak>
Pause Control <break>
Pauses are crucial for natural-sounding speech:
<speak>
Hey folks<break time="300ms"/>let me show you something cool today!
<break strength="medium"/>
First look at this<break time="200ms"/>then check this out.
</speak>
Pause Parameters:
| Parameter | Values | Effect |
|---|---|---|
| time | 100ms-3000ms | Precise duration control |
| strength | none | No pause |
| strength | x-weak | Very short pause |
| strength | weak | Short pause |
| strength | medium | Medium pause (default) |
| strength | strong | Long pause |
| strength | x-strong | Very long pause |
Usage Examples:
<!-- Between paragraphs -->
<speak>
First paragraph ends here.
<break strength="strong"/>
Second paragraph begins.
</speak>
<!-- Between list items -->
<speak>
Today's menu includes<break time="200ms"/>
braised pork<break time="300ms"/>
sweet and sour ribs<break time="300ms"/>
and twice-cooked pork.
</speak>
Speed Control <prosody>
<speak>
<prosody rate="slow">This part speaks slowly</prosody>
<prosody rate="fast">This part speaks quickly</prosody>
<prosody rate="120%">Speed increased by 20%</prosody>
<prosody rate="80%">Speed decreased by 20%</prosody>
</speak>
Speed Parameters:
| Value | Effect |
|---|---|
| x-slow | Very slow (~50%) |
| slow | Slow (~75%) |
| medium | Normal (100%) |
| fast | Fast (~125%) |
| x-fast | Very fast (~150%) |
| 50%-200% | Precise percentage |
Pitch Control <prosody>
<speak>
<prosody pitch="high">Higher pitch</prosody>
<prosody pitch="low">Lower pitch</prosody>
<prosody pitch="+10%">Slightly higher</prosody>
<prosody pitch="-10%">Slightly lower</prosody>
</speak>
Volume Control <prosody>
<speak>
<prosody volume="loud">This is louder</prosody>
<prosody volume="soft">This is softer</prosody>
<prosody volume="+6dB">Volume increased by 6dB</prosody>
</speak>
Combined Usage
<speak>
<prosody rate="slow" pitch="low" volume="soft">
This is a deep, slow narration
</prosody>
<break time="500ms"/>
<prosody rate="fast" pitch="high" volume="loud">
Suddenly getting excited!
</prosody>
</speak>
Emotion Parameter Adjustment
Beyond SSML, XiangYinGe API supports direct emotion control through parameters.
Emotion Types
emotions = {
"neutral": "Neutral, suitable for news",
"happy": "Happy, for entertainment",
"sad": "Sad, for emotional stories",
"angry": "Angry, for emotional expression",
"fearful": "Fearful, for suspense",
"surprised": "Surprised, for plot twists",
"cheerful": "Cheerful, for short videos",
"enthusiastic": "Enthusiastic, for live commerce",
"storytelling": "Narrative, for audiobooks",
"friendly": "Friendly, for customer service"
}
Emotion Intensity Control
data = {
"text": "This product is absolutely amazing!",
"dialect": "sichuan",
"voice": "sichuan_meizi",
"emotion": "enthusiastic",
"emotion_intensity": 0.8
}
Intensity Range: 0.0 (weakest) to 1.0 (strongest)
Recommended Settings:
| Content Type | Emotion | Intensity |
|---|---|---|
| News broadcast | neutral | 0.3-0.5 |
| Audiobook narration | storytelling | 0.5-0.7 |
| Short video | cheerful | 0.6-0.8 |
| Live commerce | enthusiastic | 0.7-0.9 |
| Emotional climax | sad/happy | 0.8-1.0 |
Dynamic Emotion Switching
For long texts, set different emotions per segment:
segments = [
{
"text": "The story begins on a quiet morning.",
"emotion": "storytelling",
"emotion_intensity": 0.5
},
{
"text": "Suddenly, a loud noise broke the silence!",
"emotion": "surprised",
"emotion_intensity": 0.9
},
{
"text": "He slowly walked toward the source of the sound...",
"emotion": "fearful",
"emotion_intensity": 0.6
}
]
audios = []
for segment in segments:
response = client.synthesize(**segment, dialect="dongbei", voice="dongbei_laotie")
audios.append(response.audio)
final_audio = concatenate_audio(audios)
Multi-Character Dialogue Processing
Character Voice Assignment
characters = {
"narrator": {
"voice": "shanghai_ajie",
"emotion": "storytelling",
"speed": 0.9
},
"young_man": {
"voice": "dongbei_xiaohuo",
"emotion": "cheerful",
"speed": 1.05
},
"grandmother": {
"voice": "sichuan_meizi",
"emotion": "friendly",
"speed": 0.85,
"pitch": 0.95
}
}
Dialogue Script Parsing
import re
script = '''
Narrator: It was a cold winter day.
Young Man: Grandma, it's snowing outside!
Grandmother: Is it now? Let's make something warm to eat.
Narrator: And so, the two began preparing lunch.
'''
def parse_dialogue(script):
pattern = r'(\w+): (.+)'
dialogues = []
for line in script.strip().split('\n'):
match = re.match(pattern, line)
if match:
role, text = match.groups()
dialogues.append({"role": role, "text": text})
return dialogues
role_mapping = {
"Narrator": "narrator",
"Young Man": "young_man",
"Grandmother": "grandmother"
}
dialogues = parse_dialogue(script)
for dialogue in dialogues:
character = role_mapping.get(dialogue["role"], "narrator")
config = characters[character]
audio = client.synthesize(
text=dialogue["text"],
dialect="dongbei",
**config
)
Dialogue Gap Optimization
def generate_dialogue_audio(dialogues, gap_duration=500):
audios = []
for i, dialogue in enumerate(dialogues):
character = role_mapping.get(dialogue["role"], "narrator")
config = characters[character]
audio = client.synthesize(
text=dialogue["text"],
dialect="dongbei",
**config
)
audios.append(audio)
if i < len(dialogues) - 1:
gap = generate_silence(gap_duration)
audios.append(gap)
return concatenate_audio(audios)
Batch Production Pipeline
Efficient Batch Architecture
import asyncio
from concurrent.futures import ThreadPoolExecutor
class TTSPipeline:
def __init__(self, api_key, max_workers=5):
self.client = TTSClient(api_key=api_key)
self.executor = ThreadPoolExecutor(max_workers=max_workers)
async def process_batch(self, items):
loop = asyncio.get_event_loop()
tasks = [
loop.run_in_executor(
self.executor,
self._synthesize_item,
item
)
for item in items
]
return await asyncio.gather(*tasks)
def _synthesize_item(self, item):
try:
result = self.client.synthesize(
text=item["text"],
dialect=item.get("dialect", "sichuan"),
voice=item.get("voice", "sichuan_meizi"),
**item.get("options", {})
)
return {"id": item["id"], "audio": result, "success": True}
except Exception as e:
return {"id": item["id"], "error": str(e), "success": False}
Usage Example
async def main():
pipeline = TTSPipeline(api_key="your_api_key", max_workers=10)
items = [
{"id": 1, "text": "First segment", "dialect": "sichuan"},
{"id": 2, "text": "Second segment", "dialect": "dongbei"},
{"id": 3, "text": "Third segment", "dialect": "cantonese"},
]
results = await pipeline.process_batch(items)
for result in results:
if result["success"]:
result["audio"].save(f"output_{result['id']}.mp3")
else:
print(f"Error for item {result['id']}: {result['error']}")
asyncio.run(main())
Error Retry Mechanism
import time
from functools import wraps
def retry_on_failure(max_retries=3, delay=1.0):
def decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
last_exception = None
for attempt in range(max_retries):
try:
return func(*args, **kwargs)
except Exception as e:
last_exception = e
if attempt < max_retries - 1:
time.sleep(delay * (2 ** attempt))
raise last_exception
return wrapper
return decorator
@retry_on_failure(max_retries=3, delay=1.0)
def synthesize_with_retry(client, **kwargs):
return client.synthesize(**kwargs)
Progress Tracking & Logging
import logging
from tqdm import tqdm
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
def batch_synthesize_with_progress(items, pipeline):
results = []
failed_items = []
with tqdm(total=len(items), desc="Processing") as pbar:
for item in items:
try:
result = synthesize_with_retry(
pipeline.client,
**item
)
results.append({"id": item["id"], "audio": result})
logger.info(f"Success: {item['id']}")
except Exception as e:
failed_items.append({"id": item["id"], "error": str(e)})
logger.error(f"Failed: {item['id']} - {e}")
finally:
pbar.update(1)
return results, failed_items
Audio Post-Processing
Volume Normalization
from pydub import AudioSegment
def normalize_audio(audio_path, target_dBFS=-20.0):
audio = AudioSegment.from_file(audio_path)
change_in_dBFS = target_dBFS - audio.dBFS
normalized_audio = audio.apply_gain(change_in_dBFS)
return normalized_audio
Adding Background Music
def add_background_music(voice_path, music_path, music_volume=-20):
voice = AudioSegment.from_file(voice_path)
music = AudioSegment.from_file(music_path)
if len(music) < len(voice):
music = music * (len(voice) // len(music) + 1)
music = music[:len(voice)]
music = music + music_volume
combined = voice.overlay(music)
return combined
Audio Concatenation with Crossfade
def concatenate_with_crossfade(audio_paths, crossfade_ms=200):
audios = [AudioSegment.from_file(path) for path in audio_paths]
result = audios[0]
for audio in audios[1:]:
result = result.append(audio, crossfade=crossfade_ms)
return result
Performance Optimization
Caching Strategy
import hashlib
import os
class TTSCache:
def __init__(self, cache_dir="./cache"):
self.cache_dir = cache_dir
os.makedirs(cache_dir, exist_ok=True)
def _get_cache_key(self, **params):
param_str = str(sorted(params.items()))
return hashlib.md5(param_str.encode()).hexdigest()
def get(self, **params):
key = self._get_cache_key(**params)
cache_path = os.path.join(self.cache_dir, f"{key}.mp3")
if os.path.exists(cache_path):
return cache_path
return None
def set(self, audio_data, **params):
key = self._get_cache_key(**params)
cache_path = os.path.join(self.cache_dir, f"{key}.mp3")
with open(cache_path, "wb") as f:
f.write(audio_data)
return cache_path
Concurrency Control
from threading import Semaphore
class RateLimiter:
def __init__(self, max_concurrent=5):
self.semaphore = Semaphore(max_concurrent)
def __enter__(self):
self.semaphore.acquire()
return self
def __exit__(self, *args):
self.semaphore.release()
rate_limiter = RateLimiter(max_concurrent=10)
def synthesize_with_rate_limit(**kwargs):
with rate_limiter:
return client.synthesize(**kwargs)
Debugging & Troubleshooting
Common Issues
| Issue | Possible Cause | Solution |
|---|---|---|
| Unnatural speed | Parameter too extreme | Keep within 0.8-1.2 range |
| Weak emotion | Intensity too low | Increase emotion_intensity |
| Pauses too long/short | Incorrect break values | Adjust time parameter |
| Inconsistent volume | No normalization | Apply post-processing |
| Character confusion | Similar voice traits | Choose more distinct voices |
Debug Mode
import json
def debug_synthesize(**kwargs):
print("=== Request Parameters ===")
print(json.dumps(kwargs, indent=2, ensure_ascii=False))
result = client.synthesize(**kwargs)
print("=== Response Info ===")
print(f"Duration: {result.duration}s")
print(f"Sample Rate: {result.sample_rate}")
print(f"Channels: {result.channels}")
return result
Best Practices Summary
Tuning Checklist
- Choose appropriate speed for content type (news slower, videos faster)
- Add pauses at natural break points
- Match emotion intensity to content
- Use distinctly different voices for multiple characters
- Implement error retry for batch processing
- Use caching to avoid redundant generation
- Normalize output audio volume
Content Type Presets
PRESETS = {
"news": {
"speed": 0.9,
"emotion": "neutral",
"emotion_intensity": 0.4,
"break_after_sentence": "300ms"
},
"audiobook": {
"speed": 0.85,
"emotion": "storytelling",
"emotion_intensity": 0.6,
"break_after_paragraph": "800ms"
},
"short_video": {
"speed": 1.1,
"emotion": "cheerful",
"emotion_intensity": 0.7,
"break_after_sentence": "200ms"
},
"live_commerce": {
"speed": 1.05,
"emotion": "enthusiastic",
"emotion_intensity": 0.8,
"break_after_sentence": "150ms"
}
}
Next Steps
With advanced tuning skills, start creating professional content:
Related Resources
- Getting Started with Dialect TTS: Review basics
- Dialect TTS API Deep Integration: Enterprise architecture
- Dialect TTS Service Comparison: Choose the right service
For questions, contact us via email: hello@xiangyinge.com