Sichuan Dialect TTS Batch Guide: SRT Export + Emotion Control
Hands-on workflow and API examples for batch synthesis, segmentation, and subtitle generation.
XiangYinGe Team
This is a production-ready hands-on tutorial: from preparing scripts → batch synthesizing Sichuan dialect audio → automatically generating subtitles (SRT/VTT) → packaging for export. You can complete this through XiangYinGe Web Studio or use the provided API (Node/Python) batch processing examples and troubleshooting guide.
Target Audience & Expected Outcomes
Who Should Read This
- Media/Tourism/Government: Need localized Sichuan dialect broadcasting with unified voice style and compliance
- MCN/Educational Institutions: Batch produce short video narrations with one-click subtitle and project export
- Developers/Power Users: Want to automate long text processing and batch scripts with APIs
What You'll Produce
- Sichuan dialect audio with unified voice/style (WAV/MP3)
- Well-aligned SRT/VTT subtitles (with timecodes)
- Reusable dictionary/emotion presets/segmentation strategy templates
Part 1: Data Preparation (Script Templates, Naming Conventions, Pronunciation Dictionary)
1. Script Template (CSV/Excel)
We recommend using structured tables to manage scripts and parameters for easy import into Web Studio or API.
| id | title | speaker | style | speed | pause_ms | emotion | text |
|---|---|---|---|---|---|---|---|
| 0001 | City Promo 01 | sc_female_A | narr | 1.00 | 180 | calm | Today's Chengdu has warmth in its street life and power in innovation. |
| 0002 | City Promo 02 | sc_female_A | narr | 0.95 | 160 | warm | Come chat, drink tea, and watch face-changing opera. |
| 0003 | Short Video 01 | sc_male_B | promo | 1.10 | 120 | upbeat | Let's go! Check out the trending spots, it's awesome! |
Field Descriptions:
speaker: Voice ID (example values, actual IDs depend on your platform)style: Narrationnarr, promotionalpromo, assistantassistant, etc.speed: Speed coefficient (0.8~1.2 commonly used)pause_ms: Inter-sentence pause (milliseconds)emotion:calm/warm/upbeat/serious... (example values)
2. Naming Conventions
- Project Directory:
{projectSlug}-{yyyymmdd}, e.g.,chengdu-promo-20250818/ - Audio:
{id}_{speaker}_{style}.mp3→0001_sc_female_A_narr.mp3 - Subtitles: Same name
0001_sc_female_A_narr.srt - Manifest:
manifest.jsonrecords batch task parameters and output paths
3. Dialect Pronunciation Dictionary (Optional but Highly Recommended)
Create entries for polyphones, place names, personal names, foreign words to improve pronunciation and stress.
Example (CSV):
| term | phoneme | note |
|---|---|---|
| Jinli | /tɕin˨˩ li˨˩/ | Tourist spot name |
| Longmenzhen | /luŋ˧˥ mən˨˩ t͡ʂən˥˩/ | Sichuan dialect: casual chat |
| Bashi | /pa˥ ʂɿ˥˩/ | Sichuan dialect: comfortable/pleasant |
| Hotpot | huo2 guo1 | Can use Mandarin phonemes with accent mapping |
Part 2: Batch Synthesis (Web Studio & API Methods)
Method A: Web Studio (No Code)
- Create Project → Select "Sichuan Dialect" package → Choose base voice (e.g.,
sc_female_A,sc_male_B) - Import Scripts: Upload CSV/Excel → Map fields (
text/speaker/style/speed/pause_ms/emotion) - Select Segmentation Strategy (default is fine, see Part 4 for details):
- Punctuation-based segmentation
- Soft limit of 18 Chinese characters (split by pause probability if exceeded)
- Modal particles and dialect words prioritize binding with previous sentence
- Set Emotion/Prosody Templates (Optional): Apply
promo-upbeatfor ads,narr-calmfor narration - One-Click Synthesis: Choose "4/8/16 parallel threads" and "retry on error"
- Review & Batch Edit: Preview each item with the player, dictionary corrections auto-apply
- Export: Select
MP3 320kbps+SRT+manifest.json, can package as ZIP
Method B: API (Node/Python Examples)
Node.js (Read CSV → Synthesize MP3 & SRT)
// package.json requires: node-fetch, csv-parse, fs-extra
import fetch from 'node-fetch'
import { parse } from 'csv-parse/sync'
import fs from 'fs-extra'
import path from 'path'
const BASE_URL = 'https://api.xiangyinge.com/v1'
const API_KEY = process.env.XIANGYINGE_API_KEY
const outDir = 'chengdu-promo-20250818'
await fs.ensureDir(outDir)
const csv = fs.readFileSync('./scripts_sichuan.csv', 'utf8')
const rows = parse(csv, { columns: true, skip_empty_lines: true })
const manifest = []
for (const r of rows) {
const payload = {
dialect: 'sc', // Sichuan dialect
speaker: r.speaker || 'sc_female_A',
style: r.style || 'narr',
speed: Number(r.speed || 1.0),
pause_ms: Number(r.pause_ms || 150),
emotion: r.emotion || 'calm',
text: r.text,
// Optional: custom dictionary to override default pronunciation
lexicon: [{ term: 'Longmenzhen', phoneme: 'luong2 men2 zhen4', note: 'dialect word' }],
// Request subtitle timeline (highly recommended for SRT/VTT export)
with_alignment: true,
}
const res = await fetch(`${BASE_URL}/tts:synthesize`, {
method: 'POST',
headers: {
Authorization: `Bearer ${API_KEY}`,
'Content-Type': 'application/json',
},
body: JSON.stringify(payload),
})
if (!res.ok) {
const t = await res.text()
console.error(`Synth failed for id=${r.id}`, t)
continue
}
const data = await res.json()
// Assuming response returns { audio_base64, alignment: [{start,end,text}], audio_format }
const audioPath = path.join(outDir, `${r.id}_${r.speaker}_${r.style}.mp3`)
const srtPath = audioPath.replace(/\.mp3$/, '.srt')
// Write audio
const buf = Buffer.from(data.audio_base64, 'base64')
await fs.writeFile(audioPath, buf)
// Write SRT
const srt = data.alignment
.map((seg, i) => {
const toSrtTime = (sec) => {
const ms = Math.floor(sec * 1000)
const h = String(Math.floor(ms / 3600000)).padStart(2, '0')
const m = String(Math.floor((ms % 3600000) / 60000)).padStart(2, '0')
const s = String(Math.floor((ms % 60000) / 1000)).padStart(2, '0')
const ms3 = String(ms % 1000).padStart(3, '0')
return `${h}:${m}:${s},${ms3}`
}
return `${i + 1}
${toSrtTime(seg.start)} --> ${toSrtTime(seg.end)}
${seg.text}
`
})
.join('\n')
await fs.writeFile(srtPath, srt, 'utf8')
manifest.push({
id: r.id,
title: r.title,
audio: path.basename(audioPath),
srt: path.basename(srtPath),
params: payload,
})
console.log('OK', r.id)
}
// Write manifest
await fs.writeJson(path.join(outDir, 'manifest.json'), { items: manifest }, { spaces: 2 })
console.log('DONE', outDir)
Python (Long Text Auto-Segmentation → Batch Synthesis)
# Dependencies: requests, pandas, tqdm
import base64, json, os
import pandas as pd
import requests
from tqdm import tqdm
BASE_URL = 'https://api.xiangyinge.com/v1'
API_KEY = os.environ['XIANGYINGE_API_KEY']
OUT_DIR = 'chengdu-promo-20250818'
os.makedirs(OUT_DIR, exist_ok=True)
def split_cn(text, max_len=18):
# Simple segmentation: split by punctuation, then further split long segments
import re
parts = re.split(r'([。!?;…])', text)
merged = [''.join(parts[i:i+2]).strip() for i in range(0, len(parts), 2)]
chunks = []
for m in merged:
while len(m) > max_len:
chunks.append(m[:max_len])
m = m[max_len:]
if m:
chunks.append(m)
return chunks
def tts_one(text, speaker='sc_female_A', style='narr', speed=1.0, pause_ms=150, emotion='calm'):
payload = {
'dialect': 'sc',
'speaker': speaker, 'style': style,
'speed': speed, 'pause_ms': pause_ms, 'emotion': emotion,
'text': text, 'with_alignment': True
}
r = requests.post(f'{BASE_URL}/tts:synthesize',
headers={'Authorization': f'Bearer {API_KEY}'},
json=payload, timeout=60)
r.raise_for_status()
return r.json()
df = pd.read_csv('scripts_sichuan.csv')
manifest = []
for _, row in tqdm(df.iterrows(), total=len(df)):
text = row['text']
# You can also send the entire text to server for segmentation; here we demo local segmentation
pieces = split_cn(text, max_len=18)
align_all, audio_all = [], b''
# Simplified: synthesize sentence by sentence then concatenate
# (production should use server-side cascading to avoid seams)
for p in pieces:
data = tts_one(p, row.get('speaker', 'sc_female_A'),
row.get('style', 'narr'),
float(row.get('speed', 1.0)),
int(row.get('pause_ms', 150)),
row.get('emotion', 'calm'))
audio_all += base64.b64decode(data['audio_base64'])
for seg in data['alignment']:
# Append alignment and fix timeline (omitted here, use server-side unified alignment in production)
align_all.append(seg)
base = f"{row['id']}_{row.get('speaker','sc_female_A')}_{row.get('style','narr')}"
audio_path = os.path.join(OUT_DIR, base + '.mp3')
srt_path = os.path.join(OUT_DIR, base + '.srt')
with open(audio_path, 'wb') as f:
f.write(audio_all)
def to_srt_time(sec):
ms = int(sec * 1000)
h, ms = divmod(ms, 3600000)
m, ms = divmod(ms, 60000)
s, ms = divmod(ms, 1000)
return f"{h:02d}:{m:02d}:{s:02d},{ms:03d}"
with open(srt_path, 'w', encoding='utf-8') as f:
for i, seg in enumerate(align_all, 1):
f.write(f"{i}\n{to_srt_time(seg['start'])} --> {to_srt_time(seg['end'])}\n{seg['text']}\n\n")
manifest.append({'id': row['id'], 'title': row.get('title',''),
'audio': os.path.basename(audio_path),
'srt': os.path.basename(srt_path)})
with open(os.path.join(OUT_DIR, 'manifest.json'), 'w', encoding='utf-8') as f:
json.dump({'items': manifest}, f, ensure_ascii=False, indent=2)
print('DONE', OUT_DIR)
Part 3: Automatic Segmentation & Subtitle Timeline (Alignment)
Why Segmentation Matters
- Improved Naturalness: Pauses at phrase/semantic boundaries sound more natural
- Better Subtitle Readability: Optimal 12-18 characters per line
- Easier Post-Production: More stable segmentation and replacement
Recommended Strategies
- Punctuation Priority: Direct split at
,。!?;…; secondary split for colons/long sentences by pause probability - Modal Particle Binding: Words like "ma/yo/oh/ei" should merge with previous sentence
- Duration Control: Each sentence 1.0-6.0 seconds, merge if too short, split if too long
- Dialect Dictionary Weighting: Don't break phrases marked as "fixed collocations" in dictionary
SRT Example
1
00:00:00,000 --> 00:00:02,200
Today's Chengdu has warmth in its street life.
2
00:00:02,200 --> 00:00:04,900
And power in innovation.
3
00:00:04,900 --> 00:00:07,400
Let's go, time to chat!
Part 4: Export & Delivery
Common Export Combinations
- Audio:
WAV 48kHz(for post-production) orMP3 320kbps(for final output) - Subtitles:
SRT(universal) andVTT(web-friendly) - Project Package:
manifest.json(parameters and semantic segmentation records)
Delivery Recommendations
- Unified naming for easy NLE (Premiere/CapCut) auto-association
- Include
README.md(voice, parameters, generation date, version) - For public release, must include "Synthetic Content Disclosure" and authorization statement
Part 5: Common Issues & Troubleshooting (Sichuan Dialect Scenarios)
1. Segmentation too fragmented or too long?
Increase/decrease "max character threshold" or enable "modal particle binding"; add commas to optimize punctuation in long sentences.
2. Polyphones/place names pronounced wrong?
Add entries to dictionary (higher priority than default), check if incorrectly segmented.
3. Unstable emotion, inconsistent rhythm?
Unify style and emotion; for long text use paragraph-level templates to avoid inter-sentence style drift.
4. Numbers/dates/English abbreviations not pronounced as desired?
Explicitly write in text ("two zero two five" "US dollars one hundred twenty") or override with dictionary.
5. Sichuan dialect words not following conventional pronunciation?
Add entries for "fixed collocations"; provide phonemes/IPA if necessary for exact pronunciation.
6. Concatenation artifacts (local segmentation then stitching)?
Use server-side cascaded alignment or increase "cross-sentence smoothing"; avoid local concatenation.
7. SRT timeline slightly drifts from audio?
Enable "phoneme-level alignment" or apply global regression correction during export.
8. Compliance & Disclosure
Enable synthesis watermark, output metadata (generation time, request ID, voice ID), retain logs for 6+ months.
Part 6: Quality Self-Check List (Printable)
- Speed/pauses match scenario (ads slightly faster, narration slightly slower)
- Dictionary coverage: place names, personal names, brand words, polyphones
- Subtitles 12-18 characters per line, 1-6 seconds duration, appropriate line breaks
- Consistent loudness throughout (LUFS standard: -14~-16 for short videos)
- Synthesis disclosure and authorization statement attached
- Reproducible output: complete
manifest.jsonparameters
Part 7: Next Steps
Related Resources
- XiangYinGe API Documentation (includes sandbox key and examples)
- Choose a Package (early bird users get credits and onboarding support)
- Enterprise/Tourism/Media: Support for private deployment and custom voices, contact:
hello@xiangyinge.com
Conclusion
You've now mastered the complete pipeline from scripts to Sichuan dialect audio and subtitles. Whether you're in media, tourism projects, or short video teams, you can standardize and template this workflow for continuous reuse.