Sichuan Dialect TTS Batch Guide: SRT Export + Emotion Control

This is a production-ready hands-on tutorial: from preparing scripts → batch synthesizing Sichuan dialect audio → automatically generating subtitles (SRT/VTT) → packaging for export. You can complete this through XiangYinGe Web Studio or use the provided API (Node/Python) batch processing examples and troubleshooting guide.

Target Audience & Expected Outcomes

Who Should Read This

Media/Tourism/Government: Need localized Sichuan dialect broadcasting with unified voice style and compliance
MCN/Educational Institutions: Batch produce short video narrations with one-click subtitle and project export
Developers/Power Users: Want to automate long text processing and batch scripts with APIs

What You'll Produce

Sichuan dialect audio with unified voice/style (WAV/MP3)
Well-aligned SRT/VTT subtitles (with timecodes)
Reusable dictionary/emotion presets/segmentation strategy templates

If you don't have a sample yet, we recommend using the "sample script" below to quickly generate a demo.

Part 1: Data Preparation (Script Templates, Naming Conventions, Pronunciation Dictionary)

1. Script Template (CSV/Excel)

We recommend using structured tables to manage scripts and parameters for easy import into Web Studio or API.

id	title	speaker	style	speed	pause_ms	emotion	text
0001	City Promo 01	sc_female_A	narr	1.00	180	calm	Today's Chengdu has warmth in its street life and power in innovation.
0002	City Promo 02	sc_female_A	narr	0.95	160	warm	Come chat, drink tea, and watch face-changing opera.
0003	Short Video 01	sc_male_B	promo	1.10	120	upbeat	Let's go! Check out the trending spots, it's awesome!

Field Descriptions:

speaker: Voice ID (example values, actual IDs depend on your platform)
style: Narration narr, promotional promo, assistant assistant, etc.
speed: Speed coefficient (0.8~1.2 commonly used)
pause_ms: Inter-sentence pause (milliseconds)
emotion: calm/warm/upbeat/serious... (example values)

2. Naming Conventions

Project Directory: {projectSlug}-{yyyymmdd}, e.g., chengdu-promo-20250818/
Audio: {id}_{speaker}_{style}.mp3 → 0001_sc_female_A_narr.mp3
Subtitles: Same name 0001_sc_female_A_narr.srt
Manifest: manifest.json records batch task parameters and output paths

3. Dialect Pronunciation Dictionary (Optional but Highly Recommended)

Create entries for polyphones, place names, personal names, foreign words to improve pronunciation and stress.

Example (CSV):

term	phoneme	note
Jinli	/tɕin˨˩ li˨˩/	Tourist spot name
Longmenzhen	/luŋ˧˥ mən˨˩ t͡ʂən˥˩/	Sichuan dialect: casual chat
Bashi	/pa˥ ʂɿ˥˩/	Sichuan dialect: comfortable/pleasant
Hotpot	huo2 guo1	Can use Mandarin phonemes with accent mapping

Dictionary can be uploaded in Web or attached with API requests (see `lexicon` field below).

Part 2: Batch Synthesis (Web Studio & API Methods)

Method A: Web Studio (No Code)

Create Project → Select "Sichuan Dialect" package → Choose base voice (e.g., sc_female_A, sc_male_B)
Import Scripts: Upload CSV/Excel → Map fields (text/speaker/style/speed/pause_ms/emotion)
Select Segmentation Strategy (default is fine, see Part 4 for details):
- Punctuation-based segmentation
- Soft limit of 18 Chinese characters (split by pause probability if exceeded)
- Modal particles and dialect words prioritize binding with previous sentence
Set Emotion/Prosody Templates (Optional): Apply promo-upbeat for ads, narr-calm for narration
One-Click Synthesis: Choose "4/8/16 parallel threads" and "retry on error"
Review & Batch Edit: Preview each item with the player, dictionary corrections auto-apply
Export: Select MP3 320kbps + SRT + manifest.json, can package as ZIP

Pro tip: Enable "**Same text deduplication cache**" - identical sentences reused in multiple places **won't be charged multiple times**.

Method B: API (Node/Python Examples)

**Note**: The following are example interfaces. Use your actual documentation for `BASE_URL` and field names. You can test with a **sandbox key** first, then switch to production.

Node.js (Read CSV → Synthesize MP3 & SRT)

// package.json requires: node-fetch, csv-parse, fs-extra
import fetch from 'node-fetch'
import { parse } from 'csv-parse/sync'
import fs from 'fs-extra'
import path from 'path'

const BASE_URL = 'https://api.xiangyinge.com/v1'
const API_KEY = process.env.XIANGYINGE_API_KEY
const outDir = 'chengdu-promo-20250818'

await fs.ensureDir(outDir)

const csv = fs.readFileSync('./scripts_sichuan.csv', 'utf8')
const rows = parse(csv, { columns: true, skip_empty_lines: true })

const manifest = []

for (const r of rows) {
  const payload = {
    dialect: 'sc', // Sichuan dialect
    speaker: r.speaker || 'sc_female_A',
    style: r.style || 'narr',
    speed: Number(r.speed || 1.0),
    pause_ms: Number(r.pause_ms || 150),
    emotion: r.emotion || 'calm',
    text: r.text,
    // Optional: custom dictionary to override default pronunciation
    lexicon: [{ term: 'Longmenzhen', phoneme: 'luong2 men2 zhen4', note: 'dialect word' }],
    // Request subtitle timeline (highly recommended for SRT/VTT export)
    with_alignment: true,
  }

  const res = await fetch(`${BASE_URL}/tts:synthesize`, {
    method: 'POST',
    headers: {
      Authorization: `Bearer ${API_KEY}`,
      'Content-Type': 'application/json',
    },
    body: JSON.stringify(payload),
  })

  if (!res.ok) {
    const t = await res.text()
    console.error(`Synth failed for id=${r.id}`, t)
    continue
  }

  const data = await res.json()
  // Assuming response returns { audio_base64, alignment: [{start,end,text}], audio_format }
  const audioPath = path.join(outDir, `${r.id}_${r.speaker}_${r.style}.mp3`)
  const srtPath = audioPath.replace(/\.mp3$/, '.srt')

  // Write audio
  const buf = Buffer.from(data.audio_base64, 'base64')
  await fs.writeFile(audioPath, buf)

  // Write SRT
  const srt = data.alignment
    .map((seg, i) => {
      const toSrtTime = (sec) => {
        const ms = Math.floor(sec * 1000)
        const h = String(Math.floor(ms / 3600000)).padStart(2, '0')
        const m = String(Math.floor((ms % 3600000) / 60000)).padStart(2, '0')
        const s = String(Math.floor((ms % 60000) / 1000)).padStart(2, '0')
        const ms3 = String(ms % 1000).padStart(3, '0')
        return `${h}:${m}:${s},${ms3}`
      }
      return `${i + 1}
${toSrtTime(seg.start)} --> ${toSrtTime(seg.end)}
${seg.text}
`
    })
    .join('\n')

  await fs.writeFile(srtPath, srt, 'utf8')

  manifest.push({
    id: r.id,
    title: r.title,
    audio: path.basename(audioPath),
    srt: path.basename(srtPath),
    params: payload,
  })

  console.log('OK', r.id)
}

// Write manifest
await fs.writeJson(path.join(outDir, 'manifest.json'), { items: manifest }, { spaces: 2 })
console.log('DONE', outDir)

Python (Long Text Auto-Segmentation → Batch Synthesis)

# Dependencies: requests, pandas, tqdm
import base64, json, os
import pandas as pd
import requests
from tqdm import tqdm

BASE_URL = 'https://api.xiangyinge.com/v1'
API_KEY = os.environ['XIANGYINGE_API_KEY']
OUT_DIR = 'chengdu-promo-20250818'
os.makedirs(OUT_DIR, exist_ok=True)

def split_cn(text, max_len=18):
    # Simple segmentation: split by punctuation, then further split long segments
    import re
    parts = re.split(r'([。！？；…])', text)
    merged = [''.join(parts[i:i+2]).strip() for i in range(0, len(parts), 2)]
    chunks = []
    for m in merged:
        while len(m) > max_len:
            chunks.append(m[:max_len])
            m = m[max_len:]
        if m:
            chunks.append(m)
    return chunks

def tts_one(text, speaker='sc_female_A', style='narr', speed=1.0, pause_ms=150, emotion='calm'):
    payload = {
        'dialect': 'sc',
        'speaker': speaker, 'style': style,
        'speed': speed, 'pause_ms': pause_ms, 'emotion': emotion,
        'text': text, 'with_alignment': True
    }
    r = requests.post(f'{BASE_URL}/tts:synthesize',
                      headers={'Authorization': f'Bearer {API_KEY}'},
                      json=payload, timeout=60)
    r.raise_for_status()
    return r.json()

df = pd.read_csv('scripts_sichuan.csv')
manifest = []

for _, row in tqdm(df.iterrows(), total=len(df)):
    text = row['text']
    # You can also send the entire text to server for segmentation; here we demo local segmentation
    pieces = split_cn(text, max_len=18)
    align_all, audio_all = [], b''

    # Simplified: synthesize sentence by sentence then concatenate
    # (production should use server-side cascading to avoid seams)
    for p in pieces:
        data = tts_one(p, row.get('speaker', 'sc_female_A'),
                          row.get('style', 'narr'),
                          float(row.get('speed', 1.0)),
                          int(row.get('pause_ms', 150)),
                          row.get('emotion', 'calm'))
        audio_all += base64.b64decode(data['audio_base64'])
        for seg in data['alignment']:
            # Append alignment and fix timeline (omitted here, use server-side unified alignment in production)
            align_all.append(seg)

    base = f"{row['id']}_{row.get('speaker','sc_female_A')}_{row.get('style','narr')}"
    audio_path = os.path.join(OUT_DIR, base + '.mp3')
    srt_path = os.path.join(OUT_DIR, base + '.srt')

    with open(audio_path, 'wb') as f:
        f.write(audio_all)

    def to_srt_time(sec):
        ms = int(sec * 1000)
        h, ms = divmod(ms, 3600000)
        m, ms = divmod(ms, 60000)
        s, ms = divmod(ms, 1000)
        return f"{h:02d}:{m:02d}:{s:02d},{ms:03d}"

    with open(srt_path, 'w', encoding='utf-8') as f:
        for i, seg in enumerate(align_all, 1):
            f.write(f"{i}\n{to_srt_time(seg['start'])} --> {to_srt_time(seg['end'])}\n{seg['text']}\n\n")

    manifest.append({'id': row['id'], 'title': row.get('title',''),
                    'audio': os.path.basename(audio_path),
                    'srt': os.path.basename(srt_path)})

with open(os.path.join(OUT_DIR, 'manifest.json'), 'w', encoding='utf-8') as f:
    json.dump({'items': manifest}, f, ensure_ascii=False, indent=2)

print('DONE', OUT_DIR)

Part 3: Automatic Segmentation & Subtitle Timeline (Alignment)

Why Segmentation Matters

Improved Naturalness: Pauses at phrase/semantic boundaries sound more natural
Better Subtitle Readability: Optimal 12-18 characters per line
Easier Post-Production: More stable segmentation and replacement

Recommended Strategies

Punctuation Priority: Direct split at ，。！？；…; secondary split for colons/long sentences by pause probability
Modal Particle Binding: Words like "ma/yo/oh/ei" should merge with previous sentence
Duration Control: Each sentence 1.0-6.0 seconds, merge if too short, split if too long
Dialect Dictionary Weighting: Don't break phrases marked as "fixed collocations" in dictionary

SRT Example

1
00:00:00,000 --> 00:00:02,200
Today's Chengdu has warmth in its street life.

2
00:00:02,200 --> 00:00:04,900
And power in innovation.

3
00:00:04,900 --> 00:00:07,400
Let's go, time to chat!

Part 4: Export & Delivery

Common Export Combinations

Audio: WAV 48kHz (for post-production) or MP3 320kbps (for final output)
Subtitles: SRT (universal) and VTT (web-friendly)
Project Package: manifest.json (parameters and semantic segmentation records)

Delivery Recommendations

Unified naming for easy NLE (Premiere/CapCut) auto-association
Include README.md (voice, parameters, generation date, version)
For public release, must include "Synthetic Content Disclosure" and authorization statement

Part 5: Common Issues & Troubleshooting (Sichuan Dialect Scenarios)

1. Segmentation too fragmented or too long?

Increase/decrease "max character threshold" or enable "modal particle binding"; add commas to optimize punctuation in long sentences.

2. Polyphones/place names pronounced wrong?

Add entries to dictionary (higher priority than default), check if incorrectly segmented.

3. Unstable emotion, inconsistent rhythm?

Unify style and emotion; for long text use paragraph-level templates to avoid inter-sentence style drift.

4. Numbers/dates/English abbreviations not pronounced as desired?

Explicitly write in text ("two zero two five" "US dollars one hundred twenty") or override with dictionary.

5. Sichuan dialect words not following conventional pronunciation?

Add entries for "fixed collocations"; provide phonemes/IPA if necessary for exact pronunciation.

6. Concatenation artifacts (local segmentation then stitching)?

Use server-side cascaded alignment or increase "cross-sentence smoothing"; avoid local concatenation.

7. SRT timeline slightly drifts from audio?

Enable "phoneme-level alignment" or apply global regression correction during export.

8. Compliance & Disclosure

Enable synthesis watermark, output metadata (generation time, request ID, voice ID), retain logs for 6+ months.

Part 6: Quality Self-Check List (Printable)

Speed/pauses match scenario (ads slightly faster, narration slightly slower)
Dictionary coverage: place names, personal names, brand words, polyphones
Subtitles 12-18 characters per line, 1-6 seconds duration, appropriate line breaks
Consistent loudness throughout (LUFS standard: -14~-16 for short videos)
Synthesis disclosure and authorization statement attached
Reproducible output: complete manifest.json parameters

Part 7: Next Steps

XiangYinGe API Documentation (includes sandbox key and examples)
Choose a Package (early bird users get credits and onboarding support)
Enterprise/Tourism/Media: Support for private deployment and custom voices, contact: hello@xiangyinge.com

Conclusion

You've now mastered the complete pipeline from scripts to Sichuan dialect audio and subtitles. Whether you're in media, tourism projects, or short video teams, you can standardize and template this workflow for continuous reuse.

If you'd like, I can package this article's **CSV template, Node/Python scripts, and sample dictionary** into a ZIP for direct download and use.