Qwen3-TTS Integration
Run Alibaba's Qwen3-TTS locally for high-quality, multilingual text-to-speech. This guide covers setting up the OpenAI-compatible TTS server included with Libre WebUI.
Overview
Qwen3-TTS is an advanced text-to-speech system featuring:
- 9 pre-built voices spanning English, Chinese, Japanese, and Korean
- 10 language support including German, French, Spanish, Italian, Portuguese, and Russian
- Voice cloning from 3-second audio samples
- Voice design using natural language descriptions
- Instruction control for emotion and prosody
The included server wraps Qwen3-TTS in an OpenAI-compatible API, allowing Libre WebUI to use it through the standard plugin system.
Requirements
| Component | Minimum | Recommended |
|---|---|---|
| Python | 3.12+ | 3.12 (not 3.14) |
| GPU VRAM | 4GB (0.6B models) | 8GB+ (1.7B models) |
| RAM | 8GB | 16GB+ |
| Disk | 5GB | 10GB |
Platform Support
| Platform | Backend | Notes |
|---|---|---|
| NVIDIA GPU | CUDA | Best performance, bfloat16 support |
| Apple Silicon | MPS | Use 0.6B models for memory efficiency |
| CPU | PyTorch | Slower, use 0.6B models |
Use the customvoice-0.6b model variant on Mac to avoid memory pressure. The 1.7B models may cause system instability on machines with 16GB unified memory.
Quick Start
1. Install the Server
cd examples/qwen-tts-server
# Create virtual environment (Python 3.12 required)
python3.12 -m venv venv
source venv/bin/activate # Linux/macOS
# or: venv\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txt
2. Start the Server
# NVIDIA GPU (recommended)
python server.py --model customvoice-1.7b
# Apple Silicon
python server.py --model customvoice-0.6b
# CPU (slower)
python server.py --model customvoice-0.6b
The server runs at http://localhost:8100 by default.
3. Configure Libre WebUI
The plugin is pre-configured in plugins/qwen-tts.json. Enable it in Settings → Plugins → Qwen3 TTS.
4. Test It
curl http://localhost:8100/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"model": "qwen3-tts", "input": "Hello, welcome to Libre WebUI!", "voice": "Ryan"}' \
--output speech.wav
Available Models
| Model | Size | Use Case |
|---|---|---|
customvoice-1.7b | ~3.5GB | Pre-built voices with instruction control |
customvoice-0.6b | ~1.5GB | Lightweight variant for limited VRAM |
voicedesign-1.7b | ~3.5GB | Create voices from text descriptions |
base-1.7b | ~3.5GB | Voice cloning from 3-second samples |
base-0.6b | ~1.5GB | Lightweight voice cloning |
Voices
Pre-Built Voices (CustomVoice Models)
| Voice | Language | Description |
|---|---|---|
| Ryan | English | Male, clear and natural |
| Aiden | English | Male, warm tone |
| Vivian | Chinese | Female, professional |
| Serena | Chinese | Female, friendly |
| Uncle_Fu | Chinese | Male, mature |
| Dylan | Chinese | Male, Beijing dialect |
| Eric | Chinese | Male, Sichuan dialect |
| Ono_Anna | Japanese | Female |
| Sohee | Korean | Female |
OpenAI Voice Aliases
For compatibility with OpenAI TTS clients, the server maps OpenAI voice names:
| OpenAI Voice | Maps To |
|---|---|
alloy | Ryan |
echo | Aiden |
fable | Vivian |
onyx | Uncle_Fu |
nova | Serena |
shimmer | Ono_Anna |
API Reference
Speech Generation
Endpoint: POST /v1/audio/speech
{
"model": "qwen3-tts",
"input": "Text to convert to speech",
"voice": "Ryan",
"response_format": "wav",
"instruct": "Speak with enthusiasm",
"language": "English"
}
| Parameter | Type | Default | Description |
|---|---|---|---|
model | string | qwen3-tts | Model identifier |
input | string | required | Text to synthesize (max 10,000 chars) |
voice | string | ryan | Voice name (see table above) |
response_format | string | wav | Audio format (only wav supported) |
instruct | string | "" | Emotion/prosody instruction |
language | string | auto-detect | Override language detection |
Response: Audio file (audio/wav)
Voice Design
Endpoint: POST /v1/audio/voice-design
Create custom voices from natural language descriptions.
{
"model": "qwen3-tts-voicedesign",
"input": "Welcome to our service.",
"voice_description": "A warm, friendly female voice with a slight British accent",
"language": "English"
}
Requires the voicedesign-1.7b model to be loaded.
Voice Cloning
Endpoint: POST /v1/audio/voice-clone
Clone a voice from a 3+ second audio sample.
curl -X POST http://localhost:8100/v1/audio/voice-clone \
-F "input=Hello, this is my cloned voice." \
-F "[email protected]" \
-F "reference_text=This is what was said in the reference." \
--output cloned.wav
| Parameter | Type | Description |
|---|---|---|
input | string | Text to synthesize |
reference_audio | file | 3+ second audio sample |
reference_text | string | Transcript of reference audio |
Requires the base-1.7b or base-0.6b model to be loaded.
List Voices
Endpoint: GET /v1/voices
{
"voices": [
{"id": "ryan", "name": "Ryan", "language": "English"},
{"id": "aiden", "name": "Aiden", "language": "English"},
...
]
}
Health Check
Endpoint: GET /health
{"status": "healthy", "model_loaded": true}
Server Configuration
python server.py [OPTIONS]
| Option | Default | Description |
|---|---|---|
--host | 0.0.0.0 | Host to bind to |
--port | 8100 | Port to bind to |
--model | customvoice-1.7b | Model variant to load |
Network Access
To access the server from other machines on your network:
# Start server on all interfaces
python server.py --host 0.0.0.0 --port 8100
# Access from another machine
curl http://192.168.1.100:8100/v1/audio/speech ...
Update the plugin endpoint in plugins/qwen-tts.json:
{
"endpoint": "http://192.168.1.100:8100/v1/audio/speech",
"capabilities": {
"tts": {
"endpoint": "http://192.168.1.100:8100/v1/audio/speech"
}
}
}
Production Features
Text Sanitization
The server automatically sanitizes input text to prevent model hangs:
- Removes emojis and symbols
- Strips markdown formatting (
*bold*,_italic_, etc.) - Collapses repeated characters (
FUUUUU→FUU) - Removes stage directions (
*(action)*,(whispers)) - Normalizes whitespace
Text Chunking
Long text is automatically split at sentence boundaries:
- Maximum 500 characters per chunk
- 30-second timeout per chunk
- Failed chunks are skipped, remaining chunks continue
- Chunks are concatenated into single audio response
This prevents timeouts on long AI responses while maintaining natural speech flow.
Multi-GPU Setup
For systems with multiple GPUs, the server forces single-GPU execution to avoid tensor device mismatches:
device_map = {"": "cuda:0"} # Uses first GPU only
To use a specific GPU:
CUDA_VISIBLE_DEVICES=1 python server.py --model customvoice-1.7b
Troubleshooting
Model Download Fails
The model downloads from Hugging Face on first run. If it fails:
# Set Hugging Face token for gated models
export HF_TOKEN=hf_...
# Or download manually
huggingface-cli download Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice
Out of Memory (Apple Silicon)
RuntimeError: MPS backend out of memory
Use the smaller model variant:
python server.py --model customvoice-0.6b
CUDA Out of Memory
torch.cuda.OutOfMemoryError: CUDA out of memory
- Close other GPU applications
- Use the 0.6B model variant
- Reduce chunk size in server.py (
max_chunk_size=300)
Server Times Out
If generation times out on long text:
- The server automatically chunks text and continues with remaining chunks
- Check server logs for which chunks timed out
- Consider shortening your input text
Audio Sounds Wrong
- Repeated syllables: Usually caused by emojis or special characters. The sanitizer should handle this automatically.
- Wrong language: Set the
languageparameter explicitly in the request. - Unnatural pauses: Text may be splitting at wrong boundaries. Check for unusual punctuation.
Plugin Configuration
The included plugin (plugins/qwen-tts.json):
{
"id": "qwen-tts",
"name": "Qwen3 TTS",
"type": "tts",
"endpoint": "http://localhost:8100/v1/audio/speech",
"auth": {
"header": "",
"key_env": ""
},
"model_map": [
"qwen3-tts",
"qwen3-tts-customvoice",
"qwen3-tts-voicedesign",
"qwen3-tts-clone"
],
"capabilities": {
"tts": {
"endpoint": "http://localhost:8100/v1/audio/speech",
"model_map": [
"qwen3-tts",
"qwen3-tts-customvoice",
"qwen3-tts-voicedesign",
"qwen3-tts-clone"
],
"config": {
"voices": ["Ryan", "Aiden", "Vivian", "Serena", "Uncle_Fu", "Dylan", "Eric", "Ono_Anna", "Sohee"],
"default_voice": "Ryan",
"formats": ["wav"],
"default_format": "wav",
"max_characters": 10000,
"supports_streaming": false,
"no_auth_required": true
}
}
},
"description": "Qwen3-TTS local TTS server (NVIDIA CUDA, Apple MPS, or CPU)",
"documentation_url": "https://github.com/QwenLM/Qwen3-TTS"
}
Resources
- Qwen3-TTS GitHub - Official repository
- Qwen3-TTS Demo - Try it online
- Alibaba Cloud TTS Docs - Cloud API documentation