Skip to main content

Kyutai TTS Integration

Run Kyutai's TTS models locally for high-quality text-to-speech. This guide covers both Pocket TTS (CPU) and TTS 1.6B (GPU) with OpenAI-compatible servers included in Libre WebUI.

Overview

Kyutai offers two TTS models:

ModelParametersDeviceBest For
Pocket TTS100MCPU onlyLaptops, low-resource environments
TTS 1.6B1.6BGPU/MPS/CPUServers, high-quality synthesis

Both use the CALM (Continuous Audio Language Models) framework and support voice cloning from audio samples.

Pocket TTS (CPU)

Lightweight TTS that runs in real-time on CPU. No GPU required.

Requirements

ComponentMinimum
Python3.10 - 3.14
PyTorch2.5+
RAM4GB
Disk500MB

Quick Start

cd examples/kyutai-tts-server

# Create virtual environment
python3 -m venv venv
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Start server
python server.py

Server runs at http://localhost:8200.

Test It

curl http://localhost:8200/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"model": "kyutai-tts", "input": "Hello, welcome to Libre WebUI!", "voice": "alba"}' \
--output speech.wav

Voices

VoiceDescription
AlbaFemale, clear and natural
MariusMale, warm tone
JavertMale, authoritative
JeanMale, gentle
FantineFemale, soft
CosetteFemale, young
EponineFemale, expressive
AzelmaFemale, bright

Performance

  • ~6x real-time on MacBook Air M4
  • ~200ms latency for first audio chunk
  • Uses only 2 CPU cores

TTS 1.6B (GPU)

High-quality TTS with GPU acceleration. Automatic device selection: CUDA > MPS > CPU.

Requirements

ComponentMinimumRecommended
Python3.10+3.12
GPU VRAM6GB8GB+
RAM8GB16GB+
Disk4GB8GB

Platform Support

PlatformBackendNotes
NVIDIA GPUCUDABest performance, bfloat16 support
Apple SiliconMPSUses float16
CPUPyTorchSlower, float32

Quick Start

cd examples/kyutai-tts-1.6b-server

# Create virtual environment
python3 -m venv venv
source venv/bin/activate

# Install PyTorch with CUDA (for NVIDIA GPUs)
pip install torch --index-url https://download.pytorch.org/whl/cu121

# Install dependencies
pip install -r requirements.txt

# Start server (auto-detects GPU)
python server.py

Server runs at http://localhost:8201.

Device Selection

# Auto-detect (CUDA > MPS > CPU)
python server.py

# Force specific device
python server.py --device cuda
python server.py --device mps
python server.py --device cpu

Test It

curl http://localhost:8201/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"model": "kyutai-tts-1.6b", "input": "Hello from the GPU!", "voice": "alba"}' \
--output speech.wav

Voices

Alba MacKenna (CC BY 4.0):

VoiceStyle
alba / alba-casualCasual conversation
alba-merchantMerchant character
alba-announcerAnnouncer style

Expresso (CC BY-NC 4.0 - non-commercial):

VoiceEmotion
expresso-happyHappy
expresso-sadSad
expresso-angryAngry

VCTK (CC BY 4.0):

  • vctk-p225, vctk-p226, vctk-p227, vctk-p228

Voice Cloning

Both servers support cloning voices from audio files.

Pocket TTS

# From local file
curl http://localhost:8200/v1/audio/voice-clone \
-F "input=Hello from a cloned voice" \
-F "reference_audio=@my_voice.wav" \
--output cloned.wav

# From HuggingFace URL
curl http://localhost:8200/v1/audio/voice-clone-url \
-H "Content-Type: application/json" \
-d '{
"input": "Hello world!",
"voice_url": "hf://kyutai/tts-voices/alba-mackenna/casual.wav"
}' \
--output speech.wav

TTS 1.6B

Pass any HuggingFace voice path as the voice parameter:

curl http://localhost:8201/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"model": "kyutai-tts-1.6b",
"input": "Custom voice synthesis",
"voice": "hf://kyutai/tts-voices/vctk/p230.wav"
}' \
--output speech.wav

API Reference

Speech Generation

Endpoint: POST /v1/audio/speech

{
"model": "kyutai-tts",
"input": "Text to convert to speech",
"voice": "alba",
"response_format": "wav",
"stream": false
}
ParameterTypeDefaultDescription
modelstringvarieskyutai-tts or kyutai-tts-1.6b
inputstringrequiredText to synthesize (max 10,000 chars)
voicestringalbaVoice name or HuggingFace path
response_formatstringwavAudio format (only wav supported)
streambooleanfalseEnable streaming (Pocket TTS only)
cfg_coeffloat2.0Classifier-free guidance (1.6B only)

Response: Audio file (audio/wav)

OpenAI Voice Aliases

For compatibility with OpenAI TTS clients:

OpenAI VoicePocket TTSTTS 1.6B
alloyalbaalba
echomariusvctk-p225
fablecosetteexpresso-happy
onyxjavertvctk-p226
novafantinealba-announcer
shimmereponinealba-merchant

List Voices

Endpoint: GET /v1/voices

Health Check

Endpoint: GET /health


Plugin Configuration

Pocket TTS

Enable in Settings > Plugins > Kyutai TTS

Plugin file: plugins/kyutai-tts.json

{
"id": "kyutai-tts",
"name": "Kyutai TTS",
"type": "tts",
"endpoint": "http://localhost:8200/v1/audio/speech",
"capabilities": {
"tts": {
"config": {
"voices": ["Alba", "Marius", "Javert", "Jean", "Fantine", "Cosette", "Eponine", "Azelma"],
"default_voice": "Alba",
"supports_streaming": true,
"no_auth_required": true
}
}
}
}

TTS 1.6B

Enable in Settings > Plugins > Kyutai TTS 1.6B

Plugin file: plugins/kyutai-tts-1.6b.json

{
"id": "kyutai-tts-1.6b",
"name": "Kyutai TTS 1.6B",
"type": "tts",
"endpoint": "http://localhost:8201/v1/audio/speech",
"capabilities": {
"tts": {
"config": {
"voices": ["Alba", "Alba-Casual", "Alba-Merchant", "Alba-Announcer", "Expresso-Happy", "Expresso-Sad", "Expresso-Angry", "VCTK-P225", "VCTK-P226"],
"default_voice": "Alba",
"supports_streaming": true,
"no_auth_required": true
}
}
}
}

Network Access

To access from other machines:

# Start server on all interfaces
python server.py --host 0.0.0.0

# Access from another machine
curl http://192.168.1.100:8200/v1/audio/speech ...

Update the plugin endpoint accordingly:

{
"endpoint": "http://192.168.1.100:8200/v1/audio/speech"
}

Troubleshooting

Model Download Fails

Models download from HuggingFace on first run:

# Set token for gated models
export HF_TOKEN=hf_...

CUDA Out of Memory

For TTS 1.6B on limited VRAM:

  1. Close other GPU applications
  2. Try cfg_coef=1.5 for lower memory usage
  3. Use Pocket TTS instead (CPU-based)

Audio Quality Issues

  • Robotic sound: Try a different voice
  • Cut off audio: Text may be too long, server chunks automatically
  • Wrong pronunciation: Model is optimized for English and French

MPS (Apple Silicon) Issues

RuntimeError: MPS backend error

The 1.6B model uses float16 on MPS. If issues persist, force CPU:

python server.py --device cpu

Comparison with Qwen3-TTS

FeatureKyutai PocketKyutai 1.6BQwen3-TTS
Parameters100M1.6B0.6B-1.7B
GPU RequiredNoOptionalYes
LanguagesEnglishEN/FR10 languages
Voice CloningYesYesYes
Voice DesignNoNoYes
Port820082018100

Choose Kyutai for English-focused use cases with simpler setup. Choose Qwen3-TTS for multilingual support and voice design features.


Resources