What software should I use for text-to-speech?
I’ve noticed that some videos are fully narrated using artificial intelligence, and the sound quality is quite high. Could you recommend any simple tools for creating such narration that work for free and offline?
Most available options either require an internet connection, are paid, or offer mediocre sound quality. Some offline tools require deep learning and possibly training your own model.
I’m willing to spend time training my own model since I have a GPU for this purpose.
The best free offline tools for text-to-speech with high-quality audio and GPU support include Coqui TTS, MaryTTS, and eSpeak. Coqui TTS is the most powerful solution, supporting Russian language, offline operation, and GPU model training, allowing for the creation of voices indistinguishable from human speech.
Contents
- Best Free Offline TTS Tools
- Coqui TTS: Most Powerful Solution
- MaryTTS: Multifunctional Platform
- eSpeak: Compact Solution
- Setting Up Model Training on GPU
- Comparison of Best Options
- Conclusion and Recommendations
Best Free Offline TTS Tools
There are several excellent free tools for text-to-speech that work offline and offer high-quality audio:
Coqui TTS (formerly Mozilla TTS)
- Key advantages: Fully offline operation, GPU support, voice cloning capabilities, Russian language support
- Voice quality: Natural, expressive speech with emotional coloring
- Language support: Russian, English, German, French, Spanish, and others
- Features: Modular architecture, model fine-tuning capabilities
MaryTTS
- Key advantages: Java platform, Russian language support, offline operation
- Voice quality: Natural synthesized speech
- Language support: Russian (3 voices), English (27), German (7), French (3), and others
- Features: Extensible architecture, ability to add new voices
eSpeak
- Key advantages: Very compact, lightweight, Russian language support
- Voice quality: Basic quality but clear pronunciation
- Language support: Russian, English, and more than 100 other languages
- Features: Works via command line or simple API
Coqui TTS: Most Powerful Solution
Coqui TTS is a leading open-source tool for speech synthesis that offers professional quality work completely offline with GPU acceleration support.
Key Features
Multi-platform support
- Works on Windows, Linux, and macOS
- CUDA support for NVIDIA GPUs and OpenCL for other GPUs
- Optimized performance on hardware
Voice quality
- Use of neural network XTTS models for creating natural speech
- Voice cloning support - reproducing any person’s voice from a short audio recording
- Emotional coloring of speech with different pronunciation styles
Russian language support
- Special configurations for the Russian language
- Ability to train models on Russian language data
- Correct processing of phonetics and intonations
Installation and Basic Usage
# Clone the repository
git clone https://github.com/coqui-ai/TTS
cd TTS
# Installation with all components
pip install -e .[all,dev,notebooks]
Basic usage example:
from TTS.api import TTS
# Initialize model with GPU support
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2", gpu=True)
# Text-to-speech in Russian
tts.tts_to_file(
text="Hello! This is an example of speech synthesis in Russian.",
file_path="russian_speech.wav",
speaker_wav="/path/to/your/voice_sample.wav",
language="ru"
)
MaryTTS: Multifunctional Platform
MaryTTS is a Java platform for speech synthesis with rich functionality and Russian language support.
Main Characteristics
Architecture
- Modular system allowing to add new voices and components
- Plugin support for extending functionality
- Integration with various audio systems
Quality and languages
- High-quality natural synthesized speech
- Russian language support with three different voices
- Ability to customize pronunciation and intonations
Offline operation
- Fully autonomous operation without internet connection
- Suitable for use in systems with enhanced security requirements
- Efficient operation on various hardware configurations
Russian Language Features
MaryTTS offers three different voices for the Russian language:
- Male voice with neutral intonation
- Female voice with soft diction
- Voice for educational materials with clear articulation
The platform also allows:
- Customizing speech speed and pitch
- Adding pauses and intonational accents
- Exporting audio in various formats
eSpeak: Compact Solution
eSpeak is a lightweight speech synthesizer, perfect for basic text-to-speech tasks.
eSpeak Advantages
Minimal requirements
- Very small distribution size
- Low system resource consumption
- Works on weak computers without GPU
Russian language support
- Correct Cyrillic processing
- Clear word pronunciation
- Support for different accents and speech styles
Ease of use
- Works via command line
- Simple API for integration into applications
- Support for batch text processing
Limitations
- Basic speech quality compared to neural network solutions
- Limited naturalness of intonations
- Fewer customization options
Setting Up Model Training on GPU
Since you have a graphics card, you can train your own model for Russian language speech synthesis with high quality.
Hardware Requirements
Recommended parameters
- NVIDIA GPU with video memory of at least 8GB
- CUDA 11.0 or higher
- Python 3.8+
- RAM 16GB+
Alternative options
- For AMD GPU: ROCm support
- For systems without GPU: CPU usage is possible, but training will be significantly longer
Model Training Process
1. Data preparation
# Install dependencies
pip install torch torchaudio
pip install -e .[all,dev,notebooks]
# Prepare dataset in [text, audio_file, speaker_name] format
# Example structure:
# [
# ["Hello world", "/path/to/audio.wav", "speaker1"],
# ["How are you?", "/path/to/audio2.wav", "speaker1"]
# ]
2. Configure training settings
from TTS.config.shared_configs import BaseTTSConfig
from TTS.utils.audio import AudioProcessor
config = BaseTTSConfig(
batch_size=32,
eval_batch_size=16,
num_loader_workers=4,
num_eval_loader_workers=4,
run_eval=True,
test_delay_epochs=-1,
epochs=1000,
text_cleaner="phoneme_cleaners",
use_phonemes=True,
phoneme_language="ru",
phoneme_cache_path="./phoneme_cache",
print_step=25,
print_eval=False,
mixed_precision=True,
output_path="./output",
datasets=[dataset_config],
)
# Initialize audio processor
audio_processor = AudioProcessor.init_from_config(config)
3. Start training
from TTS.trainer import Trainer
from TTS.utils.training import TrainerUtils
# Initialize trainer
trainer = Trainer(
config=config,
output_path=config.output_path,
model=model,
train_samples=train_samples,
eval_samples=eval_samples,
audio_processor=audio_processor
)
# Start training
trainer.fit()
GPU Optimization
CUDA settings
import torch
# Check GPU availability
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")
# Memory optimization
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.deterministic = True
Parallel processing
# Multiprocessing for data handling
config.num_loader_workers = 8 # Number of workers for data loading
config.num_eval_loader_workers = 8
Comparison of Best Options
| Parameter | Coqui TTS | MaryTTS | eSpeak |
|---|---|---|---|
| Speech quality | ★★★★★ (High) | ★★★★☆ (Good) | ★★☆☆☆ (Basic) |
| GPU support | ✅ (Full) | ❌ (CPU only) | ❌ (CPU only) |
| Distribution size | ~2GB | ~500MB | ~50MB |
| Setup complexity | Medium | Low | Very low |
| Russian language | ✅ (Excellent support) | ✅ (3 voices) | ✅ (Basic support) |
| Cloning capability | ✅ | ❌ | ❌ |
| Offline operation | ✅ | ✅ | ✅ |
| Platforms | Windows/Linux/macOS | Java platform | Windows/Linux/macOS |
Conclusion and Recommendations
Based on your requirements (free, offline, high quality, GPU support, and training capability), I recommend the following approach:
For beginners
eSpeak — if you need a quick solution with minimal resource requirements. Easy to install and use, but quality will be basic.
For advanced users
Coqui TTS — the optimal choice that combines high quality, GPU support, and training capability. Will require more time to master, but the results are worth it.
For comprehensive tasks
MaryTTS — a good alternative if you need a stable Java platform with rich functionality and Russian language support.
Practical steps to get started:
- Start with Coqui TTS — install the basic version and test with pre-trained models
- Prepare data — collect 1-2 hours of high-quality voice recordings for training
- Train the model — use GPU acceleration for efficient training
- Test and improve — experiment with different settings to achieve optimal quality
Coqui TTS is currently the best free solution for offline speech synthesis with GPU training capability, offering quality comparable to commercial alternatives.