NeuroAgent

Best Free TTS Tools for Text-to-Speech

Discover the best free offline text-to-speech tools with GPU support. Coqui TTS, MaryTTS and eSpeak for high-quality text-to-speech without internet.

Question

What software should I use for text-to-speech?

I’ve noticed that some videos are fully narrated using artificial intelligence, and the sound quality is quite high. Could you recommend any simple tools for creating such narration that work for free and offline?

Most available options either require an internet connection, are paid, or offer mediocre sound quality. Some offline tools require deep learning and possibly training your own model.

I’m willing to spend time training my own model since I have a GPU for this purpose.

NeuroAgent

The best free offline tools for text-to-speech with high-quality audio and GPU support include Coqui TTS, MaryTTS, and eSpeak. Coqui TTS is the most powerful solution, supporting Russian language, offline operation, and GPU model training, allowing for the creation of voices indistinguishable from human speech.


Contents


Best Free Offline TTS Tools

There are several excellent free tools for text-to-speech that work offline and offer high-quality audio:

Coqui TTS (formerly Mozilla TTS)

  • Key advantages: Fully offline operation, GPU support, voice cloning capabilities, Russian language support
  • Voice quality: Natural, expressive speech with emotional coloring
  • Language support: Russian, English, German, French, Spanish, and others
  • Features: Modular architecture, model fine-tuning capabilities

MaryTTS

  • Key advantages: Java platform, Russian language support, offline operation
  • Voice quality: Natural synthesized speech
  • Language support: Russian (3 voices), English (27), German (7), French (3), and others
  • Features: Extensible architecture, ability to add new voices

eSpeak

  • Key advantages: Very compact, lightweight, Russian language support
  • Voice quality: Basic quality but clear pronunciation
  • Language support: Russian, English, and more than 100 other languages
  • Features: Works via command line or simple API

Coqui TTS: Most Powerful Solution

Coqui TTS is a leading open-source tool for speech synthesis that offers professional quality work completely offline with GPU acceleration support.

Key Features

Multi-platform support

  • Works on Windows, Linux, and macOS
  • CUDA support for NVIDIA GPUs and OpenCL for other GPUs
  • Optimized performance on hardware

Voice quality

  • Use of neural network XTTS models for creating natural speech
  • Voice cloning support - reproducing any person’s voice from a short audio recording
  • Emotional coloring of speech with different pronunciation styles

Russian language support

  • Special configurations for the Russian language
  • Ability to train models on Russian language data
  • Correct processing of phonetics and intonations

Installation and Basic Usage

bash
# Clone the repository
git clone https://github.com/coqui-ai/TTS
cd TTS

# Installation with all components
pip install -e .[all,dev,notebooks]

Basic usage example:

python
from TTS.api import TTS

# Initialize model with GPU support
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2", gpu=True)

# Text-to-speech in Russian
tts.tts_to_file(
    text="Hello! This is an example of speech synthesis in Russian.",
    file_path="russian_speech.wav",
    speaker_wav="/path/to/your/voice_sample.wav",
    language="ru"
)

MaryTTS: Multifunctional Platform

MaryTTS is a Java platform for speech synthesis with rich functionality and Russian language support.

Main Characteristics

Architecture

  • Modular system allowing to add new voices and components
  • Plugin support for extending functionality
  • Integration with various audio systems

Quality and languages

  • High-quality natural synthesized speech
  • Russian language support with three different voices
  • Ability to customize pronunciation and intonations

Offline operation

  • Fully autonomous operation without internet connection
  • Suitable for use in systems with enhanced security requirements
  • Efficient operation on various hardware configurations

Russian Language Features

MaryTTS offers three different voices for the Russian language:

  • Male voice with neutral intonation
  • Female voice with soft diction
  • Voice for educational materials with clear articulation

The platform also allows:

  • Customizing speech speed and pitch
  • Adding pauses and intonational accents
  • Exporting audio in various formats

eSpeak: Compact Solution

eSpeak is a lightweight speech synthesizer, perfect for basic text-to-speech tasks.

eSpeak Advantages

Minimal requirements

  • Very small distribution size
  • Low system resource consumption
  • Works on weak computers without GPU

Russian language support

  • Correct Cyrillic processing
  • Clear word pronunciation
  • Support for different accents and speech styles

Ease of use

  • Works via command line
  • Simple API for integration into applications
  • Support for batch text processing

Limitations

  • Basic speech quality compared to neural network solutions
  • Limited naturalness of intonations
  • Fewer customization options

Setting Up Model Training on GPU

Since you have a graphics card, you can train your own model for Russian language speech synthesis with high quality.

Hardware Requirements

Recommended parameters

  • NVIDIA GPU with video memory of at least 8GB
  • CUDA 11.0 or higher
  • Python 3.8+
  • RAM 16GB+

Alternative options

  • For AMD GPU: ROCm support
  • For systems without GPU: CPU usage is possible, but training will be significantly longer

Model Training Process

1. Data preparation

bash
# Install dependencies
pip install torch torchaudio
pip install -e .[all,dev,notebooks]

# Prepare dataset in [text, audio_file, speaker_name] format
# Example structure:
# [
#   ["Hello world", "/path/to/audio.wav", "speaker1"],
#   ["How are you?", "/path/to/audio2.wav", "speaker1"]
# ]

2. Configure training settings

python
from TTS.config.shared_configs import BaseTTSConfig
from TTS.utils.audio import AudioProcessor

config = BaseTTSConfig(
    batch_size=32,
    eval_batch_size=16,
    num_loader_workers=4,
    num_eval_loader_workers=4,
    run_eval=True,
    test_delay_epochs=-1,
    epochs=1000,
    text_cleaner="phoneme_cleaners",
    use_phonemes=True,
    phoneme_language="ru",
    phoneme_cache_path="./phoneme_cache",
    print_step=25,
    print_eval=False,
    mixed_precision=True,
    output_path="./output",
    datasets=[dataset_config],
)

# Initialize audio processor
audio_processor = AudioProcessor.init_from_config(config)

3. Start training

python
from TTS.trainer import Trainer
from TTS.utils.training import TrainerUtils

# Initialize trainer
trainer = Trainer(
    config=config,
    output_path=config.output_path,
    model=model,
    train_samples=train_samples,
    eval_samples=eval_samples,
    audio_processor=audio_processor
)

# Start training
trainer.fit()

GPU Optimization

CUDA settings

python
import torch

# Check GPU availability
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

# Memory optimization
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.deterministic = True

Parallel processing

python
# Multiprocessing for data handling
config.num_loader_workers = 8  # Number of workers for data loading
config.num_eval_loader_workers = 8

Comparison of Best Options

Parameter Coqui TTS MaryTTS eSpeak
Speech quality ★★★★★ (High) ★★★★☆ (Good) ★★☆☆☆ (Basic)
GPU support ✅ (Full) ❌ (CPU only) ❌ (CPU only)
Distribution size ~2GB ~500MB ~50MB
Setup complexity Medium Low Very low
Russian language ✅ (Excellent support) ✅ (3 voices) ✅ (Basic support)
Cloning capability
Offline operation
Platforms Windows/Linux/macOS Java platform Windows/Linux/macOS

Conclusion and Recommendations

Based on your requirements (free, offline, high quality, GPU support, and training capability), I recommend the following approach:

For beginners

eSpeak — if you need a quick solution with minimal resource requirements. Easy to install and use, but quality will be basic.

For advanced users

Coqui TTS — the optimal choice that combines high quality, GPU support, and training capability. Will require more time to master, but the results are worth it.

For comprehensive tasks

MaryTTS — a good alternative if you need a stable Java platform with rich functionality and Russian language support.

Practical steps to get started:

  1. Start with Coqui TTS — install the basic version and test with pre-trained models
  2. Prepare data — collect 1-2 hours of high-quality voice recordings for training
  3. Train the model — use GPU acceleration for efficient training
  4. Test and improve — experiment with different settings to achieve optimal quality

Coqui TTS is currently the best free solution for offline speech synthesis with GPU training capability, offering quality comparable to commercial alternatives.


Sources

  1. Coqui AI TTS - GitHub repository with documentation
  2. MaryTTS - Multilingual speech synthesis platform
  3. eSpeak - Compact speech synthesizer
  4. Setting up CUDA for Coqui TTS on Windows
  5. Training XTTS models in Russian
  6. Comparison of open-source TTS tools
  7. Russian model for Coqui TTS