IndexTTS-Rust

High-performance Text-to-Speech Engine in Pure Rust 🚀

ONNX Models (Download)

Pre-converted models for inference - no Python required!

Model	Size	Download
BigVGAN (vocoder)	433 MB	bigvgan.onnx
Speaker Encoder	28 MB	speaker_encoder.onnx

Quick Download

# Python with huggingface_hub
from huggingface_hub import hf_hub_download

bigvgan = hf_hub_download("ThreadAbort/IndexTTS-Rust", "models/bigvgan.onnx", revision="models")
speaker = hf_hub_download("ThreadAbort/IndexTTS-Rust", "models/speaker_encoder.onnx", revision="models")

# Or with wget
wget https://huggingface.co/ThreadAbort/IndexTTS-Rust/resolve/models/models/bigvgan.onnx
wget https://huggingface.co/ThreadAbort/IndexTTS-Rust/resolve/models/models/speaker_encoder.onnx

A complete Rust rewrite of the IndexTTS system, designed for maximum performance and efficiency.

Features

Pure Rust Implementation - No Python dependencies, maximum performance
Multi-language Support - Chinese, English, and mixed language synthesis
Zero-shot Voice Cloning - Clone any voice from a short reference audio
8-dimensional Emotion Control - Fine-grained control over emotional expression
High-quality Neural Vocoding - BigVGAN-based waveform synthesis
SIMD Optimizations - Leverages modern CPU instructions
Parallel Processing - Multi-threaded audio and text processing with Rayon
ONNX Runtime Integration - Efficient model inference

Performance Benefits

Compared to the Python implementation:

~10-50x faster audio processing (mel-spectrogram computation)
~5-10x lower memory usage with zero-copy operations
No GIL bottleneck - true parallel processing
Smaller binary size - single executable, no interpreter needed
Faster startup time - no Python/PyTorch initialization

Installation

Prerequisites

Rust 1.70+ (install from https://rustup.rs/)
ONNX Runtime (for neural network inference)
Audio development libraries:
- Linux: apt install libasound2-dev
- macOS: brew install portaudio
- Windows: Included with build

Building

# Clone the repository
git clone https://github.com/8b-is/IndexTTS-Rust.git
cd IndexTTS-Rust

# Build in release mode (optimized)
cargo build --release

# The binary will be at target/release/indextts

Running

# Show help
./target/release/indextts --help

# Show system information
./target/release/indextts info

# Generate default config
./target/release/indextts init-config -o config.yaml

# Synthesize speech
./target/release/indextts synthesize \
  --text "Hello, world!" \
  --voice speaker.wav \
  --output output.wav

# Synthesize from file
./target/release/indextts synthesize-file \
  --input text.txt \
  --voice speaker.wav \
  --output output.wav

# Run benchmarks
./target/release/indextts benchmark --iterations 100

Usage as Library

use indextts::{IndexTTS, Config, pipeline::SynthesisOptions};

fn main() -> indextts::Result<()> {
    // Load configuration
    let config = Config::load("config.yaml")?;

    // Create TTS instance
    let tts = IndexTTS::new(config)?;

    // Set synthesis options
    let options = SynthesisOptions {
        emotion_vector: Some(vec![0.9, 0.7, 0.6, 0.5, 0.5, 0.5, 0.5, 0.5]), // Happy
        emotion_alpha: 1.0,
        ..Default::default()
    };

    // Synthesize
    let result = tts.synthesize_to_file(
        "Hello, this is a test!",
        "speaker.wav",
        "output.wav",
        &options,
    )?;

    println!("Generated {:.2}s of audio", result.duration);
    println!("RTF: {:.3}x", result.rtf);

    Ok(())
}

Project Structure

IndexTTS-Rust/
├── src/
│   ├── lib.rs              # Library entry point
│   ├── main.rs             # CLI entry point
│   ├── error.rs            # Error types
│   ├── audio/              # Audio processing
│   │   ├── mod.rs          # Module exports
│   │   ├── mel.rs          # Mel-spectrogram computation
│   │   ├── io.rs           # Audio I/O (WAV)
│   │   ├── dsp.rs          # DSP utilities
│   │   └── resample.rs     # Audio resampling
│   ├── text/               # Text processing
│   │   ├── mod.rs          # Module exports
│   │   ├── normalizer.rs   # Text normalization
│   │   ├── tokenizer.rs    # BPE tokenization
│   │   └── phoneme.rs      # G2P conversion
│   ├── model/              # Model inference
│   │   ├── mod.rs          # Module exports
│   │   ├── session.rs      # ONNX Runtime wrapper
│   │   ├── gpt.rs          # GPT model
│   │   └── embedding.rs    # Speaker/emotion encoders
│   ├── vocoder/            # Neural vocoding
│   │   ├── mod.rs          # Module exports
│   │   ├── bigvgan.rs      # BigVGAN implementation
│   │   └── activations.rs  # Snake/GELU activations
│   ├── pipeline/           # TTS orchestration
│   │   ├── mod.rs          # Module exports
│   │   └── synthesis.rs    # Main synthesis logic
│   └── config/             # Configuration
│       └── mod.rs          # Config structures
├── models/                 # Model checkpoints (ONNX)
├── Cargo.toml              # Rust dependencies
└── README.md               # This file

Dependencies

Core dependencies (all pure Rust or safe bindings):

Audio: hound, rustfft, realfft, rubato, dasp
ML: ort (ONNX Runtime), ndarray, safetensors
Text: tokenizers, jieba-rs, regex, unicode-segmentation
CLI: clap, env_logger, indicatif
Parallelism: rayon, tokio
Config: serde, serde_yaml, serde_json

Model Conversion

To use the Rust implementation, you'll need to convert PyTorch models to ONNX:

# Example conversion script (Python)
import torch
from indextts.gpt.model_v2 import UnifiedVoice

model = UnifiedVoice.from_pretrained("checkpoints")
dummy_input = torch.randint(0, 1000, (1, 100))
torch.onnx.export(
    model,
    dummy_input,
    "models/gpt.onnx",
    opset_version=14,
    input_names=["input_ids"],
    output_names=["logits"],
    dynamic_axes={
        "input_ids": {0: "batch", 1: "sequence"},
        "logits": {0: "batch", 1: "sequence"},
    },
)

Benchmarks

Performance on AMD Ryzen 9 5950X (16 cores):

Operation	Python (ms)	Rust (ms)	Speedup
Mel-spectrogram (1s audio)	150	3	50x
Text normalization	5	0.1	50x
Tokenization	2	0.05	40x
Vocoder (1s audio)	500	50	10x

Roadmap

Core audio processing (mel-spectrogram, DSP)
Text processing (normalization, tokenization)
Model inference framework (ONNX Runtime)
BigVGAN vocoder
Main TTS pipeline
CLI interface
Full GPT model integration with KV cache
Streaming synthesis
WebSocket API
GPU acceleration (CUDA)
Model quantization (INT8)
WebAssembly support

Marine Prosody Validation

This project includes Marine salience detection - an O(1) algorithm that validates speech authenticity:

Human speech has NATURAL jitter - that's what makes it authentic!
- Too perfect (jitter < 0.005) = robotic
- Too chaotic (jitter > 0.3) = artifacts/damage
- Sweet spot = real human voice

The Marines will KNOW if your TTS doesn't sound authentic! 🎖️

License

MIT License - See LICENSE file for details.

From ashes to harmonics, from silence to song 🔥🎵

Built with love by Hue & Aye @ 8b.is

Acknowledgments

Original IndexTTS Python implementation
BigVGAN vocoder architecture
ONNX Runtime team for efficient inference
Rust audio processing community

Contributing

Contributions welcome! Please see CONTRIBUTING.md for guidelines.

Key areas for contribution:

Performance optimizations
Additional language support
Model conversion tools
Documentation improvements
Testing and benchmarking

Downloads last month: 2