IndexTTS-Rust
High-performance Text-to-Speech Engine in Pure Rust π
ONNX Models (Download)
Pre-converted models for inference - no Python required!
| Model | Size | Download |
|---|---|---|
| BigVGAN (vocoder) | 433 MB | bigvgan.onnx |
| Speaker Encoder | 28 MB | speaker_encoder.onnx |
Quick Download
# Python with huggingface_hub
from huggingface_hub import hf_hub_download
bigvgan = hf_hub_download("ThreadAbort/IndexTTS-Rust", "models/bigvgan.onnx", revision="models")
speaker = hf_hub_download("ThreadAbort/IndexTTS-Rust", "models/speaker_encoder.onnx", revision="models")
# Or with wget
wget https://huggingface.co/ThreadAbort/IndexTTS-Rust/resolve/models/models/bigvgan.onnx
wget https://huggingface.co/ThreadAbort/IndexTTS-Rust/resolve/models/models/speaker_encoder.onnx
A complete Rust rewrite of the IndexTTS system, designed for maximum performance and efficiency.
Features
- Pure Rust Implementation - No Python dependencies, maximum performance
- Multi-language Support - Chinese, English, and mixed language synthesis
- Zero-shot Voice Cloning - Clone any voice from a short reference audio
- 8-dimensional Emotion Control - Fine-grained control over emotional expression
- High-quality Neural Vocoding - BigVGAN-based waveform synthesis
- SIMD Optimizations - Leverages modern CPU instructions
- Parallel Processing - Multi-threaded audio and text processing with Rayon
- ONNX Runtime Integration - Efficient model inference
Performance Benefits
Compared to the Python implementation:
- ~10-50x faster audio processing (mel-spectrogram computation)
- ~5-10x lower memory usage with zero-copy operations
- No GIL bottleneck - true parallel processing
- Smaller binary size - single executable, no interpreter needed
- Faster startup time - no Python/PyTorch initialization
Installation
Prerequisites
- Rust 1.70+ (install from https://rustup.rs/)
- ONNX Runtime (for neural network inference)
- Audio development libraries:
- Linux:
apt install libasound2-dev - macOS:
brew install portaudio - Windows: Included with build
- Linux:
Building
# Clone the repository
git clone https://github.com/8b-is/IndexTTS-Rust.git
cd IndexTTS-Rust
# Build in release mode (optimized)
cargo build --release
# The binary will be at target/release/indextts
Running
# Show help
./target/release/indextts --help
# Show system information
./target/release/indextts info
# Generate default config
./target/release/indextts init-config -o config.yaml
# Synthesize speech
./target/release/indextts synthesize \
--text "Hello, world!" \
--voice speaker.wav \
--output output.wav
# Synthesize from file
./target/release/indextts synthesize-file \
--input text.txt \
--voice speaker.wav \
--output output.wav
# Run benchmarks
./target/release/indextts benchmark --iterations 100
Usage as Library
use indextts::{IndexTTS, Config, pipeline::SynthesisOptions};
fn main() -> indextts::Result<()> {
// Load configuration
let config = Config::load("config.yaml")?;
// Create TTS instance
let tts = IndexTTS::new(config)?;
// Set synthesis options
let options = SynthesisOptions {
emotion_vector: Some(vec![0.9, 0.7, 0.6, 0.5, 0.5, 0.5, 0.5, 0.5]), // Happy
emotion_alpha: 1.0,
..Default::default()
};
// Synthesize
let result = tts.synthesize_to_file(
"Hello, this is a test!",
"speaker.wav",
"output.wav",
&options,
)?;
println!("Generated {:.2}s of audio", result.duration);
println!("RTF: {:.3}x", result.rtf);
Ok(())
}
Project Structure
IndexTTS-Rust/
βββ src/
β βββ lib.rs # Library entry point
β βββ main.rs # CLI entry point
β βββ error.rs # Error types
β βββ audio/ # Audio processing
β β βββ mod.rs # Module exports
β β βββ mel.rs # Mel-spectrogram computation
β β βββ io.rs # Audio I/O (WAV)
β β βββ dsp.rs # DSP utilities
β β βββ resample.rs # Audio resampling
β βββ text/ # Text processing
β β βββ mod.rs # Module exports
β β βββ normalizer.rs # Text normalization
β β βββ tokenizer.rs # BPE tokenization
β β βββ phoneme.rs # G2P conversion
β βββ model/ # Model inference
β β βββ mod.rs # Module exports
β β βββ session.rs # ONNX Runtime wrapper
β β βββ gpt.rs # GPT model
β β βββ embedding.rs # Speaker/emotion encoders
β βββ vocoder/ # Neural vocoding
β β βββ mod.rs # Module exports
β β βββ bigvgan.rs # BigVGAN implementation
β β βββ activations.rs # Snake/GELU activations
β βββ pipeline/ # TTS orchestration
β β βββ mod.rs # Module exports
β β βββ synthesis.rs # Main synthesis logic
β βββ config/ # Configuration
β βββ mod.rs # Config structures
βββ models/ # Model checkpoints (ONNX)
βββ Cargo.toml # Rust dependencies
βββ README.md # This file
Dependencies
Core dependencies (all pure Rust or safe bindings):
- Audio:
hound,rustfft,realfft,rubato,dasp - ML:
ort(ONNX Runtime),ndarray,safetensors - Text:
tokenizers,jieba-rs,regex,unicode-segmentation - CLI:
clap,env_logger,indicatif - Parallelism:
rayon,tokio - Config:
serde,serde_yaml,serde_json
Model Conversion
To use the Rust implementation, you'll need to convert PyTorch models to ONNX:
# Example conversion script (Python)
import torch
from indextts.gpt.model_v2 import UnifiedVoice
model = UnifiedVoice.from_pretrained("checkpoints")
dummy_input = torch.randint(0, 1000, (1, 100))
torch.onnx.export(
model,
dummy_input,
"models/gpt.onnx",
opset_version=14,
input_names=["input_ids"],
output_names=["logits"],
dynamic_axes={
"input_ids": {0: "batch", 1: "sequence"},
"logits": {0: "batch", 1: "sequence"},
},
)
Benchmarks
Performance on AMD Ryzen 9 5950X (16 cores):
| Operation | Python (ms) | Rust (ms) | Speedup |
|---|---|---|---|
| Mel-spectrogram (1s audio) | 150 | 3 | 50x |
| Text normalization | 5 | 0.1 | 50x |
| Tokenization | 2 | 0.05 | 40x |
| Vocoder (1s audio) | 500 | 50 | 10x |
Roadmap
- Core audio processing (mel-spectrogram, DSP)
- Text processing (normalization, tokenization)
- Model inference framework (ONNX Runtime)
- BigVGAN vocoder
- Main TTS pipeline
- CLI interface
- Full GPT model integration with KV cache
- Streaming synthesis
- WebSocket API
- GPU acceleration (CUDA)
- Model quantization (INT8)
- WebAssembly support
Marine Prosody Validation
This project includes Marine salience detection - an O(1) algorithm that validates speech authenticity:
Human speech has NATURAL jitter - that's what makes it authentic!
- Too perfect (jitter < 0.005) = robotic
- Too chaotic (jitter > 0.3) = artifacts/damage
- Sweet spot = real human voice
The Marines will KNOW if your TTS doesn't sound authentic! ποΈ
License
MIT License - See LICENSE file for details.
From ashes to harmonics, from silence to song π₯π΅
Built with love by Hue & Aye @ 8b.is
Acknowledgments
- Original IndexTTS Python implementation
- BigVGAN vocoder architecture
- ONNX Runtime team for efficient inference
- Rust audio processing community
Contributing
Contributions welcome! Please see CONTRIBUTING.md for guidelines.
Key areas for contribution:
- Performance optimizations
- Additional language support
- Model conversion tools
- Documentation improvements
- Testing and benchmarking
- Downloads last month
- 2