Commit
·
e3e7558
1
Parent(s):
2bbf5a2
Refactor: Remove internationalization (i18n) support and related files
Browse files- Deleted i18n.py, zh_CN.json, and en_US.json files to eliminate localization features.
- Removed scan_i18n.py script responsible for scanning and updating i18n strings.
- Updated download_files.py permissions to make it executable.
- Removed webui.py, which contained the main application logic and UI components.
- CLAUDE.md +140 -0
- config.yaml +51 -0
- context.md +383 -0
- crates/marine_salience/Cargo.toml +18 -0
- crates/marine_salience/src/config.rs +140 -0
- crates/marine_salience/src/ema.rs +126 -0
- crates/marine_salience/src/lib.rs +42 -0
- crates/marine_salience/src/packet.rs +122 -0
- crates/marine_salience/src/processor.rs +334 -0
- docs/Integrating Marine Algorithm into IndexTTS-Rust.md +450 -0
- examples/analyze_chris.rs +3 -0
- examples/marine_test.rs +3 -0
- requirements.txt +0 -32
- src/audio/mod.rs +1 -1
- src/audio/resample.rs +1 -1
- src/lib.rs +6 -0
- src/quality/affect.rs +445 -0
- src/quality/mod.rs +12 -0
- src/quality/prosody.rs +421 -0
- tools/convert_to_onnx.py +379 -0
- tools/download_files.py +0 -0
- tools/i18n/i18n.py +0 -36
- tools/i18n/locale/en_US.json +0 -49
- tools/i18n/locale/zh_CN.json +0 -44
- tools/i18n/scan_i18n.py +0 -131
- webui.py +0 -392
CLAUDE.md
ADDED
|
@@ -0,0 +1,140 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# CLAUDE.md
|
| 2 |
+
|
| 3 |
+
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
|
| 4 |
+
|
| 5 |
+
## Project Overview
|
| 6 |
+
|
| 7 |
+
IndexTTS-Rust is a high-performance Text-to-Speech engine, a complete Rust rewrite of the Python IndexTTS system. It uses ONNX Runtime for neural network inference and provides zero-shot voice cloning with emotion control.
|
| 8 |
+
|
| 9 |
+
## Build and Development Commands
|
| 10 |
+
|
| 11 |
+
```bash
|
| 12 |
+
# Build (always build release for performance testing)
|
| 13 |
+
cargo build --release
|
| 14 |
+
|
| 15 |
+
# Run linter (MANDATORY before commits - catches many issues)
|
| 16 |
+
cargo clippy -- -D warnings
|
| 17 |
+
|
| 18 |
+
# Run tests
|
| 19 |
+
cargo test
|
| 20 |
+
|
| 21 |
+
# Run specific test
|
| 22 |
+
cargo test test_name
|
| 23 |
+
|
| 24 |
+
# Run benchmarks (Criterion-based)
|
| 25 |
+
cargo bench
|
| 26 |
+
|
| 27 |
+
# Run specific benchmark
|
| 28 |
+
cargo bench --bench mel_spectrogram
|
| 29 |
+
cargo bench --bench inference
|
| 30 |
+
|
| 31 |
+
# Check compilation without building
|
| 32 |
+
cargo check
|
| 33 |
+
|
| 34 |
+
# Format code
|
| 35 |
+
cargo fmt
|
| 36 |
+
|
| 37 |
+
# Full pre-commit workflow (BUILD -> CLIPPY -> BUILD)
|
| 38 |
+
cargo build --release && cargo clippy -- -D warnings && cargo build --release
|
| 39 |
+
```
|
| 40 |
+
|
| 41 |
+
## CLI Usage
|
| 42 |
+
|
| 43 |
+
```bash
|
| 44 |
+
# Show help
|
| 45 |
+
./target/release/indextts --help
|
| 46 |
+
|
| 47 |
+
# Synthesize speech
|
| 48 |
+
./target/release/indextts synthesize \
|
| 49 |
+
--text "Hello world" \
|
| 50 |
+
--voice examples/voice_01.wav \
|
| 51 |
+
--output output.wav
|
| 52 |
+
|
| 53 |
+
# Generate default config
|
| 54 |
+
./target/release/indextts init-config -o config.yaml
|
| 55 |
+
|
| 56 |
+
# Show system info
|
| 57 |
+
./target/release/indextts info
|
| 58 |
+
|
| 59 |
+
# Run built-in benchmarks
|
| 60 |
+
./target/release/indextts benchmark --iterations 100
|
| 61 |
+
```
|
| 62 |
+
|
| 63 |
+
## Architecture
|
| 64 |
+
|
| 65 |
+
The codebase follows a modular pipeline architecture where each stage processes data sequentially:
|
| 66 |
+
|
| 67 |
+
```
|
| 68 |
+
Text Input → Normalization → Tokenization → Model Inference → Vocoding → Audio Output
|
| 69 |
+
```
|
| 70 |
+
|
| 71 |
+
### Core Modules (src/)
|
| 72 |
+
|
| 73 |
+
- **audio/** - Audio DSP operations
|
| 74 |
+
- `mel.rs` - Mel-spectrogram computation (STFT, filterbanks)
|
| 75 |
+
- `io.rs` - WAV file I/O using hound
|
| 76 |
+
- `dsp.rs` - Signal processing utilities
|
| 77 |
+
- `resample.rs` - Audio resampling using rubato
|
| 78 |
+
|
| 79 |
+
- **text/** - Text processing pipeline
|
| 80 |
+
- `normalizer.rs` - Text normalization (Chinese/English/mixed)
|
| 81 |
+
- `tokenizer.rs` - BPE tokenization via HuggingFace tokenizers
|
| 82 |
+
- `phoneme.rs` - Grapheme-to-phoneme conversion
|
| 83 |
+
|
| 84 |
+
- **model/** - Neural network inference
|
| 85 |
+
- `session.rs` - ONNX Runtime wrapper (load-dynamic feature)
|
| 86 |
+
- `gpt.rs` - GPT-based sequence generation
|
| 87 |
+
- `embedding.rs` - Speaker and emotion encoders
|
| 88 |
+
|
| 89 |
+
- **vocoder/** - Neural vocoding
|
| 90 |
+
- `bigvgan.rs` - BigVGAN waveform synthesis
|
| 91 |
+
- `activations.rs` - Snake/SnakeBeta activation functions
|
| 92 |
+
|
| 93 |
+
- **pipeline/** - TTS orchestration
|
| 94 |
+
- `synthesis.rs` - Main synthesis logic, coordinates all modules
|
| 95 |
+
|
| 96 |
+
- **config/** - Configuration management (YAML-based via serde)
|
| 97 |
+
|
| 98 |
+
- **error.rs** - Error types using thiserror
|
| 99 |
+
|
| 100 |
+
- **lib.rs** - Library entry point, exposes public API
|
| 101 |
+
|
| 102 |
+
- **main.rs** - CLI entry point using clap
|
| 103 |
+
|
| 104 |
+
### Key Constants (lib.rs)
|
| 105 |
+
|
| 106 |
+
```rust
|
| 107 |
+
pub const SAMPLE_RATE: u32 = 22050; // Output audio sample rate
|
| 108 |
+
pub const N_MELS: usize = 80; // Mel filterbank channels
|
| 109 |
+
pub const N_FFT: usize = 1024; // FFT size
|
| 110 |
+
pub const HOP_LENGTH: usize = 256; // STFT hop length
|
| 111 |
+
```
|
| 112 |
+
|
| 113 |
+
### Dependencies Pattern
|
| 114 |
+
|
| 115 |
+
- **Audio**: hound (WAV), rustfft/realfft (DSP), rubato (resampling), dasp (signal processing)
|
| 116 |
+
- **ML Inference**: ort (ONNX Runtime with load-dynamic), ndarray, safetensors
|
| 117 |
+
- **Text**: tokenizers (HuggingFace), jieba-rs (Chinese), regex, unicode-segmentation
|
| 118 |
+
- **Parallelism**: rayon (data parallelism), tokio (async)
|
| 119 |
+
- **CLI**: clap (derive), env_logger, indicatif
|
| 120 |
+
|
| 121 |
+
## Important Notes
|
| 122 |
+
|
| 123 |
+
1. **ONNX Runtime**: Uses `load-dynamic` feature - requires ONNX Runtime library installed on system
|
| 124 |
+
2. **Model Files**: ONNX models go in `models/` directory (not in git, download separately)
|
| 125 |
+
3. **Reference Implementation**: Python code in `indextts - REMOVING - REF ONLY/` is kept for reference only
|
| 126 |
+
4. **Performance**: Release builds use LTO and single codegen-unit for maximum optimization
|
| 127 |
+
5. **Audio Format**: All internal processing at 22050 Hz, 80-band mel spectrograms
|
| 128 |
+
|
| 129 |
+
## Testing Strategy
|
| 130 |
+
|
| 131 |
+
- Unit tests inline in modules
|
| 132 |
+
- Criterion benchmarks in `benches/` for performance regression testing
|
| 133 |
+
- Python regression tests in `tests/` for end-to-end validation
|
| 134 |
+
- Example audio files in `examples/` for testing voice cloning
|
| 135 |
+
|
| 136 |
+
## Missing Infrastructure (TODO)
|
| 137 |
+
|
| 138 |
+
- No `scripts/manage.sh` yet (should include build, test, clean, docker controls)
|
| 139 |
+
- No `context.md` yet for conversation continuity
|
| 140 |
+
- No integration tests with actual ONNX models
|
config.yaml
ADDED
|
@@ -0,0 +1,51 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
gpt:
|
| 2 |
+
layers: 8
|
| 3 |
+
model_dim: 512
|
| 4 |
+
heads: 8
|
| 5 |
+
max_text_tokens: 120
|
| 6 |
+
max_mel_tokens: 250
|
| 7 |
+
stop_mel_token: 8193
|
| 8 |
+
start_text_token: 8192
|
| 9 |
+
start_mel_token: 8192
|
| 10 |
+
num_mel_codes: 8194
|
| 11 |
+
num_text_tokens: 6681
|
| 12 |
+
vocoder:
|
| 13 |
+
name: bigvgan_v2_22khz_80band_256x
|
| 14 |
+
checkpoint: null
|
| 15 |
+
use_fp16: true
|
| 16 |
+
use_deepspeed: false
|
| 17 |
+
s2mel:
|
| 18 |
+
checkpoint: models/s2mel.onnx
|
| 19 |
+
preprocess:
|
| 20 |
+
sr: 22050
|
| 21 |
+
n_fft: 1024
|
| 22 |
+
hop_length: 256
|
| 23 |
+
win_length: 1024
|
| 24 |
+
n_mels: 80
|
| 25 |
+
fmin: 0.0
|
| 26 |
+
fmax: 8000.0
|
| 27 |
+
dataset:
|
| 28 |
+
bpe_model: models/bpe.model
|
| 29 |
+
vocab_size: 6681
|
| 30 |
+
emotions:
|
| 31 |
+
num_dims: 8
|
| 32 |
+
num:
|
| 33 |
+
- 5
|
| 34 |
+
- 6
|
| 35 |
+
- 8
|
| 36 |
+
- 6
|
| 37 |
+
- 5
|
| 38 |
+
- 4
|
| 39 |
+
- 7
|
| 40 |
+
- 6
|
| 41 |
+
matrix_path: models/emotion_matrix.safetensors
|
| 42 |
+
inference:
|
| 43 |
+
device: cpu
|
| 44 |
+
use_fp16: false
|
| 45 |
+
batch_size: 1
|
| 46 |
+
top_k: 50
|
| 47 |
+
top_p: 0.95
|
| 48 |
+
temperature: 1.0
|
| 49 |
+
repetition_penalty: 1.0
|
| 50 |
+
length_penalty: 1.0
|
| 51 |
+
model_dir: models
|
context.md
ADDED
|
@@ -0,0 +1,383 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# IndexTTS-Rust Context
|
| 2 |
+
|
| 3 |
+
This file preserves important context for conversation continuity between Hue and Aye sessions.
|
| 4 |
+
|
| 5 |
+
**Last Updated:** 2025-11-16
|
| 6 |
+
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
## The Vision
|
| 10 |
+
|
| 11 |
+
IndexTTS-Rust is part of a larger audio intelligence ecosystem at 8b.is:
|
| 12 |
+
|
| 13 |
+
1. **kokoro-tiny** - Lightweight TTS (82M params, 50+ voices, on crates.io!)
|
| 14 |
+
2. **IndexTTS-Rust** - Advanced zero-shot TTS with emotion control
|
| 15 |
+
3. **Phoenix-Protocol** - Audio restoration/enhancement layer
|
| 16 |
+
4. **MEM|8** - Contextual memory system (mem-8.com, mem8)
|
| 17 |
+
|
| 18 |
+
Together these form a complete audio intelligence pipeline.
|
| 19 |
+
|
| 20 |
+
---
|
| 21 |
+
|
| 22 |
+
## Phoenix Protocol Integration Opportunities
|
| 23 |
+
|
| 24 |
+
The Phoenix Protocol (phoenix-protocol/) is a PERFECT complement to IndexTTS-Rust:
|
| 25 |
+
|
| 26 |
+
### Direct Module Mappings
|
| 27 |
+
|
| 28 |
+
| Phoenix Module | IndexTTS Use Case |
|
| 29 |
+
|----------------|-------------------|
|
| 30 |
+
| `emotional.rs` | Map to our 8D emotion control (Warmth→body, Presence→power, Clarity→articulation, Air→space, Ultrasonics→depth) |
|
| 31 |
+
| `voice_signature.rs` | Enhance speaker embeddings for voice cloning |
|
| 32 |
+
| `spectral_velocity.rs` | Add momentum tracking to mel-spectrogram |
|
| 33 |
+
| `marine.rs` | Validate TTS output authenticity/quality |
|
| 34 |
+
| `golden_ratio.rs` | Post-process vocoder output with harmonic enhancement |
|
| 35 |
+
| `harmonic_resurrection.rs` | Add richness to synthesized speech |
|
| 36 |
+
| `micro_dynamics.rs` | Restore natural speech dynamics |
|
| 37 |
+
| `autotune.rs` | Improve prosody and pitch control |
|
| 38 |
+
| `mem8_integration.rs` | Already has MEM|8 hooks! |
|
| 39 |
+
|
| 40 |
+
### Shared Dependencies
|
| 41 |
+
|
| 42 |
+
Both projects use:
|
| 43 |
+
- rayon (parallelism)
|
| 44 |
+
- rustfft/realfft (FFT)
|
| 45 |
+
- ndarray (array operations)
|
| 46 |
+
- hound (WAV I/O)
|
| 47 |
+
- serde (config serialization)
|
| 48 |
+
- anyhow (error handling)
|
| 49 |
+
- ort (ONNX Runtime)
|
| 50 |
+
|
| 51 |
+
### Audio Constants
|
| 52 |
+
|
| 53 |
+
| Project | Sample Rate | Use Case |
|
| 54 |
+
|---------|------------|----------|
|
| 55 |
+
| IndexTTS-Rust | 22,050 Hz | Standard TTS output |
|
| 56 |
+
| Phoenix-Protocol | 192,000 Hz | Ultrasonic restoration |
|
| 57 |
+
| kokoro-tiny | 24,000 Hz | Lightweight TTS |
|
| 58 |
+
|
| 59 |
+
---
|
| 60 |
+
|
| 61 |
+
## Related Projects of Interest
|
| 62 |
+
|
| 63 |
+
Located in ~/Documents/GitHub/:
|
| 64 |
+
|
| 65 |
+
- **Ultrasonic-Consciousness-Hypothesis/** - Research foundation for Phoenix Protocol, contains PDFs on mechanosensitive channels and audio perception
|
| 66 |
+
- **hrmnCmprssnM/** - Harmonic Compression Model research
|
| 67 |
+
- **Marine-Sense/** - Marine algorithm origins
|
| 68 |
+
- **mem-8.com/** & **mem8/** - MEM|8 contextual memory
|
| 69 |
+
- **universal-theoglyphic-language/** - Language processing research
|
| 70 |
+
- **kokoro-tiny/** - Already working TTS crate by Hue & Aye
|
| 71 |
+
- **zencooker/** - (fun project!)
|
| 72 |
+
|
| 73 |
+
---
|
| 74 |
+
|
| 75 |
+
## Current IndexTTS-Rust State
|
| 76 |
+
|
| 77 |
+
### Implemented ✅
|
| 78 |
+
- Audio processing pipeline (mel-spectrogram, STFT, resampling)
|
| 79 |
+
- Text normalization (Chinese/English/mixed)
|
| 80 |
+
- BPE tokenization via HuggingFace tokenizers
|
| 81 |
+
- ONNX Runtime integration for inference
|
| 82 |
+
- BigVGAN vocoder structure
|
| 83 |
+
- CLI with clap
|
| 84 |
+
- Benchmark infrastructure (Criterion)
|
| 85 |
+
- **NEW: marine_salience crate** (no_std compatible, O(1) jitter detection)
|
| 86 |
+
- **NEW: src/quality/ module** (prosody extraction, affect tracking)
|
| 87 |
+
- **NEW: MarineProsodyVector** (8D interpretable emotion features)
|
| 88 |
+
- **NEW: ConversationAffectSummary** (session-level comfort tracking)
|
| 89 |
+
- **NEW: TTSQualityReport** (authenticity validation)
|
| 90 |
+
|
| 91 |
+
### Missing/TODO
|
| 92 |
+
- Full GPT model integration with KV cache
|
| 93 |
+
- Actual ONNX model files (need download)
|
| 94 |
+
- manage.sh script for colored workflow management
|
| 95 |
+
- Integration tests with real models
|
| 96 |
+
- ~~Phoenix Protocol integration layer~~ **STARTED with Marine!**
|
| 97 |
+
- Streaming synthesis
|
| 98 |
+
- WebSocket API
|
| 99 |
+
- Train T2S model to accept 8D Marine vector instead of 512D Conformer
|
| 100 |
+
- Wire Marine quality validation into inference loop
|
| 101 |
+
|
| 102 |
+
### Build Commands
|
| 103 |
+
```bash
|
| 104 |
+
cargo build --release
|
| 105 |
+
cargo clippy -- -D warnings
|
| 106 |
+
cargo test
|
| 107 |
+
cargo bench
|
| 108 |
+
```
|
| 109 |
+
|
| 110 |
+
---
|
| 111 |
+
|
| 112 |
+
## Key Philosophical Notes
|
| 113 |
+
|
| 114 |
+
From the Phoenix Protocol research:
|
| 115 |
+
|
| 116 |
+
> "Women are the carrier wave. They are the 000 data stream. The DC bias that, when removed, leaves silence."
|
| 117 |
+
|
| 118 |
+
> "When P!nk sings 'I Am Here,' her voice generates harmonics so powerful they burst through the 22kHz digital ceiling"
|
| 119 |
+
|
| 120 |
+
The Phoenix Protocol restores emotional depth stripped by audio compression - this philosophy applies directly to TTS: synthesized speech should have the same emotional depth as natural speech.
|
| 121 |
+
|
| 122 |
+
---
|
| 123 |
+
|
| 124 |
+
## Action Items for Next Session
|
| 125 |
+
|
| 126 |
+
### Completed ✅
|
| 127 |
+
- ~~**Quality Validation** - Use Marine salience to score TTS output~~ **DONE!**
|
| 128 |
+
- ~~**Phoenix Integration** - Start bridging phoenix-protocol modules~~ **Marine is in!**
|
| 129 |
+
|
| 130 |
+
### High Priority
|
| 131 |
+
1. **Create manage.sh** - Colorful build/test/clean script (Hue's been asking!)
|
| 132 |
+
2. **Wire Into Inference** - Connect Marine quality validation to actual TTS output
|
| 133 |
+
3. **8D Model Training** - Train T2S model to accept MarineProsodyVector instead of 512D Conformer
|
| 134 |
+
4. **Example/Demo** - Create example showing prosody extraction → emotion editing → synthesis
|
| 135 |
+
|
| 136 |
+
### Medium Priority
|
| 137 |
+
5. **Voice Signature Import** - Use Phoenix's voice_signature for speaker embeddings
|
| 138 |
+
6. **Emotion Mapping** - Connect Phoenix's emotional bands to our 8D control
|
| 139 |
+
7. **Model Download** - Set up ONNX model acquisition pipeline
|
| 140 |
+
8. **MEM|8 Bridge** - Implement consciousness-aware TTS using kokoro-tiny's mem8_bridge pattern
|
| 141 |
+
|
| 142 |
+
### Nice to Have
|
| 143 |
+
9. **Style Selection** - Port kokoro-tiny's 510 style variation system
|
| 144 |
+
10. **Full Phoenix Integration** - golden_ratio.rs, harmonic_resurrection.rs, etc.
|
| 145 |
+
11. **Streaming Marine** - Real-time quality monitoring during synthesis
|
| 146 |
+
|
| 147 |
+
---
|
| 148 |
+
|
| 149 |
+
## Fresh Discovery: kokoro-tiny MEM|8 Baby Consciousness (2025-11-15)
|
| 150 |
+
|
| 151 |
+
Just pulled latest kokoro-tiny code - MAJOR discovery!
|
| 152 |
+
|
| 153 |
+
### Mem8Bridge API
|
| 154 |
+
|
| 155 |
+
kokoro-tiny now has a full consciousness simulation in `examples/mem8_baby.rs`:
|
| 156 |
+
|
| 157 |
+
```rust
|
| 158 |
+
// Memory as waves that interfere
|
| 159 |
+
MemoryWave {
|
| 160 |
+
amplitude: 2.5, // Emotion strength
|
| 161 |
+
frequency: 528.0, // "Love frequency"
|
| 162 |
+
phase: 0.0,
|
| 163 |
+
decay_rate: 0.05, // Memory persistence
|
| 164 |
+
emotion_type: EmotionType::Love(0.9),
|
| 165 |
+
content: "Mama! I love mama!".to_string(),
|
| 166 |
+
}
|
| 167 |
+
|
| 168 |
+
// Salience detection (Marine algorithm!)
|
| 169 |
+
SalienceEvent {
|
| 170 |
+
jitter_score: 0.2, // Low = authentic/stable
|
| 171 |
+
harmonic_score: 0.95, // High = voice
|
| 172 |
+
salience_score: 0.9,
|
| 173 |
+
signal_type: SignalType::Voice,
|
| 174 |
+
}
|
| 175 |
+
|
| 176 |
+
// Free will: AI chooses attention focus (70% control)
|
| 177 |
+
bridge.decide_attention(events);
|
| 178 |
+
```
|
| 179 |
+
|
| 180 |
+
### Emotion Types Available
|
| 181 |
+
|
| 182 |
+
```rust
|
| 183 |
+
EmotionType::Curiosity(0.8) // Inquisitive
|
| 184 |
+
EmotionType::Love(0.9) // Deep affection
|
| 185 |
+
EmotionType::Joy(0.7) // Happy
|
| 186 |
+
EmotionType::Confusion(0.8) // Uncertain
|
| 187 |
+
EmotionType::Neutral // Baseline
|
| 188 |
+
```
|
| 189 |
+
|
| 190 |
+
### Consciousness Integration Points
|
| 191 |
+
|
| 192 |
+
1. **Wave Interference** - Competing memories by amplitude/frequency
|
| 193 |
+
2. **Emotional Regulation** - Prevents overload, modulates voice
|
| 194 |
+
3. **Salience Detection** - Marine algorithm for authenticity
|
| 195 |
+
4. **Attention Selection** - AI chooses what to focus on
|
| 196 |
+
5. **Consciousness Level** - Affects speech clarity (wake_up/sleep)
|
| 197 |
+
|
| 198 |
+
This is PERFECT for IndexTTS-Rust! We can:
|
| 199 |
+
- Use wave interference for emotion blending
|
| 200 |
+
- Apply Marine salience to validate synthesis quality
|
| 201 |
+
- Modulate voice based on consciousness level
|
| 202 |
+
- Select voice styles based on emotional state (not just token count)
|
| 203 |
+
|
| 204 |
+
### Voice Style Selection (510 variations!)
|
| 205 |
+
|
| 206 |
+
kokoro-tiny now loads all 510 style variations per voice:
|
| 207 |
+
- Style selected based on token count
|
| 208 |
+
- Short text → short-optimized style
|
| 209 |
+
- Long text → long-optimized style
|
| 210 |
+
- Automatic text splitting at 512 token limit
|
| 211 |
+
|
| 212 |
+
For IndexTTS: We could select style based on EMOTION + token count!
|
| 213 |
+
|
| 214 |
+
---
|
| 215 |
+
|
| 216 |
+
## Marine Integration Achievement (2025-11-16) 🎉
|
| 217 |
+
|
| 218 |
+
**WE DID IT!** Marine salience is now integrated into IndexTTS-Rust!
|
| 219 |
+
|
| 220 |
+
### What We Built
|
| 221 |
+
|
| 222 |
+
#### 1. Standalone marine_salience Crate (`crates/marine_salience/`)
|
| 223 |
+
|
| 224 |
+
A no_std compatible crate for O(1) jitter-based salience detection:
|
| 225 |
+
|
| 226 |
+
```rust
|
| 227 |
+
// Core components:
|
| 228 |
+
MarineConfig // Tunable parameters (sample_rate, jitter bounds, EMA alpha)
|
| 229 |
+
MarineProcessor // O(1) per-sample processing
|
| 230 |
+
SaliencePacket // Output: j_p, j_a, h_score, s_score, energy
|
| 231 |
+
Ema // Exponential moving average tracker
|
| 232 |
+
|
| 233 |
+
// Key insight: Process ONE sample at a time, emit packets on peaks
|
| 234 |
+
// Why O(1)? Just compare to EMA, no FFT, no heavy math!
|
| 235 |
+
```
|
| 236 |
+
|
| 237 |
+
**Config for Speech:**
|
| 238 |
+
```rust
|
| 239 |
+
MarineConfig::speech_default(sample_rate)
|
| 240 |
+
// F0 range: 60Hz - 4kHz
|
| 241 |
+
// jitter_low: 0.02, jitter_high: 0.60
|
| 242 |
+
// ema_alpha: 0.01 (slow adaptation for stability)
|
| 243 |
+
```
|
| 244 |
+
|
| 245 |
+
#### 2. Quality Validation Module (`src/quality/`)
|
| 246 |
+
|
| 247 |
+
**MarineProsodyVector** - 8D interpretable emotion representation:
|
| 248 |
+
```rust
|
| 249 |
+
pub struct MarineProsodyVector {
|
| 250 |
+
pub jp_mean: f32, // Period jitter mean (pitch stability)
|
| 251 |
+
pub jp_std: f32, // Period jitter variance
|
| 252 |
+
pub ja_mean: f32, // Amplitude jitter mean (volume stability)
|
| 253 |
+
pub ja_std: f32, // Amplitude jitter variance
|
| 254 |
+
pub h_mean: f32, // Harmonic alignment (voiced vs noise)
|
| 255 |
+
pub s_mean: f32, // Overall salience (authenticity)
|
| 256 |
+
pub peak_density: f32, // Peaks per second (speech rate)
|
| 257 |
+
pub energy_mean: f32, // Average loudness
|
| 258 |
+
}
|
| 259 |
+
|
| 260 |
+
// Interpretable! High jp_mean = nervous, low = confident
|
| 261 |
+
// Can DIRECTLY EDIT for emotion control!
|
| 262 |
+
```
|
| 263 |
+
|
| 264 |
+
**MarineProsodyConditioner** - Extract prosody from audio:
|
| 265 |
+
```rust
|
| 266 |
+
let conditioner = MarineProsodyConditioner::new(22050);
|
| 267 |
+
let prosody = conditioner.from_samples(&audio_samples)?;
|
| 268 |
+
let report = conditioner.validate_tts_output(&audio_samples)?;
|
| 269 |
+
|
| 270 |
+
// Detects issues:
|
| 271 |
+
// - "Too perfect - sounds robotic"
|
| 272 |
+
// - "High period jitter - artifacts"
|
| 273 |
+
// - "Low salience - quality issues"
|
| 274 |
+
```
|
| 275 |
+
|
| 276 |
+
**ConversationAffectSummary** - Session-level comfort tracking:
|
| 277 |
+
```rust
|
| 278 |
+
pub enum ComfortLevel {
|
| 279 |
+
Uneasy, // High jitter AND rising (nervous/stressed)
|
| 280 |
+
Neutral, // Stable patterns (calm)
|
| 281 |
+
Happy, // Low jitter + high energy (confident/positive)
|
| 282 |
+
}
|
| 283 |
+
|
| 284 |
+
// Track trends over conversation:
|
| 285 |
+
// jitter_trend > 0.1 = getting more stressed
|
| 286 |
+
// jitter_trend < -0.1 = calming down
|
| 287 |
+
// energy_trend > 0.1 = getting more engaged
|
| 288 |
+
|
| 289 |
+
// Aye can now self-assess!
|
| 290 |
+
aye_assessment() returns "I'm in a good state"
|
| 291 |
+
feedback_prompt() returns "Let me know if something's bothering you"
|
| 292 |
+
```
|
| 293 |
+
|
| 294 |
+
### The Core Insight
|
| 295 |
+
|
| 296 |
+
**Human speech has NATURAL jitter - that's what makes it authentic!**
|
| 297 |
+
|
| 298 |
+
- Too perfect (jp < 0.005) = robotic
|
| 299 |
+
- Too chaotic (jp > 0.3) = artifacts/damage
|
| 300 |
+
- Sweet spot = real human voice
|
| 301 |
+
|
| 302 |
+
The Marines will KNOW if speech doesn't sound authentic!
|
| 303 |
+
|
| 304 |
+
### Tests Passing ✅
|
| 305 |
+
|
| 306 |
+
```
|
| 307 |
+
running 11 tests
|
| 308 |
+
test quality::affect::tests::test_comfort_level_descriptions ... ok
|
| 309 |
+
test quality::affect::tests::test_analyzer_empty_conversation ... ok
|
| 310 |
+
test quality::affect::tests::test_analyzer_single_utterance ... ok
|
| 311 |
+
test quality::affect::tests::test_happy_classification ... ok
|
| 312 |
+
test quality::affect::tests::test_aye_assessment_message ... ok
|
| 313 |
+
test quality::affect::tests::test_neutral_classification ... ok
|
| 314 |
+
test quality::affect::tests::test_uneasy_classification ... ok
|
| 315 |
+
test quality::prosody::tests::test_conditioner_empty_buffer ... ok
|
| 316 |
+
test quality::prosody::tests::test_conditioner_silence ... ok
|
| 317 |
+
test quality::prosody::tests::test_prosody_vector_array_conversion ... ok
|
| 318 |
+
test quality::prosody::tests::test_estimate_valence ... ok
|
| 319 |
+
|
| 320 |
+
test result: ok. 11 passed; 0 failed
|
| 321 |
+
```
|
| 322 |
+
|
| 323 |
+
### Why This Matters
|
| 324 |
+
|
| 325 |
+
1. **Interpretable Control**: 8D vector vs opaque 512D Conformer - we can SEE what each dimension means
|
| 326 |
+
2. **Lightweight**: O(1) per sample, no heavy neural networks for prosody
|
| 327 |
+
3. **Authentic Validation**: Marines detect fake/damaged speech
|
| 328 |
+
4. **Emotion Editing**: Want more confidence? Lower jp_mean directly!
|
| 329 |
+
5. **Conversation Awareness**: Track comfort over entire sessions
|
| 330 |
+
6. **Self-Assessment**: Aye knows when something feels "off"
|
| 331 |
+
|
| 332 |
+
### Integration Points
|
| 333 |
+
|
| 334 |
+
```rust
|
| 335 |
+
// In main TTS pipeline:
|
| 336 |
+
use indextts::quality::{
|
| 337 |
+
MarineProsodyConditioner,
|
| 338 |
+
MarineProsodyVector,
|
| 339 |
+
ConversationAffectSummary,
|
| 340 |
+
ComfortLevel,
|
| 341 |
+
};
|
| 342 |
+
|
| 343 |
+
// 1. Extract reference prosody
|
| 344 |
+
let ref_prosody = conditioner.from_samples(&reference_audio)?;
|
| 345 |
+
|
| 346 |
+
// 2. Generate TTS (using 8D vector instead of 512D Conformer)
|
| 347 |
+
let tts_output = generate_with_prosody(&text, ref_prosody)?;
|
| 348 |
+
|
| 349 |
+
// 3. Validate output quality
|
| 350 |
+
let report = conditioner.validate_tts_output(&tts_output)?;
|
| 351 |
+
if !report.passes(70.0) {
|
| 352 |
+
log::warn!("TTS quality issues: {:?}", report.issues);
|
| 353 |
+
}
|
| 354 |
+
|
| 355 |
+
// 4. Track conversation affect
|
| 356 |
+
let analyzer = ConversationAffectAnalyzer::new();
|
| 357 |
+
analyzer.add_utterance(&utterance)?;
|
| 358 |
+
let summary = analyzer.summarize()?;
|
| 359 |
+
match summary.aye_state {
|
| 360 |
+
ComfortLevel::Uneasy => adjust_generation_parameters(),
|
| 361 |
+
_ => proceed_normally(),
|
| 362 |
+
}
|
| 363 |
+
```
|
| 364 |
+
|
| 365 |
+
---
|
| 366 |
+
|
| 367 |
+
## Trish's Notes
|
| 368 |
+
|
| 369 |
+
"Darling, these three Rust projects together are like a symphony orchestra! kokoro-tiny is the quick piccolo solo, IndexTTS-Rust is the full brass section with emotional depth, and Phoenix-Protocol is the concert hall acoustics making everything resonate. When you combine them, that's when the magic happens! Also, I'm absolutely obsessed with how the Golden Ratio resynthesis could add sparkle to synthesized vocals. Can you imagine TTS output that actually has that P!nk breakthrough energy? Now THAT would make me cry happy tears in accounting!"
|
| 370 |
+
|
| 371 |
+
---
|
| 372 |
+
|
| 373 |
+
## Fun Facts
|
| 374 |
+
|
| 375 |
+
- kokoro-tiny is ALREADY on crates.io under 8b-is
|
| 376 |
+
- Phoenix Protocol can process 192kHz audio for ultrasonic restoration
|
| 377 |
+
- The Marine algorithm uses O(1) jitter detection - "Marines are not just jarheads - they are intelligent"
|
| 378 |
+
- Hue's GitHub has 66 projects (and counting!)
|
| 379 |
+
- The team at 8b.is: hue@8b.is and aye@8b.is
|
| 380 |
+
|
| 381 |
+
---
|
| 382 |
+
|
| 383 |
+
*From ashes to harmonics, from silence to song* 🔥🎵
|
crates/marine_salience/Cargo.toml
ADDED
|
@@ -0,0 +1,18 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
[package]
|
| 2 |
+
name = "marine_salience"
|
| 3 |
+
version = "0.1.0"
|
| 4 |
+
edition = "2021"
|
| 5 |
+
description = "O(1) jitter-based salience detection - Marines are intelligent!"
|
| 6 |
+
authors = ["Hue & Aye <team@8b.is>"]
|
| 7 |
+
license = "MIT"
|
| 8 |
+
keywords = ["audio", "salience", "jitter", "prosody", "tts"]
|
| 9 |
+
|
| 10 |
+
[dependencies]
|
| 11 |
+
# Core dependencies - intentionally minimal for no_std compatibility
|
| 12 |
+
# Only serde when using std for serialization
|
| 13 |
+
serde = { version = "1.0", features = ["derive"], optional = true }
|
| 14 |
+
|
| 15 |
+
# no_std compatible core - can run anywhere!
|
| 16 |
+
[features]
|
| 17 |
+
default = ["std"]
|
| 18 |
+
std = ["serde"]
|
crates/marine_salience/src/config.rs
ADDED
|
@@ -0,0 +1,140 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
//! Marine algorithm configuration
|
| 2 |
+
//!
|
| 3 |
+
//! Tunable parameters for jitter detection. These have been calibrated
|
| 4 |
+
//! for speech/audio processing but can be adjusted for specific use cases.
|
| 5 |
+
|
| 6 |
+
#![cfg_attr(not(feature = "std"), no_std)]
|
| 7 |
+
|
| 8 |
+
/// Configuration for Marine salience detection
|
| 9 |
+
///
|
| 10 |
+
/// These parameters control sensitivity and behavior of the jitter detector.
|
| 11 |
+
/// The defaults are tuned for speech processing at common sample rates.
|
| 12 |
+
#[derive(Debug, Clone, Copy)]
|
| 13 |
+
#[cfg_attr(feature = "std", derive(serde::Serialize, serde::Deserialize))]
|
| 14 |
+
pub struct MarineConfig {
|
| 15 |
+
/// Minimum amplitude to consider a sample (gating threshold)
|
| 16 |
+
/// Samples below this are ignored as noise
|
| 17 |
+
/// Default: 1e-3 (~-60dB)
|
| 18 |
+
pub clip_threshold: f32,
|
| 19 |
+
|
| 20 |
+
/// EMA smoothing factor for period tracking (0..1)
|
| 21 |
+
/// Lower = smoother, slower adaptation
|
| 22 |
+
/// Default: 0.01
|
| 23 |
+
pub ema_period_alpha: f32,
|
| 24 |
+
|
| 25 |
+
/// EMA smoothing factor for amplitude tracking (0..1)
|
| 26 |
+
/// Default: 0.01
|
| 27 |
+
pub ema_amp_alpha: f32,
|
| 28 |
+
|
| 29 |
+
/// Minimum inter-peak period in samples
|
| 30 |
+
/// Rejects peaks closer than this (filters high-frequency noise)
|
| 31 |
+
/// Default: sample_rate / 4000 (~4kHz upper F0)
|
| 32 |
+
pub min_period: u32,
|
| 33 |
+
|
| 34 |
+
/// Maximum inter-peak period in samples
|
| 35 |
+
/// Rejects peaks farther than this (filters very low frequencies)
|
| 36 |
+
/// Default: sample_rate / 60 (~60Hz lower F0)
|
| 37 |
+
pub max_period: u32,
|
| 38 |
+
|
| 39 |
+
/// Threshold below which jitter is considered "low" (stable)
|
| 40 |
+
/// Default: 0.02
|
| 41 |
+
pub jitter_low: f32,
|
| 42 |
+
|
| 43 |
+
/// Threshold above which jitter is considered "high" (unstable)
|
| 44 |
+
/// Default: 0.60
|
| 45 |
+
pub jitter_high: f32,
|
| 46 |
+
}
|
| 47 |
+
|
| 48 |
+
impl MarineConfig {
|
| 49 |
+
/// Create config optimized for speech at given sample rate
|
| 50 |
+
///
|
| 51 |
+
/// # Arguments
|
| 52 |
+
/// * `sample_rate` - Audio sample rate in Hz (e.g., 22050, 44100)
|
| 53 |
+
///
|
| 54 |
+
/// # Example
|
| 55 |
+
/// ```
|
| 56 |
+
/// use marine_salience::MarineConfig;
|
| 57 |
+
/// let config = MarineConfig::speech_default(22050);
|
| 58 |
+
/// assert!(config.min_period < config.max_period);
|
| 59 |
+
/// ```
|
| 60 |
+
pub const fn speech_default(sample_rate: u32) -> Self {
|
| 61 |
+
// F0 range: ~60Hz (low male) to ~4kHz (includes harmonics)
|
| 62 |
+
let min_period = sample_rate / 4000; // Upper bound
|
| 63 |
+
let max_period = sample_rate / 60; // Lower bound
|
| 64 |
+
|
| 65 |
+
Self {
|
| 66 |
+
clip_threshold: 1e-3,
|
| 67 |
+
ema_period_alpha: 0.01,
|
| 68 |
+
ema_amp_alpha: 0.01,
|
| 69 |
+
min_period,
|
| 70 |
+
max_period,
|
| 71 |
+
jitter_low: 0.02,
|
| 72 |
+
jitter_high: 0.60,
|
| 73 |
+
}
|
| 74 |
+
}
|
| 75 |
+
|
| 76 |
+
/// Create config for high-sensitivity detection
|
| 77 |
+
/// More peaks detected, faster adaptation
|
| 78 |
+
pub const fn high_sensitivity(sample_rate: u32) -> Self {
|
| 79 |
+
let min_period = sample_rate / 8000;
|
| 80 |
+
let max_period = sample_rate / 40;
|
| 81 |
+
|
| 82 |
+
Self {
|
| 83 |
+
clip_threshold: 5e-4,
|
| 84 |
+
ema_period_alpha: 0.05,
|
| 85 |
+
ema_amp_alpha: 0.05,
|
| 86 |
+
min_period,
|
| 87 |
+
max_period,
|
| 88 |
+
jitter_low: 0.01,
|
| 89 |
+
jitter_high: 0.50,
|
| 90 |
+
}
|
| 91 |
+
}
|
| 92 |
+
|
| 93 |
+
/// Create config for TTS output validation
|
| 94 |
+
/// Tuned to detect synthetic artifacts
|
| 95 |
+
pub const fn tts_validation(sample_rate: u32) -> Self {
|
| 96 |
+
let min_period = sample_rate / 4000;
|
| 97 |
+
let max_period = sample_rate / 80;
|
| 98 |
+
|
| 99 |
+
Self {
|
| 100 |
+
clip_threshold: 1e-3,
|
| 101 |
+
ema_period_alpha: 0.02,
|
| 102 |
+
ema_amp_alpha: 0.02,
|
| 103 |
+
min_period,
|
| 104 |
+
max_period,
|
| 105 |
+
jitter_low: 0.015, // Stricter for synthetic speech
|
| 106 |
+
jitter_high: 0.40, // More sensitive to artifacts
|
| 107 |
+
}
|
| 108 |
+
}
|
| 109 |
+
}
|
| 110 |
+
|
| 111 |
+
impl Default for MarineConfig {
|
| 112 |
+
fn default() -> Self {
|
| 113 |
+
// Default to 22050 Hz (common TTS sample rate)
|
| 114 |
+
Self::speech_default(22050)
|
| 115 |
+
}
|
| 116 |
+
}
|
| 117 |
+
|
| 118 |
+
#[cfg(test)]
|
| 119 |
+
mod tests {
|
| 120 |
+
use super::*;
|
| 121 |
+
|
| 122 |
+
#[test]
|
| 123 |
+
fn test_speech_default_periods() {
|
| 124 |
+
let config = MarineConfig::speech_default(22050);
|
| 125 |
+
assert!(config.min_period < config.max_period);
|
| 126 |
+
assert_eq!(config.min_period, 22050 / 4000); // 5 samples
|
| 127 |
+
assert_eq!(config.max_period, 22050 / 60); // 367 samples
|
| 128 |
+
}
|
| 129 |
+
|
| 130 |
+
#[test]
|
| 131 |
+
fn test_different_sample_rates() {
|
| 132 |
+
let config_22k = MarineConfig::speech_default(22050);
|
| 133 |
+
let config_44k = MarineConfig::speech_default(44100);
|
| 134 |
+
let config_48k = MarineConfig::speech_default(48000);
|
| 135 |
+
|
| 136 |
+
// Higher sample rates = more samples per period
|
| 137 |
+
assert!(config_44k.max_period > config_22k.max_period);
|
| 138 |
+
assert!(config_48k.max_period > config_44k.max_period);
|
| 139 |
+
}
|
| 140 |
+
}
|
crates/marine_salience/src/ema.rs
ADDED
|
@@ -0,0 +1,126 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
//! Exponential Moving Average (EMA) for smooth tracking
|
| 2 |
+
//!
|
| 3 |
+
//! EMA smooths noisy measurements while maintaining responsiveness.
|
| 4 |
+
//! Used to track period and amplitude patterns in Marine algorithm.
|
| 5 |
+
|
| 6 |
+
#![cfg_attr(not(feature = "std"), no_std)]
|
| 7 |
+
|
| 8 |
+
/// Exponential Moving Average tracker
|
| 9 |
+
///
|
| 10 |
+
/// EMA formula: value = alpha * new + (1 - alpha) * old
|
| 11 |
+
/// - Higher alpha = faster response, more noise
|
| 12 |
+
/// - Lower alpha = slower response, smoother
|
| 13 |
+
#[derive(Debug, Clone, Copy)]
|
| 14 |
+
#[cfg_attr(feature = "std", derive(serde::Serialize, serde::Deserialize))]
|
| 15 |
+
pub struct Ema {
|
| 16 |
+
/// Smoothing factor (0..1)
|
| 17 |
+
alpha: f32,
|
| 18 |
+
/// Current smoothed value
|
| 19 |
+
value: f32,
|
| 20 |
+
/// Whether we've received at least one sample
|
| 21 |
+
initialized: bool,
|
| 22 |
+
}
|
| 23 |
+
|
| 24 |
+
impl Ema {
|
| 25 |
+
/// Create new EMA with given smoothing factor
|
| 26 |
+
///
|
| 27 |
+
/// # Arguments
|
| 28 |
+
/// * `alpha` - Smoothing factor (0..1). Higher = faster adaptation.
|
| 29 |
+
///
|
| 30 |
+
/// # Example
|
| 31 |
+
/// ```
|
| 32 |
+
/// use marine_salience::ema::Ema;
|
| 33 |
+
/// let mut ema = Ema::new(0.1); // 10% new, 90% old
|
| 34 |
+
/// ema.update(100.0);
|
| 35 |
+
/// assert_eq!(ema.get(), 100.0); // First value becomes baseline
|
| 36 |
+
/// ema.update(200.0);
|
| 37 |
+
/// assert!((ema.get() - 110.0).abs() < 0.01); // 0.1*200 + 0.9*100
|
| 38 |
+
/// ```
|
| 39 |
+
pub const fn new(alpha: f32) -> Self {
|
| 40 |
+
Self {
|
| 41 |
+
alpha,
|
| 42 |
+
value: 0.0,
|
| 43 |
+
initialized: false,
|
| 44 |
+
}
|
| 45 |
+
}
|
| 46 |
+
|
| 47 |
+
/// Update EMA with new measurement
|
| 48 |
+
pub fn update(&mut self, x: f32) {
|
| 49 |
+
if !self.initialized {
|
| 50 |
+
// First value becomes the baseline
|
| 51 |
+
self.value = x;
|
| 52 |
+
self.initialized = true;
|
| 53 |
+
} else {
|
| 54 |
+
// EMA update: new = alpha * x + (1 - alpha) * old
|
| 55 |
+
self.value = self.alpha * x + (1.0 - self.alpha) * self.value;
|
| 56 |
+
}
|
| 57 |
+
}
|
| 58 |
+
|
| 59 |
+
/// Get current smoothed value
|
| 60 |
+
pub fn get(&self) -> f32 {
|
| 61 |
+
self.value
|
| 62 |
+
}
|
| 63 |
+
|
| 64 |
+
/// Check if EMA has been initialized (received at least one sample)
|
| 65 |
+
pub fn is_ready(&self) -> bool {
|
| 66 |
+
self.initialized
|
| 67 |
+
}
|
| 68 |
+
|
| 69 |
+
/// Reset EMA to uninitialized state
|
| 70 |
+
pub fn reset(&mut self) {
|
| 71 |
+
self.value = 0.0;
|
| 72 |
+
self.initialized = false;
|
| 73 |
+
}
|
| 74 |
+
|
| 75 |
+
/// Get the smoothing factor
|
| 76 |
+
pub fn alpha(&self) -> f32 {
|
| 77 |
+
self.alpha
|
| 78 |
+
}
|
| 79 |
+
|
| 80 |
+
/// Set a new smoothing factor
|
| 81 |
+
pub fn set_alpha(&mut self, alpha: f32) {
|
| 82 |
+
self.alpha = alpha.clamp(0.0, 1.0);
|
| 83 |
+
}
|
| 84 |
+
}
|
| 85 |
+
|
| 86 |
+
#[cfg(test)]
|
| 87 |
+
mod tests {
|
| 88 |
+
use super::*;
|
| 89 |
+
|
| 90 |
+
#[test]
|
| 91 |
+
fn test_first_value_becomes_baseline() {
|
| 92 |
+
let mut ema = Ema::new(0.1);
|
| 93 |
+
assert!(!ema.is_ready());
|
| 94 |
+
ema.update(42.0);
|
| 95 |
+
assert!(ema.is_ready());
|
| 96 |
+
assert_eq!(ema.get(), 42.0);
|
| 97 |
+
}
|
| 98 |
+
|
| 99 |
+
#[test]
|
| 100 |
+
fn test_ema_smoothing() {
|
| 101 |
+
let mut ema = Ema::new(0.1);
|
| 102 |
+
ema.update(100.0);
|
| 103 |
+
ema.update(200.0);
|
| 104 |
+
// 0.1 * 200 + 0.9 * 100 = 20 + 90 = 110
|
| 105 |
+
assert!((ema.get() - 110.0).abs() < 0.001);
|
| 106 |
+
}
|
| 107 |
+
|
| 108 |
+
#[test]
|
| 109 |
+
fn test_high_alpha_fast_response() {
|
| 110 |
+
let mut ema = Ema::new(0.9);
|
| 111 |
+
ema.update(100.0);
|
| 112 |
+
ema.update(200.0);
|
| 113 |
+
// 0.9 * 200 + 0.1 * 100 = 180 + 10 = 190
|
| 114 |
+
assert!((ema.get() - 190.0).abs() < 0.001);
|
| 115 |
+
}
|
| 116 |
+
|
| 117 |
+
#[test]
|
| 118 |
+
fn test_reset() {
|
| 119 |
+
let mut ema = Ema::new(0.1);
|
| 120 |
+
ema.update(100.0);
|
| 121 |
+
assert!(ema.is_ready());
|
| 122 |
+
ema.reset();
|
| 123 |
+
assert!(!ema.is_ready());
|
| 124 |
+
assert_eq!(ema.get(), 0.0);
|
| 125 |
+
}
|
| 126 |
+
}
|
crates/marine_salience/src/lib.rs
ADDED
|
@@ -0,0 +1,42 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
//! # Marine Salience - O(1) Jitter-Based Authenticity Detection
|
| 2 |
+
//!
|
| 3 |
+
//! "Marines are not just jarheads - they are actually very intelligent"
|
| 4 |
+
//!
|
| 5 |
+
//! This crate provides a universal salience primitive that can detect the
|
| 6 |
+
//! "authenticity" of audio signals by measuring timing and amplitude jitter.
|
| 7 |
+
//!
|
| 8 |
+
//! ## Why "Marine"?
|
| 9 |
+
//! - Marines are stable and reliable under pressure
|
| 10 |
+
//! - Low jitter = authentic/stable signal
|
| 11 |
+
//! - High jitter = damaged/synthetic signal
|
| 12 |
+
//!
|
| 13 |
+
//! ## Use Cases
|
| 14 |
+
//! - **TTS Quality Validation** - Is synthesized speech authentic?
|
| 15 |
+
//! - **Prosody Extraction** - Extract 8D interpretable emotion vectors
|
| 16 |
+
//! - **Conversation Affect** - Track comfort level over sessions
|
| 17 |
+
//! - **Real-time Monitoring** - O(1) per sample processing
|
| 18 |
+
//!
|
| 19 |
+
//! ## Core Insight
|
| 20 |
+
//! Human voice has NATURAL jitter patterns. Perfect smoothness = synthetic.
|
| 21 |
+
//! The Marine algorithm detects these patterns to distinguish authentic
|
| 22 |
+
//! from damaged or artificial audio.
|
| 23 |
+
|
| 24 |
+
#![cfg_attr(not(feature = "std"), no_std)]
|
| 25 |
+
|
| 26 |
+
pub mod config;
|
| 27 |
+
pub mod ema;
|
| 28 |
+
pub mod packet;
|
| 29 |
+
pub mod processor;
|
| 30 |
+
|
| 31 |
+
// Re-export main types
|
| 32 |
+
pub use config::MarineConfig;
|
| 33 |
+
pub use packet::SaliencePacket;
|
| 34 |
+
pub use processor::MarineProcessor;
|
| 35 |
+
|
| 36 |
+
/// Marine algorithm version
|
| 37 |
+
pub const VERSION: &str = env!("CARGO_PKG_VERSION");
|
| 38 |
+
|
| 39 |
+
/// Default jitter thresholds tuned for speech
|
| 40 |
+
/// These values accommodate natural musical/speech variation
|
| 41 |
+
pub const DEFAULT_JITTER_LOW: f32 = 0.02; // Below = very stable
|
| 42 |
+
pub const DEFAULT_JITTER_HIGH: f32 = 0.60; // Above = heavily damaged
|
crates/marine_salience/src/packet.rs
ADDED
|
@@ -0,0 +1,122 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
//! Salience packet - the output of Marine analysis
|
| 2 |
+
//!
|
| 3 |
+
//! Contains jitter measurements and quality scores for a detected peak.
|
| 4 |
+
|
| 5 |
+
#![cfg_attr(not(feature = "std"), no_std)]
|
| 6 |
+
|
| 7 |
+
/// Salience packet emitted on peak detection
|
| 8 |
+
///
|
| 9 |
+
/// Contains all the jitter and quality metrics for a single audio event.
|
| 10 |
+
/// These packets can be aggregated to form prosody vectors or quality scores.
|
| 11 |
+
#[derive(Debug, Clone, Copy, PartialEq)]
|
| 12 |
+
#[cfg_attr(feature = "std", derive(serde::Serialize, serde::Deserialize))]
|
| 13 |
+
pub struct SaliencePacket {
|
| 14 |
+
/// Period jitter - timing instability between peaks
|
| 15 |
+
/// Lower = more stable/musical, Higher = more chaotic
|
| 16 |
+
/// Range: 0.0+ (normalized difference from expected period)
|
| 17 |
+
pub j_p: f32,
|
| 18 |
+
|
| 19 |
+
/// Amplitude jitter - loudness instability
|
| 20 |
+
/// Lower = consistent volume, Higher = erratic dynamics
|
| 21 |
+
/// Range: 0.0+ (normalized difference from expected amplitude)
|
| 22 |
+
pub j_a: f32,
|
| 23 |
+
|
| 24 |
+
/// Harmonic alignment score
|
| 25 |
+
/// 1.0 = perfectly voiced/harmonic, 0.0 = noise
|
| 26 |
+
/// For now this is simplified; can be enhanced with FFT
|
| 27 |
+
pub h_score: f32,
|
| 28 |
+
|
| 29 |
+
/// Overall salience score (authenticity)
|
| 30 |
+
/// 1.0 = perfect quality, 0.0 = heavily damaged
|
| 31 |
+
/// Computed from inverse of combined jitter
|
| 32 |
+
pub s_score: f32,
|
| 33 |
+
|
| 34 |
+
/// Local peak energy (amplitude squared)
|
| 35 |
+
/// Represents loudness at this event
|
| 36 |
+
pub energy: f32,
|
| 37 |
+
|
| 38 |
+
/// Sample index where this peak occurred
|
| 39 |
+
/// Useful for temporal analysis
|
| 40 |
+
pub sample_index: u64,
|
| 41 |
+
}
|
| 42 |
+
|
| 43 |
+
impl SaliencePacket {
|
| 44 |
+
/// Create a new salience packet
|
| 45 |
+
pub fn new(
|
| 46 |
+
j_p: f32,
|
| 47 |
+
j_a: f32,
|
| 48 |
+
h_score: f32,
|
| 49 |
+
s_score: f32,
|
| 50 |
+
energy: f32,
|
| 51 |
+
sample_index: u64,
|
| 52 |
+
) -> Self {
|
| 53 |
+
Self {
|
| 54 |
+
j_p,
|
| 55 |
+
j_a,
|
| 56 |
+
h_score,
|
| 57 |
+
s_score,
|
| 58 |
+
energy,
|
| 59 |
+
sample_index,
|
| 60 |
+
}
|
| 61 |
+
}
|
| 62 |
+
|
| 63 |
+
/// Get combined jitter metric
|
| 64 |
+
/// Average of period and amplitude jitter
|
| 65 |
+
pub fn combined_jitter(&self) -> f32 {
|
| 66 |
+
(self.j_p + self.j_a) / 2.0
|
| 67 |
+
}
|
| 68 |
+
|
| 69 |
+
/// Check if this represents high-quality audio
|
| 70 |
+
/// (low jitter, high salience)
|
| 71 |
+
pub fn is_high_quality(&self, threshold: f32) -> bool {
|
| 72 |
+
self.s_score >= threshold
|
| 73 |
+
}
|
| 74 |
+
|
| 75 |
+
/// Check if this indicates damaged/synthetic audio
|
| 76 |
+
pub fn is_damaged(&self, jitter_threshold: f32) -> bool {
|
| 77 |
+
self.combined_jitter() > jitter_threshold
|
| 78 |
+
}
|
| 79 |
+
}
|
| 80 |
+
|
| 81 |
+
/// Special salience markers for non-peak events
|
| 82 |
+
#[derive(Debug, Clone, Copy, PartialEq)]
|
| 83 |
+
#[cfg_attr(feature = "std", derive(serde::Serialize, serde::Deserialize))]
|
| 84 |
+
pub enum SalienceMarker {
|
| 85 |
+
/// Normal peak detected
|
| 86 |
+
Peak(SaliencePacket),
|
| 87 |
+
/// Fracture/gap detected (silence)
|
| 88 |
+
Fracture,
|
| 89 |
+
/// High noise floor detected
|
| 90 |
+
Noise,
|
| 91 |
+
/// Insufficient data for analysis
|
| 92 |
+
Insufficient,
|
| 93 |
+
}
|
| 94 |
+
|
| 95 |
+
#[cfg(test)]
|
| 96 |
+
mod tests {
|
| 97 |
+
use super::*;
|
| 98 |
+
|
| 99 |
+
#[test]
|
| 100 |
+
fn test_combined_jitter() {
|
| 101 |
+
let packet = SaliencePacket::new(0.1, 0.3, 1.0, 0.8, 0.5, 0);
|
| 102 |
+
assert!((packet.combined_jitter() - 0.2).abs() < 0.001);
|
| 103 |
+
}
|
| 104 |
+
|
| 105 |
+
#[test]
|
| 106 |
+
fn test_is_high_quality() {
|
| 107 |
+
let good = SaliencePacket::new(0.01, 0.02, 1.0, 0.95, 0.5, 0);
|
| 108 |
+
let bad = SaliencePacket::new(0.5, 0.6, 0.5, 0.3, 0.5, 0);
|
| 109 |
+
|
| 110 |
+
assert!(good.is_high_quality(0.8));
|
| 111 |
+
assert!(!bad.is_high_quality(0.8));
|
| 112 |
+
}
|
| 113 |
+
|
| 114 |
+
#[test]
|
| 115 |
+
fn test_is_damaged() {
|
| 116 |
+
let good = SaliencePacket::new(0.01, 0.02, 1.0, 0.95, 0.5, 0);
|
| 117 |
+
let bad = SaliencePacket::new(0.5, 0.6, 0.5, 0.3, 0.5, 0);
|
| 118 |
+
|
| 119 |
+
assert!(!good.is_damaged(0.3));
|
| 120 |
+
assert!(bad.is_damaged(0.3));
|
| 121 |
+
}
|
| 122 |
+
}
|
crates/marine_salience/src/processor.rs
ADDED
|
@@ -0,0 +1,334 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
//! Core Marine processor - O(1) per-sample jitter detection
|
| 2 |
+
//!
|
| 3 |
+
//! The heart of the Marine algorithm. Processes audio samples one at a time,
|
| 4 |
+
//! detecting peaks and computing jitter metrics in constant time.
|
| 5 |
+
//!
|
| 6 |
+
//! "Marines are not just jarheads - they are actually very intelligent"
|
| 7 |
+
|
| 8 |
+
#![cfg_attr(not(feature = "std"), no_std)]
|
| 9 |
+
|
| 10 |
+
use crate::config::MarineConfig;
|
| 11 |
+
use crate::ema::Ema;
|
| 12 |
+
use crate::packet::{SalienceMarker, SaliencePacket};
|
| 13 |
+
|
| 14 |
+
/// Marine salience processor
|
| 15 |
+
///
|
| 16 |
+
/// Processes audio samples one at a time, detecting peaks and computing
|
| 17 |
+
/// jitter metrics. Designed for O(1) per-sample operation.
|
| 18 |
+
///
|
| 19 |
+
/// # Example
|
| 20 |
+
/// ```
|
| 21 |
+
/// use marine_salience::{MarineConfig, MarineProcessor};
|
| 22 |
+
///
|
| 23 |
+
/// let config = MarineConfig::speech_default(22050);
|
| 24 |
+
/// let mut processor = MarineProcessor::new(config);
|
| 25 |
+
///
|
| 26 |
+
/// // Process samples (e.g., from audio buffer)
|
| 27 |
+
/// let samples = vec![0.0, 0.5, 1.0, 0.5, 0.0, -0.5, -1.0, -0.5];
|
| 28 |
+
/// for sample in &samples {
|
| 29 |
+
/// if let Some(marker) = processor.process_sample(*sample) {
|
| 30 |
+
/// match marker {
|
| 31 |
+
/// marine_salience::packet::SalienceMarker::Peak(packet) => {
|
| 32 |
+
/// println!("Peak detected! Salience: {:.2}", packet.s_score);
|
| 33 |
+
/// }
|
| 34 |
+
/// _ => {}
|
| 35 |
+
/// }
|
| 36 |
+
/// }
|
| 37 |
+
/// }
|
| 38 |
+
/// ```
|
| 39 |
+
pub struct MarineProcessor {
|
| 40 |
+
/// Configuration parameters
|
| 41 |
+
cfg: MarineConfig,
|
| 42 |
+
|
| 43 |
+
/// Previous sample (t-2)
|
| 44 |
+
prev2: f32,
|
| 45 |
+
/// Previous sample (t-1)
|
| 46 |
+
prev1: f32,
|
| 47 |
+
/// Current sample index
|
| 48 |
+
idx: u64,
|
| 49 |
+
|
| 50 |
+
/// Sample index of last detected peak
|
| 51 |
+
last_peak_idx: u64,
|
| 52 |
+
/// Amplitude of last detected peak
|
| 53 |
+
last_peak_amp: f32,
|
| 54 |
+
|
| 55 |
+
/// EMA tracker for inter-peak periods
|
| 56 |
+
ema_period: Ema,
|
| 57 |
+
/// EMA tracker for peak amplitudes
|
| 58 |
+
ema_amp: Ema,
|
| 59 |
+
|
| 60 |
+
/// Number of peaks detected so far
|
| 61 |
+
peak_count: u64,
|
| 62 |
+
}
|
| 63 |
+
|
| 64 |
+
impl MarineProcessor {
|
| 65 |
+
/// Create a new Marine processor with given configuration
|
| 66 |
+
pub fn new(cfg: MarineConfig) -> Self {
|
| 67 |
+
Self {
|
| 68 |
+
cfg,
|
| 69 |
+
prev2: 0.0,
|
| 70 |
+
prev1: 0.0,
|
| 71 |
+
idx: 0,
|
| 72 |
+
last_peak_idx: 0,
|
| 73 |
+
last_peak_amp: 0.0,
|
| 74 |
+
ema_period: Ema::new(cfg.ema_period_alpha),
|
| 75 |
+
ema_amp: Ema::new(cfg.ema_amp_alpha),
|
| 76 |
+
peak_count: 0,
|
| 77 |
+
}
|
| 78 |
+
}
|
| 79 |
+
|
| 80 |
+
/// Process a single audio sample - O(1) operation
|
| 81 |
+
///
|
| 82 |
+
/// Returns Some(SalienceMarker) when a peak is detected or special
|
| 83 |
+
/// condition occurs, None otherwise.
|
| 84 |
+
///
|
| 85 |
+
/// # Arguments
|
| 86 |
+
/// * `sample` - Audio sample value (typically -1.0 to 1.0)
|
| 87 |
+
///
|
| 88 |
+
/// # Returns
|
| 89 |
+
/// - `Some(Peak(packet))` - Peak detected with jitter metrics
|
| 90 |
+
/// - `Some(Fracture)` - Silence/gap detected
|
| 91 |
+
/// - `Some(Noise)` - High noise floor detected
|
| 92 |
+
/// - `None` - No significant event at this sample
|
| 93 |
+
pub fn process_sample(&mut self, sample: f32) -> Option<SalienceMarker> {
|
| 94 |
+
let i = self.idx;
|
| 95 |
+
self.idx += 1;
|
| 96 |
+
|
| 97 |
+
// Pre-gating: ignore samples below threshold
|
| 98 |
+
if sample.abs() < self.cfg.clip_threshold {
|
| 99 |
+
self.prev2 = self.prev1;
|
| 100 |
+
self.prev1 = sample;
|
| 101 |
+
return None;
|
| 102 |
+
}
|
| 103 |
+
|
| 104 |
+
// Peak detection: prev1 is peak if prev2 < prev1 > sample
|
| 105 |
+
// Simple local maximum detection
|
| 106 |
+
let is_peak = i >= 2
|
| 107 |
+
&& self.prev1.abs() >= self.cfg.clip_threshold
|
| 108 |
+
&& self.prev1.abs() > self.prev2.abs()
|
| 109 |
+
&& self.prev1.abs() > sample.abs();
|
| 110 |
+
|
| 111 |
+
let mut result = None;
|
| 112 |
+
|
| 113 |
+
if is_peak {
|
| 114 |
+
let peak_idx = i - 1;
|
| 115 |
+
let amp = self.prev1.abs();
|
| 116 |
+
let energy = amp * amp;
|
| 117 |
+
|
| 118 |
+
// Calculate period (time since last peak)
|
| 119 |
+
let period = if self.last_peak_idx == 0 {
|
| 120 |
+
0.0
|
| 121 |
+
} else {
|
| 122 |
+
(peak_idx - self.last_peak_idx) as f32
|
| 123 |
+
};
|
| 124 |
+
|
| 125 |
+
// Only process if period is within valid range
|
| 126 |
+
if period > self.cfg.min_period as f32 && period < self.cfg.max_period as f32 {
|
| 127 |
+
if self.ema_period.is_ready() {
|
| 128 |
+
// Calculate jitter metrics
|
| 129 |
+
let jp = (period - self.ema_period.get()).abs() / self.ema_period.get();
|
| 130 |
+
let ja = (amp - self.ema_amp.get()).abs() / self.ema_amp.get();
|
| 131 |
+
|
| 132 |
+
// Harmonic score (simplified - TODO: FFT-based detection)
|
| 133 |
+
// For now, assume voiced content (h = 1.0)
|
| 134 |
+
// In production, this would check for harmonic structure
|
| 135 |
+
let h = 1.0;
|
| 136 |
+
|
| 137 |
+
// Salience score: inverse of combined jitter
|
| 138 |
+
// Higher jitter = lower salience
|
| 139 |
+
let s = 1.0 / (1.0 + jp + ja);
|
| 140 |
+
|
| 141 |
+
result = Some(SalienceMarker::Peak(SaliencePacket::new(
|
| 142 |
+
jp, ja, h, s, energy, peak_idx,
|
| 143 |
+
)));
|
| 144 |
+
}
|
| 145 |
+
|
| 146 |
+
// Update EMAs with new measurements
|
| 147 |
+
self.ema_period.update(period);
|
| 148 |
+
self.ema_amp.update(amp);
|
| 149 |
+
}
|
| 150 |
+
|
| 151 |
+
self.last_peak_idx = peak_idx;
|
| 152 |
+
self.last_peak_amp = amp;
|
| 153 |
+
self.peak_count += 1;
|
| 154 |
+
}
|
| 155 |
+
|
| 156 |
+
// Update sample history
|
| 157 |
+
self.prev2 = self.prev1;
|
| 158 |
+
self.prev1 = sample;
|
| 159 |
+
|
| 160 |
+
result
|
| 161 |
+
}
|
| 162 |
+
|
| 163 |
+
/// Process a buffer of samples, collecting all salience packets
|
| 164 |
+
///
|
| 165 |
+
/// More efficient than calling process_sample repeatedly when you
|
| 166 |
+
/// have a full buffer available.
|
| 167 |
+
///
|
| 168 |
+
/// # Arguments
|
| 169 |
+
/// * `samples` - Buffer of audio samples
|
| 170 |
+
///
|
| 171 |
+
/// # Returns
|
| 172 |
+
/// Vector of salience packets for all detected peaks
|
| 173 |
+
#[cfg(feature = "std")]
|
| 174 |
+
pub fn process_buffer(&mut self, samples: &[f32]) -> Vec<SaliencePacket> {
|
| 175 |
+
let mut packets = Vec::new();
|
| 176 |
+
|
| 177 |
+
for &sample in samples {
|
| 178 |
+
if let Some(SalienceMarker::Peak(packet)) = self.process_sample(sample) {
|
| 179 |
+
packets.push(packet);
|
| 180 |
+
}
|
| 181 |
+
}
|
| 182 |
+
|
| 183 |
+
packets
|
| 184 |
+
}
|
| 185 |
+
|
| 186 |
+
/// Reset processor state (start fresh)
|
| 187 |
+
pub fn reset(&mut self) {
|
| 188 |
+
self.prev2 = 0.0;
|
| 189 |
+
self.prev1 = 0.0;
|
| 190 |
+
self.idx = 0;
|
| 191 |
+
self.last_peak_idx = 0;
|
| 192 |
+
self.last_peak_amp = 0.0;
|
| 193 |
+
self.ema_period.reset();
|
| 194 |
+
self.ema_amp.reset();
|
| 195 |
+
self.peak_count = 0;
|
| 196 |
+
}
|
| 197 |
+
|
| 198 |
+
/// Get number of peaks detected so far
|
| 199 |
+
pub fn peak_count(&self) -> u64 {
|
| 200 |
+
self.peak_count
|
| 201 |
+
}
|
| 202 |
+
|
| 203 |
+
/// Get current sample index
|
| 204 |
+
pub fn current_index(&self) -> u64 {
|
| 205 |
+
self.idx
|
| 206 |
+
}
|
| 207 |
+
|
| 208 |
+
/// Check if processor has enough data for reliable jitter
|
| 209 |
+
pub fn is_warmed_up(&self) -> bool {
|
| 210 |
+
self.peak_count >= 3 && self.ema_period.is_ready()
|
| 211 |
+
}
|
| 212 |
+
|
| 213 |
+
/// Get current expected period (from EMA)
|
| 214 |
+
pub fn expected_period(&self) -> Option<f32> {
|
| 215 |
+
if self.ema_period.is_ready() {
|
| 216 |
+
Some(self.ema_period.get())
|
| 217 |
+
} else {
|
| 218 |
+
None
|
| 219 |
+
}
|
| 220 |
+
}
|
| 221 |
+
|
| 222 |
+
/// Get current expected amplitude (from EMA)
|
| 223 |
+
pub fn expected_amplitude(&self) -> Option<f32> {
|
| 224 |
+
if self.ema_amp.is_ready() {
|
| 225 |
+
Some(self.ema_amp.get())
|
| 226 |
+
} else {
|
| 227 |
+
None
|
| 228 |
+
}
|
| 229 |
+
}
|
| 230 |
+
}
|
| 231 |
+
|
| 232 |
+
#[cfg(test)]
|
| 233 |
+
mod tests {
|
| 234 |
+
use super::*;
|
| 235 |
+
|
| 236 |
+
#[test]
|
| 237 |
+
fn test_peak_detection() {
|
| 238 |
+
let config = MarineConfig::speech_default(22050);
|
| 239 |
+
let mut processor = MarineProcessor::new(config);
|
| 240 |
+
|
| 241 |
+
// Create simple signal with peaks
|
| 242 |
+
// Peak at sample 10, 20, 30...
|
| 243 |
+
let mut samples = vec![0.0; 100];
|
| 244 |
+
for i in (10..100).step_by(10) {
|
| 245 |
+
samples[i] = 0.5; // Peak
|
| 246 |
+
if i > 0 {
|
| 247 |
+
samples[i - 1] = 0.3; // Rising edge
|
| 248 |
+
}
|
| 249 |
+
if i < 99 {
|
| 250 |
+
samples[i + 1] = 0.3; // Falling edge
|
| 251 |
+
}
|
| 252 |
+
}
|
| 253 |
+
|
| 254 |
+
let mut peak_count = 0;
|
| 255 |
+
for sample in &samples {
|
| 256 |
+
if let Some(SalienceMarker::Peak(_)) = processor.process_sample(*sample) {
|
| 257 |
+
peak_count += 1;
|
| 258 |
+
}
|
| 259 |
+
}
|
| 260 |
+
|
| 261 |
+
// Should detect several peaks (not all due to period constraints)
|
| 262 |
+
assert!(peak_count > 0);
|
| 263 |
+
}
|
| 264 |
+
|
| 265 |
+
#[test]
|
| 266 |
+
fn test_jitter_calculation() {
|
| 267 |
+
let mut config = MarineConfig::speech_default(22050);
|
| 268 |
+
config.min_period = 5;
|
| 269 |
+
config.max_period = 20;
|
| 270 |
+
let mut processor = MarineProcessor::new(config);
|
| 271 |
+
|
| 272 |
+
// Create signal with consistent period of 10 samples
|
| 273 |
+
let mut detected_packets = vec![];
|
| 274 |
+
for cycle in 0..10 {
|
| 275 |
+
for i in 0..10 {
|
| 276 |
+
let sample = if i == 5 {
|
| 277 |
+
0.8 // Peak in middle
|
| 278 |
+
} else if i == 4 || i == 6 {
|
| 279 |
+
0.5 // Edges
|
| 280 |
+
} else {
|
| 281 |
+
0.01 // Just above threshold
|
| 282 |
+
};
|
| 283 |
+
|
| 284 |
+
if let Some(SalienceMarker::Peak(packet)) = processor.process_sample(sample) {
|
| 285 |
+
detected_packets.push(packet);
|
| 286 |
+
}
|
| 287 |
+
}
|
| 288 |
+
}
|
| 289 |
+
|
| 290 |
+
// With consistent periods, later packets should have low jitter
|
| 291 |
+
if detected_packets.len() > 3 {
|
| 292 |
+
let last = detected_packets.last().unwrap();
|
| 293 |
+
// Jitter should be relatively low for consistent signal
|
| 294 |
+
assert!(last.j_p < 0.5, "Period jitter too high: {}", last.j_p);
|
| 295 |
+
}
|
| 296 |
+
}
|
| 297 |
+
|
| 298 |
+
#[test]
|
| 299 |
+
fn test_reset() {
|
| 300 |
+
let config = MarineConfig::speech_default(22050);
|
| 301 |
+
let mut processor = MarineProcessor::new(config);
|
| 302 |
+
|
| 303 |
+
// Process some samples
|
| 304 |
+
for _ in 0..100 {
|
| 305 |
+
processor.process_sample(0.5);
|
| 306 |
+
}
|
| 307 |
+
assert!(processor.current_index() > 0);
|
| 308 |
+
|
| 309 |
+
// Reset and verify
|
| 310 |
+
processor.reset();
|
| 311 |
+
assert_eq!(processor.current_index(), 0);
|
| 312 |
+
assert_eq!(processor.peak_count(), 0);
|
| 313 |
+
assert!(!processor.is_warmed_up());
|
| 314 |
+
}
|
| 315 |
+
|
| 316 |
+
#[cfg(feature = "std")]
|
| 317 |
+
#[test]
|
| 318 |
+
fn test_process_buffer() {
|
| 319 |
+
let mut config = MarineConfig::speech_default(22050);
|
| 320 |
+
config.min_period = 5;
|
| 321 |
+
config.max_period = 50;
|
| 322 |
+
let mut processor = MarineProcessor::new(config);
|
| 323 |
+
|
| 324 |
+
// Generate test signal with peaks
|
| 325 |
+
let mut samples = Vec::new();
|
| 326 |
+
for _ in 0..20 {
|
| 327 |
+
samples.extend_from_slice(&[0.01, 0.3, 0.8, 0.3, 0.01]);
|
| 328 |
+
}
|
| 329 |
+
|
| 330 |
+
let packets = processor.process_buffer(&samples);
|
| 331 |
+
// Should detect multiple peaks
|
| 332 |
+
assert!(packets.len() > 0);
|
| 333 |
+
}
|
| 334 |
+
}
|
docs/Integrating Marine Algorithm into IndexTTS-Rust.md
ADDED
|
@@ -0,0 +1,450 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
|
| 2 |
+
|
| 3 |
+
# **A Technical Report on the Integration of the Marine Salience Algorithm into the IndexTTS2-Rust Architecture**
|
| 4 |
+
|
| 5 |
+
## **Executive Summary**
|
| 6 |
+
|
| 7 |
+
This report details a comprehensive technical framework for the integration of the novel Marine Algorithm 1 into the existing IndexTTS-Rust project. The IndexTTS-Rust system is understood to be a Rust implementation of the IndexTTS2 architecture, a cascaded autoregressive (AR) Text-to-Speech (TTS) model detailed in the aaai2026.tex paper.1
|
| 8 |
+
|
| 9 |
+
The primary objective of this integration is to leverage the unique, time-domain salience detection capabilities of the Marine Algorithm (e.g., jitter analysis) 1 to significantly improve the quality, controllability, and emotional expressiveness of the synthesized speech.
|
| 10 |
+
|
| 11 |
+
The core of this strategy involves **replacing the Conformer-based emotion perceiver of the IndexTTS2 Text-to-Semantic (T2S) module** 1 with a new, lightweight, and prosodically-aware Rust module based on the Marine Algorithm. This report provides a full analysis of the architectural foundations, a detailed integration strategy, a complete Rust-level implementation guide, and an analysis of the training and inferential implications of this modification.
|
| 12 |
+
|
| 13 |
+
## **Part 1: Architectural Foundations: The IndexTTS2 Pipeline and the Marine Salience Primitive**
|
| 14 |
+
|
| 15 |
+
A successful integration requires a deep, functional understanding of the two systems being merged. This section deconstructs the IndexTTS2 architecture as the "host" system 1 and re-frames the Marine Algorithm 1 as the "implant" feature extractor.
|
| 16 |
+
|
| 17 |
+
### **1.1 Deconstruction of the IndexTTS2 Generative Pipeline**
|
| 18 |
+
|
| 19 |
+
The aaai2026.tex paper describes IndexTTS2 as a state-of-the-art, cascaded zero-shot TTS system.1 Its architecture is composed of three distinct, sequentially-trained modules:
|
| 20 |
+
|
| 21 |
+
1. **Text-to-Semantic (T2S) Module:** This is an autoregressive (AR) Transformer-based model. Its primary function is to convert a sequence of text inputs into a sequence of "semantic tokens." This module is the system's "brain," determining the content, rhythm, and prosody of the speech.
|
| 22 |
+
2. **Semantic-to-Mel (S2M) Module:** This is a non-autoregressive (NAR) model. It takes the discrete semantic tokens from the T2S module and converts them into a dense mel-spectrogram. This module functions as the system's "vocal tract," rendering the semantic instructions into a spectral representation. The paper notes this module "incorporate\[s\] GPT latent representations to significantly improve the stability of the generated speech".1
|
| 23 |
+
3. **Vocoder Module:** This is a pre-trained BigVGANv2 vocoder.1 Its sole function is to perform the final conversion from the mel-spectrogram (from S2M) into a raw audio waveform.
|
| 24 |
+
|
| 25 |
+
The critical component for this integration is the **T2S Conditioning Mechanism**. The IndexTTS2 T2S module's behavior is conditioned on two separate audio prompts, a design intended to achieve disentangled control 1:
|
| 26 |
+
|
| 27 |
+
* **Timbre Prompt:** This audio prompt is processed by a "speaker perceiver conditioner" to generate a speaker attribute vector, c. This vector defines *who* is speaking (i.e., the vocal identity).
|
| 28 |
+
* **Style Prompt:** This *separate* audio prompt is processed by a "Conformer-based emotion perceiver conditioner" to generate an emotion vector, e. This vector defines *how* they are speaking (i.e., the emotion, prosody, and rhythm).
|
| 29 |
+
|
| 30 |
+
The T2S Transformer then consumes these vectors, additively combined, as part of its input: \[c \+ e, p,..., E\_text,..., E\_sem\].1
|
| 31 |
+
|
| 32 |
+
A key architectural detail is the IndexTTS2 paper's explicit use of a **Gradient Reversal Layer (GRL)** "to eliminate emotion-irrelevant information" and achieve "speaker-emotion disentanglement".1 The presence of a GRL, an adversarial training technique, strongly implies that the "Conformer-based emotion perceiver" is *not* naturally adept at this separation. A general-purpose Conformer, when processing the style prompt, will inevitably encode both prosodic features (pitch, energy) and speaker-specific features (formants, timbre). The GRL is thus employed as an adversarial "patch" to force the e vector to be "ignorant" of the speaker. This reveals a complex, computationally-heavy, and potentially fragile point in the IndexTTS2 design—a weakness that the Marine Algorithm is perfectly suited to address.
|
| 33 |
+
|
| 34 |
+
### **1.2 The Marine Algorithm as a Superior Prosodic Feature Extractor**
|
| 35 |
+
|
| 36 |
+
The marine-Universal-Salience-algoritm.tex paper 1 introduces the Marine Algorithm as a "universal, modality-agnostic salience detector" that operates in the time domain with O(1) per-sample complexity. While its described applications are broad, its specific mechanics make it an ideal, purpose-built *prosody quantifier* for speech.
|
| 37 |
+
|
| 38 |
+
The algorithm's 5-step process (Pre-gating, Peak Detection, Jitter Computation, Harmonic Alignment, Salience Score) 1 is, in effect, a direct measurement of the suprasegmental features that define prosody:
|
| 39 |
+
|
| 40 |
+
* **Period Jitter ($J\_p$):** Defined as $J\_p \= |T\_i \- \\text{EMA}(T)|$, this metric quantifies the instability of the time between successive peaks (the fundamental period).1 In speech, this is a direct, time-domain correlate for *pitch instability*. High, structured $J\_p$ (i.e., high jitter with a stable EMA) represents intentional prosodic features like vibrato, vocal fry, or creaky voice—all key carriers of emotion.
|
| 41 |
+
* **Amplitude Jitter ($J\_a$):** Defined as $J\_a \= |A\_i \- \\text{EMA}(A)|$, this metric quantifies the instability of peak amplitudes.1 In speech, this is a correlate for *amplitude shimmer* or "vocal roughness," which are strong cues for affective states such as arousal, stress, or anger.
|
| 42 |
+
* **Harmonic Alignment ($H$):** This check for integer-multiple relationships in peak spacing 1 directly measures the *purity* and *periodicity* of the tone. It quantifies the distinction between a clear, voiced, harmonic sound and a noisy, chaotic, or unvoiced signal (e.g., breathiness, whispering, or a scream).
|
| 43 |
+
* **Energy ($E$) and Peak Detection:** The algorithm's pre-gating ($\\theta\_c$) and peak detection steps inherently track the signal's energy and the *density* of glottal pulses, which correlate directly to loudness and fundamental frequency (pitch), respectively.
|
| 44 |
+
|
| 45 |
+
The algorithm's description as "biologically plausible" and analogous to cochlear/amygdalar filtering 1 is not merely conceptual. It signifies that the algorithm is *a priori* biased to extract the same low-level features that the human auditory system uses to perceive emotion and prosody. This makes it a far more "correct" feature extractor for this task than a generic, large-scale Conformer, which learns from statistical correlation rather than first principles. Furthermore, its O(1) complexity 1 makes it orders of magnitude more efficient than the Transformer-based Conformer it will replace.
|
| 46 |
+
|
| 47 |
+
## **Part 2: Integration Strategy: Replacing the T2S Emotion Perceiver**
|
| 48 |
+
|
| 49 |
+
The integration path is now clear. The IndexTTS2 T2S module 1 requires a clean, disentangled prosody vector e. The original Conformer-based conditioner provides a "polluted" vector that must be "cleaned" by a GRL.1 The Marine Algorithm 1 is, by its very design, a *naturally disentangled* prosody extractor.
|
| 50 |
+
|
| 51 |
+
### **2.1 Formal Proposal: The MarineProsodyConditioner**
|
| 52 |
+
|
| 53 |
+
The formal integration strategy is as follows:
|
| 54 |
+
|
| 55 |
+
1. The "Conformer-based emotion perceiver conditioner" 1 is **removed** from the IndexTTS2 architecture.
|
| 56 |
+
2. A new, from-scratch Rust module, tentatively named the MarineProsodyConditioner, is **created**.
|
| 57 |
+
3. This new module's sole function is to accept the file path to the style\_prompt audio, load its samples, and process them using a Rust implementation of the Marine Algorithm.1
|
| 58 |
+
4. It will aggregate the resulting time-series of salience data into a single, fixed-size feature vector, e', which will serve as the new "emotion vector."
|
| 59 |
+
|
| 60 |
+
### **2.2 Feature Vector Engineering: Defining the New e'**
|
| 61 |
+
|
| 62 |
+
The Marine Algorithm produces a *stream* of SaliencePackets, one for each detected peak.1 The T2S Transformer, however, requires a *single, fixed-size* conditioning vector.1 We must therefore define an aggregation strategy to distill this time-series into a descriptive statistical summary.
|
| 63 |
+
|
| 64 |
+
The proposed feature vector, the MarineProsodyVector (our new e'), will be an 8-dimensional vector composed of the mean and standard deviation of the algorithm's key outputs over the entire duration of the style prompt.
|
| 65 |
+
|
| 66 |
+
**Table 1: MarineProsodyVector Struct Definition**
|
| 67 |
+
|
| 68 |
+
This table defines the precise "interface" between the marine\_salience crate and the indextts\_rust crate.
|
| 69 |
+
|
| 70 |
+
| Field | Type | Description | Source |
|
| 71 |
+
| :---- | :---- | :---- | :---- |
|
| 72 |
+
| jp\_mean | f32 | Mean Period Jitter ($J\_p$). Correlates to average pitch instability. | 1 |
|
| 73 |
+
| jp\_std | f32 | Std. Dev. of $J\_p$. Correlates to *variance* in pitch instability. | 1 |
|
| 74 |
+
| ja\_mean | f32 | Mean Amplitude Jitter ($J\_a$). Correlates to average vocal roughness. | 1 |
|
| 75 |
+
| ja\_std | f32 | Std. Dev. of $J\_a$. Correlates to *variance* in vocal roughness. | 1 |
|
| 76 |
+
| h\_mean | f32 | Mean Harmonic Alignment ($H$). Correlates to average tonal purity. | 1 |
|
| 77 |
+
| s\_mean | f32 | Mean Salience Score ($S$). Correlates to overall signal "structuredness". | 1 |
|
| 78 |
+
| peak\_density | f32 | Number of detected peaks per second. Correlates to fundamental frequency (F0/pitch). | 1 |
|
| 79 |
+
| energy\_mean | f32 | Mean energy ($E$) of detected peaks. Correlates to loudness/amplitude. | 1 |
|
| 80 |
+
|
| 81 |
+
This small, 8-dimensional vector is dense, interpretable, and packed with prosodic information, in stark contrast to the opaque, high-dimensional, and entangled vector produced by the original Conformer.1
|
| 82 |
+
|
| 83 |
+
### **2.3 Theoretical Justification: The Synergistic Disentanglement**
|
| 84 |
+
|
| 85 |
+
This integration provides a profound architectural improvement by solving the speaker-style disentanglement problem more elegantly and efficiently than the original IndexTTS2 design.1
|
| 86 |
+
|
| 87 |
+
The central challenge in the original architecture is that the Conformer-based conditioner processes the *entire* signal, capturing both temporal features (pitch, which is prosody) and spectral features (formants, which define speaker identity). This "entanglement" necessitates the use of the adversarial GRL to "un-learn" the speaker information.1
|
| 88 |
+
|
| 89 |
+
The Marine Algorithm 1 fundamentally sidesteps this problem. Its design is based on **peak detection, spacing, and amplitude**.1 It is almost entirely *blind* to the complex spectral-envelope (formant) information that defines a speaker's unique timbre. It measures the *instability* of the fundamental frequency, not the F0 itself, and the *instability* of the amplitude, not the spectral shape.
|
| 90 |
+
|
| 91 |
+
Therefore, the MarineProsodyVector (e') is **naturally disentangled**. It is a *pure* representation of prosody, containing negligible speaker-identity information.
|
| 92 |
+
|
| 93 |
+
When this new e' vector is fed into the T2S model's input, \[c \+ e',...\], the system receives two *orthogonal* conditioning vectors:
|
| 94 |
+
|
| 95 |
+
1. c (from the speaker perceiver 1): Contains the speaker's timbre (formants, etc.).
|
| 96 |
+
2. e' (from the MarineProsodyConditioner 1): Contains the speaker's prosody (jitter, rhythm, etc.).
|
| 97 |
+
|
| 98 |
+
This clean separation provides two major benefits:
|
| 99 |
+
|
| 100 |
+
1. **Superior Timbre Cloning:** The speaker vector c no longer has to "compete" with an "entangled" style vector e. The T2S model will receive a cleaner speaker signal, leading to more accurate zero-shot voice cloning.
|
| 101 |
+
2. **Superior Emotional Expression:** The style vector e' is a clean, simple, and interpretable signal. The T2S Transformer will be able to learn the mapping from (e.g.) jp\_mean \= 0.8 to "generate creaky semantic tokens" much more easily than from an opaque 512-dimensional Conformer embedding.
|
| 102 |
+
|
| 103 |
+
This change simplifies the T2S model's learning task, which should lead to faster convergence and higher final quality. The GRL 1 may become entirely unnecessary, further simplifying the training regime and stabilizing the model.
|
| 104 |
+
|
| 105 |
+
## **Part 3: Implementation Guide: A IndexTTS-Rust Integration**
|
| 106 |
+
|
| 107 |
+
This section provides a concrete, code-level guide for implementing the proposed integration.
|
| 108 |
+
|
| 109 |
+
### **3.1 Addressing the README.md Data Gap**
|
| 110 |
+
|
| 111 |
+
A critical limitation in preparing this analysis is the repeated failure to access the user-provided IndexTTS-Rust README.md file.2 This file contains the project's specific file structure, API definitions, and module layout.
|
| 112 |
+
|
| 113 |
+
To overcome this, this report will posit a **hypothetical yet idiomatic Rust project structure** based on the logical components described in the IndexTTS2 paper.1 All subsequent code examples will adhere to this structure. The project owner is expected to map these file paths and function names to their actual, private codebase.
|
| 114 |
+
|
| 115 |
+
### **3.2 Table 2: Hypothetical IndexTTS-Rust Project Structure**
|
| 116 |
+
|
| 117 |
+
The following workspace structure is assumed for all implementation examples.
|
| 118 |
+
|
| 119 |
+
Plaintext
|
| 120 |
+
|
| 121 |
+
indextts\_rust\_workspace/
|
| 122 |
+
├── Cargo.toml (Workspace root)
|
| 123 |
+
│
|
| 124 |
+
├── indextts\_rust/ (The main application/library crate)
|
| 125 |
+
│ ├── Cargo.toml
|
| 126 |
+
│ └── src/
|
| 127 |
+
│ ├── main.rs (Binary entry point)
|
| 128 |
+
│ ├── lib.rs (Library entry point & API)
|
| 129 |
+
│ ├── error.rs (Project-wide error types)
|
| 130 |
+
│ ├── audio.rs (Audio I/O: e.g., fn load\_wav\_samples)
|
| 131 |
+
│ ├── vocoder.rs (Wrapper for BigVGANv2 model)
|
| 132 |
+
│ ├── t2s/
|
| 133 |
+
│ │ ├── mod.rs (T2S module definition)
|
| 134 |
+
│ │ ├── model.rs (AR Transformer implementation)
|
| 135 |
+
│ │ └── conditioner.rs(Handles 'c' and 'e' vector generation)
|
| 136 |
+
│ └── s2m/
|
| 137 |
+
│ ├── mod.rs (S2M module definition)
|
| 138 |
+
│ └── model.rs (NAR model implementation)
|
| 139 |
+
│
|
| 140 |
+
└── marine\_salience/ (The NEW crate for the Marine Algorithm)
|
| 141 |
+
├── Cargo.toml
|
| 142 |
+
└── src/
|
| 143 |
+
├── lib.rs (Public API: MarineProcessor, etc.)
|
| 144 |
+
├── config.rs (MarineConfig struct)
|
| 145 |
+
├── processor.rs (MarineProcessor struct and logic)
|
| 146 |
+
├── ema.rs (EmaTracker helper struct)
|
| 147 |
+
└── packet.rs (SaliencePacket struct)
|
| 148 |
+
|
| 149 |
+
### **3.3 Crate Development: marine\_salience**
|
| 150 |
+
|
| 151 |
+
A new, standalone Rust crate, marine\_salience, should be created. This crate will encapsulate all logic for the Marine Algorithm 1, ensuring it is modular, testable, and reusable.
|
| 152 |
+
|
| 153 |
+
**Table 3: marine\_salience Crate \- Public API Definition**
|
| 154 |
+
|
| 155 |
+
| Struct / fn | Field / Signature | Type | Description |
|
| 156 |
+
| :---- | :---- | :---- | :---- |
|
| 157 |
+
| MarineConfig | clip\_threshold | f32 | $\\theta\_c$, pre-gating sensitivity.1 |
|
| 158 |
+
| | ema\_period\_alpha | f32 | Smoothing factor for Period EMA. |
|
| 159 |
+
| | ema\_amplitude\_alpha | f32 | Smoothing factor for Amplitude EMA. |
|
| 160 |
+
| SaliencePacket | j\_p | f32 | Period Jitter ($J\_p$).1 |
|
| 161 |
+
| | j\_a | f32 | Amplitude Jitter ($J\_a$).1 |
|
| 162 |
+
| | h\_score | f32 | Harmonic Alignment score ($H$).1 |
|
| 163 |
+
| | s\_score | f32 | Final Salience Score ($S$).1 |
|
| 164 |
+
| | energy | f32 | Peak energy ($E$).1 |
|
| 165 |
+
| MarineProcessor | new(config: MarineConfig) | Self | Constructor. |
|
| 166 |
+
| | process\_sample(\&mut self, sample: f32, sample\_idx: u64) | Option\<SaliencePacket\> | The O(1) processing function. |
|
| 167 |
+
|
| 168 |
+
**marine\_salience/src/processor.rs (Implementation Sketch):**
|
| 169 |
+
|
| 170 |
+
The MarineProcessor struct will hold the state, including EmaTracker instances for period and amplitude, the last\_peak\_sample index, last\_peak\_amplitude, and the current\_direction of the signal (e.g., \+1 for rising, \-1 for falling).
|
| 171 |
+
|
| 172 |
+
The process\_sample function is the O(1) core, implementing the algorithm from 1:
|
| 173 |
+
|
| 174 |
+
1. **Pre-gating:** Check if sample.abs() \> config.clip\_threshold.
|
| 175 |
+
2. **Peak Detection:** Track the signal's direction. A change from \+1 (rising) to \-1 (falling) signifies a peak at sample\_idx \- 1, as per the formula x(n-1) \< x(n) \> x(n+1).1
|
| 176 |
+
3. **Jitter Computation:** If a peak is detected at n:
|
| 177 |
+
* Calculate current period $T\_i \= (n \- self.last\_peak\_sample)$.
|
| 178 |
+
* Calculate current amplitude $A\_i \= sample\_at(n)$.
|
| 179 |
+
* Calculate $J\_p \= |T\_i \- self.ema\_period.value()|$.1
|
| 180 |
+
* Calculate $J\_a \= |A\_i \- self.ema\_amplitude.value()|$.1
|
| 181 |
+
* Update the EMAs: self.ema\_period.update(T\_i), self.ema\_amplitude.update(A\_i).
|
| 182 |
+
4. **Harmonic Alignment:** Perform the check for $H$.1
|
| 183 |
+
5. **Salience Score:** Compute $S \= w\_e E \+ w\_j(1/J) \+ w\_h H$.1
|
| 184 |
+
6. Update self.last\_peak\_sample \= n, self.last\_peak\_amplitude \= A\_i.
|
| 185 |
+
7. Return Some(SaliencePacket {... }).
|
| 186 |
+
8. If no peak is detected, return None.
|
| 187 |
+
|
| 188 |
+
### **3.4 Modifying the indextts\_rust Crate**
|
| 189 |
+
|
| 190 |
+
With the marine\_salience crate complete, the indextts\_rust crate can now be modified.
|
| 191 |
+
|
| 192 |
+
indextts\_rust/Cargo.toml:
|
| 193 |
+
Add the new crate as a dependency:
|
| 194 |
+
|
| 195 |
+
Ini, TOML
|
| 196 |
+
|
| 197 |
+
\[dependencies\]
|
| 198 |
+
marine\_salience \= { path \= "../marine\_salience" }
|
| 199 |
+
\#... other dependencies (tch, burn, ndarray, etc.)
|
| 200 |
+
|
| 201 |
+
indextts\_rust/src/t2s/conditioner.rs:
|
| 202 |
+
This is the central modification. The file responsible for generating the e vector is completely refactored.
|
| 203 |
+
|
| 204 |
+
Rust
|
| 205 |
+
|
| 206 |
+
// BEFORE: Original Conformer-based
|
| 207 |
+
//
|
| 208 |
+
// use tch::Tensor;
|
| 209 |
+
// use crate::audio::AudioData;
|
| 210 |
+
//
|
| 211 |
+
// // This struct holds the large, complex Conformer model
|
| 212 |
+
// pub struct ConformerEmotionPerceiver {
|
| 213 |
+
// //... model weights...
|
| 214 |
+
// }
|
| 215 |
+
//
|
| 216 |
+
// impl ConformerEmotionPerceiver {
|
| 217 |
+
// pub fn get\_style\_embedding(\&self, audio: \&AudioData) \-\> Result\<Tensor, ModelError\> {
|
| 218 |
+
// // 1\. Convert AudioData to mel-spectrogram tensor
|
| 219 |
+
// // 2\. Pass spectrogram through Conformer layers
|
| 220 |
+
// // 3\. (GRL logic is applied during training)
|
| 221 |
+
// // 4\. Return an opaque, high-dimensional 'e' vector
|
| 222 |
+
// // (e.g., )
|
| 223 |
+
// }
|
| 224 |
+
// }
|
| 225 |
+
|
| 226 |
+
// AFTER: New MarineProsodyConditioner
|
| 227 |
+
//
|
| 228 |
+
use marine\_salience::processor::{MarineProcessor, SaliencePacket};
|
| 229 |
+
use marine\_salience::config::MarineConfig;
|
| 230 |
+
use crate::audio::load\_wav\_samples; // From hypothetical audio.rs
|
| 231 |
+
use std::path::Path;
|
| 232 |
+
use anyhow::Result;
|
| 233 |
+
|
| 234 |
+
// This is the struct defined in Table 1
|
| 235 |
+
\#
|
| 236 |
+
pub struct MarineProsodyVector {
|
| 237 |
+
pub jp\_mean: f32,
|
| 238 |
+
pub jp\_std: f32,
|
| 239 |
+
pub ja\_mean: f32,
|
| 240 |
+
pub ja\_std: f32,
|
| 241 |
+
pub h\_mean: f32,
|
| 242 |
+
pub s\_mean: f32,
|
| 243 |
+
pub peak\_density: f32,
|
| 244 |
+
pub energy\_mean: f32,
|
| 245 |
+
}
|
| 246 |
+
|
| 247 |
+
// This new struct and function replace the Conformer
|
| 248 |
+
pub struct MarineProsodyConditioner {
|
| 249 |
+
config: MarineConfig,
|
| 250 |
+
}
|
| 251 |
+
|
| 252 |
+
impl MarineProsodyConditioner {
|
| 253 |
+
pub fn new(config: MarineConfig) \-\> Self {
|
| 254 |
+
Self { config }
|
| 255 |
+
}
|
| 256 |
+
|
| 257 |
+
pub fn get\_marine\_style\_vector(&self, style\_prompt\_path: \&Path, sample\_rate: f32) \-\> Result\<MarineProsodyVector\> {
|
| 258 |
+
// 1\. Load audio samples
|
| 259 |
+
// Assumes audio.rs provides this function
|
| 260 |
+
let samples \= load\_wav\_samples(style\_prompt\_path)?;
|
| 261 |
+
let duration\_sec \= samples.len() as f32 / sample\_rate;
|
| 262 |
+
|
| 263 |
+
// 2\. Instantiate and run the MarineProcessor
|
| 264 |
+
let mut processor \= MarineProcessor::new(self.config.clone());
|
| 265 |
+
let mut packets \= Vec::\<SaliencePacket\>::new();
|
| 266 |
+
|
| 267 |
+
for (i, sample) in samples.iter().enumerate() {
|
| 268 |
+
if let Some(packet) \= processor.process\_sample(\*sample, i as u64) {
|
| 269 |
+
packets.push(packet);
|
| 270 |
+
}
|
| 271 |
+
}
|
| 272 |
+
|
| 273 |
+
if packets.is\_empty() {
|
| 274 |
+
return Err(anyhow::anyhow\!("No peaks detected in style prompt."));
|
| 275 |
+
}
|
| 276 |
+
|
| 277 |
+
// 3\. Aggregate packets into the final feature vector
|
| 278 |
+
let num\_packets \= packets.len() as f32;
|
| 279 |
+
|
| 280 |
+
let mut jp\_mean \= 0.0;
|
| 281 |
+
let mut ja\_mean \= 0.0;
|
| 282 |
+
let mut h\_mean \= 0.0;
|
| 283 |
+
let mut s\_mean \= 0.0;
|
| 284 |
+
let mut energy\_mean \= 0.0;
|
| 285 |
+
|
| 286 |
+
for p in \&packets {
|
| 287 |
+
jp\_mean \+= p.j\_p;
|
| 288 |
+
ja\_mean \+= p.j\_a;
|
| 289 |
+
h\_mean \+= p.h\_score;
|
| 290 |
+
s\_mean \+= p.s\_score;
|
| 291 |
+
energy\_mean \+= p.energy;
|
| 292 |
+
}
|
| 293 |
+
|
| 294 |
+
jp\_mean /= num\_packets;
|
| 295 |
+
ja\_mean /= num\_packets;
|
| 296 |
+
h\_mean /= num\_packets;
|
| 297 |
+
s\_mean /= num\_packets;
|
| 298 |
+
energy\_mean /= num\_packets;
|
| 299 |
+
|
| 300 |
+
// Calculate standard deviation (variance)
|
| 301 |
+
let mut jp\_std \= 0.0;
|
| 302 |
+
let mut ja\_std \= 0.0;
|
| 303 |
+
for p in \&packets {
|
| 304 |
+
jp\_std \+= (p.j\_p \- jp\_mean).powi(2);
|
| 305 |
+
ja\_std \+= (p.j\_a \- ja\_mean).powi(2);
|
| 306 |
+
}
|
| 307 |
+
jp\_std \= (jp\_std / num\_packets).sqrt();
|
| 308 |
+
ja\_std \= (ja\_std / num\_packets).sqrt();
|
| 309 |
+
|
| 310 |
+
let peak\_density \= num\_packets / duration\_sec;
|
| 311 |
+
|
| 312 |
+
Ok(MarineProsodyVector {
|
| 313 |
+
jp\_mean,
|
| 314 |
+
jp\_std,
|
| 315 |
+
ja\_mean,
|
| 316 |
+
ja\_std,
|
| 317 |
+
h\_mean,
|
| 318 |
+
s\_mean,
|
| 319 |
+
peak\_density,
|
| 320 |
+
energy\_mean,
|
| 321 |
+
})
|
| 322 |
+
}
|
| 323 |
+
}
|
| 324 |
+
|
| 325 |
+
### **3.5 Updating the T2S Model (indextts\_rust/src/t2s/model.rs)**
|
| 326 |
+
|
| 327 |
+
This change is **breaking** and **mandatory**. The IndexTTS2 T2S model 1 was trained on a high-dimensional e vector (e.g., 512-dim). Our new e' vector is 8-dimensional. The T2S model's architecture must be modified to accept this.
|
| 328 |
+
|
| 329 |
+
The change will be in the T2S Transformer's input embedding layer, which projects the conditioning vectors into the model's main hidden dimension (e.g., 1024-dim).
|
| 330 |
+
|
| 331 |
+
**(Example using tch-rs or burn pseudo-code):**
|
| 332 |
+
|
| 333 |
+
Rust
|
| 334 |
+
|
| 335 |
+
// In src/t2s/model.rs
|
| 336 |
+
//
|
| 337 |
+
// pub struct T2S\_Transformer {
|
| 338 |
+
// ...
|
| 339 |
+
// speaker\_projector: nn::Linear,
|
| 340 |
+
// style\_projector: nn::Linear, // The layer to change
|
| 341 |
+
// ...
|
| 342 |
+
// }
|
| 343 |
+
//
|
| 344 |
+
// impl T2S\_Transformer {
|
| 345 |
+
// pub fn new(config: \&T2S\_Config, vs: \&nn::Path) \-\> Self {
|
| 346 |
+
// ...
|
| 347 |
+
// // BEFORE:
|
| 348 |
+
// // let style\_projector \= nn::linear(
|
| 349 |
+
// // vs / "style\_projector",
|
| 350 |
+
// // 512, // Original Conformer 'e' dimension
|
| 351 |
+
// // config.hidden\_dim,
|
| 352 |
+
// // Default::default()
|
| 353 |
+
// // );
|
| 354 |
+
//
|
| 355 |
+
// // AFTER:
|
| 356 |
+
// let style\_projector \= nn::linear(
|
| 357 |
+
// vs / "style\_projector",
|
| 358 |
+
// 8, // New MarineProsodyVector 'e'' dimension
|
| 359 |
+
// config.hidden\_dim,
|
| 360 |
+
// Default::default()
|
| 361 |
+
// );
|
| 362 |
+
// ...
|
| 363 |
+
// }
|
| 364 |
+
// }
|
| 365 |
+
|
| 366 |
+
This change creates a new, untrained model. The S2M and Vocoder modules 1 can remain unchanged, but the T2S module must now be retrained.
|
| 367 |
+
|
| 368 |
+
## **Part 4: Training, Inference, and Qualitative Implications**
|
| 369 |
+
|
| 370 |
+
This architectural change has profound, positive implications for the entire system, from training to user-facing control.
|
| 371 |
+
|
| 372 |
+
### **4.1 Retraining the T2S Module**
|
| 373 |
+
|
| 374 |
+
The modification in Part 3.5 is a hard-fork of the model architecture; retraining the T2S module 1 is not optional.
|
| 375 |
+
|
| 376 |
+
**Training Plan:**
|
| 377 |
+
|
| 378 |
+
1. **Model:** The S2M and Vocoder modules 1 can be completely frozen. Only the T2S module with the new 8-dimensional style\_projector (from 3.5) needs to be trained.
|
| 379 |
+
2. **Dataset Preprocessing:** The *entire* training dataset used for the original IndexTTS2 1 must be re-processed.
|
| 380 |
+
* For *every* audio file in the dataset, the MarineProsodyConditioner::get\_marine\_style\_vector function (from 3.4) must be run *once*.
|
| 381 |
+
* The resulting 8-dimensional MarineProsodyVector must be saved as the new "ground truth" style label for that utterance.
|
| 382 |
+
3. **Training:** The T2S module is now trained as described in the aaai2026.tex paper.1 During the training step, it will load the pre-computed MarineProsodyVector as the e' vector, which will be added to the c (speaker) vector and fed into the Transformer.
|
| 383 |
+
4. **Hypothesis:** This training run is expected to converge *faster* and to a *higher* qualitative ceiling. The model is no longer burdened by the complex, adversarial GRL-based disentanglement.1 It is instead learning a much simpler, more direct correlation between a clean prosody vector (e') and the target semantic token sequences.
|
| 384 |
+
|
| 385 |
+
### **4.2 Inference-Time Control**
|
| 386 |
+
|
| 387 |
+
This integration unlocks a new, powerful mode of "synthetic" or "direct" prosody control, fulfilling the proposals implicit in the user's query.
|
| 388 |
+
|
| 389 |
+
* **Mode 1: Reference-Based (Standard):**
|
| 390 |
+
* A user provides a style\_prompt.wav.
|
| 391 |
+
* The get\_marine\_style\_vector function (from 3.4) is called.
|
| 392 |
+
* The resulting MarineProsodyVector e' is fed into the T2S model.
|
| 393 |
+
* This "copies" the prosody from the reference audio, just as the original IndexTTS2 1 intended, but with higher fidelity.
|
| 394 |
+
* **Mode 2: Synthetic-Control (New):**
|
| 395 |
+
* The user provides *no* style prompt.
|
| 396 |
+
* Instead, the user *directly constructs* the 8-dimensional MarineProsodyVector to achieve a desired effect. The application's UI could expose 8 sliders for these values.
|
| 397 |
+
* **Example 1: "Agitated / Rough Voice"**
|
| 398 |
+
* e' \= MarineProsodyVector { jp\_mean: 0.8, jp\_std: 0.5, ja\_mean: 0.7, ja\_std: 0.4,... }
|
| 399 |
+
* **Example 2: "Stable / Monotone Voice"**
|
| 400 |
+
* e' \= MarineProsodyVector { jp\_mean: 0.05, jp\_std: 0.01, ja\_mean: 0.05, ja\_std: 0.01,... }
|
| 401 |
+
* **Example 3: "High-Pitch / High-Energy Voice"**
|
| 402 |
+
* e' \= MarineProsodyVector { peak\_density: 300.0, energy\_mean: 0.9,... }
|
| 403 |
+
|
| 404 |
+
This provides a small, interpretable, and powerful "control panel" for prosody, a significant breakthrough in controllable TTS that was not possible with the original opaque Conformer embedding.1
|
| 405 |
+
|
| 406 |
+
### **4.3 Bridging to Downstream Fidelity (S2M)**
|
| 407 |
+
|
| 408 |
+
The benefits of this integration propagate through the entire cascade. The S2M module's quality is directly dependent on the quality of the semantic tokens it receives from T2S.1
|
| 409 |
+
|
| 410 |
+
The aaai2026.tex paper 1 states the S2M module uses "GPT latent representations to significantly improve the stability of the generated speech." This suggests the S2M is a powerful and stable *renderer*. However, a renderer is only as good as the instructions it receives.
|
| 411 |
+
|
| 412 |
+
In the original system, the S2M module likely received semantic tokens with "muddled" or "averaged-out" prosody, resulting from the T2S model's struggle with the entangled e vector. The S2M's "stability" 1 may have come at the *cost* of expressiveness, as it learned to smooth over inconsistent prosodic instructions.
|
| 413 |
+
|
| 414 |
+
With the new MarineProsodyConditioner, the T2S model will now produce semantic tokens that are *far more richly, explicitly, and accurately* encoded with prosodic intent. The S2M module's "GPT latents" 1 will receive a higher-fidelity, more consistent input signal. This creates a synergistic effect: the S2M's stable rendering capabilities 1 will now be applied to a *more expressive* set of instructions. The result is an end-to-end system that is *both* stable *and* highly expressive.
|
| 415 |
+
|
| 416 |
+
## **Part 5: Report Conclusions and Future Trajectories**
|
| 417 |
+
|
| 418 |
+
### **5.1 Summary of Improvements**
|
| 419 |
+
|
| 420 |
+
The integration framework detailed in this report achieves the project's goals by:
|
| 421 |
+
|
| 422 |
+
1. **Replacing** a computationally heavy, black-box Conformer 1 with a lightweight, O(1), biologically-plausible, and Rust-native MarineProcessor.1
|
| 423 |
+
2. **Solving** a core architectural-art problem in the IndexTTS2 design by providing a *naturally disentangled*, speaker-invariant prosody vector, which simplifies or obviates the need for the adversarial GRL.1
|
| 424 |
+
3. **Unlocking** a powerful "synthetic control" mode, allowing users to *directly* manipulate prosody at inference time via an 8-dimensional, interpretable control vector.
|
| 425 |
+
4. **Improving** end-to-end system quality by providing a cleaner, more explicit prosodic signal to the T2S module 1, which in turn provides a higher-fidelity semantic token stream to the S2M module.1
|
| 426 |
+
|
| 427 |
+
### **5.2 Future Trajectories**
|
| 428 |
+
|
| 429 |
+
This new architecture opens two significant avenues for future research.
|
| 430 |
+
|
| 431 |
+
1\. True Streaming Synthesis with Dynamic Conditioning
|
| 432 |
+
The IndexTTS2 T2S module is autoregressive 1, and the Marine Algorithm is O(1) per-sample.1 This is a perfect combination for real-time applications.
|
| 433 |
+
A future version could implement a "Dynamic Conditioning" mode. In this mode, a MarineProcessor runs on a live microphone input (e.g., from the user) in a parallel thread. It continuously calculates the MarineProsodyVector over a short, sliding window (e.g., 500ms). This e' vector is then *hot-swapped* into the T2S model's conditioning state *during* the autoregressive generation loop. The result would be a TTS model that mirrors the user's emotional prosody in real-time.
|
| 434 |
+
|
| 435 |
+
2\. Active Quality Monitoring (Vocoder Feedback Loop)
|
| 436 |
+
The Marine Algorithm is a "universal... salience detector" that distinguishes "structured signals from noise".1 This capability can be used as a quality metric for the vocoder's output.
|
| 437 |
+
An advanced implementation could create a feedback loop:
|
| 438 |
+
|
| 439 |
+
1. The BigVGANv2 vocoder 1 produces its output audio.
|
| 440 |
+
2. This audio is *immediately* fed *back* into a MarineProcessor.
|
| 441 |
+
3. The processor analyzes the output. The key insight from the Marine paper 1 is the use of the **Exponential Moving Average (EMA)**.
|
| 442 |
+
* **Desired Prosody (e.g., vocal fry):** Will produce high $J\_p$/$J\_a$, but the $\\text{EMA}(T)$ and $\\text{EMA}(A)$ will remain *stable*. The algorithm will correctly identify this as a *structured* signal.
|
| 443 |
+
* **Undesired Artifact (e.g., vocoder hiss, phase noise):** Will produce high $J\_p$/$J\_a$, but the $\\text{EMA}(T)$ and $\\text{EMA}(A)$ will become *unstable*. The algorithm will correctly identify this as *unstructured noise*.
|
| 444 |
+
|
| 445 |
+
This creates a quantitative, real-time metric for "output fidelity" that can distinguish desirable prosody from undesirable artifacts. This metric could be used to automatically flag or discard bad generations, or even as a reward function for a Reinforcement Learning (RL) agent tasked with fine-tuning the S2M or Vocoder modules.
|
| 446 |
+
|
| 447 |
+
#### **Works cited**
|
| 448 |
+
|
| 449 |
+
1. marine-Universal-Salience-algoritm.tex
|
| 450 |
+
2. accessed December 31, 1969, uploaded:IndexTTS-Rust README.md
|
examples/analyze_chris.rs
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:00940abda6dd597d7dacdbb97761fb0635d0dcc7dc30d5391fe159129008b03a
|
| 3 |
+
size 8470
|
examples/marine_test.rs
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:d179d8f3adc5338e94ee2b92f366a36d03c32b51767223d1eefeb42ce9165374
|
| 3 |
+
size 10845
|
requirements.txt
DELETED
|
@@ -1,32 +0,0 @@
|
|
| 1 |
-
accelerate==1.8.1
|
| 2 |
-
descript-audiotools==0.7.2
|
| 3 |
-
transformers==4.52.1
|
| 4 |
-
tokenizers==0.21.0
|
| 5 |
-
cn2an==0.5.22
|
| 6 |
-
ffmpeg-python==0.2.0
|
| 7 |
-
Cython==3.0.7
|
| 8 |
-
g2p-en==2.1.0
|
| 9 |
-
jieba==0.42.1
|
| 10 |
-
json5==0.10.0
|
| 11 |
-
keras==2.9.0
|
| 12 |
-
numba==0.58.1
|
| 13 |
-
numpy==1.26.2
|
| 14 |
-
pandas==2.1.3
|
| 15 |
-
matplotlib==3.8.2
|
| 16 |
-
munch==4.0.0
|
| 17 |
-
opencv-python==4.9.0.80
|
| 18 |
-
tensorboard==2.9.1
|
| 19 |
-
librosa==0.10.2.post1
|
| 20 |
-
safetensors==0.5.2
|
| 21 |
-
deepspeed==0.17.1
|
| 22 |
-
modelscope==1.27.0
|
| 23 |
-
omegaconf
|
| 24 |
-
sentencepiece
|
| 25 |
-
gradio
|
| 26 |
-
tqdm
|
| 27 |
-
textstat
|
| 28 |
-
huggingface_hub
|
| 29 |
-
spaces
|
| 30 |
-
|
| 31 |
-
WeTextProcessing; platform_machine != "Darwin"
|
| 32 |
-
wetext; platform_system == "Darwin"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
src/audio/mod.rs
CHANGED
|
@@ -4,7 +4,7 @@
|
|
| 4 |
|
| 5 |
mod dsp;
|
| 6 |
mod io;
|
| 7 |
-
mod mel;
|
| 8 |
mod resample;
|
| 9 |
|
| 10 |
pub use dsp::{apply_preemphasis, dynamic_range_compression, dynamic_range_decompression, normalize_audio, normalize_audio_peak, apply_fade};
|
|
|
|
| 4 |
|
| 5 |
mod dsp;
|
| 6 |
mod io;
|
| 7 |
+
pub mod mel;
|
| 8 |
mod resample;
|
| 9 |
|
| 10 |
pub use dsp::{apply_preemphasis, dynamic_range_compression, dynamic_range_decompression, normalize_audio, normalize_audio_peak, apply_fade};
|
src/audio/resample.rs
CHANGED
|
@@ -31,7 +31,7 @@ pub fn resample(audio: &AudioData, target_sr: u32) -> Result<AudioData> {
|
|
| 31 |
let mut input_buffer = vec![vec![0.0f32; input_frames_needed]];
|
| 32 |
let mut output_samples = Vec::new();
|
| 33 |
|
| 34 |
-
let mut pos = 0;
|
| 35 |
while pos < audio.samples.len() {
|
| 36 |
// Fill input buffer
|
| 37 |
let end = (pos + input_frames_needed).min(audio.samples.len());
|
|
|
|
| 31 |
let mut input_buffer = vec![vec![0.0f32; input_frames_needed]];
|
| 32 |
let mut output_samples = Vec::new();
|
| 33 |
|
| 34 |
+
let mut pos = 0;
|
| 35 |
while pos < audio.samples.len() {
|
| 36 |
// Fill input buffer
|
| 37 |
let end = (pos + input_frames_needed).min(audio.samples.len());
|
src/lib.rs
CHANGED
|
@@ -27,6 +27,7 @@ pub mod config;
|
|
| 27 |
pub mod error;
|
| 28 |
pub mod model;
|
| 29 |
pub mod pipeline;
|
|
|
|
| 30 |
pub mod text;
|
| 31 |
pub mod vocoder;
|
| 32 |
|
|
@@ -34,6 +35,11 @@ pub use config::Config;
|
|
| 34 |
pub use error::{Error, Result};
|
| 35 |
pub use pipeline::IndexTTS;
|
| 36 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 37 |
/// Library version
|
| 38 |
pub const VERSION: &str = env!("CARGO_PKG_VERSION");
|
| 39 |
|
|
|
|
| 27 |
pub mod error;
|
| 28 |
pub mod model;
|
| 29 |
pub mod pipeline;
|
| 30 |
+
pub mod quality;
|
| 31 |
pub mod text;
|
| 32 |
pub mod vocoder;
|
| 33 |
|
|
|
|
| 35 |
pub use error::{Error, Result};
|
| 36 |
pub use pipeline::IndexTTS;
|
| 37 |
|
| 38 |
+
// Re-export Marine quality validation
|
| 39 |
+
pub use quality::{
|
| 40 |
+
ComfortLevel, ConversationAffectSummary, MarineProsodyConditioner, MarineProsodyVector,
|
| 41 |
+
};
|
| 42 |
+
|
| 43 |
/// Library version
|
| 44 |
pub const VERSION: &str = env!("CARGO_PKG_VERSION");
|
| 45 |
|
src/quality/affect.rs
ADDED
|
@@ -0,0 +1,445 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
//! Conversation Affect Tracking - Session-level comfort analysis
|
| 2 |
+
//!
|
| 3 |
+
//! After a conversation, Aye can determine: "This felt uneasy / ok / happy"
|
| 4 |
+
//! based on Marine prosody patterns over time.
|
| 5 |
+
//!
|
| 6 |
+
//! The key insight: jitter patterns reveal emotional state
|
| 7 |
+
//! - Rising jitter over conversation = increasing tension
|
| 8 |
+
//! - Stable low jitter = calm exchange
|
| 9 |
+
//! - High energy + low jitter = positive/confident
|
| 10 |
+
|
| 11 |
+
use super::prosody::MarineProsodyVector;
|
| 12 |
+
|
| 13 |
+
/// Comfort level classification
|
| 14 |
+
///
|
| 15 |
+
/// After a conversation, this represents the overall emotional tone.
|
| 16 |
+
/// Used by Aye to self-assess: "How did I make you feel?"
|
| 17 |
+
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
|
| 18 |
+
pub enum ComfortLevel {
|
| 19 |
+
/// High jitter AND rising over session - tension/nervousness
|
| 20 |
+
Uneasy,
|
| 21 |
+
/// Stable but low energy, or mildly jittery but not escalating
|
| 22 |
+
Neutral,
|
| 23 |
+
/// Good energy, low/stable jitter - positive interaction
|
| 24 |
+
Happy,
|
| 25 |
+
}
|
| 26 |
+
|
| 27 |
+
impl ComfortLevel {
|
| 28 |
+
/// Convert to emoji representation
|
| 29 |
+
pub fn emoji(&self) -> &'static str {
|
| 30 |
+
match self {
|
| 31 |
+
ComfortLevel::Uneasy => "😟",
|
| 32 |
+
ComfortLevel::Neutral => "😐",
|
| 33 |
+
ComfortLevel::Happy => "😊",
|
| 34 |
+
}
|
| 35 |
+
}
|
| 36 |
+
|
| 37 |
+
/// Convert to descriptive string
|
| 38 |
+
pub fn description(&self) -> &'static str {
|
| 39 |
+
match self {
|
| 40 |
+
ComfortLevel::Uneasy => "uneasy or tense",
|
| 41 |
+
ComfortLevel::Neutral => "neutral or stable",
|
| 42 |
+
ComfortLevel::Happy => "comfortable and positive",
|
| 43 |
+
}
|
| 44 |
+
}
|
| 45 |
+
|
| 46 |
+
/// Convert to numeric score (-1 = uneasy, 0 = neutral, 1 = happy)
|
| 47 |
+
pub fn score(&self) -> i8 {
|
| 48 |
+
match self {
|
| 49 |
+
ComfortLevel::Uneasy => -1,
|
| 50 |
+
ComfortLevel::Neutral => 0,
|
| 51 |
+
ComfortLevel::Happy => 1,
|
| 52 |
+
}
|
| 53 |
+
}
|
| 54 |
+
}
|
| 55 |
+
|
| 56 |
+
/// Conversation affect summary
|
| 57 |
+
///
|
| 58 |
+
/// Aggregates Marine prosody data over an entire conversation to
|
| 59 |
+
/// provide session-level emotional assessment.
|
| 60 |
+
#[derive(Debug, Clone)]
|
| 61 |
+
pub struct ConversationAffectSummary {
|
| 62 |
+
/// Comfort level of the human speaker (if analyzed)
|
| 63 |
+
pub human_state: Option<ComfortLevel>,
|
| 64 |
+
/// Comfort level of Aye's output
|
| 65 |
+
pub aye_state: ComfortLevel,
|
| 66 |
+
/// Overall audio/structure quality (0..1)
|
| 67 |
+
pub quality_score: f32,
|
| 68 |
+
/// Number of utterances analyzed
|
| 69 |
+
pub utterance_count: usize,
|
| 70 |
+
/// Session duration in seconds
|
| 71 |
+
pub duration_seconds: f32,
|
| 72 |
+
/// Mean prosody statistics
|
| 73 |
+
pub mean_prosody: MarineProsodyVector,
|
| 74 |
+
/// Jitter trend (positive = rising, negative = falling)
|
| 75 |
+
pub jitter_trend: f32,
|
| 76 |
+
/// Energy trend (positive = rising, negative = falling)
|
| 77 |
+
pub energy_trend: f32,
|
| 78 |
+
}
|
| 79 |
+
|
| 80 |
+
impl ConversationAffectSummary {
|
| 81 |
+
/// Generate Aye's self-assessment message
|
| 82 |
+
pub fn aye_assessment(&self) -> String {
|
| 83 |
+
let emoji = self.aye_state.emoji();
|
| 84 |
+
let desc = self.aye_state.description();
|
| 85 |
+
|
| 86 |
+
let quality_desc = if self.quality_score > 0.8 {
|
| 87 |
+
"very good"
|
| 88 |
+
} else if self.quality_score > 0.6 {
|
| 89 |
+
"good"
|
| 90 |
+
} else if self.quality_score > 0.4 {
|
| 91 |
+
"moderate"
|
| 92 |
+
} else {
|
| 93 |
+
"low"
|
| 94 |
+
};
|
| 95 |
+
|
| 96 |
+
format!(
|
| 97 |
+
"{} Aye thinks this conversation felt {}. Audio quality was {} ({:.0}%). \
|
| 98 |
+
{} {} utterances over {:.1} seconds.",
|
| 99 |
+
emoji,
|
| 100 |
+
desc,
|
| 101 |
+
quality_desc,
|
| 102 |
+
self.quality_score * 100.0,
|
| 103 |
+
if self.jitter_trend > 0.05 {
|
| 104 |
+
"Tension seemed to increase."
|
| 105 |
+
} else if self.jitter_trend < -0.05 {
|
| 106 |
+
"Tension seemed to decrease."
|
| 107 |
+
} else {
|
| 108 |
+
"Emotional tone stayed consistent."
|
| 109 |
+
},
|
| 110 |
+
self.utterance_count,
|
| 111 |
+
self.duration_seconds
|
| 112 |
+
)
|
| 113 |
+
}
|
| 114 |
+
|
| 115 |
+
/// Generate prompt for asking human for feedback
|
| 116 |
+
pub fn feedback_prompt(&self) -> String {
|
| 117 |
+
format!(
|
| 118 |
+
"Aye would like to improve. How did this conversation make you feel?\n\
|
| 119 |
+
A) Uneasy or tense 😟\n\
|
| 120 |
+
B) Neutral or okay 😐\n\
|
| 121 |
+
C) Comfortable and positive 😊\n\n\
|
| 122 |
+
Aye's self-assessment: {} ({})",
|
| 123 |
+
self.aye_state.emoji(),
|
| 124 |
+
self.aye_state.description()
|
| 125 |
+
)
|
| 126 |
+
}
|
| 127 |
+
}
|
| 128 |
+
|
| 129 |
+
/// Conversation affect analyzer
|
| 130 |
+
///
|
| 131 |
+
/// Collects prosody vectors over a conversation and computes
|
| 132 |
+
/// session-level emotional state.
|
| 133 |
+
pub struct ConversationAffectAnalyzer {
|
| 134 |
+
/// Collected prosody vectors
|
| 135 |
+
utterances: Vec<MarineProsodyVector>,
|
| 136 |
+
/// Total audio duration
|
| 137 |
+
total_duration_seconds: f32,
|
| 138 |
+
/// Configuration thresholds
|
| 139 |
+
config: AffectAnalyzerConfig,
|
| 140 |
+
}
|
| 141 |
+
|
| 142 |
+
/// Configuration for affect classification
|
| 143 |
+
#[derive(Debug, Clone, Copy)]
|
| 144 |
+
pub struct AffectAnalyzerConfig {
|
| 145 |
+
/// Threshold for "high" combined jitter
|
| 146 |
+
pub high_jitter_threshold: f32,
|
| 147 |
+
/// Threshold for "rising" jitter trend
|
| 148 |
+
pub rising_jitter_threshold: f32,
|
| 149 |
+
/// Threshold for "high" energy (happy indicator)
|
| 150 |
+
pub high_energy_threshold: f32,
|
| 151 |
+
}
|
| 152 |
+
|
| 153 |
+
impl Default for AffectAnalyzerConfig {
|
| 154 |
+
fn default() -> Self {
|
| 155 |
+
Self {
|
| 156 |
+
high_jitter_threshold: 0.4,
|
| 157 |
+
rising_jitter_threshold: 0.1,
|
| 158 |
+
high_energy_threshold: 0.5,
|
| 159 |
+
}
|
| 160 |
+
}
|
| 161 |
+
}
|
| 162 |
+
|
| 163 |
+
impl ConversationAffectAnalyzer {
|
| 164 |
+
/// Create new analyzer with default config
|
| 165 |
+
pub fn new() -> Self {
|
| 166 |
+
Self {
|
| 167 |
+
utterances: Vec::new(),
|
| 168 |
+
total_duration_seconds: 0.0,
|
| 169 |
+
config: AffectAnalyzerConfig::default(),
|
| 170 |
+
}
|
| 171 |
+
}
|
| 172 |
+
|
| 173 |
+
/// Create with custom configuration
|
| 174 |
+
pub fn with_config(config: AffectAnalyzerConfig) -> Self {
|
| 175 |
+
Self {
|
| 176 |
+
utterances: Vec::new(),
|
| 177 |
+
total_duration_seconds: 0.0,
|
| 178 |
+
config,
|
| 179 |
+
}
|
| 180 |
+
}
|
| 181 |
+
|
| 182 |
+
/// Add an utterance's prosody to the conversation
|
| 183 |
+
pub fn add_utterance(&mut self, prosody: MarineProsodyVector, duration_seconds: f32) {
|
| 184 |
+
self.utterances.push(prosody);
|
| 185 |
+
self.total_duration_seconds += duration_seconds;
|
| 186 |
+
}
|
| 187 |
+
|
| 188 |
+
/// Reset analyzer for new conversation
|
| 189 |
+
pub fn reset(&mut self) {
|
| 190 |
+
self.utterances.clear();
|
| 191 |
+
self.total_duration_seconds = 0.0;
|
| 192 |
+
}
|
| 193 |
+
|
| 194 |
+
/// Analyze conversation and produce affect summary
|
| 195 |
+
pub fn analyze(&self) -> Option<ConversationAffectSummary> {
|
| 196 |
+
if self.utterances.is_empty() {
|
| 197 |
+
return None;
|
| 198 |
+
}
|
| 199 |
+
|
| 200 |
+
let n = self.utterances.len() as f32;
|
| 201 |
+
|
| 202 |
+
// Calculate mean prosody
|
| 203 |
+
let mut mean_prosody = MarineProsodyVector::zeros();
|
| 204 |
+
for p in &self.utterances {
|
| 205 |
+
mean_prosody.jp_mean += p.jp_mean;
|
| 206 |
+
mean_prosody.jp_std += p.jp_std;
|
| 207 |
+
mean_prosody.ja_mean += p.ja_mean;
|
| 208 |
+
mean_prosody.ja_std += p.ja_std;
|
| 209 |
+
mean_prosody.h_mean += p.h_mean;
|
| 210 |
+
mean_prosody.s_mean += p.s_mean;
|
| 211 |
+
mean_prosody.peak_density += p.peak_density;
|
| 212 |
+
mean_prosody.energy_mean += p.energy_mean;
|
| 213 |
+
}
|
| 214 |
+
mean_prosody.jp_mean /= n;
|
| 215 |
+
mean_prosody.jp_std /= n;
|
| 216 |
+
mean_prosody.ja_mean /= n;
|
| 217 |
+
mean_prosody.ja_std /= n;
|
| 218 |
+
mean_prosody.h_mean /= n;
|
| 219 |
+
mean_prosody.s_mean /= n;
|
| 220 |
+
mean_prosody.peak_density /= n;
|
| 221 |
+
mean_prosody.energy_mean /= n;
|
| 222 |
+
|
| 223 |
+
// Calculate trends (first vs last)
|
| 224 |
+
let jitter_trend = if self.utterances.len() >= 2 {
|
| 225 |
+
let first = self.utterances.first().unwrap().combined_jitter();
|
| 226 |
+
let last = self.utterances.last().unwrap().combined_jitter();
|
| 227 |
+
last - first
|
| 228 |
+
} else {
|
| 229 |
+
0.0
|
| 230 |
+
};
|
| 231 |
+
|
| 232 |
+
let energy_trend = if self.utterances.len() >= 2 {
|
| 233 |
+
let first = self.utterances.first().unwrap().energy_mean;
|
| 234 |
+
let last = self.utterances.last().unwrap().energy_mean;
|
| 235 |
+
last - first
|
| 236 |
+
} else {
|
| 237 |
+
0.0
|
| 238 |
+
};
|
| 239 |
+
|
| 240 |
+
// Classify comfort level
|
| 241 |
+
let aye_state = self.classify_comfort(
|
| 242 |
+
mean_prosody.combined_jitter(),
|
| 243 |
+
jitter_trend,
|
| 244 |
+
mean_prosody.energy_mean,
|
| 245 |
+
);
|
| 246 |
+
|
| 247 |
+
let quality_score = mean_prosody.s_mean;
|
| 248 |
+
|
| 249 |
+
Some(ConversationAffectSummary {
|
| 250 |
+
human_state: None, // Would require analyzing human audio
|
| 251 |
+
aye_state,
|
| 252 |
+
quality_score,
|
| 253 |
+
utterance_count: self.utterances.len(),
|
| 254 |
+
duration_seconds: self.total_duration_seconds,
|
| 255 |
+
mean_prosody,
|
| 256 |
+
jitter_trend,
|
| 257 |
+
energy_trend,
|
| 258 |
+
})
|
| 259 |
+
}
|
| 260 |
+
|
| 261 |
+
/// Classify comfort level based on jitter and energy patterns
|
| 262 |
+
fn classify_comfort(
|
| 263 |
+
&self,
|
| 264 |
+
mean_jitter: f32,
|
| 265 |
+
trend_jitter: f32,
|
| 266 |
+
mean_energy: f32,
|
| 267 |
+
) -> ComfortLevel {
|
| 268 |
+
let high_jitter = mean_jitter > self.config.high_jitter_threshold;
|
| 269 |
+
let rising_jitter = trend_jitter > self.config.rising_jitter_threshold;
|
| 270 |
+
|
| 271 |
+
if high_jitter && rising_jitter {
|
| 272 |
+
// Jitter is high AND getting worse = tension/unease
|
| 273 |
+
ComfortLevel::Uneasy
|
| 274 |
+
} else if mean_energy > self.config.high_energy_threshold && !high_jitter {
|
| 275 |
+
// Good energy with stable jitter = positive/happy
|
| 276 |
+
ComfortLevel::Happy
|
| 277 |
+
} else {
|
| 278 |
+
// In-between: stable but low energy, or slightly jittery but stable
|
| 279 |
+
ComfortLevel::Neutral
|
| 280 |
+
}
|
| 281 |
+
}
|
| 282 |
+
|
| 283 |
+
/// Get number of utterances collected
|
| 284 |
+
pub fn utterance_count(&self) -> usize {
|
| 285 |
+
self.utterances.len()
|
| 286 |
+
}
|
| 287 |
+
|
| 288 |
+
/// Get total duration
|
| 289 |
+
pub fn total_duration(&self) -> f32 {
|
| 290 |
+
self.total_duration_seconds
|
| 291 |
+
}
|
| 292 |
+
}
|
| 293 |
+
|
| 294 |
+
impl Default for ConversationAffectAnalyzer {
|
| 295 |
+
fn default() -> Self {
|
| 296 |
+
Self::new()
|
| 297 |
+
}
|
| 298 |
+
}
|
| 299 |
+
|
| 300 |
+
#[cfg(test)]
|
| 301 |
+
mod tests {
|
| 302 |
+
use super::*;
|
| 303 |
+
|
| 304 |
+
#[test]
|
| 305 |
+
fn test_comfort_level_descriptions() {
|
| 306 |
+
assert_eq!(ComfortLevel::Uneasy.emoji(), "😟");
|
| 307 |
+
assert_eq!(ComfortLevel::Neutral.emoji(), "😐");
|
| 308 |
+
assert_eq!(ComfortLevel::Happy.emoji(), "😊");
|
| 309 |
+
|
| 310 |
+
assert_eq!(ComfortLevel::Uneasy.score(), -1);
|
| 311 |
+
assert_eq!(ComfortLevel::Neutral.score(), 0);
|
| 312 |
+
assert_eq!(ComfortLevel::Happy.score(), 1);
|
| 313 |
+
}
|
| 314 |
+
|
| 315 |
+
#[test]
|
| 316 |
+
fn test_analyzer_empty_conversation() {
|
| 317 |
+
let analyzer = ConversationAffectAnalyzer::new();
|
| 318 |
+
assert!(analyzer.analyze().is_none());
|
| 319 |
+
}
|
| 320 |
+
|
| 321 |
+
#[test]
|
| 322 |
+
fn test_analyzer_single_utterance() {
|
| 323 |
+
let mut analyzer = ConversationAffectAnalyzer::new();
|
| 324 |
+
let prosody = MarineProsodyVector {
|
| 325 |
+
jp_mean: 0.1,
|
| 326 |
+
jp_std: 0.05,
|
| 327 |
+
ja_mean: 0.1,
|
| 328 |
+
ja_std: 0.05,
|
| 329 |
+
h_mean: 1.0,
|
| 330 |
+
s_mean: 0.8,
|
| 331 |
+
peak_density: 50.0,
|
| 332 |
+
energy_mean: 0.6,
|
| 333 |
+
};
|
| 334 |
+
analyzer.add_utterance(prosody, 2.0);
|
| 335 |
+
|
| 336 |
+
let summary = analyzer.analyze().unwrap();
|
| 337 |
+
assert_eq!(summary.utterance_count, 1);
|
| 338 |
+
assert_eq!(summary.duration_seconds, 2.0);
|
| 339 |
+
}
|
| 340 |
+
|
| 341 |
+
#[test]
|
| 342 |
+
fn test_uneasy_classification() {
|
| 343 |
+
let mut analyzer = ConversationAffectAnalyzer::new();
|
| 344 |
+
|
| 345 |
+
// First utterance: moderate jitter
|
| 346 |
+
analyzer.add_utterance(
|
| 347 |
+
MarineProsodyVector {
|
| 348 |
+
jp_mean: 0.3,
|
| 349 |
+
jp_std: 0.1,
|
| 350 |
+
ja_mean: 0.3,
|
| 351 |
+
ja_std: 0.1,
|
| 352 |
+
h_mean: 1.0,
|
| 353 |
+
s_mean: 0.5,
|
| 354 |
+
peak_density: 50.0,
|
| 355 |
+
energy_mean: 0.3,
|
| 356 |
+
},
|
| 357 |
+
1.0,
|
| 358 |
+
);
|
| 359 |
+
|
| 360 |
+
// Second utterance: HIGH jitter (rising trend)
|
| 361 |
+
analyzer.add_utterance(
|
| 362 |
+
MarineProsodyVector {
|
| 363 |
+
jp_mean: 0.6,
|
| 364 |
+
jp_std: 0.2,
|
| 365 |
+
ja_mean: 0.5,
|
| 366 |
+
ja_std: 0.2,
|
| 367 |
+
h_mean: 0.8,
|
| 368 |
+
s_mean: 0.3,
|
| 369 |
+
peak_density: 60.0,
|
| 370 |
+
energy_mean: 0.4,
|
| 371 |
+
},
|
| 372 |
+
1.0,
|
| 373 |
+
);
|
| 374 |
+
|
| 375 |
+
let summary = analyzer.analyze().unwrap();
|
| 376 |
+
assert_eq!(summary.aye_state, ComfortLevel::Uneasy);
|
| 377 |
+
assert!(summary.jitter_trend > 0.0); // Rising jitter
|
| 378 |
+
}
|
| 379 |
+
|
| 380 |
+
#[test]
|
| 381 |
+
fn test_happy_classification() {
|
| 382 |
+
let mut analyzer = ConversationAffectAnalyzer::new();
|
| 383 |
+
|
| 384 |
+
// High energy, low jitter = happy
|
| 385 |
+
analyzer.add_utterance(
|
| 386 |
+
MarineProsodyVector {
|
| 387 |
+
jp_mean: 0.1,
|
| 388 |
+
jp_std: 0.05,
|
| 389 |
+
ja_mean: 0.1,
|
| 390 |
+
ja_std: 0.05,
|
| 391 |
+
h_mean: 1.0,
|
| 392 |
+
s_mean: 0.9,
|
| 393 |
+
peak_density: 80.0,
|
| 394 |
+
energy_mean: 0.7,
|
| 395 |
+
},
|
| 396 |
+
2.0,
|
| 397 |
+
);
|
| 398 |
+
|
| 399 |
+
let summary = analyzer.analyze().unwrap();
|
| 400 |
+
assert_eq!(summary.aye_state, ComfortLevel::Happy);
|
| 401 |
+
}
|
| 402 |
+
|
| 403 |
+
#[test]
|
| 404 |
+
fn test_neutral_classification() {
|
| 405 |
+
let mut analyzer = ConversationAffectAnalyzer::new();
|
| 406 |
+
|
| 407 |
+
// Low energy, moderate jitter = neutral
|
| 408 |
+
analyzer.add_utterance(
|
| 409 |
+
MarineProsodyVector {
|
| 410 |
+
jp_mean: 0.2,
|
| 411 |
+
jp_std: 0.1,
|
| 412 |
+
ja_mean: 0.2,
|
| 413 |
+
ja_std: 0.1,
|
| 414 |
+
h_mean: 1.0,
|
| 415 |
+
s_mean: 0.7,
|
| 416 |
+
peak_density: 40.0,
|
| 417 |
+
energy_mean: 0.3,
|
| 418 |
+
},
|
| 419 |
+
1.5,
|
| 420 |
+
);
|
| 421 |
+
|
| 422 |
+
let summary = analyzer.analyze().unwrap();
|
| 423 |
+
assert_eq!(summary.aye_state, ComfortLevel::Neutral);
|
| 424 |
+
}
|
| 425 |
+
|
| 426 |
+
#[test]
|
| 427 |
+
fn test_aye_assessment_message() {
|
| 428 |
+
let summary = ConversationAffectSummary {
|
| 429 |
+
human_state: None,
|
| 430 |
+
aye_state: ComfortLevel::Happy,
|
| 431 |
+
quality_score: 0.85,
|
| 432 |
+
utterance_count: 5,
|
| 433 |
+
duration_seconds: 30.0,
|
| 434 |
+
mean_prosody: MarineProsodyVector::zeros(),
|
| 435 |
+
jitter_trend: -0.1,
|
| 436 |
+
energy_trend: 0.2,
|
| 437 |
+
};
|
| 438 |
+
|
| 439 |
+
let message = summary.aye_assessment();
|
| 440 |
+
assert!(message.contains("😊"));
|
| 441 |
+
assert!(message.contains("comfortable"));
|
| 442 |
+
assert!(message.contains("85%"));
|
| 443 |
+
assert!(message.contains("5 utterances"));
|
| 444 |
+
}
|
| 445 |
+
}
|
src/quality/mod.rs
ADDED
|
@@ -0,0 +1,12 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
//! Quality validation module using Marine salience
|
| 2 |
+
//!
|
| 3 |
+
//! Provides TTS output validation, prosody extraction, and conversation
|
| 4 |
+
//! affect tracking using the Marine algorithm.
|
| 5 |
+
//!
|
| 6 |
+
//! "Marines are not just jarheads - they are actually very intelligent"
|
| 7 |
+
|
| 8 |
+
pub mod prosody;
|
| 9 |
+
pub mod affect;
|
| 10 |
+
|
| 11 |
+
pub use prosody::{MarineProsodyConditioner, MarineProsodyVector};
|
| 12 |
+
pub use affect::{ComfortLevel, ConversationAffectSummary};
|
src/quality/prosody.rs
ADDED
|
@@ -0,0 +1,421 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
//! Marine Prosody Conditioner - Extract 8D interpretable emotion vectors
|
| 2 |
+
//!
|
| 3 |
+
//! Uses Marine salience to extract prosodic features from reference audio.
|
| 4 |
+
//! These features are interpretable and can be directly edited for control.
|
| 5 |
+
//!
|
| 6 |
+
//! The 8D vector captures:
|
| 7 |
+
//! 1. Period jitter (mean & std) - pitch stability
|
| 8 |
+
//! 2. Amplitude jitter (mean & std) - roughness/strain
|
| 9 |
+
//! 3. Harmonic alignment - voiced vs noisy
|
| 10 |
+
//! 4. Overall salience - authenticity score
|
| 11 |
+
//! 5. Peak density - speech rate/intensity
|
| 12 |
+
//! 6. Energy - loudness
|
| 13 |
+
|
| 14 |
+
use crate::error::{Error, Result};
|
| 15 |
+
|
| 16 |
+
/// 8-dimensional prosody vector extracted from audio
|
| 17 |
+
///
|
| 18 |
+
/// These features capture the "emotional signature" of speech:
|
| 19 |
+
/// - Low jitter + high energy = confident/happy
|
| 20 |
+
/// - High jitter + low energy = nervous/uneasy
|
| 21 |
+
/// - Stable patterns = calm, unstable = agitated
|
| 22 |
+
#[derive(Debug, Clone, Copy, PartialEq)]
|
| 23 |
+
pub struct MarineProsodyVector {
|
| 24 |
+
/// Mean period jitter (pitch stability)
|
| 25 |
+
/// Lower = more stable pitch, Higher = more variation
|
| 26 |
+
pub jp_mean: f32,
|
| 27 |
+
|
| 28 |
+
/// Standard deviation of period jitter
|
| 29 |
+
/// Captures consistency of pitch patterns
|
| 30 |
+
pub jp_std: f32,
|
| 31 |
+
|
| 32 |
+
/// Mean amplitude jitter (volume stability)
|
| 33 |
+
/// Lower = consistent volume, Higher = erratic
|
| 34 |
+
pub ja_mean: f32,
|
| 35 |
+
|
| 36 |
+
/// Standard deviation of amplitude jitter
|
| 37 |
+
/// Captures volume pattern consistency
|
| 38 |
+
pub ja_std: f32,
|
| 39 |
+
|
| 40 |
+
/// Mean harmonic alignment score
|
| 41 |
+
/// 1.0 = perfectly voiced, 0.0 = noise
|
| 42 |
+
pub h_mean: f32,
|
| 43 |
+
|
| 44 |
+
/// Mean overall salience score
|
| 45 |
+
/// Overall authenticity/quality rating
|
| 46 |
+
pub s_mean: f32,
|
| 47 |
+
|
| 48 |
+
/// Peak density (peaks per second)
|
| 49 |
+
/// Related to speech rate and intensity
|
| 50 |
+
pub peak_density: f32,
|
| 51 |
+
|
| 52 |
+
/// Mean energy level
|
| 53 |
+
/// Average loudness of detected peaks
|
| 54 |
+
pub energy_mean: f32,
|
| 55 |
+
}
|
| 56 |
+
|
| 57 |
+
impl MarineProsodyVector {
|
| 58 |
+
/// Create zero vector (baseline)
|
| 59 |
+
pub fn zeros() -> Self {
|
| 60 |
+
Self {
|
| 61 |
+
jp_mean: 0.0,
|
| 62 |
+
jp_std: 0.0,
|
| 63 |
+
ja_mean: 0.0,
|
| 64 |
+
ja_std: 0.0,
|
| 65 |
+
h_mean: 1.0,
|
| 66 |
+
s_mean: 1.0,
|
| 67 |
+
peak_density: 0.0,
|
| 68 |
+
energy_mean: 0.0,
|
| 69 |
+
}
|
| 70 |
+
}
|
| 71 |
+
|
| 72 |
+
/// Convert to f32 array for neural network input
|
| 73 |
+
pub fn to_array(&self) -> [f32; 8] {
|
| 74 |
+
[
|
| 75 |
+
self.jp_mean,
|
| 76 |
+
self.jp_std,
|
| 77 |
+
self.ja_mean,
|
| 78 |
+
self.ja_std,
|
| 79 |
+
self.h_mean,
|
| 80 |
+
self.s_mean,
|
| 81 |
+
self.peak_density,
|
| 82 |
+
self.energy_mean,
|
| 83 |
+
]
|
| 84 |
+
}
|
| 85 |
+
|
| 86 |
+
/// Create from f32 array
|
| 87 |
+
pub fn from_array(arr: [f32; 8]) -> Self {
|
| 88 |
+
Self {
|
| 89 |
+
jp_mean: arr[0],
|
| 90 |
+
jp_std: arr[1],
|
| 91 |
+
ja_mean: arr[2],
|
| 92 |
+
ja_std: arr[3],
|
| 93 |
+
h_mean: arr[4],
|
| 94 |
+
s_mean: arr[5],
|
| 95 |
+
peak_density: arr[6],
|
| 96 |
+
energy_mean: arr[7],
|
| 97 |
+
}
|
| 98 |
+
}
|
| 99 |
+
|
| 100 |
+
/// Get combined jitter (average of period and amplitude)
|
| 101 |
+
pub fn combined_jitter(&self) -> f32 {
|
| 102 |
+
(self.jp_mean + self.ja_mean) / 2.0
|
| 103 |
+
}
|
| 104 |
+
|
| 105 |
+
/// Estimate emotional valence from prosody
|
| 106 |
+
/// Returns value from -1.0 (negative) to 1.0 (positive)
|
| 107 |
+
pub fn estimate_valence(&self) -> f32 {
|
| 108 |
+
// High energy + low jitter = positive
|
| 109 |
+
// Low energy + high jitter = negative
|
| 110 |
+
let jitter_factor = 1.0 / (1.0 + self.combined_jitter());
|
| 111 |
+
let energy_factor = self.energy_mean.sqrt();
|
| 112 |
+
|
| 113 |
+
// Combine factors, normalize to -1..1 range
|
| 114 |
+
(jitter_factor * energy_factor * 2.0 - 1.0).clamp(-1.0, 1.0)
|
| 115 |
+
}
|
| 116 |
+
|
| 117 |
+
/// Estimate arousal/intensity level
|
| 118 |
+
/// Returns value from 0.0 (calm) to 1.0 (excited)
|
| 119 |
+
pub fn estimate_arousal(&self) -> f32 {
|
| 120 |
+
// High peak density + high energy + some jitter variance = high arousal
|
| 121 |
+
let density_factor = (self.peak_density / 100.0).clamp(0.0, 1.0);
|
| 122 |
+
let energy_factor = self.energy_mean.sqrt();
|
| 123 |
+
let variance_factor = (self.jp_std + self.ja_std).clamp(0.0, 1.0);
|
| 124 |
+
|
| 125 |
+
((density_factor + energy_factor + variance_factor) / 3.0).clamp(0.0, 1.0)
|
| 126 |
+
}
|
| 127 |
+
}
|
| 128 |
+
|
| 129 |
+
impl Default for MarineProsodyVector {
|
| 130 |
+
fn default() -> Self {
|
| 131 |
+
Self::zeros()
|
| 132 |
+
}
|
| 133 |
+
}
|
| 134 |
+
|
| 135 |
+
/// Marine-based prosody conditioner for TTS
|
| 136 |
+
///
|
| 137 |
+
/// Replaces heavy Conformer-style extractors with lightweight, interpretable
|
| 138 |
+
/// Marine salience features. This gives you:
|
| 139 |
+
/// - 8D interpretable emotion vector
|
| 140 |
+
/// - Direct editability for control
|
| 141 |
+
/// - Biologically plausible processing
|
| 142 |
+
/// - O(n) linear time extraction
|
| 143 |
+
pub struct MarineProsodyConditioner {
|
| 144 |
+
sample_rate: u32,
|
| 145 |
+
jitter_low: f32,
|
| 146 |
+
jitter_high: f32,
|
| 147 |
+
min_period: u32,
|
| 148 |
+
max_period: u32,
|
| 149 |
+
ema_alpha: f32,
|
| 150 |
+
}
|
| 151 |
+
|
| 152 |
+
impl MarineProsodyConditioner {
|
| 153 |
+
/// Create new prosody conditioner for given sample rate
|
| 154 |
+
pub fn new(sample_rate: u32) -> Self {
|
| 155 |
+
// F0 range: ~60Hz (low male) to ~4kHz (includes harmonics)
|
| 156 |
+
let min_period = sample_rate / 4000;
|
| 157 |
+
let max_period = sample_rate / 60;
|
| 158 |
+
|
| 159 |
+
Self {
|
| 160 |
+
sample_rate,
|
| 161 |
+
jitter_low: 0.02,
|
| 162 |
+
jitter_high: 0.60,
|
| 163 |
+
min_period,
|
| 164 |
+
max_period,
|
| 165 |
+
ema_alpha: 0.01,
|
| 166 |
+
}
|
| 167 |
+
}
|
| 168 |
+
|
| 169 |
+
/// Extract prosody vector from audio samples
|
| 170 |
+
///
|
| 171 |
+
/// Analyzes the audio to produce an 8D prosody vector capturing
|
| 172 |
+
/// the emotional/stylistic characteristics of the speech.
|
| 173 |
+
///
|
| 174 |
+
/// # Arguments
|
| 175 |
+
/// * `samples` - Audio samples (typically -1.0 to 1.0 range)
|
| 176 |
+
///
|
| 177 |
+
/// # Returns
|
| 178 |
+
/// * `Ok(MarineProsodyVector)` - Extracted prosody features
|
| 179 |
+
/// * `Err` - If insufficient peaks detected
|
| 180 |
+
pub fn from_samples(&self, samples: &[f32]) -> Result<MarineProsodyVector> {
|
| 181 |
+
if samples.is_empty() {
|
| 182 |
+
return Err(Error::Audio("Empty audio buffer".into()));
|
| 183 |
+
}
|
| 184 |
+
|
| 185 |
+
// Detect peaks and collect jitter measurements
|
| 186 |
+
let mut peaks: Vec<PeakInfo> = Vec::new();
|
| 187 |
+
let clip_threshold = 1e-3;
|
| 188 |
+
|
| 189 |
+
// Simple peak detection
|
| 190 |
+
for i in 1..samples.len().saturating_sub(1) {
|
| 191 |
+
let prev = samples[i - 1].abs();
|
| 192 |
+
let curr = samples[i].abs();
|
| 193 |
+
let next = samples[i + 1].abs();
|
| 194 |
+
|
| 195 |
+
if curr > prev && curr > next && curr > clip_threshold {
|
| 196 |
+
peaks.push(PeakInfo {
|
| 197 |
+
index: i,
|
| 198 |
+
amplitude: curr,
|
| 199 |
+
});
|
| 200 |
+
}
|
| 201 |
+
}
|
| 202 |
+
|
| 203 |
+
if peaks.len() < 3 {
|
| 204 |
+
// Not enough peaks for meaningful analysis
|
| 205 |
+
return Ok(MarineProsodyVector::zeros());
|
| 206 |
+
}
|
| 207 |
+
|
| 208 |
+
// Calculate inter-peak periods and jitter
|
| 209 |
+
let mut periods: Vec<f32> = Vec::new();
|
| 210 |
+
let mut amplitudes: Vec<f32> = Vec::new();
|
| 211 |
+
let mut jp_values: Vec<f32> = Vec::new();
|
| 212 |
+
let mut ja_values: Vec<f32> = Vec::new();
|
| 213 |
+
|
| 214 |
+
// Use EMA for tracking
|
| 215 |
+
let mut ema_period = 0.0f32;
|
| 216 |
+
let mut ema_amp = 0.0f32;
|
| 217 |
+
let mut ema_initialized = false;
|
| 218 |
+
|
| 219 |
+
for i in 1..peaks.len() {
|
| 220 |
+
let period = (peaks[i].index - peaks[i - 1].index) as f32;
|
| 221 |
+
let amp = peaks[i].amplitude;
|
| 222 |
+
|
| 223 |
+
// Check if period is in valid range
|
| 224 |
+
if period > self.min_period as f32 && period < self.max_period as f32 {
|
| 225 |
+
periods.push(period);
|
| 226 |
+
amplitudes.push(amp);
|
| 227 |
+
|
| 228 |
+
if !ema_initialized {
|
| 229 |
+
ema_period = period;
|
| 230 |
+
ema_amp = amp;
|
| 231 |
+
ema_initialized = true;
|
| 232 |
+
} else {
|
| 233 |
+
// Calculate jitter
|
| 234 |
+
let jp = (period - ema_period).abs() / ema_period;
|
| 235 |
+
let ja = (amp - ema_amp).abs() / ema_amp;
|
| 236 |
+
jp_values.push(jp);
|
| 237 |
+
ja_values.push(ja);
|
| 238 |
+
|
| 239 |
+
// Update EMA
|
| 240 |
+
ema_period = self.ema_alpha * period + (1.0 - self.ema_alpha) * ema_period;
|
| 241 |
+
ema_amp = self.ema_alpha * amp + (1.0 - self.ema_alpha) * ema_amp;
|
| 242 |
+
}
|
| 243 |
+
}
|
| 244 |
+
}
|
| 245 |
+
|
| 246 |
+
if jp_values.is_empty() {
|
| 247 |
+
return Ok(MarineProsodyVector::zeros());
|
| 248 |
+
}
|
| 249 |
+
|
| 250 |
+
// Compute statistics
|
| 251 |
+
let n = jp_values.len() as f32;
|
| 252 |
+
let duration_sec = samples.len() as f32 / self.sample_rate as f32;
|
| 253 |
+
|
| 254 |
+
// Mean calculations
|
| 255 |
+
let jp_mean = jp_values.iter().sum::<f32>() / n;
|
| 256 |
+
let ja_mean = ja_values.iter().sum::<f32>() / n;
|
| 257 |
+
let energy_mean = amplitudes.iter().map(|a| a * a).sum::<f32>() / amplitudes.len() as f32;
|
| 258 |
+
|
| 259 |
+
// Std calculations
|
| 260 |
+
let jp_var = jp_values.iter().map(|x| (x - jp_mean).powi(2)).sum::<f32>() / n;
|
| 261 |
+
let ja_var = ja_values.iter().map(|x| (x - ja_mean).powi(2)).sum::<f32>() / n;
|
| 262 |
+
let jp_std = jp_var.sqrt();
|
| 263 |
+
let ja_std = ja_var.sqrt();
|
| 264 |
+
|
| 265 |
+
// Harmonic score (simplified - assume voiced content)
|
| 266 |
+
let h_mean = 1.0;
|
| 267 |
+
|
| 268 |
+
// Overall salience score
|
| 269 |
+
let s_mean = 1.0 / (1.0 + jp_mean + ja_mean);
|
| 270 |
+
|
| 271 |
+
// Peak density
|
| 272 |
+
let peak_density = peaks.len() as f32 / duration_sec;
|
| 273 |
+
|
| 274 |
+
Ok(MarineProsodyVector {
|
| 275 |
+
jp_mean,
|
| 276 |
+
jp_std,
|
| 277 |
+
ja_mean,
|
| 278 |
+
ja_std,
|
| 279 |
+
h_mean,
|
| 280 |
+
s_mean,
|
| 281 |
+
peak_density,
|
| 282 |
+
energy_mean,
|
| 283 |
+
})
|
| 284 |
+
}
|
| 285 |
+
|
| 286 |
+
/// Validate TTS output quality using Marine salience
|
| 287 |
+
///
|
| 288 |
+
/// Returns quality score and potential issues detected
|
| 289 |
+
pub fn validate_tts_output(&self, samples: &[f32]) -> Result<TTSQualityReport> {
|
| 290 |
+
let prosody = self.from_samples(samples)?;
|
| 291 |
+
|
| 292 |
+
let mut issues = Vec::new();
|
| 293 |
+
|
| 294 |
+
// Check for common TTS problems
|
| 295 |
+
if prosody.jp_mean < 0.005 {
|
| 296 |
+
issues.push("Too perfect - sounds robotic (add natural variation)");
|
| 297 |
+
}
|
| 298 |
+
|
| 299 |
+
if prosody.jp_mean > 0.3 {
|
| 300 |
+
issues.push("High period jitter - possible artifacts");
|
| 301 |
+
}
|
| 302 |
+
|
| 303 |
+
if prosody.ja_mean > 0.4 {
|
| 304 |
+
issues.push("High amplitude jitter - volume inconsistency");
|
| 305 |
+
}
|
| 306 |
+
|
| 307 |
+
if prosody.s_mean < 0.4 {
|
| 308 |
+
issues.push("Low salience - audio quality issues");
|
| 309 |
+
}
|
| 310 |
+
|
| 311 |
+
if prosody.peak_density < 10.0 {
|
| 312 |
+
issues.push("Low peak density - missing speech energy");
|
| 313 |
+
}
|
| 314 |
+
|
| 315 |
+
let quality_score = prosody.s_mean * 100.0;
|
| 316 |
+
|
| 317 |
+
Ok(TTSQualityReport {
|
| 318 |
+
prosody,
|
| 319 |
+
quality_score,
|
| 320 |
+
issues,
|
| 321 |
+
})
|
| 322 |
+
}
|
| 323 |
+
|
| 324 |
+
/// Get the configured sample rate
|
| 325 |
+
pub fn sample_rate(&self) -> u32 {
|
| 326 |
+
self.sample_rate
|
| 327 |
+
}
|
| 328 |
+
}
|
| 329 |
+
|
| 330 |
+
/// Internal peak information
|
| 331 |
+
struct PeakInfo {
|
| 332 |
+
index: usize,
|
| 333 |
+
amplitude: f32,
|
| 334 |
+
}
|
| 335 |
+
|
| 336 |
+
/// TTS quality validation report
|
| 337 |
+
#[derive(Debug, Clone)]
|
| 338 |
+
pub struct TTSQualityReport {
|
| 339 |
+
/// Extracted prosody vector
|
| 340 |
+
pub prosody: MarineProsodyVector,
|
| 341 |
+
/// Overall quality score (0-100)
|
| 342 |
+
pub quality_score: f32,
|
| 343 |
+
/// List of detected issues
|
| 344 |
+
pub issues: Vec<&'static str>,
|
| 345 |
+
}
|
| 346 |
+
|
| 347 |
+
impl TTSQualityReport {
|
| 348 |
+
/// Check if quality passes threshold
|
| 349 |
+
pub fn passes(&self, threshold: f32) -> bool {
|
| 350 |
+
self.quality_score >= threshold && self.issues.is_empty()
|
| 351 |
+
}
|
| 352 |
+
}
|
| 353 |
+
|
| 354 |
+
#[cfg(test)]
|
| 355 |
+
mod tests {
|
| 356 |
+
use super::*;
|
| 357 |
+
|
| 358 |
+
#[test]
|
| 359 |
+
fn test_prosody_vector_array_conversion() {
|
| 360 |
+
let vec = MarineProsodyVector {
|
| 361 |
+
jp_mean: 0.1,
|
| 362 |
+
jp_std: 0.05,
|
| 363 |
+
ja_mean: 0.2,
|
| 364 |
+
ja_std: 0.1,
|
| 365 |
+
h_mean: 0.9,
|
| 366 |
+
s_mean: 0.8,
|
| 367 |
+
peak_density: 50.0,
|
| 368 |
+
energy_mean: 0.3,
|
| 369 |
+
};
|
| 370 |
+
|
| 371 |
+
let arr = vec.to_array();
|
| 372 |
+
let reconstructed = MarineProsodyVector::from_array(arr);
|
| 373 |
+
|
| 374 |
+
assert_eq!(vec.jp_mean, reconstructed.jp_mean);
|
| 375 |
+
assert_eq!(vec.s_mean, reconstructed.s_mean);
|
| 376 |
+
}
|
| 377 |
+
|
| 378 |
+
#[test]
|
| 379 |
+
fn test_conditioner_empty_buffer() {
|
| 380 |
+
let conditioner = MarineProsodyConditioner::new(22050);
|
| 381 |
+
let result = conditioner.from_samples(&[]);
|
| 382 |
+
assert!(result.is_err());
|
| 383 |
+
}
|
| 384 |
+
|
| 385 |
+
#[test]
|
| 386 |
+
fn test_conditioner_silence() {
|
| 387 |
+
let conditioner = MarineProsodyConditioner::new(22050);
|
| 388 |
+
let silence = vec![0.0; 1000];
|
| 389 |
+
let prosody = conditioner.from_samples(&silence).unwrap();
|
| 390 |
+
// Should return zeros for silence
|
| 391 |
+
assert_eq!(prosody.peak_density, 0.0);
|
| 392 |
+
}
|
| 393 |
+
|
| 394 |
+
#[test]
|
| 395 |
+
fn test_estimate_valence() {
|
| 396 |
+
let positive = MarineProsodyVector {
|
| 397 |
+
jp_mean: 0.01,
|
| 398 |
+
jp_std: 0.01,
|
| 399 |
+
ja_mean: 0.01,
|
| 400 |
+
ja_std: 0.01,
|
| 401 |
+
h_mean: 1.0,
|
| 402 |
+
s_mean: 0.95,
|
| 403 |
+
peak_density: 100.0,
|
| 404 |
+
energy_mean: 0.8,
|
| 405 |
+
};
|
| 406 |
+
|
| 407 |
+
let negative = MarineProsodyVector {
|
| 408 |
+
jp_mean: 0.5,
|
| 409 |
+
jp_std: 0.3,
|
| 410 |
+
ja_mean: 0.4,
|
| 411 |
+
ja_std: 0.2,
|
| 412 |
+
h_mean: 0.7,
|
| 413 |
+
s_mean: 0.4,
|
| 414 |
+
peak_density: 30.0,
|
| 415 |
+
energy_mean: 0.1,
|
| 416 |
+
};
|
| 417 |
+
|
| 418 |
+
// Higher energy + lower jitter should give more positive valence
|
| 419 |
+
assert!(positive.estimate_valence() > negative.estimate_valence());
|
| 420 |
+
}
|
| 421 |
+
}
|
tools/convert_to_onnx.py
ADDED
|
@@ -0,0 +1,379 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
Convert IndexTTS-2 PyTorch models to ONNX format for Rust inference!
|
| 4 |
+
|
| 5 |
+
This script converts the three main models:
|
| 6 |
+
1. GPT model (gpt.pth) - Autoregressive text-to-semantic generation
|
| 7 |
+
2. S2Mel model (s2mel.pth) - Semantic-to-mel spectrogram conversion
|
| 8 |
+
3. BigVGAN - Mel-to-waveform vocoder (already available as ONNX from NVIDIA)
|
| 9 |
+
|
| 10 |
+
Usage:
|
| 11 |
+
python tools/convert_to_onnx.py
|
| 12 |
+
|
| 13 |
+
Output:
|
| 14 |
+
models/gpt.onnx
|
| 15 |
+
models/s2mel.onnx
|
| 16 |
+
models/bigvgan.onnx (if needed, otherwise use NVIDIA's)
|
| 17 |
+
|
| 18 |
+
Why ONNX?
|
| 19 |
+
- Cross-platform: Works on Windows, Linux, macOS, M1/M2 Macs
|
| 20 |
+
- Fast: ONNX Runtime is highly optimized
|
| 21 |
+
- Rust-native: ort crate provides excellent ONNX Runtime bindings
|
| 22 |
+
- No Python: Production inference without Python dependency hell!
|
| 23 |
+
|
| 24 |
+
Author: Aye & Hue @ 8b.is
|
| 25 |
+
"""
|
| 26 |
+
|
| 27 |
+
import os
|
| 28 |
+
import sys
|
| 29 |
+
|
| 30 |
+
# Setup paths
|
| 31 |
+
script_dir = os.path.dirname(os.path.abspath(__file__))
|
| 32 |
+
project_root = os.path.dirname(script_dir)
|
| 33 |
+
os.chdir(project_root)
|
| 34 |
+
|
| 35 |
+
# Set HF cache
|
| 36 |
+
os.environ['HF_HUB_CACHE'] = './checkpoints/hf_cache'
|
| 37 |
+
|
| 38 |
+
print("=" * 70)
|
| 39 |
+
print(" IndexTTS-2 PyTorch to ONNX Converter")
|
| 40 |
+
print(" For Rust inference with ort crate!")
|
| 41 |
+
print("=" * 70)
|
| 42 |
+
print()
|
| 43 |
+
|
| 44 |
+
# Check for models
|
| 45 |
+
if not os.path.exists("checkpoints/gpt.pth"):
|
| 46 |
+
print("ERROR: Models not found!")
|
| 47 |
+
print("Run: python tools/download_files.py -s huggingface")
|
| 48 |
+
sys.exit(1)
|
| 49 |
+
|
| 50 |
+
import torch
|
| 51 |
+
import torch.onnx
|
| 52 |
+
import numpy as np
|
| 53 |
+
from pathlib import Path
|
| 54 |
+
|
| 55 |
+
# Add reference code to path
|
| 56 |
+
sys.path.insert(0, "indextts - REMOVING - REF ONLY")
|
| 57 |
+
|
| 58 |
+
# Create output directory
|
| 59 |
+
output_dir = Path("models")
|
| 60 |
+
output_dir.mkdir(exist_ok=True)
|
| 61 |
+
|
| 62 |
+
print(f"PyTorch version: {torch.__version__}")
|
| 63 |
+
print(f"Output directory: {output_dir}")
|
| 64 |
+
print()
|
| 65 |
+
|
| 66 |
+
|
| 67 |
+
def export_speaker_encoder():
|
| 68 |
+
"""
|
| 69 |
+
Export the CAM++ speaker encoder to ONNX.
|
| 70 |
+
|
| 71 |
+
This model extracts speaker embeddings from reference audio.
|
| 72 |
+
Input: mel spectrogram [batch, n_mels, time]
|
| 73 |
+
Output: speaker embedding [batch, 192]
|
| 74 |
+
"""
|
| 75 |
+
print("\n" + "=" * 50)
|
| 76 |
+
print("Exporting Speaker Encoder (CAM++)")
|
| 77 |
+
print("=" * 50)
|
| 78 |
+
|
| 79 |
+
try:
|
| 80 |
+
from omegaconf import OmegaConf
|
| 81 |
+
from indextts.s2mel.modules.campplus.DTDNN import CAMPPlus
|
| 82 |
+
|
| 83 |
+
# Load config
|
| 84 |
+
cfg = OmegaConf.load("checkpoints/config.yaml")
|
| 85 |
+
|
| 86 |
+
# Create model
|
| 87 |
+
model = CAMPPlus(feat_dim=80, embedding_size=192)
|
| 88 |
+
|
| 89 |
+
# Load weights
|
| 90 |
+
weights_path = "./checkpoints/hf_cache/models--funasr--campplus/snapshots/fb71fe990cbf6031ae6987a2d76fe64f94377b7e/campplus_cn_common.bin"
|
| 91 |
+
if os.path.exists(weights_path):
|
| 92 |
+
state_dict = torch.load(weights_path, map_location='cpu')
|
| 93 |
+
model.load_state_dict(state_dict)
|
| 94 |
+
print(f"Loaded weights from: {weights_path}")
|
| 95 |
+
|
| 96 |
+
model.eval()
|
| 97 |
+
|
| 98 |
+
# CAMPPlus expects [batch, time, n_mels] NOT [batch, n_mels, time]!
|
| 99 |
+
# This is the key insight - the model processes time-series of mel features
|
| 100 |
+
dummy_input = torch.randn(1, 100, 80) # [batch, time, features]
|
| 101 |
+
|
| 102 |
+
# Verify forward pass works before export
|
| 103 |
+
with torch.no_grad():
|
| 104 |
+
test_output = model(dummy_input)
|
| 105 |
+
print(f"Forward pass works! Output shape: {test_output.shape}")
|
| 106 |
+
|
| 107 |
+
# Export to ONNX
|
| 108 |
+
output_path = output_dir / "speaker_encoder.onnx"
|
| 109 |
+
torch.onnx.export(
|
| 110 |
+
model,
|
| 111 |
+
dummy_input,
|
| 112 |
+
str(output_path),
|
| 113 |
+
input_names=['mel_spectrogram'],
|
| 114 |
+
output_names=['speaker_embedding'],
|
| 115 |
+
dynamic_axes={
|
| 116 |
+
'mel_spectrogram': {0: 'batch', 1: 'time'}, # time is dim 1!
|
| 117 |
+
'speaker_embedding': {0: 'batch'}
|
| 118 |
+
},
|
| 119 |
+
opset_version=18, # Use 18+ for latest features
|
| 120 |
+
do_constant_folding=True,
|
| 121 |
+
)
|
| 122 |
+
|
| 123 |
+
# Verify the export
|
| 124 |
+
import onnx
|
| 125 |
+
onnx_model = onnx.load(str(output_path))
|
| 126 |
+
onnx.checker.check_model(onnx_model)
|
| 127 |
+
|
| 128 |
+
print(f"✓ Exported: {output_path}")
|
| 129 |
+
print(f" Input: mel_spectrogram [batch, time, 80]") # Corrected!
|
| 130 |
+
print(f" Output: speaker_embedding [batch, 192]")
|
| 131 |
+
print(f"✓ ONNX model verified!")
|
| 132 |
+
return True
|
| 133 |
+
|
| 134 |
+
except Exception as e:
|
| 135 |
+
print(f"✗ Failed to export speaker encoder: {e}")
|
| 136 |
+
import traceback
|
| 137 |
+
traceback.print_exc()
|
| 138 |
+
return False
|
| 139 |
+
|
| 140 |
+
|
| 141 |
+
def export_gpt_model():
|
| 142 |
+
"""
|
| 143 |
+
Export the GPT autoregressive model to ONNX.
|
| 144 |
+
|
| 145 |
+
This is the most complex model - generates semantic tokens from text.
|
| 146 |
+
We may need to export it in parts due to KV caching.
|
| 147 |
+
|
| 148 |
+
Input: text_tokens [batch, seq_len], speaker_embedding [batch, 192]
|
| 149 |
+
Output: semantic_codes [batch, code_len]
|
| 150 |
+
"""
|
| 151 |
+
print("\n" + "=" * 50)
|
| 152 |
+
print("Exporting GPT Model (Autoregressive)")
|
| 153 |
+
print("=" * 50)
|
| 154 |
+
|
| 155 |
+
try:
|
| 156 |
+
from omegaconf import OmegaConf
|
| 157 |
+
|
| 158 |
+
# Load the full model config
|
| 159 |
+
cfg = OmegaConf.load("checkpoints/config.yaml")
|
| 160 |
+
|
| 161 |
+
# This is tricky - GPT models with KV caching are hard to export
|
| 162 |
+
# We might need to:
|
| 163 |
+
# 1. Export just the forward pass without caching
|
| 164 |
+
# 2. Or export separate encoder/decoder parts
|
| 165 |
+
|
| 166 |
+
print("GPT model export is complex due to:")
|
| 167 |
+
print(" - Autoregressive generation with KV caching")
|
| 168 |
+
print(" - Dynamic sequence lengths")
|
| 169 |
+
print(" - Multiple internal components")
|
| 170 |
+
print()
|
| 171 |
+
print("Options:")
|
| 172 |
+
print(" A) Export without KV cache (slower but simpler)")
|
| 173 |
+
print(" B) Export encoder + single-step decoder (efficient)")
|
| 174 |
+
print(" C) Use torch.compile + ONNX tracing")
|
| 175 |
+
print()
|
| 176 |
+
|
| 177 |
+
# For now, let's try the simpler approach
|
| 178 |
+
from infer_v2 import IndexTTS2
|
| 179 |
+
|
| 180 |
+
# Load model
|
| 181 |
+
tts = IndexTTS2(
|
| 182 |
+
cfg_path="checkpoints/config.yaml",
|
| 183 |
+
model_dir="checkpoints",
|
| 184 |
+
use_fp16=False,
|
| 185 |
+
device="cpu"
|
| 186 |
+
)
|
| 187 |
+
|
| 188 |
+
# Get the GPT component
|
| 189 |
+
gpt = tts.gpt
|
| 190 |
+
gpt.eval()
|
| 191 |
+
|
| 192 |
+
print(f"GPT model loaded: {type(gpt)}")
|
| 193 |
+
print(f"Parameters: {sum(p.numel() for p in gpt.parameters()):,}")
|
| 194 |
+
|
| 195 |
+
# The GPT model architecture:
|
| 196 |
+
# - Text encoder (embeddings + transformer)
|
| 197 |
+
# - Speaker conditioning
|
| 198 |
+
# - Autoregressive decoder
|
| 199 |
+
|
| 200 |
+
# Let's export the text encoder first
|
| 201 |
+
output_path = output_dir / "gpt_encoder.onnx"
|
| 202 |
+
|
| 203 |
+
# Create dummy inputs
|
| 204 |
+
text_tokens = torch.randint(0, 30000, (1, 32), dtype=torch.int64)
|
| 205 |
+
|
| 206 |
+
# This will likely fail due to complex control flow
|
| 207 |
+
# but let's try!
|
| 208 |
+
print(f"Attempting GPT export (may require modifications)...")
|
| 209 |
+
|
| 210 |
+
# For now, just report what we learned
|
| 211 |
+
print()
|
| 212 |
+
print("Note: Full GPT export requires modifying the model code")
|
| 213 |
+
print("to remove dynamic control flow. Creating a wrapper...")
|
| 214 |
+
|
| 215 |
+
return False
|
| 216 |
+
|
| 217 |
+
except Exception as e:
|
| 218 |
+
print(f"✗ Failed to export GPT: {e}")
|
| 219 |
+
import traceback
|
| 220 |
+
traceback.print_exc()
|
| 221 |
+
return False
|
| 222 |
+
|
| 223 |
+
|
| 224 |
+
def export_s2mel_model():
|
| 225 |
+
"""
|
| 226 |
+
Export the Semantic-to-Mel model (flow matching).
|
| 227 |
+
|
| 228 |
+
This converts semantic codes to mel spectrograms.
|
| 229 |
+
Input: semantic_codes [batch, code_len], speaker_embedding [batch, 192]
|
| 230 |
+
Output: mel_spectrogram [batch, 80, mel_len]
|
| 231 |
+
"""
|
| 232 |
+
print("\n" + "=" * 50)
|
| 233 |
+
print("Exporting S2Mel Model (Flow Matching)")
|
| 234 |
+
print("=" * 50)
|
| 235 |
+
|
| 236 |
+
try:
|
| 237 |
+
from omegaconf import OmegaConf
|
| 238 |
+
|
| 239 |
+
cfg = OmegaConf.load("checkpoints/config.yaml")
|
| 240 |
+
|
| 241 |
+
print("S2Mel model (Diffusion/Flow Matching) is also complex:")
|
| 242 |
+
print(" - Multiple denoising steps (iterative)")
|
| 243 |
+
print(" - CFM (Conditional Flow Matching) requires ODE solving")
|
| 244 |
+
print()
|
| 245 |
+
print("Export strategy:")
|
| 246 |
+
print(" 1. Export the single denoising step")
|
| 247 |
+
print(" 2. Run iteration loop in Rust")
|
| 248 |
+
print()
|
| 249 |
+
|
| 250 |
+
return False
|
| 251 |
+
|
| 252 |
+
except Exception as e:
|
| 253 |
+
print(f"✗ Failed to export S2Mel: {e}")
|
| 254 |
+
import traceback
|
| 255 |
+
traceback.print_exc()
|
| 256 |
+
return False
|
| 257 |
+
|
| 258 |
+
|
| 259 |
+
def export_bigvgan():
|
| 260 |
+
"""
|
| 261 |
+
Export BigVGAN vocoder to ONNX.
|
| 262 |
+
|
| 263 |
+
Good news: NVIDIA provides pre-trained BigVGAN models!
|
| 264 |
+
Even better: They're designed for easy ONNX export.
|
| 265 |
+
|
| 266 |
+
Input: mel_spectrogram [batch, 80, mel_len]
|
| 267 |
+
Output: waveform [batch, 1, wave_len]
|
| 268 |
+
"""
|
| 269 |
+
print("\n" + "=" * 50)
|
| 270 |
+
print("Exporting BigVGAN Vocoder")
|
| 271 |
+
print("=" * 50)
|
| 272 |
+
|
| 273 |
+
try:
|
| 274 |
+
# BigVGAN from NVIDIA is easier to export
|
| 275 |
+
# Let's check if we already have it
|
| 276 |
+
|
| 277 |
+
print("BigVGAN options:")
|
| 278 |
+
print(" 1. Use NVIDIA's pre-exported ONNX (recommended)")
|
| 279 |
+
print(" https://github.com/NVIDIA/BigVGAN")
|
| 280 |
+
print()
|
| 281 |
+
print(" 2. Export from PyTorch weights (we'll do this)")
|
| 282 |
+
print()
|
| 283 |
+
|
| 284 |
+
# Try to load BigVGAN
|
| 285 |
+
try:
|
| 286 |
+
from bigvgan import bigvgan
|
| 287 |
+
model = bigvgan.BigVGAN.from_pretrained(
|
| 288 |
+
'nvidia/bigvgan_v2_22khz_80band_256x',
|
| 289 |
+
use_cuda_kernel=False
|
| 290 |
+
)
|
| 291 |
+
model.eval()
|
| 292 |
+
model.remove_weight_norm() # Important for ONNX!
|
| 293 |
+
|
| 294 |
+
print(f"BigVGAN loaded from HuggingFace")
|
| 295 |
+
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")
|
| 296 |
+
|
| 297 |
+
# Create dummy input
|
| 298 |
+
dummy_mel = torch.randn(1, 80, 100)
|
| 299 |
+
|
| 300 |
+
# Export
|
| 301 |
+
output_path = output_dir / "bigvgan.onnx"
|
| 302 |
+
torch.onnx.export(
|
| 303 |
+
model,
|
| 304 |
+
dummy_mel,
|
| 305 |
+
str(output_path),
|
| 306 |
+
input_names=['mel_spectrogram'],
|
| 307 |
+
output_names=['waveform'],
|
| 308 |
+
dynamic_axes={
|
| 309 |
+
'mel_spectrogram': {0: 'batch', 2: 'mel_length'},
|
| 310 |
+
'waveform': {0: 'batch', 2: 'wave_length'}
|
| 311 |
+
},
|
| 312 |
+
opset_version=18, # Use 18+ for latest features
|
| 313 |
+
do_constant_folding=True,
|
| 314 |
+
)
|
| 315 |
+
|
| 316 |
+
print(f"✓ Exported: {output_path}")
|
| 317 |
+
print(f" Input: mel_spectrogram [batch, 80, mel_len]")
|
| 318 |
+
print(f" Output: waveform [batch, 1, wave_len]")
|
| 319 |
+
|
| 320 |
+
# Verify the export
|
| 321 |
+
import onnx
|
| 322 |
+
onnx_model = onnx.load(str(output_path))
|
| 323 |
+
onnx.checker.check_model(onnx_model)
|
| 324 |
+
print(f"✓ ONNX model verified!")
|
| 325 |
+
|
| 326 |
+
return True
|
| 327 |
+
|
| 328 |
+
except ImportError:
|
| 329 |
+
print("bigvgan package not installed, installing...")
|
| 330 |
+
os.system("pip install bigvgan")
|
| 331 |
+
print("Please re-run the script.")
|
| 332 |
+
return False
|
| 333 |
+
|
| 334 |
+
except Exception as e:
|
| 335 |
+
print(f"✗ Failed to export BigVGAN: {e}")
|
| 336 |
+
import traceback
|
| 337 |
+
traceback.print_exc()
|
| 338 |
+
return False
|
| 339 |
+
|
| 340 |
+
|
| 341 |
+
def main():
|
| 342 |
+
print("\nStarting ONNX conversion...\n")
|
| 343 |
+
|
| 344 |
+
results = {}
|
| 345 |
+
|
| 346 |
+
# Export each component
|
| 347 |
+
results['speaker_encoder'] = export_speaker_encoder()
|
| 348 |
+
results['gpt'] = export_gpt_model()
|
| 349 |
+
results['s2mel'] = export_s2mel_model()
|
| 350 |
+
results['bigvgan'] = export_bigvgan()
|
| 351 |
+
|
| 352 |
+
# Summary
|
| 353 |
+
print("\n" + "=" * 70)
|
| 354 |
+
print(" CONVERSION SUMMARY")
|
| 355 |
+
print("=" * 70)
|
| 356 |
+
|
| 357 |
+
for name, success in results.items():
|
| 358 |
+
status = "✓ SUCCESS" if success else "✗ NEEDS WORK"
|
| 359 |
+
print(f" {name:20} {status}")
|
| 360 |
+
|
| 361 |
+
print()
|
| 362 |
+
|
| 363 |
+
if all(results.values()):
|
| 364 |
+
print("All models converted! Ready for Rust inference.")
|
| 365 |
+
else:
|
| 366 |
+
print("Some models need manual intervention.")
|
| 367 |
+
print()
|
| 368 |
+
print("For complex models (GPT, S2Mel), consider:")
|
| 369 |
+
print(" 1. Modifying the Python code to remove dynamic control flow")
|
| 370 |
+
print(" 2. Using torch.jit.trace with concrete inputs")
|
| 371 |
+
print(" 3. Exporting subcomponents separately")
|
| 372 |
+
print(" 4. Using ONNX Runtime's transformer optimizations")
|
| 373 |
+
|
| 374 |
+
print()
|
| 375 |
+
print("Output directory:", output_dir.absolute())
|
| 376 |
+
|
| 377 |
+
|
| 378 |
+
if __name__ == "__main__":
|
| 379 |
+
main()
|
tools/download_files.py
CHANGED
|
File without changes
|
tools/i18n/i18n.py
DELETED
|
@@ -1,36 +0,0 @@
|
|
| 1 |
-
import json
|
| 2 |
-
import locale
|
| 3 |
-
import os
|
| 4 |
-
|
| 5 |
-
I18N_JSON_DIR : os.PathLike = os.path.join(os.path.dirname(os.path.relpath(__file__)), 'locale')
|
| 6 |
-
|
| 7 |
-
def load_language_list(language):
|
| 8 |
-
with open(os.path.join(I18N_JSON_DIR, f"{language}.json"), "r", encoding="utf-8") as f:
|
| 9 |
-
language_list = json.load(f)
|
| 10 |
-
return language_list
|
| 11 |
-
|
| 12 |
-
def scan_language_list():
|
| 13 |
-
language_list = []
|
| 14 |
-
for name in os.listdir(I18N_JSON_DIR):
|
| 15 |
-
if name.endswith(".json"):language_list.append(name.split('.')[0])
|
| 16 |
-
return language_list
|
| 17 |
-
|
| 18 |
-
class I18nAuto:
|
| 19 |
-
def __init__(self, language=None):
|
| 20 |
-
if language in ["Auto", None]:
|
| 21 |
-
language = locale.getdefaultlocale()[0]
|
| 22 |
-
# getlocale can't identify the system's language ((None, None))
|
| 23 |
-
if not os.path.exists(os.path.join(I18N_JSON_DIR, f"{language}.json")):
|
| 24 |
-
language = "en_US"
|
| 25 |
-
self.language = language
|
| 26 |
-
self.language_map = load_language_list(language)
|
| 27 |
-
|
| 28 |
-
def __call__(self, key):
|
| 29 |
-
return self.language_map.get(key, key)
|
| 30 |
-
|
| 31 |
-
def __repr__(self):
|
| 32 |
-
return "Use Language: " + self.language
|
| 33 |
-
|
| 34 |
-
if __name__ == "__main__":
|
| 35 |
-
i18n = I18nAuto(language='en_US')
|
| 36 |
-
print(i18n)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
tools/i18n/locale/en_US.json
DELETED
|
@@ -1,49 +0,0 @@
|
|
| 1 |
-
{
|
| 2 |
-
"本软件以自拟协议开源, 作者不对软件具备任何控制力, 使用软件者、传播软件导出的声音者自负全责.": "This software is open-sourced under customized license. The author has no control over the software, and users of the software, as well as those who distribute the audio generated by the software, assume full responsibility.",
|
| 3 |
-
"如不认可该条款, 则不能使用或引用软件包内任何代码和文件. 详见根目录LICENSE.": "If you do not agree to these terms, you are not permitted to use or reference any code or files within the software package. For further details, please refer to the LICENSE files in the root directory.",
|
| 4 |
-
"时长必须为正数": "Duration must be a positive number",
|
| 5 |
-
"请输入有效的浮点数": "Please enter a valid floating-point number",
|
| 6 |
-
"使用情感参考音频": "Use emotion reference audio",
|
| 7 |
-
"使用情感向量控制": "Use emotion vectors",
|
| 8 |
-
"使用情感描述文本控制": "Use text description to control emotion",
|
| 9 |
-
"上传情感参考音频": "Upload emotion reference audio",
|
| 10 |
-
"情感权重": "Emotion control weight",
|
| 11 |
-
"喜": "Happy",
|
| 12 |
-
"怒": "Angry",
|
| 13 |
-
"哀": "Sad",
|
| 14 |
-
"惧": "Afraid",
|
| 15 |
-
"厌恶": "Disgusted",
|
| 16 |
-
"低落": "Melancholic",
|
| 17 |
-
"惊喜": "Surprised",
|
| 18 |
-
"平静": "Calm",
|
| 19 |
-
"情感描述文本": "Emotion description",
|
| 20 |
-
"请输入情绪描述(或留空以自动使用目标文本作为情绪描述)": "Please input an emotion description (or leave blank to automatically use the main text prompt)",
|
| 21 |
-
"高级生成参数设置": "Advanced generation parameter settings",
|
| 22 |
-
"情感向量之和不能超过1.5,请调整后重试。": "The sum of the emotion vectors cannot exceed 1.5. Please adjust and try again.",
|
| 23 |
-
"音色参考音频": "Voice Reference",
|
| 24 |
-
"音频生成": "Speech Synthesis",
|
| 25 |
-
"文本": "Text",
|
| 26 |
-
"生成语音": "Synthesize",
|
| 27 |
-
"生成结果": "Synthesis Result",
|
| 28 |
-
"功能设置": "Settings",
|
| 29 |
-
"分句设置": "Text segmentation settings",
|
| 30 |
-
"参数会影响音频质量和生成速度": "These parameters affect the audio quality and generation speed.",
|
| 31 |
-
"分句最大Token数": "Max tokens per generation segment",
|
| 32 |
-
"建议80~200之间,值越大,分句越长;值越小,分句越碎;过小过大都可能导致音频质量不高": "Recommended range: 80 - 200. Larger values require more VRAM but improves the flow of the speech, while lower values require less VRAM but means more fragmented sentences. Values that are too small or too large may lead to less coherent speech.",
|
| 33 |
-
"预览分句结果": "Preview of the audio generation segments",
|
| 34 |
-
"序号": "Index",
|
| 35 |
-
"分句内容": "Content",
|
| 36 |
-
"Token数": "Token Count",
|
| 37 |
-
"情感控制方式": "Emotion control method",
|
| 38 |
-
"GPT2 采样设置": "GPT-2 Sampling Configuration",
|
| 39 |
-
"参数会影响音频多样性和生成速度详见": "Influences both the diversity of the generated audio and the generation speed. For further details, refer to",
|
| 40 |
-
"是否进行采样": "Enable GPT-2 sampling",
|
| 41 |
-
"生成Token最大数量,过小导致音频被截断": "Maximum number of tokens to generate. If text exceeds this, the audio will be cut off.",
|
| 42 |
-
"请上传情感参考音频": "Please upload the emotion reference audio",
|
| 43 |
-
"当前模型版本": "Current model version: ",
|
| 44 |
-
"请输入目标文本": "Please input the text to synthesize",
|
| 45 |
-
"例如:委屈巴巴、危险在悄悄逼近": "e.g. deeply sad, danger is creeping closer",
|
| 46 |
-
"与音色参考音频相同": "Same as the voice reference",
|
| 47 |
-
"情感随机采样": "Randomize emotion sampling",
|
| 48 |
-
"显示实验功能": "Show experimental features"
|
| 49 |
-
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
tools/i18n/locale/zh_CN.json
DELETED
|
@@ -1,44 +0,0 @@
|
|
| 1 |
-
{
|
| 2 |
-
"本软件以自拟协议开源, 作者不对软件具备任何控制力, 使用软件者、传播软件导出的声音者自负全责.": "本软件以自拟协议开源, 作者不对软件具备任何控制力, 使用软件者、传播软件导出的声音者自负全责.",
|
| 3 |
-
"如不认可该条款, 则不能使用或引用软件包内任何代码和文件. 详见根目录LICENSE.": "如不认可该条款, 则不能使用或引用软件包内任何代码和文件. 详见根目录LICENSE.",
|
| 4 |
-
"时长必须为正数": "时长必须为正数",
|
| 5 |
-
"请输入有效的浮点数": "请输入有效的浮点数",
|
| 6 |
-
"使用情感参考音频": "使用情感参考音频",
|
| 7 |
-
"使用情感向量控制": "使用情感向量控制",
|
| 8 |
-
"使用情感描述文本控制": "使用情感描述文本控制",
|
| 9 |
-
"上传情感参考音频": "上传情感参考音频",
|
| 10 |
-
"情感权重": "情感权重",
|
| 11 |
-
"喜": "喜",
|
| 12 |
-
"怒": "怒",
|
| 13 |
-
"哀": "哀",
|
| 14 |
-
"惧": "惧",
|
| 15 |
-
"厌恶": "厌恶",
|
| 16 |
-
"低落": "低落",
|
| 17 |
-
"惊喜": "惊喜",
|
| 18 |
-
"平静": "平静",
|
| 19 |
-
"情感描述文本": "情感描述文本",
|
| 20 |
-
"请输入情绪描述(或留空以自动使用目标文本作为情绪描述)": "请输入情绪描述(或留空以自动使用目标文本作为情绪描述)",
|
| 21 |
-
"高级生成参数设置": "高级生成参数设置",
|
| 22 |
-
"情感向量之和不能超过1.5,请调整后重试。": "情感向量之和不能超过1.5,请调整后重试。",
|
| 23 |
-
"音色参考音频": "音色参考音频",
|
| 24 |
-
"音频生成": "音频生成",
|
| 25 |
-
"文本": "文本",
|
| 26 |
-
"生成语音": "生成语音",
|
| 27 |
-
"生成结果": "生成结果",
|
| 28 |
-
"功能设置": "功能设置",
|
| 29 |
-
"分句设置": "分句设置",
|
| 30 |
-
"参数会影响音频质量和生成速度": "参数会影响音频质量和生成速度",
|
| 31 |
-
"分句最大Token数": "分句最大Token数",
|
| 32 |
-
"建议80~200之间,值越大,分句越长;值越小,分句越碎;过小过大都可能导致音频质量不高": "建议80~200之间,值越大,分句越长;值越小,分句越碎;过小过大都可能导致音频质量不高",
|
| 33 |
-
"预览分句结果": "预览分句结果",
|
| 34 |
-
"序号": "序号",
|
| 35 |
-
"分句内容": "分句内容",
|
| 36 |
-
"Token数": "Token数",
|
| 37 |
-
"情感控制方式": "情感控制方式",
|
| 38 |
-
"GPT2 采样设置": "GPT2 采样设置",
|
| 39 |
-
"参数会影响音频多样性和生成速度详见": "参数会影响音频多样性和生成速度详见",
|
| 40 |
-
"是否进行采样": "是否进行采样",
|
| 41 |
-
"生成Token最大数量,过小导致音频被截断": "生成Token最大数量,过小导致音频被截断",
|
| 42 |
-
"显示实验功能": "显示实验功能",
|
| 43 |
-
"例如:委屈巴巴、危险在悄悄逼近": "例如:委屈巴巴、危险在悄悄逼近"
|
| 44 |
-
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
tools/i18n/scan_i18n.py
DELETED
|
@@ -1,131 +0,0 @@
|
|
| 1 |
-
import ast
|
| 2 |
-
import glob
|
| 3 |
-
import json
|
| 4 |
-
import os
|
| 5 |
-
from collections import OrderedDict
|
| 6 |
-
|
| 7 |
-
I18N_JSON_DIR : os.PathLike = os.path.join(os.path.dirname(os.path.relpath(__file__)), 'locale')
|
| 8 |
-
DEFAULT_LANGUAGE: str = "zh_CN" # 默认语言
|
| 9 |
-
TITLE_LEN : int = 60 # 标题显示长度
|
| 10 |
-
KEY_LEN : int = 30 # 键名显示长度
|
| 11 |
-
SHOW_KEYS : bool = False # 是否显示键信息
|
| 12 |
-
SORT_KEYS : bool = False # 是否按全局键名写入文件
|
| 13 |
-
|
| 14 |
-
def extract_i18n_strings(node):
|
| 15 |
-
i18n_strings = []
|
| 16 |
-
|
| 17 |
-
if (
|
| 18 |
-
isinstance(node, ast.Call)
|
| 19 |
-
and isinstance(node.func, ast.Name)
|
| 20 |
-
and node.func.id == "i18n"
|
| 21 |
-
):
|
| 22 |
-
for arg in node.args:
|
| 23 |
-
if isinstance(arg, ast.Str):
|
| 24 |
-
i18n_strings.append(arg.s)
|
| 25 |
-
|
| 26 |
-
for child_node in ast.iter_child_nodes(node):
|
| 27 |
-
i18n_strings.extend(extract_i18n_strings(child_node))
|
| 28 |
-
|
| 29 |
-
return i18n_strings
|
| 30 |
-
|
| 31 |
-
def scan_i18n_strings():
|
| 32 |
-
"""
|
| 33 |
-
scan the directory for all .py files (recursively)
|
| 34 |
-
for each file, parse the code into an AST
|
| 35 |
-
for each AST, extract the i18n strings
|
| 36 |
-
"""
|
| 37 |
-
strings = []
|
| 38 |
-
print(" Scanning Files and Extracting i18n Strings ".center(TITLE_LEN, "="))
|
| 39 |
-
for filename in glob.iglob("**/*.py", recursive=True):
|
| 40 |
-
try:
|
| 41 |
-
with open(filename, "r", encoding="utf-8") as f:
|
| 42 |
-
code = f.read()
|
| 43 |
-
if "I18nAuto" in code:
|
| 44 |
-
tree = ast.parse(code)
|
| 45 |
-
i18n_strings = extract_i18n_strings(tree)
|
| 46 |
-
print(f"{filename.ljust(KEY_LEN*3//2)}: {len(i18n_strings)}")
|
| 47 |
-
if SHOW_KEYS:
|
| 48 |
-
print("\n".join([s for s in i18n_strings]))
|
| 49 |
-
strings.extend(i18n_strings)
|
| 50 |
-
except Exception as e:
|
| 51 |
-
print(f"\033[31m[Failed] Error occur at {filename}: {e}\033[0m")
|
| 52 |
-
|
| 53 |
-
code_keys = set(strings)
|
| 54 |
-
print(f"{'Total Unique'.ljust(KEY_LEN*3//2)}: {len(code_keys)}")
|
| 55 |
-
return code_keys
|
| 56 |
-
|
| 57 |
-
def update_i18n_json(json_file, standard_keys):
|
| 58 |
-
standard_keys = sorted(standard_keys)
|
| 59 |
-
print(f" Process {json_file} ".center(TITLE_LEN, "="))
|
| 60 |
-
# 读取 JSON 文件
|
| 61 |
-
with open(json_file, "r", encoding="utf-8") as f:
|
| 62 |
-
json_data = json.load(f, object_pairs_hook=OrderedDict)
|
| 63 |
-
# 打印处理前的 JSON 条目数
|
| 64 |
-
len_before = len(json_data)
|
| 65 |
-
print(f"{'Total Keys'.ljust(KEY_LEN)}: {len_before}")
|
| 66 |
-
# 识别缺失的键并补全
|
| 67 |
-
miss_keys = set(standard_keys) - set(json_data.keys())
|
| 68 |
-
if len(miss_keys) > 0:
|
| 69 |
-
print(f"{'Missing Keys (+)'.ljust(KEY_LEN)}: {len(miss_keys)}")
|
| 70 |
-
for key in miss_keys:
|
| 71 |
-
if DEFAULT_LANGUAGE in json_file:
|
| 72 |
-
# 默认语言的键值相同.
|
| 73 |
-
json_data[key] = key
|
| 74 |
-
else:
|
| 75 |
-
# 其他语言的值设置为 #! + 键名以标注未被翻译.
|
| 76 |
-
json_data[key] = "#!" + key
|
| 77 |
-
if SHOW_KEYS:
|
| 78 |
-
print(f"{'Added Missing Key'.ljust(KEY_LEN)}: {key}")
|
| 79 |
-
# 识别多余的键并删除
|
| 80 |
-
diff_keys = set(json_data.keys()) - set(standard_keys)
|
| 81 |
-
if len(diff_keys) > 0:
|
| 82 |
-
print(f"{'Unused Keys (-)'.ljust(KEY_LEN)}: {len(diff_keys)}")
|
| 83 |
-
for key in diff_keys:
|
| 84 |
-
del json_data[key]
|
| 85 |
-
if SHOW_KEYS:
|
| 86 |
-
print(f"{'Removed Unused Key'.ljust(KEY_LEN)}: {key}")
|
| 87 |
-
# 按键顺序排序
|
| 88 |
-
json_data = OrderedDict(
|
| 89 |
-
sorted(
|
| 90 |
-
json_data.items(),
|
| 91 |
-
key=lambda x: (
|
| 92 |
-
list(standard_keys).index(x[0]) if x[0] in standard_keys and not x[1].startswith('#!') else len(json_data),
|
| 93 |
-
)
|
| 94 |
-
)
|
| 95 |
-
)
|
| 96 |
-
# 打印处理后的 JSON 条目数
|
| 97 |
-
if len(miss_keys) != 0 or len(diff_keys) != 0:
|
| 98 |
-
print(f"{'Total Keys (After)'.ljust(KEY_LEN)}: {len(json_data)}")
|
| 99 |
-
# 识别有待翻译的键
|
| 100 |
-
num_miss_translation = 0
|
| 101 |
-
duplicate_items = {}
|
| 102 |
-
for key, value in json_data.items():
|
| 103 |
-
if value.startswith("#!"):
|
| 104 |
-
num_miss_translation += 1
|
| 105 |
-
if SHOW_KEYS:
|
| 106 |
-
print(f"{'Missing Translation'.ljust(KEY_LEN)}: {key}")
|
| 107 |
-
if value in duplicate_items:
|
| 108 |
-
duplicate_items[value].append(key)
|
| 109 |
-
else:
|
| 110 |
-
duplicate_items[value] = [key]
|
| 111 |
-
# 打印是否有重复的值
|
| 112 |
-
for value, keys in duplicate_items.items():
|
| 113 |
-
if len(keys) > 1:
|
| 114 |
-
print("\n".join([f"\033[31m{'[Failed] Duplicate Value'.ljust(KEY_LEN)}: {key} -> {value}\033[0m" for key in keys]))
|
| 115 |
-
|
| 116 |
-
if num_miss_translation > 0:
|
| 117 |
-
print(f"\033[31m{'[Failed] Missing Translation'.ljust(KEY_LEN)}: {num_miss_translation}\033[0m")
|
| 118 |
-
else:
|
| 119 |
-
print(f"\033[32m[Passed] All Keys Translated\033[0m")
|
| 120 |
-
# 将处理后的结果写入 JSON 文件
|
| 121 |
-
with open(json_file, "w", encoding="utf-8") as f:
|
| 122 |
-
json.dump(json_data, f, ensure_ascii=False, indent=4, sort_keys=SORT_KEYS)
|
| 123 |
-
f.write("\n")
|
| 124 |
-
print(f" Updated {json_file} ".center(TITLE_LEN, "=") + '\n')
|
| 125 |
-
|
| 126 |
-
if __name__ == "__main__":
|
| 127 |
-
code_keys = scan_i18n_strings()
|
| 128 |
-
for json_file in os.listdir(I18N_JSON_DIR):
|
| 129 |
-
if json_file.endswith(r".json"):
|
| 130 |
-
json_file = os.path.join(I18N_JSON_DIR, json_file)
|
| 131 |
-
update_i18n_json(json_file, code_keys)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
webui.py
DELETED
|
@@ -1,392 +0,0 @@
|
|
| 1 |
-
import spaces
|
| 2 |
-
import json
|
| 3 |
-
import os
|
| 4 |
-
import sys
|
| 5 |
-
import threading
|
| 6 |
-
import time
|
| 7 |
-
|
| 8 |
-
import warnings
|
| 9 |
-
|
| 10 |
-
import numpy as np
|
| 11 |
-
|
| 12 |
-
warnings.filterwarnings("ignore", category=FutureWarning)
|
| 13 |
-
warnings.filterwarnings("ignore", category=UserWarning)
|
| 14 |
-
|
| 15 |
-
import pandas as pd
|
| 16 |
-
|
| 17 |
-
current_dir = os.path.dirname(os.path.abspath(__file__))
|
| 18 |
-
sys.path.append(current_dir)
|
| 19 |
-
sys.path.append(os.path.join(current_dir, "indextts"))
|
| 20 |
-
|
| 21 |
-
import argparse
|
| 22 |
-
parser = argparse.ArgumentParser(
|
| 23 |
-
description="IndexTTS WebUI",
|
| 24 |
-
formatter_class=argparse.ArgumentDefaultsHelpFormatter,
|
| 25 |
-
)
|
| 26 |
-
parser.add_argument("--verbose", action="store_true", default=False, help="Enable verbose mode")
|
| 27 |
-
parser.add_argument("--port", type=int, default=7860, help="Port to run the web UI on")
|
| 28 |
-
parser.add_argument("--host", type=str, default="0.0.0.0", help="Host to run the web UI on")
|
| 29 |
-
parser.add_argument("--model_dir", type=str, default="./checkpoints", help="Model checkpoints directory")
|
| 30 |
-
parser.add_argument("--fp16", action="store_true", default=False, help="Use FP16 for inference if available")
|
| 31 |
-
parser.add_argument("--deepspeed", action="store_true", default=False, help="Use DeepSpeed to accelerate if available")
|
| 32 |
-
parser.add_argument("--cuda_kernel", action="store_true", default=False, help="Use CUDA kernel for inference if available")
|
| 33 |
-
parser.add_argument("--gui_seg_tokens", type=int, default=120, help="GUI: Max tokens per generation segment")
|
| 34 |
-
cmd_args = parser.parse_args()
|
| 35 |
-
|
| 36 |
-
from tools.download_files import download_model_from_huggingface
|
| 37 |
-
download_model_from_huggingface(os.path.join(current_dir,"checkpoints"),
|
| 38 |
-
os.path.join(current_dir, "checkpoints","hf_cache"))
|
| 39 |
-
|
| 40 |
-
import gradio as gr
|
| 41 |
-
from indextts.infer_v2 import IndexTTS2
|
| 42 |
-
from tools.i18n.i18n import I18nAuto
|
| 43 |
-
|
| 44 |
-
i18n = I18nAuto(language="Auto")
|
| 45 |
-
MODE = 'local'
|
| 46 |
-
tts = IndexTTS2(model_dir=cmd_args.model_dir,
|
| 47 |
-
cfg_path=os.path.join(cmd_args.model_dir, "config.yaml"),
|
| 48 |
-
use_fp16=cmd_args.fp16,
|
| 49 |
-
use_deepspeed=cmd_args.deepspeed,
|
| 50 |
-
use_cuda_kernel=cmd_args.cuda_kernel,
|
| 51 |
-
)
|
| 52 |
-
# 支持的语言列表
|
| 53 |
-
LANGUAGES = {
|
| 54 |
-
"中文": "zh_CN",
|
| 55 |
-
"English": "en_US"
|
| 56 |
-
}
|
| 57 |
-
EMO_CHOICES = [i18n("与音色参考音频相同"),
|
| 58 |
-
i18n("使用情感参考音频"),
|
| 59 |
-
i18n("使用情感向量控制"),
|
| 60 |
-
i18n("使用情感描述文本控制")]
|
| 61 |
-
EMO_CHOICES_BASE = EMO_CHOICES[:3] # 基础选项
|
| 62 |
-
EMO_CHOICES_EXPERIMENTAL = EMO_CHOICES # 全部选项(包括文本描述)
|
| 63 |
-
|
| 64 |
-
os.makedirs("outputs/tasks",exist_ok=True)
|
| 65 |
-
os.makedirs("prompts",exist_ok=True)
|
| 66 |
-
|
| 67 |
-
MAX_LENGTH_TO_USE_SPEED = 70
|
| 68 |
-
with open("examples/cases.jsonl", "r", encoding="utf-8") as f:
|
| 69 |
-
example_cases = []
|
| 70 |
-
for line in f:
|
| 71 |
-
line = line.strip()
|
| 72 |
-
if not line:
|
| 73 |
-
continue
|
| 74 |
-
example = json.loads(line)
|
| 75 |
-
if example.get("emo_audio",None):
|
| 76 |
-
emo_audio_path = os.path.join("examples",example["emo_audio"])
|
| 77 |
-
else:
|
| 78 |
-
emo_audio_path = None
|
| 79 |
-
example_cases.append([os.path.join("examples", example.get("prompt_audio", "sample_prompt.wav")),
|
| 80 |
-
EMO_CHOICES[example.get("emo_mode",0)],
|
| 81 |
-
example.get("text"),
|
| 82 |
-
emo_audio_path,
|
| 83 |
-
example.get("emo_weight",1.0),
|
| 84 |
-
example.get("emo_text",""),
|
| 85 |
-
example.get("emo_vec_1",0),
|
| 86 |
-
example.get("emo_vec_2",0),
|
| 87 |
-
example.get("emo_vec_3",0),
|
| 88 |
-
example.get("emo_vec_4",0),
|
| 89 |
-
example.get("emo_vec_5",0),
|
| 90 |
-
example.get("emo_vec_6",0),
|
| 91 |
-
example.get("emo_vec_7",0),
|
| 92 |
-
example.get("emo_vec_8",0),
|
| 93 |
-
example.get("emo_text") is not None]
|
| 94 |
-
)
|
| 95 |
-
|
| 96 |
-
def normalize_emo_vec(emo_vec):
|
| 97 |
-
# emotion factors for better user experience
|
| 98 |
-
k_vec = [0.75,0.70,0.80,0.80,0.75,0.75,0.55,0.45]
|
| 99 |
-
tmp = np.array(k_vec) * np.array(emo_vec)
|
| 100 |
-
if np.sum(tmp) > 0.8:
|
| 101 |
-
tmp = tmp * 0.8/ np.sum(tmp)
|
| 102 |
-
return tmp.tolist()
|
| 103 |
-
|
| 104 |
-
@spaces.GPU
|
| 105 |
-
def gen_single(emo_control_method,prompt, text,
|
| 106 |
-
emo_ref_path, emo_weight,
|
| 107 |
-
vec1, vec2, vec3, vec4, vec5, vec6, vec7, vec8,
|
| 108 |
-
emo_text,emo_random,
|
| 109 |
-
max_text_tokens_per_segment=120,
|
| 110 |
-
*args, progress=gr.Progress()):
|
| 111 |
-
output_path = None
|
| 112 |
-
if not output_path:
|
| 113 |
-
output_path = os.path.join("outputs", f"spk_{int(time.time())}.wav")
|
| 114 |
-
# set gradio progress
|
| 115 |
-
tts.gr_progress = progress
|
| 116 |
-
do_sample, top_p, top_k, temperature, \
|
| 117 |
-
length_penalty, num_beams, repetition_penalty, max_mel_tokens = args
|
| 118 |
-
kwargs = {
|
| 119 |
-
"do_sample": bool(do_sample),
|
| 120 |
-
"top_p": float(top_p),
|
| 121 |
-
"top_k": int(top_k) if int(top_k) > 0 else None,
|
| 122 |
-
"temperature": float(temperature),
|
| 123 |
-
"length_penalty": float(length_penalty),
|
| 124 |
-
"num_beams": num_beams,
|
| 125 |
-
"repetition_penalty": float(repetition_penalty),
|
| 126 |
-
"max_mel_tokens": int(max_mel_tokens),
|
| 127 |
-
# "typical_sampling": bool(typical_sampling),
|
| 128 |
-
# "typical_mass": float(typical_mass),
|
| 129 |
-
}
|
| 130 |
-
if type(emo_control_method) is not int:
|
| 131 |
-
emo_control_method = emo_control_method.value
|
| 132 |
-
if emo_control_method == 0: # emotion from speaker
|
| 133 |
-
emo_ref_path = None # remove external reference audio
|
| 134 |
-
if emo_control_method == 1: # emotion from reference audio
|
| 135 |
-
# normalize emo_alpha for better user experience
|
| 136 |
-
emo_weight = emo_weight * 0.8
|
| 137 |
-
pass
|
| 138 |
-
if emo_control_method == 2: # emotion from custom vectors
|
| 139 |
-
vec = [vec1, vec2, vec3, vec4, vec5, vec6, vec7, vec8]
|
| 140 |
-
vec = normalize_emo_vec(vec)
|
| 141 |
-
else:
|
| 142 |
-
# don't use the emotion vector inputs for the other modes
|
| 143 |
-
vec = None
|
| 144 |
-
|
| 145 |
-
if emo_text == "":
|
| 146 |
-
# erase empty emotion descriptions; `infer()` will then automatically use the main prompt
|
| 147 |
-
emo_text = None
|
| 148 |
-
|
| 149 |
-
print(f"Emo control mode:{emo_control_method},weight:{emo_weight},vec:{vec}")
|
| 150 |
-
output = tts.infer(spk_audio_prompt=prompt, text=text,
|
| 151 |
-
output_path=output_path,
|
| 152 |
-
emo_audio_prompt=emo_ref_path, emo_alpha=emo_weight,
|
| 153 |
-
emo_vector=vec,
|
| 154 |
-
use_emo_text=(emo_control_method==3), emo_text=emo_text,use_random=emo_random,
|
| 155 |
-
verbose=cmd_args.verbose,
|
| 156 |
-
max_text_tokens_per_segment=int(max_text_tokens_per_segment),
|
| 157 |
-
**kwargs)
|
| 158 |
-
return gr.update(value=output,visible=True)
|
| 159 |
-
|
| 160 |
-
def update_prompt_audio():
|
| 161 |
-
update_button = gr.update(interactive=True)
|
| 162 |
-
return update_button
|
| 163 |
-
|
| 164 |
-
with gr.Blocks(title="IndexTTS Demo") as demo:
|
| 165 |
-
mutex = threading.Lock()
|
| 166 |
-
gr.HTML('''
|
| 167 |
-
<h2><center>IndexTTS2: A Breakthrough in Emotionally Expressive and Duration-Controlled Auto-Regressive Zero-Shot Text-to-Speech</h2>
|
| 168 |
-
<p align="center">
|
| 169 |
-
<a href='https://arxiv.org/abs/2506.21619'><img src='https://img.shields.io/badge/ArXiv-2506.21619-red'></a>
|
| 170 |
-
</p>
|
| 171 |
-
''')
|
| 172 |
-
|
| 173 |
-
with gr.Tab(i18n("音频生成")):
|
| 174 |
-
with gr.Row():
|
| 175 |
-
os.makedirs("prompts",exist_ok=True)
|
| 176 |
-
prompt_audio = gr.Audio(label=i18n("音色参考音频"),key="prompt_audio",
|
| 177 |
-
sources=["upload","microphone"],type="filepath")
|
| 178 |
-
prompt_list = os.listdir("prompts")
|
| 179 |
-
default = ''
|
| 180 |
-
if prompt_list:
|
| 181 |
-
default = prompt_list[0]
|
| 182 |
-
with gr.Column():
|
| 183 |
-
input_text_single = gr.TextArea(label=i18n("文本"),key="input_text_single", placeholder=i18n("请输入目标文本"), info=f"{i18n('当前模型版本')}{tts.model_version or '1.0'}")
|
| 184 |
-
gen_button = gr.Button(i18n("生成语音"), key="gen_button",interactive=True)
|
| 185 |
-
output_audio = gr.Audio(label=i18n("生成结果"), visible=True,key="output_audio")
|
| 186 |
-
experimental_checkbox = gr.Checkbox(label=i18n("显示实验功能"),value=False)
|
| 187 |
-
with gr.Accordion(i18n("功能设置")):
|
| 188 |
-
# 情感控制选项部分
|
| 189 |
-
with gr.Row():
|
| 190 |
-
emo_control_method = gr.Radio(
|
| 191 |
-
choices=EMO_CHOICES_BASE,
|
| 192 |
-
type="index",
|
| 193 |
-
value=EMO_CHOICES_BASE[0],label=i18n("情感控制方式"))
|
| 194 |
-
# 情感参考音频部分
|
| 195 |
-
with gr.Group(visible=False) as emotion_reference_group:
|
| 196 |
-
with gr.Row():
|
| 197 |
-
emo_upload = gr.Audio(label=i18n("上传情感参考音频"), type="filepath")
|
| 198 |
-
|
| 199 |
-
# 情感随机采样
|
| 200 |
-
with gr.Row(visible=False) as emotion_randomize_group:
|
| 201 |
-
emo_random = gr.Checkbox(label=i18n("情感随机采样"), value=False)
|
| 202 |
-
|
| 203 |
-
# 情感向量控制部分
|
| 204 |
-
with gr.Group(visible=False) as emotion_vector_group:
|
| 205 |
-
with gr.Row():
|
| 206 |
-
with gr.Column():
|
| 207 |
-
vec1 = gr.Slider(label=i18n("喜"), minimum=0.0, maximum=1.0, value=0.0, step=0.05)
|
| 208 |
-
vec2 = gr.Slider(label=i18n("怒"), minimum=0.0, maximum=1.0, value=0.0, step=0.05)
|
| 209 |
-
vec3 = gr.Slider(label=i18n("哀"), minimum=0.0, maximum=1.0, value=0.0, step=0.05)
|
| 210 |
-
vec4 = gr.Slider(label=i18n("惧"), minimum=0.0, maximum=1.0, value=0.0, step=0.05)
|
| 211 |
-
with gr.Column():
|
| 212 |
-
vec5 = gr.Slider(label=i18n("厌恶"), minimum=0.0, maximum=1.0, value=0.0, step=0.05)
|
| 213 |
-
vec6 = gr.Slider(label=i18n("低落"), minimum=0.0, maximum=1.0, value=0.0, step=0.05)
|
| 214 |
-
vec7 = gr.Slider(label=i18n("惊喜"), minimum=0.0, maximum=1.0, value=0.0, step=0.05)
|
| 215 |
-
vec8 = gr.Slider(label=i18n("平静"), minimum=0.0, maximum=1.0, value=0.0, step=0.05)
|
| 216 |
-
|
| 217 |
-
with gr.Group(visible=False) as emo_text_group:
|
| 218 |
-
with gr.Row():
|
| 219 |
-
emo_text = gr.Textbox(label=i18n("情感描述文本"),
|
| 220 |
-
placeholder=i18n("请输入情绪描述(或留空以自动使用目标文本作为情绪描述)"),
|
| 221 |
-
value="",
|
| 222 |
-
info=i18n("例如:委屈巴巴、危险在悄悄逼近"))
|
| 223 |
-
|
| 224 |
-
|
| 225 |
-
with gr.Row(visible=False) as emo_weight_group:
|
| 226 |
-
emo_weight = gr.Slider(label=i18n("情感权重"), minimum=0.0, maximum=1.0, value=0.8, step=0.01)
|
| 227 |
-
|
| 228 |
-
with gr.Accordion(i18n("高级生成参数设置"), open=False,visible=False) as advanced_settings_group:
|
| 229 |
-
with gr.Row():
|
| 230 |
-
with gr.Column(scale=1):
|
| 231 |
-
gr.Markdown(f"**{i18n('GPT2 采样设置')}** _{i18n('参数会影响音频多样性和生成速度详见')} [Generation strategies](https://huggingface.co/docs/transformers/main/en/generation_strategies)._")
|
| 232 |
-
with gr.Row():
|
| 233 |
-
do_sample = gr.Checkbox(label="do_sample", value=True, info=i18n("是否进行采样"))
|
| 234 |
-
temperature = gr.Slider(label="temperature", minimum=0.1, maximum=2.0, value=0.8, step=0.1)
|
| 235 |
-
with gr.Row():
|
| 236 |
-
top_p = gr.Slider(label="top_p", minimum=0.0, maximum=1.0, value=0.8, step=0.01)
|
| 237 |
-
top_k = gr.Slider(label="top_k", minimum=0, maximum=100, value=30, step=1)
|
| 238 |
-
num_beams = gr.Slider(label="num_beams", value=3, minimum=1, maximum=10, step=1)
|
| 239 |
-
with gr.Row():
|
| 240 |
-
repetition_penalty = gr.Number(label="repetition_penalty", precision=None, value=10.0, minimum=0.1, maximum=20.0, step=0.1)
|
| 241 |
-
length_penalty = gr.Number(label="length_penalty", precision=None, value=0.0, minimum=-2.0, maximum=2.0, step=0.1)
|
| 242 |
-
max_mel_tokens = gr.Slider(label="max_mel_tokens", value=1500, minimum=50, maximum=tts.cfg.gpt.max_mel_tokens, step=10, info=i18n("生成Token最大数量,过小导致音频被截断"), key="max_mel_tokens")
|
| 243 |
-
# with gr.Row():
|
| 244 |
-
# typical_sampling = gr.Checkbox(label="typical_sampling", value=False, info="不建议使用")
|
| 245 |
-
# typical_mass = gr.Slider(label="typical_mass", value=0.9, minimum=0.0, maximum=1.0, step=0.1)
|
| 246 |
-
with gr.Column(scale=2):
|
| 247 |
-
gr.Markdown(f'**{i18n("分句设置")}** _{i18n("参数会影响音频质量和生成速度")}_')
|
| 248 |
-
with gr.Row():
|
| 249 |
-
initial_value = max(20, min(tts.cfg.gpt.max_text_tokens, cmd_args.gui_seg_tokens))
|
| 250 |
-
max_text_tokens_per_segment = gr.Slider(
|
| 251 |
-
label=i18n("分句最大Token数"), value=initial_value, minimum=20, maximum=tts.cfg.gpt.max_text_tokens, step=2, key="max_text_tokens_per_segment",
|
| 252 |
-
info=i18n("建议80~200之间,值越大,分句越长;值越小,分句越碎;过小过大都可能导致音频质量不高"),
|
| 253 |
-
)
|
| 254 |
-
with gr.Accordion(i18n("预览分句结果"), open=True) as segments_settings:
|
| 255 |
-
segments_preview = gr.Dataframe(
|
| 256 |
-
headers=[i18n("序号"), i18n("分句内容"), i18n("Token数")],
|
| 257 |
-
key="segments_preview",
|
| 258 |
-
wrap=True,
|
| 259 |
-
)
|
| 260 |
-
advanced_params = [
|
| 261 |
-
do_sample, top_p, top_k, temperature,
|
| 262 |
-
length_penalty, num_beams, repetition_penalty, max_mel_tokens,
|
| 263 |
-
# typical_sampling, typical_mass,
|
| 264 |
-
]
|
| 265 |
-
|
| 266 |
-
if len(example_cases) > 2:
|
| 267 |
-
example_table = gr.Examples(
|
| 268 |
-
examples=example_cases[:-2],
|
| 269 |
-
examples_per_page=20,
|
| 270 |
-
inputs=[prompt_audio,
|
| 271 |
-
emo_control_method,
|
| 272 |
-
input_text_single,
|
| 273 |
-
emo_upload,
|
| 274 |
-
emo_weight,
|
| 275 |
-
emo_text,
|
| 276 |
-
vec1,vec2,vec3,vec4,vec5,vec6,vec7,vec8,experimental_checkbox]
|
| 277 |
-
)
|
| 278 |
-
elif len(example_cases) > 0:
|
| 279 |
-
example_table = gr.Examples(
|
| 280 |
-
examples=example_cases,
|
| 281 |
-
examples_per_page=20,
|
| 282 |
-
inputs=[prompt_audio,
|
| 283 |
-
emo_control_method,
|
| 284 |
-
input_text_single,
|
| 285 |
-
emo_upload,
|
| 286 |
-
emo_weight,
|
| 287 |
-
emo_text,
|
| 288 |
-
vec1, vec2, vec3, vec4, vec5, vec6, vec7, vec8, experimental_checkbox]
|
| 289 |
-
)
|
| 290 |
-
|
| 291 |
-
def on_input_text_change(text, max_text_tokens_per_segment):
|
| 292 |
-
if text and len(text) > 0:
|
| 293 |
-
text_tokens_list = tts.tokenizer.tokenize(text)
|
| 294 |
-
|
| 295 |
-
segments = tts.tokenizer.split_segments(text_tokens_list, max_text_tokens_per_segment=int(max_text_tokens_per_segment))
|
| 296 |
-
data = []
|
| 297 |
-
for i, s in enumerate(segments):
|
| 298 |
-
segment_str = ''.join(s)
|
| 299 |
-
tokens_count = len(s)
|
| 300 |
-
data.append([i, segment_str, tokens_count])
|
| 301 |
-
return {
|
| 302 |
-
segments_preview: gr.update(value=data, visible=True, type="array"),
|
| 303 |
-
}
|
| 304 |
-
else:
|
| 305 |
-
df = pd.DataFrame([], columns=[i18n("序号"), i18n("分句内容"), i18n("Token数")])
|
| 306 |
-
return {
|
| 307 |
-
segments_preview: gr.update(value=df),
|
| 308 |
-
}
|
| 309 |
-
|
| 310 |
-
def on_method_select(emo_control_method):
|
| 311 |
-
if emo_control_method == 1: # emotion reference audio
|
| 312 |
-
return (gr.update(visible=True),
|
| 313 |
-
gr.update(visible=False),
|
| 314 |
-
gr.update(visible=False),
|
| 315 |
-
gr.update(visible=False),
|
| 316 |
-
gr.update(visible=True)
|
| 317 |
-
)
|
| 318 |
-
elif emo_control_method == 2: # emotion vectors
|
| 319 |
-
return (gr.update(visible=False),
|
| 320 |
-
gr.update(visible=True),
|
| 321 |
-
gr.update(visible=True),
|
| 322 |
-
gr.update(visible=False),
|
| 323 |
-
gr.update(visible=False)
|
| 324 |
-
)
|
| 325 |
-
elif emo_control_method == 3: # emotion text description
|
| 326 |
-
return (gr.update(visible=False),
|
| 327 |
-
gr.update(visible=True),
|
| 328 |
-
gr.update(visible=False),
|
| 329 |
-
gr.update(visible=True),
|
| 330 |
-
gr.update(visible=True)
|
| 331 |
-
)
|
| 332 |
-
else: # 0: same as speaker voice
|
| 333 |
-
return (gr.update(visible=False),
|
| 334 |
-
gr.update(visible=False),
|
| 335 |
-
gr.update(visible=False),
|
| 336 |
-
gr.update(visible=False),
|
| 337 |
-
gr.update(visible=False)
|
| 338 |
-
)
|
| 339 |
-
|
| 340 |
-
def on_experimental_change(is_exp):
|
| 341 |
-
# 切换情感控制选项
|
| 342 |
-
# 第三个返回值实际没有起作用
|
| 343 |
-
if is_exp:
|
| 344 |
-
return gr.update(choices=EMO_CHOICES_EXPERIMENTAL, value=EMO_CHOICES_EXPERIMENTAL[0]), gr.update(visible=True),gr.update(value=example_cases)
|
| 345 |
-
else:
|
| 346 |
-
return gr.update(choices=EMO_CHOICES_BASE, value=EMO_CHOICES_BASE[0]), gr.update(visible=False),gr.update(value=example_cases[:-2])
|
| 347 |
-
|
| 348 |
-
emo_control_method.select(on_method_select,
|
| 349 |
-
inputs=[emo_control_method],
|
| 350 |
-
outputs=[emotion_reference_group,
|
| 351 |
-
emotion_randomize_group,
|
| 352 |
-
emotion_vector_group,
|
| 353 |
-
emo_text_group,
|
| 354 |
-
emo_weight_group]
|
| 355 |
-
)
|
| 356 |
-
|
| 357 |
-
input_text_single.change(
|
| 358 |
-
on_input_text_change,
|
| 359 |
-
inputs=[input_text_single, max_text_tokens_per_segment],
|
| 360 |
-
outputs=[segments_preview]
|
| 361 |
-
)
|
| 362 |
-
|
| 363 |
-
experimental_checkbox.change(
|
| 364 |
-
on_experimental_change,
|
| 365 |
-
inputs=[experimental_checkbox],
|
| 366 |
-
outputs=[emo_control_method, advanced_settings_group,example_table.dataset] # 高级参数Accordion
|
| 367 |
-
)
|
| 368 |
-
|
| 369 |
-
max_text_tokens_per_segment.change(
|
| 370 |
-
on_input_text_change,
|
| 371 |
-
inputs=[input_text_single, max_text_tokens_per_segment],
|
| 372 |
-
outputs=[segments_preview]
|
| 373 |
-
)
|
| 374 |
-
|
| 375 |
-
prompt_audio.upload(update_prompt_audio,
|
| 376 |
-
inputs=[],
|
| 377 |
-
outputs=[gen_button])
|
| 378 |
-
|
| 379 |
-
gen_button.click(gen_single,
|
| 380 |
-
inputs=[emo_control_method,prompt_audio, input_text_single, emo_upload, emo_weight,
|
| 381 |
-
vec1, vec2, vec3, vec4, vec5, vec6, vec7, vec8,
|
| 382 |
-
emo_text,emo_random,
|
| 383 |
-
max_text_tokens_per_segment,
|
| 384 |
-
*advanced_params,
|
| 385 |
-
],
|
| 386 |
-
outputs=[output_audio])
|
| 387 |
-
|
| 388 |
-
|
| 389 |
-
|
| 390 |
-
if __name__ == "__main__":
|
| 391 |
-
demo.queue(20)
|
| 392 |
-
demo.launch(server_name=cmd_args.host, server_port=cmd_args.port)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|