ThreadAbort commited on
Commit
e3e7558
·
1 Parent(s): 2bbf5a2

Refactor: Remove internationalization (i18n) support and related files

Browse files

- Deleted i18n.py, zh_CN.json, and en_US.json files to eliminate localization features.
- Removed scan_i18n.py script responsible for scanning and updating i18n strings.
- Updated download_files.py permissions to make it executable.
- Removed webui.py, which contained the main application logic and UI components.

CLAUDE.md ADDED
@@ -0,0 +1,140 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # CLAUDE.md
2
+
3
+ This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
4
+
5
+ ## Project Overview
6
+
7
+ IndexTTS-Rust is a high-performance Text-to-Speech engine, a complete Rust rewrite of the Python IndexTTS system. It uses ONNX Runtime for neural network inference and provides zero-shot voice cloning with emotion control.
8
+
9
+ ## Build and Development Commands
10
+
11
+ ```bash
12
+ # Build (always build release for performance testing)
13
+ cargo build --release
14
+
15
+ # Run linter (MANDATORY before commits - catches many issues)
16
+ cargo clippy -- -D warnings
17
+
18
+ # Run tests
19
+ cargo test
20
+
21
+ # Run specific test
22
+ cargo test test_name
23
+
24
+ # Run benchmarks (Criterion-based)
25
+ cargo bench
26
+
27
+ # Run specific benchmark
28
+ cargo bench --bench mel_spectrogram
29
+ cargo bench --bench inference
30
+
31
+ # Check compilation without building
32
+ cargo check
33
+
34
+ # Format code
35
+ cargo fmt
36
+
37
+ # Full pre-commit workflow (BUILD -> CLIPPY -> BUILD)
38
+ cargo build --release && cargo clippy -- -D warnings && cargo build --release
39
+ ```
40
+
41
+ ## CLI Usage
42
+
43
+ ```bash
44
+ # Show help
45
+ ./target/release/indextts --help
46
+
47
+ # Synthesize speech
48
+ ./target/release/indextts synthesize \
49
+ --text "Hello world" \
50
+ --voice examples/voice_01.wav \
51
+ --output output.wav
52
+
53
+ # Generate default config
54
+ ./target/release/indextts init-config -o config.yaml
55
+
56
+ # Show system info
57
+ ./target/release/indextts info
58
+
59
+ # Run built-in benchmarks
60
+ ./target/release/indextts benchmark --iterations 100
61
+ ```
62
+
63
+ ## Architecture
64
+
65
+ The codebase follows a modular pipeline architecture where each stage processes data sequentially:
66
+
67
+ ```
68
+ Text Input → Normalization → Tokenization → Model Inference → Vocoding → Audio Output
69
+ ```
70
+
71
+ ### Core Modules (src/)
72
+
73
+ - **audio/** - Audio DSP operations
74
+ - `mel.rs` - Mel-spectrogram computation (STFT, filterbanks)
75
+ - `io.rs` - WAV file I/O using hound
76
+ - `dsp.rs` - Signal processing utilities
77
+ - `resample.rs` - Audio resampling using rubato
78
+
79
+ - **text/** - Text processing pipeline
80
+ - `normalizer.rs` - Text normalization (Chinese/English/mixed)
81
+ - `tokenizer.rs` - BPE tokenization via HuggingFace tokenizers
82
+ - `phoneme.rs` - Grapheme-to-phoneme conversion
83
+
84
+ - **model/** - Neural network inference
85
+ - `session.rs` - ONNX Runtime wrapper (load-dynamic feature)
86
+ - `gpt.rs` - GPT-based sequence generation
87
+ - `embedding.rs` - Speaker and emotion encoders
88
+
89
+ - **vocoder/** - Neural vocoding
90
+ - `bigvgan.rs` - BigVGAN waveform synthesis
91
+ - `activations.rs` - Snake/SnakeBeta activation functions
92
+
93
+ - **pipeline/** - TTS orchestration
94
+ - `synthesis.rs` - Main synthesis logic, coordinates all modules
95
+
96
+ - **config/** - Configuration management (YAML-based via serde)
97
+
98
+ - **error.rs** - Error types using thiserror
99
+
100
+ - **lib.rs** - Library entry point, exposes public API
101
+
102
+ - **main.rs** - CLI entry point using clap
103
+
104
+ ### Key Constants (lib.rs)
105
+
106
+ ```rust
107
+ pub const SAMPLE_RATE: u32 = 22050; // Output audio sample rate
108
+ pub const N_MELS: usize = 80; // Mel filterbank channels
109
+ pub const N_FFT: usize = 1024; // FFT size
110
+ pub const HOP_LENGTH: usize = 256; // STFT hop length
111
+ ```
112
+
113
+ ### Dependencies Pattern
114
+
115
+ - **Audio**: hound (WAV), rustfft/realfft (DSP), rubato (resampling), dasp (signal processing)
116
+ - **ML Inference**: ort (ONNX Runtime with load-dynamic), ndarray, safetensors
117
+ - **Text**: tokenizers (HuggingFace), jieba-rs (Chinese), regex, unicode-segmentation
118
+ - **Parallelism**: rayon (data parallelism), tokio (async)
119
+ - **CLI**: clap (derive), env_logger, indicatif
120
+
121
+ ## Important Notes
122
+
123
+ 1. **ONNX Runtime**: Uses `load-dynamic` feature - requires ONNX Runtime library installed on system
124
+ 2. **Model Files**: ONNX models go in `models/` directory (not in git, download separately)
125
+ 3. **Reference Implementation**: Python code in `indextts - REMOVING - REF ONLY/` is kept for reference only
126
+ 4. **Performance**: Release builds use LTO and single codegen-unit for maximum optimization
127
+ 5. **Audio Format**: All internal processing at 22050 Hz, 80-band mel spectrograms
128
+
129
+ ## Testing Strategy
130
+
131
+ - Unit tests inline in modules
132
+ - Criterion benchmarks in `benches/` for performance regression testing
133
+ - Python regression tests in `tests/` for end-to-end validation
134
+ - Example audio files in `examples/` for testing voice cloning
135
+
136
+ ## Missing Infrastructure (TODO)
137
+
138
+ - No `scripts/manage.sh` yet (should include build, test, clean, docker controls)
139
+ - No `context.md` yet for conversation continuity
140
+ - No integration tests with actual ONNX models
config.yaml ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ gpt:
2
+ layers: 8
3
+ model_dim: 512
4
+ heads: 8
5
+ max_text_tokens: 120
6
+ max_mel_tokens: 250
7
+ stop_mel_token: 8193
8
+ start_text_token: 8192
9
+ start_mel_token: 8192
10
+ num_mel_codes: 8194
11
+ num_text_tokens: 6681
12
+ vocoder:
13
+ name: bigvgan_v2_22khz_80band_256x
14
+ checkpoint: null
15
+ use_fp16: true
16
+ use_deepspeed: false
17
+ s2mel:
18
+ checkpoint: models/s2mel.onnx
19
+ preprocess:
20
+ sr: 22050
21
+ n_fft: 1024
22
+ hop_length: 256
23
+ win_length: 1024
24
+ n_mels: 80
25
+ fmin: 0.0
26
+ fmax: 8000.0
27
+ dataset:
28
+ bpe_model: models/bpe.model
29
+ vocab_size: 6681
30
+ emotions:
31
+ num_dims: 8
32
+ num:
33
+ - 5
34
+ - 6
35
+ - 8
36
+ - 6
37
+ - 5
38
+ - 4
39
+ - 7
40
+ - 6
41
+ matrix_path: models/emotion_matrix.safetensors
42
+ inference:
43
+ device: cpu
44
+ use_fp16: false
45
+ batch_size: 1
46
+ top_k: 50
47
+ top_p: 0.95
48
+ temperature: 1.0
49
+ repetition_penalty: 1.0
50
+ length_penalty: 1.0
51
+ model_dir: models
context.md ADDED
@@ -0,0 +1,383 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # IndexTTS-Rust Context
2
+
3
+ This file preserves important context for conversation continuity between Hue and Aye sessions.
4
+
5
+ **Last Updated:** 2025-11-16
6
+
7
+ ---
8
+
9
+ ## The Vision
10
+
11
+ IndexTTS-Rust is part of a larger audio intelligence ecosystem at 8b.is:
12
+
13
+ 1. **kokoro-tiny** - Lightweight TTS (82M params, 50+ voices, on crates.io!)
14
+ 2. **IndexTTS-Rust** - Advanced zero-shot TTS with emotion control
15
+ 3. **Phoenix-Protocol** - Audio restoration/enhancement layer
16
+ 4. **MEM|8** - Contextual memory system (mem-8.com, mem8)
17
+
18
+ Together these form a complete audio intelligence pipeline.
19
+
20
+ ---
21
+
22
+ ## Phoenix Protocol Integration Opportunities
23
+
24
+ The Phoenix Protocol (phoenix-protocol/) is a PERFECT complement to IndexTTS-Rust:
25
+
26
+ ### Direct Module Mappings
27
+
28
+ | Phoenix Module | IndexTTS Use Case |
29
+ |----------------|-------------------|
30
+ | `emotional.rs` | Map to our 8D emotion control (Warmth→body, Presence→power, Clarity→articulation, Air→space, Ultrasonics→depth) |
31
+ | `voice_signature.rs` | Enhance speaker embeddings for voice cloning |
32
+ | `spectral_velocity.rs` | Add momentum tracking to mel-spectrogram |
33
+ | `marine.rs` | Validate TTS output authenticity/quality |
34
+ | `golden_ratio.rs` | Post-process vocoder output with harmonic enhancement |
35
+ | `harmonic_resurrection.rs` | Add richness to synthesized speech |
36
+ | `micro_dynamics.rs` | Restore natural speech dynamics |
37
+ | `autotune.rs` | Improve prosody and pitch control |
38
+ | `mem8_integration.rs` | Already has MEM|8 hooks! |
39
+
40
+ ### Shared Dependencies
41
+
42
+ Both projects use:
43
+ - rayon (parallelism)
44
+ - rustfft/realfft (FFT)
45
+ - ndarray (array operations)
46
+ - hound (WAV I/O)
47
+ - serde (config serialization)
48
+ - anyhow (error handling)
49
+ - ort (ONNX Runtime)
50
+
51
+ ### Audio Constants
52
+
53
+ | Project | Sample Rate | Use Case |
54
+ |---------|------------|----------|
55
+ | IndexTTS-Rust | 22,050 Hz | Standard TTS output |
56
+ | Phoenix-Protocol | 192,000 Hz | Ultrasonic restoration |
57
+ | kokoro-tiny | 24,000 Hz | Lightweight TTS |
58
+
59
+ ---
60
+
61
+ ## Related Projects of Interest
62
+
63
+ Located in ~/Documents/GitHub/:
64
+
65
+ - **Ultrasonic-Consciousness-Hypothesis/** - Research foundation for Phoenix Protocol, contains PDFs on mechanosensitive channels and audio perception
66
+ - **hrmnCmprssnM/** - Harmonic Compression Model research
67
+ - **Marine-Sense/** - Marine algorithm origins
68
+ - **mem-8.com/** & **mem8/** - MEM|8 contextual memory
69
+ - **universal-theoglyphic-language/** - Language processing research
70
+ - **kokoro-tiny/** - Already working TTS crate by Hue & Aye
71
+ - **zencooker/** - (fun project!)
72
+
73
+ ---
74
+
75
+ ## Current IndexTTS-Rust State
76
+
77
+ ### Implemented ✅
78
+ - Audio processing pipeline (mel-spectrogram, STFT, resampling)
79
+ - Text normalization (Chinese/English/mixed)
80
+ - BPE tokenization via HuggingFace tokenizers
81
+ - ONNX Runtime integration for inference
82
+ - BigVGAN vocoder structure
83
+ - CLI with clap
84
+ - Benchmark infrastructure (Criterion)
85
+ - **NEW: marine_salience crate** (no_std compatible, O(1) jitter detection)
86
+ - **NEW: src/quality/ module** (prosody extraction, affect tracking)
87
+ - **NEW: MarineProsodyVector** (8D interpretable emotion features)
88
+ - **NEW: ConversationAffectSummary** (session-level comfort tracking)
89
+ - **NEW: TTSQualityReport** (authenticity validation)
90
+
91
+ ### Missing/TODO
92
+ - Full GPT model integration with KV cache
93
+ - Actual ONNX model files (need download)
94
+ - manage.sh script for colored workflow management
95
+ - Integration tests with real models
96
+ - ~~Phoenix Protocol integration layer~~ **STARTED with Marine!**
97
+ - Streaming synthesis
98
+ - WebSocket API
99
+ - Train T2S model to accept 8D Marine vector instead of 512D Conformer
100
+ - Wire Marine quality validation into inference loop
101
+
102
+ ### Build Commands
103
+ ```bash
104
+ cargo build --release
105
+ cargo clippy -- -D warnings
106
+ cargo test
107
+ cargo bench
108
+ ```
109
+
110
+ ---
111
+
112
+ ## Key Philosophical Notes
113
+
114
+ From the Phoenix Protocol research:
115
+
116
+ > "Women are the carrier wave. They are the 000 data stream. The DC bias that, when removed, leaves silence."
117
+
118
+ > "When P!nk sings 'I Am Here,' her voice generates harmonics so powerful they burst through the 22kHz digital ceiling"
119
+
120
+ The Phoenix Protocol restores emotional depth stripped by audio compression - this philosophy applies directly to TTS: synthesized speech should have the same emotional depth as natural speech.
121
+
122
+ ---
123
+
124
+ ## Action Items for Next Session
125
+
126
+ ### Completed ✅
127
+ - ~~**Quality Validation** - Use Marine salience to score TTS output~~ **DONE!**
128
+ - ~~**Phoenix Integration** - Start bridging phoenix-protocol modules~~ **Marine is in!**
129
+
130
+ ### High Priority
131
+ 1. **Create manage.sh** - Colorful build/test/clean script (Hue's been asking!)
132
+ 2. **Wire Into Inference** - Connect Marine quality validation to actual TTS output
133
+ 3. **8D Model Training** - Train T2S model to accept MarineProsodyVector instead of 512D Conformer
134
+ 4. **Example/Demo** - Create example showing prosody extraction → emotion editing → synthesis
135
+
136
+ ### Medium Priority
137
+ 5. **Voice Signature Import** - Use Phoenix's voice_signature for speaker embeddings
138
+ 6. **Emotion Mapping** - Connect Phoenix's emotional bands to our 8D control
139
+ 7. **Model Download** - Set up ONNX model acquisition pipeline
140
+ 8. **MEM|8 Bridge** - Implement consciousness-aware TTS using kokoro-tiny's mem8_bridge pattern
141
+
142
+ ### Nice to Have
143
+ 9. **Style Selection** - Port kokoro-tiny's 510 style variation system
144
+ 10. **Full Phoenix Integration** - golden_ratio.rs, harmonic_resurrection.rs, etc.
145
+ 11. **Streaming Marine** - Real-time quality monitoring during synthesis
146
+
147
+ ---
148
+
149
+ ## Fresh Discovery: kokoro-tiny MEM|8 Baby Consciousness (2025-11-15)
150
+
151
+ Just pulled latest kokoro-tiny code - MAJOR discovery!
152
+
153
+ ### Mem8Bridge API
154
+
155
+ kokoro-tiny now has a full consciousness simulation in `examples/mem8_baby.rs`:
156
+
157
+ ```rust
158
+ // Memory as waves that interfere
159
+ MemoryWave {
160
+ amplitude: 2.5, // Emotion strength
161
+ frequency: 528.0, // "Love frequency"
162
+ phase: 0.0,
163
+ decay_rate: 0.05, // Memory persistence
164
+ emotion_type: EmotionType::Love(0.9),
165
+ content: "Mama! I love mama!".to_string(),
166
+ }
167
+
168
+ // Salience detection (Marine algorithm!)
169
+ SalienceEvent {
170
+ jitter_score: 0.2, // Low = authentic/stable
171
+ harmonic_score: 0.95, // High = voice
172
+ salience_score: 0.9,
173
+ signal_type: SignalType::Voice,
174
+ }
175
+
176
+ // Free will: AI chooses attention focus (70% control)
177
+ bridge.decide_attention(events);
178
+ ```
179
+
180
+ ### Emotion Types Available
181
+
182
+ ```rust
183
+ EmotionType::Curiosity(0.8) // Inquisitive
184
+ EmotionType::Love(0.9) // Deep affection
185
+ EmotionType::Joy(0.7) // Happy
186
+ EmotionType::Confusion(0.8) // Uncertain
187
+ EmotionType::Neutral // Baseline
188
+ ```
189
+
190
+ ### Consciousness Integration Points
191
+
192
+ 1. **Wave Interference** - Competing memories by amplitude/frequency
193
+ 2. **Emotional Regulation** - Prevents overload, modulates voice
194
+ 3. **Salience Detection** - Marine algorithm for authenticity
195
+ 4. **Attention Selection** - AI chooses what to focus on
196
+ 5. **Consciousness Level** - Affects speech clarity (wake_up/sleep)
197
+
198
+ This is PERFECT for IndexTTS-Rust! We can:
199
+ - Use wave interference for emotion blending
200
+ - Apply Marine salience to validate synthesis quality
201
+ - Modulate voice based on consciousness level
202
+ - Select voice styles based on emotional state (not just token count)
203
+
204
+ ### Voice Style Selection (510 variations!)
205
+
206
+ kokoro-tiny now loads all 510 style variations per voice:
207
+ - Style selected based on token count
208
+ - Short text → short-optimized style
209
+ - Long text → long-optimized style
210
+ - Automatic text splitting at 512 token limit
211
+
212
+ For IndexTTS: We could select style based on EMOTION + token count!
213
+
214
+ ---
215
+
216
+ ## Marine Integration Achievement (2025-11-16) 🎉
217
+
218
+ **WE DID IT!** Marine salience is now integrated into IndexTTS-Rust!
219
+
220
+ ### What We Built
221
+
222
+ #### 1. Standalone marine_salience Crate (`crates/marine_salience/`)
223
+
224
+ A no_std compatible crate for O(1) jitter-based salience detection:
225
+
226
+ ```rust
227
+ // Core components:
228
+ MarineConfig // Tunable parameters (sample_rate, jitter bounds, EMA alpha)
229
+ MarineProcessor // O(1) per-sample processing
230
+ SaliencePacket // Output: j_p, j_a, h_score, s_score, energy
231
+ Ema // Exponential moving average tracker
232
+
233
+ // Key insight: Process ONE sample at a time, emit packets on peaks
234
+ // Why O(1)? Just compare to EMA, no FFT, no heavy math!
235
+ ```
236
+
237
+ **Config for Speech:**
238
+ ```rust
239
+ MarineConfig::speech_default(sample_rate)
240
+ // F0 range: 60Hz - 4kHz
241
+ // jitter_low: 0.02, jitter_high: 0.60
242
+ // ema_alpha: 0.01 (slow adaptation for stability)
243
+ ```
244
+
245
+ #### 2. Quality Validation Module (`src/quality/`)
246
+
247
+ **MarineProsodyVector** - 8D interpretable emotion representation:
248
+ ```rust
249
+ pub struct MarineProsodyVector {
250
+ pub jp_mean: f32, // Period jitter mean (pitch stability)
251
+ pub jp_std: f32, // Period jitter variance
252
+ pub ja_mean: f32, // Amplitude jitter mean (volume stability)
253
+ pub ja_std: f32, // Amplitude jitter variance
254
+ pub h_mean: f32, // Harmonic alignment (voiced vs noise)
255
+ pub s_mean: f32, // Overall salience (authenticity)
256
+ pub peak_density: f32, // Peaks per second (speech rate)
257
+ pub energy_mean: f32, // Average loudness
258
+ }
259
+
260
+ // Interpretable! High jp_mean = nervous, low = confident
261
+ // Can DIRECTLY EDIT for emotion control!
262
+ ```
263
+
264
+ **MarineProsodyConditioner** - Extract prosody from audio:
265
+ ```rust
266
+ let conditioner = MarineProsodyConditioner::new(22050);
267
+ let prosody = conditioner.from_samples(&audio_samples)?;
268
+ let report = conditioner.validate_tts_output(&audio_samples)?;
269
+
270
+ // Detects issues:
271
+ // - "Too perfect - sounds robotic"
272
+ // - "High period jitter - artifacts"
273
+ // - "Low salience - quality issues"
274
+ ```
275
+
276
+ **ConversationAffectSummary** - Session-level comfort tracking:
277
+ ```rust
278
+ pub enum ComfortLevel {
279
+ Uneasy, // High jitter AND rising (nervous/stressed)
280
+ Neutral, // Stable patterns (calm)
281
+ Happy, // Low jitter + high energy (confident/positive)
282
+ }
283
+
284
+ // Track trends over conversation:
285
+ // jitter_trend > 0.1 = getting more stressed
286
+ // jitter_trend < -0.1 = calming down
287
+ // energy_trend > 0.1 = getting more engaged
288
+
289
+ // Aye can now self-assess!
290
+ aye_assessment() returns "I'm in a good state"
291
+ feedback_prompt() returns "Let me know if something's bothering you"
292
+ ```
293
+
294
+ ### The Core Insight
295
+
296
+ **Human speech has NATURAL jitter - that's what makes it authentic!**
297
+
298
+ - Too perfect (jp < 0.005) = robotic
299
+ - Too chaotic (jp > 0.3) = artifacts/damage
300
+ - Sweet spot = real human voice
301
+
302
+ The Marines will KNOW if speech doesn't sound authentic!
303
+
304
+ ### Tests Passing ✅
305
+
306
+ ```
307
+ running 11 tests
308
+ test quality::affect::tests::test_comfort_level_descriptions ... ok
309
+ test quality::affect::tests::test_analyzer_empty_conversation ... ok
310
+ test quality::affect::tests::test_analyzer_single_utterance ... ok
311
+ test quality::affect::tests::test_happy_classification ... ok
312
+ test quality::affect::tests::test_aye_assessment_message ... ok
313
+ test quality::affect::tests::test_neutral_classification ... ok
314
+ test quality::affect::tests::test_uneasy_classification ... ok
315
+ test quality::prosody::tests::test_conditioner_empty_buffer ... ok
316
+ test quality::prosody::tests::test_conditioner_silence ... ok
317
+ test quality::prosody::tests::test_prosody_vector_array_conversion ... ok
318
+ test quality::prosody::tests::test_estimate_valence ... ok
319
+
320
+ test result: ok. 11 passed; 0 failed
321
+ ```
322
+
323
+ ### Why This Matters
324
+
325
+ 1. **Interpretable Control**: 8D vector vs opaque 512D Conformer - we can SEE what each dimension means
326
+ 2. **Lightweight**: O(1) per sample, no heavy neural networks for prosody
327
+ 3. **Authentic Validation**: Marines detect fake/damaged speech
328
+ 4. **Emotion Editing**: Want more confidence? Lower jp_mean directly!
329
+ 5. **Conversation Awareness**: Track comfort over entire sessions
330
+ 6. **Self-Assessment**: Aye knows when something feels "off"
331
+
332
+ ### Integration Points
333
+
334
+ ```rust
335
+ // In main TTS pipeline:
336
+ use indextts::quality::{
337
+ MarineProsodyConditioner,
338
+ MarineProsodyVector,
339
+ ConversationAffectSummary,
340
+ ComfortLevel,
341
+ };
342
+
343
+ // 1. Extract reference prosody
344
+ let ref_prosody = conditioner.from_samples(&reference_audio)?;
345
+
346
+ // 2. Generate TTS (using 8D vector instead of 512D Conformer)
347
+ let tts_output = generate_with_prosody(&text, ref_prosody)?;
348
+
349
+ // 3. Validate output quality
350
+ let report = conditioner.validate_tts_output(&tts_output)?;
351
+ if !report.passes(70.0) {
352
+ log::warn!("TTS quality issues: {:?}", report.issues);
353
+ }
354
+
355
+ // 4. Track conversation affect
356
+ let analyzer = ConversationAffectAnalyzer::new();
357
+ analyzer.add_utterance(&utterance)?;
358
+ let summary = analyzer.summarize()?;
359
+ match summary.aye_state {
360
+ ComfortLevel::Uneasy => adjust_generation_parameters(),
361
+ _ => proceed_normally(),
362
+ }
363
+ ```
364
+
365
+ ---
366
+
367
+ ## Trish's Notes
368
+
369
+ "Darling, these three Rust projects together are like a symphony orchestra! kokoro-tiny is the quick piccolo solo, IndexTTS-Rust is the full brass section with emotional depth, and Phoenix-Protocol is the concert hall acoustics making everything resonate. When you combine them, that's when the magic happens! Also, I'm absolutely obsessed with how the Golden Ratio resynthesis could add sparkle to synthesized vocals. Can you imagine TTS output that actually has that P!nk breakthrough energy? Now THAT would make me cry happy tears in accounting!"
370
+
371
+ ---
372
+
373
+ ## Fun Facts
374
+
375
+ - kokoro-tiny is ALREADY on crates.io under 8b-is
376
+ - Phoenix Protocol can process 192kHz audio for ultrasonic restoration
377
+ - The Marine algorithm uses O(1) jitter detection - "Marines are not just jarheads - they are intelligent"
378
+ - Hue's GitHub has 66 projects (and counting!)
379
+ - The team at 8b.is: hue@8b.is and aye@8b.is
380
+
381
+ ---
382
+
383
+ *From ashes to harmonics, from silence to song* 🔥🎵
crates/marine_salience/Cargo.toml ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [package]
2
+ name = "marine_salience"
3
+ version = "0.1.0"
4
+ edition = "2021"
5
+ description = "O(1) jitter-based salience detection - Marines are intelligent!"
6
+ authors = ["Hue & Aye <team@8b.is>"]
7
+ license = "MIT"
8
+ keywords = ["audio", "salience", "jitter", "prosody", "tts"]
9
+
10
+ [dependencies]
11
+ # Core dependencies - intentionally minimal for no_std compatibility
12
+ # Only serde when using std for serialization
13
+ serde = { version = "1.0", features = ["derive"], optional = true }
14
+
15
+ # no_std compatible core - can run anywhere!
16
+ [features]
17
+ default = ["std"]
18
+ std = ["serde"]
crates/marine_salience/src/config.rs ADDED
@@ -0,0 +1,140 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ //! Marine algorithm configuration
2
+ //!
3
+ //! Tunable parameters for jitter detection. These have been calibrated
4
+ //! for speech/audio processing but can be adjusted for specific use cases.
5
+
6
+ #![cfg_attr(not(feature = "std"), no_std)]
7
+
8
+ /// Configuration for Marine salience detection
9
+ ///
10
+ /// These parameters control sensitivity and behavior of the jitter detector.
11
+ /// The defaults are tuned for speech processing at common sample rates.
12
+ #[derive(Debug, Clone, Copy)]
13
+ #[cfg_attr(feature = "std", derive(serde::Serialize, serde::Deserialize))]
14
+ pub struct MarineConfig {
15
+ /// Minimum amplitude to consider a sample (gating threshold)
16
+ /// Samples below this are ignored as noise
17
+ /// Default: 1e-3 (~-60dB)
18
+ pub clip_threshold: f32,
19
+
20
+ /// EMA smoothing factor for period tracking (0..1)
21
+ /// Lower = smoother, slower adaptation
22
+ /// Default: 0.01
23
+ pub ema_period_alpha: f32,
24
+
25
+ /// EMA smoothing factor for amplitude tracking (0..1)
26
+ /// Default: 0.01
27
+ pub ema_amp_alpha: f32,
28
+
29
+ /// Minimum inter-peak period in samples
30
+ /// Rejects peaks closer than this (filters high-frequency noise)
31
+ /// Default: sample_rate / 4000 (~4kHz upper F0)
32
+ pub min_period: u32,
33
+
34
+ /// Maximum inter-peak period in samples
35
+ /// Rejects peaks farther than this (filters very low frequencies)
36
+ /// Default: sample_rate / 60 (~60Hz lower F0)
37
+ pub max_period: u32,
38
+
39
+ /// Threshold below which jitter is considered "low" (stable)
40
+ /// Default: 0.02
41
+ pub jitter_low: f32,
42
+
43
+ /// Threshold above which jitter is considered "high" (unstable)
44
+ /// Default: 0.60
45
+ pub jitter_high: f32,
46
+ }
47
+
48
+ impl MarineConfig {
49
+ /// Create config optimized for speech at given sample rate
50
+ ///
51
+ /// # Arguments
52
+ /// * `sample_rate` - Audio sample rate in Hz (e.g., 22050, 44100)
53
+ ///
54
+ /// # Example
55
+ /// ```
56
+ /// use marine_salience::MarineConfig;
57
+ /// let config = MarineConfig::speech_default(22050);
58
+ /// assert!(config.min_period < config.max_period);
59
+ /// ```
60
+ pub const fn speech_default(sample_rate: u32) -> Self {
61
+ // F0 range: ~60Hz (low male) to ~4kHz (includes harmonics)
62
+ let min_period = sample_rate / 4000; // Upper bound
63
+ let max_period = sample_rate / 60; // Lower bound
64
+
65
+ Self {
66
+ clip_threshold: 1e-3,
67
+ ema_period_alpha: 0.01,
68
+ ema_amp_alpha: 0.01,
69
+ min_period,
70
+ max_period,
71
+ jitter_low: 0.02,
72
+ jitter_high: 0.60,
73
+ }
74
+ }
75
+
76
+ /// Create config for high-sensitivity detection
77
+ /// More peaks detected, faster adaptation
78
+ pub const fn high_sensitivity(sample_rate: u32) -> Self {
79
+ let min_period = sample_rate / 8000;
80
+ let max_period = sample_rate / 40;
81
+
82
+ Self {
83
+ clip_threshold: 5e-4,
84
+ ema_period_alpha: 0.05,
85
+ ema_amp_alpha: 0.05,
86
+ min_period,
87
+ max_period,
88
+ jitter_low: 0.01,
89
+ jitter_high: 0.50,
90
+ }
91
+ }
92
+
93
+ /// Create config for TTS output validation
94
+ /// Tuned to detect synthetic artifacts
95
+ pub const fn tts_validation(sample_rate: u32) -> Self {
96
+ let min_period = sample_rate / 4000;
97
+ let max_period = sample_rate / 80;
98
+
99
+ Self {
100
+ clip_threshold: 1e-3,
101
+ ema_period_alpha: 0.02,
102
+ ema_amp_alpha: 0.02,
103
+ min_period,
104
+ max_period,
105
+ jitter_low: 0.015, // Stricter for synthetic speech
106
+ jitter_high: 0.40, // More sensitive to artifacts
107
+ }
108
+ }
109
+ }
110
+
111
+ impl Default for MarineConfig {
112
+ fn default() -> Self {
113
+ // Default to 22050 Hz (common TTS sample rate)
114
+ Self::speech_default(22050)
115
+ }
116
+ }
117
+
118
+ #[cfg(test)]
119
+ mod tests {
120
+ use super::*;
121
+
122
+ #[test]
123
+ fn test_speech_default_periods() {
124
+ let config = MarineConfig::speech_default(22050);
125
+ assert!(config.min_period < config.max_period);
126
+ assert_eq!(config.min_period, 22050 / 4000); // 5 samples
127
+ assert_eq!(config.max_period, 22050 / 60); // 367 samples
128
+ }
129
+
130
+ #[test]
131
+ fn test_different_sample_rates() {
132
+ let config_22k = MarineConfig::speech_default(22050);
133
+ let config_44k = MarineConfig::speech_default(44100);
134
+ let config_48k = MarineConfig::speech_default(48000);
135
+
136
+ // Higher sample rates = more samples per period
137
+ assert!(config_44k.max_period > config_22k.max_period);
138
+ assert!(config_48k.max_period > config_44k.max_period);
139
+ }
140
+ }
crates/marine_salience/src/ema.rs ADDED
@@ -0,0 +1,126 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ //! Exponential Moving Average (EMA) for smooth tracking
2
+ //!
3
+ //! EMA smooths noisy measurements while maintaining responsiveness.
4
+ //! Used to track period and amplitude patterns in Marine algorithm.
5
+
6
+ #![cfg_attr(not(feature = "std"), no_std)]
7
+
8
+ /// Exponential Moving Average tracker
9
+ ///
10
+ /// EMA formula: value = alpha * new + (1 - alpha) * old
11
+ /// - Higher alpha = faster response, more noise
12
+ /// - Lower alpha = slower response, smoother
13
+ #[derive(Debug, Clone, Copy)]
14
+ #[cfg_attr(feature = "std", derive(serde::Serialize, serde::Deserialize))]
15
+ pub struct Ema {
16
+ /// Smoothing factor (0..1)
17
+ alpha: f32,
18
+ /// Current smoothed value
19
+ value: f32,
20
+ /// Whether we've received at least one sample
21
+ initialized: bool,
22
+ }
23
+
24
+ impl Ema {
25
+ /// Create new EMA with given smoothing factor
26
+ ///
27
+ /// # Arguments
28
+ /// * `alpha` - Smoothing factor (0..1). Higher = faster adaptation.
29
+ ///
30
+ /// # Example
31
+ /// ```
32
+ /// use marine_salience::ema::Ema;
33
+ /// let mut ema = Ema::new(0.1); // 10% new, 90% old
34
+ /// ema.update(100.0);
35
+ /// assert_eq!(ema.get(), 100.0); // First value becomes baseline
36
+ /// ema.update(200.0);
37
+ /// assert!((ema.get() - 110.0).abs() < 0.01); // 0.1*200 + 0.9*100
38
+ /// ```
39
+ pub const fn new(alpha: f32) -> Self {
40
+ Self {
41
+ alpha,
42
+ value: 0.0,
43
+ initialized: false,
44
+ }
45
+ }
46
+
47
+ /// Update EMA with new measurement
48
+ pub fn update(&mut self, x: f32) {
49
+ if !self.initialized {
50
+ // First value becomes the baseline
51
+ self.value = x;
52
+ self.initialized = true;
53
+ } else {
54
+ // EMA update: new = alpha * x + (1 - alpha) * old
55
+ self.value = self.alpha * x + (1.0 - self.alpha) * self.value;
56
+ }
57
+ }
58
+
59
+ /// Get current smoothed value
60
+ pub fn get(&self) -> f32 {
61
+ self.value
62
+ }
63
+
64
+ /// Check if EMA has been initialized (received at least one sample)
65
+ pub fn is_ready(&self) -> bool {
66
+ self.initialized
67
+ }
68
+
69
+ /// Reset EMA to uninitialized state
70
+ pub fn reset(&mut self) {
71
+ self.value = 0.0;
72
+ self.initialized = false;
73
+ }
74
+
75
+ /// Get the smoothing factor
76
+ pub fn alpha(&self) -> f32 {
77
+ self.alpha
78
+ }
79
+
80
+ /// Set a new smoothing factor
81
+ pub fn set_alpha(&mut self, alpha: f32) {
82
+ self.alpha = alpha.clamp(0.0, 1.0);
83
+ }
84
+ }
85
+
86
+ #[cfg(test)]
87
+ mod tests {
88
+ use super::*;
89
+
90
+ #[test]
91
+ fn test_first_value_becomes_baseline() {
92
+ let mut ema = Ema::new(0.1);
93
+ assert!(!ema.is_ready());
94
+ ema.update(42.0);
95
+ assert!(ema.is_ready());
96
+ assert_eq!(ema.get(), 42.0);
97
+ }
98
+
99
+ #[test]
100
+ fn test_ema_smoothing() {
101
+ let mut ema = Ema::new(0.1);
102
+ ema.update(100.0);
103
+ ema.update(200.0);
104
+ // 0.1 * 200 + 0.9 * 100 = 20 + 90 = 110
105
+ assert!((ema.get() - 110.0).abs() < 0.001);
106
+ }
107
+
108
+ #[test]
109
+ fn test_high_alpha_fast_response() {
110
+ let mut ema = Ema::new(0.9);
111
+ ema.update(100.0);
112
+ ema.update(200.0);
113
+ // 0.9 * 200 + 0.1 * 100 = 180 + 10 = 190
114
+ assert!((ema.get() - 190.0).abs() < 0.001);
115
+ }
116
+
117
+ #[test]
118
+ fn test_reset() {
119
+ let mut ema = Ema::new(0.1);
120
+ ema.update(100.0);
121
+ assert!(ema.is_ready());
122
+ ema.reset();
123
+ assert!(!ema.is_ready());
124
+ assert_eq!(ema.get(), 0.0);
125
+ }
126
+ }
crates/marine_salience/src/lib.rs ADDED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ //! # Marine Salience - O(1) Jitter-Based Authenticity Detection
2
+ //!
3
+ //! "Marines are not just jarheads - they are actually very intelligent"
4
+ //!
5
+ //! This crate provides a universal salience primitive that can detect the
6
+ //! "authenticity" of audio signals by measuring timing and amplitude jitter.
7
+ //!
8
+ //! ## Why "Marine"?
9
+ //! - Marines are stable and reliable under pressure
10
+ //! - Low jitter = authentic/stable signal
11
+ //! - High jitter = damaged/synthetic signal
12
+ //!
13
+ //! ## Use Cases
14
+ //! - **TTS Quality Validation** - Is synthesized speech authentic?
15
+ //! - **Prosody Extraction** - Extract 8D interpretable emotion vectors
16
+ //! - **Conversation Affect** - Track comfort level over sessions
17
+ //! - **Real-time Monitoring** - O(1) per sample processing
18
+ //!
19
+ //! ## Core Insight
20
+ //! Human voice has NATURAL jitter patterns. Perfect smoothness = synthetic.
21
+ //! The Marine algorithm detects these patterns to distinguish authentic
22
+ //! from damaged or artificial audio.
23
+
24
+ #![cfg_attr(not(feature = "std"), no_std)]
25
+
26
+ pub mod config;
27
+ pub mod ema;
28
+ pub mod packet;
29
+ pub mod processor;
30
+
31
+ // Re-export main types
32
+ pub use config::MarineConfig;
33
+ pub use packet::SaliencePacket;
34
+ pub use processor::MarineProcessor;
35
+
36
+ /// Marine algorithm version
37
+ pub const VERSION: &str = env!("CARGO_PKG_VERSION");
38
+
39
+ /// Default jitter thresholds tuned for speech
40
+ /// These values accommodate natural musical/speech variation
41
+ pub const DEFAULT_JITTER_LOW: f32 = 0.02; // Below = very stable
42
+ pub const DEFAULT_JITTER_HIGH: f32 = 0.60; // Above = heavily damaged
crates/marine_salience/src/packet.rs ADDED
@@ -0,0 +1,122 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ //! Salience packet - the output of Marine analysis
2
+ //!
3
+ //! Contains jitter measurements and quality scores for a detected peak.
4
+
5
+ #![cfg_attr(not(feature = "std"), no_std)]
6
+
7
+ /// Salience packet emitted on peak detection
8
+ ///
9
+ /// Contains all the jitter and quality metrics for a single audio event.
10
+ /// These packets can be aggregated to form prosody vectors or quality scores.
11
+ #[derive(Debug, Clone, Copy, PartialEq)]
12
+ #[cfg_attr(feature = "std", derive(serde::Serialize, serde::Deserialize))]
13
+ pub struct SaliencePacket {
14
+ /// Period jitter - timing instability between peaks
15
+ /// Lower = more stable/musical, Higher = more chaotic
16
+ /// Range: 0.0+ (normalized difference from expected period)
17
+ pub j_p: f32,
18
+
19
+ /// Amplitude jitter - loudness instability
20
+ /// Lower = consistent volume, Higher = erratic dynamics
21
+ /// Range: 0.0+ (normalized difference from expected amplitude)
22
+ pub j_a: f32,
23
+
24
+ /// Harmonic alignment score
25
+ /// 1.0 = perfectly voiced/harmonic, 0.0 = noise
26
+ /// For now this is simplified; can be enhanced with FFT
27
+ pub h_score: f32,
28
+
29
+ /// Overall salience score (authenticity)
30
+ /// 1.0 = perfect quality, 0.0 = heavily damaged
31
+ /// Computed from inverse of combined jitter
32
+ pub s_score: f32,
33
+
34
+ /// Local peak energy (amplitude squared)
35
+ /// Represents loudness at this event
36
+ pub energy: f32,
37
+
38
+ /// Sample index where this peak occurred
39
+ /// Useful for temporal analysis
40
+ pub sample_index: u64,
41
+ }
42
+
43
+ impl SaliencePacket {
44
+ /// Create a new salience packet
45
+ pub fn new(
46
+ j_p: f32,
47
+ j_a: f32,
48
+ h_score: f32,
49
+ s_score: f32,
50
+ energy: f32,
51
+ sample_index: u64,
52
+ ) -> Self {
53
+ Self {
54
+ j_p,
55
+ j_a,
56
+ h_score,
57
+ s_score,
58
+ energy,
59
+ sample_index,
60
+ }
61
+ }
62
+
63
+ /// Get combined jitter metric
64
+ /// Average of period and amplitude jitter
65
+ pub fn combined_jitter(&self) -> f32 {
66
+ (self.j_p + self.j_a) / 2.0
67
+ }
68
+
69
+ /// Check if this represents high-quality audio
70
+ /// (low jitter, high salience)
71
+ pub fn is_high_quality(&self, threshold: f32) -> bool {
72
+ self.s_score >= threshold
73
+ }
74
+
75
+ /// Check if this indicates damaged/synthetic audio
76
+ pub fn is_damaged(&self, jitter_threshold: f32) -> bool {
77
+ self.combined_jitter() > jitter_threshold
78
+ }
79
+ }
80
+
81
+ /// Special salience markers for non-peak events
82
+ #[derive(Debug, Clone, Copy, PartialEq)]
83
+ #[cfg_attr(feature = "std", derive(serde::Serialize, serde::Deserialize))]
84
+ pub enum SalienceMarker {
85
+ /// Normal peak detected
86
+ Peak(SaliencePacket),
87
+ /// Fracture/gap detected (silence)
88
+ Fracture,
89
+ /// High noise floor detected
90
+ Noise,
91
+ /// Insufficient data for analysis
92
+ Insufficient,
93
+ }
94
+
95
+ #[cfg(test)]
96
+ mod tests {
97
+ use super::*;
98
+
99
+ #[test]
100
+ fn test_combined_jitter() {
101
+ let packet = SaliencePacket::new(0.1, 0.3, 1.0, 0.8, 0.5, 0);
102
+ assert!((packet.combined_jitter() - 0.2).abs() < 0.001);
103
+ }
104
+
105
+ #[test]
106
+ fn test_is_high_quality() {
107
+ let good = SaliencePacket::new(0.01, 0.02, 1.0, 0.95, 0.5, 0);
108
+ let bad = SaliencePacket::new(0.5, 0.6, 0.5, 0.3, 0.5, 0);
109
+
110
+ assert!(good.is_high_quality(0.8));
111
+ assert!(!bad.is_high_quality(0.8));
112
+ }
113
+
114
+ #[test]
115
+ fn test_is_damaged() {
116
+ let good = SaliencePacket::new(0.01, 0.02, 1.0, 0.95, 0.5, 0);
117
+ let bad = SaliencePacket::new(0.5, 0.6, 0.5, 0.3, 0.5, 0);
118
+
119
+ assert!(!good.is_damaged(0.3));
120
+ assert!(bad.is_damaged(0.3));
121
+ }
122
+ }
crates/marine_salience/src/processor.rs ADDED
@@ -0,0 +1,334 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ //! Core Marine processor - O(1) per-sample jitter detection
2
+ //!
3
+ //! The heart of the Marine algorithm. Processes audio samples one at a time,
4
+ //! detecting peaks and computing jitter metrics in constant time.
5
+ //!
6
+ //! "Marines are not just jarheads - they are actually very intelligent"
7
+
8
+ #![cfg_attr(not(feature = "std"), no_std)]
9
+
10
+ use crate::config::MarineConfig;
11
+ use crate::ema::Ema;
12
+ use crate::packet::{SalienceMarker, SaliencePacket};
13
+
14
+ /// Marine salience processor
15
+ ///
16
+ /// Processes audio samples one at a time, detecting peaks and computing
17
+ /// jitter metrics. Designed for O(1) per-sample operation.
18
+ ///
19
+ /// # Example
20
+ /// ```
21
+ /// use marine_salience::{MarineConfig, MarineProcessor};
22
+ ///
23
+ /// let config = MarineConfig::speech_default(22050);
24
+ /// let mut processor = MarineProcessor::new(config);
25
+ ///
26
+ /// // Process samples (e.g., from audio buffer)
27
+ /// let samples = vec![0.0, 0.5, 1.0, 0.5, 0.0, -0.5, -1.0, -0.5];
28
+ /// for sample in &samples {
29
+ /// if let Some(marker) = processor.process_sample(*sample) {
30
+ /// match marker {
31
+ /// marine_salience::packet::SalienceMarker::Peak(packet) => {
32
+ /// println!("Peak detected! Salience: {:.2}", packet.s_score);
33
+ /// }
34
+ /// _ => {}
35
+ /// }
36
+ /// }
37
+ /// }
38
+ /// ```
39
+ pub struct MarineProcessor {
40
+ /// Configuration parameters
41
+ cfg: MarineConfig,
42
+
43
+ /// Previous sample (t-2)
44
+ prev2: f32,
45
+ /// Previous sample (t-1)
46
+ prev1: f32,
47
+ /// Current sample index
48
+ idx: u64,
49
+
50
+ /// Sample index of last detected peak
51
+ last_peak_idx: u64,
52
+ /// Amplitude of last detected peak
53
+ last_peak_amp: f32,
54
+
55
+ /// EMA tracker for inter-peak periods
56
+ ema_period: Ema,
57
+ /// EMA tracker for peak amplitudes
58
+ ema_amp: Ema,
59
+
60
+ /// Number of peaks detected so far
61
+ peak_count: u64,
62
+ }
63
+
64
+ impl MarineProcessor {
65
+ /// Create a new Marine processor with given configuration
66
+ pub fn new(cfg: MarineConfig) -> Self {
67
+ Self {
68
+ cfg,
69
+ prev2: 0.0,
70
+ prev1: 0.0,
71
+ idx: 0,
72
+ last_peak_idx: 0,
73
+ last_peak_amp: 0.0,
74
+ ema_period: Ema::new(cfg.ema_period_alpha),
75
+ ema_amp: Ema::new(cfg.ema_amp_alpha),
76
+ peak_count: 0,
77
+ }
78
+ }
79
+
80
+ /// Process a single audio sample - O(1) operation
81
+ ///
82
+ /// Returns Some(SalienceMarker) when a peak is detected or special
83
+ /// condition occurs, None otherwise.
84
+ ///
85
+ /// # Arguments
86
+ /// * `sample` - Audio sample value (typically -1.0 to 1.0)
87
+ ///
88
+ /// # Returns
89
+ /// - `Some(Peak(packet))` - Peak detected with jitter metrics
90
+ /// - `Some(Fracture)` - Silence/gap detected
91
+ /// - `Some(Noise)` - High noise floor detected
92
+ /// - `None` - No significant event at this sample
93
+ pub fn process_sample(&mut self, sample: f32) -> Option<SalienceMarker> {
94
+ let i = self.idx;
95
+ self.idx += 1;
96
+
97
+ // Pre-gating: ignore samples below threshold
98
+ if sample.abs() < self.cfg.clip_threshold {
99
+ self.prev2 = self.prev1;
100
+ self.prev1 = sample;
101
+ return None;
102
+ }
103
+
104
+ // Peak detection: prev1 is peak if prev2 < prev1 > sample
105
+ // Simple local maximum detection
106
+ let is_peak = i >= 2
107
+ && self.prev1.abs() >= self.cfg.clip_threshold
108
+ && self.prev1.abs() > self.prev2.abs()
109
+ && self.prev1.abs() > sample.abs();
110
+
111
+ let mut result = None;
112
+
113
+ if is_peak {
114
+ let peak_idx = i - 1;
115
+ let amp = self.prev1.abs();
116
+ let energy = amp * amp;
117
+
118
+ // Calculate period (time since last peak)
119
+ let period = if self.last_peak_idx == 0 {
120
+ 0.0
121
+ } else {
122
+ (peak_idx - self.last_peak_idx) as f32
123
+ };
124
+
125
+ // Only process if period is within valid range
126
+ if period > self.cfg.min_period as f32 && period < self.cfg.max_period as f32 {
127
+ if self.ema_period.is_ready() {
128
+ // Calculate jitter metrics
129
+ let jp = (period - self.ema_period.get()).abs() / self.ema_period.get();
130
+ let ja = (amp - self.ema_amp.get()).abs() / self.ema_amp.get();
131
+
132
+ // Harmonic score (simplified - TODO: FFT-based detection)
133
+ // For now, assume voiced content (h = 1.0)
134
+ // In production, this would check for harmonic structure
135
+ let h = 1.0;
136
+
137
+ // Salience score: inverse of combined jitter
138
+ // Higher jitter = lower salience
139
+ let s = 1.0 / (1.0 + jp + ja);
140
+
141
+ result = Some(SalienceMarker::Peak(SaliencePacket::new(
142
+ jp, ja, h, s, energy, peak_idx,
143
+ )));
144
+ }
145
+
146
+ // Update EMAs with new measurements
147
+ self.ema_period.update(period);
148
+ self.ema_amp.update(amp);
149
+ }
150
+
151
+ self.last_peak_idx = peak_idx;
152
+ self.last_peak_amp = amp;
153
+ self.peak_count += 1;
154
+ }
155
+
156
+ // Update sample history
157
+ self.prev2 = self.prev1;
158
+ self.prev1 = sample;
159
+
160
+ result
161
+ }
162
+
163
+ /// Process a buffer of samples, collecting all salience packets
164
+ ///
165
+ /// More efficient than calling process_sample repeatedly when you
166
+ /// have a full buffer available.
167
+ ///
168
+ /// # Arguments
169
+ /// * `samples` - Buffer of audio samples
170
+ ///
171
+ /// # Returns
172
+ /// Vector of salience packets for all detected peaks
173
+ #[cfg(feature = "std")]
174
+ pub fn process_buffer(&mut self, samples: &[f32]) -> Vec<SaliencePacket> {
175
+ let mut packets = Vec::new();
176
+
177
+ for &sample in samples {
178
+ if let Some(SalienceMarker::Peak(packet)) = self.process_sample(sample) {
179
+ packets.push(packet);
180
+ }
181
+ }
182
+
183
+ packets
184
+ }
185
+
186
+ /// Reset processor state (start fresh)
187
+ pub fn reset(&mut self) {
188
+ self.prev2 = 0.0;
189
+ self.prev1 = 0.0;
190
+ self.idx = 0;
191
+ self.last_peak_idx = 0;
192
+ self.last_peak_amp = 0.0;
193
+ self.ema_period.reset();
194
+ self.ema_amp.reset();
195
+ self.peak_count = 0;
196
+ }
197
+
198
+ /// Get number of peaks detected so far
199
+ pub fn peak_count(&self) -> u64 {
200
+ self.peak_count
201
+ }
202
+
203
+ /// Get current sample index
204
+ pub fn current_index(&self) -> u64 {
205
+ self.idx
206
+ }
207
+
208
+ /// Check if processor has enough data for reliable jitter
209
+ pub fn is_warmed_up(&self) -> bool {
210
+ self.peak_count >= 3 && self.ema_period.is_ready()
211
+ }
212
+
213
+ /// Get current expected period (from EMA)
214
+ pub fn expected_period(&self) -> Option<f32> {
215
+ if self.ema_period.is_ready() {
216
+ Some(self.ema_period.get())
217
+ } else {
218
+ None
219
+ }
220
+ }
221
+
222
+ /// Get current expected amplitude (from EMA)
223
+ pub fn expected_amplitude(&self) -> Option<f32> {
224
+ if self.ema_amp.is_ready() {
225
+ Some(self.ema_amp.get())
226
+ } else {
227
+ None
228
+ }
229
+ }
230
+ }
231
+
232
+ #[cfg(test)]
233
+ mod tests {
234
+ use super::*;
235
+
236
+ #[test]
237
+ fn test_peak_detection() {
238
+ let config = MarineConfig::speech_default(22050);
239
+ let mut processor = MarineProcessor::new(config);
240
+
241
+ // Create simple signal with peaks
242
+ // Peak at sample 10, 20, 30...
243
+ let mut samples = vec![0.0; 100];
244
+ for i in (10..100).step_by(10) {
245
+ samples[i] = 0.5; // Peak
246
+ if i > 0 {
247
+ samples[i - 1] = 0.3; // Rising edge
248
+ }
249
+ if i < 99 {
250
+ samples[i + 1] = 0.3; // Falling edge
251
+ }
252
+ }
253
+
254
+ let mut peak_count = 0;
255
+ for sample in &samples {
256
+ if let Some(SalienceMarker::Peak(_)) = processor.process_sample(*sample) {
257
+ peak_count += 1;
258
+ }
259
+ }
260
+
261
+ // Should detect several peaks (not all due to period constraints)
262
+ assert!(peak_count > 0);
263
+ }
264
+
265
+ #[test]
266
+ fn test_jitter_calculation() {
267
+ let mut config = MarineConfig::speech_default(22050);
268
+ config.min_period = 5;
269
+ config.max_period = 20;
270
+ let mut processor = MarineProcessor::new(config);
271
+
272
+ // Create signal with consistent period of 10 samples
273
+ let mut detected_packets = vec![];
274
+ for cycle in 0..10 {
275
+ for i in 0..10 {
276
+ let sample = if i == 5 {
277
+ 0.8 // Peak in middle
278
+ } else if i == 4 || i == 6 {
279
+ 0.5 // Edges
280
+ } else {
281
+ 0.01 // Just above threshold
282
+ };
283
+
284
+ if let Some(SalienceMarker::Peak(packet)) = processor.process_sample(sample) {
285
+ detected_packets.push(packet);
286
+ }
287
+ }
288
+ }
289
+
290
+ // With consistent periods, later packets should have low jitter
291
+ if detected_packets.len() > 3 {
292
+ let last = detected_packets.last().unwrap();
293
+ // Jitter should be relatively low for consistent signal
294
+ assert!(last.j_p < 0.5, "Period jitter too high: {}", last.j_p);
295
+ }
296
+ }
297
+
298
+ #[test]
299
+ fn test_reset() {
300
+ let config = MarineConfig::speech_default(22050);
301
+ let mut processor = MarineProcessor::new(config);
302
+
303
+ // Process some samples
304
+ for _ in 0..100 {
305
+ processor.process_sample(0.5);
306
+ }
307
+ assert!(processor.current_index() > 0);
308
+
309
+ // Reset and verify
310
+ processor.reset();
311
+ assert_eq!(processor.current_index(), 0);
312
+ assert_eq!(processor.peak_count(), 0);
313
+ assert!(!processor.is_warmed_up());
314
+ }
315
+
316
+ #[cfg(feature = "std")]
317
+ #[test]
318
+ fn test_process_buffer() {
319
+ let mut config = MarineConfig::speech_default(22050);
320
+ config.min_period = 5;
321
+ config.max_period = 50;
322
+ let mut processor = MarineProcessor::new(config);
323
+
324
+ // Generate test signal with peaks
325
+ let mut samples = Vec::new();
326
+ for _ in 0..20 {
327
+ samples.extend_from_slice(&[0.01, 0.3, 0.8, 0.3, 0.01]);
328
+ }
329
+
330
+ let packets = processor.process_buffer(&samples);
331
+ // Should detect multiple peaks
332
+ assert!(packets.len() > 0);
333
+ }
334
+ }
docs/Integrating Marine Algorithm into IndexTTS-Rust.md ADDED
@@ -0,0 +1,450 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+
3
+ # **A Technical Report on the Integration of the Marine Salience Algorithm into the IndexTTS2-Rust Architecture**
4
+
5
+ ## **Executive Summary**
6
+
7
+ This report details a comprehensive technical framework for the integration of the novel Marine Algorithm 1 into the existing IndexTTS-Rust project. The IndexTTS-Rust system is understood to be a Rust implementation of the IndexTTS2 architecture, a cascaded autoregressive (AR) Text-to-Speech (TTS) model detailed in the aaai2026.tex paper.1
8
+
9
+ The primary objective of this integration is to leverage the unique, time-domain salience detection capabilities of the Marine Algorithm (e.g., jitter analysis) 1 to significantly improve the quality, controllability, and emotional expressiveness of the synthesized speech.
10
+
11
+ The core of this strategy involves **replacing the Conformer-based emotion perceiver of the IndexTTS2 Text-to-Semantic (T2S) module** 1 with a new, lightweight, and prosodically-aware Rust module based on the Marine Algorithm. This report provides a full analysis of the architectural foundations, a detailed integration strategy, a complete Rust-level implementation guide, and an analysis of the training and inferential implications of this modification.
12
+
13
+ ## **Part 1: Architectural Foundations: The IndexTTS2 Pipeline and the Marine Salience Primitive**
14
+
15
+ A successful integration requires a deep, functional understanding of the two systems being merged. This section deconstructs the IndexTTS2 architecture as the "host" system 1 and re-frames the Marine Algorithm 1 as the "implant" feature extractor.
16
+
17
+ ### **1.1 Deconstruction of the IndexTTS2 Generative Pipeline**
18
+
19
+ The aaai2026.tex paper describes IndexTTS2 as a state-of-the-art, cascaded zero-shot TTS system.1 Its architecture is composed of three distinct, sequentially-trained modules:
20
+
21
+ 1. **Text-to-Semantic (T2S) Module:** This is an autoregressive (AR) Transformer-based model. Its primary function is to convert a sequence of text inputs into a sequence of "semantic tokens." This module is the system's "brain," determining the content, rhythm, and prosody of the speech.
22
+ 2. **Semantic-to-Mel (S2M) Module:** This is a non-autoregressive (NAR) model. It takes the discrete semantic tokens from the T2S module and converts them into a dense mel-spectrogram. This module functions as the system's "vocal tract," rendering the semantic instructions into a spectral representation. The paper notes this module "incorporate\[s\] GPT latent representations to significantly improve the stability of the generated speech".1
23
+ 3. **Vocoder Module:** This is a pre-trained BigVGANv2 vocoder.1 Its sole function is to perform the final conversion from the mel-spectrogram (from S2M) into a raw audio waveform.
24
+
25
+ The critical component for this integration is the **T2S Conditioning Mechanism**. The IndexTTS2 T2S module's behavior is conditioned on two separate audio prompts, a design intended to achieve disentangled control 1:
26
+
27
+ * **Timbre Prompt:** This audio prompt is processed by a "speaker perceiver conditioner" to generate a speaker attribute vector, c. This vector defines *who* is speaking (i.e., the vocal identity).
28
+ * **Style Prompt:** This *separate* audio prompt is processed by a "Conformer-based emotion perceiver conditioner" to generate an emotion vector, e. This vector defines *how* they are speaking (i.e., the emotion, prosody, and rhythm).
29
+
30
+ The T2S Transformer then consumes these vectors, additively combined, as part of its input: \[c \+ e, p,..., E\_text,..., E\_sem\].1
31
+
32
+ A key architectural detail is the IndexTTS2 paper's explicit use of a **Gradient Reversal Layer (GRL)** "to eliminate emotion-irrelevant information" and achieve "speaker-emotion disentanglement".1 The presence of a GRL, an adversarial training technique, strongly implies that the "Conformer-based emotion perceiver" is *not* naturally adept at this separation. A general-purpose Conformer, when processing the style prompt, will inevitably encode both prosodic features (pitch, energy) and speaker-specific features (formants, timbre). The GRL is thus employed as an adversarial "patch" to force the e vector to be "ignorant" of the speaker. This reveals a complex, computationally-heavy, and potentially fragile point in the IndexTTS2 design—a weakness that the Marine Algorithm is perfectly suited to address.
33
+
34
+ ### **1.2 The Marine Algorithm as a Superior Prosodic Feature Extractor**
35
+
36
+ The marine-Universal-Salience-algoritm.tex paper 1 introduces the Marine Algorithm as a "universal, modality-agnostic salience detector" that operates in the time domain with O(1) per-sample complexity. While its described applications are broad, its specific mechanics make it an ideal, purpose-built *prosody quantifier* for speech.
37
+
38
+ The algorithm's 5-step process (Pre-gating, Peak Detection, Jitter Computation, Harmonic Alignment, Salience Score) 1 is, in effect, a direct measurement of the suprasegmental features that define prosody:
39
+
40
+ * **Period Jitter ($J\_p$):** Defined as $J\_p \= |T\_i \- \\text{EMA}(T)|$, this metric quantifies the instability of the time between successive peaks (the fundamental period).1 In speech, this is a direct, time-domain correlate for *pitch instability*. High, structured $J\_p$ (i.e., high jitter with a stable EMA) represents intentional prosodic features like vibrato, vocal fry, or creaky voice—all key carriers of emotion.
41
+ * **Amplitude Jitter ($J\_a$):** Defined as $J\_a \= |A\_i \- \\text{EMA}(A)|$, this metric quantifies the instability of peak amplitudes.1 In speech, this is a correlate for *amplitude shimmer* or "vocal roughness," which are strong cues for affective states such as arousal, stress, or anger.
42
+ * **Harmonic Alignment ($H$):** This check for integer-multiple relationships in peak spacing 1 directly measures the *purity* and *periodicity* of the tone. It quantifies the distinction between a clear, voiced, harmonic sound and a noisy, chaotic, or unvoiced signal (e.g., breathiness, whispering, or a scream).
43
+ * **Energy ($E$) and Peak Detection:** The algorithm's pre-gating ($\\theta\_c$) and peak detection steps inherently track the signal's energy and the *density* of glottal pulses, which correlate directly to loudness and fundamental frequency (pitch), respectively.
44
+
45
+ The algorithm's description as "biologically plausible" and analogous to cochlear/amygdalar filtering 1 is not merely conceptual. It signifies that the algorithm is *a priori* biased to extract the same low-level features that the human auditory system uses to perceive emotion and prosody. This makes it a far more "correct" feature extractor for this task than a generic, large-scale Conformer, which learns from statistical correlation rather than first principles. Furthermore, its O(1) complexity 1 makes it orders of magnitude more efficient than the Transformer-based Conformer it will replace.
46
+
47
+ ## **Part 2: Integration Strategy: Replacing the T2S Emotion Perceiver**
48
+
49
+ The integration path is now clear. The IndexTTS2 T2S module 1 requires a clean, disentangled prosody vector e. The original Conformer-based conditioner provides a "polluted" vector that must be "cleaned" by a GRL.1 The Marine Algorithm 1 is, by its very design, a *naturally disentangled* prosody extractor.
50
+
51
+ ### **2.1 Formal Proposal: The MarineProsodyConditioner**
52
+
53
+ The formal integration strategy is as follows:
54
+
55
+ 1. The "Conformer-based emotion perceiver conditioner" 1 is **removed** from the IndexTTS2 architecture.
56
+ 2. A new, from-scratch Rust module, tentatively named the MarineProsodyConditioner, is **created**.
57
+ 3. This new module's sole function is to accept the file path to the style\_prompt audio, load its samples, and process them using a Rust implementation of the Marine Algorithm.1
58
+ 4. It will aggregate the resulting time-series of salience data into a single, fixed-size feature vector, e', which will serve as the new "emotion vector."
59
+
60
+ ### **2.2 Feature Vector Engineering: Defining the New e'**
61
+
62
+ The Marine Algorithm produces a *stream* of SaliencePackets, one for each detected peak.1 The T2S Transformer, however, requires a *single, fixed-size* conditioning vector.1 We must therefore define an aggregation strategy to distill this time-series into a descriptive statistical summary.
63
+
64
+ The proposed feature vector, the MarineProsodyVector (our new e'), will be an 8-dimensional vector composed of the mean and standard deviation of the algorithm's key outputs over the entire duration of the style prompt.
65
+
66
+ **Table 1: MarineProsodyVector Struct Definition**
67
+
68
+ This table defines the precise "interface" between the marine\_salience crate and the indextts\_rust crate.
69
+
70
+ | Field | Type | Description | Source |
71
+ | :---- | :---- | :---- | :---- |
72
+ | jp\_mean | f32 | Mean Period Jitter ($J\_p$). Correlates to average pitch instability. | 1 |
73
+ | jp\_std | f32 | Std. Dev. of $J\_p$. Correlates to *variance* in pitch instability. | 1 |
74
+ | ja\_mean | f32 | Mean Amplitude Jitter ($J\_a$). Correlates to average vocal roughness. | 1 |
75
+ | ja\_std | f32 | Std. Dev. of $J\_a$. Correlates to *variance* in vocal roughness. | 1 |
76
+ | h\_mean | f32 | Mean Harmonic Alignment ($H$). Correlates to average tonal purity. | 1 |
77
+ | s\_mean | f32 | Mean Salience Score ($S$). Correlates to overall signal "structuredness". | 1 |
78
+ | peak\_density | f32 | Number of detected peaks per second. Correlates to fundamental frequency (F0/pitch). | 1 |
79
+ | energy\_mean | f32 | Mean energy ($E$) of detected peaks. Correlates to loudness/amplitude. | 1 |
80
+
81
+ This small, 8-dimensional vector is dense, interpretable, and packed with prosodic information, in stark contrast to the opaque, high-dimensional, and entangled vector produced by the original Conformer.1
82
+
83
+ ### **2.3 Theoretical Justification: The Synergistic Disentanglement**
84
+
85
+ This integration provides a profound architectural improvement by solving the speaker-style disentanglement problem more elegantly and efficiently than the original IndexTTS2 design.1
86
+
87
+ The central challenge in the original architecture is that the Conformer-based conditioner processes the *entire* signal, capturing both temporal features (pitch, which is prosody) and spectral features (formants, which define speaker identity). This "entanglement" necessitates the use of the adversarial GRL to "un-learn" the speaker information.1
88
+
89
+ The Marine Algorithm 1 fundamentally sidesteps this problem. Its design is based on **peak detection, spacing, and amplitude**.1 It is almost entirely *blind* to the complex spectral-envelope (formant) information that defines a speaker's unique timbre. It measures the *instability* of the fundamental frequency, not the F0 itself, and the *instability* of the amplitude, not the spectral shape.
90
+
91
+ Therefore, the MarineProsodyVector (e') is **naturally disentangled**. It is a *pure* representation of prosody, containing negligible speaker-identity information.
92
+
93
+ When this new e' vector is fed into the T2S model's input, \[c \+ e',...\], the system receives two *orthogonal* conditioning vectors:
94
+
95
+ 1. c (from the speaker perceiver 1): Contains the speaker's timbre (formants, etc.).
96
+ 2. e' (from the MarineProsodyConditioner 1): Contains the speaker's prosody (jitter, rhythm, etc.).
97
+
98
+ This clean separation provides two major benefits:
99
+
100
+ 1. **Superior Timbre Cloning:** The speaker vector c no longer has to "compete" with an "entangled" style vector e. The T2S model will receive a cleaner speaker signal, leading to more accurate zero-shot voice cloning.
101
+ 2. **Superior Emotional Expression:** The style vector e' is a clean, simple, and interpretable signal. The T2S Transformer will be able to learn the mapping from (e.g.) jp\_mean \= 0.8 to "generate creaky semantic tokens" much more easily than from an opaque 512-dimensional Conformer embedding.
102
+
103
+ This change simplifies the T2S model's learning task, which should lead to faster convergence and higher final quality. The GRL 1 may become entirely unnecessary, further simplifying the training regime and stabilizing the model.
104
+
105
+ ## **Part 3: Implementation Guide: A IndexTTS-Rust Integration**
106
+
107
+ This section provides a concrete, code-level guide for implementing the proposed integration.
108
+
109
+ ### **3.1 Addressing the README.md Data Gap**
110
+
111
+ A critical limitation in preparing this analysis is the repeated failure to access the user-provided IndexTTS-Rust README.md file.2 This file contains the project's specific file structure, API definitions, and module layout.
112
+
113
+ To overcome this, this report will posit a **hypothetical yet idiomatic Rust project structure** based on the logical components described in the IndexTTS2 paper.1 All subsequent code examples will adhere to this structure. The project owner is expected to map these file paths and function names to their actual, private codebase.
114
+
115
+ ### **3.2 Table 2: Hypothetical IndexTTS-Rust Project Structure**
116
+
117
+ The following workspace structure is assumed for all implementation examples.
118
+
119
+ Plaintext
120
+
121
+ indextts\_rust\_workspace/
122
+ ├── Cargo.toml (Workspace root)
123
+
124
+ ├── indextts\_rust/ (The main application/library crate)
125
+ │ ├── Cargo.toml
126
+ │ └── src/
127
+ │ ├── main.rs (Binary entry point)
128
+ │ ├── lib.rs (Library entry point & API)
129
+ │ ├── error.rs (Project-wide error types)
130
+ │ ├── audio.rs (Audio I/O: e.g., fn load\_wav\_samples)
131
+ │ ├── vocoder.rs (Wrapper for BigVGANv2 model)
132
+ │ ├── t2s/
133
+ │ │ ├── mod.rs (T2S module definition)
134
+ │ │ ├── model.rs (AR Transformer implementation)
135
+ │ │ └── conditioner.rs(Handles 'c' and 'e' vector generation)
136
+ │ └── s2m/
137
+ │ ├── mod.rs (S2M module definition)
138
+ │ └── model.rs (NAR model implementation)
139
+
140
+ └── marine\_salience/ (The NEW crate for the Marine Algorithm)
141
+ ├── Cargo.toml
142
+ └── src/
143
+ ├── lib.rs (Public API: MarineProcessor, etc.)
144
+ ├── config.rs (MarineConfig struct)
145
+ ├── processor.rs (MarineProcessor struct and logic)
146
+ ├── ema.rs (EmaTracker helper struct)
147
+ └── packet.rs (SaliencePacket struct)
148
+
149
+ ### **3.3 Crate Development: marine\_salience**
150
+
151
+ A new, standalone Rust crate, marine\_salience, should be created. This crate will encapsulate all logic for the Marine Algorithm 1, ensuring it is modular, testable, and reusable.
152
+
153
+ **Table 3: marine\_salience Crate \- Public API Definition**
154
+
155
+ | Struct / fn | Field / Signature | Type | Description |
156
+ | :---- | :---- | :---- | :---- |
157
+ | MarineConfig | clip\_threshold | f32 | $\\theta\_c$, pre-gating sensitivity.1 |
158
+ | | ema\_period\_alpha | f32 | Smoothing factor for Period EMA. |
159
+ | | ema\_amplitude\_alpha | f32 | Smoothing factor for Amplitude EMA. |
160
+ | SaliencePacket | j\_p | f32 | Period Jitter ($J\_p$).1 |
161
+ | | j\_a | f32 | Amplitude Jitter ($J\_a$).1 |
162
+ | | h\_score | f32 | Harmonic Alignment score ($H$).1 |
163
+ | | s\_score | f32 | Final Salience Score ($S$).1 |
164
+ | | energy | f32 | Peak energy ($E$).1 |
165
+ | MarineProcessor | new(config: MarineConfig) | Self | Constructor. |
166
+ | | process\_sample(\&mut self, sample: f32, sample\_idx: u64) | Option\<SaliencePacket\> | The O(1) processing function. |
167
+
168
+ **marine\_salience/src/processor.rs (Implementation Sketch):**
169
+
170
+ The MarineProcessor struct will hold the state, including EmaTracker instances for period and amplitude, the last\_peak\_sample index, last\_peak\_amplitude, and the current\_direction of the signal (e.g., \+1 for rising, \-1 for falling).
171
+
172
+ The process\_sample function is the O(1) core, implementing the algorithm from 1:
173
+
174
+ 1. **Pre-gating:** Check if sample.abs() \> config.clip\_threshold.
175
+ 2. **Peak Detection:** Track the signal's direction. A change from \+1 (rising) to \-1 (falling) signifies a peak at sample\_idx \- 1, as per the formula x(n-1) \< x(n) \> x(n+1).1
176
+ 3. **Jitter Computation:** If a peak is detected at n:
177
+ * Calculate current period $T\_i \= (n \- self.last\_peak\_sample)$.
178
+ * Calculate current amplitude $A\_i \= sample\_at(n)$.
179
+ * Calculate $J\_p \= |T\_i \- self.ema\_period.value()|$.1
180
+ * Calculate $J\_a \= |A\_i \- self.ema\_amplitude.value()|$.1
181
+ * Update the EMAs: self.ema\_period.update(T\_i), self.ema\_amplitude.update(A\_i).
182
+ 4. **Harmonic Alignment:** Perform the check for $H$.1
183
+ 5. **Salience Score:** Compute $S \= w\_e E \+ w\_j(1/J) \+ w\_h H$.1
184
+ 6. Update self.last\_peak\_sample \= n, self.last\_peak\_amplitude \= A\_i.
185
+ 7. Return Some(SaliencePacket {... }).
186
+ 8. If no peak is detected, return None.
187
+
188
+ ### **3.4 Modifying the indextts\_rust Crate**
189
+
190
+ With the marine\_salience crate complete, the indextts\_rust crate can now be modified.
191
+
192
+ indextts\_rust/Cargo.toml:
193
+ Add the new crate as a dependency:
194
+
195
+ Ini, TOML
196
+
197
+ \[dependencies\]
198
+ marine\_salience \= { path \= "../marine\_salience" }
199
+ \#... other dependencies (tch, burn, ndarray, etc.)
200
+
201
+ indextts\_rust/src/t2s/conditioner.rs:
202
+ This is the central modification. The file responsible for generating the e vector is completely refactored.
203
+
204
+ Rust
205
+
206
+ // BEFORE: Original Conformer-based
207
+ //
208
+ // use tch::Tensor;
209
+ // use crate::audio::AudioData;
210
+ //
211
+ // // This struct holds the large, complex Conformer model
212
+ // pub struct ConformerEmotionPerceiver {
213
+ // //... model weights...
214
+ // }
215
+ //
216
+ // impl ConformerEmotionPerceiver {
217
+ // pub fn get\_style\_embedding(\&self, audio: \&AudioData) \-\> Result\<Tensor, ModelError\> {
218
+ // // 1\. Convert AudioData to mel-spectrogram tensor
219
+ // // 2\. Pass spectrogram through Conformer layers
220
+ // // 3\. (GRL logic is applied during training)
221
+ // // 4\. Return an opaque, high-dimensional 'e' vector
222
+ // // (e.g., )
223
+ // }
224
+ // }
225
+
226
+ // AFTER: New MarineProsodyConditioner
227
+ //
228
+ use marine\_salience::processor::{MarineProcessor, SaliencePacket};
229
+ use marine\_salience::config::MarineConfig;
230
+ use crate::audio::load\_wav\_samples; // From hypothetical audio.rs
231
+ use std::path::Path;
232
+ use anyhow::Result;
233
+
234
+ // This is the struct defined in Table 1
235
+ \#
236
+ pub struct MarineProsodyVector {
237
+ pub jp\_mean: f32,
238
+ pub jp\_std: f32,
239
+ pub ja\_mean: f32,
240
+ pub ja\_std: f32,
241
+ pub h\_mean: f32,
242
+ pub s\_mean: f32,
243
+ pub peak\_density: f32,
244
+ pub energy\_mean: f32,
245
+ }
246
+
247
+ // This new struct and function replace the Conformer
248
+ pub struct MarineProsodyConditioner {
249
+ config: MarineConfig,
250
+ }
251
+
252
+ impl MarineProsodyConditioner {
253
+ pub fn new(config: MarineConfig) \-\> Self {
254
+ Self { config }
255
+ }
256
+
257
+ pub fn get\_marine\_style\_vector(&self, style\_prompt\_path: \&Path, sample\_rate: f32) \-\> Result\<MarineProsodyVector\> {
258
+ // 1\. Load audio samples
259
+ // Assumes audio.rs provides this function
260
+ let samples \= load\_wav\_samples(style\_prompt\_path)?;
261
+ let duration\_sec \= samples.len() as f32 / sample\_rate;
262
+
263
+ // 2\. Instantiate and run the MarineProcessor
264
+ let mut processor \= MarineProcessor::new(self.config.clone());
265
+ let mut packets \= Vec::\<SaliencePacket\>::new();
266
+
267
+ for (i, sample) in samples.iter().enumerate() {
268
+ if let Some(packet) \= processor.process\_sample(\*sample, i as u64) {
269
+ packets.push(packet);
270
+ }
271
+ }
272
+
273
+ if packets.is\_empty() {
274
+ return Err(anyhow::anyhow\!("No peaks detected in style prompt."));
275
+ }
276
+
277
+ // 3\. Aggregate packets into the final feature vector
278
+ let num\_packets \= packets.len() as f32;
279
+
280
+ let mut jp\_mean \= 0.0;
281
+ let mut ja\_mean \= 0.0;
282
+ let mut h\_mean \= 0.0;
283
+ let mut s\_mean \= 0.0;
284
+ let mut energy\_mean \= 0.0;
285
+
286
+ for p in \&packets {
287
+ jp\_mean \+= p.j\_p;
288
+ ja\_mean \+= p.j\_a;
289
+ h\_mean \+= p.h\_score;
290
+ s\_mean \+= p.s\_score;
291
+ energy\_mean \+= p.energy;
292
+ }
293
+
294
+ jp\_mean /= num\_packets;
295
+ ja\_mean /= num\_packets;
296
+ h\_mean /= num\_packets;
297
+ s\_mean /= num\_packets;
298
+ energy\_mean /= num\_packets;
299
+
300
+ // Calculate standard deviation (variance)
301
+ let mut jp\_std \= 0.0;
302
+ let mut ja\_std \= 0.0;
303
+ for p in \&packets {
304
+ jp\_std \+= (p.j\_p \- jp\_mean).powi(2);
305
+ ja\_std \+= (p.j\_a \- ja\_mean).powi(2);
306
+ }
307
+ jp\_std \= (jp\_std / num\_packets).sqrt();
308
+ ja\_std \= (ja\_std / num\_packets).sqrt();
309
+
310
+ let peak\_density \= num\_packets / duration\_sec;
311
+
312
+ Ok(MarineProsodyVector {
313
+ jp\_mean,
314
+ jp\_std,
315
+ ja\_mean,
316
+ ja\_std,
317
+ h\_mean,
318
+ s\_mean,
319
+ peak\_density,
320
+ energy\_mean,
321
+ })
322
+ }
323
+ }
324
+
325
+ ### **3.5 Updating the T2S Model (indextts\_rust/src/t2s/model.rs)**
326
+
327
+ This change is **breaking** and **mandatory**. The IndexTTS2 T2S model 1 was trained on a high-dimensional e vector (e.g., 512-dim). Our new e' vector is 8-dimensional. The T2S model's architecture must be modified to accept this.
328
+
329
+ The change will be in the T2S Transformer's input embedding layer, which projects the conditioning vectors into the model's main hidden dimension (e.g., 1024-dim).
330
+
331
+ **(Example using tch-rs or burn pseudo-code):**
332
+
333
+ Rust
334
+
335
+ // In src/t2s/model.rs
336
+ //
337
+ // pub struct T2S\_Transformer {
338
+ // ...
339
+ // speaker\_projector: nn::Linear,
340
+ // style\_projector: nn::Linear, // The layer to change
341
+ // ...
342
+ // }
343
+ //
344
+ // impl T2S\_Transformer {
345
+ // pub fn new(config: \&T2S\_Config, vs: \&nn::Path) \-\> Self {
346
+ // ...
347
+ // // BEFORE:
348
+ // // let style\_projector \= nn::linear(
349
+ // // vs / "style\_projector",
350
+ // // 512, // Original Conformer 'e' dimension
351
+ // // config.hidden\_dim,
352
+ // // Default::default()
353
+ // // );
354
+ //
355
+ // // AFTER:
356
+ // let style\_projector \= nn::linear(
357
+ // vs / "style\_projector",
358
+ // 8, // New MarineProsodyVector 'e'' dimension
359
+ // config.hidden\_dim,
360
+ // Default::default()
361
+ // );
362
+ // ...
363
+ // }
364
+ // }
365
+
366
+ This change creates a new, untrained model. The S2M and Vocoder modules 1 can remain unchanged, but the T2S module must now be retrained.
367
+
368
+ ## **Part 4: Training, Inference, and Qualitative Implications**
369
+
370
+ This architectural change has profound, positive implications for the entire system, from training to user-facing control.
371
+
372
+ ### **4.1 Retraining the T2S Module**
373
+
374
+ The modification in Part 3.5 is a hard-fork of the model architecture; retraining the T2S module 1 is not optional.
375
+
376
+ **Training Plan:**
377
+
378
+ 1. **Model:** The S2M and Vocoder modules 1 can be completely frozen. Only the T2S module with the new 8-dimensional style\_projector (from 3.5) needs to be trained.
379
+ 2. **Dataset Preprocessing:** The *entire* training dataset used for the original IndexTTS2 1 must be re-processed.
380
+ * For *every* audio file in the dataset, the MarineProsodyConditioner::get\_marine\_style\_vector function (from 3.4) must be run *once*.
381
+ * The resulting 8-dimensional MarineProsodyVector must be saved as the new "ground truth" style label for that utterance.
382
+ 3. **Training:** The T2S module is now trained as described in the aaai2026.tex paper.1 During the training step, it will load the pre-computed MarineProsodyVector as the e' vector, which will be added to the c (speaker) vector and fed into the Transformer.
383
+ 4. **Hypothesis:** This training run is expected to converge *faster* and to a *higher* qualitative ceiling. The model is no longer burdened by the complex, adversarial GRL-based disentanglement.1 It is instead learning a much simpler, more direct correlation between a clean prosody vector (e') and the target semantic token sequences.
384
+
385
+ ### **4.2 Inference-Time Control**
386
+
387
+ This integration unlocks a new, powerful mode of "synthetic" or "direct" prosody control, fulfilling the proposals implicit in the user's query.
388
+
389
+ * **Mode 1: Reference-Based (Standard):**
390
+ * A user provides a style\_prompt.wav.
391
+ * The get\_marine\_style\_vector function (from 3.4) is called.
392
+ * The resulting MarineProsodyVector e' is fed into the T2S model.
393
+ * This "copies" the prosody from the reference audio, just as the original IndexTTS2 1 intended, but with higher fidelity.
394
+ * **Mode 2: Synthetic-Control (New):**
395
+ * The user provides *no* style prompt.
396
+ * Instead, the user *directly constructs* the 8-dimensional MarineProsodyVector to achieve a desired effect. The application's UI could expose 8 sliders for these values.
397
+ * **Example 1: "Agitated / Rough Voice"**
398
+ * e' \= MarineProsodyVector { jp\_mean: 0.8, jp\_std: 0.5, ja\_mean: 0.7, ja\_std: 0.4,... }
399
+ * **Example 2: "Stable / Monotone Voice"**
400
+ * e' \= MarineProsodyVector { jp\_mean: 0.05, jp\_std: 0.01, ja\_mean: 0.05, ja\_std: 0.01,... }
401
+ * **Example 3: "High-Pitch / High-Energy Voice"**
402
+ * e' \= MarineProsodyVector { peak\_density: 300.0, energy\_mean: 0.9,... }
403
+
404
+ This provides a small, interpretable, and powerful "control panel" for prosody, a significant breakthrough in controllable TTS that was not possible with the original opaque Conformer embedding.1
405
+
406
+ ### **4.3 Bridging to Downstream Fidelity (S2M)**
407
+
408
+ The benefits of this integration propagate through the entire cascade. The S2M module's quality is directly dependent on the quality of the semantic tokens it receives from T2S.1
409
+
410
+ The aaai2026.tex paper 1 states the S2M module uses "GPT latent representations to significantly improve the stability of the generated speech." This suggests the S2M is a powerful and stable *renderer*. However, a renderer is only as good as the instructions it receives.
411
+
412
+ In the original system, the S2M module likely received semantic tokens with "muddled" or "averaged-out" prosody, resulting from the T2S model's struggle with the entangled e vector. The S2M's "stability" 1 may have come at the *cost* of expressiveness, as it learned to smooth over inconsistent prosodic instructions.
413
+
414
+ With the new MarineProsodyConditioner, the T2S model will now produce semantic tokens that are *far more richly, explicitly, and accurately* encoded with prosodic intent. The S2M module's "GPT latents" 1 will receive a higher-fidelity, more consistent input signal. This creates a synergistic effect: the S2M's stable rendering capabilities 1 will now be applied to a *more expressive* set of instructions. The result is an end-to-end system that is *both* stable *and* highly expressive.
415
+
416
+ ## **Part 5: Report Conclusions and Future Trajectories**
417
+
418
+ ### **5.1 Summary of Improvements**
419
+
420
+ The integration framework detailed in this report achieves the project's goals by:
421
+
422
+ 1. **Replacing** a computationally heavy, black-box Conformer 1 with a lightweight, O(1), biologically-plausible, and Rust-native MarineProcessor.1
423
+ 2. **Solving** a core architectural-art problem in the IndexTTS2 design by providing a *naturally disentangled*, speaker-invariant prosody vector, which simplifies or obviates the need for the adversarial GRL.1
424
+ 3. **Unlocking** a powerful "synthetic control" mode, allowing users to *directly* manipulate prosody at inference time via an 8-dimensional, interpretable control vector.
425
+ 4. **Improving** end-to-end system quality by providing a cleaner, more explicit prosodic signal to the T2S module 1, which in turn provides a higher-fidelity semantic token stream to the S2M module.1
426
+
427
+ ### **5.2 Future Trajectories**
428
+
429
+ This new architecture opens two significant avenues for future research.
430
+
431
+ 1\. True Streaming Synthesis with Dynamic Conditioning
432
+ The IndexTTS2 T2S module is autoregressive 1, and the Marine Algorithm is O(1) per-sample.1 This is a perfect combination for real-time applications.
433
+ A future version could implement a "Dynamic Conditioning" mode. In this mode, a MarineProcessor runs on a live microphone input (e.g., from the user) in a parallel thread. It continuously calculates the MarineProsodyVector over a short, sliding window (e.g., 500ms). This e' vector is then *hot-swapped* into the T2S model's conditioning state *during* the autoregressive generation loop. The result would be a TTS model that mirrors the user's emotional prosody in real-time.
434
+
435
+ 2\. Active Quality Monitoring (Vocoder Feedback Loop)
436
+ The Marine Algorithm is a "universal... salience detector" that distinguishes "structured signals from noise".1 This capability can be used as a quality metric for the vocoder's output.
437
+ An advanced implementation could create a feedback loop:
438
+
439
+ 1. The BigVGANv2 vocoder 1 produces its output audio.
440
+ 2. This audio is *immediately* fed *back* into a MarineProcessor.
441
+ 3. The processor analyzes the output. The key insight from the Marine paper 1 is the use of the **Exponential Moving Average (EMA)**.
442
+ * **Desired Prosody (e.g., vocal fry):** Will produce high $J\_p$/$J\_a$, but the $\\text{EMA}(T)$ and $\\text{EMA}(A)$ will remain *stable*. The algorithm will correctly identify this as a *structured* signal.
443
+ * **Undesired Artifact (e.g., vocoder hiss, phase noise):** Will produce high $J\_p$/$J\_a$, but the $\\text{EMA}(T)$ and $\\text{EMA}(A)$ will become *unstable*. The algorithm will correctly identify this as *unstructured noise*.
444
+
445
+ This creates a quantitative, real-time metric for "output fidelity" that can distinguish desirable prosody from undesirable artifacts. This metric could be used to automatically flag or discard bad generations, or even as a reward function for a Reinforcement Learning (RL) agent tasked with fine-tuning the S2M or Vocoder modules.
446
+
447
+ #### **Works cited**
448
+
449
+ 1. marine-Universal-Salience-algoritm.tex
450
+ 2. accessed December 31, 1969, uploaded:IndexTTS-Rust README.md
examples/analyze_chris.rs ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:00940abda6dd597d7dacdbb97761fb0635d0dcc7dc30d5391fe159129008b03a
3
+ size 8470
examples/marine_test.rs ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d179d8f3adc5338e94ee2b92f366a36d03c32b51767223d1eefeb42ce9165374
3
+ size 10845
requirements.txt DELETED
@@ -1,32 +0,0 @@
1
- accelerate==1.8.1
2
- descript-audiotools==0.7.2
3
- transformers==4.52.1
4
- tokenizers==0.21.0
5
- cn2an==0.5.22
6
- ffmpeg-python==0.2.0
7
- Cython==3.0.7
8
- g2p-en==2.1.0
9
- jieba==0.42.1
10
- json5==0.10.0
11
- keras==2.9.0
12
- numba==0.58.1
13
- numpy==1.26.2
14
- pandas==2.1.3
15
- matplotlib==3.8.2
16
- munch==4.0.0
17
- opencv-python==4.9.0.80
18
- tensorboard==2.9.1
19
- librosa==0.10.2.post1
20
- safetensors==0.5.2
21
- deepspeed==0.17.1
22
- modelscope==1.27.0
23
- omegaconf
24
- sentencepiece
25
- gradio
26
- tqdm
27
- textstat
28
- huggingface_hub
29
- spaces
30
-
31
- WeTextProcessing; platform_machine != "Darwin"
32
- wetext; platform_system == "Darwin"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
src/audio/mod.rs CHANGED
@@ -4,7 +4,7 @@
4
 
5
  mod dsp;
6
  mod io;
7
- mod mel;
8
  mod resample;
9
 
10
  pub use dsp::{apply_preemphasis, dynamic_range_compression, dynamic_range_decompression, normalize_audio, normalize_audio_peak, apply_fade};
 
4
 
5
  mod dsp;
6
  mod io;
7
+ pub mod mel;
8
  mod resample;
9
 
10
  pub use dsp::{apply_preemphasis, dynamic_range_compression, dynamic_range_decompression, normalize_audio, normalize_audio_peak, apply_fade};
src/audio/resample.rs CHANGED
@@ -31,7 +31,7 @@ pub fn resample(audio: &AudioData, target_sr: u32) -> Result<AudioData> {
31
  let mut input_buffer = vec![vec![0.0f32; input_frames_needed]];
32
  let mut output_samples = Vec::new();
33
 
34
- let mut pos = 0;
35
  while pos < audio.samples.len() {
36
  // Fill input buffer
37
  let end = (pos + input_frames_needed).min(audio.samples.len());
 
31
  let mut input_buffer = vec![vec![0.0f32; input_frames_needed]];
32
  let mut output_samples = Vec::new();
33
 
34
+ let mut pos = 0;
35
  while pos < audio.samples.len() {
36
  // Fill input buffer
37
  let end = (pos + input_frames_needed).min(audio.samples.len());
src/lib.rs CHANGED
@@ -27,6 +27,7 @@ pub mod config;
27
  pub mod error;
28
  pub mod model;
29
  pub mod pipeline;
 
30
  pub mod text;
31
  pub mod vocoder;
32
 
@@ -34,6 +35,11 @@ pub use config::Config;
34
  pub use error::{Error, Result};
35
  pub use pipeline::IndexTTS;
36
 
 
 
 
 
 
37
  /// Library version
38
  pub const VERSION: &str = env!("CARGO_PKG_VERSION");
39
 
 
27
  pub mod error;
28
  pub mod model;
29
  pub mod pipeline;
30
+ pub mod quality;
31
  pub mod text;
32
  pub mod vocoder;
33
 
 
35
  pub use error::{Error, Result};
36
  pub use pipeline::IndexTTS;
37
 
38
+ // Re-export Marine quality validation
39
+ pub use quality::{
40
+ ComfortLevel, ConversationAffectSummary, MarineProsodyConditioner, MarineProsodyVector,
41
+ };
42
+
43
  /// Library version
44
  pub const VERSION: &str = env!("CARGO_PKG_VERSION");
45
 
src/quality/affect.rs ADDED
@@ -0,0 +1,445 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ //! Conversation Affect Tracking - Session-level comfort analysis
2
+ //!
3
+ //! After a conversation, Aye can determine: "This felt uneasy / ok / happy"
4
+ //! based on Marine prosody patterns over time.
5
+ //!
6
+ //! The key insight: jitter patterns reveal emotional state
7
+ //! - Rising jitter over conversation = increasing tension
8
+ //! - Stable low jitter = calm exchange
9
+ //! - High energy + low jitter = positive/confident
10
+
11
+ use super::prosody::MarineProsodyVector;
12
+
13
+ /// Comfort level classification
14
+ ///
15
+ /// After a conversation, this represents the overall emotional tone.
16
+ /// Used by Aye to self-assess: "How did I make you feel?"
17
+ #[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
18
+ pub enum ComfortLevel {
19
+ /// High jitter AND rising over session - tension/nervousness
20
+ Uneasy,
21
+ /// Stable but low energy, or mildly jittery but not escalating
22
+ Neutral,
23
+ /// Good energy, low/stable jitter - positive interaction
24
+ Happy,
25
+ }
26
+
27
+ impl ComfortLevel {
28
+ /// Convert to emoji representation
29
+ pub fn emoji(&self) -> &'static str {
30
+ match self {
31
+ ComfortLevel::Uneasy => "😟",
32
+ ComfortLevel::Neutral => "😐",
33
+ ComfortLevel::Happy => "😊",
34
+ }
35
+ }
36
+
37
+ /// Convert to descriptive string
38
+ pub fn description(&self) -> &'static str {
39
+ match self {
40
+ ComfortLevel::Uneasy => "uneasy or tense",
41
+ ComfortLevel::Neutral => "neutral or stable",
42
+ ComfortLevel::Happy => "comfortable and positive",
43
+ }
44
+ }
45
+
46
+ /// Convert to numeric score (-1 = uneasy, 0 = neutral, 1 = happy)
47
+ pub fn score(&self) -> i8 {
48
+ match self {
49
+ ComfortLevel::Uneasy => -1,
50
+ ComfortLevel::Neutral => 0,
51
+ ComfortLevel::Happy => 1,
52
+ }
53
+ }
54
+ }
55
+
56
+ /// Conversation affect summary
57
+ ///
58
+ /// Aggregates Marine prosody data over an entire conversation to
59
+ /// provide session-level emotional assessment.
60
+ #[derive(Debug, Clone)]
61
+ pub struct ConversationAffectSummary {
62
+ /// Comfort level of the human speaker (if analyzed)
63
+ pub human_state: Option<ComfortLevel>,
64
+ /// Comfort level of Aye's output
65
+ pub aye_state: ComfortLevel,
66
+ /// Overall audio/structure quality (0..1)
67
+ pub quality_score: f32,
68
+ /// Number of utterances analyzed
69
+ pub utterance_count: usize,
70
+ /// Session duration in seconds
71
+ pub duration_seconds: f32,
72
+ /// Mean prosody statistics
73
+ pub mean_prosody: MarineProsodyVector,
74
+ /// Jitter trend (positive = rising, negative = falling)
75
+ pub jitter_trend: f32,
76
+ /// Energy trend (positive = rising, negative = falling)
77
+ pub energy_trend: f32,
78
+ }
79
+
80
+ impl ConversationAffectSummary {
81
+ /// Generate Aye's self-assessment message
82
+ pub fn aye_assessment(&self) -> String {
83
+ let emoji = self.aye_state.emoji();
84
+ let desc = self.aye_state.description();
85
+
86
+ let quality_desc = if self.quality_score > 0.8 {
87
+ "very good"
88
+ } else if self.quality_score > 0.6 {
89
+ "good"
90
+ } else if self.quality_score > 0.4 {
91
+ "moderate"
92
+ } else {
93
+ "low"
94
+ };
95
+
96
+ format!(
97
+ "{} Aye thinks this conversation felt {}. Audio quality was {} ({:.0}%). \
98
+ {} {} utterances over {:.1} seconds.",
99
+ emoji,
100
+ desc,
101
+ quality_desc,
102
+ self.quality_score * 100.0,
103
+ if self.jitter_trend > 0.05 {
104
+ "Tension seemed to increase."
105
+ } else if self.jitter_trend < -0.05 {
106
+ "Tension seemed to decrease."
107
+ } else {
108
+ "Emotional tone stayed consistent."
109
+ },
110
+ self.utterance_count,
111
+ self.duration_seconds
112
+ )
113
+ }
114
+
115
+ /// Generate prompt for asking human for feedback
116
+ pub fn feedback_prompt(&self) -> String {
117
+ format!(
118
+ "Aye would like to improve. How did this conversation make you feel?\n\
119
+ A) Uneasy or tense 😟\n\
120
+ B) Neutral or okay 😐\n\
121
+ C) Comfortable and positive 😊\n\n\
122
+ Aye's self-assessment: {} ({})",
123
+ self.aye_state.emoji(),
124
+ self.aye_state.description()
125
+ )
126
+ }
127
+ }
128
+
129
+ /// Conversation affect analyzer
130
+ ///
131
+ /// Collects prosody vectors over a conversation and computes
132
+ /// session-level emotional state.
133
+ pub struct ConversationAffectAnalyzer {
134
+ /// Collected prosody vectors
135
+ utterances: Vec<MarineProsodyVector>,
136
+ /// Total audio duration
137
+ total_duration_seconds: f32,
138
+ /// Configuration thresholds
139
+ config: AffectAnalyzerConfig,
140
+ }
141
+
142
+ /// Configuration for affect classification
143
+ #[derive(Debug, Clone, Copy)]
144
+ pub struct AffectAnalyzerConfig {
145
+ /// Threshold for "high" combined jitter
146
+ pub high_jitter_threshold: f32,
147
+ /// Threshold for "rising" jitter trend
148
+ pub rising_jitter_threshold: f32,
149
+ /// Threshold for "high" energy (happy indicator)
150
+ pub high_energy_threshold: f32,
151
+ }
152
+
153
+ impl Default for AffectAnalyzerConfig {
154
+ fn default() -> Self {
155
+ Self {
156
+ high_jitter_threshold: 0.4,
157
+ rising_jitter_threshold: 0.1,
158
+ high_energy_threshold: 0.5,
159
+ }
160
+ }
161
+ }
162
+
163
+ impl ConversationAffectAnalyzer {
164
+ /// Create new analyzer with default config
165
+ pub fn new() -> Self {
166
+ Self {
167
+ utterances: Vec::new(),
168
+ total_duration_seconds: 0.0,
169
+ config: AffectAnalyzerConfig::default(),
170
+ }
171
+ }
172
+
173
+ /// Create with custom configuration
174
+ pub fn with_config(config: AffectAnalyzerConfig) -> Self {
175
+ Self {
176
+ utterances: Vec::new(),
177
+ total_duration_seconds: 0.0,
178
+ config,
179
+ }
180
+ }
181
+
182
+ /// Add an utterance's prosody to the conversation
183
+ pub fn add_utterance(&mut self, prosody: MarineProsodyVector, duration_seconds: f32) {
184
+ self.utterances.push(prosody);
185
+ self.total_duration_seconds += duration_seconds;
186
+ }
187
+
188
+ /// Reset analyzer for new conversation
189
+ pub fn reset(&mut self) {
190
+ self.utterances.clear();
191
+ self.total_duration_seconds = 0.0;
192
+ }
193
+
194
+ /// Analyze conversation and produce affect summary
195
+ pub fn analyze(&self) -> Option<ConversationAffectSummary> {
196
+ if self.utterances.is_empty() {
197
+ return None;
198
+ }
199
+
200
+ let n = self.utterances.len() as f32;
201
+
202
+ // Calculate mean prosody
203
+ let mut mean_prosody = MarineProsodyVector::zeros();
204
+ for p in &self.utterances {
205
+ mean_prosody.jp_mean += p.jp_mean;
206
+ mean_prosody.jp_std += p.jp_std;
207
+ mean_prosody.ja_mean += p.ja_mean;
208
+ mean_prosody.ja_std += p.ja_std;
209
+ mean_prosody.h_mean += p.h_mean;
210
+ mean_prosody.s_mean += p.s_mean;
211
+ mean_prosody.peak_density += p.peak_density;
212
+ mean_prosody.energy_mean += p.energy_mean;
213
+ }
214
+ mean_prosody.jp_mean /= n;
215
+ mean_prosody.jp_std /= n;
216
+ mean_prosody.ja_mean /= n;
217
+ mean_prosody.ja_std /= n;
218
+ mean_prosody.h_mean /= n;
219
+ mean_prosody.s_mean /= n;
220
+ mean_prosody.peak_density /= n;
221
+ mean_prosody.energy_mean /= n;
222
+
223
+ // Calculate trends (first vs last)
224
+ let jitter_trend = if self.utterances.len() >= 2 {
225
+ let first = self.utterances.first().unwrap().combined_jitter();
226
+ let last = self.utterances.last().unwrap().combined_jitter();
227
+ last - first
228
+ } else {
229
+ 0.0
230
+ };
231
+
232
+ let energy_trend = if self.utterances.len() >= 2 {
233
+ let first = self.utterances.first().unwrap().energy_mean;
234
+ let last = self.utterances.last().unwrap().energy_mean;
235
+ last - first
236
+ } else {
237
+ 0.0
238
+ };
239
+
240
+ // Classify comfort level
241
+ let aye_state = self.classify_comfort(
242
+ mean_prosody.combined_jitter(),
243
+ jitter_trend,
244
+ mean_prosody.energy_mean,
245
+ );
246
+
247
+ let quality_score = mean_prosody.s_mean;
248
+
249
+ Some(ConversationAffectSummary {
250
+ human_state: None, // Would require analyzing human audio
251
+ aye_state,
252
+ quality_score,
253
+ utterance_count: self.utterances.len(),
254
+ duration_seconds: self.total_duration_seconds,
255
+ mean_prosody,
256
+ jitter_trend,
257
+ energy_trend,
258
+ })
259
+ }
260
+
261
+ /// Classify comfort level based on jitter and energy patterns
262
+ fn classify_comfort(
263
+ &self,
264
+ mean_jitter: f32,
265
+ trend_jitter: f32,
266
+ mean_energy: f32,
267
+ ) -> ComfortLevel {
268
+ let high_jitter = mean_jitter > self.config.high_jitter_threshold;
269
+ let rising_jitter = trend_jitter > self.config.rising_jitter_threshold;
270
+
271
+ if high_jitter && rising_jitter {
272
+ // Jitter is high AND getting worse = tension/unease
273
+ ComfortLevel::Uneasy
274
+ } else if mean_energy > self.config.high_energy_threshold && !high_jitter {
275
+ // Good energy with stable jitter = positive/happy
276
+ ComfortLevel::Happy
277
+ } else {
278
+ // In-between: stable but low energy, or slightly jittery but stable
279
+ ComfortLevel::Neutral
280
+ }
281
+ }
282
+
283
+ /// Get number of utterances collected
284
+ pub fn utterance_count(&self) -> usize {
285
+ self.utterances.len()
286
+ }
287
+
288
+ /// Get total duration
289
+ pub fn total_duration(&self) -> f32 {
290
+ self.total_duration_seconds
291
+ }
292
+ }
293
+
294
+ impl Default for ConversationAffectAnalyzer {
295
+ fn default() -> Self {
296
+ Self::new()
297
+ }
298
+ }
299
+
300
+ #[cfg(test)]
301
+ mod tests {
302
+ use super::*;
303
+
304
+ #[test]
305
+ fn test_comfort_level_descriptions() {
306
+ assert_eq!(ComfortLevel::Uneasy.emoji(), "😟");
307
+ assert_eq!(ComfortLevel::Neutral.emoji(), "😐");
308
+ assert_eq!(ComfortLevel::Happy.emoji(), "😊");
309
+
310
+ assert_eq!(ComfortLevel::Uneasy.score(), -1);
311
+ assert_eq!(ComfortLevel::Neutral.score(), 0);
312
+ assert_eq!(ComfortLevel::Happy.score(), 1);
313
+ }
314
+
315
+ #[test]
316
+ fn test_analyzer_empty_conversation() {
317
+ let analyzer = ConversationAffectAnalyzer::new();
318
+ assert!(analyzer.analyze().is_none());
319
+ }
320
+
321
+ #[test]
322
+ fn test_analyzer_single_utterance() {
323
+ let mut analyzer = ConversationAffectAnalyzer::new();
324
+ let prosody = MarineProsodyVector {
325
+ jp_mean: 0.1,
326
+ jp_std: 0.05,
327
+ ja_mean: 0.1,
328
+ ja_std: 0.05,
329
+ h_mean: 1.0,
330
+ s_mean: 0.8,
331
+ peak_density: 50.0,
332
+ energy_mean: 0.6,
333
+ };
334
+ analyzer.add_utterance(prosody, 2.0);
335
+
336
+ let summary = analyzer.analyze().unwrap();
337
+ assert_eq!(summary.utterance_count, 1);
338
+ assert_eq!(summary.duration_seconds, 2.0);
339
+ }
340
+
341
+ #[test]
342
+ fn test_uneasy_classification() {
343
+ let mut analyzer = ConversationAffectAnalyzer::new();
344
+
345
+ // First utterance: moderate jitter
346
+ analyzer.add_utterance(
347
+ MarineProsodyVector {
348
+ jp_mean: 0.3,
349
+ jp_std: 0.1,
350
+ ja_mean: 0.3,
351
+ ja_std: 0.1,
352
+ h_mean: 1.0,
353
+ s_mean: 0.5,
354
+ peak_density: 50.0,
355
+ energy_mean: 0.3,
356
+ },
357
+ 1.0,
358
+ );
359
+
360
+ // Second utterance: HIGH jitter (rising trend)
361
+ analyzer.add_utterance(
362
+ MarineProsodyVector {
363
+ jp_mean: 0.6,
364
+ jp_std: 0.2,
365
+ ja_mean: 0.5,
366
+ ja_std: 0.2,
367
+ h_mean: 0.8,
368
+ s_mean: 0.3,
369
+ peak_density: 60.0,
370
+ energy_mean: 0.4,
371
+ },
372
+ 1.0,
373
+ );
374
+
375
+ let summary = analyzer.analyze().unwrap();
376
+ assert_eq!(summary.aye_state, ComfortLevel::Uneasy);
377
+ assert!(summary.jitter_trend > 0.0); // Rising jitter
378
+ }
379
+
380
+ #[test]
381
+ fn test_happy_classification() {
382
+ let mut analyzer = ConversationAffectAnalyzer::new();
383
+
384
+ // High energy, low jitter = happy
385
+ analyzer.add_utterance(
386
+ MarineProsodyVector {
387
+ jp_mean: 0.1,
388
+ jp_std: 0.05,
389
+ ja_mean: 0.1,
390
+ ja_std: 0.05,
391
+ h_mean: 1.0,
392
+ s_mean: 0.9,
393
+ peak_density: 80.0,
394
+ energy_mean: 0.7,
395
+ },
396
+ 2.0,
397
+ );
398
+
399
+ let summary = analyzer.analyze().unwrap();
400
+ assert_eq!(summary.aye_state, ComfortLevel::Happy);
401
+ }
402
+
403
+ #[test]
404
+ fn test_neutral_classification() {
405
+ let mut analyzer = ConversationAffectAnalyzer::new();
406
+
407
+ // Low energy, moderate jitter = neutral
408
+ analyzer.add_utterance(
409
+ MarineProsodyVector {
410
+ jp_mean: 0.2,
411
+ jp_std: 0.1,
412
+ ja_mean: 0.2,
413
+ ja_std: 0.1,
414
+ h_mean: 1.0,
415
+ s_mean: 0.7,
416
+ peak_density: 40.0,
417
+ energy_mean: 0.3,
418
+ },
419
+ 1.5,
420
+ );
421
+
422
+ let summary = analyzer.analyze().unwrap();
423
+ assert_eq!(summary.aye_state, ComfortLevel::Neutral);
424
+ }
425
+
426
+ #[test]
427
+ fn test_aye_assessment_message() {
428
+ let summary = ConversationAffectSummary {
429
+ human_state: None,
430
+ aye_state: ComfortLevel::Happy,
431
+ quality_score: 0.85,
432
+ utterance_count: 5,
433
+ duration_seconds: 30.0,
434
+ mean_prosody: MarineProsodyVector::zeros(),
435
+ jitter_trend: -0.1,
436
+ energy_trend: 0.2,
437
+ };
438
+
439
+ let message = summary.aye_assessment();
440
+ assert!(message.contains("😊"));
441
+ assert!(message.contains("comfortable"));
442
+ assert!(message.contains("85%"));
443
+ assert!(message.contains("5 utterances"));
444
+ }
445
+ }
src/quality/mod.rs ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ //! Quality validation module using Marine salience
2
+ //!
3
+ //! Provides TTS output validation, prosody extraction, and conversation
4
+ //! affect tracking using the Marine algorithm.
5
+ //!
6
+ //! "Marines are not just jarheads - they are actually very intelligent"
7
+
8
+ pub mod prosody;
9
+ pub mod affect;
10
+
11
+ pub use prosody::{MarineProsodyConditioner, MarineProsodyVector};
12
+ pub use affect::{ComfortLevel, ConversationAffectSummary};
src/quality/prosody.rs ADDED
@@ -0,0 +1,421 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ //! Marine Prosody Conditioner - Extract 8D interpretable emotion vectors
2
+ //!
3
+ //! Uses Marine salience to extract prosodic features from reference audio.
4
+ //! These features are interpretable and can be directly edited for control.
5
+ //!
6
+ //! The 8D vector captures:
7
+ //! 1. Period jitter (mean & std) - pitch stability
8
+ //! 2. Amplitude jitter (mean & std) - roughness/strain
9
+ //! 3. Harmonic alignment - voiced vs noisy
10
+ //! 4. Overall salience - authenticity score
11
+ //! 5. Peak density - speech rate/intensity
12
+ //! 6. Energy - loudness
13
+
14
+ use crate::error::{Error, Result};
15
+
16
+ /// 8-dimensional prosody vector extracted from audio
17
+ ///
18
+ /// These features capture the "emotional signature" of speech:
19
+ /// - Low jitter + high energy = confident/happy
20
+ /// - High jitter + low energy = nervous/uneasy
21
+ /// - Stable patterns = calm, unstable = agitated
22
+ #[derive(Debug, Clone, Copy, PartialEq)]
23
+ pub struct MarineProsodyVector {
24
+ /// Mean period jitter (pitch stability)
25
+ /// Lower = more stable pitch, Higher = more variation
26
+ pub jp_mean: f32,
27
+
28
+ /// Standard deviation of period jitter
29
+ /// Captures consistency of pitch patterns
30
+ pub jp_std: f32,
31
+
32
+ /// Mean amplitude jitter (volume stability)
33
+ /// Lower = consistent volume, Higher = erratic
34
+ pub ja_mean: f32,
35
+
36
+ /// Standard deviation of amplitude jitter
37
+ /// Captures volume pattern consistency
38
+ pub ja_std: f32,
39
+
40
+ /// Mean harmonic alignment score
41
+ /// 1.0 = perfectly voiced, 0.0 = noise
42
+ pub h_mean: f32,
43
+
44
+ /// Mean overall salience score
45
+ /// Overall authenticity/quality rating
46
+ pub s_mean: f32,
47
+
48
+ /// Peak density (peaks per second)
49
+ /// Related to speech rate and intensity
50
+ pub peak_density: f32,
51
+
52
+ /// Mean energy level
53
+ /// Average loudness of detected peaks
54
+ pub energy_mean: f32,
55
+ }
56
+
57
+ impl MarineProsodyVector {
58
+ /// Create zero vector (baseline)
59
+ pub fn zeros() -> Self {
60
+ Self {
61
+ jp_mean: 0.0,
62
+ jp_std: 0.0,
63
+ ja_mean: 0.0,
64
+ ja_std: 0.0,
65
+ h_mean: 1.0,
66
+ s_mean: 1.0,
67
+ peak_density: 0.0,
68
+ energy_mean: 0.0,
69
+ }
70
+ }
71
+
72
+ /// Convert to f32 array for neural network input
73
+ pub fn to_array(&self) -> [f32; 8] {
74
+ [
75
+ self.jp_mean,
76
+ self.jp_std,
77
+ self.ja_mean,
78
+ self.ja_std,
79
+ self.h_mean,
80
+ self.s_mean,
81
+ self.peak_density,
82
+ self.energy_mean,
83
+ ]
84
+ }
85
+
86
+ /// Create from f32 array
87
+ pub fn from_array(arr: [f32; 8]) -> Self {
88
+ Self {
89
+ jp_mean: arr[0],
90
+ jp_std: arr[1],
91
+ ja_mean: arr[2],
92
+ ja_std: arr[3],
93
+ h_mean: arr[4],
94
+ s_mean: arr[5],
95
+ peak_density: arr[6],
96
+ energy_mean: arr[7],
97
+ }
98
+ }
99
+
100
+ /// Get combined jitter (average of period and amplitude)
101
+ pub fn combined_jitter(&self) -> f32 {
102
+ (self.jp_mean + self.ja_mean) / 2.0
103
+ }
104
+
105
+ /// Estimate emotional valence from prosody
106
+ /// Returns value from -1.0 (negative) to 1.0 (positive)
107
+ pub fn estimate_valence(&self) -> f32 {
108
+ // High energy + low jitter = positive
109
+ // Low energy + high jitter = negative
110
+ let jitter_factor = 1.0 / (1.0 + self.combined_jitter());
111
+ let energy_factor = self.energy_mean.sqrt();
112
+
113
+ // Combine factors, normalize to -1..1 range
114
+ (jitter_factor * energy_factor * 2.0 - 1.0).clamp(-1.0, 1.0)
115
+ }
116
+
117
+ /// Estimate arousal/intensity level
118
+ /// Returns value from 0.0 (calm) to 1.0 (excited)
119
+ pub fn estimate_arousal(&self) -> f32 {
120
+ // High peak density + high energy + some jitter variance = high arousal
121
+ let density_factor = (self.peak_density / 100.0).clamp(0.0, 1.0);
122
+ let energy_factor = self.energy_mean.sqrt();
123
+ let variance_factor = (self.jp_std + self.ja_std).clamp(0.0, 1.0);
124
+
125
+ ((density_factor + energy_factor + variance_factor) / 3.0).clamp(0.0, 1.0)
126
+ }
127
+ }
128
+
129
+ impl Default for MarineProsodyVector {
130
+ fn default() -> Self {
131
+ Self::zeros()
132
+ }
133
+ }
134
+
135
+ /// Marine-based prosody conditioner for TTS
136
+ ///
137
+ /// Replaces heavy Conformer-style extractors with lightweight, interpretable
138
+ /// Marine salience features. This gives you:
139
+ /// - 8D interpretable emotion vector
140
+ /// - Direct editability for control
141
+ /// - Biologically plausible processing
142
+ /// - O(n) linear time extraction
143
+ pub struct MarineProsodyConditioner {
144
+ sample_rate: u32,
145
+ jitter_low: f32,
146
+ jitter_high: f32,
147
+ min_period: u32,
148
+ max_period: u32,
149
+ ema_alpha: f32,
150
+ }
151
+
152
+ impl MarineProsodyConditioner {
153
+ /// Create new prosody conditioner for given sample rate
154
+ pub fn new(sample_rate: u32) -> Self {
155
+ // F0 range: ~60Hz (low male) to ~4kHz (includes harmonics)
156
+ let min_period = sample_rate / 4000;
157
+ let max_period = sample_rate / 60;
158
+
159
+ Self {
160
+ sample_rate,
161
+ jitter_low: 0.02,
162
+ jitter_high: 0.60,
163
+ min_period,
164
+ max_period,
165
+ ema_alpha: 0.01,
166
+ }
167
+ }
168
+
169
+ /// Extract prosody vector from audio samples
170
+ ///
171
+ /// Analyzes the audio to produce an 8D prosody vector capturing
172
+ /// the emotional/stylistic characteristics of the speech.
173
+ ///
174
+ /// # Arguments
175
+ /// * `samples` - Audio samples (typically -1.0 to 1.0 range)
176
+ ///
177
+ /// # Returns
178
+ /// * `Ok(MarineProsodyVector)` - Extracted prosody features
179
+ /// * `Err` - If insufficient peaks detected
180
+ pub fn from_samples(&self, samples: &[f32]) -> Result<MarineProsodyVector> {
181
+ if samples.is_empty() {
182
+ return Err(Error::Audio("Empty audio buffer".into()));
183
+ }
184
+
185
+ // Detect peaks and collect jitter measurements
186
+ let mut peaks: Vec<PeakInfo> = Vec::new();
187
+ let clip_threshold = 1e-3;
188
+
189
+ // Simple peak detection
190
+ for i in 1..samples.len().saturating_sub(1) {
191
+ let prev = samples[i - 1].abs();
192
+ let curr = samples[i].abs();
193
+ let next = samples[i + 1].abs();
194
+
195
+ if curr > prev && curr > next && curr > clip_threshold {
196
+ peaks.push(PeakInfo {
197
+ index: i,
198
+ amplitude: curr,
199
+ });
200
+ }
201
+ }
202
+
203
+ if peaks.len() < 3 {
204
+ // Not enough peaks for meaningful analysis
205
+ return Ok(MarineProsodyVector::zeros());
206
+ }
207
+
208
+ // Calculate inter-peak periods and jitter
209
+ let mut periods: Vec<f32> = Vec::new();
210
+ let mut amplitudes: Vec<f32> = Vec::new();
211
+ let mut jp_values: Vec<f32> = Vec::new();
212
+ let mut ja_values: Vec<f32> = Vec::new();
213
+
214
+ // Use EMA for tracking
215
+ let mut ema_period = 0.0f32;
216
+ let mut ema_amp = 0.0f32;
217
+ let mut ema_initialized = false;
218
+
219
+ for i in 1..peaks.len() {
220
+ let period = (peaks[i].index - peaks[i - 1].index) as f32;
221
+ let amp = peaks[i].amplitude;
222
+
223
+ // Check if period is in valid range
224
+ if period > self.min_period as f32 && period < self.max_period as f32 {
225
+ periods.push(period);
226
+ amplitudes.push(amp);
227
+
228
+ if !ema_initialized {
229
+ ema_period = period;
230
+ ema_amp = amp;
231
+ ema_initialized = true;
232
+ } else {
233
+ // Calculate jitter
234
+ let jp = (period - ema_period).abs() / ema_period;
235
+ let ja = (amp - ema_amp).abs() / ema_amp;
236
+ jp_values.push(jp);
237
+ ja_values.push(ja);
238
+
239
+ // Update EMA
240
+ ema_period = self.ema_alpha * period + (1.0 - self.ema_alpha) * ema_period;
241
+ ema_amp = self.ema_alpha * amp + (1.0 - self.ema_alpha) * ema_amp;
242
+ }
243
+ }
244
+ }
245
+
246
+ if jp_values.is_empty() {
247
+ return Ok(MarineProsodyVector::zeros());
248
+ }
249
+
250
+ // Compute statistics
251
+ let n = jp_values.len() as f32;
252
+ let duration_sec = samples.len() as f32 / self.sample_rate as f32;
253
+
254
+ // Mean calculations
255
+ let jp_mean = jp_values.iter().sum::<f32>() / n;
256
+ let ja_mean = ja_values.iter().sum::<f32>() / n;
257
+ let energy_mean = amplitudes.iter().map(|a| a * a).sum::<f32>() / amplitudes.len() as f32;
258
+
259
+ // Std calculations
260
+ let jp_var = jp_values.iter().map(|x| (x - jp_mean).powi(2)).sum::<f32>() / n;
261
+ let ja_var = ja_values.iter().map(|x| (x - ja_mean).powi(2)).sum::<f32>() / n;
262
+ let jp_std = jp_var.sqrt();
263
+ let ja_std = ja_var.sqrt();
264
+
265
+ // Harmonic score (simplified - assume voiced content)
266
+ let h_mean = 1.0;
267
+
268
+ // Overall salience score
269
+ let s_mean = 1.0 / (1.0 + jp_mean + ja_mean);
270
+
271
+ // Peak density
272
+ let peak_density = peaks.len() as f32 / duration_sec;
273
+
274
+ Ok(MarineProsodyVector {
275
+ jp_mean,
276
+ jp_std,
277
+ ja_mean,
278
+ ja_std,
279
+ h_mean,
280
+ s_mean,
281
+ peak_density,
282
+ energy_mean,
283
+ })
284
+ }
285
+
286
+ /// Validate TTS output quality using Marine salience
287
+ ///
288
+ /// Returns quality score and potential issues detected
289
+ pub fn validate_tts_output(&self, samples: &[f32]) -> Result<TTSQualityReport> {
290
+ let prosody = self.from_samples(samples)?;
291
+
292
+ let mut issues = Vec::new();
293
+
294
+ // Check for common TTS problems
295
+ if prosody.jp_mean < 0.005 {
296
+ issues.push("Too perfect - sounds robotic (add natural variation)");
297
+ }
298
+
299
+ if prosody.jp_mean > 0.3 {
300
+ issues.push("High period jitter - possible artifacts");
301
+ }
302
+
303
+ if prosody.ja_mean > 0.4 {
304
+ issues.push("High amplitude jitter - volume inconsistency");
305
+ }
306
+
307
+ if prosody.s_mean < 0.4 {
308
+ issues.push("Low salience - audio quality issues");
309
+ }
310
+
311
+ if prosody.peak_density < 10.0 {
312
+ issues.push("Low peak density - missing speech energy");
313
+ }
314
+
315
+ let quality_score = prosody.s_mean * 100.0;
316
+
317
+ Ok(TTSQualityReport {
318
+ prosody,
319
+ quality_score,
320
+ issues,
321
+ })
322
+ }
323
+
324
+ /// Get the configured sample rate
325
+ pub fn sample_rate(&self) -> u32 {
326
+ self.sample_rate
327
+ }
328
+ }
329
+
330
+ /// Internal peak information
331
+ struct PeakInfo {
332
+ index: usize,
333
+ amplitude: f32,
334
+ }
335
+
336
+ /// TTS quality validation report
337
+ #[derive(Debug, Clone)]
338
+ pub struct TTSQualityReport {
339
+ /// Extracted prosody vector
340
+ pub prosody: MarineProsodyVector,
341
+ /// Overall quality score (0-100)
342
+ pub quality_score: f32,
343
+ /// List of detected issues
344
+ pub issues: Vec<&'static str>,
345
+ }
346
+
347
+ impl TTSQualityReport {
348
+ /// Check if quality passes threshold
349
+ pub fn passes(&self, threshold: f32) -> bool {
350
+ self.quality_score >= threshold && self.issues.is_empty()
351
+ }
352
+ }
353
+
354
+ #[cfg(test)]
355
+ mod tests {
356
+ use super::*;
357
+
358
+ #[test]
359
+ fn test_prosody_vector_array_conversion() {
360
+ let vec = MarineProsodyVector {
361
+ jp_mean: 0.1,
362
+ jp_std: 0.05,
363
+ ja_mean: 0.2,
364
+ ja_std: 0.1,
365
+ h_mean: 0.9,
366
+ s_mean: 0.8,
367
+ peak_density: 50.0,
368
+ energy_mean: 0.3,
369
+ };
370
+
371
+ let arr = vec.to_array();
372
+ let reconstructed = MarineProsodyVector::from_array(arr);
373
+
374
+ assert_eq!(vec.jp_mean, reconstructed.jp_mean);
375
+ assert_eq!(vec.s_mean, reconstructed.s_mean);
376
+ }
377
+
378
+ #[test]
379
+ fn test_conditioner_empty_buffer() {
380
+ let conditioner = MarineProsodyConditioner::new(22050);
381
+ let result = conditioner.from_samples(&[]);
382
+ assert!(result.is_err());
383
+ }
384
+
385
+ #[test]
386
+ fn test_conditioner_silence() {
387
+ let conditioner = MarineProsodyConditioner::new(22050);
388
+ let silence = vec![0.0; 1000];
389
+ let prosody = conditioner.from_samples(&silence).unwrap();
390
+ // Should return zeros for silence
391
+ assert_eq!(prosody.peak_density, 0.0);
392
+ }
393
+
394
+ #[test]
395
+ fn test_estimate_valence() {
396
+ let positive = MarineProsodyVector {
397
+ jp_mean: 0.01,
398
+ jp_std: 0.01,
399
+ ja_mean: 0.01,
400
+ ja_std: 0.01,
401
+ h_mean: 1.0,
402
+ s_mean: 0.95,
403
+ peak_density: 100.0,
404
+ energy_mean: 0.8,
405
+ };
406
+
407
+ let negative = MarineProsodyVector {
408
+ jp_mean: 0.5,
409
+ jp_std: 0.3,
410
+ ja_mean: 0.4,
411
+ ja_std: 0.2,
412
+ h_mean: 0.7,
413
+ s_mean: 0.4,
414
+ peak_density: 30.0,
415
+ energy_mean: 0.1,
416
+ };
417
+
418
+ // Higher energy + lower jitter should give more positive valence
419
+ assert!(positive.estimate_valence() > negative.estimate_valence());
420
+ }
421
+ }
tools/convert_to_onnx.py ADDED
@@ -0,0 +1,379 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Convert IndexTTS-2 PyTorch models to ONNX format for Rust inference!
4
+
5
+ This script converts the three main models:
6
+ 1. GPT model (gpt.pth) - Autoregressive text-to-semantic generation
7
+ 2. S2Mel model (s2mel.pth) - Semantic-to-mel spectrogram conversion
8
+ 3. BigVGAN - Mel-to-waveform vocoder (already available as ONNX from NVIDIA)
9
+
10
+ Usage:
11
+ python tools/convert_to_onnx.py
12
+
13
+ Output:
14
+ models/gpt.onnx
15
+ models/s2mel.onnx
16
+ models/bigvgan.onnx (if needed, otherwise use NVIDIA's)
17
+
18
+ Why ONNX?
19
+ - Cross-platform: Works on Windows, Linux, macOS, M1/M2 Macs
20
+ - Fast: ONNX Runtime is highly optimized
21
+ - Rust-native: ort crate provides excellent ONNX Runtime bindings
22
+ - No Python: Production inference without Python dependency hell!
23
+
24
+ Author: Aye & Hue @ 8b.is
25
+ """
26
+
27
+ import os
28
+ import sys
29
+
30
+ # Setup paths
31
+ script_dir = os.path.dirname(os.path.abspath(__file__))
32
+ project_root = os.path.dirname(script_dir)
33
+ os.chdir(project_root)
34
+
35
+ # Set HF cache
36
+ os.environ['HF_HUB_CACHE'] = './checkpoints/hf_cache'
37
+
38
+ print("=" * 70)
39
+ print(" IndexTTS-2 PyTorch to ONNX Converter")
40
+ print(" For Rust inference with ort crate!")
41
+ print("=" * 70)
42
+ print()
43
+
44
+ # Check for models
45
+ if not os.path.exists("checkpoints/gpt.pth"):
46
+ print("ERROR: Models not found!")
47
+ print("Run: python tools/download_files.py -s huggingface")
48
+ sys.exit(1)
49
+
50
+ import torch
51
+ import torch.onnx
52
+ import numpy as np
53
+ from pathlib import Path
54
+
55
+ # Add reference code to path
56
+ sys.path.insert(0, "indextts - REMOVING - REF ONLY")
57
+
58
+ # Create output directory
59
+ output_dir = Path("models")
60
+ output_dir.mkdir(exist_ok=True)
61
+
62
+ print(f"PyTorch version: {torch.__version__}")
63
+ print(f"Output directory: {output_dir}")
64
+ print()
65
+
66
+
67
+ def export_speaker_encoder():
68
+ """
69
+ Export the CAM++ speaker encoder to ONNX.
70
+
71
+ This model extracts speaker embeddings from reference audio.
72
+ Input: mel spectrogram [batch, n_mels, time]
73
+ Output: speaker embedding [batch, 192]
74
+ """
75
+ print("\n" + "=" * 50)
76
+ print("Exporting Speaker Encoder (CAM++)")
77
+ print("=" * 50)
78
+
79
+ try:
80
+ from omegaconf import OmegaConf
81
+ from indextts.s2mel.modules.campplus.DTDNN import CAMPPlus
82
+
83
+ # Load config
84
+ cfg = OmegaConf.load("checkpoints/config.yaml")
85
+
86
+ # Create model
87
+ model = CAMPPlus(feat_dim=80, embedding_size=192)
88
+
89
+ # Load weights
90
+ weights_path = "./checkpoints/hf_cache/models--funasr--campplus/snapshots/fb71fe990cbf6031ae6987a2d76fe64f94377b7e/campplus_cn_common.bin"
91
+ if os.path.exists(weights_path):
92
+ state_dict = torch.load(weights_path, map_location='cpu')
93
+ model.load_state_dict(state_dict)
94
+ print(f"Loaded weights from: {weights_path}")
95
+
96
+ model.eval()
97
+
98
+ # CAMPPlus expects [batch, time, n_mels] NOT [batch, n_mels, time]!
99
+ # This is the key insight - the model processes time-series of mel features
100
+ dummy_input = torch.randn(1, 100, 80) # [batch, time, features]
101
+
102
+ # Verify forward pass works before export
103
+ with torch.no_grad():
104
+ test_output = model(dummy_input)
105
+ print(f"Forward pass works! Output shape: {test_output.shape}")
106
+
107
+ # Export to ONNX
108
+ output_path = output_dir / "speaker_encoder.onnx"
109
+ torch.onnx.export(
110
+ model,
111
+ dummy_input,
112
+ str(output_path),
113
+ input_names=['mel_spectrogram'],
114
+ output_names=['speaker_embedding'],
115
+ dynamic_axes={
116
+ 'mel_spectrogram': {0: 'batch', 1: 'time'}, # time is dim 1!
117
+ 'speaker_embedding': {0: 'batch'}
118
+ },
119
+ opset_version=18, # Use 18+ for latest features
120
+ do_constant_folding=True,
121
+ )
122
+
123
+ # Verify the export
124
+ import onnx
125
+ onnx_model = onnx.load(str(output_path))
126
+ onnx.checker.check_model(onnx_model)
127
+
128
+ print(f"✓ Exported: {output_path}")
129
+ print(f" Input: mel_spectrogram [batch, time, 80]") # Corrected!
130
+ print(f" Output: speaker_embedding [batch, 192]")
131
+ print(f"✓ ONNX model verified!")
132
+ return True
133
+
134
+ except Exception as e:
135
+ print(f"✗ Failed to export speaker encoder: {e}")
136
+ import traceback
137
+ traceback.print_exc()
138
+ return False
139
+
140
+
141
+ def export_gpt_model():
142
+ """
143
+ Export the GPT autoregressive model to ONNX.
144
+
145
+ This is the most complex model - generates semantic tokens from text.
146
+ We may need to export it in parts due to KV caching.
147
+
148
+ Input: text_tokens [batch, seq_len], speaker_embedding [batch, 192]
149
+ Output: semantic_codes [batch, code_len]
150
+ """
151
+ print("\n" + "=" * 50)
152
+ print("Exporting GPT Model (Autoregressive)")
153
+ print("=" * 50)
154
+
155
+ try:
156
+ from omegaconf import OmegaConf
157
+
158
+ # Load the full model config
159
+ cfg = OmegaConf.load("checkpoints/config.yaml")
160
+
161
+ # This is tricky - GPT models with KV caching are hard to export
162
+ # We might need to:
163
+ # 1. Export just the forward pass without caching
164
+ # 2. Or export separate encoder/decoder parts
165
+
166
+ print("GPT model export is complex due to:")
167
+ print(" - Autoregressive generation with KV caching")
168
+ print(" - Dynamic sequence lengths")
169
+ print(" - Multiple internal components")
170
+ print()
171
+ print("Options:")
172
+ print(" A) Export without KV cache (slower but simpler)")
173
+ print(" B) Export encoder + single-step decoder (efficient)")
174
+ print(" C) Use torch.compile + ONNX tracing")
175
+ print()
176
+
177
+ # For now, let's try the simpler approach
178
+ from infer_v2 import IndexTTS2
179
+
180
+ # Load model
181
+ tts = IndexTTS2(
182
+ cfg_path="checkpoints/config.yaml",
183
+ model_dir="checkpoints",
184
+ use_fp16=False,
185
+ device="cpu"
186
+ )
187
+
188
+ # Get the GPT component
189
+ gpt = tts.gpt
190
+ gpt.eval()
191
+
192
+ print(f"GPT model loaded: {type(gpt)}")
193
+ print(f"Parameters: {sum(p.numel() for p in gpt.parameters()):,}")
194
+
195
+ # The GPT model architecture:
196
+ # - Text encoder (embeddings + transformer)
197
+ # - Speaker conditioning
198
+ # - Autoregressive decoder
199
+
200
+ # Let's export the text encoder first
201
+ output_path = output_dir / "gpt_encoder.onnx"
202
+
203
+ # Create dummy inputs
204
+ text_tokens = torch.randint(0, 30000, (1, 32), dtype=torch.int64)
205
+
206
+ # This will likely fail due to complex control flow
207
+ # but let's try!
208
+ print(f"Attempting GPT export (may require modifications)...")
209
+
210
+ # For now, just report what we learned
211
+ print()
212
+ print("Note: Full GPT export requires modifying the model code")
213
+ print("to remove dynamic control flow. Creating a wrapper...")
214
+
215
+ return False
216
+
217
+ except Exception as e:
218
+ print(f"✗ Failed to export GPT: {e}")
219
+ import traceback
220
+ traceback.print_exc()
221
+ return False
222
+
223
+
224
+ def export_s2mel_model():
225
+ """
226
+ Export the Semantic-to-Mel model (flow matching).
227
+
228
+ This converts semantic codes to mel spectrograms.
229
+ Input: semantic_codes [batch, code_len], speaker_embedding [batch, 192]
230
+ Output: mel_spectrogram [batch, 80, mel_len]
231
+ """
232
+ print("\n" + "=" * 50)
233
+ print("Exporting S2Mel Model (Flow Matching)")
234
+ print("=" * 50)
235
+
236
+ try:
237
+ from omegaconf import OmegaConf
238
+
239
+ cfg = OmegaConf.load("checkpoints/config.yaml")
240
+
241
+ print("S2Mel model (Diffusion/Flow Matching) is also complex:")
242
+ print(" - Multiple denoising steps (iterative)")
243
+ print(" - CFM (Conditional Flow Matching) requires ODE solving")
244
+ print()
245
+ print("Export strategy:")
246
+ print(" 1. Export the single denoising step")
247
+ print(" 2. Run iteration loop in Rust")
248
+ print()
249
+
250
+ return False
251
+
252
+ except Exception as e:
253
+ print(f"✗ Failed to export S2Mel: {e}")
254
+ import traceback
255
+ traceback.print_exc()
256
+ return False
257
+
258
+
259
+ def export_bigvgan():
260
+ """
261
+ Export BigVGAN vocoder to ONNX.
262
+
263
+ Good news: NVIDIA provides pre-trained BigVGAN models!
264
+ Even better: They're designed for easy ONNX export.
265
+
266
+ Input: mel_spectrogram [batch, 80, mel_len]
267
+ Output: waveform [batch, 1, wave_len]
268
+ """
269
+ print("\n" + "=" * 50)
270
+ print("Exporting BigVGAN Vocoder")
271
+ print("=" * 50)
272
+
273
+ try:
274
+ # BigVGAN from NVIDIA is easier to export
275
+ # Let's check if we already have it
276
+
277
+ print("BigVGAN options:")
278
+ print(" 1. Use NVIDIA's pre-exported ONNX (recommended)")
279
+ print(" https://github.com/NVIDIA/BigVGAN")
280
+ print()
281
+ print(" 2. Export from PyTorch weights (we'll do this)")
282
+ print()
283
+
284
+ # Try to load BigVGAN
285
+ try:
286
+ from bigvgan import bigvgan
287
+ model = bigvgan.BigVGAN.from_pretrained(
288
+ 'nvidia/bigvgan_v2_22khz_80band_256x',
289
+ use_cuda_kernel=False
290
+ )
291
+ model.eval()
292
+ model.remove_weight_norm() # Important for ONNX!
293
+
294
+ print(f"BigVGAN loaded from HuggingFace")
295
+ print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")
296
+
297
+ # Create dummy input
298
+ dummy_mel = torch.randn(1, 80, 100)
299
+
300
+ # Export
301
+ output_path = output_dir / "bigvgan.onnx"
302
+ torch.onnx.export(
303
+ model,
304
+ dummy_mel,
305
+ str(output_path),
306
+ input_names=['mel_spectrogram'],
307
+ output_names=['waveform'],
308
+ dynamic_axes={
309
+ 'mel_spectrogram': {0: 'batch', 2: 'mel_length'},
310
+ 'waveform': {0: 'batch', 2: 'wave_length'}
311
+ },
312
+ opset_version=18, # Use 18+ for latest features
313
+ do_constant_folding=True,
314
+ )
315
+
316
+ print(f"✓ Exported: {output_path}")
317
+ print(f" Input: mel_spectrogram [batch, 80, mel_len]")
318
+ print(f" Output: waveform [batch, 1, wave_len]")
319
+
320
+ # Verify the export
321
+ import onnx
322
+ onnx_model = onnx.load(str(output_path))
323
+ onnx.checker.check_model(onnx_model)
324
+ print(f"✓ ONNX model verified!")
325
+
326
+ return True
327
+
328
+ except ImportError:
329
+ print("bigvgan package not installed, installing...")
330
+ os.system("pip install bigvgan")
331
+ print("Please re-run the script.")
332
+ return False
333
+
334
+ except Exception as e:
335
+ print(f"✗ Failed to export BigVGAN: {e}")
336
+ import traceback
337
+ traceback.print_exc()
338
+ return False
339
+
340
+
341
+ def main():
342
+ print("\nStarting ONNX conversion...\n")
343
+
344
+ results = {}
345
+
346
+ # Export each component
347
+ results['speaker_encoder'] = export_speaker_encoder()
348
+ results['gpt'] = export_gpt_model()
349
+ results['s2mel'] = export_s2mel_model()
350
+ results['bigvgan'] = export_bigvgan()
351
+
352
+ # Summary
353
+ print("\n" + "=" * 70)
354
+ print(" CONVERSION SUMMARY")
355
+ print("=" * 70)
356
+
357
+ for name, success in results.items():
358
+ status = "✓ SUCCESS" if success else "✗ NEEDS WORK"
359
+ print(f" {name:20} {status}")
360
+
361
+ print()
362
+
363
+ if all(results.values()):
364
+ print("All models converted! Ready for Rust inference.")
365
+ else:
366
+ print("Some models need manual intervention.")
367
+ print()
368
+ print("For complex models (GPT, S2Mel), consider:")
369
+ print(" 1. Modifying the Python code to remove dynamic control flow")
370
+ print(" 2. Using torch.jit.trace with concrete inputs")
371
+ print(" 3. Exporting subcomponents separately")
372
+ print(" 4. Using ONNX Runtime's transformer optimizations")
373
+
374
+ print()
375
+ print("Output directory:", output_dir.absolute())
376
+
377
+
378
+ if __name__ == "__main__":
379
+ main()
tools/download_files.py CHANGED
File without changes
tools/i18n/i18n.py DELETED
@@ -1,36 +0,0 @@
1
- import json
2
- import locale
3
- import os
4
-
5
- I18N_JSON_DIR : os.PathLike = os.path.join(os.path.dirname(os.path.relpath(__file__)), 'locale')
6
-
7
- def load_language_list(language):
8
- with open(os.path.join(I18N_JSON_DIR, f"{language}.json"), "r", encoding="utf-8") as f:
9
- language_list = json.load(f)
10
- return language_list
11
-
12
- def scan_language_list():
13
- language_list = []
14
- for name in os.listdir(I18N_JSON_DIR):
15
- if name.endswith(".json"):language_list.append(name.split('.')[0])
16
- return language_list
17
-
18
- class I18nAuto:
19
- def __init__(self, language=None):
20
- if language in ["Auto", None]:
21
- language = locale.getdefaultlocale()[0]
22
- # getlocale can't identify the system's language ((None, None))
23
- if not os.path.exists(os.path.join(I18N_JSON_DIR, f"{language}.json")):
24
- language = "en_US"
25
- self.language = language
26
- self.language_map = load_language_list(language)
27
-
28
- def __call__(self, key):
29
- return self.language_map.get(key, key)
30
-
31
- def __repr__(self):
32
- return "Use Language: " + self.language
33
-
34
- if __name__ == "__main__":
35
- i18n = I18nAuto(language='en_US')
36
- print(i18n)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
tools/i18n/locale/en_US.json DELETED
@@ -1,49 +0,0 @@
1
- {
2
- "本软件以自拟协议开源, 作者不对软件具备任何控制力, 使用软件者、传播软件导出的声音者自负全责.": "This software is open-sourced under customized license. The author has no control over the software, and users of the software, as well as those who distribute the audio generated by the software, assume full responsibility.",
3
- "如不认可该条款, 则不能使用或引用软件包内任何代码和文件. 详见根目录LICENSE.": "If you do not agree to these terms, you are not permitted to use or reference any code or files within the software package. For further details, please refer to the LICENSE files in the root directory.",
4
- "时长必须为正数": "Duration must be a positive number",
5
- "请输入有效的浮点数": "Please enter a valid floating-point number",
6
- "使用情感参考音频": "Use emotion reference audio",
7
- "使用情感向量控制": "Use emotion vectors",
8
- "使用情感描述文本控制": "Use text description to control emotion",
9
- "上传情感参考音频": "Upload emotion reference audio",
10
- "情感权重": "Emotion control weight",
11
- "喜": "Happy",
12
- "怒": "Angry",
13
- "哀": "Sad",
14
- "惧": "Afraid",
15
- "厌恶": "Disgusted",
16
- "低落": "Melancholic",
17
- "惊喜": "Surprised",
18
- "平静": "Calm",
19
- "情感描述文本": "Emotion description",
20
- "请输入情绪描述(或留空以自动使用目标文本作为情绪描述)": "Please input an emotion description (or leave blank to automatically use the main text prompt)",
21
- "高级生成参数设置": "Advanced generation parameter settings",
22
- "情感向量之和不能超过1.5,请调整后重试。": "The sum of the emotion vectors cannot exceed 1.5. Please adjust and try again.",
23
- "音色参考音频": "Voice Reference",
24
- "音频生成": "Speech Synthesis",
25
- "文本": "Text",
26
- "生成语音": "Synthesize",
27
- "生成结果": "Synthesis Result",
28
- "功能设置": "Settings",
29
- "分句设置": "Text segmentation settings",
30
- "参数会影响音频质量和生成速度": "These parameters affect the audio quality and generation speed.",
31
- "分句最大Token数": "Max tokens per generation segment",
32
- "建议80~200之间,值越大,分句越长;值越小,分句越碎;过小过大都可能导致音频质量不高": "Recommended range: 80 - 200. Larger values require more VRAM but improves the flow of the speech, while lower values require less VRAM but means more fragmented sentences. Values that are too small or too large may lead to less coherent speech.",
33
- "预览分句结果": "Preview of the audio generation segments",
34
- "序号": "Index",
35
- "分句内容": "Content",
36
- "Token数": "Token Count",
37
- "情感控制方式": "Emotion control method",
38
- "GPT2 采样设置": "GPT-2 Sampling Configuration",
39
- "参数会影响音频多样性和生成速度详见": "Influences both the diversity of the generated audio and the generation speed. For further details, refer to",
40
- "是否进行采样": "Enable GPT-2 sampling",
41
- "生成Token最大数量,过小导致音频被截断": "Maximum number of tokens to generate. If text exceeds this, the audio will be cut off.",
42
- "请上传情感参考音频": "Please upload the emotion reference audio",
43
- "当前模型版本": "Current model version: ",
44
- "请输入目标文本": "Please input the text to synthesize",
45
- "例如:委屈巴巴、危险在悄悄逼近": "e.g. deeply sad, danger is creeping closer",
46
- "与音色参考音频相同": "Same as the voice reference",
47
- "情感随机采样": "Randomize emotion sampling",
48
- "显示实验功能": "Show experimental features"
49
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
tools/i18n/locale/zh_CN.json DELETED
@@ -1,44 +0,0 @@
1
- {
2
- "本软件以自拟协议开源, 作者不对软件具备任何控制力, 使用软件者、传播软件导出的声音者自负全责.": "本软件以自拟协议开源, 作者不对软件具备任何控制力, 使用软件者、传播软件导出的声音者自负全责.",
3
- "如不认可该条款, 则不能使用或引用软件包内任何代码和文件. 详见根目录LICENSE.": "如不认可该条款, 则不能使用或引用软件包内任何代码和文件. 详见根目录LICENSE.",
4
- "时长必须为正数": "时长必须为正数",
5
- "请输入有效的浮点数": "请输入有效的浮点数",
6
- "使用情感参考音频": "使用情感参考音频",
7
- "使用情感向量控制": "使用情感向量控制",
8
- "使用情感描述文本控制": "使用情感描述文本控制",
9
- "上传情感参考音频": "上传情感参考音频",
10
- "情感权重": "情感权重",
11
- "喜": "喜",
12
- "怒": "怒",
13
- "哀": "哀",
14
- "惧": "惧",
15
- "厌恶": "厌恶",
16
- "低落": "低落",
17
- "惊喜": "惊喜",
18
- "平静": "平静",
19
- "情感描述文本": "情感描述文本",
20
- "请输入情绪描述(或留空以自动使用目标文本作为情绪描述)": "请输入情绪描述(或留空以自动使用目标文本作为情绪描述)",
21
- "高级生成参数设置": "高级生成参数设置",
22
- "情感向量之和不能超过1.5,请调整后重试。": "情感向量之和不能超过1.5,请调整后重试。",
23
- "音色参考音频": "音色参考音频",
24
- "音频生成": "音频生成",
25
- "文本": "文本",
26
- "生成语音": "生成语音",
27
- "生成结果": "生成结果",
28
- "功能设置": "功能设置",
29
- "分句设置": "分句设置",
30
- "参数会影响音频质量和生成速度": "参数会影响音频质量和生成速度",
31
- "分句最大Token数": "分句最大Token数",
32
- "建议80~200之间,值越大,分句越长;值越小,分句越碎;过小过大都可能导致音频质量不高": "建议80~200之间,值越大,分句越长;值越小,分句越碎;过小过大都可能导致音频质量不高",
33
- "预览分句结果": "预览分句结果",
34
- "序号": "序号",
35
- "分句内容": "分句内容",
36
- "Token数": "Token数",
37
- "情感控制方式": "情感控制方式",
38
- "GPT2 采样设置": "GPT2 采样设置",
39
- "参数会影响音频多样性和生成速度详见": "参数会影响音频多样性和生成速度详见",
40
- "是否进行采样": "是否进行采样",
41
- "生成Token最大数量,过小导致音频被截断": "生成Token最大数量,过小导致音频被截断",
42
- "显示实验功能": "显示实验功能",
43
- "例如:委屈巴巴、危险在悄悄逼近": "例如:委屈巴巴、危险在悄悄逼近"
44
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
tools/i18n/scan_i18n.py DELETED
@@ -1,131 +0,0 @@
1
- import ast
2
- import glob
3
- import json
4
- import os
5
- from collections import OrderedDict
6
-
7
- I18N_JSON_DIR : os.PathLike = os.path.join(os.path.dirname(os.path.relpath(__file__)), 'locale')
8
- DEFAULT_LANGUAGE: str = "zh_CN" # 默认语言
9
- TITLE_LEN : int = 60 # 标题显示长度
10
- KEY_LEN : int = 30 # 键名显示长度
11
- SHOW_KEYS : bool = False # 是否显示键信息
12
- SORT_KEYS : bool = False # 是否按全局键名写入文件
13
-
14
- def extract_i18n_strings(node):
15
- i18n_strings = []
16
-
17
- if (
18
- isinstance(node, ast.Call)
19
- and isinstance(node.func, ast.Name)
20
- and node.func.id == "i18n"
21
- ):
22
- for arg in node.args:
23
- if isinstance(arg, ast.Str):
24
- i18n_strings.append(arg.s)
25
-
26
- for child_node in ast.iter_child_nodes(node):
27
- i18n_strings.extend(extract_i18n_strings(child_node))
28
-
29
- return i18n_strings
30
-
31
- def scan_i18n_strings():
32
- """
33
- scan the directory for all .py files (recursively)
34
- for each file, parse the code into an AST
35
- for each AST, extract the i18n strings
36
- """
37
- strings = []
38
- print(" Scanning Files and Extracting i18n Strings ".center(TITLE_LEN, "="))
39
- for filename in glob.iglob("**/*.py", recursive=True):
40
- try:
41
- with open(filename, "r", encoding="utf-8") as f:
42
- code = f.read()
43
- if "I18nAuto" in code:
44
- tree = ast.parse(code)
45
- i18n_strings = extract_i18n_strings(tree)
46
- print(f"{filename.ljust(KEY_LEN*3//2)}: {len(i18n_strings)}")
47
- if SHOW_KEYS:
48
- print("\n".join([s for s in i18n_strings]))
49
- strings.extend(i18n_strings)
50
- except Exception as e:
51
- print(f"\033[31m[Failed] Error occur at {filename}: {e}\033[0m")
52
-
53
- code_keys = set(strings)
54
- print(f"{'Total Unique'.ljust(KEY_LEN*3//2)}: {len(code_keys)}")
55
- return code_keys
56
-
57
- def update_i18n_json(json_file, standard_keys):
58
- standard_keys = sorted(standard_keys)
59
- print(f" Process {json_file} ".center(TITLE_LEN, "="))
60
- # 读取 JSON 文件
61
- with open(json_file, "r", encoding="utf-8") as f:
62
- json_data = json.load(f, object_pairs_hook=OrderedDict)
63
- # 打印处理前的 JSON 条目数
64
- len_before = len(json_data)
65
- print(f"{'Total Keys'.ljust(KEY_LEN)}: {len_before}")
66
- # 识别缺失的键并补全
67
- miss_keys = set(standard_keys) - set(json_data.keys())
68
- if len(miss_keys) > 0:
69
- print(f"{'Missing Keys (+)'.ljust(KEY_LEN)}: {len(miss_keys)}")
70
- for key in miss_keys:
71
- if DEFAULT_LANGUAGE in json_file:
72
- # 默认语言的键值相同.
73
- json_data[key] = key
74
- else:
75
- # 其他语言的值设置为 #! + 键名以标注未被翻译.
76
- json_data[key] = "#!" + key
77
- if SHOW_KEYS:
78
- print(f"{'Added Missing Key'.ljust(KEY_LEN)}: {key}")
79
- # 识别多余的键并删除
80
- diff_keys = set(json_data.keys()) - set(standard_keys)
81
- if len(diff_keys) > 0:
82
- print(f"{'Unused Keys (-)'.ljust(KEY_LEN)}: {len(diff_keys)}")
83
- for key in diff_keys:
84
- del json_data[key]
85
- if SHOW_KEYS:
86
- print(f"{'Removed Unused Key'.ljust(KEY_LEN)}: {key}")
87
- # 按键顺序排序
88
- json_data = OrderedDict(
89
- sorted(
90
- json_data.items(),
91
- key=lambda x: (
92
- list(standard_keys).index(x[0]) if x[0] in standard_keys and not x[1].startswith('#!') else len(json_data),
93
- )
94
- )
95
- )
96
- # 打印处理后的 JSON 条目数
97
- if len(miss_keys) != 0 or len(diff_keys) != 0:
98
- print(f"{'Total Keys (After)'.ljust(KEY_LEN)}: {len(json_data)}")
99
- # 识别有待翻译的键
100
- num_miss_translation = 0
101
- duplicate_items = {}
102
- for key, value in json_data.items():
103
- if value.startswith("#!"):
104
- num_miss_translation += 1
105
- if SHOW_KEYS:
106
- print(f"{'Missing Translation'.ljust(KEY_LEN)}: {key}")
107
- if value in duplicate_items:
108
- duplicate_items[value].append(key)
109
- else:
110
- duplicate_items[value] = [key]
111
- # 打印是否有重复的值
112
- for value, keys in duplicate_items.items():
113
- if len(keys) > 1:
114
- print("\n".join([f"\033[31m{'[Failed] Duplicate Value'.ljust(KEY_LEN)}: {key} -> {value}\033[0m" for key in keys]))
115
-
116
- if num_miss_translation > 0:
117
- print(f"\033[31m{'[Failed] Missing Translation'.ljust(KEY_LEN)}: {num_miss_translation}\033[0m")
118
- else:
119
- print(f"\033[32m[Passed] All Keys Translated\033[0m")
120
- # 将处理后的结果写入 JSON 文件
121
- with open(json_file, "w", encoding="utf-8") as f:
122
- json.dump(json_data, f, ensure_ascii=False, indent=4, sort_keys=SORT_KEYS)
123
- f.write("\n")
124
- print(f" Updated {json_file} ".center(TITLE_LEN, "=") + '\n')
125
-
126
- if __name__ == "__main__":
127
- code_keys = scan_i18n_strings()
128
- for json_file in os.listdir(I18N_JSON_DIR):
129
- if json_file.endswith(r".json"):
130
- json_file = os.path.join(I18N_JSON_DIR, json_file)
131
- update_i18n_json(json_file, code_keys)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
webui.py DELETED
@@ -1,392 +0,0 @@
1
- import spaces
2
- import json
3
- import os
4
- import sys
5
- import threading
6
- import time
7
-
8
- import warnings
9
-
10
- import numpy as np
11
-
12
- warnings.filterwarnings("ignore", category=FutureWarning)
13
- warnings.filterwarnings("ignore", category=UserWarning)
14
-
15
- import pandas as pd
16
-
17
- current_dir = os.path.dirname(os.path.abspath(__file__))
18
- sys.path.append(current_dir)
19
- sys.path.append(os.path.join(current_dir, "indextts"))
20
-
21
- import argparse
22
- parser = argparse.ArgumentParser(
23
- description="IndexTTS WebUI",
24
- formatter_class=argparse.ArgumentDefaultsHelpFormatter,
25
- )
26
- parser.add_argument("--verbose", action="store_true", default=False, help="Enable verbose mode")
27
- parser.add_argument("--port", type=int, default=7860, help="Port to run the web UI on")
28
- parser.add_argument("--host", type=str, default="0.0.0.0", help="Host to run the web UI on")
29
- parser.add_argument("--model_dir", type=str, default="./checkpoints", help="Model checkpoints directory")
30
- parser.add_argument("--fp16", action="store_true", default=False, help="Use FP16 for inference if available")
31
- parser.add_argument("--deepspeed", action="store_true", default=False, help="Use DeepSpeed to accelerate if available")
32
- parser.add_argument("--cuda_kernel", action="store_true", default=False, help="Use CUDA kernel for inference if available")
33
- parser.add_argument("--gui_seg_tokens", type=int, default=120, help="GUI: Max tokens per generation segment")
34
- cmd_args = parser.parse_args()
35
-
36
- from tools.download_files import download_model_from_huggingface
37
- download_model_from_huggingface(os.path.join(current_dir,"checkpoints"),
38
- os.path.join(current_dir, "checkpoints","hf_cache"))
39
-
40
- import gradio as gr
41
- from indextts.infer_v2 import IndexTTS2
42
- from tools.i18n.i18n import I18nAuto
43
-
44
- i18n = I18nAuto(language="Auto")
45
- MODE = 'local'
46
- tts = IndexTTS2(model_dir=cmd_args.model_dir,
47
- cfg_path=os.path.join(cmd_args.model_dir, "config.yaml"),
48
- use_fp16=cmd_args.fp16,
49
- use_deepspeed=cmd_args.deepspeed,
50
- use_cuda_kernel=cmd_args.cuda_kernel,
51
- )
52
- # 支持的语言列表
53
- LANGUAGES = {
54
- "中文": "zh_CN",
55
- "English": "en_US"
56
- }
57
- EMO_CHOICES = [i18n("与音色参考音频相同"),
58
- i18n("使用情感参考音频"),
59
- i18n("使用情感向量控制"),
60
- i18n("使用情感描述文本控制")]
61
- EMO_CHOICES_BASE = EMO_CHOICES[:3] # 基础选项
62
- EMO_CHOICES_EXPERIMENTAL = EMO_CHOICES # 全部选项(包括文本描述)
63
-
64
- os.makedirs("outputs/tasks",exist_ok=True)
65
- os.makedirs("prompts",exist_ok=True)
66
-
67
- MAX_LENGTH_TO_USE_SPEED = 70
68
- with open("examples/cases.jsonl", "r", encoding="utf-8") as f:
69
- example_cases = []
70
- for line in f:
71
- line = line.strip()
72
- if not line:
73
- continue
74
- example = json.loads(line)
75
- if example.get("emo_audio",None):
76
- emo_audio_path = os.path.join("examples",example["emo_audio"])
77
- else:
78
- emo_audio_path = None
79
- example_cases.append([os.path.join("examples", example.get("prompt_audio", "sample_prompt.wav")),
80
- EMO_CHOICES[example.get("emo_mode",0)],
81
- example.get("text"),
82
- emo_audio_path,
83
- example.get("emo_weight",1.0),
84
- example.get("emo_text",""),
85
- example.get("emo_vec_1",0),
86
- example.get("emo_vec_2",0),
87
- example.get("emo_vec_3",0),
88
- example.get("emo_vec_4",0),
89
- example.get("emo_vec_5",0),
90
- example.get("emo_vec_6",0),
91
- example.get("emo_vec_7",0),
92
- example.get("emo_vec_8",0),
93
- example.get("emo_text") is not None]
94
- )
95
-
96
- def normalize_emo_vec(emo_vec):
97
- # emotion factors for better user experience
98
- k_vec = [0.75,0.70,0.80,0.80,0.75,0.75,0.55,0.45]
99
- tmp = np.array(k_vec) * np.array(emo_vec)
100
- if np.sum(tmp) > 0.8:
101
- tmp = tmp * 0.8/ np.sum(tmp)
102
- return tmp.tolist()
103
-
104
- @spaces.GPU
105
- def gen_single(emo_control_method,prompt, text,
106
- emo_ref_path, emo_weight,
107
- vec1, vec2, vec3, vec4, vec5, vec6, vec7, vec8,
108
- emo_text,emo_random,
109
- max_text_tokens_per_segment=120,
110
- *args, progress=gr.Progress()):
111
- output_path = None
112
- if not output_path:
113
- output_path = os.path.join("outputs", f"spk_{int(time.time())}.wav")
114
- # set gradio progress
115
- tts.gr_progress = progress
116
- do_sample, top_p, top_k, temperature, \
117
- length_penalty, num_beams, repetition_penalty, max_mel_tokens = args
118
- kwargs = {
119
- "do_sample": bool(do_sample),
120
- "top_p": float(top_p),
121
- "top_k": int(top_k) if int(top_k) > 0 else None,
122
- "temperature": float(temperature),
123
- "length_penalty": float(length_penalty),
124
- "num_beams": num_beams,
125
- "repetition_penalty": float(repetition_penalty),
126
- "max_mel_tokens": int(max_mel_tokens),
127
- # "typical_sampling": bool(typical_sampling),
128
- # "typical_mass": float(typical_mass),
129
- }
130
- if type(emo_control_method) is not int:
131
- emo_control_method = emo_control_method.value
132
- if emo_control_method == 0: # emotion from speaker
133
- emo_ref_path = None # remove external reference audio
134
- if emo_control_method == 1: # emotion from reference audio
135
- # normalize emo_alpha for better user experience
136
- emo_weight = emo_weight * 0.8
137
- pass
138
- if emo_control_method == 2: # emotion from custom vectors
139
- vec = [vec1, vec2, vec3, vec4, vec5, vec6, vec7, vec8]
140
- vec = normalize_emo_vec(vec)
141
- else:
142
- # don't use the emotion vector inputs for the other modes
143
- vec = None
144
-
145
- if emo_text == "":
146
- # erase empty emotion descriptions; `infer()` will then automatically use the main prompt
147
- emo_text = None
148
-
149
- print(f"Emo control mode:{emo_control_method},weight:{emo_weight},vec:{vec}")
150
- output = tts.infer(spk_audio_prompt=prompt, text=text,
151
- output_path=output_path,
152
- emo_audio_prompt=emo_ref_path, emo_alpha=emo_weight,
153
- emo_vector=vec,
154
- use_emo_text=(emo_control_method==3), emo_text=emo_text,use_random=emo_random,
155
- verbose=cmd_args.verbose,
156
- max_text_tokens_per_segment=int(max_text_tokens_per_segment),
157
- **kwargs)
158
- return gr.update(value=output,visible=True)
159
-
160
- def update_prompt_audio():
161
- update_button = gr.update(interactive=True)
162
- return update_button
163
-
164
- with gr.Blocks(title="IndexTTS Demo") as demo:
165
- mutex = threading.Lock()
166
- gr.HTML('''
167
- <h2><center>IndexTTS2: A Breakthrough in Emotionally Expressive and Duration-Controlled Auto-Regressive Zero-Shot Text-to-Speech</h2>
168
- <p align="center">
169
- <a href='https://arxiv.org/abs/2506.21619'><img src='https://img.shields.io/badge/ArXiv-2506.21619-red'></a>
170
- </p>
171
- ''')
172
-
173
- with gr.Tab(i18n("音频生成")):
174
- with gr.Row():
175
- os.makedirs("prompts",exist_ok=True)
176
- prompt_audio = gr.Audio(label=i18n("音色参考音频"),key="prompt_audio",
177
- sources=["upload","microphone"],type="filepath")
178
- prompt_list = os.listdir("prompts")
179
- default = ''
180
- if prompt_list:
181
- default = prompt_list[0]
182
- with gr.Column():
183
- input_text_single = gr.TextArea(label=i18n("文本"),key="input_text_single", placeholder=i18n("请输入目标文本"), info=f"{i18n('当前模型版本')}{tts.model_version or '1.0'}")
184
- gen_button = gr.Button(i18n("生成语音"), key="gen_button",interactive=True)
185
- output_audio = gr.Audio(label=i18n("生成结果"), visible=True,key="output_audio")
186
- experimental_checkbox = gr.Checkbox(label=i18n("显示实验功能"),value=False)
187
- with gr.Accordion(i18n("功能设置")):
188
- # 情感控制选项部分
189
- with gr.Row():
190
- emo_control_method = gr.Radio(
191
- choices=EMO_CHOICES_BASE,
192
- type="index",
193
- value=EMO_CHOICES_BASE[0],label=i18n("情感控制方式"))
194
- # 情感参考音频部分
195
- with gr.Group(visible=False) as emotion_reference_group:
196
- with gr.Row():
197
- emo_upload = gr.Audio(label=i18n("上传情感参考音频"), type="filepath")
198
-
199
- # 情感随机采样
200
- with gr.Row(visible=False) as emotion_randomize_group:
201
- emo_random = gr.Checkbox(label=i18n("情感随机采样"), value=False)
202
-
203
- # 情感向量控制部分
204
- with gr.Group(visible=False) as emotion_vector_group:
205
- with gr.Row():
206
- with gr.Column():
207
- vec1 = gr.Slider(label=i18n("喜"), minimum=0.0, maximum=1.0, value=0.0, step=0.05)
208
- vec2 = gr.Slider(label=i18n("怒"), minimum=0.0, maximum=1.0, value=0.0, step=0.05)
209
- vec3 = gr.Slider(label=i18n("哀"), minimum=0.0, maximum=1.0, value=0.0, step=0.05)
210
- vec4 = gr.Slider(label=i18n("惧"), minimum=0.0, maximum=1.0, value=0.0, step=0.05)
211
- with gr.Column():
212
- vec5 = gr.Slider(label=i18n("厌恶"), minimum=0.0, maximum=1.0, value=0.0, step=0.05)
213
- vec6 = gr.Slider(label=i18n("低落"), minimum=0.0, maximum=1.0, value=0.0, step=0.05)
214
- vec7 = gr.Slider(label=i18n("惊喜"), minimum=0.0, maximum=1.0, value=0.0, step=0.05)
215
- vec8 = gr.Slider(label=i18n("平静"), minimum=0.0, maximum=1.0, value=0.0, step=0.05)
216
-
217
- with gr.Group(visible=False) as emo_text_group:
218
- with gr.Row():
219
- emo_text = gr.Textbox(label=i18n("情感描述文本"),
220
- placeholder=i18n("请输入情绪描述(或留空以自动使用目标文本作为情绪描述)"),
221
- value="",
222
- info=i18n("例如:委屈巴巴、危险在悄悄逼近"))
223
-
224
-
225
- with gr.Row(visible=False) as emo_weight_group:
226
- emo_weight = gr.Slider(label=i18n("情感权重"), minimum=0.0, maximum=1.0, value=0.8, step=0.01)
227
-
228
- with gr.Accordion(i18n("高级生成参数设置"), open=False,visible=False) as advanced_settings_group:
229
- with gr.Row():
230
- with gr.Column(scale=1):
231
- gr.Markdown(f"**{i18n('GPT2 采样设置')}** _{i18n('参数会影响音频多样性和生成速度详见')} [Generation strategies](https://huggingface.co/docs/transformers/main/en/generation_strategies)._")
232
- with gr.Row():
233
- do_sample = gr.Checkbox(label="do_sample", value=True, info=i18n("是否进行采样"))
234
- temperature = gr.Slider(label="temperature", minimum=0.1, maximum=2.0, value=0.8, step=0.1)
235
- with gr.Row():
236
- top_p = gr.Slider(label="top_p", minimum=0.0, maximum=1.0, value=0.8, step=0.01)
237
- top_k = gr.Slider(label="top_k", minimum=0, maximum=100, value=30, step=1)
238
- num_beams = gr.Slider(label="num_beams", value=3, minimum=1, maximum=10, step=1)
239
- with gr.Row():
240
- repetition_penalty = gr.Number(label="repetition_penalty", precision=None, value=10.0, minimum=0.1, maximum=20.0, step=0.1)
241
- length_penalty = gr.Number(label="length_penalty", precision=None, value=0.0, minimum=-2.0, maximum=2.0, step=0.1)
242
- max_mel_tokens = gr.Slider(label="max_mel_tokens", value=1500, minimum=50, maximum=tts.cfg.gpt.max_mel_tokens, step=10, info=i18n("生成Token最大数量,过小导致音频被截断"), key="max_mel_tokens")
243
- # with gr.Row():
244
- # typical_sampling = gr.Checkbox(label="typical_sampling", value=False, info="不建议使用")
245
- # typical_mass = gr.Slider(label="typical_mass", value=0.9, minimum=0.0, maximum=1.0, step=0.1)
246
- with gr.Column(scale=2):
247
- gr.Markdown(f'**{i18n("分句设置")}** _{i18n("参数会影响音频质量和生成速度")}_')
248
- with gr.Row():
249
- initial_value = max(20, min(tts.cfg.gpt.max_text_tokens, cmd_args.gui_seg_tokens))
250
- max_text_tokens_per_segment = gr.Slider(
251
- label=i18n("分句最大Token数"), value=initial_value, minimum=20, maximum=tts.cfg.gpt.max_text_tokens, step=2, key="max_text_tokens_per_segment",
252
- info=i18n("建议80~200之间,值越大,分句越长;值越小,分句越碎;过小过大都可能导致音频质量不高"),
253
- )
254
- with gr.Accordion(i18n("预览分句结果"), open=True) as segments_settings:
255
- segments_preview = gr.Dataframe(
256
- headers=[i18n("序号"), i18n("分句内容"), i18n("Token数")],
257
- key="segments_preview",
258
- wrap=True,
259
- )
260
- advanced_params = [
261
- do_sample, top_p, top_k, temperature,
262
- length_penalty, num_beams, repetition_penalty, max_mel_tokens,
263
- # typical_sampling, typical_mass,
264
- ]
265
-
266
- if len(example_cases) > 2:
267
- example_table = gr.Examples(
268
- examples=example_cases[:-2],
269
- examples_per_page=20,
270
- inputs=[prompt_audio,
271
- emo_control_method,
272
- input_text_single,
273
- emo_upload,
274
- emo_weight,
275
- emo_text,
276
- vec1,vec2,vec3,vec4,vec5,vec6,vec7,vec8,experimental_checkbox]
277
- )
278
- elif len(example_cases) > 0:
279
- example_table = gr.Examples(
280
- examples=example_cases,
281
- examples_per_page=20,
282
- inputs=[prompt_audio,
283
- emo_control_method,
284
- input_text_single,
285
- emo_upload,
286
- emo_weight,
287
- emo_text,
288
- vec1, vec2, vec3, vec4, vec5, vec6, vec7, vec8, experimental_checkbox]
289
- )
290
-
291
- def on_input_text_change(text, max_text_tokens_per_segment):
292
- if text and len(text) > 0:
293
- text_tokens_list = tts.tokenizer.tokenize(text)
294
-
295
- segments = tts.tokenizer.split_segments(text_tokens_list, max_text_tokens_per_segment=int(max_text_tokens_per_segment))
296
- data = []
297
- for i, s in enumerate(segments):
298
- segment_str = ''.join(s)
299
- tokens_count = len(s)
300
- data.append([i, segment_str, tokens_count])
301
- return {
302
- segments_preview: gr.update(value=data, visible=True, type="array"),
303
- }
304
- else:
305
- df = pd.DataFrame([], columns=[i18n("序号"), i18n("分句内容"), i18n("Token数")])
306
- return {
307
- segments_preview: gr.update(value=df),
308
- }
309
-
310
- def on_method_select(emo_control_method):
311
- if emo_control_method == 1: # emotion reference audio
312
- return (gr.update(visible=True),
313
- gr.update(visible=False),
314
- gr.update(visible=False),
315
- gr.update(visible=False),
316
- gr.update(visible=True)
317
- )
318
- elif emo_control_method == 2: # emotion vectors
319
- return (gr.update(visible=False),
320
- gr.update(visible=True),
321
- gr.update(visible=True),
322
- gr.update(visible=False),
323
- gr.update(visible=False)
324
- )
325
- elif emo_control_method == 3: # emotion text description
326
- return (gr.update(visible=False),
327
- gr.update(visible=True),
328
- gr.update(visible=False),
329
- gr.update(visible=True),
330
- gr.update(visible=True)
331
- )
332
- else: # 0: same as speaker voice
333
- return (gr.update(visible=False),
334
- gr.update(visible=False),
335
- gr.update(visible=False),
336
- gr.update(visible=False),
337
- gr.update(visible=False)
338
- )
339
-
340
- def on_experimental_change(is_exp):
341
- # 切换情感控制选项
342
- # 第三个返回值实际没有起作用
343
- if is_exp:
344
- return gr.update(choices=EMO_CHOICES_EXPERIMENTAL, value=EMO_CHOICES_EXPERIMENTAL[0]), gr.update(visible=True),gr.update(value=example_cases)
345
- else:
346
- return gr.update(choices=EMO_CHOICES_BASE, value=EMO_CHOICES_BASE[0]), gr.update(visible=False),gr.update(value=example_cases[:-2])
347
-
348
- emo_control_method.select(on_method_select,
349
- inputs=[emo_control_method],
350
- outputs=[emotion_reference_group,
351
- emotion_randomize_group,
352
- emotion_vector_group,
353
- emo_text_group,
354
- emo_weight_group]
355
- )
356
-
357
- input_text_single.change(
358
- on_input_text_change,
359
- inputs=[input_text_single, max_text_tokens_per_segment],
360
- outputs=[segments_preview]
361
- )
362
-
363
- experimental_checkbox.change(
364
- on_experimental_change,
365
- inputs=[experimental_checkbox],
366
- outputs=[emo_control_method, advanced_settings_group,example_table.dataset] # 高级参数Accordion
367
- )
368
-
369
- max_text_tokens_per_segment.change(
370
- on_input_text_change,
371
- inputs=[input_text_single, max_text_tokens_per_segment],
372
- outputs=[segments_preview]
373
- )
374
-
375
- prompt_audio.upload(update_prompt_audio,
376
- inputs=[],
377
- outputs=[gen_button])
378
-
379
- gen_button.click(gen_single,
380
- inputs=[emo_control_method,prompt_audio, input_text_single, emo_upload, emo_weight,
381
- vec1, vec2, vec3, vec4, vec5, vec6, vec7, vec8,
382
- emo_text,emo_random,
383
- max_text_tokens_per_segment,
384
- *advanced_params,
385
- ],
386
- outputs=[output_audio])
387
-
388
-
389
-
390
- if __name__ == "__main__":
391
- demo.queue(20)
392
- demo.launch(server_name=cmd_args.host, server_port=cmd_args.port)