Spaces:

pythonlearnreal
/

F5-TTS-THAI

Sleeping

App Files Files Community

pythonlearnreal commited on Jul 24

Commit

106478e

verified ·

1 Parent(s): 8af61df

Upload folder using huggingface_hub

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.gitattributes +22 -0
.gitignore +81 -0
DEPLOYMENT_GUIDE.md +251 -0
Inference.ipynb +62 -0
LICENSE +21 -0
README.md +105 -12
README_DEPLOYMENT.md +88 -0
REFACTORING_README.md +172 -0
app-webui.bat +9 -0
app.py +7 -0
ckpts/README.md +10 -0
data/Emilia_ZH_EN_pinyin/vocab.txt +2586 -0
data/librispeech_pc_test_clean_cross_sentence.lst +0 -0
deployment/.gitignore +81 -0
deployment/README.md +89 -0
deployment/app.py +45 -0
deployment/app_minimal.py +31 -0
deployment/requirements.txt +15 -0
deployment/requirements_minimal.txt +1 -0
deployment/src/f5_tts/api.py +174 -0
deployment/src/f5_tts/cleantext/number_tha.py +145 -0
deployment/src/f5_tts/cleantext/th_repeat.py +41 -0
deployment/src/f5_tts/config.py +98 -0
deployment/src/f5_tts/configs/E2TTS_Base_train.yaml +45 -0
deployment/src/f5_tts/configs/E2TTS_Small_train.yaml +45 -0
deployment/src/f5_tts/configs/F5TTS_Base_train.yaml +48 -0
deployment/src/f5_tts/configs/F5TTS_Small_train.yaml +48 -0
deployment/src/f5_tts/eval/README.md +52 -0
deployment/src/f5_tts/eval/ecapa_tdnn.py +330 -0
deployment/src/f5_tts/eval/eval_infer_batch.py +207 -0
deployment/src/f5_tts/eval/eval_infer_batch.sh +13 -0
deployment/src/f5_tts/eval/eval_librispeech_test_clean.py +96 -0
deployment/src/f5_tts/eval/eval_seedtts_testset.py +95 -0
deployment/src/f5_tts/eval/eval_utmos.py +44 -0
deployment/src/f5_tts/eval/utils_eval.py +413 -0
deployment/src/f5_tts/f5_tts_webui.py +295 -0
deployment/src/f5_tts/infer/README.md +219 -0
deployment/src/f5_tts/infer/SHARED.md +164 -0
deployment/src/f5_tts/infer/examples/basic/basic.toml +11 -0
deployment/src/f5_tts/infer/examples/basic/basic_ref_en.wav +3 -0
deployment/src/f5_tts/infer/examples/basic/basic_ref_zh.wav +3 -0
deployment/src/f5_tts/infer/examples/multi/country.flac +3 -0
deployment/src/f5_tts/infer/examples/multi/main.flac +3 -0
deployment/src/f5_tts/infer/examples/multi/story.toml +20 -0
deployment/src/f5_tts/infer/examples/multi/story.txt +1 -0
deployment/src/f5_tts/infer/examples/multi/town.flac +3 -0
deployment/src/f5_tts/infer/examples/thai_examples/ref_gen_1.wav +3 -0
deployment/src/f5_tts/infer/examples/thai_examples/ref_gen_2.wav +3 -0
deployment/src/f5_tts/infer/examples/thai_examples/ref_gen_3.wav +3 -0
deployment/src/f5_tts/infer/examples/thai_examples/ref_gen_4.wav +3 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,25 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+deployment/src/f5_tts/infer/examples/basic/basic_ref_en.wav filter=lfs diff=lfs merge=lfs -text
+deployment/src/f5_tts/infer/examples/basic/basic_ref_zh.wav filter=lfs diff=lfs merge=lfs -text
+deployment/src/f5_tts/infer/examples/multi/country.flac filter=lfs diff=lfs merge=lfs -text
+deployment/src/f5_tts/infer/examples/multi/main.flac filter=lfs diff=lfs merge=lfs -text
+deployment/src/f5_tts/infer/examples/multi/town.flac filter=lfs diff=lfs merge=lfs -text
+deployment/src/f5_tts/infer/examples/thai_examples/ref_gen_1.wav filter=lfs diff=lfs merge=lfs -text
+deployment/src/f5_tts/infer/examples/thai_examples/ref_gen_2.wav filter=lfs diff=lfs merge=lfs -text
+deployment/src/f5_tts/infer/examples/thai_examples/ref_gen_3.wav filter=lfs diff=lfs merge=lfs -text
+deployment/src/f5_tts/infer/examples/thai_examples/ref_gen_4.wav filter=lfs diff=lfs merge=lfs -text
+deployment/src/f5_tts/infer/examples/thai_examples/tts_gen_1.wav filter=lfs diff=lfs merge=lfs -text
+deployment/src/f5_tts/infer/examples/thai_examples/tts_gen_2.wav filter=lfs diff=lfs merge=lfs -text
+src/f5_tts/infer/examples/basic/basic_ref_en.wav filter=lfs diff=lfs merge=lfs -text
+src/f5_tts/infer/examples/basic/basic_ref_zh.wav filter=lfs diff=lfs merge=lfs -text
+src/f5_tts/infer/examples/multi/country.flac filter=lfs diff=lfs merge=lfs -text
+src/f5_tts/infer/examples/multi/main.flac filter=lfs diff=lfs merge=lfs -text
+src/f5_tts/infer/examples/multi/town.flac filter=lfs diff=lfs merge=lfs -text
+src/f5_tts/infer/examples/thai_examples/ref_gen_1.wav filter=lfs diff=lfs merge=lfs -text
+src/f5_tts/infer/examples/thai_examples/ref_gen_2.wav filter=lfs diff=lfs merge=lfs -text
+src/f5_tts/infer/examples/thai_examples/ref_gen_3.wav filter=lfs diff=lfs merge=lfs -text
+src/f5_tts/infer/examples/thai_examples/ref_gen_4.wav filter=lfs diff=lfs merge=lfs -text
+src/f5_tts/infer/examples/thai_examples/tts_gen_1.wav filter=lfs diff=lfs merge=lfs -text
+src/f5_tts/infer/examples/thai_examples/tts_gen_2.wav filter=lfs diff=lfs merge=lfs -text

.gitignore ADDED Viewed

	@@ -0,0 +1,81 @@

+# Python
+__pycache__/
+*.py[cod]
+*$py.class
+*.so
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+# PyTorch
+*.pth
+*.pt
+# Gradio
+.gradio/
+flagged/
+# Environment
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+# IDE
+.vscode/
+.idea/
+*.swp
+*.swo
+*~
+# OS
+.DS_Store
+.DS_Store?
+._*
+.Spotlight-V100
+.Trashes
+ehthumbs.db
+Thumbs.db
+# Logs
+*.log
+logs/
+# Temporary files
+*.tmp
+*.temp
+tmp/
+temp/
+# Cache
+.cache/
+*.cache
+# Model downloads (if large)
+# ckpts/
+# models/
+# Audio files (if large)
+# *.wav
+# *.mp3
+# *.flac
+# Jupyter
+.ipynb_checkpoints

DEPLOYMENT_GUIDE.md ADDED Viewed

	@@ -0,0 +1,251 @@

+# 🚀 คู่มือการ Deploy F5-TTS Thai WebUI
+## วิธีการ Deploy ไป Hugging Face Spaces
+### ขั้นตอนที่ 1: เตรียม Account และ Repository
+1. **สร้าง Hugging Face Account** (ถ้ายังไม่มี)
+   - ไปที่ https://huggingface.co/join
+   - สร้าง account ฟรี
+2. **สร้าง Space ใหม่**
+   - ไปที่ https://huggingface.co/new-space
+   - ตั้งชื่อ Space (เช่น `f5-tts-thai`)
+   - เลือก SDK: **Gradio**
+   - เลือก Hardware: **CPU basic** (ฟรี) หรือ **GPU** (ต้องเสียเงิน)
+### ขั้นตอนที่ 2: Upload โค้ด
+**วิธีที่ 1: ใช้ Git (แนะนำ)**
+```bash
+# Clone repository ที่สร้างจาก HF Spaces
+git clone https://huggingface.co/spaces/YOUR_USERNAME/f5-tts-thai
+cd f5-tts-thai
+# คัดลอกไฟล์จากโปรเจ็กต์ของคุณ
+cp -r /path/to/F5-TTS-THAI/src .
+cp /path/to/F5-TTS-THAI/app.py .
+cp /path/to/F5-TTS-THAI/requirements.txt .
+# สร้าง README.md จาก README_DEPLOYMENT.md
+cp /path/to/F5-TTS-THAI/README_DEPLOYMENT.md README.md
+# Commit และ push
+git add .
+git commit -m "Initial deployment"
+git push
+```
+**วิธีที่ 2: อัปโหลดผ่าน Web Interface**
+1. ไปที่ Space ที่คุณสร้าง
+2. คลิก "Files and versions"
+3. อัปโหลดไฟล์ทีละไฟล์:
+   - `app.py`
+   - `requirements.txt`
+   - `README.md` (จาก README_DEPLOYMENT.md)
+   - โฟลเดอร์ `src/` ทั้งหมด
+### ขั้นตอนที่ 3: ตรวจสอบการ Deploy
+1. **รอการ Build**
+   - Hugging Face จะ build app อัตโนมัติ
+   - ดู logs ได้ที่ "Logs" tab
+2. **ทดสอบ App**
+   - เมื่อ build สำเร็จ จะแสดง URL ของ app
+   - ทดสอบ functionality ต่างๆ
+### ขั้นตอนที่ 4: Configuration ขั้นสูง
+**เปิดใช้งาน GPU (ต้องเสียเงิน)**
+```yaml
+# ใน README.md header
+---
+title: F5-TTS Thai
+emoji: 🎤
+colorFrom: blue
+colorTo: purple
+sdk: gradio
+sdk_version: 4.44.0
+app_file: app.py
+pinned: false
+license: mit
+python_version: 3.10
+hardware: gpu-t4-small  # เปลี่ยนจาก cpu-basic
+---
+```
+**ปรับแต่ง Environment Variables**
+ใน Space settings เพิ่ม variables:
+- `CUDA_VISIBLE_DEVICES=0` (สำหรับ GPU)
+- `TRANSFORMERS_CACHE=/tmp` (เพื่อประหยัด storage)
+## วิธีการ Deploy ไป Gradio.app
+### ขั้นตอนที่ 1: สร้าง Account
+1. ไปที่ https://gradio.app
+2. สร้าง account และ login
+### ขั้นตอนที่ 2: Deploy
+```bash
+# ติดตั้ง gradio
+pip install gradio
+# Upload app
+python app.py --share
+```
+## การ Optimize สำหรับ Production
+### 1. ลดขนาด Model
+```python
+# ใน config.py เปลี่ยนเป็น
+DEFAULT_MODEL_BASE = "hf://VIZINTZOR/F5-TTS-THAI/model_650000_FP16.pt"  # ใช้ FP16
+```
+### 2. เพิ่ม Caching
+```python
+# ใน model_manager.py
+@lru_cache(maxsize=1)
+def get_cached_model():
+    return load_model(...)
+```
+### 3. ปรับแต่ง Memory Usage
+```python
+# ใน app.py
+import torch
+torch.set_num_threads(2)  # ลด CPU threads
+```
+### 4. เพิ่ม Error Handling
+```python
+# ใน app.py
+import gc
+import torch
+def cleanup_memory():
+    gc.collect()
+    if torch.cuda.is_available():
+        torch.cuda.empty_cache()
+```
+## Troubleshooting
+### ปัญหา: Out of Memory
+**แก้ไข:**
+```python
+# ใช้โมเดล FP16
+# ลด NFE steps
+# เพิ่ม memory cleanup
+```
+### ปัญหา: Slow Loading
+**แก้ไข:**
+```python
+# Pre-load models
+# ใช้ model caching
+# ปรับ CPU/GPU settings
+```
+### ปัญหา: Import Errors
+**แก้ไข:**
+```python
+# ตรวจสอบ requirements.txt
+# เพิ่ม try-except สำหรับ imports
+# ใช้ fallback interface
+```
+## การ Monitor และ Maintain
+### 1. ดู Logs
+```bash
+# ดู logs ของ HF Spaces
+# Monitor memory usage
+# ตรวจสอบ error rates
+```
+### 2. Update App
+```bash
+# git pull latest changes
+# test locally first
+# deploy gradually
+```
+### 3. Scale Up/Down
+```bash
+# เปลี่ยน hardware specs
+# ปรับ concurrent users
+# optimize model loading
+```
+## Security Considerations
+### 1. Input Validation
+```python
+def validate_audio_input(audio_file):
+    # ตรวจสอบ��นาดไฟล์
+    # ตรวจสอบรูปแบบไฟล์
+    # จำกัดความยาวเสียง
+```
+### 2. Rate Limiting
+```python
+import time
+from functools import wraps
+def rate_limit(calls_per_minute=10):
+    # implement rate limiting
+```
+### 3. Content Filtering
+```python
+def filter_inappropriate_content(text):
+    # กรองเนื้อหาที่ไม่เหมาะสม
+    # ตรวจสอบ spam
+```
+## Cost Optimization
+### Free Tier (CPU)
+- **ข้อจำกัด**: ช้า, memory จำกัด
+- **เหมาะสำหรับ**: demo, testing
+### GPU Tier (T4/A10G)
+- **ราคา**: ~$0.60-3.00/ชั่วโมง
+- **เหมาะสำหรับ**: production, fast inference
+### Tips ประหยัดค่าใช้จ่าย
+1. ใช้ CPU สำหรับ development
+2. เปิด GPU เฉพาะเวลาที่ต้องการ
+3. ใช้ auto-shutdown
+4. Monitor usage regularly
+## สรุป
+การ deploy F5-TTS Thai WebUI ไป cloud platforms ทำได้ง่ายและมีหลายทางเลือก:
+✅ **Hugging Face Spaces**: ง่าย, มี free tier
+✅ **Gradio.app**: รวดเร็ว, เหมาะสำหรับ quick demos
+✅ **Cloud Platforms**: AWS, GCP, Azure สำหรับ enterprise
+เลือกตามความต้องการและงบประมาณของคุณ! 🚀

Inference.ipynb ADDED Viewed

	@@ -0,0 +1,62 @@

+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "provenance": [],
+      "gpuType": "T4"
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    },
+    "language_info": {
+      "name": "python"
+    },
+    "accelerator": "GPU"
+  },
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# ติดตั้ง"
+      ],
+      "metadata": {
+        "id": "fXnq08ZVNMAh"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "tTdQJcckmuZ4"
+      },
+      "outputs": [],
+      "source": [
+        "!git clone https://github.com/VYNCX/F5-TTS-THAI.git\n",
+        "%cd F5-TTS-THAI\n",
+        "!pip install git+https://github.com/VYNCX/F5-TTS-THAI.git"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# ใช้งาน"
+      ],
+      "metadata": {
+        "id": "wJNZvZB7PXSI"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "!python src/f5_tts/f5_tts_webui.py --share"
+      ],
+      "metadata": {
+        "id": "UoKmwDmfm6qP"
+      },
+      "execution_count": null,
+      "outputs": []
+    }
+  ]
+}

LICENSE ADDED Viewed

	@@ -0,0 +1,21 @@

+MIT License
+Copyright (c) 2024 Yushen CHEN
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

README.md CHANGED Viewed

@@ -1,12 +1,105 @@
----
-title: F5 TTS THAI
-emoji: 📚
-colorFrom: green
-colorTo: purple
-sdk: gradio
-sdk_version: 5.38.1
-app_file: app.py
-pinned: false
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+---
+title: F5-TTS-THAI
+app_file: .
+sdk: gradio
+sdk_version: 5.38.0
+---
+# F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching. Support For Thai language.
+[![python](https://img.shields.io/badge/Python-3.10-brightgreen)](https://github.com/SWivid/F5-TTS)
+[![arXiv](https://img.shields.io/badge/arXiv-2410.06885-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2410.06885)
+[![lab](https://img.shields.io/badge/X--LANCE-Lab-grey?labelColor=lightgrey)](https://x-lance.sjtu.edu.cn/)
+[![lab](https://img.shields.io/badge/Peng%20Cheng-Lab-grey?labelColor=lightgrey)](https://www.pcl.ac.cn)
+<!-- <img src="https://github.com/user-attachments/assets/12d7749c-071a-427c-81bf-b87b91def670" alt="Watermark" style="width: 40px; height: auto"> -->
+Text-to-Speech (TTS) ภาษาไทย — เครื่องมือสร้างเสียงพูดจากข้อความด้วยเทคนิค Flow Matching ด้วยโมเดล F5-TTS
+โมเดล Finetune : [VIZINTZOR/F5-TTS-THAI](https://huggingface.co/VIZINTZOR/F5-TTS-THAI)
+ - โมเดล last steps : 1,000,000
+ - การอ่านข้อความยาวๆ หรือบางคำ ยังไม่ถูกต้อง
+# การติดตั้ง
+ก่อนเริ่มใช้งาน ต้องติดตั้ง:
+ - Python (แนะนำเวอร์ชัน 3.10 ขึ้นไป)
+ - [CUDA](https://developer.nvidia.com/cuda-downloads) แนะนำ CUDA version 11.8
+```sh
+git clone https://github.com/VYNCX/F5-TTS-THAI.git
+cd F5-TTS-THAI
+python -m venv venv
+call venv/scripts/activate
+pip install git+https://github.com/VYNCX/F5-TTS-THAI.git
+#จำเป็นต้องติดตั้งเพื่อใช้งานได้มีประสิทธิภาพกับ GPU
+pip install torch==2.3.0+cu118 torchaudio==2.3.0+cu118 --extra-index-url https://download.pytorch.org/whl/cu118
+```
+หรือ รันไฟล์ `install.bat` เพื่อติดตั้ง
+# การใช้งาน
+สามารถรันไฟล์ `app-webui.bat` เพื่อใช้งานได้
+```sh
+  python src/f5_tts/f5_tts_webui.py
+```
+หรือ
+```sh
+  f5-tts_webui
+```
+ใช้งานบน [Google Colab](https://colab.research.google.com/drive/10yb4-mGbSoyyfMyDX1xVF6uLqfeoCNxV?usp=sharing)
+คำแนะนำ :
+- สามารถตั้งค่า "ตัวอักษรสูงสุดต่อส่วน" หรือ max_chars เพื่อลดความผิดพลาดการอ่าน แต่ความเร็วในการสร้างจะช้าลง สามารถปรับลด NFE Step เพื่อเพิ่มความเร็วได้.
+- อย่าลืมเว้นวรรคประโยคเพื่อให้สามารถแบ่งส่วนในการสร้างได้.
+- สำหรับ ref_text หรือ ข้อความตันฉบับ แนะนำให้ใช้เป็นภาษาไทยหรือคำอ่านภาษาไทยสำหรับเสียงภาษาอื่น เพื่อให้การอ่านภาษาไทยดีขึ้น เช่น Good Morning > กู้ดมอร์นิ่ง.
+- สำหรับเสียงต้นแบบ ควรใช้ความยาวไม่เกิน 10 วินาที ถ้าเป็นไปได้ห้ามมีเสียงรบกวน.
+- สามารถปรับลดความเร็ว เพื่อให้การอ่านคำดีขึ้นได้ เช่น ความเร็ว 0.8-0.9 เพื่อลดการอ่านผิดหรือคำขาดหาย แต่ลดมากไปอาจมีเสียงต้นฉบับแทรกเข้ามา.
+  <details><summary>ตัวอย่าง WebUI</summary>
+   - Text To Speech
+   ![Example_Gradio#3](https://github.com/user-attachments/assets/9fd6bf42-3c34-41aa-8f88-3f7ea191e4f0)
+   - Multi Speech
+   ![Example_Gradio#4](https://github.com/user-attachments/assets/fc57b2d0-bef9-4454-94c3-b72ca2551265)
+# ฝึกอบรม และ Finetune
+ใช้งานบน Google Colab [Finetune](https://colab.research.google.com/drive/1jwzw4Jn1qF8-F0o3TND68hLHdIqqgYEe?usp=sharing) หรือ
+ติดตั้ง
+```sh
+  cd F5-TTS-THAI
+  pip install -e .
+```
+เปิด Gradio
+```sh
+  f5-tts_finetune-gradio
+```
+# ตัวอย่าง��สียง
+- เสียงต้นฉบับ
+- ข้อความ : ได้รับข่าวคราวของเราที่จะหาที่มันเป็นไปที่จะจัดขึ้น.
+https://github.com/user-attachments/assets/003c8a54-6f75-4456-907d-d28897e4c393
+- เสียงที่สร้าง 1(ข้อความเดียวกัน)
+- ข้อความ : ได้รับข่าวคราวของเราที่จะหาที่มันเป็นไปที่จะจัดขึ้น.
+https://github.com/user-attachments/assets/926829f2-8d56-4f0f-8e2e-d73cfcecc511
+- เสียงที่สร้าง 2(ข้อความใหม่)
+- ข้อความ : ฉันชอบฟังเพลงขณะขับรถ เพราะช่วยให้รู้สึกผ่อนคลาย
+https://github.com/user-attachments/assets/06d6e94b-5f83-4d69-99d1-ad19caa9792b
+# อ้างอิง
+- [F5-TTS](https://github.com/SWivid/F5-TTS)

README_DEPLOYMENT.md ADDED Viewed

	@@ -0,0 +1,88 @@

+---
+title: F5-TTS Thai
+emoji: 🎤
+colorFrom: blue
+colorTo: purple
+sdk: gradio
+sdk_version: 4.44.0
+app_file: app.py
+pinned: false
+license: mit
+python_version: 3.10
+hardware: cpu-basic
+---
+# F5-TTS ภาษาไทย 🎤
+Zero-shot Text-to-Speech สำหรับภาษาไทย ด้วยโมเดล F5-TTS
+## ✨ Features
+- **Multi-Speech Generation**: สร้างเสียงพูดหลายสไตล์ในไฟล์เดียว
+- **Voice Cloning**: โคลนเสียงจากไฟล์ตัวอย่างสั้นๆ
+- **Thai Language Support**: รองรับภาษาไทยอย่างเต็มรูปแบบ
+- **Real-time Processing**: ประมวลผลแบบ real-time
+- **Segment Editing**: แก้ไขและปรับแต่งเสียงแต่ละส่วนได้
+## 🚀 วิธีใช้งาน
+### Multi-Speech Generation
+1. **เพิ่มประเภทคำพูด**: คลิก "เพิ่มประเภทคำพูด" เพื่อเพิ่มสไตล์เสียงใหม่
+2. **อัปโหลดเสียงตัวอย่าง**: อัปโหลดไฟล์เสียงสำหรับแต่ละสไตล์
+3. **ใส่ข้อความต้นฉบับ**: พิมพ์ข้อความที่สอดคล้องกับเสียงตัวอย่าง
+4. **เขียนสคริปต์**: ใช้รูปแบบ `{ชื่อสไตล์} ข้อความที่จะพูด`
+### ตัวอย่างการใช้งาน
+```
+{ปกติ} สวัสดีครับ มีอะไรให้ผมช่วยไหมครับ
+{เศร้า} ผมเครียดจริงๆ นะตอนนี้...
+{โกรธ} รู้ไหม! เธอไม่ควรอยู่ที่นี่!
+{กระซิบ} ฉันมีอะไรจะบอกคุณ แต่มันเป็นความลับนะ
+```
+## ⚙️ Technical Details
+### Models Used
+- **F5-TTS**: Zero-shot text-to-speech model
+- **Vocoder**: Neural vocoder for high-quality audio synthesis
+- **Text Processing**: Thai text normalization and processing
+### System Requirements
+- **RAM**: อย่างน้อย 4GB (แนะนำ 8GB+)
+- **GPU**: ไม่จำเป็น แต่จะช่วยเพิ่มความเร็ว
+- **Storage**: ~2GB สำหรับโมเดลและ dependencies
+## 🔧 Configuration
+### Model Settings
+- **NFE Steps**: ควบคุมคุณภาพเสียง (16-64)
+- **Cross Fade Duration**: ปรับการต่อเสียงระหว่างส่วน
+- **Speed**: ปรับความเร็วการพูด
+- **CFG Strength**: ปรับความแข็งแกร่งของ guidance
+### Tips สำหรับผลลัพธ์ที่ดี
+1. **เสียงตัวอย่าง**: ใช้เสียงที่ชัดเจน ไม่มีเสียงรบกวน ความยาว 5-10 วินาที
+2. **ข้อความต้นฉบับ**: ให้ตรงกับเสียงตัวอย่างที่สุด
+3. **ข้อความที่จะสร้าง**: เว้นวรรคและใส่เครื่องหมายวรรคตอนให้ชัดเจน
+4. **การตั้งค่า**: เริ่มด้วยค่า default แล้วค่อยปรับแต่ง
+## 🚨 Limitations
+- รองรับเฉพาะภาษาไทยเป็นหลัก
+- คุณภาพเสียงขึ้นอยู่กับเสียงตัวอย่าง
+- ใช้เวลาในการประมวลผลตามความยาวข้อความ
+- ต้องใช้ internet เพื่อดาวน์โหลดโมเดล
+## 📝 License
+MIT License - ใช้งานได้อย่างอิสระ
+## 🤝 Contributing
+สามารถมีส่วนร่วมพัฒนาได้ที่ [GitHub Repository](https://github.com/yourusername/F5-TTS-THAI)
+## 🐛 Bug Reports
+หากพบปัญหาการใช้งาน กรุณาแจ้งได้ที่ Issues ของ GitHub Repository

REFACTORING_README.md ADDED Viewed

	@@ -0,0 +1,172 @@

+# F5-TTS Thai WebUI - Refactoring Documentation
+## สรุปการ Refactoring
+ไฟล์ `src/f5_tts/f5_tts_webui.py` ได้รับการปรับปรุงโครงสร้างใหม่ (refactored) เพื่อให้โค้ดมีความเป็นระเบียบ ง่ายต่อการดูแลรักษา และขยายได้ในอนาคต
+## ปัญหาของโค้ดเดิม
+- **ไฟล์ใหญ่เกินไป**: มีโค้ดกว่า 680 บรรทัดในไฟล์เดียว
+- **ฟังก์ชันยาวเกินไป**: มีฟังก์ชันที่มีโค้ดหลายร้อยบรรทัด
+- **ตัวแปร Global**: ใช้ตัวแปร global หลายตัวทำให้ยากต่อการติดตาม
+- **การแยกหน้าที่ไม่ชัดเจน**: โค้ดสำหรับ UI, business logic, และ model management ปนกัน
+- **การ duplicate code**: มีโค้ดที่ทำงานคล้ายกันแต่เขียนซ้ำ
+- **ยากต่อการทดสอบ**: โค้ดเดิมยากต่อการเขียน unit tests
+## โครงสร้างใหม่หลังการ Refactoring
+### 1. แยกไฟล์ตามหน้าที่ (Separation of Concerns)
+```
+src/f5_tts/
+├── config.py                    # Configuration และ constants
+├── model_manager.py             # จัดการโมเดล F5-TTS
+├── tts_processor.py             # ประมวลผล Text-to-Speech และ Speech-to-Text
+├── multi_speech_processor.py    # ประมวลผล Multi-Speech และ Segment Editing
+├── ui_components.py             # Gradio UI Components
+└── f5_tts_webui.py             # Main application class
+```
+### 2. Classes และ Responsibilities
+#### `config.py`
+- เก็บ constants และ configuration ทั้งหมด
+- Model paths, default settings, UI configurations
+- ข้อความสำหรับ UI (ตัวอย่าง, คำแนะนำ)
+#### `ModelManager` class
+- จัดการการโหลดและเปลี่ยนโมเดล F5-TTS
+- รองรับ Default, FP16, และ Custom models
+- จัดการ vocoder loading
+- Error handling สำหรับการโหลดโมเดล
+#### `TTSProcessor` class
+- ประมวลผล Text-to-Speech
+- จัดการ seed generation และ validation
+- Audio preprocessing และ postprocessing
+- Spectrogram generation
+#### `SpeechToTextProcessor` class
+- ประมวลผล Speech-to-Text ด้วย Whisper
+- รองรับการแปลภาษา
+- จัดการ model configurations
+#### `MultiSpeechProcessor` class
+- ประมวลผล Multi-Speech generation
+- จัดการ speech types และ segments
+- Segment editing และ regeneration
+- Silence management
+#### `UIComponents` class
+- สร้าง Gradio components
+- จัดการ speech type management
+- แยก UI logic ออกจาก business logic
+#### `F5TTSWebUI` class
+- Main application class
+- ประสานงานระหว่าง components
+- Event handling และ binding
+## ประโยชน์ของการ Refactoring
+### 1. **Maintainability (ความง่ายในการดูแลรักษา)**
+- โค้ดแต่ละส่วนมีหน้าที่ชัดเจน
+- แก้ไขส่วนใดส่วนหนึ่งไม่กระทบส่วนอื่น
+- ง่ายต่อการค้นหาและแก้ไข bugs
+### 2. **Reusability (การใช้ซ้ำได้)**
+- Classes สามารถนำไปใช้ในโปรเจ็กต์อื่นได้
+- Components สามารถใช้งานแยกจากกันได้
+### 3. **Testability (การทดสอบได้)**
+- สามารถเขียน unit tests สำหรับแต่ละ class ได้
+- Mock dependencies ได้ง่าย
+- Isolated testing สำหรับแต่ละ functionality
+### 4. **Scalability (การขยายได้)**
+- เพิ่ม features ใหม่ได้ง่าย
+- เปลี่ยนแปลง implementation ได้โดยไม่กระทบส่วนอื่น
+- รองรับการเพิ่ม model types ใหม่
+### 5. **Readability (ความอ่านง่าย)**
+- โค้ดสั้นลงในแต่ละไฟล์
+- ชื่อ class และ method สื่อความหมายชัดเจน
+- Documentation ครบ���้วน
+## วิธีการใช้งานหลังการ Refactoring
+### การรันแอพพลิเคชั่น
+```python
+from f5_tts.f5_tts_webui import main
+# หรือ
+python -m f5_tts.f5_tts_webui --share
+```
+### การใช้งาน Components แยกต่างหาก
+```python
+from f5_tts.model_manager import ModelManager
+from f5_tts.tts_processor import TTSProcessor
+# สร้าง model manager
+model_manager = ModelManager()
+# สร้าง TTS processor
+tts_processor = TTSProcessor(model_manager)
+# ใช้งาน TTS
+result = tts_processor.infer_tts(
+    ref_audio="path/to/audio.wav",
+    ref_text="เสียงต้นฉบับ",
+    gen_text="ข้อความที่จะสร้าง"
+)
+```
+## การเปลี่ยนแปลงที่สำคัญ
+### 1. **ไม่มีตัวแปร Global แล้ว**
+- `f5tts_model` และ `vocoder` ถูกย้ายไปอยู่ใน `ModelManager`
+- ใช้ dependency injection แทน global state
+### 2. **Error Handling ที่ดีขึ้น**
+- ตรวจสอบ errors ใน model loading
+- Graceful handling สำหรับ invalid inputs
+### 3. **Configuration Management**
+- Constants ทั้งหมดอยู่ในที่เดียว
+- ง่ายต่อการเปลี่ยนแปลง configuration
+### 4. **Type Safety**
+- ใช้ type hints ในฟังก์ชันสำคัญ
+- ลดความเสี่ยงของ runtime errors
+## การทดสอบ
+หลังจากการ refactoring สามารถเขียนและรัน tests ได้:
+```python
+# ตัวอย่าง unit test
+def test_model_manager():
+    manager = ModelManager()
+    assert manager.get_model() is not None
+    assert manager.get_vocoder() is not None
+def test_tts_processor():
+    model_manager = ModelManager()
+    processor = TTSProcessor(model_manager)
+    # Test TTS functionality
+```
+## อนาคต
+การ refactoring นี้เป็นฐานสำหรับการพัฒนาต่อไปในอนาคต:
+1. **เพิ่ม Model Types ใหม่**: ง่ายต่อการเพิ่ม support สำหรับโมเดลใหม่
+2. **API Endpoints**: สามารถสร้าง REST API ได้ง่าย
+3. **Batch Processing**: เพิ่ม functionality สำหรับประมวลผลหลายไฟล์
+4. **Advanced Features**: เพิ่ม features เช่น voice cloning, style transfer
+5. **Performance Optimization**: ปรับปรุงประสิทธิภาพได้ง่าย
+## สรุป
+การ refactoring นี้ทำให้โค้ดมีคุณภาพดีขึ้นอย่างมาก พร้อมสำหรับการพัฒนาและขยายในอนาคต ในขณะที่ยังคงความสามารถเดิมทุกอย่างไว้

app-webui.bat ADDED Viewed

	@@ -0,0 +1,9 @@

+@echo off
+set "current_dir=%CD%"
+call venv/scripts/activate
+python src/f5_tts/f5_tts_webui.py
+pause

app.py ADDED Viewed

	@@ -0,0 +1,7 @@

+import gradio as gr
+def greet(name):
+    return "Hello " + name + "!!"
+demo = gr.Interface(fn=greet, inputs="text", outputs="text")
+demo.launch()

ckpts/README.md ADDED Viewed

	@@ -0,0 +1,10 @@

+Pretrained model ckpts. https://huggingface.co/SWivid/F5-TTS
+```
+ckpts/
+    E2TTS_Base/
+        model_1200000.pt
+    F5TTS_Base/
+        model_1200000.pt
+```

data/Emilia_ZH_EN_pinyin/vocab.txt ADDED Viewed

	@@ -0,0 +1,2586 @@

+!
+"
+#
+$
+%
+&
+'
+(
+)
+*
++
+,
+-
+.
+/
+0
+1
+2
+3
+4
+5
+6
+7
+8
+9
+:
+;
+=
+>
+?
+@
+A
+B
+C
+D
+E
+F
+G
+H
+I
+J
+K
+L
+M
+N
+O
+P
+Q
+R
+S
+T
+U
+V
+W
+X
+Y
+Z
+[
+\
+]
+_
+a
+a1
+ai1
+ai2
+ai3
+ai4
+an1
+an3
+an4
+ang1
+ang2
+ang4
+ao1
+ao2
+ao3
+ao4
+b
+ba
+ba1
+ba2
+ba3
+ba4
+bai1
+bai2
+bai3
+bai4
+ban1
+ban2
+ban3
+ban4
+bang1
+bang2
+bang3
+bang4
+bao1
+bao2
+bao3
+bao4
+bei
+bei1
+bei2
+bei3
+bei4
+ben1
+ben2
+ben3
+ben4
+beng
+beng1
+beng2
+beng3
+beng4
+bi1
+bi2
+bi3
+bi4
+bian1
+bian2
+bian3
+bian4
+biao1
+biao2
+biao3
+bie1
+bie2
+bie3
+bie4
+bin1
+bin4
+bing1
+bing2
+bing3
+bing4
+bo
+bo1
+bo2
+bo3
+bo4
+bu2
+bu3
+bu4
+c
+ca1
+cai1
+cai2
+cai3
+cai4
+can1
+can2
+can3
+can4
+cang1
+cang2
+cao1
+cao2
+cao3
+ce4
+cen1
+cen2
+ceng1
+ceng2
+ceng4
+cha1
+cha2
+cha3
+cha4
+chai1
+chai2
+chan1
+chan2
+chan3
+chan4
+chang1
+chang2
+chang3
+chang4
+chao1
+chao2
+chao3
+che1
+che2
+che3
+che4
+chen1
+chen2
+chen3
+chen4
+cheng1
+cheng2
+cheng3
+cheng4
+chi1
+chi2
+chi3
+chi4
+chong1
+chong2
+chong3
+chong4
+chou1
+chou2
+chou3
+chou4
+chu1
+chu2
+chu3
+chu4
+chua1
+chuai1
+chuai2
+chuai3
+chuai4
+chuan1
+chuan2
+chuan3
+chuan4
+chuang1
+chuang2
+chuang3
+chuang4
+chui1
+chui2
+chun1
+chun2
+chun3
+chuo1
+chuo4
+ci1
+ci2
+ci3
+ci4
+cong1
+cong2
+cou4
+cu1
+cu4
+cuan1
+cuan2
+cuan4
+cui1
+cui3
+cui4
+cun1
+cun2
+cun4
+cuo1
+cuo2
+cuo4
+d
+da
+da1
+da2
+da3
+da4
+dai1
+dai2
+dai3
+dai4
+dan1
+dan2
+dan3
+dan4
+dang1
+dang2
+dang3
+dang4
+dao1
+dao2
+dao3
+dao4
+de
+de1
+de2
+dei3
+den4
+deng1
+deng2
+deng3
+deng4
+di1
+di2
+di3
+di4
+dia3
+dian1
+dian2
+dian3
+dian4
+diao1
+diao3
+diao4
+die1
+die2
+die4
+ding1
+ding2
+ding3
+ding4
+diu1
+dong1
+dong3
+dong4
+dou1
+dou2
+dou3
+dou4
+du1
+du2
+du3
+du4
+duan1
+duan2
+duan3
+duan4
+dui1
+dui4
+dun1
+dun3
+dun4
+duo1
+duo2
+duo3
+duo4
+e
+e1
+e2
+e3
+e4
+ei2
+en1
+en4
+er
+er2
+er3
+er4
+f
+fa1
+fa2
+fa3
+fa4
+fan1
+fan2
+fan3
+fan4
+fang1
+fang2
+fang3
+fang4
+fei1
+fei2
+fei3
+fei4
+fen1
+fen2
+fen3
+fen4
+feng1
+feng2
+feng3
+feng4
+fo2
+fou2
+fou3
+fu1
+fu2
+fu3
+fu4
+g
+ga1
+ga2
+ga3
+ga4
+gai1
+gai2
+gai3
+gai4
+gan1
+gan2
+gan3
+gan4
+gang1
+gang2
+gang3
+gang4
+gao1
+gao2
+gao3
+gao4
+ge1
+ge2
+ge3
+ge4
+gei2
+gei3
+gen1
+gen2
+gen3
+gen4
+geng1
+geng3
+geng4
+gong1
+gong3
+gong4
+gou1
+gou2
+gou3
+gou4
+gu
+gu1
+gu2
+gu3
+gu4
+gua1
+gua2
+gua3
+gua4
+guai1
+guai2
+guai3
+guai4
+guan1
+guan2
+guan3
+guan4
+guang1
+guang2
+guang3
+guang4
+gui1
+gui2
+gui3
+gui4
+gun3
+gun4
+guo1
+guo2
+guo3
+guo4
+h
+ha1
+ha2
+ha3
+hai1
+hai2
+hai3
+hai4
+han1
+han2
+han3
+han4
+hang1
+hang2
+hang4
+hao1
+hao2
+hao3
+hao4
+he1
+he2
+he4
+hei1
+hen2
+hen3
+hen4
+heng1
+heng2
+heng4
+hong1
+hong2
+hong3
+hong4
+hou1
+hou2
+hou3
+hou4
+hu1
+hu2
+hu3
+hu4
+hua1
+hua2
+hua4
+huai2
+huai4
+huan1
+huan2
+huan3
+huan4
+huang1
+huang2
+huang3
+huang4
+hui1
+hui2
+hui3
+hui4
+hun1
+hun2
+hun4
+huo
+huo1
+huo2
+huo3
+huo4
+i
+j
+ji1
+ji2
+ji3
+ji4
+jia
+jia1
+jia2
+jia3
+jia4
+jian1
+jian2
+jian3
+jian4
+jiang1
+jiang2
+jiang3
+jiang4
+jiao1
+jiao2
+jiao3
+jiao4
+jie1
+jie2
+jie3
+jie4
+jin1
+jin2
+jin3
+jin4
+jing1
+jing2
+jing3
+jing4
+jiong3
+jiu1
+jiu2
+jiu3
+jiu4
+ju1
+ju2
+ju3
+ju4
+juan1
+juan2
+juan3
+juan4
+jue1
+jue2
+jue4
+jun1
+jun4
+k
+ka1
+ka2
+ka3
+kai1
+kai2
+kai3
+kai4
+kan1
+kan2
+kan3
+kan4
+kang1
+kang2
+kang4
+kao1
+kao2
+kao3
+kao4
+ke1
+ke2
+ke3
+ke4
+ken3
+keng1
+kong1
+kong3
+kong4
+kou1
+kou2
+kou3
+kou4
+ku1
+ku2
+ku3
+ku4
+kua1
+kua3
+kua4
+kuai3
+kuai4
+kuan1
+kuan2
+kuan3
+kuang1
+kuang2
+kuang4
+kui1
+kui2
+kui3
+kui4
+kun1
+kun3
+kun4
+kuo4
+l
+la
+la1
+la2
+la3
+la4
+lai2
+lai4
+lan2
+lan3
+lan4
+lang1
+lang2
+lang3
+lang4
+lao1
+lao2
+lao3
+lao4
+le
+le1
+le4
+lei
+lei1
+lei2
+lei3
+lei4
+leng1
+leng2
+leng3
+leng4
+li
+li1
+li2
+li3
+li4
+lia3
+lian2
+lian3
+lian4
+liang2
+liang3
+liang4
+liao1
+liao2
+liao3
+liao4
+lie1
+lie2
+lie3
+lie4
+lin1
+lin2
+lin3
+lin4
+ling2
+ling3
+ling4
+liu1
+liu2
+liu3
+liu4
+long1
+long2
+long3
+long4
+lou1
+lou2
+lou3
+lou4
+lu1
+lu2
+lu3
+lu4
+luan2
+luan3
+luan4
+lun1
+lun2
+lun4
+luo1
+luo2
+luo3
+luo4
+lv2
+lv3
+lv4
+lve3
+lve4
+m
+ma
+ma1
+ma2
+ma3
+ma4
+mai2
+mai3
+mai4
+man1
+man2
+man3
+man4
+mang2
+mang3
+mao1
+mao2
+mao3
+mao4
+me
+mei2
+mei3
+mei4
+men
+men1
+men2
+men4
+meng
+meng1
+meng2
+meng3
+meng4
+mi1
+mi2
+mi3
+mi4
+mian2
+mian3
+mian4
+miao1
+miao2
+miao3
+miao4
+mie1
+mie4
+min2
+min3
+ming2
+ming3
+ming4
+miu4
+mo1
+mo2
+mo3
+mo4
+mou1
+mou2
+mou3
+mu2
+mu3
+mu4
+n
+n2
+na1
+na2
+na3
+na4
+nai2
+nai3
+nai4
+nan1
+nan2
+nan3
+nan4
+nang1
+nang2
+nang3
+nao1
+nao2
+nao3
+nao4
+ne
+ne2
+ne4
+nei3
+nei4
+nen4
+neng2
+ni1
+ni2
+ni3
+ni4
+nian1
+nian2
+nian3
+nian4
+niang2
+niang4
+niao2
+niao3
+niao4
+nie1
+nie4
+nin2
+ning2
+ning3
+ning4
+niu1
+niu2
+niu3
+niu4
+nong2
+nong4
+nou4
+nu2
+nu3
+nu4
+nuan3
+nuo2
+nuo4
+nv2
+nv3
+nve4
+o
+o1
+o2
+ou1
+ou2
+ou3
+ou4
+p
+pa1
+pa2
+pa4
+pai1
+pai2
+pai3
+pai4
+pan1
+pan2
+pan4
+pang1
+pang2
+pang4
+pao1
+pao2
+pao3
+pao4
+pei1
+pei2
+pei4
+pen1
+pen2
+pen4
+peng1
+peng2
+peng3
+peng4
+pi1
+pi2
+pi3
+pi4
+pian1
+pian2
+pian4
+piao1
+piao2
+piao3
+piao4
+pie1
+pie2
+pie3
+pin1
+pin2
+pin3
+pin4
+ping1
+ping2
+po1
+po2
+po3
+po4
+pou1
+pu1
+pu2
+pu3
+pu4
+q
+qi1
+qi2
+qi3
+qi4
+qia1
+qia3
+qia4
+qian1
+qian2
+qian3
+qian4
+qiang1
+qiang2
+qiang3
+qiang4
+qiao1
+qiao2
+qiao3
+qiao4
+qie1
+qie2
+qie3
+qie4
+qin1
+qin2
+qin3
+qin4
+qing1
+qing2
+qing3
+qing4
+qiong1
+qiong2
+qiu1
+qiu2
+qiu3
+qu1
+qu2
+qu3
+qu4
+quan1
+quan2
+quan3
+quan4
+que1
+que2
+que4
+qun2
+r
+ran2
+ran3
+rang1
+rang2
+rang3
+rang4
+rao2
+rao3
+rao4
+re2
+re3
+re4
+ren2
+ren3
+ren4
+reng1
+reng2
+ri4
+rong1
+rong2
+rong3
+rou2
+rou4
+ru2
+ru3
+ru4
+ruan2
+ruan3
+rui3
+rui4
+run4
+ruo4
+s
+sa1
+sa2
+sa3
+sa4
+sai1
+sai4
+san1
+san2
+san3
+san4
+sang1
+sang3
+sang4
+sao1
+sao2
+sao3
+sao4
+se4
+sen1
+seng1
+sha1
+sha2
+sha3
+sha4
+shai1
+shai2
+shai3
+shai4
+shan1
+shan3
+shan4
+shang
+shang1
+shang3
+shang4
+shao1
+shao2
+shao3
+shao4
+she1
+she2
+she3
+she4
+shei2
+shen1
+shen2
+shen3
+shen4
+sheng1
+sheng2
+sheng3
+sheng4
+shi
+shi1
+shi2
+shi3
+shi4
+shou1
+shou2
+shou3
+shou4
+shu1
+shu2
+shu3
+shu4
+shua1
+shua2
+shua3
+shua4
+shuai1
+shuai3
+shuai4
+shuan1
+shuan4
+shuang1
+shuang3
+shui2
+shui3
+shui4
+shun3
+shun4
+shuo1
+shuo4
+si1
+si2
+si3
+si4
+song1
+song3
+song4
+sou1
+sou3
+sou4
+su1
+su2
+su4
+suan1
+suan4
+sui1
+sui2
+sui3
+sui4
+sun1
+sun3
+suo
+suo1
+suo2
+suo3
+t
+ta1
+ta2
+ta3
+ta4
+tai1
+tai2
+tai4
+tan1
+tan2
+tan3
+tan4
+tang1
+tang2
+tang3
+tang4
+tao1
+tao2
+tao3
+tao4
+te4
+teng2
+ti1
+ti2
+ti3
+ti4
+tian1
+tian2
+tian3
+tiao1
+tiao2
+tiao3
+tiao4
+tie1
+tie2
+tie3
+tie4
+ting1
+ting2
+ting3
+tong1
+tong2
+tong3
+tong4
+tou
+tou1
+tou2
+tou4
+tu1
+tu2
+tu3
+tu4
+tuan1
+tuan2
+tui1
+tui2
+tui3
+tui4
+tun1
+tun2
+tun4
+tuo1
+tuo2
+tuo3
+tuo4
+u
+v
+w
+wa
+wa1
+wa2
+wa3
+wa4
+wai1
+wai3
+wai4
+wan1
+wan2
+wan3
+wan4
+wang1
+wang2
+wang3
+wang4
+wei1
+wei2
+wei3
+wei4
+wen1
+wen2
+wen3
+wen4
+weng1
+weng4
+wo1
+wo2
+wo3
+wo4
+wu1
+wu2
+wu3
+wu4
+x
+xi1
+xi2
+xi3
+xi4
+xia1
+xia2
+xia4
+xian1
+xian2
+xian3
+xian4
+xiang1
+xiang2
+xiang3
+xiang4
+xiao1
+xiao2
+xiao3
+xiao4
+xie1
+xie2
+xie3
+xie4
+xin1
+xin2
+xin4
+xing1
+xing2
+xing3
+xing4
+xiong1
+xiong2
+xiu1
+xiu3
+xiu4
+xu
+xu1
+xu2
+xu3
+xu4
+xuan1
+xuan2
+xuan3
+xuan4
+xue1
+xue2
+xue3
+xue4
+xun1
+xun2
+xun4
+y
+ya
+ya1
+ya2
+ya3
+ya4
+yan1
+yan2
+yan3
+yan4
+yang1
+yang2
+yang3
+yang4
+yao1
+yao2
+yao3
+yao4
+ye1
+ye2
+ye3
+ye4
+yi
+yi1
+yi2
+yi3
+yi4
+yin1
+yin2
+yin3
+yin4
+ying1
+ying2
+ying3
+ying4
+yo1
+yong1
+yong2
+yong3
+yong4
+you1
+you2
+you3
+you4
+yu1
+yu2
+yu3
+yu4
+yuan1
+yuan2
+yuan3
+yuan4
+yue1
+yue4
+yun1
+yun2
+yun3
+yun4
+z
+za1
+za2
+za3
+zai1
+zai3
+zai4
+zan1
+zan2
+zan3
+zan4
+zang1
+zang4
+zao1
+zao2
+zao3
+zao4
+ze2
+ze4
+zei2
+zen3
+zeng1
+zeng4
+zha1
+zha2
+zha3
+zha4
+zhai1
+zhai2
+zhai3
+zhai4
+zhan1
+zhan2
+zhan3
+zhan4
+zhang1
+zhang2
+zhang3
+zhang4
+zhao1
+zhao2
+zhao3
+zhao4
+zhe
+zhe1
+zhe2
+zhe3
+zhe4
+zhen1
+zhen2
+zhen3
+zhen4
+zheng1
+zheng2
+zheng3
+zheng4
+zhi1
+zhi2
+zhi3
+zhi4
+zhong1
+zhong2
+zhong3
+zhong4
+zhou1
+zhou2
+zhou3
+zhou4
+zhu1
+zhu2
+zhu3
+zhu4
+zhua1
+zhua2
+zhua3
+zhuai1
+zhuai3
+zhuai4
+zhuan1
+zhuan2
+zhuan3
+zhuan4
+zhuang1
+zhuang4
+zhui1
+zhui4
+zhun1
+zhun2
+zhun3
+zhuo1
+zhuo2
+zi
+zi1
+zi2
+zi3
+zi4
+zong1
+zong2
+zong3
+zong4
+zou1
+zou2
+zou3
+zou4
+zu1
+zu2
+zu3
+zuan1
+zuan3
+zuan4
+zui2
+zui3
+zui4
+zun1
+zuo
+zuo1
+zuo2
+zuo3
+zuo4
+{
+~
+¡
+¢
+£
+¥
+§
+¨
+©
+«
+®
+¯
+°
+±
+²
+³
+´
+µ
+·
+¹
+º
+»
+¼
+½
+¾
+¿
+À
+Á
+Â
+Ã
+Ä
+Å
+Æ
+Ç
+È
+É
+Ê
+Í
+Î
+Ñ
+Ó
+Ö
+×
+Ø
+Ú
+Ü
+Ý
+Þ
+ß
+à
+á
+â
+ã
+ä
+å
+æ
+ç
+è
+é
+ê
+ë
+ì
+í
+î
+ï
+ð
+ñ
+ò
+ó
+ô
+õ
+ö
+ø
+ù
+ú
+û
+ü
+ý
+Ā
+ā
+ă
+ą
+ć
+Č
+č
+Đ
+đ
+ē
+ė
+ę
+ě
+ĝ
+ğ
+ħ
+ī
+į
+İ
+ı
+Ł
+ł
+ń
+ņ
+ň
+ŋ
+Ō
+ō
+ő
+œ
+ř
+Ś
+ś
+Ş
+ş
+Š
+š
+Ť
+ť
+ũ
+ū
+ź
+Ż
+ż
+Ž
+ž
+ơ
+ư
+ǎ
+ǐ
+ǒ
+ǔ
+ǚ
+ș
+ț
+ɑ
+ɔ
+ɕ
+ə
+ɛ
+ɜ
+ɡ
+ɣ
+ɪ
+ɫ
+ɴ
+ɹ
+ɾ
+ʃ
+ʊ
+ʌ
+ʒ
+ʔ
+ʰ
+ʷ
+ʻ
+ʾ
+ʿ
+ˈ
+ː
+˙
+˜
+ˢ
+́
+̅
+Α
+Β
+Δ
+Ε
+Θ
+Κ
+Λ
+Μ
+Ξ
+Π
+Σ
+Τ
+Φ
+Χ
+Ψ
+Ω
+ά
+έ
+ή
+ί
+α
+β
+γ
+δ
+ε
+ζ
+η
+θ
+ι
+κ
+λ
+μ
+ν
+ξ
+ο
+π
+ρ
+ς
+σ
+τ
+υ
+φ
+χ
+ψ
+ω
+ϊ
+ό
+ύ
+ώ
+ϕ
+ϵ
+Ё
+А
+Б
+В
+Г
+Д
+Е
+Ж
+З
+И
+Й
+К
+Л
+М
+Н
+О
+П
+Р
+С
+Т
+У
+Ф
+Х
+Ц
+Ч
+Ш
+Щ
+Ы
+Ь
+Э
+Ю
+Я
+а
+б
+в
+г
+д
+е
+ж
+з
+и
+й
+к
+л
+м
+н
+о
+п
+р
+с
+т
+у
+ф
+х
+ц
+ч
+ш
+щ
+ъ
+ы
+ь
+э
+ю
+я
+ё
+і
+ְ
+ִ
+ֵ
+ֶ
+ַ
+ָ
+ֹ
+ּ
+־
+ׁ
+א
+ב
+ג
+ד
+ה
+ו
+ז
+ח
+ט
+י
+כ
+ל
+ם
+מ
+ן
+נ
+ס
+ע
+פ
+ק
+ר
+ש
+ת
+أ
+ب
+ة
+ت
+ج
+ح
+د
+ر
+ز
+س
+ص
+ط
+ع
+ق
+ك
+ل
+م
+ن
+ه
+و
+ي
+َ
+ُ
+ِ
+ْ
+ก
+ข
+ง
+จ
+ต
+ท
+น
+ป
+ย
+ร
+ว
+ส
+ห
+อ
+ฮ
+ั
+า
+ี
+ึ
+โ
+ใ
+ไ
+่
+้
+์
+ḍ
+Ḥ
+ḥ
+ṁ
+ṃ
+ṅ
+ṇ
+Ṛ
+ṛ
+Ṣ
+ṣ
+Ṭ
+ṭ
+ạ
+ả
+Ấ
+ấ
+ầ
+ậ
+ắ
+ằ
+ẻ
+ẽ
+ế
+ề
+ể
+ễ
+ệ
+ị
+ọ
+ỏ
+ố
+ồ
+ộ
+ớ
+ờ
+ở
+ụ
+ủ
+ứ
+ữ
+ἀ
+ἁ
+Ἀ
+ἐ
+ἔ
+ἰ
+ἱ
+ὀ
+ὁ
+ὐ
+ὲ
+ὸ
+���
+᾽
+ῆ
+ῇ
+ῶ
+‎
+‑
+‒
+–
+—
+―
+‖
+†
+‡
+•
+…
+‧
+‬
+′
+″
+⁄
+⁡
+⁰
+⁴
+⁵
+⁶
+⁷
+⁸
+⁹
+₁
+₂
+₃
+€
+₱
+₹
+₽
+℃
+ℏ
+ℓ
+№
+ℝ
+™
+⅓
+⅔
+⅛
+→
+∂
+∈
+∑
+−
+∗
+√
+∞
+∫
+≈
+≠
+≡
+≤
+≥
+⋅
+⋯
+█
+♪
+⟨
+⟩
+、
+。
+《
+》
+「
+」
+【
+】
+あ
+う
+え
+お
+か
+が
+き
+ぎ
+く
+ぐ
+け
+げ
+こ
+ご
+さ
+し
+じ
+す
+ず
+せ
+ぜ
+そ
+ぞ
+た
+だ
+ち
+っ
+つ
+で
+と
+ど
+な
+に
+ね
+の
+は
+ば
+ひ
+ぶ
+へ
+べ
+ま
+み
+む
+め
+も
+ゃ
+や
+ゆ
+ょ
+よ
+ら
+り
+る
+れ
+ろ
+わ
+を
+ん
+ァ
+ア
+ィ
+イ
+ウ
+ェ
+エ
+オ
+カ
+ガ
+キ
+ク
+ケ
+ゲ
+コ
+ゴ
+サ
+ザ
+シ
+ジ
+ス
+ズ
+セ
+ゾ
+タ
+ダ
+チ
+ッ
+ツ
+テ
+デ
+ト
+ド
+ナ
+ニ
+ネ
+ノ
+バ
+パ
+ビ
+ピ
+フ
+プ
+ヘ
+ベ
+ペ
+ホ
+ボ
+ポ
+マ
+ミ
+ム
+メ
+モ
+ャ
+ヤ
+ュ
+ユ
+ョ
+ヨ
+ラ
+リ
+ル
+レ
+ロ
+ワ
+ン
+・
+ー
+ㄋ
+ㄍ
+ㄎ
+ㄏ
+ㄓ
+ㄕ
+ㄚ
+ㄜ
+ㄟ
+ㄤ
+ㄥ
+ㄧ
+ㄱ
+ㄴ
+ㄷ
+ㄹ
+ㅁ
+ㅂ
+ㅅ
+ㅈ
+ㅍ
+ㅎ
+ㅏ
+ㅓ
+ㅗ
+ㅜ
+ㅡ
+ㅣ
+㗎
+가
+각
+간
+갈
+감
+갑
+갓
+갔
+강
+같
+개
+거
+건
+걸
+겁
+것
+겉
+게
+겠
+겨
+결
+겼
+경
+계
+고
+곤
+골
+곱
+공
+과
+관
+광
+교
+구
+국
+굴
+귀
+귄
+그
+근
+글
+금
+기
+긴
+길
+까
+깍
+깔
+깜
+깨
+께
+꼬
+꼭
+꽃
+꾸
+꿔
+끔
+끗
+끝
+끼
+나
+난
+날
+남
+납
+내
+냐
+냥
+너
+넘
+넣
+네
+녁
+년
+녕
+노
+녹
+놀
+누
+눈
+느
+는
+늘
+니
+님
+닙
+다
+닥
+단
+달
+닭
+당
+대
+더
+덕
+던
+덥
+데
+도
+독
+동
+돼
+됐
+되
+된
+될
+두
+둑
+둥
+드
+들
+등
+디
+따
+딱
+딸
+땅
+때
+떤
+떨
+떻
+또
+똑
+뚱
+뛰
+뜻
+띠
+라
+락
+란
+람
+랍
+랑
+래
+랜
+러
+런
+럼
+렇
+레
+려
+력
+렵
+렸
+로
+록
+롬
+루
+르
+른
+를
+름
+릉
+리
+릴
+림
+마
+막
+만
+많
+말
+맑
+맙
+맛
+매
+머
+먹
+멍
+메
+면
+명
+몇
+모
+목
+몸
+못
+무
+문
+물
+뭐
+뭘
+미
+민
+밌
+밑
+바
+박
+밖
+반
+받
+발
+밤
+밥
+방
+배
+백
+밸
+뱀
+버
+번
+벌
+벚
+베
+벼
+벽
+별
+병
+보
+복
+본
+볼
+봐
+봤
+부
+분
+불
+비
+빔
+빛
+빠
+빨
+뼈
+뽀
+뿅
+쁘
+사
+산
+살
+삼
+샀
+상
+새
+색
+생
+서
+선
+설
+섭
+섰
+성
+세
+셔
+션
+셨
+소
+속
+손
+송
+수
+숙
+순
+술
+숫
+숭
+숲
+쉬
+쉽
+스
+슨
+습
+슷
+시
+식
+신
+실
+싫
+심
+십
+싶
+싸
+써
+쓰
+쓴
+씌
+씨
+씩
+씬
+아
+악
+안
+않
+알
+야
+약
+얀
+양
+얘
+어
+언
+얼
+엄
+업
+없
+었
+엉
+에
+여
+역
+연
+염
+엽
+영
+옆
+예
+옛
+오
+온
+올
+옷
+옹
+와
+왔
+왜
+요
+욕
+용
+우
+운
+울
+웃
+워
+원
+월
+웠
+위
+윙
+유
+육
+윤
+으
+은
+을
+음
+응
+의
+이
+익
+인
+일
+읽
+임
+입
+있
+자
+작
+잔
+잖
+잘
+잡
+잤
+장
+재
+저
+전
+점
+정
+제
+져
+졌
+조
+족
+좀
+종
+좋
+죠
+주
+준
+줄
+중
+줘
+즈
+즐
+즘
+지
+진
+집
+짜
+짝
+쩌
+쪼
+쪽
+쫌
+쭈
+쯔
+찌
+찍
+차
+착
+찾
+책
+처
+천
+철
+체
+쳐
+쳤
+초
+촌
+추
+출
+춤
+춥
+춰
+치
+친
+칠
+침
+칩
+칼
+커
+켓
+코
+콩
+쿠
+퀴
+크
+큰
+큽
+키
+킨
+타
+태
+터
+턴
+털
+테
+토
+통
+투
+트
+특
+튼
+틀
+티
+팀
+파
+팔
+패
+페
+펜
+펭
+평
+포
+폭
+표
+품
+풍
+프
+플
+피
+필
+하
+학
+한
+할
+함
+합
+항
+해
+햇
+했
+행
+허
+험
+형
+혜
+호
+혼
+홀
+화
+회
+획
+후
+휴
+흐
+흔
+희
+히
+힘
+ﷺ
+ﷻ
+！
+，
+？
+�
+𠮶
+ค
+เ
+็
+ผ
+ู
+บ
+พ
+ำ
+แ
+ม
+ะ
+ิ
+ด
+ฝ
+ุ
+ล
+ื
+ถ
+ฬ
+ฟ
+ณ
+ซ
+ธ
+ช
+ฉ
+ศ
+ญ
+ภ
+ฆ
+ษ
+ฐ
+๊
+ฒ
+ฎ
+๋
+ฏ
+ฤ
+ๅ
+ฑ
+ฌ
+ฃ

data/librispeech_pc_test_clean_cross_sentence.lst ADDED Viewed

The diff for this file is too large to render. See raw diff

deployment/.gitignore ADDED Viewed

	@@ -0,0 +1,81 @@

+# Python
+__pycache__/
+*.py[cod]
+*$py.class
+*.so
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+# PyTorch
+*.pth
+*.pt
+# Gradio
+.gradio/
+flagged/
+# Environment
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+# IDE
+.vscode/
+.idea/
+*.swp
+*.swo
+*~
+# OS
+.DS_Store
+.DS_Store?
+._*
+.Spotlight-V100
+.Trashes
+ehthumbs.db
+Thumbs.db
+# Logs
+*.log
+logs/
+# Temporary files
+*.tmp
+*.temp
+tmp/
+temp/
+# Cache
+.cache/
+*.cache
+# Model downloads (if large)
+# ckpts/
+# models/
+# Audio files (if large)
+# *.wav
+# *.mp3
+# *.flac
+# Jupyter
+.ipynb_checkpoints

deployment/README.md ADDED Viewed

	@@ -0,0 +1,89 @@

+---
+title: F5-TTS Thai
+emoji: 🎤
+colorFrom: blue
+colorTo: purple
+sdk: gradio
+sdk_version: 4.44.0
+app_file: app.py
+pinned: false
+license: mit
+python_version: 3.10
+hardware: cpu-basic
+short_description: Zero-shot Text-to-Speech for Thai language
+---
+# F5-TTS ภาษาไทย 🎤
+Zero-shot Text-to-Speech สำหรับภาษาไทย ด้วยโมเดล F5-TTS
+## ✨ Features
+- **Multi-Speech Generation**: สร้างเสียงพูดหลายสไตล์ในไฟล์เดียว
+- **Voice Cloning**: โคลนเสียงจากไฟล์ตัวอย่างสั้นๆ
+- **Thai Language Support**: รองรับภาษาไทยอย่างเต็มรูปแบบ
+- **Real-time Processing**: ประมวลผลแบบ real-time
+- **Segment Editing**: แก้ไขและปรับแต่งเสียงแต่ละส่วนได้
+## 🚀 วิธีใช้งาน
+### Multi-Speech Generation
+1. **เพิ่มประเภทคำพูด**: คลิก "เพิ่มประเภทคำพูด" เพื่อเพิ่มสไตล์เสียงใหม่
+2. **อัปโหลดเสียงตัวอย่าง**: อัปโหลดไฟล์เสียงสำหรับแต่ละสไตล์
+3. **ใส่ข้อความต้นฉบับ**: พิมพ์ข้อความที่สอดคล้องกับเสียงตัวอย่าง
+4. **เขียนสคริปต์**: ใช้รูปแบบ `{ชื่อสไตล์} ข้อความที่จะพูด`
+### ตัวอย่างการใช้งาน
+```
+{ปกติ} สวัสดีครับ มีอะไรให้ผมช่วยไหมครับ
+{เศร้า} ผมเครียดจริงๆ นะตอนนี้...
+{โกรธ} รู้ไหม! เธอไม่ควรอยู่ที่นี่!
+{กระซิบ} ฉันมีอะไรจะบอกคุณ แต่มันเป็นความลับนะ
+```
+## ⚙️ Technical Details
+### Models Used
+- **F5-TTS**: Zero-shot text-to-speech model
+- **Vocoder**: Neural vocoder for high-quality audio synthesis
+- **Text Processing**: Thai text normalization and processing
+### System Requirements
+- **RAM**: อย่างน้อย 4GB (แนะนำ 8GB+)
+- **GPU**: ไม่จำเป็น แต่จะช่วยเพิ่มความเร็ว
+- **Storage**: ~2GB สำหรับโมเดลและ dependencies
+## 🔧 Configuration
+### Model Settings
+- **NFE Steps**: ควบคุมคุณภาพเสียง (16-64)
+- **Cross Fade Duration**: ปรับการต่อเสียงระหว่างส่วน
+- **Speed**: ปรับความเร็วการพูด
+- **CFG Strength**: ปรับความแข็งแกร่งของ guidance
+### Tips สำหรับผลลัพธ์ที่ดี
+1. **เสียงตัวอย่าง**: ใช้เสียงที่ชัดเจน ไม่มีเสียงรบกวน ความยาว 5-10 วินาที
+2. **ข้อความต้นฉบับ**: ให้ตรงกับเสียงตัวอย่างที่สุด
+3. **ข้อความที่จะสร้าง**: เว้นวรรคและใส่เครื่องหมายวรรคตอนให้ชัดเจน
+4. **การตั้งค่า**: เริ่มด้วยค่า default แล้วค่อยปรับแต่ง
+## 🚨 Limitations
+- รองรับเฉพาะภาษาไทยเป็นหลัก
+- คุณภาพเสียงขึ้นอยู่กับเสียงตัวอย่าง
+- ใช้เวลาในการประมวลผลตามความยาวข้อความ
+- ต้องใช้ internet เพื่อดาวน์โหลดโมเดล
+## 📝 License
+MIT License - ใช้งานได้อย่างอิสระ
+## 🤝 Contributing
+สามารถมีส่วนร่วมพัฒนาได้ที่ [GitHub Repository](https://github.com/yourusername/F5-TTS-THAI)
+## 🐛 Bug Reports
+หากพบปัญหาการใช้งาน กรุณาแจ้งได้ที่ Issues ของ GitHub Repository

deployment/app.py ADDED Viewed

	@@ -0,0 +1,45 @@

+#!/usr/bin/env python3
+import os
+import sys
+import gradio as gr
+# Add src to path
+current_dir = os.path.dirname(os.path.abspath(__file__))
+src_dir = os.path.join(current_dir, "src")
+if src_dir not in sys.path:
+    sys.path.insert(0, src_dir)
+def create_demo():
+    """Create the main demo interface"""
+    try:
+        from f5_tts.f5_tts_webui import F5TTSWebUI
+        app = F5TTSWebUI()
+        return app.create_gradio_interface()
+    except Exception as e:
+        # Fallback interface if imports fail
+        with gr.Blocks(title="F5-TTS Thai") as demo:
+            gr.Markdown("# F5-TTS ภาษาไทย 🎤")
+            gr.Markdown("## ⚠️ กำลังโหลดระบบ...")
+            gr.Markdown(f"**Status:** กำลังดาวน์โหลดและเตรียมโมเดล")
+            gr.Markdown("""
+            ### กรุณารอสักครู่...
+            - ระบบกำลังดาวน์โหลด dependencies
+            - กำลังโหลดโมเดล F5-TTS
+            - โปรเซสนี้อาจใช้เวลา 2-5 นาที
+            **หากยังไม่ทำงาน กรุณารีเฟรชหน้าใหม่**
+            """)
+            with gr.Row():
+                status_text = gr.Textbox(label="สถานะ", value="กำลังเตรียมระบบ...", interactive=False)
+                refresh_btn = gr.Button("🔄 รีเฟรช", variant="primary")
+                refresh_btn.click(fn=lambda: "รีเฟรชแล้ว", outputs=status_text)
+        return demo
+# Create the demo - THIS IS IMPORTANT FOR HF SPACES
+demo = create_demo()
+# Launch settings
+if __name__ == "__main__":
+    demo.launch()

deployment/app_minimal.py ADDED Viewed

	@@ -0,0 +1,31 @@

+import gradio as gr
+def test_function(text):
+    return f"ทดสอบสำเร็จ! คุณพิมพ์: {text}"
+# Simple demo for testing
+with gr.Blocks(title="F5-TTS Thai Test") as demo:
+    gr.Markdown("# 🧪 F5-TTS Thai - Test App")
+    gr.Markdown("แอปทดสอบเพื่อตรวจสอบว่า Hugging Face Spaces ทำงานได้")
+    with gr.Row():
+        input_text = gr.Textbox(label="ทดสอบการพิมพ์", placeholder="พิมพ์อะไรก็ได้...")
+        output_text = gr.Textbox(label="ผลลัพธ์")
+    test_btn = gr.Button("ทดสอบ", variant="primary")
+    test_btn.click(fn=test_function, inputs=input_text, outputs=output_text)
+    gr.Markdown("""
+    ### ✅ หากแอปนี้ทำงานได้ แสดงว่า:
+    - Gradio ทำงานได้ปกติ
+    - โครงสร้างไฟล์ถูกต้อง
+    - สามารถอัปโหลดแอปหลักได้
+    ### 📝 ขั้นตอนต่อไป:
+    1. แทนที่ `app_minimal.py` ด้วย `app.py`
+    2. อัปโหลดโฟลเดอร์ `src/`
+    3. อัปเดต `requirements.txt`
+    """)
+if __name__ == "__main__":
+    demo.launch()

deployment/requirements.txt ADDED Viewed

	@@ -0,0 +1,15 @@

+gradio>=4.0.0
+torch>=2.0.0
+torchaudio>=2.0.0
+numpy>=1.21.0
+soundfile>=0.12.1
+cached-path>=1.5.0
+faster-whisper>=0.9.0
+transformers>=4.30.0
+accelerate>=0.20.0
+datasets>=2.10.0
+librosa>=0.10.0
+scipy>=1.9.0
+matplotlib>=3.5.0
+Pillow>=9.0.0
+requests>=2.25.0

deployment/requirements_minimal.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+ gradio>=4.0.0

deployment/src/f5_tts/api.py ADDED Viewed

	@@ -0,0 +1,174 @@

+import random
+import sys
+from importlib.resources import files
+import soundfile as sf
+import tqdm
+from cached_path import cached_path
+from f5_tts.infer.utils_infer import (
+    hop_length,
+    infer_process,
+    load_model,
+    load_vocoder,
+    preprocess_ref_audio_text,
+    remove_silence_for_generated_wav,
+    save_spectrogram,
+    transcribe,
+    target_sample_rate,
+)
+from f5_tts.model import DiT, UNetT
+from f5_tts.model.utils import seed_everything
+class F5TTS:
+    def __init__(
+        self,
+        model_type="F5-TTS",
+        ckpt_file="",
+        vocab_file="",
+        ode_method="euler",
+        use_ema=True,
+        vocoder_name="vocos",
+        local_path=None,
+        device=None,
+        hf_cache_dir=None,
+    ):
+        # Initialize parameters
+        self.final_wave = None
+        self.target_sample_rate = target_sample_rate
+        self.hop_length = hop_length
+        self.seed = -1
+        self.mel_spec_type = vocoder_name
+        # Set device
+        if device is not None:
+            self.device = device
+        else:
+            import torch
+            self.device = (
+                "cuda"
+                if torch.cuda.is_available()
+                else "xpu"
+                if torch.xpu.is_available()
+                else "mps"
+                if torch.backends.mps.is_available()
+                else "cpu"
+            )
+        # Load models
+        self.load_vocoder_model(vocoder_name, local_path=local_path, hf_cache_dir=hf_cache_dir)
+        self.load_ema_model(
+            model_type, ckpt_file, vocoder_name, vocab_file, ode_method, use_ema, hf_cache_dir=hf_cache_dir
+        )
+    def load_vocoder_model(self, vocoder_name, local_path=None, hf_cache_dir=None):
+        self.vocoder = load_vocoder(vocoder_name, local_path is not None, local_path, self.device, hf_cache_dir)
+    def load_ema_model(self, model_type, ckpt_file, mel_spec_type, vocab_file, ode_method, use_ema, hf_cache_dir=None):
+        if model_type == "F5-TTS":
+            if not ckpt_file:
+                if mel_spec_type == "vocos":
+                    ckpt_file = str(
+                        cached_path("hf://SWivid/F5-TTS/F5TTS_Base/model_1200000.safetensors", cache_dir=hf_cache_dir)
+                    )
+                elif mel_spec_type == "bigvgan":
+                    ckpt_file = str(
+                        cached_path("hf://SWivid/F5-TTS/F5TTS_Base_bigvgan/model_1250000.pt", cache_dir=hf_cache_dir)
+                    )
+            model_cfg = dict(dim=1024, depth=22, heads=16, ff_mult=2, text_dim=512, conv_layers=4)
+            model_cls = DiT
+        elif model_type == "E2-TTS":
+            if not ckpt_file:
+                ckpt_file = str(
+                    cached_path("hf://SWivid/E2-TTS/E2TTS_Base/model_1200000.safetensors", cache_dir=hf_cache_dir)
+                )
+            model_cfg = dict(dim=1024, depth=24, heads=16, ff_mult=4)
+            model_cls = UNetT
+        else:
+            raise ValueError(f"Unknown model type: {model_type}")
+        self.ema_model = load_model(
+            model_cls, model_cfg, ckpt_file, mel_spec_type, vocab_file, ode_method, use_ema, self.device
+        )
+    def transcribe(self, ref_audio, language=None):
+        return transcribe(ref_audio, language)
+    def export_wav(self, wav, file_wave, remove_silence=False):
+        sf.write(file_wave, wav, self.target_sample_rate)
+        if remove_silence:
+            remove_silence_for_generated_wav(file_wave)
+    def export_spectrogram(self, spect, file_spect):
+        save_spectrogram(spect, file_spect)
+    def infer(
+        self,
+        ref_file,
+        ref_text,
+        gen_text,
+        show_info=print,
+        progress=tqdm,
+        target_rms=0.1,
+        cross_fade_duration=0.15,
+        sway_sampling_coef=-1,
+        cfg_strength=2,
+        nfe_step=32,
+        speed=1.0,
+        fix_duration=None,
+        remove_silence=False,
+        file_wave=None,
+        file_spect=None,
+        seed=-1,
+    ):
+        if seed == -1:
+            seed = random.randint(0, sys.maxsize)
+        seed_everything(seed)
+        self.seed = seed
+        ref_file, ref_text = preprocess_ref_audio_text(ref_file, ref_text, device=self.device)
+        wav, sr, spect = infer_process(
+            ref_file,
+            ref_text,
+            gen_text,
+            self.ema_model,
+            self.vocoder,
+            self.mel_spec_type,
+            show_info=show_info,
+            progress=progress,
+            target_rms=target_rms,
+            cross_fade_duration=cross_fade_duration,
+            nfe_step=nfe_step,
+            cfg_strength=cfg_strength,
+            sway_sampling_coef=sway_sampling_coef,
+            speed=speed,
+            fix_duration=fix_duration,
+            device=self.device,
+        )
+        if file_wave is not None:
+            self.export_wav(wav, file_wave, remove_silence)
+        if file_spect is not None:
+            self.export_spectrogram(spect, file_spect)
+        return wav, sr, spect
+if __name__ == "__main__":
+    f5tts = F5TTS()
+    wav, sr, spect = f5tts.infer(
+        ref_file=str(files("f5_tts").joinpath("infer/examples/basic/basic_ref_en.wav")),
+        ref_text="some call me nature, others call me mother nature.",
+        gen_text="""I don't really care what you call me. I've been a silent spectator, watching species evolve, empires rise and fall. But always remember, I am mighty and enduring. Respect me and I'll nurture you; ignore me and you shall face the consequences.""",
+        file_wave=str(files("f5_tts").joinpath("../../tests/api_out.wav")),
+        file_spect=str(files("f5_tts").joinpath("../../tests/api_out.png")),
+        seed=-1,  # random seed = -1
+    )
+    print("seed :", f5tts.seed)

deployment/src/f5_tts/cleantext/number_tha.py ADDED Viewed

	@@ -0,0 +1,145 @@

+def number_to_thai_text(num, digit_by_digit=False):
+    # Thai numerals and place values
+    thai_digits = {
+        0: "ศูนย์", 1: "หนึ่ง", 2: "สอง", 3: "สาม", 4: "สี่",
+        5: "ห้า", 6: "หก", 7: "เจ็ด", 8: "แปด", 9: "เก้า"
+    }
+    thai_places = ["", "สิบ", "ร้อย", "พัน", "หมื่น", "แสน", "ล้าน"]
+    # Handle zero case
+    if num == 0:
+        return thai_digits[0]
+    # If digit_by_digit is True, read each digit separately
+    if digit_by_digit:
+        return " ".join(thai_digits[int(d)] for d in str(num))
+    # For very large numbers, we'll process in chunks of millions
+    if num >= 1000000:
+        millions = num // 1000000
+        remainder = num % 1000000
+        result = number_to_thai_text(millions) + "ล้าน"
+        if remainder > 0:
+            result += number_to_thai_text(remainder)
+        return result
+    # Convert number to string and reverse it for easier place value processing
+    num_str = str(num)
+    digits = [int(d) for d in num_str]
+    digits.reverse()  # Reverse to process from units to highest place
+    result = []
+    for i, digit in enumerate(digits):
+        if digit == 0:
+            continue  # Skip zeros
+        # Special case for tens place
+        if i == 1:
+            if digit == 1:
+                result.append(thai_places[i])  # "สิบ" for 10-19
+            elif digit == 2:
+                result.append("ยี่" + thai_places[i])  # "ยี่สิบ" for 20-29
+            else:
+                result.append(thai_digits[digit] + thai_places[i])
+        # Special case for units place
+        elif i == 0 and digit == 1:
+            if len(digits) > 1 and digits[1] in [1, 2]:
+                result.append("เอ็ด")  # "เอ็ด" for 11, 21
+            else:
+                result.append(thai_digits[digit])
+        else:
+            result.append(thai_digits[digit] + thai_places[i])
+    # Reverse back and join
+    result.reverse()
+    return "".join(result)
+def replace_numbers_with_thai(text):
+    import re
+    # Function to convert matched number to Thai text
+    def convert_match(match):
+        num_str = match.group(0).replace(',', '')
+        # Skip if the string is empty or invalid after removing commas
+        if not num_str or num_str == '.':
+            return match.group(0)
+        # Handle decimal numbers
+        if '.' in num_str:
+            parts = num_str.split('.')
+            integer_part = parts[0]
+            decimal_part = parts[1] if len(parts) > 1 else ''
+            # If integer part is empty, treat as 0
+            integer_value = int(integer_part) if integer_part else 0
+            # If integer part is too long (>7 digits), read digit by digit
+            if len(integer_part) > 7:
+                result = number_to_thai_text(integer_value, digit_by_digit=True)
+            else:
+                result = number_to_thai_text(integer_value)
+            # Add decimal part if it exists
+            if decimal_part:
+                result += "จุด " + " ".join(number_to_thai_text(int(d)) for d in decimal_part)
+            return result
+        # Handle integer numbers
+        num = int(num_str)
+        if len(num_str) > 7:  # If number exceeds 7 digits
+            return number_to_thai_text(num, digit_by_digit=True)
+        return number_to_thai_text(num)
+    # Replace all numbers (with or without commas and decimals) in the text
+    def process_text(text):
+        # Split by spaces to process each word
+        words = text.split()
+        result = []
+        for word in words:
+            # Match only valid numeric strings (allowing commas and one decimal point)
+            if re.match(r'^[\d,]+(\.\d+)?$', word):  # Valid number with optional decimal
+                result.append(convert_match(re.match(r'[\d,\.]+', word)))
+            else:
+                # If word contains non-numeric characters, read numbers digit-by-digit
+                if any(c.isdigit() for c in word):
+                    processed = ""
+                    num_chunk = ""
+                    for char in word:
+                        if char.isdigit():
+                            num_chunk += char
+                        else:
+                            if num_chunk:
+                                processed += " ".join(number_to_thai_text(int(d)) for d in num_chunk) + " "
+                                num_chunk = ""
+                            processed += char + " "
+                    if num_chunk:  # Handle any remaining numbers
+                        processed += " ".join(number_to_thai_text(int(d)) for d in num_chunk)
+                    result.append(processed.strip())
+                else:
+                    result.append(word)
+        return " ".join(result)
+    return process_text(text)
+# Test the functions
+if __name__ == "__main__":
+    # Test number_to_thai_text
+    test_numbers = [1, 12, 500, 6450, 100000, 12345678]
+    for num in test_numbers:
+        print(f"{num:,} -> {number_to_thai_text(num)}")
+    # Test with decimals and mixed text
+    test_texts = [
+        "ฉันมีเงิน 500 บาท",
+        "ราคา 123.45 บาท",
+        "บ้านเลขที่ 12 34",
+        "วันที่ 15 08 2023",
+    ]
+    for text in test_texts:
+        result = replace_numbers_with_thai(text)
+        print(f"\nOriginal: {text}")
+        print(f"Converted: {result}")

deployment/src/f5_tts/cleantext/th_repeat.py ADDED Viewed

	@@ -0,0 +1,41 @@

+from pythainlp.tokenize import syllable_tokenize
+def remove_symbol(text):
+    symbols = "{}[]()-_?/\\|!*%$&@#^<>+-\";:~\`=“”"
+    for symbol in symbols:
+        text = text.replace(symbol, '')
+    text = text.replace(" ๆ","ๆ")
+    return text
+def process_thai_repeat(text):
+    cleaned_symbols = remove_symbol(text)
+    words = syllable_tokenize(cleaned_symbols)
+    result = []
+    i = 0
+    while i < len(words):
+        if i + 1 < len(words) and words[i + 1] == "ๆ":
+            result.append(words[i])
+            result.append(words[i])
+            i += 2
+        else:
+            result.append(words[i])
+            i += 1
+    return "".join(result)
+if __name__ == "__main__":
+    # Example
+    test_cases = [
+        "วันที่ ฉันสนุกมากๆ",
+        "ดีมากๆ",
+        "บ้านสวยๆ",
+        "เขียนเร็วๆ",
+        "วันที่ ฉันสนุกมากๆ และกินอร่อยๆ"
+    ]
+    for text in test_cases:
+        result = process_thai_repeat(text)
+        print(f"Original: {text} -> Converted: {result}")

deployment/src/f5_tts/config.py ADDED Viewed

	@@ -0,0 +1,98 @@

+"""
+Configuration settings for F5-TTS Thai WebUI
+"""
+# Model configurations
+DEFAULT_MODEL_BASE = "hf://VIZINTZOR/F5-TTS-THAI/model_1000000.pt"
+FP16_MODEL_BASE = "hf://VIZINTZOR/F5-TTS-THAI/model_650000_FP16.pt"
+VOCAB_BASE = "./vocab/vocab.txt"
+VOCAB_HF = "hf://VIZINTZOR/F5-TTS-THAI/vocab.txt"
+MODEL_CHOICES = ["Default", "FP16", "Custom"]
+# F5TTS model configuration
+F5TTS_MODEL_CFG = {
+    "dim": 1024,
+    "depth": 22,
+    "heads": 16,
+    "ff_mult": 2,
+    "text_dim": 512,
+    "conv_layers": 4
+}
+# Audio settings
+TARGET_SAMPLE_RATE = 24000
+HOP_LENGTH = 256
+# UI settings
+MAX_SPEECH_TYPES = 100
+MAX_SEGMENTS = 20
+# Default TTS settings
+DEFAULT_TTS_SETTINGS = {
+    "remove_silence": True,
+    "cross_fade_duration": 0.15,
+    "nfe_step": 32,
+    "speed": 1.0,
+    "cfg_strength": 2.0,
+    "max_chars": 250,
+    "seed": -1,
+    "no_ref_audio": False
+}
+# Whisper model settings
+WHISPER_MODELS = ['base', 'small', 'medium', 'large-v2', 'large-v3', 'large-v3-turbo']
+WHISPER_COMPUTE_TYPES = ["float32", "float16", "int8_float16", "int8"]
+WHISPER_LANGUAGES = {
+    "source": ["Auto", 'th', "en"],
+    "target": ['th', "en"]
+}
+# Example configurations
+EXAMPLES = [
+    [
+        "./src/f5_tts/infer/examples/thai_examples/ref_gen_1.wav",
+        "ได้รับข่าวคราวของเราที่จะหาที่มันเป็นไปที่จะจัดขึ้น.",
+        "พรุ่งนี้มีประชุมสำคัญ อย่าลืมเตรียมเอกสารให้เรียบร้อย"
+    ],
+    [
+        "./src/f5_tts/infer/examples/thai_examples/ref_gen_2.wav",
+        "ฉันเดินทางไปเที่ยวที่จังหวัดเชียงใหม่ในช่วงฤดูหนาวเพื่อสัมผัสอากาศเย็นสบาย.",
+        "ฉันชอบฟังเพลงขณะขับรถ เพราะช่วยให้รู้สึกผ่อนคลาย"
+    ],
+    [
+        "./src/f5_tts/infer/examples/thai_examples/ref_gen_3.wav",
+        "กู้ดอาฟเต้อนูนไนท์ทูมีทยู.",
+        "วันนี้อากาศดีมาก เหมาะกับการไปเดินเล่นที่สวนสาธารณะ"
+    ],
+    [
+        "./src/f5_tts/infer/examples/thai_examples/ref_gen_4.wav",
+        "เราอยากจะตื่นขึ้นมามั้ยคะ.",
+        "เมื่อวานฉันไปเดินเล่นที่ชายหาด เสียงคลื่นซัดฝั่งเป็นจังหวะที่ชวนให้ใจสงบ."
+    ]
+]
+TIPS_TEXT = """
+- สามารถตั้งค่า "ตัวอักษรสูงสุดต่อส่วน" หรือ max_chars เพื่อลดความผิดพลาดการอ่าน แต่ความเร็วในการสร้างจะช้าลง สามารถปรับลด NFE Step เพื่อเพิ่มความเร็วได้
+ปรับ NFE Step เหลือ 7 สามารถเพิ่มความเร็วการในการสร้างได้มาก แต่เสียงที่ได้พอฟังได้.
+- อย่าลืมเว้นวรรคประโยคเพื่อให้สามารถแบ่งส่วนในการสร้างได้.
+- สำหรับ ref_text หรือ ข้อความตันฉบับ แนะนำให้ใช้เป็นภาษาไทยหรือคำอ่านภาษาไทยสำหรับเสียงภาษาอื่น เพื่อให้การอ่านภาษาไทยดีขึ้น เช่น Good Morning > กู้ดมอร์นิ่ง.
+- สำหรับเสียงต้นแบบ ควรใช้ความยาวไม่เกิน 10 วินาที ถ้าเป็นไปได้ห้ามมีเสียงรบกวน.
+- สามารถปรับลดความเร็วให้ช้าลง ถ้าเสียงต้นฉบับมีความยาวไม่มาก เช่น 2-5 วินาที
+- การอ่านข้อความยาวๆ หรือบางคำ ยังไม่ถูกต้อง สามารถปรับลดความเร็วเพื่อให้การอ่านถูกต้องได้ เช่น ถ้าเสียงต้นฉบับมีความยาว 1-3 วินาที อาจจะต้องประความเร็วเหลือ 0.8-0.9.
+- โมเดลตอนนี้ยังเน้นการอ่านภาษาไทยเป็นหลัก ���ารอ่านภาษาไทยผสมกับภาษาอังกฤษยังต้องปรับปรุง.
+"""
+MULTISPEECH_EXAMPLE_TEXT = """
+**ตัวอย่าง:**
+{ปกติ} สวัสดีครับ มีอะไรให้ผมช่วยไหมครับ
+{เศร้า} ผมเครียดจริงๆ นะตอนนี้...
+{โกรธ} รู้ไหม! เธอไม่ควรอยู่ที่นี่!
+{กระซิบ} ฉันมีอะไรจะบอกคุณ แต่มันเป็นความลับนะ.
+"""
+MULTISPEECH_PLACEHOLDER = """ป้อนสคริปต์โดยใส่ชื่อผู้พูด (หรือลักษณะอารมณ์) ไว้ที่ต้นแต่ละบล็อก ตัวอย่างเช่น:
+{ปกติ} สวัสดีครับ มีอะไรให้ผมช่วยไหมครับ
+{เศร้า} ผมเครียดจริงๆ นะตอนนี้...
+{โกรธ} รู้ไหม! เธอไม่ควรอยู่ที่นี่!
+{กระซิบ} ฉันมีอะไรจะบอกคุณ แต่มันเป็นความลับนะ."""

deployment/src/f5_tts/configs/E2TTS_Base_train.yaml ADDED Viewed

	@@ -0,0 +1,45 @@

+hydra:
+  run:
+    dir: ckpts/${model.name}_${model.mel_spec.mel_spec_type}_${model.tokenizer}_${datasets.name}/${now:%Y-%m-%d}/${now:%H-%M-%S}
+datasets:
+  name: Emilia_ZH_EN  # dataset name
+  batch_size_per_gpu: 38400  # 8 GPUs, 8 * 38400 = 307200
+  batch_size_type: frame  # "frame" or "sample"
+  max_samples: 64  # max sequences per batch if use frame-wise batch_size. we set 32 for small models, 64 for base models
+  num_workers: 16
+optim:
+  epochs: 15
+  learning_rate: 7.5e-5
+  num_warmup_updates: 20000  # warmup updates
+  grad_accumulation_steps: 1  # note: updates = steps / grad_accumulation_steps
+  max_grad_norm: 1.0  # gradient clipping
+  bnb_optimizer: False  # use bnb 8bit AdamW optimizer or not
+model:
+  name: E2TTS_Base
+  tokenizer: pinyin
+  tokenizer_path: None  # if tokenizer = 'custom', define the path to the tokenizer you want to use (should be vocab.txt)
+  arch:
+    dim: 1024
+    depth: 24
+    heads: 16
+    ff_mult: 4
+  mel_spec:
+    target_sample_rate: 24000
+    n_mel_channels: 100
+    hop_length: 256
+    win_length: 1024
+    n_fft: 1024
+    mel_spec_type: vocos  # 'vocos' or 'bigvgan'
+  vocoder:
+    is_local: False  # use local offline ckpt or not
+    local_path: None  # local vocoder path
+ckpts:
+  logger: wandb  # wandb | tensorboard | None
+  save_per_updates: 50000  # save checkpoint per updates
+  keep_last_n_checkpoints: -1  # -1 to keep all, 0 to not save intermediate, > 0 to keep last N checkpoints
+  last_per_updates: 5000  # save last checkpoint per updates
+  save_dir: ckpts/${model.name}_${model.mel_spec.mel_spec_type}_${model.tokenizer}_${datasets.name}

deployment/src/f5_tts/configs/E2TTS_Small_train.yaml ADDED Viewed

	@@ -0,0 +1,45 @@

+hydra:
+  run:
+    dir: ckpts/${model.name}_${model.mel_spec.mel_spec_type}_${model.tokenizer}_${datasets.name}/${now:%Y-%m-%d}/${now:%H-%M-%S}
+datasets:
+  name: Emilia_ZH_EN
+  batch_size_per_gpu: 38400  # 8 GPUs, 8 * 38400 = 307200
+  batch_size_type: frame  # "frame" or "sample"
+  max_samples: 64  # max sequences per batch if use frame-wise batch_size. we set 32 for small models, 64 for base models
+  num_workers: 16
+optim:
+  epochs: 15
+  learning_rate: 7.5e-5
+  num_warmup_updates: 20000  # warmup updates
+  grad_accumulation_steps: 1  # note: updates = steps / grad_accumulation_steps
+  max_grad_norm: 1.0
+  bnb_optimizer: False
+model:
+  name: E2TTS_Small
+  tokenizer: pinyin
+  tokenizer_path: None  # if tokenizer = 'custom', define the path to the tokenizer you want to use (should be vocab.txt)
+  arch:
+    dim: 768
+    depth: 20
+    heads: 12
+    ff_mult: 4
+  mel_spec:
+    target_sample_rate: 24000
+    n_mel_channels: 100
+    hop_length: 256
+    win_length: 1024
+    n_fft: 1024
+    mel_spec_type: vocos  # 'vocos' or 'bigvgan'
+  vocoder:
+    is_local: False  # use local offline ckpt or not
+    local_path: None  # local vocoder path
+ckpts:
+  logger: wandb  # wandb | tensorboard | None
+  save_per_updates: 50000  # save checkpoint per updates
+  keep_last_n_checkpoints: -1  # -1 to keep all, 0 to not save intermediate, > 0 to keep last N checkpoints
+  last_per_updates: 5000  # save last checkpoint per updates
+  save_dir: ckpts/${model.name}_${model.mel_spec.mel_spec_type}_${model.tokenizer}_${datasets.name}

deployment/src/f5_tts/configs/F5TTS_Base_train.yaml ADDED Viewed

	@@ -0,0 +1,48 @@

+hydra:
+  run:
+    dir: ckpts/${model.name}_${model.mel_spec.mel_spec_type}_${model.tokenizer}_${datasets.name}/${now:%Y-%m-%d}/${now:%H-%M-%S}
+datasets:
+  name: Emilia_ZH_EN  # dataset name
+  batch_size_per_gpu: 38400  # 8 GPUs, 8 * 38400 = 307200
+  batch_size_type: frame  # "frame" or "sample"
+  max_samples: 64  # max sequences per batch if use frame-wise batch_size. we set 32 for small models, 64 for base models
+  num_workers: 16
+optim:
+  epochs: 15
+  learning_rate: 7.5e-5
+  num_warmup_updates: 20000  # warmup updates
+  grad_accumulation_steps: 1  # note: updates = steps / grad_accumulation_steps
+  max_grad_norm: 1.0  # gradient clipping
+  bnb_optimizer: False  # use bnb 8bit AdamW optimizer or not
+model:
+  name: F5TTS_Base  # model name
+  tokenizer: pinyin  # tokenizer type
+  tokenizer_path: None  # if tokenizer = 'custom', define the path to the tokenizer you want to use (should be vocab.txt)
+  arch:
+    dim: 1024
+    depth: 22
+    heads: 16
+    ff_mult: 2
+    text_dim: 512
+    conv_layers: 4
+    checkpoint_activations: False  # recompute activations and save memory for extra compute
+  mel_spec:
+    target_sample_rate: 24000
+    n_mel_channels: 100
+    hop_length: 256
+    win_length: 1024
+    n_fft: 1024
+    mel_spec_type: vocos  # 'vocos' or 'bigvgan'
+  vocoder:
+    is_local: False  # use local offline ckpt or not
+    local_path: None  # local vocoder path
+ckpts:
+  logger: wandb  # wandb | tensorboard | None
+  save_per_updates: 50000  # save checkpoint per updates
+  keep_last_n_checkpoints: -1  # -1 to keep all, 0 to not save intermediate, > 0 to keep last N checkpoints
+  last_per_updates: 5000  # save last checkpoint per updates
+  save_dir: ckpts/${model.name}_${model.mel_spec.mel_spec_type}_${model.tokenizer}_${datasets.name}

deployment/src/f5_tts/configs/F5TTS_Small_train.yaml ADDED Viewed

	@@ -0,0 +1,48 @@

+hydra:
+  run:
+    dir: ckpts/${model.name}_${model.mel_spec.mel_spec_type}_${model.tokenizer}_${datasets.name}/${now:%Y-%m-%d}/${now:%H-%M-%S}
+datasets:
+  name: Emilia_ZH_EN
+  batch_size_per_gpu: 38400  # 8 GPUs, 8 * 38400 = 307200
+  batch_size_type: frame  # "frame" or "sample"
+  max_samples: 64  # max sequences per batch if use frame-wise batch_size. we set 32 for small models, 64 for base models
+  num_workers: 16
+optim:
+  epochs: 15
+  learning_rate: 7.5e-5
+  num_warmup_updates: 20000  # warmup updates
+  grad_accumulation_steps: 1  # note: updates = steps / grad_accumulation_steps
+  max_grad_norm: 1.0  # gradient clipping
+  bnb_optimizer: False  # use bnb 8bit AdamW optimizer or not
+model:
+  name: F5TTS_Small
+  tokenizer: pinyin
+  tokenizer_path: None  # if tokenizer = 'custom', define the path to the tokenizer you want to use (should be vocab.txt)
+  arch:
+    dim: 768
+    depth: 18
+    heads: 12
+    ff_mult: 2
+    text_dim: 512
+    conv_layers: 4
+    checkpoint_activations: False  # recompute activations and save memory for extra compute
+  mel_spec:
+    target_sample_rate: 24000
+    n_mel_channels: 100
+    hop_length: 256
+    win_length: 1024
+    n_fft: 1024
+    mel_spec_type: vocos  # 'vocos' or 'bigvgan'
+  vocoder:
+    is_local: False  # use local offline ckpt or not
+    local_path: None  # local vocoder path
+ckpts:
+  logger: wandb  # wandb | tensorboard | None
+  save_per_updates: 50000  # save checkpoint per updates
+  keep_last_n_checkpoints: -1  # -1 to keep all, 0 to not save intermediate, > 0 to keep last N checkpoints
+  last_per_updates: 5000  # save last checkpoint per updates
+  save_dir: ckpts/${model.name}_${model.mel_spec.mel_spec_type}_${model.tokenizer}_${datasets.name}

deployment/src/f5_tts/eval/README.md ADDED Viewed

	@@ -0,0 +1,52 @@

+# Evaluation
+Install packages for evaluation:
+```bash
+pip install -e .[eval]
+```
+## Generating Samples for Evaluation
+### Prepare Test Datasets
+1. *Seed-TTS testset*: Download from [seed-tts-eval](https://github.com/BytedanceSpeech/seed-tts-eval).
+2. *LibriSpeech test-clean*: Download from [OpenSLR](http://www.openslr.org/12/).
+3. Unzip the downloaded datasets and place them in the `data/` directory.
+4. Update the path for *LibriSpeech test-clean* data in `src/f5_tts/eval/eval_infer_batch.py`
+5. Our filtered LibriSpeech-PC 4-10s subset: `data/librispeech_pc_test_clean_cross_sentence.lst`
+### Batch Inference for Test Set
+To run batch inference for evaluations, execute the following commands:
+```bash
+# batch inference for evaluations
+accelerate config  # if not set before
+bash src/f5_tts/eval/eval_infer_batch.sh
+```
+## Objective Evaluation on Generated Results
+### Download Evaluation Model Checkpoints
+1. Chinese ASR Model: [Paraformer-zh](https://huggingface.co/funasr/paraformer-zh)
+2. English ASR Model: [Faster-Whisper](https://huggingface.co/Systran/faster-whisper-large-v3)
+3. WavLM Model: Download from [Google Drive](https://drive.google.com/file/d/1-aE1NfzpRCLxA4GUxX9ITI3F9LlbtEGP/view).
+Then update in the following scripts with the paths you put evaluation model ckpts to.
+### Objective Evaluation
+Update the path with your batch-inferenced results, and carry out WER / SIM / UTMOS evaluations:
+```bash
+# Evaluation [WER] for Seed-TTS test [ZH] set
+python src/f5_tts/eval/eval_seedtts_testset.py --eval_task wer --lang zh --gen_wav_dir <GEN_WAV_DIR> --gpu_nums 8
+# Evaluation [SIM] for LibriSpeech-PC test-clean (cross-sentence)
+python src/f5_tts/eval/eval_librispeech_test_clean.py --eval_task sim --gen_wav_dir <GEN_WAV_DIR> --librispeech_test_clean_path <TEST_CLEAN_PATH>
+# Evaluation [UTMOS]. --ext: Audio extension
+python src/f5_tts/eval/eval_utmos.py --audio_dir <WAV_DIR> --ext wav
+```

deployment/src/f5_tts/eval/ecapa_tdnn.py ADDED Viewed

	@@ -0,0 +1,330 @@

+# just for speaker similarity evaluation, third-party code
+# From https://github.com/microsoft/UniSpeech/blob/main/downstreams/speaker_verification/models/
+# part of the code is borrowed from https://github.com/lawlict/ECAPA-TDNN
+import os
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+""" Res2Conv1d + BatchNorm1d + ReLU
+"""
+class Res2Conv1dReluBn(nn.Module):
+    """
+    in_channels == out_channels == channels
+    """
+    def __init__(self, channels, kernel_size=1, stride=1, padding=0, dilation=1, bias=True, scale=4):
+        super().__init__()
+        assert channels % scale == 0, "{} % {} != 0".format(channels, scale)
+        self.scale = scale
+        self.width = channels // scale
+        self.nums = scale if scale == 1 else scale - 1
+        self.convs = []
+        self.bns = []
+        for i in range(self.nums):
+            self.convs.append(nn.Conv1d(self.width, self.width, kernel_size, stride, padding, dilation, bias=bias))
+            self.bns.append(nn.BatchNorm1d(self.width))
+        self.convs = nn.ModuleList(self.convs)
+        self.bns = nn.ModuleList(self.bns)
+    def forward(self, x):
+        out = []
+        spx = torch.split(x, self.width, 1)
+        for i in range(self.nums):
+            if i == 0:
+                sp = spx[i]
+            else:
+                sp = sp + spx[i]
+            # Order: conv -> relu -> bn
+            sp = self.convs[i](sp)
+            sp = self.bns[i](F.relu(sp))
+            out.append(sp)
+        if self.scale != 1:
+            out.append(spx[self.nums])
+        out = torch.cat(out, dim=1)
+        return out
+""" Conv1d + BatchNorm1d + ReLU
+"""
+class Conv1dReluBn(nn.Module):
+    def __init__(self, in_channels, out_channels, kernel_size=1, stride=1, padding=0, dilation=1, bias=True):
+        super().__init__()
+        self.conv = nn.Conv1d(in_channels, out_channels, kernel_size, stride, padding, dilation, bias=bias)
+        self.bn = nn.BatchNorm1d(out_channels)
+    def forward(self, x):
+        return self.bn(F.relu(self.conv(x)))
+""" The SE connection of 1D case.
+"""
+class SE_Connect(nn.Module):
+    def __init__(self, channels, se_bottleneck_dim=128):
+        super().__init__()
+        self.linear1 = nn.Linear(channels, se_bottleneck_dim)
+        self.linear2 = nn.Linear(se_bottleneck_dim, channels)
+    def forward(self, x):
+        out = x.mean(dim=2)
+        out = F.relu(self.linear1(out))
+        out = torch.sigmoid(self.linear2(out))
+        out = x * out.unsqueeze(2)
+        return out
+""" SE-Res2Block of the ECAPA-TDNN architecture.
+"""
+# def SE_Res2Block(channels, kernel_size, stride, padding, dilation, scale):
+#     return nn.Sequential(
+#         Conv1dReluBn(channels, 512, kernel_size=1, stride=1, padding=0),
+#         Res2Conv1dReluBn(512, kernel_size, stride, padding, dilation, scale=scale),
+#         Conv1dReluBn(512, channels, kernel_size=1, stride=1, padding=0),
+#         SE_Connect(channels)
+#     )
+class SE_Res2Block(nn.Module):
+    def __init__(self, in_channels, out_channels, kernel_size, stride, padding, dilation, scale, se_bottleneck_dim):
+        super().__init__()
+        self.Conv1dReluBn1 = Conv1dReluBn(in_channels, out_channels, kernel_size=1, stride=1, padding=0)
+        self.Res2Conv1dReluBn = Res2Conv1dReluBn(out_channels, kernel_size, stride, padding, dilation, scale=scale)
+        self.Conv1dReluBn2 = Conv1dReluBn(out_channels, out_channels, kernel_size=1, stride=1, padding=0)
+        self.SE_Connect = SE_Connect(out_channels, se_bottleneck_dim)
+        self.shortcut = None
+        if in_channels != out_channels:
+            self.shortcut = nn.Conv1d(
+                in_channels=in_channels,
+                out_channels=out_channels,
+                kernel_size=1,
+            )
+    def forward(self, x):
+        residual = x
+        if self.shortcut:
+            residual = self.shortcut(x)
+        x = self.Conv1dReluBn1(x)
+        x = self.Res2Conv1dReluBn(x)
+        x = self.Conv1dReluBn2(x)
+        x = self.SE_Connect(x)
+        return x + residual
+""" Attentive weighted mean and standard deviation pooling.
+"""
+class AttentiveStatsPool(nn.Module):
+    def __init__(self, in_dim, attention_channels=128, global_context_att=False):
+        super().__init__()
+        self.global_context_att = global_context_att
+        # Use Conv1d with stride == 1 rather than Linear, then we don't need to transpose inputs.
+        if global_context_att:
+            self.linear1 = nn.Conv1d(in_dim * 3, attention_channels, kernel_size=1)  # equals W and b in the paper
+        else:
+            self.linear1 = nn.Conv1d(in_dim, attention_channels, kernel_size=1)  # equals W and b in the paper
+        self.linear2 = nn.Conv1d(attention_channels, in_dim, kernel_size=1)  # equals V and k in the paper
+    def forward(self, x):
+        if self.global_context_att:
+            context_mean = torch.mean(x, dim=-1, keepdim=True).expand_as(x)
+            context_std = torch.sqrt(torch.var(x, dim=-1, keepdim=True) + 1e-10).expand_as(x)
+            x_in = torch.cat((x, context_mean, context_std), dim=1)
+        else:
+            x_in = x
+        # DON'T use ReLU here! In experiments, I find ReLU hard to converge.
+        alpha = torch.tanh(self.linear1(x_in))
+        # alpha = F.relu(self.linear1(x_in))
+        alpha = torch.softmax(self.linear2(alpha), dim=2)
+        mean = torch.sum(alpha * x, dim=2)
+        residuals = torch.sum(alpha * (x**2), dim=2) - mean**2
+        std = torch.sqrt(residuals.clamp(min=1e-9))
+        return torch.cat([mean, std], dim=1)
+class ECAPA_TDNN(nn.Module):
+    def __init__(
+        self,
+        feat_dim=80,
+        channels=512,
+        emb_dim=192,
+        global_context_att=False,
+        feat_type="wavlm_large",
+        sr=16000,
+        feature_selection="hidden_states",
+        update_extract=False,
+        config_path=None,
+    ):
+        super().__init__()
+        self.feat_type = feat_type
+        self.feature_selection = feature_selection
+        self.update_extract = update_extract
+        self.sr = sr
+        torch.hub._validate_not_a_forked_repo = lambda a, b, c: True
+        try:
+            local_s3prl_path = os.path.expanduser("~/.cache/torch/hub/s3prl_s3prl_main")
+            self.feature_extract = torch.hub.load(local_s3prl_path, feat_type, source="local", config_path=config_path)
+        except:  # noqa: E722
+            self.feature_extract = torch.hub.load("s3prl/s3prl", feat_type)
+        if len(self.feature_extract.model.encoder.layers) == 24 and hasattr(
+            self.feature_extract.model.encoder.layers[23].self_attn, "fp32_attention"
+        ):
+            self.feature_extract.model.encoder.layers[23].self_attn.fp32_attention = False
+        if len(self.feature_extract.model.encoder.layers) == 24 and hasattr(
+            self.feature_extract.model.encoder.layers[11].self_attn, "fp32_attention"
+        ):
+            self.feature_extract.model.encoder.layers[11].self_attn.fp32_attention = False
+        self.feat_num = self.get_feat_num()
+        self.feature_weight = nn.Parameter(torch.zeros(self.feat_num))
+        if feat_type != "fbank" and feat_type != "mfcc":
+            freeze_list = ["final_proj", "label_embs_concat", "mask_emb", "project_q", "quantizer"]
+            for name, param in self.feature_extract.named_parameters():
+                for freeze_val in freeze_list:
+                    if freeze_val in name:
+                        param.requires_grad = False
+                        break
+        if not self.update_extract:
+            for param in self.feature_extract.parameters():
+                param.requires_grad = False
+        self.instance_norm = nn.InstanceNorm1d(feat_dim)
+        # self.channels = [channels] * 4 + [channels * 3]
+        self.channels = [channels] * 4 + [1536]
+        self.layer1 = Conv1dReluBn(feat_dim, self.channels[0], kernel_size=5, padding=2)
+        self.layer2 = SE_Res2Block(
+            self.channels[0],
+            self.channels[1],
+            kernel_size=3,
+            stride=1,
+            padding=2,
+            dilation=2,
+            scale=8,
+            se_bottleneck_dim=128,
+        )
+        self.layer3 = SE_Res2Block(
+            self.channels[1],
+            self.channels[2],
+            kernel_size=3,
+            stride=1,
+            padding=3,
+            dilation=3,
+            scale=8,
+            se_bottleneck_dim=128,
+        )
+        self.layer4 = SE_Res2Block(
+            self.channels[2],
+            self.channels[3],
+            kernel_size=3,
+            stride=1,
+            padding=4,
+            dilation=4,
+            scale=8,
+            se_bottleneck_dim=128,
+        )
+        # self.conv = nn.Conv1d(self.channels[-1], self.channels[-1], kernel_size=1)
+        cat_channels = channels * 3
+        self.conv = nn.Conv1d(cat_channels, self.channels[-1], kernel_size=1)
+        self.pooling = AttentiveStatsPool(
+            self.channels[-1], attention_channels=128, global_context_att=global_context_att
+        )
+        self.bn = nn.BatchNorm1d(self.channels[-1] * 2)
+        self.linear = nn.Linear(self.channels[-1] * 2, emb_dim)
+    def get_feat_num(self):
+        self.feature_extract.eval()
+        wav = [torch.randn(self.sr).to(next(self.feature_extract.parameters()).device)]
+        with torch.no_grad():
+            features = self.feature_extract(wav)
+        select_feature = features[self.feature_selection]
+        if isinstance(select_feature, (list, tuple)):
+            return len(select_feature)
+        else:
+            return 1
+    def get_feat(self, x):
+        if self.update_extract:
+            x = self.feature_extract([sample for sample in x])
+        else:
+            with torch.no_grad():
+                if self.feat_type == "fbank" or self.feat_type == "mfcc":
+                    x = self.feature_extract(x) + 1e-6  # B x feat_dim x time_len
+                else:
+                    x = self.feature_extract([sample for sample in x])
+        if self.feat_type == "fbank":
+            x = x.log()
+        if self.feat_type != "fbank" and self.feat_type != "mfcc":
+            x = x[self.feature_selection]
+            if isinstance(x, (list, tuple)):
+                x = torch.stack(x, dim=0)
+            else:
+                x = x.unsqueeze(0)
+            norm_weights = F.softmax(self.feature_weight, dim=-1).unsqueeze(-1).unsqueeze(-1).unsqueeze(-1)
+            x = (norm_weights * x).sum(dim=0)
+            x = torch.transpose(x, 1, 2) + 1e-6
+        x = self.instance_norm(x)
+        return x
+    def forward(self, x):
+        x = self.get_feat(x)
+        out1 = self.layer1(x)
+        out2 = self.layer2(out1)
+        out3 = self.layer3(out2)
+        out4 = self.layer4(out3)
+        out = torch.cat([out2, out3, out4], dim=1)
+        out = F.relu(self.conv(out))
+        out = self.bn(self.pooling(out))
+        out = self.linear(out)
+        return out
+def ECAPA_TDNN_SMALL(
+    feat_dim,
+    emb_dim=256,
+    feat_type="wavlm_large",
+    sr=16000,
+    feature_selection="hidden_states",
+    update_extract=False,
+    config_path=None,
+):
+    return ECAPA_TDNN(
+        feat_dim=feat_dim,
+        channels=512,
+        emb_dim=emb_dim,
+        feat_type=feat_type,
+        sr=sr,
+        feature_selection=feature_selection,
+        update_extract=update_extract,
+        config_path=config_path,
+    )

deployment/src/f5_tts/eval/eval_infer_batch.py ADDED Viewed

	@@ -0,0 +1,207 @@

+import os
+import sys
+sys.path.append(os.getcwd())
+import argparse
+import time
+from importlib.resources import files
+import torch
+import torchaudio
+from accelerate import Accelerator
+from tqdm import tqdm
+from f5_tts.eval.utils_eval import (
+    get_inference_prompt,
+    get_librispeech_test_clean_metainfo,
+    get_seedtts_testset_metainfo,
+)
+from f5_tts.infer.utils_infer import load_checkpoint, load_vocoder
+from f5_tts.model import CFM, DiT, UNetT
+from f5_tts.model.utils import get_tokenizer
+accelerator = Accelerator()
+device = f"cuda:{accelerator.process_index}"
+# --------------------- Dataset Settings -------------------- #
+target_sample_rate = 24000
+n_mel_channels = 100
+hop_length = 256
+win_length = 1024
+n_fft = 1024
+target_rms = 0.1
+rel_path = str(files("f5_tts").joinpath("../../"))
+def main():
+    # ---------------------- infer setting ---------------------- #
+    parser = argparse.ArgumentParser(description="batch inference")
+    parser.add_argument("-s", "--seed", default=None, type=int)
+    parser.add_argument("-d", "--dataset", default="Emilia_ZH_EN")
+    parser.add_argument("-n", "--expname", required=True)
+    parser.add_argument("-c", "--ckptstep", default=1200000, type=int)
+    parser.add_argument("-m", "--mel_spec_type", default="vocos", type=str, choices=["bigvgan", "vocos"])
+    parser.add_argument("-to", "--tokenizer", default="pinyin", type=str, choices=["pinyin", "char"])
+    parser.add_argument("-nfe", "--nfestep", default=32, type=int)
+    parser.add_argument("-o", "--odemethod", default="euler")
+    parser.add_argument("-ss", "--swaysampling", default=-1, type=float)
+    parser.add_argument("-t", "--testset", required=True)
+    args = parser.parse_args()
+    seed = args.seed
+    dataset_name = args.dataset
+    exp_name = args.expname
+    ckpt_step = args.ckptstep
+    ckpt_path = rel_path + f"/ckpts/{exp_name}/model_{ckpt_step}.pt"
+    mel_spec_type = args.mel_spec_type
+    tokenizer = args.tokenizer
+    nfe_step = args.nfestep
+    ode_method = args.odemethod
+    sway_sampling_coef = args.swaysampling
+    testset = args.testset
+    infer_batch_size = 1  # max frames. 1 for ddp single inference (recommended)
+    cfg_strength = 2.0
+    speed = 1.0
+    use_truth_duration = False
+    no_ref_audio = False
+    if exp_name == "F5TTS_Base":
+        model_cls = DiT
+        model_cfg = dict(dim=1024, depth=22, heads=16, ff_mult=2, text_dim=512, conv_layers=4)
+    elif exp_name == "E2TTS_Base":
+        model_cls = UNetT
+        model_cfg = dict(dim=1024, depth=24, heads=16, ff_mult=4)
+    if testset == "ls_pc_test_clean":
+        metalst = rel_path + "/data/librispeech_pc_test_clean_cross_sentence.lst"
+        librispeech_test_clean_path = "<SOME_PATH>/LibriSpeech/test-clean"  # test-clean path
+        metainfo = get_librispeech_test_clean_metainfo(metalst, librispeech_test_clean_path)
+    elif testset == "seedtts_test_zh":
+        metalst = rel_path + "/data/seedtts_testset/zh/meta.lst"
+        metainfo = get_seedtts_testset_metainfo(metalst)
+    elif testset == "seedtts_test_en":
+        metalst = rel_path + "/data/seedtts_testset/en/meta.lst"
+        metainfo = get_seedtts_testset_metainfo(metalst)
+    # path to save genereted wavs
+    output_dir = (
+        f"{rel_path}/"
+        f"results/{exp_name}_{ckpt_step}/{testset}/"
+        f"seed{seed}_{ode_method}_nfe{nfe_step}_{mel_spec_type}"
+        f"{f'_ss{sway_sampling_coef}' if sway_sampling_coef else ''}"
+        f"_cfg{cfg_strength}_speed{speed}"
+        f"{'_gt-dur' if use_truth_duration else ''}"
+        f"{'_no-ref-audio' if no_ref_audio else ''}"
+    )
+    # -------------------------------------------------#
+    use_ema = True
+    prompts_all = get_inference_prompt(
+        metainfo,
+        speed=speed,
+        tokenizer=tokenizer,
+        target_sample_rate=target_sample_rate,
+        n_mel_channels=n_mel_channels,
+        hop_length=hop_length,
+        mel_spec_type=mel_spec_type,
+        target_rms=target_rms,
+        use_truth_duration=use_truth_duration,
+        infer_batch_size=infer_batch_size,
+    )
+    # Vocoder model
+    local = False
+    if mel_spec_type == "vocos":
+        vocoder_local_path = "../checkpoints/charactr/vocos-mel-24khz"
+    elif mel_spec_type == "bigvgan":
+        vocoder_local_path = "../checkpoints/bigvgan_v2_24khz_100band_256x"
+    vocoder = load_vocoder(vocoder_name=mel_spec_type, is_local=local, local_path=vocoder_local_path)
+    # Tokenizer
+    vocab_char_map, vocab_size = get_tokenizer(dataset_name, tokenizer)
+    # Model
+    model = CFM(
+        transformer=model_cls(**model_cfg, text_num_embeds=vocab_size, mel_dim=n_mel_channels),
+        mel_spec_kwargs=dict(
+            n_fft=n_fft,
+            hop_length=hop_length,
+            win_length=win_length,
+            n_mel_channels=n_mel_channels,
+            target_sample_rate=target_sample_rate,
+            mel_spec_type=mel_spec_type,
+        ),
+        odeint_kwargs=dict(
+            method=ode_method,
+        ),
+        vocab_char_map=vocab_char_map,
+    ).to(device)
+    dtype = torch.float32 if mel_spec_type == "bigvgan" else None
+    model = load_checkpoint(model, ckpt_path, device, dtype=dtype, use_ema=use_ema)
+    if not os.path.exists(output_dir) and accelerator.is_main_process:
+        os.makedirs(output_dir)
+    # start batch inference
+    accelerator.wait_for_everyone()
+    start = time.time()
+    with accelerator.split_between_processes(prompts_all) as prompts:
+        for prompt in tqdm(prompts, disable=not accelerator.is_local_main_process):
+            utts, ref_rms_list, ref_mels, ref_mel_lens, total_mel_lens, final_text_list = prompt
+            ref_mels = ref_mels.to(device)
+            ref_mel_lens = torch.tensor(ref_mel_lens, dtype=torch.long).to(device)
+            total_mel_lens = torch.tensor(total_mel_lens, dtype=torch.long).to(device)
+            # Inference
+            with torch.inference_mode():
+                generated, _ = model.sample(
+                    cond=ref_mels,
+                    text=final_text_list,
+                    duration=total_mel_lens,
+                    lens=ref_mel_lens,
+                    steps=nfe_step,
+                    cfg_strength=cfg_strength,
+                    sway_sampling_coef=sway_sampling_coef,
+                    no_ref_audio=no_ref_audio,
+                    seed=seed,
+                )
+                # Final result
+                for i, gen in enumerate(generated):
+                    gen = gen[ref_mel_lens[i] : total_mel_lens[i], :].unsqueeze(0)
+                    gen_mel_spec = gen.permute(0, 2, 1).to(torch.float32)
+                    if mel_spec_type == "vocos":
+                        generated_wave = vocoder.decode(gen_mel_spec).cpu()
+                    elif mel_spec_type == "bigvgan":
+                        generated_wave = vocoder(gen_mel_spec).squeeze(0).cpu()
+                    if ref_rms_list[i] < target_rms:
+                        generated_wave = generated_wave * ref_rms_list[i] / target_rms
+                    torchaudio.save(f"{output_dir}/{utts[i]}.wav", generated_wave, target_sample_rate)
+    accelerator.wait_for_everyone()
+    if accelerator.is_main_process:
+        timediff = time.time() - start
+        print(f"Done batch inference in {timediff / 60 :.2f} minutes.")
+if __name__ == "__main__":
+    main()

deployment/src/f5_tts/eval/eval_infer_batch.sh ADDED Viewed

	@@ -0,0 +1,13 @@

+#!/bin/bash
+# e.g. F5-TTS, 16 NFE
+accelerate launch src/f5_tts/eval/eval_infer_batch.py -s 0 -n "F5TTS_Base" -t "seedtts_test_zh" -nfe 16
+accelerate launch src/f5_tts/eval/eval_infer_batch.py -s 0 -n "F5TTS_Base" -t "seedtts_test_en" -nfe 16
+accelerate launch src/f5_tts/eval/eval_infer_batch.py -s 0 -n "F5TTS_Base" -t "ls_pc_test_clean" -nfe 16
+# e.g. Vanilla E2 TTS, 32 NFE
+accelerate launch src/f5_tts/eval/eval_infer_batch.py -s 0 -n "E2TTS_Base" -t "seedtts_test_zh" -o "midpoint" -ss 0
+accelerate launch src/f5_tts/eval/eval_infer_batch.py -s 0 -n "E2TTS_Base" -t "seedtts_test_en" -o "midpoint" -ss 0
+accelerate launch src/f5_tts/eval/eval_infer_batch.py -s 0 -n "E2TTS_Base" -t "ls_pc_test_clean" -o "midpoint" -ss 0
+# etc.

deployment/src/f5_tts/eval/eval_librispeech_test_clean.py ADDED Viewed

	@@ -0,0 +1,96 @@

+# Evaluate with Librispeech test-clean, ~3s prompt to generate 4-10s audio (the way of valle/voicebox evaluation)
+import argparse
+import json
+import os
+import sys
+sys.path.append(os.getcwd())
+import multiprocessing as mp
+from importlib.resources import files
+import numpy as np
+from f5_tts.eval.utils_eval import (
+    get_librispeech_test,
+    run_asr_wer,
+    run_sim,
+)
+rel_path = str(files("f5_tts").joinpath("../../"))
+def get_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("-e", "--eval_task", type=str, default="wer", choices=["sim", "wer"])
+    parser.add_argument("-l", "--lang", type=str, default="en")
+    parser.add_argument("-g", "--gen_wav_dir", type=str, required=True)
+    parser.add_argument("-p", "--librispeech_test_clean_path", type=str, required=True)
+    parser.add_argument("-n", "--gpu_nums", type=int, default=8, help="Number of GPUs to use")
+    parser.add_argument("--local", action="store_true", help="Use local custom checkpoint directory")
+    return parser.parse_args()
+def main():
+    args = get_args()
+    eval_task = args.eval_task
+    lang = args.lang
+    librispeech_test_clean_path = args.librispeech_test_clean_path  # test-clean path
+    gen_wav_dir = args.gen_wav_dir
+    metalst = rel_path + "/data/librispeech_pc_test_clean_cross_sentence.lst"
+    gpus = list(range(args.gpu_nums))
+    test_set = get_librispeech_test(metalst, gen_wav_dir, gpus, librispeech_test_clean_path)
+    ## In LibriSpeech, some speakers utilized varying voice characteristics for different characters in the book,
+    ## leading to a low similarity for the ground truth in some cases.
+    # test_set = get_librispeech_test(metalst, gen_wav_dir, gpus, librispeech_test_clean_path, eval_ground_truth = True)  # eval ground truth
+    local = args.local
+    if local:  # use local custom checkpoint dir
+        asr_ckpt_dir = "../checkpoints/Systran/faster-whisper-large-v3"
+    else:
+        asr_ckpt_dir = ""  # auto download to cache dir
+    wavlm_ckpt_dir = "../checkpoints/UniSpeech/wavlm_large_finetune.pth"
+    # --------------------------- WER ---------------------------
+    if eval_task == "wer":
+        wer_results = []
+        wers = []
+        with mp.Pool(processes=len(gpus)) as pool:
+            args = [(rank, lang, sub_test_set, asr_ckpt_dir) for (rank, sub_test_set) in test_set]
+            results = pool.map(run_asr_wer, args)
+            for r in results:
+                wer_results.extend(r)
+        wer_result_path = f"{gen_wav_dir}/{lang}_wer_results.jsonl"
+        with open(wer_result_path, "w") as f:
+            for line in wer_results:
+                wers.append(line["wer"])
+                json_line = json.dumps(line, ensure_ascii=False)
+                f.write(json_line + "\n")
+        wer = round(np.mean(wers) * 100, 3)
+        print(f"\nTotal {len(wers)} samples")
+        print(f"WER      : {wer}%")
+        print(f"Results have been saved to {wer_result_path}")
+    # --------------------------- SIM ---------------------------
+    if eval_task == "sim":
+        sims = []
+        with mp.Pool(processes=len(gpus)) as pool:
+            args = [(rank, sub_test_set, wavlm_ckpt_dir) for (rank, sub_test_set) in test_set]
+            results = pool.map(run_sim, args)
+            for r in results:
+                sims.extend(r)
+        sim = round(sum(sims) / len(sims), 3)
+        print(f"\nTotal {len(sims)} samples")
+        print(f"SIM      : {sim}")
+if __name__ == "__main__":
+    main()

deployment/src/f5_tts/eval/eval_seedtts_testset.py ADDED Viewed

	@@ -0,0 +1,95 @@

+# Evaluate with Seed-TTS testset
+import argparse
+import json
+import os
+import sys
+sys.path.append(os.getcwd())
+import multiprocessing as mp
+from importlib.resources import files
+import numpy as np
+from f5_tts.eval.utils_eval import (
+    get_seed_tts_test,
+    run_asr_wer,
+    run_sim,
+)
+rel_path = str(files("f5_tts").joinpath("../../"))
+def get_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("-e", "--eval_task", type=str, default="wer", choices=["sim", "wer"])
+    parser.add_argument("-l", "--lang", type=str, default="en", choices=["zh", "en"])
+    parser.add_argument("-g", "--gen_wav_dir", type=str, required=True)
+    parser.add_argument("-n", "--gpu_nums", type=int, default=8, help="Number of GPUs to use")
+    parser.add_argument("--local", action="store_true", help="Use local custom checkpoint directory")
+    return parser.parse_args()
+def main():
+    args = get_args()
+    eval_task = args.eval_task
+    lang = args.lang
+    gen_wav_dir = args.gen_wav_dir
+    metalst = rel_path + f"/data/seedtts_testset/{lang}/meta.lst"  # seed-tts testset
+    # NOTE. paraformer-zh result will be slightly different according to the number of gpus, cuz batchsize is different
+    #       zh 1.254 seems a result of 4 workers wer_seed_tts
+    gpus = list(range(args.gpu_nums))
+    test_set = get_seed_tts_test(metalst, gen_wav_dir, gpus)
+    local = args.local
+    if local:  # use local custom checkpoint dir
+        if lang == "zh":
+            asr_ckpt_dir = "../checkpoints/funasr"  # paraformer-zh dir under funasr
+        elif lang == "en":
+            asr_ckpt_dir = "../checkpoints/Systran/faster-whisper-large-v3"
+    else:
+        asr_ckpt_dir = ""  # auto download to cache dir
+    wavlm_ckpt_dir = "../checkpoints/UniSpeech/wavlm_large_finetune.pth"
+    # --------------------------- WER ---------------------------
+    if eval_task == "wer":
+        wer_results = []
+        wers = []
+        with mp.Pool(processes=len(gpus)) as pool:
+            args = [(rank, lang, sub_test_set, asr_ckpt_dir) for (rank, sub_test_set) in test_set]
+            results = pool.map(run_asr_wer, args)
+            for r in results:
+                wer_results.extend(r)
+        wer_result_path = f"{gen_wav_dir}/{lang}_wer_results.jsonl"
+        with open(wer_result_path, "w") as f:
+            for line in wer_results:
+                wers.append(line["wer"])
+                json_line = json.dumps(line, ensure_ascii=False)
+                f.write(json_line + "\n")
+        wer = round(np.mean(wers) * 100, 3)
+        print(f"\nTotal {len(wers)} samples")
+        print(f"WER      : {wer}%")
+        print(f"Results have been saved to {wer_result_path}")
+    # --------------------------- SIM ---------------------------
+    if eval_task == "sim":
+        sims = []
+        with mp.Pool(processes=len(gpus)) as pool:
+            args = [(rank, sub_test_set, wavlm_ckpt_dir) for (rank, sub_test_set) in test_set]
+            results = pool.map(run_sim, args)
+            for r in results:
+                sims.extend(r)
+        sim = round(sum(sims) / len(sims), 3)
+        print(f"\nTotal {len(sims)} samples")
+        print(f"SIM      : {sim}")
+if __name__ == "__main__":
+    main()

deployment/src/f5_tts/eval/eval_utmos.py ADDED Viewed

	@@ -0,0 +1,44 @@

+import argparse
+import json
+from pathlib import Path
+import librosa
+import torch
+from tqdm import tqdm
+def main():
+    parser = argparse.ArgumentParser(description="UTMOS Evaluation")
+    parser.add_argument("--audio_dir", type=str, required=True, help="Audio file path.")
+    parser.add_argument("--ext", type=str, default="wav", help="Audio extension.")
+    args = parser.parse_args()
+    device = "cuda" if torch.cuda.is_available() else "xpu" if torch.xpu.is_available() else "cpu"
+    predictor = torch.hub.load("tarepan/SpeechMOS:v1.2.0", "utmos22_strong", trust_repo=True)
+    predictor = predictor.to(device)
+    audio_paths = list(Path(args.audio_dir).rglob(f"*.{args.ext}"))
+    utmos_results = {}
+    utmos_score = 0
+    for audio_path in tqdm(audio_paths, desc="Processing"):
+        wav_name = audio_path.stem
+        wav, sr = librosa.load(audio_path, sr=None, mono=True)
+        wav_tensor = torch.from_numpy(wav).to(device).unsqueeze(0)
+        score = predictor(wav_tensor, sr)
+        utmos_results[str(wav_name)] = score.item()
+        utmos_score += score.item()
+    avg_score = utmos_score / len(audio_paths) if len(audio_paths) > 0 else 0
+    print(f"UTMOS: {avg_score}")
+    utmos_result_path = Path(args.audio_dir) / "utmos_results.json"
+    with open(utmos_result_path, "w", encoding="utf-8") as f:
+        json.dump(utmos_results, f, ensure_ascii=False, indent=4)
+    print(f"Results have been saved to {utmos_result_path}")
+if __name__ == "__main__":
+    main()

deployment/src/f5_tts/eval/utils_eval.py ADDED Viewed

	@@ -0,0 +1,413 @@

+import math
+import os
+import random
+import string
+from pathlib import Path
+import torch
+import torch.nn.functional as F
+import torchaudio
+from tqdm import tqdm
+from f5_tts.eval.ecapa_tdnn import ECAPA_TDNN_SMALL
+from f5_tts.model.modules import MelSpec
+from f5_tts.model.utils import convert_char_to_pinyin
+# seedtts testset metainfo: utt, prompt_text, prompt_wav, gt_text, gt_wav
+def get_seedtts_testset_metainfo(metalst):
+    f = open(metalst)
+    lines = f.readlines()
+    f.close()
+    metainfo = []
+    for line in lines:
+        if len(line.strip().split("|")) == 5:
+            utt, prompt_text, prompt_wav, gt_text, gt_wav = line.strip().split("|")
+        elif len(line.strip().split("|")) == 4:
+            utt, prompt_text, prompt_wav, gt_text = line.strip().split("|")
+            gt_wav = os.path.join(os.path.dirname(metalst), "wavs", utt + ".wav")
+        if not os.path.isabs(prompt_wav):
+            prompt_wav = os.path.join(os.path.dirname(metalst), prompt_wav)
+        metainfo.append((utt, prompt_text, prompt_wav, gt_text, gt_wav))
+    return metainfo
+# librispeech test-clean metainfo: gen_utt, ref_txt, ref_wav, gen_txt, gen_wav
+def get_librispeech_test_clean_metainfo(metalst, librispeech_test_clean_path):
+    f = open(metalst)
+    lines = f.readlines()
+    f.close()
+    metainfo = []
+    for line in lines:
+        ref_utt, ref_dur, ref_txt, gen_utt, gen_dur, gen_txt = line.strip().split("\t")
+        # ref_txt = ref_txt[0] + ref_txt[1:].lower() + '.'  # if use librispeech test-clean (no-pc)
+        ref_spk_id, ref_chaptr_id, _ = ref_utt.split("-")
+        ref_wav = os.path.join(librispeech_test_clean_path, ref_spk_id, ref_chaptr_id, ref_utt + ".flac")
+        # gen_txt = gen_txt[0] + gen_txt[1:].lower() + '.'  # if use librispeech test-clean (no-pc)
+        gen_spk_id, gen_chaptr_id, _ = gen_utt.split("-")
+        gen_wav = os.path.join(librispeech_test_clean_path, gen_spk_id, gen_chaptr_id, gen_utt + ".flac")
+        metainfo.append((gen_utt, ref_txt, ref_wav, " " + gen_txt, gen_wav))
+    return metainfo
+# padded to max length mel batch
+def padded_mel_batch(ref_mels):
+    max_mel_length = torch.LongTensor([mel.shape[-1] for mel in ref_mels]).amax()
+    padded_ref_mels = []
+    for mel in ref_mels:
+        padded_ref_mel = F.pad(mel, (0, max_mel_length - mel.shape[-1]), value=0)
+        padded_ref_mels.append(padded_ref_mel)
+    padded_ref_mels = torch.stack(padded_ref_mels)
+    padded_ref_mels = padded_ref_mels.permute(0, 2, 1)
+    return padded_ref_mels
+# get prompts from metainfo containing: utt, prompt_text, prompt_wav, gt_text, gt_wav
+def get_inference_prompt(
+    metainfo,
+    speed=1.0,
+    tokenizer="pinyin",
+    polyphone=True,
+    target_sample_rate=24000,
+    n_fft=1024,
+    win_length=1024,
+    n_mel_channels=100,
+    hop_length=256,
+    mel_spec_type="vocos",
+    target_rms=0.1,
+    use_truth_duration=False,
+    infer_batch_size=1,
+    num_buckets=200,
+    min_secs=3,
+    max_secs=40,
+):
+    prompts_all = []
+    min_tokens = min_secs * target_sample_rate // hop_length
+    max_tokens = max_secs * target_sample_rate // hop_length
+    batch_accum = [0] * num_buckets
+    utts, ref_rms_list, ref_mels, ref_mel_lens, total_mel_lens, final_text_list = (
+        [[] for _ in range(num_buckets)] for _ in range(6)
+    )
+    mel_spectrogram = MelSpec(
+        n_fft=n_fft,
+        hop_length=hop_length,
+        win_length=win_length,
+        n_mel_channels=n_mel_channels,
+        target_sample_rate=target_sample_rate,
+        mel_spec_type=mel_spec_type,
+    )
+    for utt, prompt_text, prompt_wav, gt_text, gt_wav in tqdm(metainfo, desc="Processing prompts..."):
+        # Audio
+        ref_audio, ref_sr = torchaudio.load(prompt_wav)
+        ref_rms = torch.sqrt(torch.mean(torch.square(ref_audio)))
+        if ref_rms < target_rms:
+            ref_audio = ref_audio * target_rms / ref_rms
+        assert ref_audio.shape[-1] > 5000, f"Empty prompt wav: {prompt_wav}, or torchaudio backend issue."
+        if ref_sr != target_sample_rate:
+            resampler = torchaudio.transforms.Resample(ref_sr, target_sample_rate)
+            ref_audio = resampler(ref_audio)
+        # Text
+        if len(prompt_text[-1].encode("utf-8")) == 1:
+            prompt_text = prompt_text + " "
+        text = [prompt_text + gt_text]
+        if tokenizer == "pinyin":
+            text_list = convert_char_to_pinyin(text, polyphone=polyphone)
+        else:
+            text_list = text
+        # Duration, mel frame length
+        ref_mel_len = ref_audio.shape[-1] // hop_length
+        if use_truth_duration:
+            gt_audio, gt_sr = torchaudio.load(gt_wav)
+            if gt_sr != target_sample_rate:
+                resampler = torchaudio.transforms.Resample(gt_sr, target_sample_rate)
+                gt_audio = resampler(gt_audio)
+            total_mel_len = ref_mel_len + int(gt_audio.shape[-1] / hop_length / speed)
+            # # test vocoder resynthesis
+            # ref_audio = gt_audio
+        else:
+            ref_text_len = len(prompt_text.encode("utf-8"))
+            gen_text_len = len(gt_text.encode("utf-8"))
+            total_mel_len = ref_mel_len + int(ref_mel_len / ref_text_len * gen_text_len / speed)
+        # to mel spectrogram
+        ref_mel = mel_spectrogram(ref_audio)
+        ref_mel = ref_mel.squeeze(0)
+        # deal with batch
+        assert infer_batch_size > 0, "infer_batch_size should be greater than 0."
+        assert (
+            min_tokens <= total_mel_len <= max_tokens
+        ), f"Audio {utt} has duration {total_mel_len*hop_length//target_sample_rate}s out of range [{min_secs}, {max_secs}]."
+        bucket_i = math.floor((total_mel_len - min_tokens) / (max_tokens - min_tokens + 1) * num_buckets)
+        utts[bucket_i].append(utt)
+        ref_rms_list[bucket_i].append(ref_rms)
+        ref_mels[bucket_i].append(ref_mel)
+        ref_mel_lens[bucket_i].append(ref_mel_len)
+        total_mel_lens[bucket_i].append(total_mel_len)
+        final_text_list[bucket_i].extend(text_list)
+        batch_accum[bucket_i] += total_mel_len
+        if batch_accum[bucket_i] >= infer_batch_size:
+            # print(f"\n{len(ref_mels[bucket_i][0][0])}\n{ref_mel_lens[bucket_i]}\n{total_mel_lens[bucket_i]}")
+            prompts_all.append(
+                (
+                    utts[bucket_i],
+                    ref_rms_list[bucket_i],
+                    padded_mel_batch(ref_mels[bucket_i]),
+                    ref_mel_lens[bucket_i],
+                    total_mel_lens[bucket_i],
+                    final_text_list[bucket_i],
+                )
+            )
+            batch_accum[bucket_i] = 0
+            (
+                utts[bucket_i],
+                ref_rms_list[bucket_i],
+                ref_mels[bucket_i],
+                ref_mel_lens[bucket_i],
+                total_mel_lens[bucket_i],
+                final_text_list[bucket_i],
+            ) = [], [], [], [], [], []
+    # add residual
+    for bucket_i, bucket_frames in enumerate(batch_accum):
+        if bucket_frames > 0:
+            prompts_all.append(
+                (
+                    utts[bucket_i],
+                    ref_rms_list[bucket_i],
+                    padded_mel_batch(ref_mels[bucket_i]),
+                    ref_mel_lens[bucket_i],
+                    total_mel_lens[bucket_i],
+                    final_text_list[bucket_i],
+                )
+            )
+    # not only leave easy work for last workers
+    random.seed(666)
+    random.shuffle(prompts_all)
+    return prompts_all
+# get wav_res_ref_text of seed-tts test metalst
+# https://github.com/BytedanceSpeech/seed-tts-eval
+def get_seed_tts_test(metalst, gen_wav_dir, gpus):
+    f = open(metalst)
+    lines = f.readlines()
+    f.close()
+    test_set_ = []
+    for line in tqdm(lines):
+        if len(line.strip().split("|")) == 5:
+            utt, prompt_text, prompt_wav, gt_text, gt_wav = line.strip().split("|")
+        elif len(line.strip().split("|")) == 4:
+            utt, prompt_text, prompt_wav, gt_text = line.strip().split("|")
+        if not os.path.exists(os.path.join(gen_wav_dir, utt + ".wav")):
+            continue
+        gen_wav = os.path.join(gen_wav_dir, utt + ".wav")
+        if not os.path.isabs(prompt_wav):
+            prompt_wav = os.path.join(os.path.dirname(metalst), prompt_wav)
+        test_set_.append((gen_wav, prompt_wav, gt_text))
+    num_jobs = len(gpus)
+    if num_jobs == 1:
+        return [(gpus[0], test_set_)]
+    wav_per_job = len(test_set_) // num_jobs + 1
+    test_set = []
+    for i in range(num_jobs):
+        test_set.append((gpus[i], test_set_[i * wav_per_job : (i + 1) * wav_per_job]))
+    return test_set
+# get librispeech test-clean cross sentence test
+def get_librispeech_test(metalst, gen_wav_dir, gpus, librispeech_test_clean_path, eval_ground_truth=False):
+    f = open(metalst)
+    lines = f.readlines()
+    f.close()
+    test_set_ = []
+    for line in tqdm(lines):
+        ref_utt, ref_dur, ref_txt, gen_utt, gen_dur, gen_txt = line.strip().split("\t")
+        if eval_ground_truth:
+            gen_spk_id, gen_chaptr_id, _ = gen_utt.split("-")
+            gen_wav = os.path.join(librispeech_test_clean_path, gen_spk_id, gen_chaptr_id, gen_utt + ".flac")
+        else:
+            if not os.path.exists(os.path.join(gen_wav_dir, gen_utt + ".wav")):
+                raise FileNotFoundError(f"Generated wav not found: {gen_utt}")
+            gen_wav = os.path.join(gen_wav_dir, gen_utt + ".wav")
+        ref_spk_id, ref_chaptr_id, _ = ref_utt.split("-")
+        ref_wav = os.path.join(librispeech_test_clean_path, ref_spk_id, ref_chaptr_id, ref_utt + ".flac")
+        test_set_.append((gen_wav, ref_wav, gen_txt))
+    num_jobs = len(gpus)
+    if num_jobs == 1:
+        return [(gpus[0], test_set_)]
+    wav_per_job = len(test_set_) // num_jobs + 1
+    test_set = []
+    for i in range(num_jobs):
+        test_set.append((gpus[i], test_set_[i * wav_per_job : (i + 1) * wav_per_job]))
+    return test_set
+# load asr model
+def load_asr_model(lang, ckpt_dir=""):
+    if lang == "zh":
+        from funasr import AutoModel
+        model = AutoModel(
+            model=os.path.join(ckpt_dir, "paraformer-zh"),
+            # vad_model = os.path.join(ckpt_dir, "fsmn-vad"),
+            # punc_model = os.path.join(ckpt_dir, "ct-punc"),
+            # spk_model = os.path.join(ckpt_dir, "cam++"),
+            disable_update=True,
+        )  # following seed-tts setting
+    elif lang == "en":
+        from faster_whisper import WhisperModel
+        model_size = "large-v3" if ckpt_dir == "" else ckpt_dir
+        model = WhisperModel(model_size, device="cuda", compute_type="float16")
+    return model
+# WER Evaluation, the way Seed-TTS does
+def run_asr_wer(args):
+    rank, lang, test_set, ckpt_dir = args
+    if lang == "zh":
+        import zhconv
+        torch.cuda.set_device(rank)
+    elif lang == "en":
+        os.environ["CUDA_VISIBLE_DEVICES"] = str(rank)
+    else:
+        raise NotImplementedError(
+            "lang support only 'zh' (funasr paraformer-zh), 'en' (faster-whisper-large-v3), for now."
+        )
+    asr_model = load_asr_model(lang, ckpt_dir=ckpt_dir)
+    from zhon.hanzi import punctuation
+    punctuation_all = punctuation + string.punctuation
+    wer_results = []
+    from jiwer import compute_measures
+    for gen_wav, prompt_wav, truth in tqdm(test_set):
+        if lang == "zh":
+            res = asr_model.generate(input=gen_wav, batch_size_s=300, disable_pbar=True)
+            hypo = res[0]["text"]
+            hypo = zhconv.convert(hypo, "zh-cn")
+        elif lang == "en":
+            segments, _ = asr_model.transcribe(gen_wav, beam_size=5, language="en")
+            hypo = ""
+            for segment in segments:
+                hypo = hypo + " " + segment.text
+        raw_truth = truth
+        raw_hypo = hypo
+        for x in punctuation_all:
+            truth = truth.replace(x, "")
+            hypo = hypo.replace(x, "")
+        truth = truth.replace("  ", " ")
+        hypo = hypo.replace("  ", " ")
+        if lang == "zh":
+            truth = " ".join([x for x in truth])
+            hypo = " ".join([x for x in hypo])
+        elif lang == "en":
+            truth = truth.lower()
+            hypo = hypo.lower()
+        measures = compute_measures(truth, hypo)
+        wer = measures["wer"]
+        # ref_list = truth.split(" ")
+        # subs = measures["substitutions"] / len(ref_list)
+        # dele = measures["deletions"] / len(ref_list)
+        # inse = measures["insertions"] / len(ref_list)
+        wer_results.append(
+            {
+                "wav": Path(gen_wav).stem,
+                "truth": raw_truth,
+                "hypo": raw_hypo,
+                "wer": wer,
+            }
+        )
+    return wer_results
+# SIM Evaluation
+def run_sim(args):
+    rank, test_set, ckpt_dir = args
+    device = f"cuda:{rank}"
+    model = ECAPA_TDNN_SMALL(feat_dim=1024, feat_type="wavlm_large", config_path=None)
+    state_dict = torch.load(ckpt_dir, weights_only=True, map_location=lambda storage, loc: storage)
+    model.load_state_dict(state_dict["model"], strict=False)
+    use_gpu = True if torch.cuda.is_available() else False
+    if use_gpu:
+        model = model.cuda(device)
+    model.eval()
+    sims = []
+    for wav1, wav2, truth in tqdm(test_set):
+        wav1, sr1 = torchaudio.load(wav1)
+        wav2, sr2 = torchaudio.load(wav2)
+        resample1 = torchaudio.transforms.Resample(orig_freq=sr1, new_freq=16000)
+        resample2 = torchaudio.transforms.Resample(orig_freq=sr2, new_freq=16000)
+        wav1 = resample1(wav1)
+        wav2 = resample2(wav2)
+        if use_gpu:
+            wav1 = wav1.cuda(device)
+            wav2 = wav2.cuda(device)
+        with torch.no_grad():
+            emb1 = model(wav1)
+            emb2 = model(wav2)
+        sim = F.cosine_similarity(emb1, emb2)[0].item()
+        # print(f"VSim score between two audios: {sim:.4f} (-1.0, 1.0).")
+        sims.append(sim)
+    return sims

deployment/src/f5_tts/f5_tts_webui.py ADDED Viewed

	@@ -0,0 +1,295 @@

+"""
+F5-TTS Thai WebUI - Refactored Version
+เวอร์ชันที่ปรับปรุงโครงสร้างใหม่ให้มีระเบียบและง่ายต่อการดูแลรักษา
+"""
+import argparse
+import sys
+import os
+import gradio as gr
+# Add the src directory to Python path for imports
+current_dir = os.path.dirname(os.path.abspath(__file__))
+src_dir = os.path.dirname(current_dir)
+if src_dir not in sys.path:
+    sys.path.insert(0, src_dir)
+from f5_tts.model_manager import ModelManager
+from f5_tts.tts_processor import TTSProcessor, SpeechToTextProcessor
+from f5_tts.multi_speech_processor import MultiSpeechProcessor
+from f5_tts.ui_components import UIComponents
+from f5_tts.config import MAX_SPEECH_TYPES
+class F5TTSWebUI:
+    """หลัก Web UI Application สำหรับ F5-TTS Thai"""
+    def __init__(self):
+        self.model_manager = ModelManager()
+        self.tts_processor = TTSProcessor(self.model_manager)
+        self.stt_processor = SpeechToTextProcessor()
+        self.multi_speech_processor = MultiSpeechProcessor(self.model_manager)
+        self.ui_components = UIComponents()
+    def create_gradio_interface(self):
+        """สร้าง Gradio interface"""
+        with gr.Blocks(title="F5-TTS ไทย", theme=gr.themes.Ocean()) as demo:
+            gr.Markdown("# F5-TTS ภาษาไทย")
+            gr.Markdown("สร้างคำพูดจากข้อความ ด้วย Zero-shot TTS หรือ เสียงต้นฉบับ ภาษาไทย.")
+            # Model selection section
+            model_select, model_custom, model_status, load_custom_btn = self.ui_components.create_model_selection_section()
+            # Setup model selection events
+            self._setup_model_selection_events(
+                model_select, model_custom, model_status, load_custom_btn
+            )
+            # Create tabs
+            #with gr.Tab(label="Text To Speech"):
+            #    self._create_tts_tab()
+            with gr.Tab(label="Multi Speech"):
+                self._create_multispeech_tab()
+            #with gr.Tab(label="Speech to Text"):
+            #    self._create_stt_tab()
+        return demo
+    def _setup_model_selection_events(self, model_select, model_custom, model_status, load_custom_btn):
+        """ตั้งค่า events สำหรับการเลือกโมเดล"""
+        # Model selection change event
+        model_select.change(
+            fn=self.model_manager.update_custom_model_visibility,
+            inputs=model_select,
+            outputs=model_custom
+        )
+        # Load custom model button
+        load_custom_btn.click(
+            fn=self.model_manager.load_model_by_choice,
+            inputs=[model_select, model_custom],
+            outputs=model_status
+        )
+    def _create_tts_tab(self):
+        """สร้าง Text To Speech tab"""
+        tts_components = self.ui_components.create_tts_tab(self.tts_processor.infer_tts)
+        # Setup TTS generation
+        tts_components['controls']['generate_btn'].click(
+            fn=self.tts_processor.infer_tts,
+            inputs=[
+                tts_components['inputs']['ref_audio'],
+                tts_components['inputs']['ref_text'],
+                tts_components['inputs']['gen_text'],
+                tts_components['inputs']['remove_silence'],
+                tts_components['inputs']['cross_fade_duration'],
+                tts_components['inputs']['nfe_step'],
+                tts_components['inputs']['speed'],
+                tts_components['inputs']['cfg_strength'],
+                tts_components['inputs']['max_chars'],
+                tts_components['inputs']['seed'],
+                tts_components['inputs']['no_ref_audio']
+            ],
+            outputs=[
+                tts_components['outputs']['output_audio'],
+                tts_components['outputs']['spectrogram'],
+                tts_components['inputs']['ref_text'],
+                tts_components['outputs']['seed_output']
+            ]
+        )
+    def _create_multispeech_tab(self):
+        """สร้าง Multi Speech tab"""
+        ms_components = self.ui_components.create_multispeech_tab()
+        # Setup speech type management
+        self._setup_speech_type_events(ms_components)
+        # Setup multispeech generation
+        self._setup_multispeech_generation(ms_components)
+        # Setup segment editing
+        self._setup_segment_editing(ms_components)
+    def _setup_speech_type_events(self, ms_components):
+        """ตั้งค่า events สำหรับ speech type management"""
+        # Add speech type button
+        ms_components['controls']['add_speech_type_btn'].click(
+            fn=self.ui_components.add_speech_type_fn,
+            outputs=ms_components['controls']['speech_type_rows']
+        )
+        # Delete speech type buttons
+        for i in range(1, len(self.ui_components.speech_type_delete_btns)):
+            if self.ui_components.speech_type_delete_btns[i] is not None:
+                self.ui_components.speech_type_delete_btns[i].click(
+                    fn=self.ui_components.delete_speech_type_fn,
+                    outputs=[
+                        self.ui_components.speech_type_rows[i],
+                        self.ui_components.speech_type_names[i],
+                        self.ui_components.speech_type_audios[i],
+                        self.ui_components.speech_type_ref_texts[i]
+                    ]
+                )
+        # Insert speech type buttons
+        for i, insert_btn in enumerate(self.ui_components.speech_type_insert_btns):
+            insert_fn = self.ui_components.make_insert_speech_type_fn(i)
+            insert_btn.click(
+                fn=insert_fn,
+                inputs=[ms_components['inputs']['gen_text'], self.ui_components.speech_type_names[i]],
+                outputs=ms_components['inputs']['gen_text']
+            )
+        # Validation for generate button
+        ms_components['inputs']['gen_text'].change(
+            fn=self.multi_speech_processor.validate_speech_types,
+            inputs=[ms_components['inputs']['gen_text']] + ms_components['inputs']['speech_type_names'],
+            outputs=ms_components['controls']['generate_btn']
+        )
+    def _setup_multispeech_generation(self, ms_components):
+        """ตั้งค่า multispeech generation"""
+        # Prepare inputs for generation
+        generation_inputs = [
+            ms_components['inputs']['gen_text'],
+            ms_components['inputs']['cross_fade_duration'],
+            ms_components['inputs']['nfe_step']
+        ] + (
+            ms_components['inputs']['speech_type_names'] +
+            ms_components['inputs']['speech_type_audios'] +
+            ms_components['inputs']['speech_type_ref_texts'] +
+            [ms_components['inputs']['remove_silence']] +
+            ms_components['inputs']['segment_silence_inputs']
+        )
+        # Prepare outputs for generation
+        generation_outputs = [
+            ms_components['outputs']['audio_output'],
+            ms_components['outputs']['download_btn']
+        ] + (
+            ms_components['outputs']['segment_players'] +
+            ms_components['outputs']['segment_text_inputs'] +
+            ms_components['outputs']['segment_silence_inputs'] +
+            ms_components['outputs']['segment_regen_btns'] +
+            [ms_components['state']['segments_state'], ms_components['state']['sr_state']]
+        )
+        # Generate button click
+        ms_components['controls']['generate_btn'].click(
+            fn=self._wrap_multispeech_generation,
+            inputs=generation_inputs,
+            outputs=generation_outputs
+        )
+    def _wrap_multispeech_generation(self, gen_text, cross_fade_duration, nfe_step, *args):
+        """Wrapper สำหรับ multispeech generation"""
+        speech_types_data = args[:MAX_SPEECH_TYPES * 3]
+        remove_silence = args[MAX_SPEECH_TYPES * 3]
+        silence_inputs = args[MAX_SPEECH_TYPES * 3 + 1:]
+        return self.multi_speech_processor.generate_multistyle_speech(
+            gen_text,
+            cross_fade_duration,
+            nfe_step,
+            speech_types_data,
+            remove_silence,
+            silence_inputs
+        )
+    def _setup_segment_editing(self, ms_components):
+        """ตั้งค่า segment editing"""
+        # Update silence button
+        ms_components['controls']['update_silence_btn'].click(
+            fn=self.multi_speech_processor.update_silence_all,
+            inputs=ms_components['inputs']['segment_silence_inputs'] + [
+                ms_components['state']['segments_state'],
+                ms_components['state']['sr_state']
+            ],
+            outputs=ms_components['outputs']['segment_players'] +
+                   ms_components['outputs']['segment_text_inputs'] +
+                   ms_components['outputs']['segment_silence_inputs'] +
+                   ms_components['outputs']['segment_regen_btns'] + [
+                       ms_components['outputs']['audio_output'],
+                       ms_components['outputs']['download_btn'],
+                       ms_components['state']['segments_state'],
+                       ms_components['state']['sr_state']
+                   ]
+        )
+        # Regenerate segment buttons
+        for i, btn in enumerate(ms_components['outputs']['segment_regen_btns']):
+            btn.click(
+                fn=self._wrap_regenerate_segment,
+                inputs=[
+                    gr.State(i),
+                    ms_components['outputs']['segment_text_inputs'][i],
+                    ms_components['outputs']['segment_silence_inputs'][i],
+                    ms_components['state']['segments_state'],
+                    ms_components['inputs']['cross_fade_duration'],
+                    ms_components['inputs']['nfe_step']
+                ],
+                outputs=ms_components['outputs']['segment_players'] +
+                       ms_components['outputs']['segment_text_inputs'] +
+                       ms_components['outputs']['segment_silence_inputs'] +
+                       ms_components['outputs']['segment_regen_btns'] + [
+                           ms_components['outputs']['audio_output'],
+                           ms_components['outputs']['download_btn'],
+                           ms_components['state']['segments_state'],
+                           ms_components['state']['sr_state']
+                       ]
+            )
+    def _wrap_regenerate_segment(self, idx, new_text, silence_ms, segments, cross_fade_duration, nfe_step):
+        """Wrapper สำหรับ regenerate segment"""
+        return self.multi_speech_processor.regenerate_segment(
+            idx, new_text, silence_ms, segments, cross_fade_duration, nfe_step
+        )
+    def _create_stt_tab(self):
+        """สร้าง Speech to Text tab"""
+        stt_components = self.ui_components.create_stt_tab()
+        # Setup STT generation
+        stt_components['controls']['generate_btn_stt'].click(
+            fn=self.stt_processor.transcribe_text,
+            inputs=[
+                stt_components['inputs']['ref_audio_input'],
+                stt_components['inputs']['is_translate'],
+                stt_components['inputs']['model_wp'],
+                stt_components['inputs']['compute_type'],
+                stt_components['inputs']['target_lg'],
+                stt_components['inputs']['source_lg']
+            ],
+            outputs=stt_components['outputs']['output_ref_text']
+        )
+def main():
+    """Main function สำหรับรัน application"""
+    try:
+        parser = argparse.ArgumentParser(description="F5-TTS Thai WebUI - Refactored")
+        parser.add_argument("--share", action="store_true", help="Share the app")
+        args = parser.parse_args()
+        print("กำลังเริ่มต้น F5-TTS Thai WebUI...")
+        app = F5TTSWebUI()
+        demo = app.create_gradio_interface()
+        print("WebUI พร้อมใช้งาน!")
+        demo.launch(inbrowser=True, share=args.share)
+    except Exception as e:
+        print(f"เกิดข้อผิดพลาด: {e}")
+        import traceback
+        traceback.print_exc()
+if __name__ == "__main__":
+    main()

deployment/src/f5_tts/infer/README.md ADDED Viewed

	@@ -0,0 +1,219 @@

+# Inference
+The pretrained model checkpoints can be reached at [🤗 Hugging Face](https://huggingface.co/SWivid/F5-TTS) and [🤖 Model Scope](https://www.modelscope.cn/models/SWivid/F5-TTS_Emilia-ZH-EN), or will be automatically downloaded when running inference scripts.
+**More checkpoints with whole community efforts can be found in [SHARED.md](SHARED.md), supporting more languages.**
+Currently support **30s for a single** generation, which is the **total length** including both prompt and output audio. However, you can provide `infer_cli` and `infer_gradio` with longer text, will automatically do chunk generation. Long reference audio will be **clip short to ~15s**.
+To avoid possible inference failures, make sure you have seen through the following instructions.
+- Use reference audio <15s and leave some silence (e.g. 1s) at the end. Otherwise there is a risk of truncating in the middle of word, leading to suboptimal generation.
+- Uppercased letters will be uttered letter by letter, so use lowercased letters for normal words.
+- Add some spaces (blank: " ") or punctuations (e.g. "," ".") to explicitly introduce some pauses.
+- Preprocess numbers to Chinese letters if you want to have them read in Chinese, otherwise in English.
+- If the generation output is blank (pure silence), check for ffmpeg installation (various tutorials online, blogs, videos, etc.).
+- Try turn off use_ema if using an early-stage finetuned checkpoint (which goes just few updates).
+## Gradio App
+Currently supported features:
+- Basic TTS with Chunk Inference
+- Multi-Style / Multi-Speaker Generation
+- Voice Chat powered by Qwen2.5-3B-Instruct
+- [Custom inference with more language support](src/f5_tts/infer/SHARED.md)
+The cli command `f5-tts_infer-gradio` equals to `python src/f5_tts/infer/infer_gradio.py`, which launches a Gradio APP (web interface) for inference.
+The script will load model checkpoints from Huggingface. You can also manually download files and update the path to `load_model()` in `infer_gradio.py`. Currently only load TTS models first, will load ASR model to do transcription if `ref_text` not provided, will load LLM model if use Voice Chat.
+More flags options:
+```bash
+# Automatically launch the interface in the default web browser
+f5-tts_infer-gradio --inbrowser
+# Set the root path of the application, if it's not served from the root ("/") of the domain
+# For example, if the application is served at "https://example.com/myapp"
+f5-tts_infer-gradio --root_path "/myapp"
+```
+Could also be used as a component for larger application:
+```python
+import gradio as gr
+from f5_tts.infer.infer_gradio import app
+with gr.Blocks() as main_app:
+    gr.Markdown("# This is an example of using F5-TTS within a bigger Gradio app")
+    # ... other Gradio components
+    app.render()
+main_app.launch()
+```
+## CLI Inference
+The cli command `f5-tts_infer-cli` equals to `python src/f5_tts/infer/infer_cli.py`, which is a command line tool for inference.
+The script will load model checkpoints from Huggingface. You can also manually download files and use `--ckpt_file` to specify the model you want to load, or directly update in `infer_cli.py`.
+For change vocab.txt use `--vocab_file` to provide your `vocab.txt` file.
+Basically you can inference with flags:
+```bash
+# Leave --ref_text "" will have ASR model transcribe (extra GPU memory usage)
+f5-tts_infer-cli \
+--model "F5-TTS" \
+--ref_audio "ref_audio.wav" \
+--ref_text "The content, subtitle or transcription of reference audio." \
+--gen_text "Some text you want TTS model generate for you."
+# Choose Vocoder
+f5-tts_infer-cli --vocoder_name bigvgan --load_vocoder_from_local --ckpt_file <YOUR_CKPT_PATH, eg:ckpts/F5TTS_Base_bigvgan/model_1250000.pt>
+f5-tts_infer-cli --vocoder_name vocos --load_vocoder_from_local --ckpt_file <YOUR_CKPT_PATH, eg:ckpts/F5TTS_Base/model_1200000.safetensors>
+# More instructions
+f5-tts_infer-cli --help
+```
+And a `.toml` file would help with more flexible usage.
+```bash
+f5-tts_infer-cli -c custom.toml
+```
+For example, you can use `.toml` to pass in variables, refer to `src/f5_tts/infer/examples/basic/basic.toml`:
+```toml
+# F5-TTS | E2-TTS
+model = "F5-TTS"
+ref_audio = "infer/examples/basic/basic_ref_en.wav"
+# If an empty "", transcribes the reference audio automatically.
+ref_text = "Some call me nature, others call me mother nature."
+gen_text = "I don't really care what you call me. I've been a silent spectator, watching species evolve, empires rise and fall. But always remember, I am mighty and enduring."
+# File with text to generate. Ignores the text above.
+gen_file = ""
+remove_silence = false
+output_dir = "tests"
+```
+You can also leverage `.toml` file to do multi-style generation, refer to `src/f5_tts/infer/examples/multi/story.toml`.
+```toml
+# F5-TTS | E2-TTS
+model = "F5-TTS"
+ref_audio = "infer/examples/multi/main.flac"
+# If an empty "", transcribes the reference audio automatically.
+ref_text = ""
+gen_text = ""
+# File with text to generate. Ignores the text above.
+gen_file = "infer/examples/multi/story.txt"
+remove_silence = true
+output_dir = "tests"
+[voices.town]
+ref_audio = "infer/examples/multi/town.flac"
+ref_text = ""
+[voices.country]
+ref_audio = "infer/examples/multi/country.flac"
+ref_text = ""
+```
+You should mark the voice with `[main]` `[town]` `[country]` whenever you want to change voice, refer to `src/f5_tts/infer/examples/multi/story.txt`.
+## Speech Editing
+To test speech editing capabilities, use the following command:
+```bash
+python src/f5_tts/infer/speech_edit.py
+```
+## Socket Realtime Client
+To communicate with socket server you need to run
+```bash
+python src/f5_tts/socket_server.py
+```
+<details>
+<summary>Then create client to communicate</summary>
+```bash
+# If PyAudio not installed
+sudo apt-get install portaudio19-dev
+pip install pyaudio
+```
+``` python
+# Create the socket_client.py
+import socket
+import asyncio
+import pyaudio
+import numpy as np
+import logging
+import time
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+async def listen_to_F5TTS(text, server_ip="localhost", server_port=9998):
+    client_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
+    await asyncio.get_event_loop().run_in_executor(None, client_socket.connect, (server_ip, int(server_port)))
+    start_time = time.time()
+    first_chunk_time = None
+    async def play_audio_stream():
+        nonlocal first_chunk_time
+        p = pyaudio.PyAudio()
+        stream = p.open(format=pyaudio.paFloat32, channels=1, rate=24000, output=True, frames_per_buffer=2048)
+        try:
+            while True:
+                data = await asyncio.get_event_loop().run_in_executor(None, client_socket.recv, 8192)
+                if not data:
+                    break
+                if data == b"END":
+                    logger.info("End of audio received.")
+                    break
+                audio_array = np.frombuffer(data, dtype=np.float32)
+                stream.write(audio_array.tobytes())
+                if first_chunk_time is None:
+                    first_chunk_time = time.time()
+        finally:
+            stream.stop_stream()
+            stream.close()
+            p.terminate()
+        logger.info(f"Total time taken: {time.time() - start_time:.4f} seconds")
+    try:
+        data_to_send = f"{text}".encode("utf-8")
+        await asyncio.get_event_loop().run_in_executor(None, client_socket.sendall, data_to_send)
+        await play_audio_stream()
+    except Exception as e:
+        logger.error(f"Error in listen_to_F5TTS: {e}")
+    finally:
+        client_socket.close()
+if __name__ == "__main__":
+    text_to_send = "As a Reader assistant, I'm familiar with new technology. which are key to its improved performance in terms of both training speed and inference efficiency. Let's break down the components"
+    asyncio.run(listen_to_F5TTS(text_to_send))
+```
+</details>

deployment/src/f5_tts/infer/SHARED.md ADDED Viewed

	@@ -0,0 +1,164 @@

+<!-- omit in toc -->
+# Shared Model Cards
+<!-- omit in toc -->
+### **Prerequisites of using**
+- This document is serving as a quick lookup table for the community training/finetuning result, with various language support.
+- The models in this repository are open source and are based on voluntary contributions from contributors.
+- The use of models must be conditioned on respect for the respective creators. The convenience brought comes from their efforts.
+<!-- omit in toc -->
+### **Welcome to share here**
+- Have a pretrained/finetuned result: model checkpoint (pruned best to facilitate inference, i.e. leave only `ema_model_state_dict`) and corresponding vocab file (for tokenization).
+- Host a public [huggingface model repository](https://huggingface.co/new) and upload the model related files.
+- Make a pull request adding a model card to the current page, i.e. `src\f5_tts\infer\SHARED.md`.
+<!-- omit in toc -->
+### Supported Languages
+- [Multilingual](#multilingual)
+    - [F5-TTS Base @ zh \& en @ F5-TTS](#f5-tts-base--zh--en--f5-tts)
+- [English](#english)
+- [Finnish](#finnish)
+    - [F5-TTS Base @ fi @ AsmoKoskinen](#f5-tts-base--fi--asmokoskinen)
+- [French](#french)
+    - [F5-TTS Base @ fr @ RASPIAUDIO](#f5-tts-base--fr--raspiaudio)
+- [Hindi](#hindi)
+    - [F5-TTS Small @ hi @ SPRINGLab](#f5-tts-small--hi--springlab)
+- [Italian](#italian)
+    - [F5-TTS Base @ it @ alien79](#f5-tts-base--it--alien79)
+- [Japanese](#japanese)
+    - [F5-TTS Base @ ja @ Jmica](#f5-tts-base--ja--jmica)
+- [Mandarin](#mandarin)
+- [Russian](#russian)
+    - [F5-TTS Base @ ru @ HotDro4illa](#f5-tts-base--ru--hotdro4illa)
+- [Spanish](#spanish)
+    - [F5-TTS Base @ es @ jpgallegoar](#f5-tts-base--es--jpgallegoar)
+## Multilingual
+#### F5-TTS Base @ zh & en @ F5-TTS
+|Model|🤗Hugging Face|Data (Hours)|Model License|
+|:---:|:------------:|:-----------:|:-------------:|
+|F5-TTS Base|[ckpt & vocab](https://huggingface.co/SWivid/F5-TTS/tree/main/F5TTS_Base)|[Emilia 95K zh&en](https://huggingface.co/datasets/amphion/Emilia-Dataset/tree/fc71e07)|cc-by-nc-4.0|
+```bash
+Model: hf://SWivid/F5-TTS/F5TTS_Base/model_1200000.safetensors
+Vocab: hf://SWivid/F5-TTS/F5TTS_Base/vocab.txt
+Config: {"dim": 1024, "depth": 22, "heads": 16, "ff_mult": 2, "text_dim": 512, "conv_layers": 4}
+```
+*Other infos, e.g. Author info, Github repo, Link to some sampled results, Usage instruction, Tutorial (Blog, Video, etc.) ...*
+## English
+## Finnish
+#### F5-TTS Base @ fi @ AsmoKoskinen
+|Model|🤗Hugging Face|Data|Model License|
+|:---:|:------------:|:-----------:|:-------------:|
+|F5-TTS Base|[ckpt & vocab](https://huggingface.co/AsmoKoskinen/F5-TTS_Finnish_Model)|[Common Voice](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0), [Vox Populi](https://huggingface.co/datasets/facebook/voxpopuli)|cc-by-nc-4.0|
+```bash
+Model: hf://AsmoKoskinen/F5-TTS_Finnish_Model/model_common_voice_fi_vox_populi_fi_20241206.safetensors
+Vocab: hf://AsmoKoskinen/F5-TTS_Finnish_Model/vocab.txt
+Config: {"dim": 1024, "depth": 22, "heads": 16, "ff_mult": 2, "text_dim": 512, "conv_layers": 4}
+```
+## French
+#### F5-TTS Base @ fr @ RASPIAUDIO
+|Model|🤗Hugging Face|Data (Hours)|Model License|
+|:---:|:------------:|:-----------:|:-------------:|
+|F5-TTS Base|[ckpt & vocab](https://huggingface.co/RASPIAUDIO/F5-French-MixedSpeakers-reduced)|[LibriVox](https://librivox.org/)|cc-by-nc-4.0|
+```bash
+Model: hf://RASPIAUDIO/F5-French-MixedSpeakers-reduced/model_last_reduced.pt
+Vocab: hf://RASPIAUDIO/F5-French-MixedSpeakers-reduced/vocab.txt
+Config: {"dim": 1024, "depth": 22, "heads": 16, "ff_mult": 2, "text_dim": 512, "conv_layers": 4}
+```
+- [Online Inference with Hugging Face Space](https://huggingface.co/spaces/RASPIAUDIO/f5-tts_french).
+- [Tutorial video to train a new language model](https://www.youtube.com/watch?v=UO4usaOojys).
+- [Discussion about this training can be found here](https://github.com/SWivid/F5-TTS/issues/434).
+## Hindi
+#### F5-TTS Small @ hi @ SPRINGLab
+|Model|🤗Hugging Face|Data (Hours)|Model License|
+|:---:|:------------:|:-----------:|:-------------:|
+|F5-TTS Small|[ckpt & vocab](https://huggingface.co/SPRINGLab/F5-Hindi-24KHz)|[IndicTTS Hi](https://huggingface.co/datasets/SPRINGLab/IndicTTS-Hindi) & [IndicVoices-R Hi](https://huggingface.co/datasets/SPRINGLab/IndicVoices-R_Hindi) |cc-by-4.0|
+```bash
+Model: hf://SPRINGLab/F5-Hindi-24KHz/model_2500000.safetensors
+Vocab: hf://SPRINGLab/F5-Hindi-24KHz/vocab.txt
+Config: {"dim": 768, "depth": 18, "heads": 12, "ff_mult": 2, "text_dim": 512, "conv_layers": 4}
+```
+- Authors: SPRING Lab, Indian Institute of Technology, Madras
+- Website: https://asr.iitm.ac.in/
+## Italian
+#### F5-TTS Base @ it @ alien79
+|Model|🤗Hugging Face|Data|Model License|
+|:---:|:------------:|:-----------:|:-------------:|
+|F5-TTS Base|[ckpt & vocab](https://huggingface.co/alien79/F5-TTS-italian)|[ylacombe/cml-tts](https://huggingface.co/datasets/ylacombe/cml-tts) |cc-by-nc-4.0|
+```bash
+Model: hf://alien79/F5-TTS-italian/model_159600.safetensors
+Vocab: hf://alien79/F5-TTS-italian/vocab.txt
+Config: {"dim": 1024, "depth": 22, "heads": 16, "ff_mult": 2, "text_dim": 512, "conv_layers": 4}
+```
+- Trained by [Mithril Man](https://github.com/MithrilMan)
+- Model details on [hf project home](https://huggingface.co/alien79/F5-TTS-italian)
+- Open to collaborations to further improve the model
+## Japanese
+#### F5-TTS Base @ ja @ Jmica
+|Model|🤗Hugging Face|Data (Hours)|Model License|
+|:---:|:------------:|:-----------:|:-------------:|
+|F5-TTS Base|[ckpt & vocab](https://huggingface.co/Jmica/F5TTS/tree/main/JA_25498980)|[Emilia 1.7k JA](https://huggingface.co/datasets/amphion/Emilia-Dataset/tree/fc71e07) & [Galgame Dataset 5.4k](https://huggingface.co/datasets/OOPPEENN/Galgame_Dataset)|cc-by-nc-4.0|
+```bash
+Model: hf://Jmica/F5TTS/JA_25498980/model_25498980.pt
+Vocab: hf://Jmica/F5TTS/JA_25498980/vocab_updated.txt
+Config: {"dim": 1024, "depth": 22, "heads": 16, "ff_mult": 2, "text_dim": 512, "conv_layers": 4}
+```
+## Mandarin
+## Russian
+#### F5-TTS Base @ ru @ HotDro4illa
+|Model|🤗Hugging Face|Data (Hours)|Model License|
+|:---:|:------------:|:-----------:|:-------------:|
+|F5-TTS Base|[ckpt & vocab](https://huggingface.co/hotstone228/F5-TTS-Russian)|[Common voice](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0)|cc-by-nc-4.0|
+```bash
+Model: hf://hotstone228/F5-TTS-Russian/model_last.safetensors
+Vocab: hf://hotstone228/F5-TTS-Russian/vocab.txt
+Config: {"dim": 1024, "depth": 22, "heads": 16, "ff_mult": 2, "text_dim": 512, "conv_layers": 4}
+```
+- Finetuned by [HotDro4illa](https://github.com/HotDro4illa)
+- Any improvements are welcome
+## Spanish
+#### F5-TTS Base @ es @ jpgallegoar
+|Model|🤗Hugging Face|Data (Hours)|Model License|
+|:---:|:------------:|:-----------:|:-------------:|
+|F5-TTS Base|[ckpt & vocab](https://huggingface.co/jpgallegoar/F5-Spanish)|[Voxpopuli](https://huggingface.co/datasets/facebook/voxpopuli) & Crowdsourced & TEDx, 218 hours|cc0-1.0|
+- @jpgallegoar [GitHub repo](https://github.com/jpgallegoar/Spanish-F5), Jupyter Notebook and Gradio usage for Spanish model.

deployment/src/f5_tts/infer/examples/basic/basic.toml ADDED Viewed

	@@ -0,0 +1,11 @@

+# F5-TTS | E2-TTS
+model = "F5-TTS"
+ref_audio = "infer/examples/basic/basic_ref_en.wav"
+# If an empty "", transcribes the reference audio automatically.
+ref_text = "Some call me nature, others call me mother nature."
+gen_text = "I don't really care what you call me. I've been a silent spectator, watching species evolve, empires rise and fall. But always remember, I am mighty and enduring."
+# File with text to generate. Ignores the text above.
+gen_file = ""
+remove_silence = false
+output_dir = "tests"
+output_file = "infer_cli_basic.wav"

deployment/src/f5_tts/infer/examples/basic/basic_ref_en.wav ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b0e22048e72414fcc1e6b6342e47a774d748a195ed34e4a5b3fcf416707f2b71
+size 256018

deployment/src/f5_tts/infer/examples/basic/basic_ref_zh.wav ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:96724a113240d1f82c6ded1334122f0176b96c9226ccd3c919e625bcfd2a3ede
+size 324558

deployment/src/f5_tts/infer/examples/multi/country.flac ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:bb15708b4b3875e37beec46591a5d89e1a9a63fdad3b8fe4a5c8738f4f554400
+size 180321

deployment/src/f5_tts/infer/examples/multi/main.flac ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4abb1107771ce7e14926fde879b959dde6db6e572476b98684f04e45e978ab19
+size 279219

deployment/src/f5_tts/infer/examples/multi/story.toml ADDED Viewed

	@@ -0,0 +1,20 @@

+# F5-TTS | E2-TTS
+model = "F5-TTS"
+ref_audio = "infer/examples/multi/main.flac"
+# If an empty "", transcribes the reference audio automatically.
+ref_text = ""
+gen_text = ""
+# File with text to generate. Ignores the text above.
+gen_file = "infer/examples/multi/story.txt"
+remove_silence = true
+output_dir = "tests"
+output_file = "infer_cli_story.wav"
+[voices.town]
+ref_audio = "infer/examples/multi/town.flac"
+ref_text = ""
+[voices.country]
+ref_audio = "infer/examples/multi/country.flac"
+ref_text = ""

deployment/src/f5_tts/infer/examples/multi/story.txt ADDED Viewed

	@@ -0,0 +1 @@

+ A Town Mouse and a Country Mouse were acquaintances, and the Country Mouse one day invited his friend to come and see him at his home in the fields. The Town Mouse came, and they sat down to a dinner of barleycorns and roots, the latter of which had a distinctly earthy flavour. The fare was not much to the taste of the guest, and presently he broke out with [town] “My poor dear friend, you live here no better than the ants. Now, you should just see how I fare! My larder is a regular horn of plenty. You must come and stay with me, and I promise you you shall live on the fat of the land.” [main] So when he returned to town he took the Country Mouse with him, and showed him into a larder containing flour and oatmeal and figs and honey and dates. The Country Mouse had never seen anything like it, and sat down to enjoy the luxuries his friend provided: but before they had well begun, the door of the larder opened and someone came in. The two Mice scampered off and hid themselves in a narrow and exceedingly uncomfortable hole. Presently, when all was quiet, they ventured out again; but someone else came in, and off they scuttled again. This was too much for the visitor. [country] “Goodbye,” [main] said he, [country] “I’m off. You live in the lap of luxury, I can see, but you are surrounded by dangers; whereas at home I can enjoy my simple dinner of roots and corn in peace.”

deployment/src/f5_tts/infer/examples/multi/town.flac ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e7d069b8ebd5180c3b30fde5d378f0a1ddac96722d62cf43537efc3c3f3a3ce8
+size 229383

deployment/src/f5_tts/infer/examples/thai_examples/ref_gen_1.wav ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0012dc98101ec049fb2ff8e644984a1cc92c43b31327324354ed0089e31f6847
+size 376844

deployment/src/f5_tts/infer/examples/thai_examples/ref_gen_2.wav ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f2786aa1c286b572a046a8cc69e6fc5657e0367c69c2c02912b54803fa369f08
+size 386802

deployment/src/f5_tts/infer/examples/thai_examples/ref_gen_3.wav ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:2d4c32a6fe2db670030bfa0280f307e1fa370ac33e10ded60cc44c72da452930
+size 494282

deployment/src/f5_tts/infer/examples/thai_examples/ref_gen_4.wav ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:da5c6c828b02e966d22c5014d523e2af0879bf5baa41ae1361361741eea5ad20
+size 195884