General Introduction
F5-TTS is a novel non-autoregressive text-to-speech (TTS) system based on a stream-matched Diffusion Transformer (DiT). The system significantly improves synthesis quality and efficiency by using the ConvNeXt model to optimize the text representation and make it easier to align with speech.F5-TTS supports training on multi-language datasets with highly natural and expressive zero-sample capabilities, seamless code switching, and speed control efficiency. The project is open source and aims to promote community development.
Instead of the complex modules of traditional TTS systems, such as duration modeling, phoneme alignment, and text encoders, this model achieves speech generation by padding the text to the same length as the input speech and applying denoising methods.
One of the major innovations of the F5-TTS is Sway Sampling strategy, which significantly improves the efficiency in the inference phase and enables real-time processing capabilities. This feature is suitable for scenarios requiring fast speech synthesis, such as voice assistants and interactive speech systems.
F5-TTS support zero-sample speech cloningThe program is designed to generate a wide range of voices and accents without the need for large amounts of training data, and also provides a emotional control cap (a poem) Speed Adjustment Features. The system's strong multilingual support makes it particularly suitable for applications that require the generation of diverse audio content, such as audiobooks, e-learning modules and marketing materials.
Function List
- Text-to-Speech Conversion: Convert input text into natural and smooth speech.
- Zero-sample generation: Generate high-quality speech without pre-recorded samples.
- Emotional Reproduction: Support for generating speech with emotions.
- Speed control: the user can control the speed of speech generation.
- Multi-language support: supports speech generation in multiple languages.
- Open source code: complete code and model checkpoints are provided to facilitate community use and development.
Using Help
Installation process
conda create -n f5-tts python=3.10 conda activate f5-tts sudo apt update sudo apt install -y ffmpeg pip uninstall torch torchvision torchaudio transformers # Install PyTorch (with CUDA support) pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 # install transformers pip install transformers git clone https://github.com/SWivid/F5-TTS.git cd F5-TTS pip install -e . # Launch a Gradio app (web interface) f5-tts_infer-gradio # Specify the port/host f5-tts_infer-gradio --port 7860 --host 0.0.0.0 # Launch a share link f5-tts_infer-gradio --share
F5-TTS One-Click Installation Command
conda create -n f5-tts python=3.10 -y && \ conda activate f5-tts && \\ sudo apt update && sudo apt install -y ffmpeg && \ pip uninstall -y torch torchvision torchaudio transformers && \\ pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 transformers && \\ git clone https://github.com/SWivid/F5-TTS.git && \ \ cd F5-TTS && \ pip install -e . pip install -e . && \ f5-tts_infer-gradio --port 7860 --host 0.0.0.0
F5-TTS google Colab running
Note: ngrok registration is required to apply for a key to achieve intranet penetration.
!pip install pyngrok transformers gradio # Import the required libraries import os from pyngrok import ngrok !apt-get update && apt-get install -y ffmpeg !pip uninstall -y torch torchvision torchaudio transformers !pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 transformers # Clone and install the project !git clone https://github.com/SWivid/F5-TTS.git F5-TTS !pip install -e . !ngrok config add-authtoken 2hKI7tLqJVdnbgM8pxM4nyYP7kQ_3vL3RWtqXQUUdwY5JE4nj # Configuring ngrok and gradio import gradio as gr from pyngrok import ngrok import threading import time import socket import requests def is_port_in_use(port): with socket.socket(socket.AF_INET, socket. with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:: return s.connect_ex(('localhost', port)) == 0 return s.connect_ex(('localhost', port)) == 0 def wait_for_server(port, timeout=60): start_time = time. start_time = time.time() while time.time() - start_time < timeout. if is_port_in_use(port). start_time = time.time() response = requests.get(f'http://localhost:{port}') return True except: if response.status_code == 200: return True pass time.sleep(2) return False # Ensure that ngrok is not running ngrok.kill() # Start Gradio in a new thread def run_gradio(): import sys import f5_tts.infer.infer_gradio sys.argv = ['f5-tts_infer-gradio', '--port', '7860', '--host', '0.0.0.0'] f5_tts.infer.infer_gradio.main() thread = threading.Thread(target=run_gradio) thread.daemon = True thread.start() # Waiting for the Gradio service to start print("Waiting for the Gradio service to start...") print("Waiting for Gradio service to start...") if wait_for_server(7860). print("Gradio service started") # Starting ngrok public_url = ngrok.connect(7860) print(f"\n=== Access Information ===") print(f "Ngrok URL: {public_url}") print("===============\n") print(f "Ngrok URL: {public_url}") print("Gradio service startup timeout") # Keep the program running while True. time.sleep(1) time.sleep(1) except KeyboardInterrupt: ngrok.kill() ngrok.kill() ngrok.kill() !f5-tts_infer-cli \ ---model "F5-TTS" \\ --ref_audio "/content/test.MP3" \ --ref_text "Welcome to the Chief AI Sharing Circle, Microsoft has released OmniParser, a large model-based screen parsing tool.This tool is designed to enhance user interface automation it." \ --gen_text "Welcome to the Chief AI Sharing Circle, today we will demonstrate another open source speech cloning project in detail."
Usage Process
training model
- Configure acceleration settings, such as using multiple GPUs and FP16:
accelerate config
- Initiate training:
accelerate launch test_train.py
inference
- Download pre-trained model checkpoints.
- Single Reasoning:
- Modify the configuration file to meet requirements, such as fixed duration and step size:
python test_infer_single.py
- Modify the configuration file to meet requirements, such as fixed duration and step size:
- Batch reasoning:
- Prepare the test dataset and update the path:
bash test_infer_batch.sh
- Prepare the test dataset and update the path:
Detailed Operation Procedure
- Text-to-speech conversion::
- After entering text, the system will automatically convert it to speech, and users can choose different speech styles and emotions.
- Zero sample generation::
- The user does not need to provide any pre-recorded samples and the system generates high quality speech based on the input text.
- emotional reproduction::
- Users can select different emotion labels and the system will generate a voice with the corresponding emotion.
- speed control::
- Users can control the speed of speech generation by adjusting the parameters to meet the needs of different scenarios.
- Multi-language support::
- The system supports speech generation in multiple languages, and users can choose different languages as needed.