AI Personal Learning
and practical guidance

F5-TTS: Sample less speech cloning to generate smooth and emotionally rich cloned voices

General Introduction

F5-TTS is a novel non-autoregressive text-to-speech (TTS) system based on a stream-matched Diffusion Transformer (DiT). The system significantly improves synthesis quality and efficiency by using the ConvNeXt model to optimize the text representation and make it easier to align with speech.F5-TTS supports training on multi-language datasets with highly natural and expressive zero-sample capabilities, seamless code switching, and speed control efficiency. The project is open source and aims to promote community development.

Instead of the complex modules of traditional TTS systems, such as duration modeling, phoneme alignment, and text encoders, this model achieves speech generation by padding the text to the same length as the input speech and applying denoising methods.

One of the major innovations of the F5-TTS is Sway Sampling strategy, which significantly improves the efficiency in the inference phase and enables real-time processing capabilities. This feature is suitable for scenarios requiring fast speech synthesis, such as voice assistants and interactive speech systems.


F5-TTS support zero-sample speech cloningThe program is designed to generate a wide range of voices and accents without the need for large amounts of training data, and also provides a emotional control cap (a poem) Speed Adjustment Features. The system's strong multilingual support makes it particularly suitable for applications that require the generation of diverse audio content, such as audiobooks, e-learning modules and marketing materials.

 

F5-TTS: Sample less speech cloning to generate smooth and emotionally rich cloned voices-1

 

F5-TTS: Sample less speech cloning to generate smooth and emotionally rich cloned voices-1

 

Function List

  • Text-to-Speech Conversion: Convert input text into natural and smooth speech.
  • Zero-sample generation: Generate high-quality speech without pre-recorded samples.
  • Emotional Reproduction: Support for generating speech with emotions.
  • Speed control: the user can control the speed of speech generation.
  • Multi-language support: supports speech generation in multiple languages.
  • Open source code: complete code and model checkpoints are provided to facilitate community use and development.

 

Using Help

Installation process

conda create -n f5-tts python=3.10
conda activate f5-tts

sudo apt update
sudo apt install -y ffmpeg

pip uninstall torch torchvision torchaudio transformers

# Install PyTorch (with CUDA support)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# install transformers
pip install transformers

git clone https://github.com/SWivid/F5-TTS.git
cd F5-TTS
pip install -e .

# Launch a Gradio app (web interface)
f5-tts_infer-gradio

# Specify the port/host
f5-tts_infer-gradio --port 7860 --host 0.0.0.0

# Launch a share link
f5-tts_infer-gradio --share

F5-TTS One-Click Installation Command

conda create -n f5-tts python=3.10 -y && \
conda activate f5-tts && \\
sudo apt update && sudo apt install -y ffmpeg && \
pip uninstall -y torch torchvision torchaudio transformers && \\
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 transformers && \\
git clone https://github.com/SWivid/F5-TTS.git && \ \
cd F5-TTS && \ pip install -e .
pip install -e . && \
f5-tts_infer-gradio --port 7860 --host 0.0.0.0

 

F5-TTS google Colab running

Note: ngrok registration is required to apply for a key to achieve intranet penetration.

 

!pip install pyngrok transformers gradio

# Import the required libraries
import os
from pyngrok import ngrok

!apt-get update && apt-get install -y ffmpeg
!pip uninstall -y torch torchvision torchaudio transformers
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 transformers

# Clone and install the project
!git clone https://github.com/SWivid/F5-TTS.git
 F5-TTS
!pip install -e .

!ngrok config add-authtoken 2hKI7tLqJVdnbgM8pxM4nyYP7kQ_3vL3RWtqXQUUdwY5JE4nj

# Configuring ngrok and gradio
import gradio as gr
from pyngrok import ngrok
import threading
import time
import socket
import requests

def is_port_in_use(port): with socket.socket(socket.AF_INET, socket.
 with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:: return s.connect_ex(('localhost', port)) == 0
 return s.connect_ex(('localhost', port)) == 0

def wait_for_server(port, timeout=60): start_time = time.
 start_time = time.time()
 while time.time() - start_time < timeout.
 if is_port_in_use(port).
 start_time = time.time()
 response = requests.get(f'http://localhost:{port}')
 
 return True
 except: if response.status_code == 200: return True
 pass
 time.sleep(2)
 return False

# Ensure that ngrok is not running
ngrok.kill()

# Start Gradio in a new thread
def run_gradio():
 import sys
 import f5_tts.infer.infer_gradio
 sys.argv = ['f5-tts_infer-gradio', '--port', '7860', '--host', '0.0.0.0']
 f5_tts.infer.infer_gradio.main()

thread = threading.Thread(target=run_gradio)
thread.daemon = True
thread.start()

# Waiting for the Gradio service to start
print("Waiting for the Gradio service to start...")
print("Waiting for Gradio service to start...") if wait_for_server(7860).
 print("Gradio service started")
 # Starting ngrok
 public_url = ngrok.connect(7860)
 print(f"\n=== Access Information ===")
 print(f "Ngrok URL: {public_url}")
 print("===============\n")
print(f "Ngrok URL: {public_url}")
 print("Gradio service startup timeout")

# Keep the program running
while True.
 time.sleep(1)
 time.sleep(1)
 except KeyboardInterrupt: ngrok.kill()
 ngrok.kill()
 ngrok.kill()

!f5-tts_infer-cli \
---model "F5-TTS" \\
--ref_audio "/content/test.MP3" \
--ref_text "Welcome to the Chief AI Sharing Circle, Microsoft has released OmniParser, a large model-based screen parsing tool.This tool is designed to enhance user interface automation it." \
--gen_text "Welcome to the Chief AI Sharing Circle, today we will demonstrate another open source speech cloning project in detail."

 

Usage Process

training model

  1. Configure acceleration settings, such as using multiple GPUs and FP16:
    accelerate config
    
  2. Initiate training:
    accelerate launch test_train.py
    

inference

  1. Download pre-trained model checkpoints.
  2. Single Reasoning:
    • Modify the configuration file to meet requirements, such as fixed duration and step size:
      python test_infer_single.py
      
  3. Batch reasoning:
    • Prepare the test dataset and update the path:
      bash test_infer_batch.sh
      

Detailed Operation Procedure

  1. Text-to-speech conversion::
    • After entering text, the system will automatically convert it to speech, and users can choose different speech styles and emotions.
  2. Zero sample generation::
    • The user does not need to provide any pre-recorded samples and the system generates high quality speech based on the input text.
  3. emotional reproduction::
    • Users can select different emotion labels and the system will generate a voice with the corresponding emotion.
  4. speed control::
    • Users can control the speed of speech generation by adjusting the parameters to meet the needs of different scenarios.
  5. Multi-language support::
    • The system supports speech generation in multiple languages, and users can choose different languages as needed.

 

F5 One-Click Installer

Chief AI Sharing CircleThis content has been hidden by the author, please enter the verification code to view the content
Captcha:
Please pay attention to this site WeChat public number, reply "CAPTCHA, a type of challenge-response test (computing)", get the verification code. Search in WeChat for "Chief AI Sharing Circle"or"Looks-AI"or WeChat scanning the right side of the QR code can be concerned about this site WeChat public number.

Related documents download address
© Download resources copyright belongs to the author; all resources on this site are from the network, for learning purposes only, please support the original version!
May not be reproduced without permission:Chief AI Sharing Circle " F5-TTS: Sample less speech cloning to generate smooth and emotionally rich cloned voices

Chief AI Sharing Circle

Chief AI Sharing Circle specializes in AI learning, providing comprehensive AI learning content, AI tools and hands-on guidance. Our goal is to help users master AI technology and explore the unlimited potential of AI together through high-quality content and practical experience sharing. Whether you are an AI beginner or a senior expert, this is the ideal place for you to gain knowledge, improve your skills and realize innovation.

Contact Us
en_USEnglish