Letting the Text Speak for Itself: What's the best way to choose from the dizzying array of TTS tools?

AI News4mos agorelease AI Sharing Circle

1.1K 00

Today we're going to talk about a technology that's getting hotter and hotter - text-to-speech, also known as TTS (Text-to-Speech). Simply put, it is to let the computer read out the text, and to read it like human speech. In the past few years, from short video dubbing, audiobook production to intelligent customer service, virtual assistants, the figure of TTS is ubiquitous. The tools on the market are also endless, there are free, there are paid, there are open source, there are closed source; there are simple and easy to use online services, there are also development libraries that require a certain technical threshold. In the face of so many choices, how should an ordinary user or developer pick the one that suits him or her?

Don't worry, today we'll take a look at some of the most popular and representative TTS tools on the market, and see what they are capable of and where they can be used.

Easy to use, out-of-the-box type

For users who don't want to toss code and just want to generate speech quickly, a number of online services and packaged tools are preferred.

TTS Maker: This is a free (commercially available) online tool. The advantage is that it supports more than 50 languages, even including dialects like Northeastern and Cantonese, and has a wide selection of tones. It's a good starting point for creators doing short videos, podcasts, or creators who need multilingual content. However, the specific technical model behind it isn't explicitly stated, and it may not be as natural and controllable in terms of sound as more specialized tools.
Edge-TTS: This isn't really a standalone service, but someone has encapsulated the TTS functionality built into Microsoft's Edge browser into a Python library. The good thing is that it's easy to call, and free, and the voice quality is quite good, after all, it's Microsoft's technology behind it. It also supports many languages and tones (40+ languages, 300+ tones). The disadvantage is that it relies on Microsoft's interface, stability and long-term availability may be variable, and customization capabilities are limited. Suitable for lightweight applications or personal projects for rapid integration.

Business giants and professional-grade services

If you need top-notch voice quality, a high degree of stability, and a wealth of customization options, then a commercial service is usually the better choice, but of course, that usually means paying for it as well.

Microsoft Azure TTS: This is one of the recognized benchmarks in the industry. Based on neural network technology, the synthesized speech is so natural and smooth that it can almost be messed up. It offers rich emotion control options, and you can fine-tune pronunciation, speech rate, pauses, and more with SSML (Speech Synthesis Markup Language), a markup language. With support for over 140 languages and 400 tones, it's the first choice for enterprise applications, high-quality audiobooks, and professional virtual assistants. And of course, the price is enterprise-class.
Seed-TTS: This is a technology developed by ByteDance, and at this point it doesn't appear to have fully open-sourced its core model. From the technical report, it is characterized by the ability to edit voice content and fine control of the speaking speed, which sounds very suitable for the post-production of audiobooks or the need to refine the voice content of the scene. At present, it is mainly oriented to Chinese, and the specific product form and the way to get it still need to pay attention to the official follow-up release.
Voice Engine (from OpenAI?): The original article mentions this name, but it should be noted that OpenAI does not appear to have officially released the name of this program. Voice Engine There may be third-party implementations based on OpenAI technology (e.g., GPT) on the market or refer to its internal research program. There may be third-party implementations on the market that are based on OpenAI technology (e.g., GPT) or refer to its internal research program.OpenAI is strong in speech technology, and has received much attention for its speech cloning and generation capabilities, but there is not yet a service explicitly called Voice Engine The public products are available for direct use by everyone, and the sources of information need to be screened when used.

The Power of Open Source Communities: Freedom and Customization

For developers and researchers, open source TTS tools offer greater freedom and room for customization. You can gain a deeper understanding of the modeling principles and modify and train it according to your needs.

PaddleSpeech: An open source project from Baidu Flying Paddle , specially optimized for Chinese support. One of its highlights is the support for streaming synthesis , which means that it can be generated while playing , low latency , very suitable for the need for real-time response to the scene , such as real-time voice broadcasting , intelligent customer service. It is based on FastSpeech2 cap (a poem) HiFiGAN and other mainstream models.
Coqui TTS: This is a very active open source project, formerly known as Mozilla TTS, whose biggest advantage is a large library of pre-trained models supporting more than 1100 languages (based on its XTTS model), which is very valuable for applications that need to deal with multiple languages, especially low-resource languages. The community is active and well documented.
Bark: By Suno AI (known for music generation) was developed, what is more special is that it not only generates speech, but also non-speech sounds, such as music clips, background noise, laughter, cries, etc., and supports mixing of multiple languages. This gives it a unique potential for use in creative audio projects, game sound design, and other areas. It is based on Transformer Architecture.
TensorFlowTTS: As the name suggests, this is a TensorFlow-based TTS toolset. It supports a variety of popular TTS models such as Tacotron 2,FastSpeech2 etc., in conjunction with MelGAN et vocoder use. This is a great option for developers familiar with the TensorFlow ecosystem and those conducting academic research to facilitate model experimentation and custom development.
Fish Speech: This project focuses on hybrid multilingual generation, such as naturally switching between Chinese, English, and Japanese in a single sentence. It supports VITS2,Bert-VITS2 and other newer modeling architectures. Attractive for producing scenarios such as multilingual podcasts, movie and TV dubbing, etc.
ChatTTS: An open source model optimized specifically for conversational scenarios. It excels in English and Chinese conversations, generating speech with natural emotional features (e.g., laughter, hesitation, pauses in tone), which makes synthesized conversations sound more realistic and interactive. It is claimed to have used 40,000 hours of data for pre-training.

Voice Cloning: Have an Exclusive Voice

Voice cloning technology allows you to use a small amount (sometimes even just a few seconds or a minute) of someone's voice sample to synthesize a voice that speaks in that person's voice. This technology is interesting, but it also carries ethical risks, and it is important to follow laws, regulations, and ethics when using it.

GPT-SoVITS: It's a combination of SoVITS (a popular model for song synthesis and speech conversion) and GPT It is an open-source project for the technology. It claims that with 1 minute of voice can achieve a good cloning effect, and also has some support for Chinese dialects. It is currently one of the community's very hot voice cloning program.
OpenVoice: By MyShell.ai Open source that not only clones voices, but also provides fine control over the cloned voices, such as adjusting emotion, accent, and intonation. This makes it useful in scenarios that require personalization and expressiveness, such as advertisement voiceovers and multilingual virtual assistants.
Real-Time-Voice-Cloning: This project provides a graphical user interface (GUI) that makes it relatively easy for non-technical users to perform speech cloning experiments. It is based on the SV2TTS model, which may not work as well as the latest models, but beats ease of use.
F5-TTS: This project introduces Diffusion Transformer (DiT) technology for Zero-Shot Voice Cloning, which means that it may be possible to perform a certain degree of voice imitation or transformation without requiring the target speaker's speech data, and to support emotion control. This is a relatively new research direction.
Mocking Bird: is also an earlier open source speech cloning project that had attracted a lot of attention. Although it may have been technically surpassed by newer programs, it is still informative for understanding the development of speech cloning technology.

Other tools of interest

There are also a number of tools that are unique to each:

VoiceVox: It is mainly for Japanese language, and it is especially good at generating sounds in secondary and anime styles. It's very popular in the Japanese VTuber and Creative community.
EmotiVoice: NetEaseYouDao open source, focusing on emotional speech synthesis, can generate speech with a variety of emotions such as happiness, anger, sadness and joy.
MetaVoice-1B: An open-source model with 1B (billion) parameters usually implies greater expressive power and higher speech quality, but also requires more computational resources.
So-VITS-SVC: Mainly used for Singing Voice Conversion, which allows a song to be sung in another person's voice, is SoVITS An important branch of technology.

How to choose?

After reading all this, you may be even more confused. Don't worry, here's a simple idea for choosing:

Normal user, quick voice: Try it. TTS Maker Or look for a program based on Edge-TTS of online tools.
Developers, to be integrated into the application:
- Seeking high quality and consistency on a budget:. Microsoft Azure TTSThe
- Needs Chinese optimization and low latency. PaddleSpeechThe
- Requires the widest range of language support. Coqui TTSThe
- Want to play with sound + soundscape ideas. BarkThe
- unfamiliar TensorFlow Ecology. TensorFlowTTSThe
- Need to handle multi-language mixing. Fish SpeechThe
- Focused Conversation Scenarios. ChatTTSThe
Want to play voice cloning:
- Seeking Results and Community Heat: The GPT-SoVITS maybe OpenVoiceThe
- Need GUI, simple to try. Real-Time-Voice-CloningThe
- Focus on cutting-edge technology. F5-TTSThe
particular requirement:
- Japanese secondary voices. VoiceVoxThe
- Rich Emotional Expression. EmotiVoiceThe

It is important to note that open source projects usually require some technical background to deploy and use, and the results may need to be debugged and optimized on your own. Commercial services, on the other hand, provide a more stable and easy-to-use interface, but incur costs.

TTS technology is evolving very quickly, with new models and tools coming out all the time. Which tool you choose ultimately depends on your specific needs, budget and technical skills. Hopefully, this compendium will help you find the right starting point in this rapidly evolving field.

Summary of related resource links

To make it easier for you to find them, here is a compilation of the official or codebase addresses for some of the tools mentioned in the article:

TTS Maker: https://ttsmaker.com/zh-cn (online services)
Microsoft Azure TTS:: https://azure.microsoft.com/en-us/products/cognitive-services/text-to-speech/ (business services)
PaddleSpeech: https://github.com/PaddlePaddle/PaddleSpeech (open source)
VoiceVox: https://github.com/VOICEVOX/voicevox (open source)
TensorFlowTTS: https://github.com/TensorSpeech/TensorFlowTTS (open source)
Edge-TTS: https://github.com/rany2/edge-tts (open source library)
ChatTTS: https://github.com/2noise/ChatTTS (open source)
Seed-TTS:: Technical reports, papers, assessment tools (core model not fully open source)
Fish Speech: https://github.com/fishaudio/fish-speech (open source)
GPT-SoVITS: https://github.com/RVC-Boss/GPTSoVITS (open source)
OpenVoice: https://github.com/myshell-ai/OpenVoice (open source)
Bark: https://github.com/suno-ai/bark (open source)
Coqui TTS: https://github.com/coqui-ai/tts (open source)
Real-Time-Voice-Cloning: https://github.com/CorentinJ/Real-Time-Voice-Cloning (open source)
F5-TTS: https://github.com/SWivid/F5-TTS (open source)
EmotiVoice: https://github.com/netease-youdao/EmotiVoice (open source)
MetaVoice-1B: https://github.com/metavoiceio/metavoice-src (open source)
So-VITS-SVC: https://github.com/svc-develop-team/so-vits-svc (open source)
Mocking Bird: https://github.com/babysor/MockingBird (open source)
FUNAudioLLM-CosyVoice: https://github.com/FunAudioLLM/CosyVoice (open source)
VoiceCraft: https://github.com/jasonppy/VoiceCraft (open source)
Parler-TTS: https://github.com/huggingface/parler-tts (open source)
MaskGCT:: https://hf-mirror.com/amphion/MaskGCT (open source mirrors)