Amazon releases BASE TTS, the largest text-to-speech AI model available, showing "potential capabilities"

AI News2yrs agorelease AI Sharing Circle

1.6K 00

亚马逊发布目前最大型的文本语音转换人工智能模型BASE TTS展现了“潜在能力”

Amazon researchers have trained the largest text-to-speech conversion model to date, which is said to exhibit "latent" qualities that enable more natural speech output of complex sentences. This breakthrough may be the key to ridding the technology sector of its unnaturalness.

The growth and enhancement of such models has continued, and researchers specifically expect to see the leap in capability that we have witnessed when the body of language models increases to a certain level. For some unknown reason, when Language Long Models (LLMs) expand beyond a certain point, they become more powerful and flexible, capable of taking on untrained tasks.

This doesn't mean that the models gained self-awareness or similar attributes, but rather that after a certain point of transcendence, their performance on specific conversational AI tasks showed a sharp upward trend. The Amazon Artificial General Intelligence (AGI) team - whose goal is no secret - thought the same might happen as text-to-speech models evolved, and their research shows that it did.

The new model is called [Large-scale adaptive streaming text-to-speech conversion with potential capabilities](Big Adaptive Streamable TTS with Emergent abilities), or BASE TTS for short.The largest version of the model utilizes 100,000 hours of public domain speech material, of which 90% is in English and the rest includes German, Dutch, and Spanish.

With 9.8 million parameters, BASE-large is the largest model in the field. For comparison, they also trained models with 400 million and 150 million parameters based on 10,000 and 1,000 hours of audio material, respectively - the reason for this is that if one model exhibits latent behaviors and the other doesn't, it's possible to determine the critical region where these behaviors begin to emerge.

The results showed that the medium-sized model demonstrated the leap in competence that the team had hoped for, not just in the quality of ordinary speech (although the scores improved, but only slightly), but in a range of potential competencies that the team observed and assessed. Here are some examples of tricky texts mentioned in the paper:

compound noun (grammar): The Beckhams decided to rent a charming traditional stone country vacation home.
emotional expression: "Oh my God! Are we really going to the Maldives? It's unbelievable!" , Jenny squealed, jumping up and down with excitement.
foreign language vocabulary:: "Mr. Henry, known for his masterful kitchen preparations, orchestrated a seven-course feast, each of which was a rare delicacy.
Pala linguistics(as in the paraphrase-able, non-textual part): 'Hush, Lucy, be quiet, we mustn't wake your brother,' whispered Tom, as they walked carefully through the nursery.
a punctuation mark: She received a weird text message from her brother: 'Emergency at home; please call as soon as possible! Mom and Dad are worried . . # Family First.'
How to ask a question: But questions about Britain's exit from the European Union still hang in the air: after all the trials and tribulations, will ministers find answers in time?
Syntactic complexity: De Moya, who recently received a Lifetime Achievement Award, starred in a movie in 2022 that was a box office success despite receiving mixed reviews.

"These sentences were carefully designed to include the challenging tasks of parsing complex structured sentences, applying phrasal stress to long compound nouns, producing emotive or whispered pronunciations, or correctly pronouncing words or punctuation in foreign languages such as 'qi' or '@' - all tasks that BASE TTS does not explicitly train," the authors said. Pronouncing words or punctuation correctly are challenging tasks - tasks that BASE TTS is not explicitly trained for," the authors said.

Such features typically frustrate text-to-speech conversion engines, which may mispronounce words, omit words, use improper intonation, or make other errors. Although BASE TTS also encounters difficulties, its processing power far exceeds that of contemporaneous models such as Tortoise and VALL-E.

The official website provides many examples of how these difficult texts can be read aloud naturally and fluently [Check it out on the website they built for the model] Of course, these examples were screened by researchers, so they must have been hand-picked, but this is still impressive. Here are some examples in case you don't want to click through:

Since the three BASE TTS models share the same architecture, the size of the models and the adequacy of their training data are clearly the reason why the models are able to handle the complexity described above. Note that currently this is still an experimental model and processing flow-not a commercial model or similar product. Follow-up research will need to determine the tipping point at which the potential capabilities are demonstrated and how to efficiently train and deploy the final model.

Of interest is that the model can be "streamed" as the name suggests - meaning that it doesn't have to generate the entire sentence at once, but can generate it incrementally at a relatively low bit rate. The team is also attempting to package speech metadata such as mood and rhythm into a separate, low-bandwidth stream, which could be played back in sync with regular audio.

It looks like text-to-speech modeling may be poised for a breakthrough moment in 2024-just in time for the election! However, the utility of the technology is undeniable, especially when it comes to improving accessibility. It's important to note that the team has chosen not to publicize the model's source code and other data, given the risk that the model could be exploited by the unwary. However, sooner or later, the truth will come out.

AI News

The article is copyrighted and should not be reproduced without permission.

5 ChatGPT tools that can be used directly at home (August 2024)

AI News

12mos ago

01.8K

FaceFusion3.1.x,全新批量处理系统,模型更新,NSFW,推荐更新！

FaceFusion 3.1.x, New Batch Processing System, Model Updates, NSFW, Recommended Updates!

AI News

7mos ago

03.8K

Say goodbye to mechanical sounds! All-around AI voice tools explained: text-to-speech, voice cloning, sound effects library in one stop!

AI News

6mos ago

01.1K

AI搜索工具PK，三大知识搜索神器谁最强？Perplexity、秘塔、Felo深度对比

AI search tools PK, the three knowledge search god who is the strongest?Perplexity, Secret Tower, Felo depth comparison

AI News

8mos ago

02.7K

No comments

You must be logged in to leave a comment!

No comments...

Amazon releases BASE TTS, the largest text-to-speech AI model available, showing "potential capabilities"

Microsoft: Hackers use artificial intelligence to boost their cyberattack skills

Nvidia's latest AI chatbot works independently on your PC and is completely free.

Related posts

5 ChatGPT tools that can be used directly at home (August 2024)

FaceFusion 3.1.x, New Batch Processing System, Model Updates, NSFW, Recommended Updates!

Say goodbye to mechanical sounds! All-around AI voice tools explained: text-to-speech, voice cloning, sound effects library in one stop!

AI search tools PK, the three knowledge search god who is the strongest?Perplexity, Secret Tower, Felo depth comparison

No comments

Latest Collections

Latest Articles

Amazon releases BASE TTS, the largest text-to-speech AI model available, showing "potential capabilities"

Microsoft: Hackers use artificial intelligence to boost their cyberattack skills

Nvidia's latest AI chatbot works independently on your PC and is completely free.

Related posts

5 ChatGPT tools that can be used directly at home (August 2024)

FaceFusion 3.1.x, New Batch Processing System, Model Updates, NSFW, Recommended Updates!

Say goodbye to mechanical sounds! All-around AI voice tools explained: text-to-speech, voice cloning, sound effects library in one stop!

AI search tools PK, the three knowledge search god who is the strongest?Perplexity, Secret Tower, Felo depth comparison

No comments

Selected AI Tools

Latest Collections

Latest Articles