Amazon researchers have trained the largest text-to-speech conversion model to date, which is said to exhibit "latent" qualities that enable more natural speech output of complex sentences. This breakthrough may be the key to ridding the technology sector of its unnaturalness.
The growth and enhancement of such models has continued, and researchers specifically expect to see the leap in capability that we have witnessed when the body of language models increases to a certain level. For some unknown reason, when Language Long Models (LLMs) expand beyond a certain point, they become more powerful and flexible, capable of taking on untrained tasks.
This doesn't mean that the models gained self-awareness or similar attributes, but rather that after a certain point of transcendence, their performance on specific conversational AI tasks showed a sharp upward trend. The Amazon Artificial General Intelligence (AGI) team - whose goal is no secret - thought the same might happen as text-to-speech models evolved, and their research shows that it did.
The new model is called [Large-scale adaptive streaming text-to-speech conversion with potential capabilities](Big Adaptive Streamable TTS with Emergent abilities), or BASE TTS for short.The largest version of the model utilizes 100,000 hours of public domain speech material, of which 90% is in English and the rest includes German, Dutch, and Spanish.
With 9.8 million parameters, BASE-large is the largest model in the field. For comparison, they also trained models with 400 million and 150 million parameters based on 10,000 and 1,000 hours of audio material, respectively - the reason for this is that if one model exhibits latent behaviors and the other doesn't, it's possible to determine the critical region where these behaviors begin to emerge.
The results showed that the medium-sized model demonstrated the leap in competence that the team had hoped for, not just in the quality of ordinary speech (although the scores improved, but only slightly), but in a range of potential competencies that the team observed and assessed. Here are some examples of tricky texts mentioned in the paper:
compound noun (grammar): The Beckhams decided to rent a charming traditional stone country vacation home.
emotional expression: "Oh my God! Are we really going to the Maldives? It's unbelievable!" , Jenny squealed, jumping up and down with excitement.
foreign language vocabulary:: "Mr. Henry, known for his masterful kitchen preparations, orchestrated a seven-course feast, each of which was a rare delicacy.
Pala linguistics(as in the paraphrase-able, non-textual part): 'Hush, Lucy, be quiet, we mustn't wake your brother,' whispered Tom, as they walked carefully through the nursery.
a punctuation mark: She received a weird text message from her brother: 'Emergency at home; please call as soon as possible! Mom and Dad are worried . . # Family First.'
How to ask a question: But questions about Britain's exit from the European Union still hang in the air: after all the trials and tribulations, will ministers find answers in time?
Syntactic complexity: De Moya, who recently received a Lifetime Achievement Award, starred in a movie in 2022 that was a box office success despite receiving mixed reviews.
"These sentences were carefully designed to include the challenging tasks of parsing complex structured sentences, applying phrasal stress to long compound nouns, producing emotive or whispered pronunciations, or correctly pronouncing words or punctuation in foreign languages such as 'qi' or '@' - all tasks that BASE TTS does not explicitly train," the authors said. Pronouncing words or punctuation correctly are challenging tasks - tasks that BASE TTS is not explicitly trained for," the authors said.
Such features typically frustrate text-to-speech conversion engines, which may mispronounce words, omit words, use improper intonation, or make other errors. Although BASE TTS also encounters difficulties, its processing power far exceeds that of contemporaneous models such as Tortoise and VALL-E.
The official website provides many examples of how these difficult texts can be read aloud naturally and fluently [Check it out on the website they built for the model] Of course, these examples were screened by researchers, so they must have been hand-picked, but this is still impressive. Here are some examples in case you don't want to click through:
Since the three BASE TTS models share the same architecture, the size of the models and the adequacy of their training data are clearly the reason why the models are able to handle the complexity described above. Note that currently this is still an experimental model and processing flow-not a commercial model or similar product. Follow-up research will need to determine the tipping point at which the potential capabilities are demonstrated and how to efficiently train and deploy the final model.
Of interest is that the model can be "streamed" as the name suggests - meaning that it doesn't have to generate the entire sentence at once, but can generate it incrementally at a relatively low bit rate. The team is also attempting to package speech metadata such as mood and rhythm into a separate, low-bandwidth stream, which could be played back in sync with regular audio.
It looks like text-to-speech modeling may be poised for a breakthrough moment in 2024-just in time for the election! However, the utility of the technology is undeniable, especially when it comes to improving accessibility. It's important to note that the team has chosen not to publicize the model's source code and other data, given the risk that the model could be exploited by the unwary. However, sooner or later, the truth will come out.