Beanbag end-to-end real-time voice grand model is online! IQ and EQ are both online, and Chinese voice dialog is leading off the cliff!

AI News1yrs agorelease AI Sharing Circle

60.5K 00

Today, Beanbag APP announced that the new end-to-end real-time voice call feature is officially online, and instead of playing "pre-release", it will directly open up the whole volume, free for everyone to use, and welcome every user's test.

Beanbag Real-Time Voice Megamodel URL: https://team.doubao.com/realtime_voice

After watching it, we found a couple of great points:

First of all, Beanbag is really human-like, with highly anthropomorphic wordplay, tone and breathing rhythms.When you speak at a lowered volume, Beanbag will also use its 'whisper' skill, completely removing the human feel of previous AI voice calls.

Secondly, regardless of the complexity of the Chinese dialog, the beanbag can hold its own.After our series of real-world experiences, Doubao can be said to have a breakneck lead in Chinese language ability. This advantage is not just compared to ChatGPT and other overseas players, and compare that to a series of domestic AI dialog apps as well.

In addition, Beanbag is a 'chatty hitchhiker' who knows everything from astronomy to geography.It is seriously listening to what the user is saying and the deeper meaning they are trying to convey, will quickly give interesting and useful responses, and has the ability to network queries.

To experience this feature, you need to upgrade DoudouBao APP to 7.2.0 Chinese New Year version. After the launch, a large number of users updated and flocked to Doubao at the first time, and made telephone congee with Doubao:

Remember in the early morning of May 14th, 2024, GPT-4o came out of the blue and brought ChatGPT a new real-time voice calling capability, which was called "the launch that shook the world" in the industry. Unfortunately, after ChatGPT went live with this feature, our actual experience was not as impressive as the launch demo.

Now, it's Doubao's turn to rock the world. Before going live, the internal team has evaluated the Beanbag Real-time Voice Big Model and GPT-4o behind this feature around a number of dimensions such as anthropomorphism, usefulness, emotional intelligence, call stability, conversation smoothness, etc. For overall satisfaction (on a scale of 5), the GPT-4o scored 4.36, with the 501st real-time voice Big Model scoring 3.18. In terms of overall satisfaction (out of 5), the Beanbag Real-time Voice Big Model scored 4.36 and GPT-4o 3.18. 50% testers scored Beanbag's Real-time Voice Big Model performance out of 5.

In addition, in the model merit evaluation, Doubao real-time speech grand model has obvious advantages in emotion understanding and emotion expression. In particular, in the "AI or not" evaluation, more than 30% feedbacks indicated that GPT-4o was "too AI", while the corresponding proportion of Doubao real-time speech big models was only within 2%.

The next part is the actual test of the heart of the machine, if you are interested in reading it, we suggest that you quickly open your own Bean Bag App and upgrade the version to 7.2.0 New Year Edition to experience it. After all, from the current degree of fire, go late may have the probability of not squeezing the car.

First-hand test: a bit shocking, science fiction movie into reality

At the end of 2024, the Beanbag Big Model team revealed a new end-to-end real-time voice feature that would soon go live on the Beanbag App, sparking a wave of anticipation among users.

After actually using it, our feeling is that it is indeed more anthropomorphic and natural than expected.

Being very good at sensing and taking on the emotions of human users is one of the highlights of Beanbag.Why not listen to a few of our conversations with beanbags to get a feel for how anthropomorphic it is?

For example, the ability to express emotions allows it to show complex emotions in its voice, which can be done to the extent of "difficult to distinguish between human and machine".

Doubao seems to be a skillful actor, facing different scenes of the 5 million yuan lottery ticket, sometimes ecstatic, sometimes grief.

The ability to follow instructions is also very strong. We have been able to memorize poems at various speeds of speech, and we have been able to feel the emotions in the poems and recite them emotionally.

The ability to empathize is also taken. When our first words were about bad news with frustration, the beanbag would reassure you with a calmer and warmer tone. But when you regain a positive frame of mind and switch to a lighter tone to compliment it, the beanbag switches to a perky tone. It will also have human-like paralinguistic features, including intonation, hesitation, and pauses.

Note: Some responses are delayed and stem from networking queries.

At the same time, we can feel that Doubao doesn't just provide emotional companionship, for example, in the first conversation test, it gives advice on grabbing tickets, trip recommendations that are also very practical, and instant information about the weather and other instant information that can be retrieved quickly and accurately.

Yes, Beanbag's eloquent speech is based on the powerful semantic understanding and information retrieval capabilities of Beanbag's real-time speech grand model. At the time of user voice input, Doubao immediately begins to understand the depth of each dimension of information to ensure the usefulness and authenticity of the output information.In layman's terms, it has both 'emotional value' and 'practical value'.(However, we also found that Beanbag's real-time voice grand model currently only supports English and Chinese, and we expect that the multilingual capability can be strengthened by a wave in the future.)

Since Beanbag has been 'hanging out' on the Internet for a long time, his level of playing with abstraction must not be bad.

Note: Some responses are delayed and stem from networking queries.

Of course, when you talk to a beanbag, you have not just one hitchhiker, but countless drama friends.

Under the "Hundred Changes of Big Cameo" mode, from Monkey King to Lin Daiyu, from Grey Wolf to Lazy Goat, the control of voice and the interpretation of emotions have brought Doubao's user experience to a higher level.

Since role-playing is out of the question, storytelling ability is also at hand. Switching freely between horror and hilarity.

Interestingly, Doubao APP has introduced the singing function that GPT-4o doesn't have, which is a fun game for both young and old, and a fire is just around the corner.

It's the end of the year, so let's let it have some New Year's Eve songs to close out this review:

What is the technology behind the far superior call experience?

How did the team behind Beanbag achieve such silky, natural real-time voice calls?

Supporting the core capabilities of this feature is the recently launched Beanbag Real-Time Speech Megamodel.

According to the Beanbag Big Model Speech team, this is an integrated model for speech understanding and generation that truly realizes end-to-end voice dialog, which is more stunning than the traditional cascade model in terms of voice expressiveness, control, and emotional undertakings, and has the advantages of low latency and the ability to interrupt at any time during the dialog.

Looking at the field of speech AI, there are two technical difficulties in real-time speech big models for real people.

One is that it is difficult to balance between emotional intelligence and intellectual intelligence.

Many practitioners in the field of speech know that the model itself often has a contradictory relationship between the naturalness of the conversation, usefulness and security dimensions. In other words, it is how to make the model not only a "school bully" with logical reasoning ability online, but also expressive, empathetic, understanding online, and emotional intelligence level pulling full.

According to the team, they are oriented to the above problems in terms of data and post-training algorithms to ensure that multimodal speech dialog data is both semantically correct and expressively natural. At the same time, it relies on a multi-round data synthesis method to produce high-quality, highly expressive speech data, ensuring that the generated speech expression is natural and consistent.

In addition, the team also regularly conducts multi-dimensional evaluation of the model, relying on the results to timely adjust the training strategy and data usage, to ensure that the model always maintains a good balance between IQ and performance.

The second is the high threshold of landing, to make the voice function does not stop at Toy, it is a big challenge to the team's comprehensive ability.

In the past, a number of end-to-end voice releases, including GPT-4o, only showed the demo, and even if the subsequent capabilities are made public, the actual capabilities may not be recognized by the public. The reason is: the function of the R & D process requires the participation of algorithms, engineering, product, testing and other teams, not only to clarify the user needs, but also to divide the technical evaluation dimensions and indicators, and then in the model training, fine-tuning, and other processes, the same need for multiple teams to work closely with each other. Finally, if the product is to be launched to serve hundreds of millions of users, it will also face great challenges in terms of engineering and security.

As mentioned earlier, the new real-time voice function announced by this Doubao official on-line that is open, directly serving thousands of users, the team also try to find the best balance in terms of delivery experience, in order to protect the security of the basis, so that the model has an unprecedented voice high expressive power, control and bright emotional undertaking ability, at the same time, to ensure that it is both a strong comprehension and logic, but also can be networked to answer the timeliness of the question .

Under the framework of joint modeling of speech generation, comprehension and text big model, the team realized the ability of diverse input and output of the model, at the same time, it ensures the generation accuracy and naturalness of the model on the generation side in the case of lower system latency, at the same time, on the comprehension side, the framework allows the model to realize the ability of sharp speech interruption and user conversation stopping.

Of course, the team also attaches great importance to the security issues brought about by the improved modeling capabilities. According to the relevant technical staff, they introduced a variety of security mechanisms in the post-training phase of the joint modeling process to reduce security risks by effectively suppressing and filtering potential non-security content.

The technical team also revealed to us that through joint modeling, the model has surprisingly emerged with new capabilities such as command comprehension, voice play, and voice control. For example, some of the model's dialects and accents are now derived from data generalization in the Pretrain phase, rather than from targeted training. In this respect, speech models are very similar to language models.

Beyond the surprises, what did Beanbag 'subvert'?

Among the existing similar products, we can feel that Doubao's anthropomorphism and emotional experience is the best, and it is proficient in all 18 martial arts, and its Chinese language ability is far superior to that of ChatGPT and other "imported products".

At the end of the day, one might want to ask: aside from the surprising user experience, why has Beanbag's updated end-to-end real-time voice reaped so much attention?

The key answer is: it's the first end-to-end Chinese voice system that serves hundreds of millions of users and actually works -- well, and for free.

Once upon a time, real-time voice dialog with AI was just a scene from a sci-fi movie, and it was also our concrete imagination of advanced artificial intelligence. But now, such a magical function exists in the Doubao App in your phone and mine, and it has become "within reach" from "far away".

Photo credit: The movie Her

To summarize briefly, Beanbag's new end-to-end real-time voice sets two precedents:

From the level of technological change, Doubao injected "soul" into AI for the first time in the industry, and achieved the dual quotient of "emotional quotient" and "intelligence quotient" online. This seems to mean the end of the traditional voice assistant era. We no longer subconsciously feel that we are talking to a model trained by a huge amount of data, people and AI began to produce a subtle emotional connection, including trust, dependence, the plot of the science fiction movie is entering the public life.

As in classics like Her, humans never fell in love with AI because it provided unlimited knowledge, but because it delivered just the right amount of emotional value.

From the perspective of big model technology, end-to-end real-time voice calls fill one of the few gaps in multimodal interaction. The gameplay of big model applications is constantly upgrading -- future products may receive any combination of text, audio and images as input and generate any combination of text, audio and images as output in real time. The way humans and machines interact is being disrupted, which in turn is transforming the way humans interact with each other.

At least for current Chinese-speaking users, the launch of Doubao's end-to-end real-time voice feature provides a way of interaction mediated by natural human language that truly breaks the threshold for people to access and experience advanced AI.

Going back six months, could we have imagined that it was beanbags that took the lead in making history?

Starting from the big language model in 2023 and ending in 2024, Doubao's big model family has been completed at the multi-modal levels of image, voice, music, video, 3D, etc. It has not only ranked among the first echelon in China, but also accomplished the metamorphosis from "fledgling" to "shocking the world" in just a few months' time.

And whoever arrives at this milestone first on the big modeling circuit of a hundred boats may determine their ranking in the field for the next decade.

In the next year, about big models, about beanbags and domestic AI will move forward at what speed, more worthy of our expectations.