In the era of AI applications, how should model capabilities evolve?

In the era of AI applications, how does model capability evolve?

At the Volcano Engine Winter Force Prime Power Conference site, the Doubao Voice Large Model brings the “cross-time and space conversation” gameplay. Its voice replication technology underpins this sci-fi scenario. The interesting gameplay of domestic models fills people with imagination for AI applications, and model manufacturers are shifting their focus from “competing on models” to “competing on applications.”

Voice interaction is a key entry point for the implementation of AI models. OpenAI’s end-to-end voice interaction model has sparked extensive discussions. The Doubao Voice Large Model has evolved rapidly in 2024 due to the need for an AI application matrix. The ByteDance voice team divides it into three parts: speech synthesis, speech recognition, and voice replication.

In terms of speech recognition, the Doubao speech recognition model has high accuracy. Its error rate is 10% – 40% lower than publicly released models in China. It can use context information for reasoning to improve recall rate and also supports the recognition of multiple dialects. Speech synthesis is difficult. The Doubao speech synthesis model can generate ultra-natural, high-fidelity, and personalized speech according to the context. It can synthesize different emotions and there are 260 style timbres for users to choose. The voice replication technology can replicate the user’s timbre with 5 seconds of data. Combined with multilingual replication and low-cost tuning, it is more flexible.

The development of ByteDance’s models and AI applications, on the one hand, benefits from business ecosystem advantages, and on the other hand, is related to Volcano Engine’s model technology upgrade strategy. The Doubao Voice Large Model has obtained dual certifications from the China Academy of Information and Communications Technology and is rated as the first “leading-level” voice large model. In speech synthesis, it uses the Seed-TTS model architecture and makes targeted upgrades for scenarios such as chat companionship. In the marketing service scenario, it meets the high voice output needs of intelligent customer service. The upgrade of the speech recognition model addresses issues of accuracy and difficult recognition in complex scenarios to meet the needs of various scenarios for ASR.

The development of the Doubao Voice Large Model is a window for the interaction between Volcano Engine’s model capabilities and AI applications. Volcano Engine follows an efficient research and development model and makes a comprehensive layout. The development of multimodal large models is an important foundation for building an application ecosystem. The Doubao language large model performs excellently in evaluations by the Beijing Academy of Artificial Intelligence. The Doubao music model can complete full-song creation and partial modifications, bringing new gameplay and applications in combination with multimodal model capabilities. The visual model has been upgraded in professional image editing, making text-to-image more controllable and professional and has been implemented in multiple AI applications. Volcano Engine adheres to the strategy of taking root in scenarios and driving innovation and occupies a leading position in the generative AI IaaS field and the AI application market. AI applications may become the next driving force for the evolution of large models.