In 2024, as the year is coming to an end, has China’s AI application risen to prominence? This company hands in a 95-point answer.
In AI-generated videos, the left is generated by Sora, and the right is generated by the domestic Zhixiang multimodal large model. On December 10, OpenAI released Sora, but the official release did not bring the expected shocking effect. Many domestic models have even surpassed Sora in some aspects.
At the same time, the question about the application prospects of image and video generation models is raised again. When Sora just released the preview version in February, domestic AI companies were divided on whether to follow this direction. Ten months later, those companies that chose to do it not only made breakthroughs in model effects but also explored some scenarios that can be implemented currently.
Zhixiang Future is such a company. Zhixiang Future was established in March 2023. The core team began researching video and image generation models several years ago. Now, their Zhixiang multimodal generation large model has been updated to version 3.0, and the understanding large model 1.0 has also been released.
Yao Ting, the company’s CTO, said that in the field of video and image generation, there is no need to wait until the basic model reaches 100 points before doing applications. Based on the existing capabilities of the basic model, finding scenarios that truly solve users’ pain points and doing deep in applications to truly achieve more than 95 points end-to-end, users will pay.
## Optimize the model starting from scenarios
### Zhixiang multimodal generation large model welcomes version 3.0
What kind of model is really needed by users? Yao Ting observed from user feedback that young people born in the 1990s and 2000s think that the one-minute single-shot videos generated by AI are boring, but simple dynamic wallpapers with specific IPs can attract them to pay. The B-side scenario is similar. For example, printing a product logo on clothes requires a natural effect and direct usability.
These phenomena reveal the gap between the model and the application: researchers think that only when the basic model reaches 100 points will someone pay, while users only want a model that can reach more than 95 points in solving their specific problems. This gap makes Yao Ting realize that starting from the perspective of scene requirements to optimize the model is possible to make a really useful product.
The Zhixiang multimodal generation large model 3.0 was polished under such a concept. It has achieved optimizations in three major aspects:
1. **Improved picture quality and relevance**: In the technical architecture, a hybrid architecture of Diffusion Transformer (DiT) + Autoregressive model (AR) is introduced. This architecture maintains the advantage of continuous image coding in DiT, realizes the combination of autoregressive process and lightweight diffusion process, improves generation quality and controllability, and also increases model inference speed.
2. **More controllable camera movement and picture movement**: Joint training of camera movement and picture movement has strengthened the learning and simulation of film and television-level shots and also improved the naturalness of the movement of the picture itself.
3. **Improved generation effect in characteristic scenarios**: By using multi-scenario learning to amplify the product capability characteristics of the multimodal generation large model, the generation effect in different characteristic scenarios is improved, and the “last mile” demand of users is completed. Taking the IP migration function in the marketing scenario as an example, to achieve more than 95 points end-to-end, technically, more consideration needs to be given to how to balance the authenticity of the generated content on the user side and the degree of maximizing IP preservation.
## Understanding and generation complement each other
The Zhixiang multimodal understanding large model 1.0 is unveiled. In multimodal large models, understanding and generation complement each other. Therefore, Zhixiang Future has added understanding enhancement in the training of generation large model 3.0. At the same time, they have also specially launched a new understanding large model – Zhixiang multimodal understanding large model 1.0. This model realizes more refined and accurate image and video content understanding through object-level picture modeling and event-level spatiotemporal modeling.
In addition, this understanding large model can further serve the understanding-enhanced multimodal generation technology and be combined with the multimodal generation large model to realize a creation platform of multimodal retrieval + multimodal content editing and generation.
## Let AIGC “fly into ordinary people’s homes”
For companies doing generative models, some noteworthy trends have occurred recently. On the one hand, everyone is discussing “whether the scaling law has reached its end and whether pre-training is about to end.” On the other hand, multimodal large models are being placed with higher expectations. Some people