After attending the Scale With AI event in Silicon Valley and engaging in discussions with researchers, entrepreneurs, and investors, several observations and insights regarding the development of AI have been compiled. AI is reshaping the future of the world, with Silicon Valley undoubtedly being the core engine of this transformation. However, the power to reshape the world lies not just in Silicon Valley but in the efforts of a generation of Chinese practitioners.
**1. For LLMs, the Era of Pre-training is Essentially Over**
* **Pre-training is nearing a bottleneck, but post-training still offers many opportunities.** The scaling of pre-training has slowed down mainly due to issues with structure, computing power, and data. However, for MultiModel, data and computing power are more crucial. For MultiModel, it’s about selecting combinations across multiple modalities. Pre-training may be considered over in the current architecture, but new architectures can emerge.
* **The relationship between pre-training and RL:** Pre-training is less concerned with data quality, while post-training requires higher data quality. RL can achieve different tasks, with pre-training preceding RL in post-training.
* **Optimizing large models affects product capabilities.** Mainly in the post-training part, it helps with many safety aspects, such as solving issues like C.AI in the context of child suicide, using different models for different age groups and populations.
* **Some non-consensus may achieve consensus next year.** It’s not necessary for everyone to adopt large models; many small models have been effective and may not require a new model. Today’s large models will become small models in a year.
* **For LLMs, the era of pre-training is basically over.** Now, the focus is on post-training, which demands high data quality.
* **Building post-training teams.** Theoretically, a team of 5 is sufficient (not necessarily full-time). One to build the pipeline, one to manage data, one to focus on the model itself (SFT), one to judge model arrangement for the product and collect user data.
* **Building data pipelines.** Data loops: data enters the pipeline, generates new data, and flows back. Efficient iteration: data annotation combined with pipeline and AB testing, structured data warehouse. Data input: efficient annotation and rich user feedback build a moat. Initial stage: SFT (constantly looping back to this stage). Subsequent stage: RL, with a heavier RLFH, scoring guiding RL, DPO methods being prone to collapse, and a simplified version of RL.
**2. Scaling Law for Video Models: The Bottleneck is Still Early**
* **Video generation is at the GPT1 and 2 stage.** The current challenge is the dataset. While images rely on LIAON, there isn’t a large public dataset for videos due to copyright issues. How each company acquires, processes, and cleans data will differ, leading to varying model capabilities and difficulties in open-source versions.
* **Choosing different technology stacks for different scenarios will be a trend.** Sora was thought to converge to DiT, but many technical paths are still active, such as GAN-based paths, real-time generation with AutoRegressive models like Oasis, and combining CG and CV for better consistency and control. Each company has different choices, and the future will see different technology stacks for different scenarios.
* **Video Scaling Law is far from reaching the level of LLMs.** The largest model parameter so far is 30b, proven effective; but at 300b, there are no successful cases.
* **Methods to speed up video generation.** The simplest is generating low-resolution, low-frame-rate images. The most common is step distillation, as diffusion reasoning has steps, and at least 2 steps are needed for image generation, which can be faster if distilled to 1 step.
* **Priority for video model iteration.** Clarity, consistency, and controllability are not yet saturated, so it’s still in the phase of simultaneous improvement in pre-training.
* **Technical solutions for long video generation acceleration.** The limits of DiT’s capabilities are unknown, and if a bottleneck appears at a certain size, new model architectures might emerge.
* **Fusion of video models with other modalities.** There will be two types of unification: multi-modal and generation-understanding. For the former, representation must be unified first. For the latter, text and speech can be unified, but the current view is that VLM and diffusion unification is less than 1+1=2.
* **There is still a lot of training data for video models.** There is a lot of video data, and efficiently selecting high-quality data is important. The quantity depends on the understanding of copyright. But computing power is also a bottleneck, as even with much data, there may not be enough computing power, especially for high-definition data.
* **The future of long video generation lies in storytelling.** Current video generation is based on materials. Future generations will be about stories, with videos generated for a purpose. Long videos are not about length but about storytelling.
* **Aesthetic improvements in video generation mainly rely on post-training.** Mainly through post-training, such as using a lot of film and television data for conch shells. Realism depends on the base model’s capabilities.
* **Two challenges for video understanding are long context and latency.**
* **Visual modalities may not be the best path to AGI.** The textual modality—turning text into images, then into videos—is a shortcut to intelligence, with video and text having a efficiency gap of hundreds of times.
* **End-to-end progress in speech models is significant.** No need for human labeling and judgment, enabling refined emotional understanding and output.
* **Multi-modal models are still in the early stages.** Predicting the next 5 seconds from the first 1 second of a video is already difficult, and adding text may be even harder.
* **The technical path for multi-modal models has not fully converged.** Diffusion models have good quality, but the model structure is still changing; Alter agreesive has good logic.
* **There is no consensus on the alignment of different modalities.** Whether video is discrete or continuous tokens is undecided. High-quality alignment is not common, and it’s unclear if it’s a scientific or engineering problem.
**3. Embodied Intelligence: Robots with Human-like Generalization Capabilities May Not Be Realized in Our Generation**
* **Embodied robots have not yet seen a ‘critical moment’ like ChatGPT.** A core reason is that robots need to complete tasks in the physical world, not just generate text through virtual language.
* **The most core problem solved by this generation of machine learning is generalization.** Generalization is the ability of an AI system to learn patterns from training data and apply them to unseen data.
* **The challenge of generalization for this generation of robots: mostly extrapolation situations.** Environmental complexity: the diversity and dynamic changes of home and industrial environments. Physical interaction issues: such as the weight, angle differences, and wear of doors. Uncertainty in human-robot interaction: the unpredictability of human behavior poses higher requirements for robots.
* **Robots with fully human-like generalization capabilities may not be realized in the current or next generation.** The complexity and diversity of the real world, such as pets, children, and furniture arrangements in homes, make it difficult for robots to achieve complete generalization.
* **Stanford lab’s choice: focus on the family scene.** Stanford’s robotics lab mainly focuses on household tasks, especially those related to the aging society, such as robots that can help with daily tasks like folding quilts, picking up items, or opening bottles.
* **Defining generalization conditions based on specific scenes.** Clearly define the environments and scenes robots need to handle, such as homes, restaurants, or nursing homes. With clear scenes, the task scope can be better defined, and the possible changes in item status and environmental dynamics can be covered.
* **The contradiction between generalization and specialization.** The conflict between general models and specific task models: general models need strong generalization capabilities to adapt to diverse tasks and environments, but this usually requires a lot of data and computing resources. Specific task models are easier to commercialize but have limited capabilities and are difficult to expand to other fields.
* **The potential of embodied multi-modal models.** Integrating multi-modal data: multi-modal models can process various inputs like visual, tactile, and linguistic data, enhancing robots’ understanding and decision-making capabilities in complex scenes. For example, in grasping tasks, visual data can help identify the position and shape of objects, while tactile data can provide additional feedback to ensure stable grasping.
* **The challenge of achieving a robot data loop.** The robotics field currently lacks a landmark dataset like ImageNet, making it difficult to form unified evaluation standards. Data collection is costly, especially for interactive data involving the real world. For example, collecting multi-modal data like tactile, visual, and dynamics requires complex hardware and environmental support.
* **The challenge of the Sim-to-Real Gap.** Simulators have gaps with the real world in visual rendering and physical modeling, such as friction and material properties. Robots that perform well in simulation may fail in the real world, limiting the direct application of simulation data.
* **Advantages and challenges of real data.** Real data more accurately reflects the complexity of the physical world, but its collection is costly. Data annotation is a bottleneck, especially for multi-modal data