In-Depth Interpretation by Yu Yang of Nanjing University: What is a “World Model”?

What is a World Model?

In the field of AI, when we talk about “world” or “environment”, it is usually to distinguish it from “agent”. The field that studies agents the most is reinforcement learning and robotics. Therefore, world models and world modeling first and most often appear in papers in the robotics field.

Today, the most influential one may be the article named “world models” put on arxiv by Jurgen in 2018 and finally published as “Recurrent World Models Facilitate Policy Evolution” in NeurIPS’18. This paper does not define World models but analogizes the mental model of the human brain in cognitive science and cites literature from 1971.

The mental model introduced in Wikipedia is a mirror image of the surrounding world in the human brain and may participate in cognitive, reasoning, and decision-making processes. It mainly consists of two parts: mental representations and mental simulation. The structural diagram in the paper illustrates the main components of the World model: the vertical V->z is the low-dimensional representation of observation, implemented by VAE; the horizontal M->h->M->h is the sequence prediction of the representation at the next moment, implemented by RNN. These two parts together are the World Model, corresponding to mental representations and mental simulation. But in fact, the input of RNN is not only z but also action. This is different from the usual sequence prediction. Adding actions will make the data distribution change freely, bringing huge challenges.

This paper by Jurgen belongs to the field of reinforcement learning. The model in model-based RL in reinforcement learning is no different from the world model. In the early version of Jurgen, it was mentioned that although many model-based RLs have learned the model, they have not completely trained RL in the model. This is because model-based RL has long faced the problem that the model is not accurate enough, and the RL trained completely in the model has very poor effects. It was not solved until recent years.

In the paper of the Dyna framework proposed by Sutton in 1990, this model is called an action model, emphasizing predicting the result of action execution. RL learns from real data on one hand and from the model on the other hand to prevent inaccurate models from causing poor learning of strategies.

The Core Role of World Model

The world model is extremely important for decision-making. If an accurate world model can be obtained, then by repeatedly trying and making mistakes in the world model, the optimal real-world decision can be found. Its core role is counterfactual reasoning. Even for decisions that have not been seen in the data, the results of decisions can be reasoned out in the world model.

Those who understand causal reasoning will be familiar with the term counterfactual reasoning. Turing Award winner Judea Pearl drew a causal ladder in the popular science book “The Book of Why”. The lowest layer is “association”, the middle layer is “intervention”, and exploration in reinforcement learning is a typical intervention. The top layer is counterfactual, answering “what if” questions through imagination. The schematic diagram of counterfactual reasoning drawn by Judea is similar in spirit to the schematic diagram of the world model in Jurgen’s paper.

Is Sora a World Simulator?

“Simulator” appears more in the engineering field. Its function is the same as that of the world model, attempting high-cost and high-risk trial and error. OpenAI uses the phrase “world simulator”, with the meaning remaining unchanged.

The videos generated by Sora can only be guided by vague prompt words and are difficult to accurately control. It is more of a video tool and difficult to accurately answer “what if” questions as a tool for counterfactual reasoning. It is even difficult to evaluate how strong Sora’s generation ability is because it is not clear how much difference there is between the demo videos and the training data. Moreover, there are places in the videos generated by Sora that do not conform to physical laws. OpenAI believes that Sora proves a route to “simulators of the physical world”, but simply piling up data is not the way to more advanced intelligent technologies.