Continuous Thought Chains by Tian Yuandong’s Team Make a Splash! Superior to CoT, Unlocking a New Paradigm for LLM Reasoning.

In the field of cognitive science, the debate over whether language is for thinking or communication has been ongoing. With the rise of large language models (LLMs) and Chain of Thought (CoT), language has become the default medium for machine reasoning. But is it really the best approach? Generally, LLMs are confined to reasoning within the language space and express the reasoning process through CoT to solve complex reasoning problems. However, the language space may not always be the most suitable for reasoning.

To explore the reasoning potential of LLMs in an unrestricted latent space, researchers from Meta and the University of California, San Diego have proposed a new paradigm – Coconut (Chain of Continuous Thought).

## Chain of Continuous Thought: Coconut

### Method Overview
In the Coconut method, the LLM switches between language mode and latent mode:
– In language mode, the model operates as a standard language model and autoregressively generates the next token.
– In latent mode, the last hidden state is directly used as the next input embedding. This hidden state represents the current reasoning state and is called continuous thought. Special tokens and are used to mark the start and end of the latent thought mode, respectively.

### Training
Language CoT data is used to supervise continuous thought. In the initial stage, the model is trained on regular CoT instances. In the subsequent stage, i.e., the kth stage, the first k reasoning steps in CoT are replaced by k×c continuous thoughts, where c is a hyperparameter that controls the number of latent thoughts replacing a single language reasoning step.

### Reasoning Process
The reasoning process of Coconut is similar to the decoding process of a standard language model. The difference is that in latent mode, the last hidden state is directly used as the next input embedding. Determining when to switch between latent mode and language mode is a challenge. In a problem-solving setting, a token is immediately inserted after the problem token. For , there are two strategies: one is to train a binary classifier on latent thoughts to let the model decide autonomously when to terminate latent reasoning, and the other is to pad latent thoughts to a constant length. In the experiment, the second option is used to simplify the operation.

## Experiments

### Datasets and Tasks
The feasibility of large language models reasoning in a continuous latent space is verified through three datasets, involving two types of tasks: mathematical reasoning and logical reasoning. For mathematical reasoning, the GSM8k dataset is used. For logical reasoning, the 5-hop ProntoQA and the ProsQA developed by the research team are adopted. ProntoQA requires the model to determine whether the subordination relationship between different categories is correct. ProsQA is a more challenging reasoning task that contains many randomly generated directed acyclic graphs and requires the model to perform a lot of planning and searching.

### Experimental Setup
A pre-trained GPT-2 model is used with a learning rate of 1×10^−4 and a batch size of 128. For the mathematical reasoning task, each reasoning step is represented by 2 latent thought vectors. The training is divided into 4 progressive stages. In the logical reasoning task, each step is represented by 1 latent thought vector. The training is divided into 7 progressive stages. All experiments continue training until the 50th round after the standard training process. The best model checkpoint is selected by evaluating the accuracy on the validation set for the final test.

### Baseline Methods and Coconut Variants
– Baseline methods: traditional CoT, No-CoT, iCoT, Pause token.
– Coconut variants: version without curriculum learning, version without thought, version with thought replacement.

## Results and Discussion

### Continuous Thought Enhances Reasoning Ability
Continuous thought effectively enhances the reasoning ability of large language models and even outperforms CoT on ProntoQA and ProsQA.

### “Chained” Combination Enhances Reasoning Ability
In the GSM8k dataset experiment, Coconut performs better than other architectures trained with similar strategies. As the parameter c controlling the number of latent thoughts corresponding to each language reasoning step increases, the model performance steadily improves, indicating that a similar chain effect as CoT also exists in the latent space.

### Latent Space Reasoning is Superior to Language Reasoning in Planning-Intensive Tasks
The problem structures of GSM8k and ProntoQA are intuitive and have limited branches, making it relatively easy to predict the next step. The randomly generated DAG structure of ProsQA challenges the model’s planning ability. CoT shows no significant improvement compared to No-CoT, while Coconut and its variants and iCoT significantly improve the reasoning ability, indicating that latent space reasoning has an advantage in tasks that require a lot of planning.

### The Model Needs Guidance to Learn Latent Space Reasoning
The version of Coconut without curriculum learning does not perform better than no-CoT. By