DeepSeek-R1 Tops o1 on Hard Test Set

The most challenging large model test set has arrived, crafted by over a thousand experts, presenting an unprecedented level of difficulty. To date, not a single model, including the renowned model o1, has scored above 10%. Here is the heart of the story:

This test set features questions compiled by more than 1000 scholars from over 500 institutions, with the final selection consisting of over 3000 questions, all at the graduate level and above. The questions span various disciplines, including mathematics, physics, chemistry, biomedicine, engineering, and social sciences, with more than 100 sub-disciplines.

Below is the formatted content for a wordpress blog:

Core Content of the Test Set

The test set is divided into eight categories, with mathematics dominating at 42%. The questions are required to be at a graduate level of difficulty, unretrievable, and must have clear answers and evaluation methods.

Model Performance

Models like o1, known for strong reasoning capabilities, achieved only a 9.1% accuracy rate. However, DeepSeek-R1 excelled in the pure text subset, taking the top spot with a significant advantage over o1.

Selection Process

1. Questions undergo a dual review process by both large models and human judges.
2. Preliminary screening is conducted via large models, where questions that are incorrectly answered or have an accuracy rate below random guessing for multiple-choice pass the initial filter.
3. After more than 70,000 attempts, 13,000 questions entered the human review phase.
4. Human review consists of two rounds: the first by domain experts and the second by the organizing committee and outstanding reviewers.

Details of the Test Set

The final selection includes over 3000 questions, forming a substantial public dataset and a smaller private dataset.

– Some questions test the model’s visual capabilities, such as interpreting ancient scripts.
– Others require a combined understanding of visual and textual information, like the structure of chemical compounds in organic chemistry.
– Some purely test the knowledge base, with the difficulty still reaching the graduate level.

About the Test Set Creation

This test set aims to challenge the top models. Here’s more information:

The project is initiated by the AI Security Center and Scale AI, with contributors from all around the globe. Contributors whose questions are selected receive rewards ranging from $500 to $5000.

Key Points

– Model scores do not exceed 10%
– Over 3000 graduate-level questions
– Coverage of multiple disciplines
– A rigorous selection process

The test set is seen as the “ultimate human exam,” presenting a monumental challenge to models. In pure text tasks, DeepSeek-R1 emerged as the best performer. Below is more detailed information, crafted to maintain a balance of rigorous science with a touch of humanistic flair.