Doubao Code Model Exposed! In ByteDance’s Latest Open Source Benchmark, Its Performance in Multiple Programming Languages is Second Only to OpenAI/Claude.

Breaking News: Doubao-Coder (Preview Version) Revealed in ByteDance’s Open Source Code Model Evaluation Benchmark!

As of December 25, 2024, at 12:04:41 PM on a Wednesday, an exciting discovery has been made in the field of code models. ByteDance’s code model evaluation benchmark, FullStack Bench, has unveiled the previously undisclosed Doubao-Coder (Preview version).

Back in June 2024, ByteDance’s AI programming assistant, Doubao MarsCode, was rumored to be supported by the Doubao-Coder model. As of now, it contributes millions of lines of code for users every month.

The brand-new code model evaluation benchmark, FullStack Bench, addresses the limitations of existing evaluation benchmarks. Current ones struggle to reflect the real capabilities of code models, with monotonous question types and limited coverage of domains and languages. FullStack Bench is an evaluation dataset focused on full-stack and multi-language programming. It encompasses 11 real-life scenarios in full-stack programming technology and covers 16 programming languages. It contains 3374 problems. The application domains are extracted from Stack Overflow, covering 88.1% of the major domains. Each problem has a question description, reference solution, unit test cases, and labels. It is designed by programming experts and reviewed by both AI and manual efforts. The team also open-sourced SandboxFusion, which is used to evaluate programming tasks in different languages. It is compatible with over 10 datasets and supports 23 programming languages. It can be deployed on a single server or experienced online.

Evaluation Results: Based on FullStack Bench, the programming performance of more than 20 code models and language models worldwide has been evaluated. In cross-domain performance, OpenAI o1-preview leads, and some open-source models also show good performance. The performance of models in different fields varies significantly, with the biggest gap in the field of mathematical programming. In cross-language performance, most models perform well in Bash programming tasks, while there are large differences in C++, C, and Ruby. Some small models have weak multi-language processing capabilities. In solving difficult problems, closed-source models are generally better than open-source models. Moreover, the “Reflection” strategy of using SandboxFusion is significantly better than the “BoN” strategy, indicating that its feedback context is effective. For more details, check the Arxiv link in the article or follow the official account of “Doubao Large Model Team”.