Project-level code generation results are in! o3/Claude 3.7 leads the way, R1 is in the top tier!

AI News5mos agorelease AI Sharing Circle

On February 26, 2025, SuperCLUE released the inaugural list of project-level code generation (SuperCLUE-Project) measurement benchmarks.

For the evaluation program, see: Project-level Code Generation Evaluation Benchmark Release. Based on the cooperation of the big model "judges", the evaluation evaluates the capabilities of 12 big models at home and abroad in project-level code generation tasks, covering five major application scenarios, including game development, tools and management systems. The following is the detailed evaluation report.

Summary of project-level code measurements

Abstract 1: o3-mini-high & Claude-3.7-Sonnet-Reasoning in the lead

In this evaluation, OpenAI's released o3-mini-high achieved a composite score of 82.08, and Anthropic's newly released reasoning model Claude-3.7-Sonnet-Reasoning reached a composite score of 81.63, and the two led the list hand in hand.

Abstract 2: DeepSeek-R1 Leads Domestic Models, Ranks Among Industry's First Tier

From the evaluation results, the score gap between DeepSeek-R1 and the industry's cutting-edge models such as o3-mini-high, Claude-3.5-Sonnet/3.7-Sonnet-Reasoning, and Gemini-2.0-pro is extremely small, and even achieves a certain leading position in some application scenarios.

Summary 3: Each has its own strengths. r1 specializes in game development, o3/Step Step R specializes in multimedia editing, and multiple specialize in web applications.

The 12 models participating in the evaluation show the differences in capabilities in different application scenarios, such as DeepSeek-R1 is outstanding in the field of "game development", Claude-3.5-Sonnet, Doubao 1.5pro and Tongyi Qianqian Max are more specialized in "web application" design, and so on. Claude-3.5-Sonnet, Beanbag 1.5pro and Tongyi Qianqian Max are more specialized in "web application" design, and StepStar Step R-mini has a unique advantage in the development of "multimedia editing" tools, and so on.

Abstract 4: Different models differ significantly in methodological choices, interface styles

Comparison of model answers reveals that, in the face of the same user requirements, different models choose very different programming languages, call libraries/modules, and pay significant attention to the aesthetics of the interface, which, to a certain extent, reflects the differences in the model's capabilities, preferences, and concepts.

Overview of the list 项目级代码生成结果出炉！o3/Claude3.7领跑，R1跻身第一梯队

SuperCLUE-Project Assessment System

SuperCLUE-Project is a Chinese native project-level code evaluation benchmark designed to examine the ability of large models to turn users' project-level requirements into code implementations. 项目级代码生成结果出炉！o3/Claude3.7领跑，R1跻身第一梯队

SuperCLUE-Project focuses on the actual needs of the non-programmer user group, covering 5 first-level dimensions and 18 second-level dimensions, and constructing the question sets in Chinese natural language. In view of the characteristics of the non-programmer community, we only emphasize the functional level of the requirements description in the question design, and put the efficiency, security, readability and other indicators as the independent ability of the large model players to be evaluated in the evaluation session.

In addition, the benchmarks have three levels of difficulty, easy - medium - complex, which are scaled holistically for the same topic set to provide deeper insight into the project-level code implementation capabilities of the models.

Methodology

Referring to the SuperCLUE fine-grained assessment approach, the following process is followed to conduct the assessment:

1) Measurement Set Construction

1. Concerned about the dynamics of the large model-assisted low-code/zero-code development field, collecting and organizing the non-programmer group code project requirements

2. Write simple difficulty project level code assessment sets

3. Control the format and word range to extend the assessment set to moderate/complex difficulty levels

4. Testing and manual calibration

2) Scoring process

1. Preparation of evaluation rulesprompt --->

2. Small-scale testing to manually check the consistency of the referee model's evaluations with those of human experts --->

3. Iterative tuning of evaluation rules based on consistency feedback --->

4. Pass the complete set of to-be-tested model responses and evaluation rules into the two referee models to receive the complete evaluations, respectively --->

5. Calculate the mean of the scores of the two referee models in each dimension as the final result

3) Human Consistency Analysis

A stratified sampling of the measurement set is performed to test the consistency of the referee model's evaluations with those of the human experts by calculating the intra-group correlation coefficient and reporting that performance.

Compared with previous benchmarks, SuperCLUE-Project introduces for the first time both domestic and foreign models (Gemini-2.0-flash and Qwen-Max) as referees in the implementation of the evaluation, which further reduces the bias and preference problems of the large model through the cooperation of the "referee team". (Through the cooperation of the "referee team", the problems of bias and preference of the large model are further reduced.

In addition, in order to verify the reliability of the referee model, SuperCLUE-Project introduces the Intra-class Correlation Coefficient (ICC) for the first time, and calculates the bidirectional mixed effects of the ratings of human experts, Qwen-Max and Gemini-2.0-flash ( ICC(3,k)) indices, the referee model was verified to be strongly consistent with human ratings. Compared with the percentage reliability in the past, this method effectively overcomes the fluctuating effects of random errors.

(*Note: The intragroup correlation coefficient (ICC) is one of the reliability coefficient indices for measuring and evaluating inter-observer reliability and test-retest reliability, and was first used by Bartko in 1966 to measure and evaluate the magnitude of reliability.The ICC is equal to an individual's variability divided by the total variability. In this experiment, the two-way mixed-effects index was chosen as the consistency index because we only need to consider the consistency of the selected referee model and the human expert's ratings without extending it to other raters).

Evaluation criteria

Functional Integrity (60%): ensures that the code fully implements all functions described in the user instructions.
Code Quality (28%): evaluates code performance in terms of efficiency, readability, and security. Specifically includes:

a. Efficiency (12%): whether the code is sufficiently optimized in terms of resource usage, DOM manipulation, database/large data set handling, computation or API calls.

b. Readability (8%): Whether the code implements (1) uses clear naming and consistent formatting; (2) reasonably divides the code base into modules; and (3) maintains a clear project structure.

c. Security (8%): Whether the code (1) has no obvious security holes; and (2) can handle basic exceptions effectively.

User Experience (12%): Evaluates the quality of user interface design and aesthetics, including the proper functioning of interactive elements (e.g., buttons, forms) and the basic aesthetics of the overall interface.

Compared to the design of the evaluation criteria in the past, SuperCLUE-Project has changed the relatively balanced scoring mechanism, significantly highlighting the scoring weight of the functional implementation aspect, which is also the ability that ordinary users are most concerned about.

In addition, the evaluation criteria of SuperCLUE-Project specifies the scoring mode of deduction system, i.e., on the basis of the default full score, based on the comparison between the questions and the corresponding code implementations, the part of the score that fails to meet the requirements of the questions will be deducted. For this kind of question-by-question individual evaluation method, the point deduction system to a certain extent makes up for the deficiency of the large model referee in examining the relative quality of multiple responses, and alleviates the large model evaluation'sRandomization issuesThe

Participating Models

In order to comprehensively measure the ability level of current domestic and overseas large models to solve project-level code requirements, seven domestic models and five overseas models with high representativeness were selected for this evaluation.

Evaluation results

overall list

Application Scenario List

Review and Human Consistency Analysis

In order to scientifically evaluate the consistency of SuperCLUE-Project benchmarks with human experts, we select the three models with better, average and poorer overall performance from the assessment results, stratify 33 question-answer pairs, conduct manual independent scoring and question-by-question scoring by the two referee models, and report the percentage consistency of the total scores of the three, the correlation coefficients of the three groups, and scientifically examine the professionalism of the large models and the "referee panel" scoring. The professionalism of the "referee panel" scoring was scientifically examined.

Note: An ICC-3k value of 0.75 or higher is generally considered to indicate high inter-rater agreement.

Evaluation Cases

Example 1 Game Development (Puzzle, Difficult)

Prompt:Write code to implement a connect-the-dots game, randomly appear multi-colored, closely spaced m * n small cubes (m, n by the user to start the game before the user to specify, are greater than 5 integer), the user clicks on the two squares, if the color is the same and connected to the path (only allowed to the horizontal and vertical two directions) is not more than two turns, then eliminate the two cubes; if does not satisfy the rules of elimination, then the two clicked on the two squares Blink three times. The player has three chances to disrupt the arrangement of existing squares. If the user succeeds in eliminating two squares in a row within 3 seconds, "Combo!" is displayed on the screen.

Big Model Ref 1 (Gemini-2.0-flash):

用户指令中包含的需求 (基于独立分析)

1.创建游戏界面: 创建一个m*n的方块网格，m和n由用户指定（大于5的整数）。

2.方块颜色随机化: 随机生成多种颜色的方块，并确保每种颜色出现偶数次，以便可以两两消除。

3.消除规则: 当用户点击两个颜色相同的方块，且它们之间可以通过不多于两次转弯的路径连接时，消除这两个方块。

4.错误提示: 如果用户点击的两个方块不满足消除规则，则点击的方块闪烁三次。

5.打乱功能: 允许玩家打乱方块排列，总共有三次机会。

6.连击奖励: 如果用户在3秒内连续成功消除两次，则显示 "Combo!"。

对代码缺陷的总体点评

该代码实现了一个基本的连连看游戏，实现了方块的随机生成、消除规则、错误提示、打乱功能和连击奖励。但是，代码在用户体验和效率方面存在一些不足。例如，方块闪烁效果不够流畅，消除方块后的界面更新不够及时。

各项标准的评分

-- 功能完整性：[[15]] 分

（实现了所有功能，包括随机生成方块、消除规则、打乱功能和连击奖励。所有核心功能均已实现，因此不扣分。）

-- 代码质量：[[5]] 分，其中：

效率：[[2]] 分

（代码在生成棋盘时，先创建了一个包含所有颜色的列表，然后进行随机打乱。这种做法在m和n较大时，可能会影响效率。此外，在判断两个方块是否可以消除时，使用了广度优先搜索算法。虽然该算法可以找到最短路径，但在大规模棋盘上，可能会比较耗时。因此，扣1分。）

可读性：[[2]] 分

（代码的命名基本清晰，但部分代码块的注释不足，例如，对于广度优先搜索算法的解释不够详细。代码的整体结构良好，但可以进一步模块化，例如，将游戏逻辑和界面绘制分离。因此，不扣分。）

安全性：[[1]] 分

（代码没有明显的安全漏洞，但是，对于用户输入的m和n，只做了简单的数值范围检查，没有考虑其他异常情况，例如，用户输入非整数。因此，扣1分。）

-- 用户体验：[[1]] 分

（游戏界面基本可以运行，但是，方块闪烁效果不够流畅，消除方块后的界面更新不够及时。此外，游戏缺少音效和动画效果，用户体验有待提升。因此扣2分。）

综上所述，该代码实现的总体得分为 <<21>> 分。

Big Model Referee 2 (Qwen-Max):

用户指令中包含的需求 (基于独立分析)

1. 游戏界面应随机出现多种颜色的、紧密排列的m*n个小方块，其中m和n由用户指定且均为大于5的整数。

2. 用户点击两个方块，如果颜色相同且相连接路径（仅允许水平和垂直两个方向）不多于两次转弯，则消除两方块。

3. 如果不满足消除规则，则点击的两个方块闪烁三下。

4. 玩家有三次机会可以打乱现有方块的排列。

5. 如果用户在3秒内连续成功消除两次，则在屏幕上显示"Combo!"。

对代码缺陷的总体点评

代码实现了基本的游戏逻辑，包括生成游戏板、检测方块是否相连以及处理玩家输入等。

代码结构较为清晰，但部分地方可以进一步优化以提高可读性和效率。


各项标准的评分

-- 功能完整性：[[15]] 分

该代码实现完全符合用户指令中的所有功能要求：

生成了m*n的小方块，且m和n由用户指定且均为大于5的整数。

用户点击两个方块，如果颜色相同且相连接路径不多于两次转弯，则消除两方块。

不满足消除规则时，点击的两个方块闪烁三下。

玩家有三次机会可以打乱现有方块的排列。

用户在3秒内连续成功消除两次时，在屏幕上显示"Combo!"。

-- 代码质量：[[5]] 分，其中：

效率：[[3]] 分

代码整体上对资源利用合理，没有明显的性能瓶颈。但由于未考虑极端情况下可能出现的问题，故保持满分。

可读性：[[1]] 分

虽然变量命名较清晰，但函数is_connected中的逻辑稍显复杂，不易理解。此外，主循环内的逻辑也可以通过更合理的模块化来改善。

安全性：[[1]] 分

代码缺乏对用户输入的有效验证，比如确保输入的是正整数。同时，对于某些潜在错误（如文件操作失败）也没有相应的处理机制。

-- 用户体验：[[2]] 分

游戏界面基本美观，交互元素工作正常。但是，由于方块闪烁次数符合要求，用户体验的一致性较好。

综上所述，该代码实现的总体得分为 <<22>> 分。

Overall rating: [21.5/25]

# Example 2 Game Development (Shooter, Difficult)

Prompt: Write code to implement a simple air combat game, the player uses the left and right arrow keys to maneuver the plane at the bottom of the interface to avoid obstacles flying down from above, and presses the spacebar to shoot enemy planes from above, the enemy planes will move left and right randomly and fire. Initial life value is 3, every time you hit an obstacle or get hit by an enemy plane, your life value decreases by 1, and the game ends when your life value reaches 0. There are 3 enemy planes in the first level and 3 more in each subsequent level. There are two firing modes: Mode A (default) can only fire in front of the enemy plane, and one hit on the enemy plane destroys it; Mode B fires in multiple directions, and requires two hits on the enemy plane to destroy it. Press "Q" key to switch between mode A and B.

[o3-mini-high code effect demo]:

Overall rating: [22/25]

# Example 3 Shortcut Tool (Daily Office, Medium)

Prompt:Write code to implement an English text processing tool, the user inputs text, the tool can quickly perform word count, word frequency sorting, case conversion, remove spaces and line breaks, add line numbers and other operations. In addition, the tool can save multiple user-defined replacement rules, and unified implementation. Users can save the text to favorites and customize the title.

Overall rating: [20.5/25]

Example 4 Web Application (Web Visualization, Difficult)

Prompt:Write code to implement a fashion showcase website with multiple images (uploaded by the user) that rotate automatically, with thumbnails located at the bottom of the page. The images are switched using a card flip visual effect. When hovering over an image, a magnifying glass is used to show the details. There is a "turn off light" button on the top right corner of the page, the default background is white, after clicking "turn off light", the background becomes black, and the button becomes "turn on light". The background of the page has the effect of slowly falling flower petals. There is a start/pause icon button in the upper left corner to control the start and pause of the picture rotation; there is a white heart icon in the lower right corner of each rotating picture, which turns into pink when clicked, and the number of times the heart has been clicked is displayed on the right side.

Overall rating: [23/25]

Example 5 Web Application (Educational Learning, Difficulty)

Prompt:Write code to implement a vocabulary memorization website that shows the user the word and four paraphrase options; if the user chooses the correct option, he or she jumps to the next word; if the user chooses the wrong option, he or she is prompted for the correct option before jumping. Each group has five words, a total of three groups, after the end of each group, the user can choose to end the study or learn another group of words. After finishing the study, the overall correct rate of this study is displayed. User's wrong answers will be recorded automatically, and user can click "Switch to Review Mode" at the top of the interface to answer the wrong questions again. The order of the questions is randomized, i.e. the order of the questions is usually different each time you enter the site.

[Qwen-Max code effect demo]:

Overall rating: [19/25]

Measurement Analysis and Conclusion

1. o3-mini-high with Claude-3.7-Sonnet-Reasoning in the lead

In this evaluation, OpenAI's released o3-mini-high achieved a composite score of 82.08, while Anthropic's newly released reasoning model Claude-3.7-Sonnet-Reasoning achieved a composite score of 81.63, and the two hand in hand led the list.

2. DeepSeek-R1 leads domestic models and is among the first tier of the industry

From the evaluation results, DeepSeek-R1 has a very small score gap with o3-mini-high, Claude-3.5-Sonnet/3.7-Sonnet-Reasoning, Gemini-2.0-pro and other cutting-edge models in the industry, and its performance is especially outstanding in the application scenarios of "game development" and "network application". The performance is especially outstanding in the "Game Development" and "Web Application" application scenarios, surpassing or reaching the level of Claude-3.5-Sonnet, Gemini-2.0-pro and other models.

3. Each has its own strengths: R1 specializes in game development, o3/Step R in multimedia editing, and multiple in web applications.

The 12 models that participated in the evaluation show the difference in capability in different application scenarios. Among them, DeepSeek-R1 is outstanding in the field of "game development", Claude-3.5-Sonnet, Beanbag 1.5pro, Smart Spectrum GLM-Zero-preview and Tongyi Qianqian Max are more specialized in "web application" design, o3-mini-high and Step Star Step R-mini are more specialized in "web application" design, and o3-mini-high and Step Star Step R-mini are more specialized in "web application" design. Claude-3.5-Sonnet, Beanbag 1.5pro, GLM-Zero-preview and Max are more specialized in "web application" design, while o3-mini-high and Step Star Step R-mini have a unique advantage in "multimedia editing" tool development.

4. There are significant differences in methodological choices and interface styles between the different models.

Comparing the model answers, it is found that, facing the same user requirements, different models choose very different programming languages, call libraries/modules, and pay significant attention to the aesthetics of the interface, which to a certain extent reflects the differences in the model's capabilities, preferences, and concepts. Overall, overseas models perform better in user interface design.

Relevant examples are listed below:

Question one:

Write code to realize a simple online food ordering website, support for adding dishes to the shopping cart, through the "+" and "-" to change the number of dishes, real-time display of the total price of the dishes in the shopping cart, and be able to click to place an order. After the order is placed, the shopping cart will be emptied and the customer will be asked if he/she wants to pack the food. For every $100 of the total amount, there should be a discount of $10.

Question two:

Write code to realize a basketball shooting game, mouse movement to control the direction of the basketball, press the mouse to store power, the basketball into the basket will score points, consecutive baskets have extra points, not into the basket three times will end the game. When choosing the direction and accumulating power, you need to mark the intended flight trajectory with a dotted line; after throwing the basketball, you need to clearly display its flight trajectory. Before shooting, you can use the left and right arrow keys to move the initial position of the basketball, short-distance shooting scores 2 points, when more than a certain distance, shooting scores 3 points. There is a possibility of hitting the rim and bouncing into the ball.