On February 26, 2025, SuperCLUE released the inaugural list of project-level code generation (SuperCLUE-Project) measurement benchmarks.
For the evaluation program, see: Project-level Code Generation Evaluation Benchmark Release. Based on the cooperation of the big model "judges", the evaluation evaluates the capabilities of 12 big models at home and abroad in project-level code generation tasks, covering five major application scenarios, including game development, tools and management systems. The following is the detailed evaluation report.
Summary of project-level code measurements
Abstract 1: o3-mini-high & Claude-3.7-Sonnet-Reasoning in the lead
In this evaluation, OpenAI's released o3-mini-high achieved a composite score of 82.08, and Anthropic's newly released reasoning model Claude-3.7-Sonnet-Reasoning reached a composite score of 81.63, and the two led the list hand in hand.
Abstract 2: DeepSeek-R1 Leads Domestic Models, Ranks Among Industry's First Tier
From the evaluation results, the score gap between DeepSeek-R1 and the industry's cutting-edge models such as o3-mini-high, Claude-3.5-Sonnet/3.7-Sonnet-Reasoning, and Gemini-2.0-pro is extremely small, and even achieves a certain leading position in some application scenarios.
Summary 3: Each has its own strengths. r1 specializes in game development, o3/Step Step R specializes in multimedia editing, and multiple specialize in web applications.
The 12 models participating in the evaluation show the differences in capabilities in different application scenarios, such as DeepSeek-R1 is outstanding in the field of "game development", Claude-3.5-Sonnet, Doubao 1.5pro and Tongyi Qianqian Max are more specialized in "web application" design, and so on. Claude-3.5-Sonnet, Beanbag 1.5pro and Tongyi Qianqian Max are more specialized in "web application" design, and StepStar Step R-mini has a unique advantage in the development of "multimedia editing" tools, and so on.
Abstract 4: Different models differ significantly in methodological choices, interface styles
Comparison of model answers reveals that, in the face of the same user requirements, different models choose very different programming languages, call libraries/modules, and pay significant attention to the aesthetics of the interface, which, to a certain extent, reflects the differences in the model's capabilities, preferences, and concepts.
Overview of the list
SuperCLUE-Project Assessment System
SuperCLUE-Project is a Chinese native project-level code evaluation benchmark designed to examine the ability of large models to turn users' project-level requirements into code implementations.
SuperCLUE-Project focuses on the actual needs of the non-programmer user group, covering 5 first-level dimensions and 18 second-level dimensions, and constructing the question sets in Chinese natural language. In view of the characteristics of the non-programmer community, we only emphasize the functional level of the requirements description in the question design, and put the efficiency, security, readability and other indicators as the independent ability of the large model players to be evaluated in the evaluation session.
In addition, the benchmarks have three levels of difficulty, easy - medium - complex, which are scaled holistically for the same topic set to provide deeper insight into the project-level code implementation capabilities of the models.
Methodology
Referring to the SuperCLUE fine-grained assessment approach, the following process is followed to conduct the assessment:
1) Measurement Set Construction
1. Concerned about the dynamics of the large model-assisted low-code/zero-code development field, collecting and organizing the non-programmer group code project requirements
2. Write simple difficulty project level code assessment sets
3. Control the format and word range to extend the assessment set to moderate/complex difficulty levels
4. Testing and manual calibration
2) Scoring process
1. Preparation of evaluation rulesprompt --->
2. Small-scale testing to manually check the consistency of the referee model's evaluations with those of human experts --->
3. Iterative tuning of evaluation rules based on consistency feedback --->
4. Pass the complete set of to-be-tested model responses and evaluation rules into the two referee models to receive the complete evaluations, respectively --->
5. Calculate the mean of the scores of the two referee models in each dimension as the final result
3) Human Consistency Analysis
A stratified sampling of the measurement set is performed to test the consistency of the referee model's evaluations with those of the human experts by calculating the intra-group correlation coefficient and reporting that performance.
Compared with previous benchmarks, SuperCLUE-Project introduces for the first time both domestic and foreign models (Gemini-2.0-flash and Qwen-Max) as referees in the implementation of the evaluation, which further reduces the bias and preference problems of the large model through the cooperation of the "referee team". (Through the cooperation of the "referee team", the problems of bias and preference of the large model are further reduced.
In addition, in order to verify the reliability of the referee model, SuperCLUE-Project introduces the Intra-class Correlation Coefficient (ICC) for the first time, and calculates the bidirectional mixed effects of the ratings of human experts, Qwen-Max and Gemini-2.0-flash ( ICC(3,k)) indices, the referee model was verified to be strongly consistent with human ratings. Compared with the percentage reliability in the past, this method effectively overcomes the fluctuating effects of random errors.
(*Note: The intragroup correlation coefficient (ICC) is one of the reliability coefficient indices for measuring and evaluating inter-observer reliability and test-retest reliability, and was first used by Bartko in 1966 to measure and evaluate the magnitude of reliability.The ICC is equal to an individual's variability divided by the total variability. In this experiment, the two-way mixed-effects index was chosen as the consistency index because we only need to consider the consistency of the selected referee model and the human expert's ratings without extending it to other raters).
Evaluation criteria
- Functional Integrity (60%): ensures that the code fully implements all functions described in the user instructions.
- Code Quality (28%): evaluates code performance in terms of efficiency, readability, and security. Specifically includes:
a. Efficiency (12%): whether the code is sufficiently optimized in terms of resource usage, DOM manipulation, database/large data set handling, computation or API calls.
b. Readability (8%): Whether the code implements (1) uses clear naming and consistent formatting; (2) reasonably divides the code base into modules; and (3) maintains a clear project structure.
c. Security (8%): Whether the code (1) has no obvious security holes; and (2) can handle basic exceptions effectively.
- User Experience (12%): Evaluates the quality of user interface design and aesthetics, including the proper functioning of interactive elements (e.g., buttons, forms) and the basic aesthetics of the overall interface.
Compared to the design of the evaluation criteria in the past, SuperCLUE-Project has changed the relatively balanced scoring mechanism, significantly highlighting the scoring weight of the functional implementation aspect, which is also the ability that ordinary users are most concerned about.
In addition, the evaluation criteria of SuperCLUE-Project specifies the scoring mode of deduction system, i.e., on the basis of the default full score, based on the comparison between the questions and the corresponding code implementations, the part of the score that fails to meet the requirements of the questions will be deducted. For this kind of question-by-question individual evaluation method, the point deduction system to a certain extent makes up for the deficiency of the large model referee in examining the relative quality of multiple responses, and alleviates the large model evaluation'sRandomization issuesThe
Participating Models
In order to comprehensively measure the ability level of current domestic and overseas large models to solve project-level code requirements, seven domestic models and five overseas models with high representativeness were selected for this evaluation.
Evaluation results
overall list
Application Scenario List 
Review and Human Consistency Analysis
In order to scientifically evaluate the consistency of SuperCLUE-Project benchmarks with human experts, we select the three models with better, average and poorer overall performance from the assessment results, stratify 33 question-answer pairs, conduct manual independent scoring and question-by-question scoring by the two referee models, and report the percentage consistency of the total scores of the three, the correlation coefficients of the three groups, and scientifically examine the professionalism of the large models and the "referee panel" scoring. The professionalism of the "referee panel" scoring was scientifically examined.
Note: An ICC-3k value of 0.75 or higher is generally considered to indicate high inter-rater agreement.
Evaluation Cases
Example 1 Game Development (Puzzle, Difficult)
Prompt:Write code to implement a connect-the-dots game, randomly appear multi-colored, closely spaced m * n small cubes (m, n by the user to start the game before the user to specify, are greater than 5 integer), the user clicks on the two squares, if the color is the same and connected to the path (only allowed to the horizontal and vertical two directions) is not more than two turns, then eliminate the two cubes; if does not satisfy the rules of elimination, then the two clicked on the two squares Blink three times. The player has three chances to disrupt the arrangement of existing squares. If the user succeeds in eliminating two squares in a row within 3 seconds, "Combo!" is displayed on the screen.
Big Model Ref 1 (Gemini-2.0-flash):
Requirements contained in user instructions (based on independent analysis) 1.Create game interface: Create a grid of m*n squares, with m and n specified by the user (integers greater than 5). 2. Block color randomization: Randomly generate blocks of multiple colors and make sure that each color appears an even number of times so that they can be eliminated two by two. 3. Elimination rules: Eliminate two squares of the same color when the user clicks on them and they can be connected by a path with no more than two turns. 4. Error Tips: If the two squares clicked by the user do not satisfy the elimination rule, the clicked square blinks three times. 5. Disruptive Functions: Allows the player to disrupt the arrangement of squares for a total of three chances. 6. Combo Reward: If the user succeeds in eliminating two times in a row within 3 seconds, "Combo!" is displayed. Overall review of code defects The code implements a basic combo game, realizing random generation of squares, elimination rules, error alerts, upset function and combo bonus. However, the code has some shortcomings in terms of user experience and efficiency. For example, the flashing effect of the squares is not smooth enough, and the interface update after eliminating the squares is not timely enough. Scoring for each criterion -- Functional integrity: [[15]] points (All features are implemented, including randomly generated squares, elimination rules, disruption features, and combo bonuses. All core features are implemented, so no points are deducted.) -- Code Quality: [[5]] points where: Efficiency: [[2]] points (The code generates the board by creating a list of all the colors and then randomly disrupting it. This approach may affect efficiency when m and n are large. In addition, a breadth-first search algorithm is used when determining whether two squares can be eliminated. Although the algorithm can find the shortest path, it may be time-consuming on a large-scale board. Therefore, 1 point is deducted.) Readability: [[2]] points (The naming of the code is generally clear, but some of the code blocks are not sufficiently commented, e.g., the explanation of the breadth-first search algorithm is not detailed enough. The overall structure of the code is good, but it can be further modularized, e.g., by separating game logic and interface drawing. Therefore, no points will be deducted.) Security: [[1]] marks (There are no obvious security holes in the code, however, for the user input of m and n, only a simple value range checking is done and no other exceptions are considered, e.g., the user inputs non-integer numbers. Therefore, 1 point is deducted.) -- User experience: [[1]] points (The game interface basically works, however, the block flashing effect is not smooth enough and the interface update after eliminating blocks is not timely enough. In addition, the game lacks sound effects and animation effects, and the user experience needs to be improved. Therefore, 2 points are deducted.) In summary, the overall score for this code implementation is <> points.
Big Model Referee 2 (Qwen-Max):
Requirements contained in user instructions (based on independent analysis) 1. The game interface should randomly display m*n cubes of multiple colors in a close arrangement, where m and n are specified by the user and are integers greater than 5. 2. If the user clicks on two cubes, if they are of the same color and the connected paths (only horizontal and vertical directions are allowed) do not have more than two turns, then both cubes are eliminated. 3. If the elimination rule is not satisfied, the two clicked squares blink three times. 4. The player has three chances to disrupt the arrangement of the existing squares. 5. If the user succeeds in eliminating two squares in a row within 3 seconds, "Combo!" is displayed on the screen. Overall Comments on Code Flaws The code implements the basic game logic, including generating the game board, detecting if the blocks are connected, and handling player input. The code structure is relatively clear, but some parts can be further optimized to improve readability and efficiency. Scoring for each criterion -- Functional integrity: [[15]] points The code implementation fully complies with all functional requirements in the user instructions: Small cubes of m*n are generated, and m and n are specified by the user and are both integers greater than 5. The user clicks on two squares, and if the colors are the same and the connected paths have no more than two turns, then both squares are eliminated. If the elimination rule is not satisfied, the two clicked squares blink three times. The player has three chances to disrupt the arrangement of existing squares. When the user succeeds in eliminating two squares in a row within 3 seconds, "Combo!" is displayed on the screen. -- Code Quality: [[5]] points where: Efficiency: [[3]] points The code as a whole utilizes resources reasonably well, with no obvious performance bottlenecks. However, full marks are maintained as possible problems in extreme cases are not considered. Readability: [[1]] points Although the variable naming is clearer, the logic in the function is_connected is slightly more complex and not easy to understand. Also, the logic within the main loop could be improved with more sensible modularization. Security: [[1]] points The code lacks effective validation of user input, such as ensuring that the input is a positive integer. Also, there is no mechanism to handle certain potential errors (e.g. failed file operations). -- User Experience: [[2]] points The game interface is basically aesthetically pleasing and the interactive elements work properly. However, the consistency of the user experience is better as the number of times the squares blink meets the requirements. In summary, the overall score for this code implementation is <> points.
Overall rating: [21.5/25]
# Example 2 Game Development (Shooter, Difficult)
Prompt: Write code to implement a simple air combat game, the player uses the left and right arrow keys to maneuver the plane at the bottom of the interface to avoid obstacles flying down from above, and presses the spacebar to shoot enemy planes from above, the enemy planes will move left and right randomly and fire. Initial life value is 3, every time you hit an obstacle or get hit by an enemy plane, your life value decreases by 1, and the game ends when your life value reaches 0. There are 3 enemy planes in the first level and 3 more in each subsequent level. There are two firing modes: Mode A (default) can only fire in front of the enemy plane, and one hit on the enemy plane destroys it; Mode B fires in multiple directions, and requires two hits on the enemy plane to destroy it. Press "Q" key to switch between mode A and B.
[o3-mini-high code effect demo]:
Overall rating: [22/25]
# Example 3 Shortcut Tool (Daily Office, Medium)
Prompt:Write code to implement an English text processing tool, the user inputs text, the tool can quickly perform word count, word frequency sorting, case conversion, remove spaces and line breaks, add line numbers and other operations. In addition, the tool can save multiple user-defined replacement rules, and unified implementation. Users can save the text to favorites and customize the title.
Overall rating: [20.5/25]
Example 4 Web Application (Web Visualization, Difficult)
Prompt:Write code to implement a fashion showcase website with multiple images (uploaded by the user) that rotate automatically, with thumbnails located at the bottom of the page. The images are switched using a card flip visual effect. When hovering over an image, a magnifying glass is used to show the details. There is a "turn off light" button on the top right corner of the page, the default background is white, after clicking "turn off light", the background becomes black, and the button becomes "turn on light". The background of the page has the effect of slowly falling flower petals. There is a start/pause icon button in the upper left corner to control the start and pause of the picture rotation; there is a white heart icon in the lower right corner of each rotating picture, which turns into pink when clicked, and the number of times the heart has been clicked is displayed on the right side.
Overall rating: [23/25]
Example 5 Web Application (Educational Learning, Difficulty)
Prompt:Write code to implement a vocabulary memorization website that shows the user the word and four paraphrase options; if the user chooses the correct option, he or she jumps to the next word; if the user chooses the wrong option, he or she is prompted for the correct option before jumping. Each group has five words, a total of three groups, after the end of each group, the user can choose to end the study or learn another group of words. After finishing the study, the overall correct rate of this study is displayed. User's wrong answers will be recorded automatically, and user can click "Switch to Review Mode" at the top of the interface to answer the wrong questions again. The order of the questions is randomized, i.e. the order of the questions is usually different each time you enter the site.
[Qwen-Max code effect demo]:
Overall rating: [19/25]
Measurement Analysis and Conclusion
1. o3-mini-high with Claude-3.7-Sonnet-Reasoning in the lead
In this evaluation, OpenAI's released o3-mini-high achieved a composite score of 82.08, while Anthropic's newly released reasoning model Claude-3.7-Sonnet-Reasoning achieved a composite score of 81.63, and the two hand in hand led the list.
2. DeepSeek-R1 leads domestic models and is among the first tier of the industry
From the evaluation results, DeepSeek-R1 has a very small score gap with o3-mini-high, Claude-3.5-Sonnet/3.7-Sonnet-Reasoning, Gemini-2.0-pro and other cutting-edge models in the industry, and its performance is especially outstanding in the application scenarios of "game development" and "network application". The performance is especially outstanding in the "Game Development" and "Web Application" application scenarios, surpassing or reaching the level of Claude-3.5-Sonnet, Gemini-2.0-pro and other models.
3. Each has its own strengths: R1 specializes in game development, o3/Step R in multimedia editing, and multiple in web applications.
The 12 models that participated in the evaluation show the difference in capability in different application scenarios. Among them, DeepSeek-R1 is outstanding in the field of "game development", Claude-3.5-Sonnet, Beanbag 1.5pro, Smart Spectrum GLM-Zero-preview and Tongyi Qianqian Max are more specialized in "web application" design, o3-mini-high and Step Star Step R-mini are more specialized in "web application" design, and o3-mini-high and Step Star Step R-mini are more specialized in "web application" design. Claude-3.5-Sonnet, Beanbag 1.5pro, GLM-Zero-preview and Max are more specialized in "web application" design, while o3-mini-high and Step Star Step R-mini have a unique advantage in "multimedia editing" tool development.
4. There are significant differences in methodological choices and interface styles between the different models.
Comparing the model answers, it is found that, facing the same user requirements, different models choose very different programming languages, call libraries/modules, and pay significant attention to the aesthetics of the interface, which to a certain extent reflects the differences in the model's capabilities, preferences, and concepts. Overall, overseas models perform better in user interface design.
Relevant examples are listed below:
Question one:
Write code to realize a simple online food ordering website, support for adding dishes to the shopping cart, through the "+" and "-" to change the number of dishes, real-time display of the total price of the dishes in the shopping cart, and be able to click to place an order. After the order is placed, the shopping cart will be emptied and the customer will be asked if he/she wants to pack the food. For every $100 of the total amount, there should be a discount of $10.
Question two:
Write code to realize a basketball shooting game, mouse movement to control the direction of the basketball, press the mouse to store power, the basketball into the basket will score points, consecutive baskets have extra points, not into the basket three times will end the game. When choosing the direction and accumulating power, you need to mark the intended flight trajectory with a dotted line; after throwing the basketball, you need to clearly display its flight trajectory. Before shooting, you can use the left and right arrow keys to move the initial position of the basketball, short-distance shooting scores 2 points, when more than a certain distance, shooting scores 3 points. There is a possibility of hitting the rim and bouncing into the ball.