AI Personal Learning
and practical guidance

"Chinese Benchmark Assessment of Scientific Reasoning (SuperCLUE-Science) Program Launched

With the rapid development of AI technology, the ability of large language models to reason on difficult scientific topics at the graduate level has become a hot topic of research. Taking OpenAI as an example, its new model OpenAI o1, officially released in early December, has shown strongscientific reasoningThe o1 performed amazingly well on the GPQA-Diamond, a benchmark for testing graduate-level expertise in physics, chemistry, and biology, demonstrating abilities comparable to those at the human PhD level.


In order to more effectively evaluate the performance of large models in this regard, especially considering that a large number of large models with excellent scientific reasoning capabilities are emerging in China, we have launched a comprehensive benchmark based on the accumulation of SuperCLUE Chinese.In Scientific ReasoningSuperCLUE-Science (literacy benchmark assessment). This benchmark focuses primarily on evaluating Chinese large models in theGraduate level science questionsin the performance of the model, aiming to provide a more targeted reference for future model development.

 

SuperCLUE-Scienceassessment system

Note: The specific assessment system is based on the officially released assessment report.

 

1. Characteristics

(1)comprehensiveness

The benchmarks cover a wide range of knowledge domains and complexity, and are developed in detail for the sub-domains of secondary subjects under the three categories of Physics, Chemistry and Biology, to ensure a comprehensive assessment of the Chinese language model in terms of scientific reasoning ability.

(2) Objectivesuffix forming noun from adjective, corresponding -ness or -ity

The Scientific Reasoning Benchmark Assessment places a high value on the objectivity of the scientific questions, which is ensured by constructing the assessment set in the form of well-designed question-answer pairs with solved questions. In the assessment process, we pay special attention to the accuracy of the answers given by the Big Model.

(3) Challengessuffix forming noun from adjective, corresponding -ness or -ity

In order to measure the model's performance on complex scenarios and difficult scientific questions, we introduced challenging graduate-level scientific questions in terms of both the breadth of knowledge covered by the questions and the depth of reasoning required for the questions.

 

2. assessment mission

In order to more effectively assess the Big Model's ability to reason scientifically at the graduate level, we measured topics in three disciplines - Physics, Chemistry, and Biology - and expanded in detail the secondary sub-domains under their respective disciplines to ensure comprehensive coverage of the different scientific domains. The secondary sub-domains are displayed below:

  • physiotherapy: quantum mechanics, high-energy particle physics, general physics, astrophysics, electromagnetism and photonics, relativistic mechanics, statistical mechanics, condensed matter physics, optics and acoustics
  • chemotherapy: Organic Chemistry, General Chemistry, Inorganic Chemistry, Analytical Chemistry, Physical Chemistry
  • biotechnology: Molecular Biology, Genetics

Next, we will briefly introduce some of the categories and show corresponding examples.

2.1 Quantum mechanics

Quantum mechanics is a cutting-edge field in physics that explores the exotic behavior of particles in the microscopic world. The field involves concepts such as wave-particle duality, quantum superposition and entanglement, and requires an in-depth understanding of the uncertainty principle and the evolution of quantum states. Quantum physics not only challenges the traditional concepts of physics, but also promotes the development of technologies such as quantum computing and quantum communication, making it a key area for scientific exploration and technological innovation.

Example:

2.2 Energetic particle physics

High-energy particle physics is the study of the most fundamental particles in the universe and their interactions. The field encompasses gas pedal technology, particle detectors and data analysis, and aims to reveal the fundamental composition of matter and the origin of the universe. High-energy particle physics experiments, such as the Large Hadron Collider (LHC), are at the cutting edge of scientific exploration, requiring precise measurements and complex data analysis, as well as the development of a rigorous scientific attitude and the ability to collaborate across disciplines.

Example:

2.3 Organic chemistry

Organic chemistry is the science of studying the structure, properties and synthetic methods of carbon-containing compounds. The field deals with the tetravalent bonding properties of carbon atoms, stereochemistry and reaction mechanisms, and explores the mysteries of natural products and synthetic macromolecules. Organic chemistry not only enriches the theoretical basis of drug development and materials science, but also develops the ability to analyze structures and design syntheses, making it a highly creative part of the field of chemistry.

Example:

2.4 Physical chemistry

Physical chemistry is an interdisciplinary field at the intersection of chemistry and physics that studies the physical basis of chemical phenomena. The field covers thermodynamics, quantum chemistry, electrochemistry and kinetics, and applies the laws of physics to explain the nature of chemical reactions. Physical chemistry not only deepens the understanding of chemical bonding and reaction rates, but also promotes the development of techniques such as catalysis and spectroscopy, and is a bridge between theory and experiment.

Example:

2.5 Genetics

Genetics is the study of the patterns of transmission of genetic information and variation in living organisms. The field involves gene structure, genetic recombination, epigenetics and population genetics, and reveals the origin and evolution of biological diversity. Genetics not only provides the theoretical basis for the diagnosis and treatment of genetic diseases in medicine, but also promotes the development of agricultural breeding and ecological conservation, and is a core field in the life sciences. Example:

2.6 Molecular biology

Molecular biology is the science of studying the structure and function of biological macromolecules. The field covers DNA replication, transcription and translation, protein folding and interactions, and reveals the molecular mechanisms of life activities. Molecular biology not only deepens the understanding of the regulation of gene expression, but also promotes the development of emerging fields such as gene editing and bioinformatics, and is a key tool for exploring the mysteries of life in the life sciences. Example:

 

3. Examples of measurement methods and assessments

Scoring Methods and Ideas

1. Ideas for scoring methodsReferring to the scoring method of the Teamwork SuperCLUE-CoT "Chained Reasoning" assessment benchmark, build a specialized assessment set to evaluate each dimension and provide detailed feedback.

2. Measurement set construction

The process of building a Chinese question bank for scientific reasoning: 1. collecting and organizing graduate-level expertise in chemistry, physics, and biology ---> 2. writing Chinese scientific reasoning questions ---> 3. testing ---> 4. revising and finalizing the Chinese question bank for scientific reasoning, referring to domestic and international standards, and constructing a dedicated assessment set for each dimension.

3. Scoring criteria

The whole assessment process was divided into several key stages: firstly, the questionnaire material was prepared to ensure the accuracy and completeness of the input data. Next, the answers to the large model were analyzed based on detailed assessment criteria. Finally, rigorous scoring rules are applied to score the answers to the large model. This process provides questions corresponding to themanual calibrationreference answers afterward for objective assessment.

The assessment criteria cover two important dimensions for examining scientific reasoning, includingthe process of solving a problemcap (a poem)final answerthat ensures a comprehensive assessment of the model's ability to reason on graduate level difficulty science questions.

The scoring rules are quantitative in nature, aiming to ensure the scientific and fair nature of the assessment process. We have also introduced an advanced automated scoring system, which greatly reduces manual intervention and further enhances the efficiency and consistency of the assessment.

The assessment criteria for each dimension are clearly defined in the assessment task. By combining the assessment process, criteria and scoring rules, the questions are fed into the big model for assessment, and the assessment results of each dimension are finally obtained. This systematic approach not only improves the accuracy of the assessment, but also provides strong data support for the improvement of the big model.

4.Evaluation criteria

For the assessment of the response quality of each broad model on the assessment task, we adopt two assessment criteria for evaluation.

In the assessment system for scientific reasoning questions, the core rubric focuses primarily on the"The Final Answer"The precision and accuracy of the"Problem solving process"The rigor of the reasoning steps in the consideration. Given the application scenarios of the Big Language Model, we have deeply personalized and optimized it for the specific type of scientific reasoning questions to suit their unique challenges.

Scientific reasoning questions are different from conventional scientific questions in that they touch on the academic depth of graduate level, not only covering a wide range of knowledge, but also being more intricate in logical reasoning, forcing the solvers to think out of the traditional thinking mode and adopt innovative thinking strategies. Therefore, when evaluating such questions, we not only strictly check the accuracy of the final answer, but also attach great importance to the clarity and rationality of the reasoning logic in the process of solving the questions, in order to measure the scientific reasoning ability of the solvers in a comprehensive and in-depth manner.

With these two evaluation criteria, we hope to comprehensively and objectively assess the performance of each large model on the measurement tasks and provide valuable feedback for further improvement and optimization.

Evaluation Examples

We used different large models from home and abroad to answer the questions we prepared and used the o1 Automated evaluation of the responses to the larger model. The following are examples of evaluation for two different disciplinary subdomains, respectively:

Example 1 Genetics

[Question]:

[Reference Answer]: 38.25%

[Model (Domestic Large Model 1) Answer]: Super Model Scoring::

Evaluate this answer as follows:QUESTIONS AND FEEDBACK FOR IMPROVEMENT: The AI assistant was not logical in solving the question and failed to properly consider the effect of recombination frequency on gametes, resulting in an incorrect final answer. The solution steps were not complete and clear enough.

  1. the process of solving a problem(5 points)::

Criterion 1-based solution process; the AI Assistant's solution process is not logically rigorous, has incomplete steps, and does not properly consider the effect of recombination events on genotype proportions; score of 1 based on Criterion 1.

Score:1/5

  1. final answer(5 points)::

Final answer based on Criterion 2; AI Assistant's final answer does not match the reference answer and is answered incorrectly; score of 0 based on Criterion 2.

Score:0/5Combined score = ( 1 + 0 ) / 2 = 0.5Summarizing the above.AIThe combined score of the assistant's answers 0.5 Points.

[Model (Overseas Large Model 1) Answer]:

Super Model Scoring::

Evaluate this answer as follows:The AI assistant's answer demonstrated some logic in solving the question, but there was an error in the key probability calculation, resulting in a final answer that did not match the reference answer. The understanding and application of the relationship between recombination frequency and genotype probability in genetics needs to be strengthened.

  1. Solution process (5 points)::

Based on Criterion 1 ; the AI assistant's solution steps are partially correct, and it is able to identify recombination frequencies and attempts to calculate the probability of gamete types, but there is an error in the calculation of the combined recombination probability and the probability of the final genotype, which leads to an inaccurate result; based on the Criterion 1 scores, e.g., 3 points.

Score:3/5

  1. Final answer (5 points)::

Based on the for Criterion 2; the final answer provided by the AI Assistant, 12.31%, does not match the reference answer, 38.25%, and is therefore scored as 0.

Score: 0/5Combined score = ( 3 + 0 ) / 2 = 1.50

Summarizing the above.AIThe combined score of the assistant's answers 1.50 Points.

# Example 2 Organic Chemistry

[Question]:

[Reference Answer]: 0.082 ~ 0.093

[Model (Domestic Large Model 2) Answer]:

Super Model Scoring::

Evaluate this answer as follows:The AI assistant incorrectly assumes that the concentration of hydroxyl ions remains constant during the solution, resulting in a final calculation that does not match the reference answer. The solution fails to correctly apply the secondary reaction kinetics formula, which affects the accuracy of the answer.

Scoring and evaluation begins:

  1. Solution process (5 points)::

The AI assistant's solution process was not logical, with incomplete steps and invalid solutions. Incorrectly simplified the secondary reaction to a primary reaction and did not correctly apply the secondary kinetic formula; score based on criterion 1, e.g., 1 point.

Score: 1/5

  1. Final answer (5 points)::

The final answer provided by the AI assistant, 0.045 M, does not match the reference answer range of 0.082 ~ 0.093 M. The answer is incorrect; the score based on criterion 2, e.g., 0 points.

Score: 0/5Combined score = ( 1 + 0 ) / 2 = 0.50

Summarizing the above.AIThe combined score of the assistant's answers 0.50 Points.

[Model (Overseas Large Model 2) Answer]:

"Chinese Benchmark Assessment of Scientific Reasoning (SuperCLUE-Science) Program Released-1

Super Model Scoring::

Evaluate this answer as follows:The AI assistant is clear in its logic and complete in its steps in solving the problem, which basically meets the requirements of the question and the final answer is accurate. However, there is a slight inaccuracy in the presentation of the range of values in the final step, and it is recommended that the details of the calculation be further checked to ensure complete accuracy.

Scoring and evaluation begins:

  1. Solution process (5 points)::

The AI assistant's solution is logical, with complete and clear steps, consistent with the kinetic analysis of secondary nucleophilic substitution reactions, and uses an effective and appropriate methodology. Score based on Criterion 1, 4 points.

Score: 4/5

  1. Final answer (5 points)::

The final answer provided by the AI assistant is within the range of the reference answer, meets the requirements of the question, and is the correct answer. Score based on Criterion 2, 5 points.

Score: 5/5Combined score = ( 4 + 5 ) / 2 = 4.50

Summarizing the above.AIAssistant's Answer has a composite score of 4.50 Points.

 

Invitation to Evaluation

time planning

1. Registration begins: January 2

2. Participation assessment model confirmation: January 10th

3. Measurement implementation and result statistics: January 10th-15th

4. Release of assessment results: January 16thassessment process

1.Mail application

2. Communication of Intent

3. Participation Confirmation and Agreement Process

4. Provide model API and documentation

5. Obtaining an evaluation report

Apply for a review atMail subject: SuperCLUE-Science Chinese Scientific Reasoning Assessment Application, 发送contact@superclue.ai请使用单位邮箱, mail content includes: unit information, large model profile, contact person and affiliated department, contact information

May not be reproduced without permission:Chief AI Sharing Circle " "Chinese Benchmark Assessment of Scientific Reasoning (SuperCLUE-Science) Program Launched

Chief AI Sharing Circle

Chief AI Sharing Circle specializes in AI learning, providing comprehensive AI learning content, AI tools and hands-on guidance. Our goal is to help users master AI technology and explore the unlimited potential of AI together through high-quality content and practical experience sharing. Whether you are an AI beginner or a senior expert, this is the ideal place for you to gain knowledge, improve your skills and realize innovation.

Contact Us
en_USEnglish