MathCLUE "National High School Mathematics Competition" is introduced: an in-depth assessment of competition-level mathematical reasoning ability in large models. The assessment system covers a number of representative dimensions of high school mathematics, including geometry, algebra and probability statistics.
🔥 Measurement model: DeepSeek-R1 (accessed at chat.deepseek.com)
DeepSeek-R1 Evaluation and Analysis
🔍 DeepSeek-R1 Tops MathCLUE's National High School Math Contest List
DeepSeek-R1 topped the national high school math competition evaluation list with an excellent score of 87.31 points, significantly ahead of the world's top model o1 nearly 10 points, compared to DeepSeek-R1-Lite-Preview to improve 26.12 points, its overall score increased substantially, mathematical reasoning and problem solving ability to reach a new height.
Meanwhile, the results of Qwen2.5-Max "National High School Math Contest" are out! Failed to meet expectations, with reasons
🔥 Assessment model: Qwen2.5-Max
Call the official API version name: qwen-max-2025-01-25
Qwen2.5-Max Evaluation and Analysis
🔍Qwen2.5-Max still has some room for improvement on the MathCLUE list
Qwen2.5-Max scored 33.58 points and ranked 9th in the National High School Mathematics Competition, ahead of famous overseas models. Claude 3.5 Sonnet (20241022) 15.67 points, but still has some room for improvement (with a gap of more than 30 points) compared to the headline big models at home and abroad.
For the performance of this model, we analyzed its wrong questions in depth. It is found that the model omits the solution process and gives wrong answers directly on some puzzles, and this assessment is only based on the final answers, which may be the main reason for its low score.
Reviews
MathCLUE National High School Math Competition Review Set. Covers questions from the 2024 National High School Mathematics Competition and develops a rigorous assessment of the Big Model.
Methodology
The method of determining whether the final answer in the response matches the reference answer for the macromodel's response on the assessment task to confirm the macromodel's rate of correctness (correct or incorrect) on a question achieves complete objectivity in assessment.