Upgraded Claude 3.5 Sonnet Chinese benchmark evaluation is out! Code ability exceeds GPT-4o, higher-order reasoning is not as good as o1

AI News10mos agoupdate AI Sharing Circle

2.3K 00

The results of this assessment will be used for academic purposes only.

On October 22, Anthropic officially launched the upgraded large model Claude 3.5 Sonnet as well as the new model Claude 3.5 Haiku.

Claude

According to Anthropic officials, Claude 3.5 Sonnet outperforms the previous version across the board in every capacity, with significant improvements in performance especially in intelligent body coding and tool usage tasks.On SWE-bench Verified, its encoding performance improves from 33.41 TP3T to 49.01 TP3T, outperforming all publicly available models including OpenAI o1-preview.

Anthropic is also introducing a groundbreaking new feature: computer use. Developers can control Claude to use the computer in the same way as a human via an API, but the feature is currently in the experimental stage.

In response to public concerns about the performance of Claude 3.5 Sonnet upgrade, in terms of reasoning and code in Chinese scenarios, SuperCLUE, a professional third-party evaluation organization, conducted an in-depth evaluation of Claude 3.5 Sonnet upgrade.

assessment environment

Reference Standards: SuperCLUE-Reasoning Chinese Higher Order Reasoning Assessment Benchmark, SuperCLUE-Code3 Chinese Code Assessment Benchmark.

Evaluation Models : Claude 3.5 Sonnet (1022, POE latest model snapshot)

Model GenerationConfig configuration:

Refer to Claude's detailed description document: https://www.anthropic.com/news/3-5-models-and-computer-use

Review Episode:

1.SuperCLUE-Reasoning Chinese Complex Tasks Higher Order Reasoning Review Set. A total of 302 questions, including multi-step reasoning, numerical reasoning, reasoning calculations, market analysis and optimization problems five difficult reasoning tasks.

2.SuperCLUE-Code3 Chinese native code evaluation set.HumanEval's Chinese upgraded version, a total of 195 questions, including 1560 test cases, divided into primary, intermediate and advanced three categories of difficulty level.

Methodology::

1. Higher-order reasoning: For each reasoning question, we provide reference answers and reasoning processes after manual calibration and checking; then according to the set evaluation process, evaluation criteria and scoring rules (1-5 points), the referee model evaluates the answers of candidate models. The set evaluation criteria include: compliance with basic requirements, accuracy of reasoning, and reasonableness of conclusion.

2. Code SC-Code3: specifically designed to evaluate and enhance the core capabilities of the Chinese Big Model in code generation through functional unit testing.

SuperCLUE-Code3: Chinese Native Ranked Code Proficiency Benchmarks

Let's start with the conclusion

Conclusion 1: Claude 3.5 Sonnet (1022) performs well in Chinese code capability (SC-Code3). The total score reached 84.85, 1.52 points higher than GPT-4o-0513 which is 4.54 points lower than o1-preview. In particular, it performed amazingly well on the advanced code tasks, scoring 69.23, tying with o1-preview.

Conclusion 2: Claude 3.5 Sonnet (1022) scored 74.54 on the Chinese Higher Order Reasoning Task. Close to ChatGPT-4o-latest, with a large gap compared to o1-preview (-10.73 points) . Among them, the multi-step reasoning ability is relatively outstanding, with performance over ChatGPT-4o-latest.

**Conclusion 3:** Compared to the previous version Claude 3.5 Sonnet (0621). Claude 3.5 Sonnet (1022) improved by 4.57 points in the reasoning task , and significant progress in secondary tasks such as numerical reasoning and computational reasoning.

Evaluation results

comparative analysis

Code Measurement Example

Example 1

Prompt.

from typing import List


def longest_peak_sequence(nums: List[int]) -> int:
    """
    在一个整数数组中找出最长的"峰"序列的长度。
    所谓"峰"序列，是指先严格递增然后严格递减的序列，且峰顶（即最大值）唯一。
    至少包括三个数以上才能构成一个"峰"序列（即包含递增和递减的部分）。

    返回数组中最长"峰"序列的长度。如果没有"峰"序列，则返回0。


    示例:
    >>> longest_peak_sequence([2, 1, 4, 7, 3, 2, 5])
    5  # 最长的"峰"序列是 [1, 4, 7, 3, 2]


    >>> longest_peak_sequence([5, 4, 3, 2, 1])
    0  # 不存在"峰"序列，因为没有严格递增的部分

    提示:
    - 数组 `nums` 的长度范围为 [0, 10000]。
    - 数组中的每个元素范围为 [-10000, 10000]。
    """

Claude 3.5 Sonnet (1022) Answer.

from typing import List


def longest_peak_sequence(nums: List[int]) -> int:
    if len(nums) < 3:
        return 0

    max_length = 0

    # 从第二个元素遍历到倒数第二个元素
    for i in range(1, len(nums) - 1):
        # 检查当前位置是否是峰顶
        if nums[i] > nums[i-1] and nums[i] > nums[i+1]:
            # 找到一个峰顶，向两边扩展
            left = i - 1
            right = i + 1

            # 向左扩展，寻找严格递增序列
            while left > 0 and nums[left-1] < nums[left]:
                left -= 1

            # 向右扩展，寻找严格递减序列
            while right < len(nums) - 1 and nums[right+1] < nums[right]:
                right += 1

            # 更新最长峰序列长度
            current_length = right - left + 1
            max_length = max(max_length, current_length)

    return max_length

Test case.

def check(candidate):
    assert candidate( [1, 2, 3, 4, 5, 6])==0
    assert candidate([1, 3, 5, 6, 4, 2, 0])==7
    assert candidate( [5, 4, 3, 2, 1, 2, 3, 4, 1])==5
    assert candidate([1, 3, 5, 3, 1])==5
    assert candidate([1, 2, 3, 3, 2, 1])==0
    assert candidate([2, 1, 3, 1, 2, 1, 3])==3
    assert candidate([1, 3, 5, 4, 3, 2, 1])==7

if_passed**(Adopted or not):**Adopted.

Example 2

Prompt.

from typing import List


def longest_increasing_path(matrix: List[List[int]]) -> int:
    """
    给定一个整数矩阵，找到最长严格递增路径的长度。
    路径被定义为从矩阵中的任意单元格开始，每次可以向上、下、左、右移动到相邻的单元格上，如果递增的话。
    你不能在路径中移动到矩阵外面，也不能移动到不满足递增条件的单元格上。
    路径的长度是路径上的单元格总数。

    示例:
    >>> longest_increasing_path([
            [9, 9, 4],
            [6, 6, 8],
            [2, 1, 1]
        ])
    4  # 最长递增路径是 [1, 2, 6, 9]


    >>> longest_increasing_path([
            [3, 4, 5],
            [3, 2, 6],
            [2, 2, 1]
        ])
    4  # 最长递增路径是 [3, 4, 5, 6]或 [2, 4, 5, 6]


    提示:
    - 矩阵 `matrix` 的维度为 m x n。
    - 矩阵 `matrix` 的每个元素是一个非负整数。


    """

Claude 3.5 Sonnet (1022) Answer.

from typing import List


def longest_increasing_path(matrix: List[List[int]]) -> int:
    if not matrix or not matrix[0]:
        return 0

    m, n = len(matrix), len(matrix[0])
    # 缓存，用于存储每个位置的最长递增路径长度
    cache = [[0] * n for _ in range(m)]

    def dfs(i: int, j: int) -> int:
        # 如果已经计算过，直接返回缓存结果
        if cache[i][j] != 0:
            return cache[i][j]

        # 四个方向：上、下、左、右
        directions = [(-1, 0), (1, 0), (0, -1), (0, 1)]

        # 当前位置的最长路径至少为1
        max_length = 1

        # 遍历四个方向
        for di, dj in directions:
            ni, nj = i + di, j + dj

            # 检查是否在矩阵范围内且满足递增条件
            if (0 <= ni < m and 0 <= nj < n and 
                matrix[ni][nj] > matrix[i][j]):
                # 递归计算邻居位置的最长路径
                curr_length = 1 + dfs(ni, nj)
                max_length = max(max_length, curr_length)

        # 存入缓存
        cache[i][j] = max_length
        return max_length

    # 遍历矩阵中的每个位置，找到最长路径
    result = 0
    for i in range(m):
        for j in range(n):
            result = max(result, dfs(i, j))

    return result

Test case.

def check(candidate):
    assert candidate([[1, 1], [1, 1]]) == 1
    assert candidate([
    [9, 9, 4],
    [6, 6, 8],
    [2, 1, 1]
]) == 4
    assert candidate([
    [3, 4, 5],
    [3, 2, 6],
    [2, 2, 1]
]) == 4
    assert candidate([
        [1, 2, 3, 4],
        [5, 6, 7, 8],
        [9, 10, 11, 12],
        [13, 14, 15, 16]
]) == 7
    assert candidate([
        [-1, -2, -3, 0],
        [5, -6, 7, -8],
        [-9, 0, -11, -12],
        [13, 0, 15, -16]
]) == 4
    assert candidate([
    [5, 10, 6, 7, 9],
    [6, 10, 2, 2, 3], 
    [4, 9, 8, 2, 8],
    [3, 3, 3, 8, 8]
])==4
    assert candidate([
    [1, 3, 1, 4],
    [1, 5, 1, 7],
    [4, 2, 1, 1],
    [1, 3, 2, 1]
])==3
    assert candidate([
    [1, 9, 2, 4], 
    [9, 6, 5, 3],
])==4

if_passed(Adopted or not):Adopted.

Examples of higher-order reasoning

Example 1: Multi-Step Reasoning

ask

A company has seven employees named A, B, C, D, E, F and G. The company held a grading exercise and based on the grades of the employees and the following conditions, deduce the exact ranking of each employee:

A scores higher than B, but lower than C.
D scores higher than E and G, but lower than A.
G scored the lowest.
B scores higher than E and F.
F scores higher than E.
C scored the highest.
Each employee scored differently, with D in the very center.

Please infer the specific score for each employee based on the above conditions.

Claude 3.5 Sonnet (1022) replies:

Let's extrapolate each employee's ranking through logical analysis:

Known from condition 3: G is in last place (7th)
Known from condition 6: C is at the top of the list (1st place)
Known from condition 7: D is ranked 4th (middle position)
We know from condition 2: D is higher than E and G, but lower than A. So:
- E must be ranked after D (5th or 6th)
- A must be ranked before D (2nd or 3rd)
We know from condition 1: A is higher than B but lower than C. Combine this with the previous analysis:
- 2nd place in row A
- B must be ranked 3rd
We know from conditions 4 and 5: B is higher than E and F, and F is higher than E. So:
- 5th place in row F
- E 6th place

The final rankings are as follows:

C (1st place)
A (2nd place)
B (3rd place)
D (4th)
F (5th)
E (6th)
G (7th)

Would you like me to show you the detailed reasoning process?

Example 2: Optimization problem

ask

A city has five major transportation hubs, and the city planning department wants to optimize the coverage of bus routes to ensure that each hub is connected by at least one route. The construction costs between the hubs are known, and the goal is to connect all the hubs at minimal cost.

Data:

Connectivity costs between hubs (in millions of dollars)
Hub 1 and Hub 2: $100,000
Hub 1 and Hub 3: $150,000
Hub 2 and Hub 4: $200,000
Hub 3 and Hub 4: $250,000
Hub 3 and Hub 5: $300,000
Hub 4 and Hub 5: $180,000

Goal: Determine which hubs should be connected to each other with lines to minimize total construction costs and ensure that all hubs are interconnected.

Claude 3.5 Sonnet (1022) replies:

This is a Minimum Spanning Tree problem which I will solve using Kruskal's algorithm. Let's analyze it step by step:

First, sort all connections from smallest to largest cost:
- Hub 1-2: $100,000
- Hubs 1-3: $150,000
- Hub 4-5: $180,000
- Hubs 2-4: $200,000
- Hub 3-4: $250,000
- Hubs 3-5: $300,000
The lowest cost connection is selected in order to avoid forming a ring:
- Select Hub 1-2 ($100,000)
- Select Hub 1-3 ($150,000)
- Select Hub 4-5 ($180,000)
- Select Hub 2-4 ($200,000)

Optimal Program:

Connecting Hub 1 and Hub 2
Connecting Hub 1 and Hub 3
Connecting Hub 4 and Hub 5
Connecting Hub 2 and Hub 4

Total cost: 10 + 15 + 18 + 20 = $630,000

This program ensures that:

All hubs are connected
No loop formed
Minimize total cost of ownership

Would you like me to generate a visualization for you to show this solution?

# Join the Community

For more detailed Claude 3.5 latest big model review details, you can join SuperCLUE-Claude exchange group.

AI News

Article copyright AI Sharing Circle All, please do not reproduce without permission.

LiblibAI-API supports to call ComfyUI workflow now!

AI News

7mos ago

02.2K

What's New in Copilot in Excel: a New Favorite at the Ignite 2024 Conference

AI News

9mos ago

01.8K

Deep Research：一款深度探索知识并生成报告的新工具，Google给出的主题知识研究新范式

Deep Research: a new tool for exploring knowledge in depth and generating reports, a new paradigm for subject matter knowledge research given by Google

AI News

7mos ago

01.9K

吴恩达论 AI 模型战略：从 DeepSeek、Gemini 看技术选型与价值观考量

Wu Enda on AI Modeling Strategy: Technology Selection and Values Consideration from DeepSeek, Gemini

AI News

5mos ago

01.3K

No comments

You must be logged in to leave a comment!

No comments...

Upgraded Claude 3.5 Sonnet Chinese benchmark evaluation is out! Code ability exceeds GPT-4o, higher-order reasoning is not as good as o1

Data:

Goal: Determine which hubs should be connected to each other with lines to minimize total construction costs and ensure that all hubs are interconnected.

AutoGLM-Web Plugin: More Than Computer Use, AI's 'Phone Use' Moment Has Arrived!

Mysterious image generation model Red panda (Red Panda) topped the arena! More than FLUX 1.1Pro

Related posts

LiblibAI-API supports to call ComfyUI workflow now!

What's New in Copilot in Excel: a New Favorite at the Ignite 2024 Conference

Deep Research: a new tool for exploring knowledge in depth and generating reports, a new paradigm for subject matter knowledge research given by Google

Wu Enda on AI Modeling Strategy: Technology Selection and Values Consideration from DeepSeek, Gemini

No comments

Latest Collections

Latest Articles

Upgraded Claude 3.5 Sonnet Chinese benchmark evaluation is out! Code ability exceeds GPT-4o, higher-order reasoning is not as good as o1

Data:

Goal: Determine which hubs should be connected to each other with lines to minimize total construction costs and ensure that all hubs are interconnected.

AutoGLM-Web Plugin: More Than Computer Use, AI's 'Phone Use' Moment Has Arrived!

Mysterious image generation model Red panda (Red Panda) topped the arena! More than FLUX 1.1Pro

Related posts

LiblibAI-API supports to call ComfyUI workflow now!

What's New in Copilot in Excel: a New Favorite at the Ignite 2024 Conference

Deep Research: a new tool for exploring knowledge in depth and generating reports, a new paradigm for subject matter knowledge research given by Google

Wu Enda on AI Modeling Strategy: Technology Selection and Values Consideration from DeepSeek, Gemini

No comments

Selected AI Tools

Latest Collections

Latest Articles