AI Personal Learning
and practical guidance

Upgraded Claude 3.5 Sonnet Chinese benchmark evaluation is out! Code ability exceeds GPT-4o, higher-order reasoning is not as good as o1

The results of this assessment will be used for academic purposes only.

On October 22, Anthropic officially launched the upgraded large model Claude 3.5 Sonnet as well as the new model Claude 3.5 Haiku.


 

According to Anthropic officials, Claude 3.5 Sonnet outperforms the previous version across the board in every capacity, with significant improvements in performance especially in intelligent body coding and tool usage tasks.On SWE-bench Verified, its encoding performance improves from 33.41 TP3T to 49.01 TP3T, outperforming all publicly available models including OpenAI o1-preview.

Anthropic is also introducing a groundbreaking new feature: computer use. Developers can control Claude to use the computer in the same way as a human via an API, but the feature is currently in the experimental stage.

 

In response to public concerns about the performance of Claude 3.5 Sonnet upgrade, in terms of reasoning and code in Chinese scenarios, SuperCLUE, a professional third-party evaluation organization, conducted an in-depth evaluation of Claude 3.5 Sonnet upgrade.

assessment environment

Reference Standards: SuperCLUE-Reasoning Chinese Higher Order Reasoning Assessment Benchmark, SuperCLUE-Code3 Chinese Code Assessment Benchmark.

Evaluation Models : Claude 3.5 Sonnet (1022, POE latest model snapshot)

Model GenerationConfig configuration:

  • Refer to Claude's detailed description document: https://www.anthropic.com/news/3-5-models-and-computer-use

Review Episode:

1.SuperCLUE-Reasoning Chinese Complex Tasks Higher Order Reasoning Review Set. A total of 302 questions, including multi-step reasoning, numerical reasoning, reasoning calculations, market analysis and optimization problems five difficult reasoning tasks.

2.SuperCLUE-Code3 Chinese native code evaluation set.HumanEval's Chinese upgraded version, a total of 195 questions, including 1560 test cases, divided into primary, intermediate and advanced three categories of difficulty level.

Methodology::

1. Higher-order reasoning: For each reasoning question, we provide reference answers and reasoning processes after manual calibration and checking; then according to the set evaluation process, evaluation criteria and scoring rules (1-5 points), the referee model evaluates the answers of candidate models. The set evaluation criteria include: compliance with basic requirements, accuracy of reasoning, and reasonableness of conclusion.

-1

 

2. Code SC-Code3: specifically designed to evaluate and enhance the core capabilities of the Chinese Big Model in code generation through functional unit testing.

SuperCLUE-Code3: Chinese Native Ranked Code Proficiency Benchmarks

-1

 

Let's start with the conclusion

Conclusion 1: Claude 3.5 Sonnet (1022) performs well in Chinese code capability (SC-Code3). The total score reached 84.85, 1.52 points higher than GPT-4o-0513 which is 4.54 points lower than o1-preview. In particular, it performed amazingly well on the advanced code tasks, scoring 69.23, tying with o1-preview.

Conclusion 2: Claude 3.5 Sonnet (1022) scored 74.54 on the Chinese Higher Order Reasoning Task. Close to ChatGPT-4o-latest, with a large gap compared to o1-preview (-10.73 points) . Among them, the multi-step reasoning ability is relatively outstanding, with performance over ChatGPT-4o-latest.

**Conclusion 3:** Compared to the previous version Claude 3.5 Sonnet (0621). Claude 3.5 Sonnet (1022) improved by 4.57 points in the reasoning task , and significant progress in secondary tasks such as numerical reasoning and computational reasoning.

Evaluation results

-1-1

 

comparative analysis

-1-1

-1-1

 

Code Measurement Example

Example 1

Prompt.

from typing import List

def longest_peak_sequence(nums. List[int]) -> int:
    """
    Find the length of the longest "peak" sequence in an array of integers.
    A "peaked" sequence is one that is strictly increasing and then strictly decreasing, and has a unique peak (i.e., a maximum value).
    A "peaked" sequence (i.e., one that contains both increasing and decreasing parts) can only be made up of at least three numbers.

    Returns the length of the longest "peak" in the array. If there are no "peaks", then 0 is returned.

    Example.
    >>> longest_peak_sequence([2, 1, 4, 7, 3, 2, 5])
    5 # The longest "peak" sequence is [1, 4, 7, 3, 2].

    >>> longest_peak_sequence([5, 4, 3, 2, 1])
    0 # There is no "peak" sequence, because there is no strictly increasing part.

    Tip.
    - The length of the array `nums` ranges from [0, 10000].
    - Each element of the array is in the range [-10000, 10000].
    """

Claude 3.5 Sonnet (1022) Answer.

from typing import List

def longest_peak_sequence(nums: List[int]) -> int:
    if len(nums) < 3:
        return 0

    max_length = 0

    # Iterate from the second element to the penultimate element
    for i in range(1, len(nums) - 1).
        # Check if the current position is a peak
        if nums[i] > nums[i-1] and nums[i] > nums[i+1].
            # Finding a Peak, Expanding in Both Directions
            left = i - 1
            right = i + 1

            # Expand to the left to find strictly increasing sequences
            while left > 0 and nums[left-1] < nums[left]:
                left -= 1

            # Expanding to the right, looking for strictly decreasing sequences
            while right < len(nums) - 1 and nums[right+1] < nums[right]:
                right += 1

            # Update longest peak sequence lengths
            current_length = right - left + 1
            max_length = max(max_length, current_length)

    return max_length

Test case.

def check(candidate).
    assert candidate( [1, 2, 3, 4, 5, 6])==0
    assert candidate([1, 3, 5, 6, 4, 2, 0])==7
    assert candidate( [5, 4, 3, 2, 1, 2, 3, 4, 1])==5
    assert candidate([1, 3, 5, 3, 1])==5
    assert candidate([1, 2, 3, 3, 2, 1])==0
    assert candidate([2, 1, 3, 1, 2, 1, 3])==3
    assert candidate([1, 3, 5, 4, 3, 2, 1])==7

if_passed**(Adopted or not):**Adopted.

Example 2

Prompt.

from typing import List

def longest_increasing_path(matrix. List[List[int]]) -> int:
    """
    Given a matrix of integers, find the length of the longest strictly increasing path.
    The path is defined to start at any cell in the matrix, and you can move up, down, left, or right onto neighboring cells each time, if incremental.
    You cannot move outside the matrix in a path, nor can you move onto a cell that does not satisfy the incremental condition.
    The length of the path is the total number of cells in the path.

    Example.
    >>> longest_increasing_path([
            
            [6, 6, 8], [2, 1, 1], [2, 1, 1], [2, 1, 1].
            [2, 1, 1]
        ])
    4 # longest increasing path is [1, 2, 6, 9]

    >>> longest_increasing_path([
            
            [3, 2, 6], [2, 2, 1]
            [2, 2, 1]
        ])
    4 # The longest increasing path is [3, 4, 5, 6] or [2, 4, 5, 6]

    Hints.
    - The dimension of the matrix `matrix` is m x n. The matrix `matrix` has dimensions m x n.
    - Each element of the matrix `matrix` is a non-negative integer.

    """

Claude 3.5 Sonnet (1022) Answer.

from typing import List

def longest_increasing_path(matrix: List[List[int]]) -> int:
    if not matrix or not matrix[0]:
        return 0

    m, n = len(matrix), len(matrix[0])
    # cache for storing the longest incremental path length for each location
    cache = [[0] * n for _ in range(m)]

    def dfs(i: int, j: int) -> int:
        # If it has already been calculated, return the cached result directly
        if cache[i][j] != 0:
            return cache[i][j]

        # Four directions: up, down, left, right
        directions = [(-1, 0), (1, 0), (0, -1), (0, 1)]

        # The longest path to the current location is at least 1
        max_length = 1

        # Traversing the four directions
        for di, dj in directions:
            ni, nj = i + di, j + dj

            # Check to see if it is within the matrix and satisfies the incremental condition
            if (0 <= ni < m and 0 <= nj < n and 
                matrix[ni][nj] > matrix[i][j]).
                # Recursive computation of the longest path to a neighbor's location
                curr_length = 1 + dfs(ni, nj)
                max_length = max(max_length, curr_length)

        # Access Cache
        cache[i][j] = max_length
        return max_length

    # Iterate through each position in the matrix to find the longest path
    result = 0
    for i in range(m):
        for j in range(n):
            result = max(result, dfs(i, j))

    return result

Test case.

def check(candidate).
    assert candidate([[1, 1], [1, 1]]) == 1
    assert candidate([
    [9, 9, 4],
    [6, 6, 8],
    [2, 1, 1]
]) == 4
    assert candidate([
    [3, 4, 5],
    [3, 2, 6],
    [2, 2, 1]
]) == 4
    assert candidate([
        [1, 2, 3, 4],
        [5, 6, 7, 8],
        [9, 10, 11, 12],
        [13, 14, 15, 16]
]) == 7
    assert candidate([
        [-1, -2, -3, 0],
        [5, -6, 7, -8],
        [-9, 0, -11, -12],
        [13, 0, 15, -16]
]) == 4
    assert candidate([
    [5, 10, 6, 7, 9],
    [6, 10, 2, 2, 3], 
    [4, 9, 8, 2, 8],
    [3, 3, 3, 8, 8]
])==4
    assert candidate([
    [1, 3, 1, 4],
    [1, 5, 1, 7],
    [4, 2, 1, 1],
    [1, 3, 2, 1]
])==3
    assert candidate([
    [1, 9, 2, 4], 
    [9, 6, 5, 3],
])==4

if_passed(Adopted or not):Adopted.

Examples of higher-order reasoning

Example 1: Multi-Step Reasoning

ask

A company has seven employees named A, B, C, D, E, F and G. The company held a grading exercise and based on the grades of the employees and the following conditions, deduce the exact ranking of each employee:

  1. A scores higher than B, but lower than C.
  2. D scores higher than E and G, but lower than A.
  3. G scored the lowest.
  4. B scores higher than E and F.
  5. F scores higher than E.
  6. C scored the highest.
  7. Each employee scored differently, with D in the very center.

Please infer the specific score for each employee based on the above conditions.

Claude 3.5 Sonnet (1022) replies:

Let's extrapolate each employee's ranking through logical analysis:

  1. Known from condition 3: G is in last place (7th)
  2. Known from condition 6: C is at the top of the list (1st place)
  3. Known from condition 7: D is ranked 4th (middle position)
  4. We know from condition 2: D is higher than E and G, but lower than A. So:
    • E must be ranked after D (5th or 6th)
    • A must be ranked before D (2nd or 3rd)
  5. We know from condition 1: A is higher than B but lower than C. Combine this with the previous analysis:
    • 2nd place in row A
    • B must be ranked 3rd
  6. We know from conditions 4 and 5: B is higher than E and F, and F is higher than E. So:
    • 5th place in row F
    • E 6th place

The final rankings are as follows:

  1. C (1st place)
  2. A (2nd place)
  3. B (3rd place)
  4. D (4th)
  5. F (5th)
  6. E (6th)
  7. G (7th)

Would you like me to show you the detailed reasoning process?

Example 2: Optimization problem

ask

A city has five major transportation hubs, and the city planning department wants to optimize the coverage of bus routes to ensure that each hub is connected by at least one route. The construction costs between the hubs are known, and the goal is to connect all the hubs at minimal cost.

Data:

  • Connectivity costs between hubs (in millions of dollars)
  • Hub 1 and Hub 2: $100,000
  • Hub 1 and Hub 3: $150,000
  • Hub 2 and Hub 4: $200,000
  • Hub 3 and Hub 4: $250,000
  • Hub 3 and Hub 5: $300,000
  • Hub 4 and Hub 5: $180,000

Goal: Determine which hubs should be connected to each other with lines to minimize total construction costs and ensure that all hubs are interconnected.

Claude 3.5 Sonnet (1022) replies:

This is a Minimum Spanning Tree problem which I will solve using Kruskal's algorithm. Let's analyze it step by step:

  1. First, sort all connections from smallest to largest cost:
    • Hub 1-2: $100,000
    • Hubs 1-3: $150,000
    • Hub 4-5: $180,000
    • Hubs 2-4: $200,000
    • Hub 3-4: $250,000
    • Hubs 3-5: $300,000
  2. The lowest cost connection is selected in order to avoid forming a ring:
    • Select Hub 1-2 ($100,000)
    • Select Hub 1-3 ($150,000)
    • Select Hub 4-5 ($180,000)
    • Select Hub 2-4 ($200,000)

Optimal Program:

  • Connecting Hub 1 and Hub 2
  • Connecting Hub 1 and Hub 3
  • Connecting Hub 4 and Hub 5
  • Connecting Hub 2 and Hub 4

Total cost: 10 + 15 + 18 + 20 = $630,000

This program ensures that:

  1. All hubs are connected
  2. No loop formed
  3. Minimize total cost of ownership

Would you like me to generate a visualization for you to show this solution?

# Join the Community

For more detailed Claude 3.5 latest big model review details, you can join SuperCLUE-Claude exchange group.

May not be reproduced without permission:Chief AI Sharing Circle " Upgraded Claude 3.5 Sonnet Chinese benchmark evaluation is out! Code ability exceeds GPT-4o, higher-order reasoning is not as good as o1

Chief AI Sharing Circle

Chief AI Sharing Circle specializes in AI learning, providing comprehensive AI learning content, AI tools and hands-on guidance. Our goal is to help users master AI technology and explore the unlimited potential of AI together through high-quality content and practical experience sharing. Whether you are an AI beginner or a senior expert, this is the ideal place for you to gain knowledge, improve your skills and realize innovation.

Contact Us
en_USEnglish