The results of this assessment will be used for academic purposes only.
On October 22, Anthropic officially launched the upgraded large model Claude 3.5 Sonnet as well as the new model Claude 3.5 Haiku.
According to Anthropic officials, Claude 3.5 Sonnet outperforms the previous version across the board in every capacity, with significant improvements in performance especially in intelligent body coding and tool usage tasks.On SWE-bench Verified, its encoding performance improves from 33.41 TP3T to 49.01 TP3T, outperforming all publicly available models including OpenAI o1-preview.
Anthropic is also introducing a groundbreaking new feature: computer use. Developers can control Claude to use the computer in the same way as a human via an API, but the feature is currently in the experimental stage.
In response to public concerns about the performance of Claude 3.5 Sonnet upgrade, in terms of reasoning and code in Chinese scenarios, SuperCLUE, a professional third-party evaluation organization, conducted an in-depth evaluation of Claude 3.5 Sonnet upgrade.
assessment environment
Reference Standards: SuperCLUE-Reasoning Chinese Higher Order Reasoning Assessment Benchmark, SuperCLUE-Code3 Chinese Code Assessment Benchmark.
Evaluation Models : Claude 3.5 Sonnet (1022, POE latest model snapshot)
Model GenerationConfig configuration:
- Refer to Claude's detailed description document: https://www.anthropic.com/news/3-5-models-and-computer-use
Review Episode:
1.SuperCLUE-Reasoning Chinese Complex Tasks Higher Order Reasoning Review Set. A total of 302 questions, including multi-step reasoning, numerical reasoning, reasoning calculations, market analysis and optimization problems five difficult reasoning tasks.
2.SuperCLUE-Code3 Chinese native code evaluation set.HumanEval's Chinese upgraded version, a total of 195 questions, including 1560 test cases, divided into primary, intermediate and advanced three categories of difficulty level.
Methodology::
1. Higher-order reasoning: For each reasoning question, we provide reference answers and reasoning processes after manual calibration and checking; then according to the set evaluation process, evaluation criteria and scoring rules (1-5 points), the referee model evaluates the answers of candidate models. The set evaluation criteria include: compliance with basic requirements, accuracy of reasoning, and reasonableness of conclusion.
2. Code SC-Code3: specifically designed to evaluate and enhance the core capabilities of the Chinese Big Model in code generation through functional unit testing.
SuperCLUE-Code3: Chinese Native Ranked Code Proficiency Benchmarks
Let's start with the conclusion
Conclusion 1: Claude 3.5 Sonnet (1022) performs well in Chinese code capability (SC-Code3). The total score reached 84.85, 1.52 points higher than GPT-4o-0513 which is 4.54 points lower than o1-preview. In particular, it performed amazingly well on the advanced code tasks, scoring 69.23, tying with o1-preview.
Conclusion 2: Claude 3.5 Sonnet (1022) scored 74.54 on the Chinese Higher Order Reasoning Task. Close to ChatGPT-4o-latest, with a large gap compared to o1-preview (-10.73 points) . Among them, the multi-step reasoning ability is relatively outstanding, with performance over ChatGPT-4o-latest.
**Conclusion 3:** Compared to the previous version Claude 3.5 Sonnet (0621). Claude 3.5 Sonnet (1022) improved by 4.57 points in the reasoning task , and significant progress in secondary tasks such as numerical reasoning and computational reasoning.
Evaluation results
comparative analysis
Code Measurement Example
Example 1
Prompt.
from typing import List
def longest_peak_sequence(nums. List[int]) -> int:
"""
Find the length of the longest "peak" sequence in an array of integers.
A "peaked" sequence is one that is strictly increasing and then strictly decreasing, and has a unique peak (i.e., a maximum value).
A "peaked" sequence (i.e., one that contains both increasing and decreasing parts) can only be made up of at least three numbers.
Returns the length of the longest "peak" in the array. If there are no "peaks", then 0 is returned.
Example.
>>> longest_peak_sequence([2, 1, 4, 7, 3, 2, 5])
5 # The longest "peak" sequence is [1, 4, 7, 3, 2].
>>> longest_peak_sequence([5, 4, 3, 2, 1])
0 # There is no "peak" sequence, because there is no strictly increasing part.
Tip.
- The length of the array `nums` ranges from [0, 10000].
- Each element of the array is in the range [-10000, 10000].
"""
Claude 3.5 Sonnet (1022) Answer.
from typing import List
def longest_peak_sequence(nums: List[int]) -> int:
if len(nums) < 3:
return 0
max_length = 0
# Iterate from the second element to the penultimate element
for i in range(1, len(nums) - 1).
# Check if the current position is a peak
if nums[i] > nums[i-1] and nums[i] > nums[i+1].
# Finding a Peak, Expanding in Both Directions
left = i - 1
right = i + 1
# Expand to the left to find strictly increasing sequences
while left > 0 and nums[left-1] < nums[left]:
left -= 1
# Expanding to the right, looking for strictly decreasing sequences
while right < len(nums) - 1 and nums[right+1] < nums[right]:
right += 1
# Update longest peak sequence lengths
current_length = right - left + 1
max_length = max(max_length, current_length)
return max_length
Test case.
def check(candidate).
assert candidate( [1, 2, 3, 4, 5, 6])==0
assert candidate([1, 3, 5, 6, 4, 2, 0])==7
assert candidate( [5, 4, 3, 2, 1, 2, 3, 4, 1])==5
assert candidate([1, 3, 5, 3, 1])==5
assert candidate([1, 2, 3, 3, 2, 1])==0
assert candidate([2, 1, 3, 1, 2, 1, 3])==3
assert candidate([1, 3, 5, 4, 3, 2, 1])==7
if_passed**(Adopted or not):**Adopted.
Example 2
Prompt.
from typing import List
def longest_increasing_path(matrix. List[List[int]]) -> int:
"""
Given a matrix of integers, find the length of the longest strictly increasing path.
The path is defined to start at any cell in the matrix, and you can move up, down, left, or right onto neighboring cells each time, if incremental.
You cannot move outside the matrix in a path, nor can you move onto a cell that does not satisfy the incremental condition.
The length of the path is the total number of cells in the path.
Example.
>>> longest_increasing_path([
[6, 6, 8], [2, 1, 1], [2, 1, 1], [2, 1, 1].
[2, 1, 1]
])
4 # longest increasing path is [1, 2, 6, 9]
>>> longest_increasing_path([
[3, 2, 6], [2, 2, 1]
[2, 2, 1]
])
4 # The longest increasing path is [3, 4, 5, 6] or [2, 4, 5, 6]
Hints.
- The dimension of the matrix `matrix` is m x n. The matrix `matrix` has dimensions m x n.
- Each element of the matrix `matrix` is a non-negative integer.
"""
Claude 3.5 Sonnet (1022) Answer.
from typing import List
def longest_increasing_path(matrix: List[List[int]]) -> int:
if not matrix or not matrix[0]:
return 0
m, n = len(matrix), len(matrix[0])
# cache for storing the longest incremental path length for each location
cache = [[0] * n for _ in range(m)]
def dfs(i: int, j: int) -> int:
# If it has already been calculated, return the cached result directly
if cache[i][j] != 0:
return cache[i][j]
# Four directions: up, down, left, right
directions = [(-1, 0), (1, 0), (0, -1), (0, 1)]
# The longest path to the current location is at least 1
max_length = 1
# Traversing the four directions
for di, dj in directions:
ni, nj = i + di, j + dj
# Check to see if it is within the matrix and satisfies the incremental condition
if (0 <= ni < m and 0 <= nj < n and
matrix[ni][nj] > matrix[i][j]).
# Recursive computation of the longest path to a neighbor's location
curr_length = 1 + dfs(ni, nj)
max_length = max(max_length, curr_length)
# Access Cache
cache[i][j] = max_length
return max_length
# Iterate through each position in the matrix to find the longest path
result = 0
for i in range(m):
for j in range(n):
result = max(result, dfs(i, j))
return result
Test case.
def check(candidate).
assert candidate([[1, 1], [1, 1]]) == 1
assert candidate([
[9, 9, 4],
[6, 6, 8],
[2, 1, 1]
]) == 4
assert candidate([
[3, 4, 5],
[3, 2, 6],
[2, 2, 1]
]) == 4
assert candidate([
[1, 2, 3, 4],
[5, 6, 7, 8],
[9, 10, 11, 12],
[13, 14, 15, 16]
]) == 7
assert candidate([
[-1, -2, -3, 0],
[5, -6, 7, -8],
[-9, 0, -11, -12],
[13, 0, 15, -16]
]) == 4
assert candidate([
[5, 10, 6, 7, 9],
[6, 10, 2, 2, 3],
[4, 9, 8, 2, 8],
[3, 3, 3, 8, 8]
])==4
assert candidate([
[1, 3, 1, 4],
[1, 5, 1, 7],
[4, 2, 1, 1],
[1, 3, 2, 1]
])==3
assert candidate([
[1, 9, 2, 4],
[9, 6, 5, 3],
])==4
if_passed(Adopted or not):Adopted.
Examples of higher-order reasoning
Example 1: Multi-Step Reasoning
ask
A company has seven employees named A, B, C, D, E, F and G. The company held a grading exercise and based on the grades of the employees and the following conditions, deduce the exact ranking of each employee:
- A scores higher than B, but lower than C.
- D scores higher than E and G, but lower than A.
- G scored the lowest.
- B scores higher than E and F.
- F scores higher than E.
- C scored the highest.
- Each employee scored differently, with D in the very center.
Please infer the specific score for each employee based on the above conditions.
Claude 3.5 Sonnet (1022) replies:
Let's extrapolate each employee's ranking through logical analysis:
- Known from condition 3: G is in last place (7th)
- Known from condition 6: C is at the top of the list (1st place)
- Known from condition 7: D is ranked 4th (middle position)
- We know from condition 2: D is higher than E and G, but lower than A. So:
- E must be ranked after D (5th or 6th)
- A must be ranked before D (2nd or 3rd)
- We know from condition 1: A is higher than B but lower than C. Combine this with the previous analysis:
- 2nd place in row A
- B must be ranked 3rd
- We know from conditions 4 and 5: B is higher than E and F, and F is higher than E. So:
- 5th place in row F
- E 6th place
The final rankings are as follows:
- C (1st place)
- A (2nd place)
- B (3rd place)
- D (4th)
- F (5th)
- E (6th)
- G (7th)
Would you like me to show you the detailed reasoning process?
Example 2: Optimization problem
ask
A city has five major transportation hubs, and the city planning department wants to optimize the coverage of bus routes to ensure that each hub is connected by at least one route. The construction costs between the hubs are known, and the goal is to connect all the hubs at minimal cost.
Data:
- Connectivity costs between hubs (in millions of dollars)
- Hub 1 and Hub 2: $100,000
- Hub 1 and Hub 3: $150,000
- Hub 2 and Hub 4: $200,000
- Hub 3 and Hub 4: $250,000
- Hub 3 and Hub 5: $300,000
- Hub 4 and Hub 5: $180,000
Goal: Determine which hubs should be connected to each other with lines to minimize total construction costs and ensure that all hubs are interconnected.
Claude 3.5 Sonnet (1022) replies:
This is a Minimum Spanning Tree problem which I will solve using Kruskal's algorithm. Let's analyze it step by step:
- First, sort all connections from smallest to largest cost:
- Hub 1-2: $100,000
- Hubs 1-3: $150,000
- Hub 4-5: $180,000
- Hubs 2-4: $200,000
- Hub 3-4: $250,000
- Hubs 3-5: $300,000
- The lowest cost connection is selected in order to avoid forming a ring:
- Select Hub 1-2 ($100,000)
- Select Hub 1-3 ($150,000)
- Select Hub 4-5 ($180,000)
- Select Hub 2-4 ($200,000)
Optimal Program:
- Connecting Hub 1 and Hub 2
- Connecting Hub 1 and Hub 3
- Connecting Hub 4 and Hub 5
- Connecting Hub 2 and Hub 4
Total cost: 10 + 15 + 18 + 20 = $630,000
This program ensures that:
- All hubs are connected
- No loop formed
- Minimize total cost of ownership
Would you like me to generate a visualization for you to show this solution?
# Join the Community
For more detailed Claude 3.5 latest big model review details, you can join SuperCLUE-Claude exchange group.