Grok 3 Benchmark Data 'Watered Down'? OpenAI Employee Says xAI May Be Misrepresenting Performance

AI News8mos agorelease AI Sharing Circle

16.8K 00

The debate over AI benchmarking and how AI labs publish their results is becoming increasingly public. AI performance has long been measured and reported in controversial ways, and now these behind-the-scenes debates are finally coming into the spotlight.

This week, an OpenAI employee publicly accused xAI, the AI company founded by Elon Musk, of publishing misleading benchmark results to promote their latest AI models. Grok 3. Igor Babushkin, one of xAI's co-founders, responded immediately to the accusations, insisting that there was nothing wrong with xAI's approach. The public debate has certainly brought the issue of transparency in AI performance evaluation to the forefront.

Grok 3 基准测试数据“掺水”？ OpenAI 员工爆料 xAI 或虚报性能

But the truth, perhaps, lies somewhere between the two sides. As in many technical disputes, the truth often lies beneath the surface and requires deeper digging and scrutiny.

In a post on xAI's official blog, xAI presents a chart illustrating Grok 3's performance in the AIME 2025 benchmark. AIME 2025 is a collection of questions from a recent high-stakes, invite-only math exam, and is considered a litmus test of AI's mathematical capabilities. However, it is worth noting that experts have long questioned the validity of AIME as an AI benchmark. It may be interesting to see xAI's intentions in choosing such a controversial benchmark to demonstrate modeling capabilities. Nonetheless, AIME 2025 and its earlier versions are still widely used to assess the mathematical reasoning capabilities of AI models.

In its chart, xAI claimed that both versions of Grok 3, Grok 3 Reasoning Beta and Grok 3 mini Reasoning, outperformed OpenAI's current state-of-the-art model, o3-mini-high, in the AIME 2025 test. However, OpenAI staff were quick to point out on the X platform that xAI's chart had a critical flaw: it omitted o3-mini-high's AIME 2025 test scores under "cons@64" conditions. This selective presentation of data raises questions about xAI's true intentions.

You may ask, "What exactly is "consensus@64"? Simply put, it stands for "consensus@64" and is a special kind of evaluation method. In this method, the model tries 64 times to answer each question in a benchmark test, and the most frequent answer is the final answer. Unsurprisingly, the "cons@64" mechanism tends to significantly improve a model's benchmark score. xAI deliberately omits the "cons@64" data from its graphs, which may give the impression that Grok 3 outperforms other models, but this may not be the case. Is this "tricky" approach not fair game?

In terms of actual data, both Grok 3 Reasoning Beta and Grok 3 mini Reasoning actually scored lower than o3-mini-high on AIME 2025 in the "@1" condition, i.e., the model's first attempt at scoring in the benchmark. Even Grok 3 Reasoning Beta scores are only slightly ahead of OpenAI's o1 model (set to "medium" computation). However, even with this kind of data comparison, xAI is still publicizing Grok 3 as "the smartest AI in the world". This publicity strategy is not based on rigorous scientific evidence, but is more of a marketing tactic to grab the market's attention. At a time when AI technology is changing rapidly, is it more important to make down-to-earth technological progress, or is it more important to win the future through exaggerated marketing hype? This may be a question that the entire AI industry should seriously consider.

AI News

Article copyright AI Sharing Circle All, please do not reproduce without permission.

Sam Altman: OpenAI Confirms Release of AI Agents to Revolutionize Enterprise Efficiency

AI News

9mos ago

018.1K

AI 应用领域再迎重磅融资，LiblibAI 一年四轮领跑“AI 原生工作流”赛道

LiblibAI Leads "AI Native Workflow" Track with Four Round of Financing in a Year

AI News

8mos ago

017.3K

Microsoft's new model: Phi-4 is here, with 14B parameters comparable to larger scale models

AI News

9mos ago

016.3K

OpenAI Launches Operator, the First L3-Level Intelligence: Opening a New Era of Human-Computer Interaction

AI News

9mos ago

019.3K

No comments

You must be logged in to leave a comment!

No comments...

Grok 3 Benchmark Data 'Watered Down'? OpenAI Employee Says xAI May Be Misrepresenting Performance

Zero-One Everything: Can Business Spin-Offs Become a New Growth Engine as Strategic Focus is Fine-Tuned?

Video generation model VBench tops the charts... Top of the charts - WanX 2.1 will soon be open source!

Related posts

Sam Altman: OpenAI Confirms Release of AI Agents to Revolutionize Enterprise Efficiency

LiblibAI Leads "AI Native Workflow" Track with Four Round of Financing in a Year

Microsoft's new model: Phi-4 is here, with 14B parameters comparable to larger scale models

OpenAI Launches Operator, the First L3-Level Intelligence: Opening a New Era of Human-Computer Interaction

No comments

Latest Collections

Latest Articles

Grok 3 Benchmark Data 'Watered Down'? OpenAI Employee Says xAI May Be Misrepresenting Performance

Zero-One Everything: Can Business Spin-Offs Become a New Growth Engine as Strategic Focus is Fine-Tuned?

Video generation model VBench tops the charts... Top of the charts - WanX 2.1 will soon be open source!

Related posts

Sam Altman: OpenAI Confirms Release of AI Agents to Revolutionize Enterprise Efficiency

LiblibAI Leads "AI Native Workflow" Track with Four Round of Financing in a Year

Microsoft's new model: Phi-4 is here, with 14B parameters comparable to larger scale models

OpenAI Launches Operator, the First L3-Level Intelligence: Opening a New Era of Human-Computer Interaction

No comments

Selected AI Tools

Latest Collections

Latest Articles