High-quality AI inference models towards popularity.
Early this morning, OpenAI released a new inference modelo3-miniThe
OpenAI claims this is its most cost-effective inference model, with significantly improved complex inference and dialog capabilities, outperforming its predecessor, the o1 model, in areas such as science, math, and programming, while maintaining the low cost and low latency of the o1-mini, andCan be used in conjunction with the network search functionThe
o3-mini is already available in ChatGPT and the API, and enterprise access will be available in theone weekLaunch.
Apparently DeepSeek's rise to the top of the U.S. App Store's free list has created pressure on OpenAI.
Today.For the first time, ChatGPT is offering free inference models to all users: Users can try o3-mini by selecting the "Reason" button in ChatGPT.
ChatGPT Pro users canUnlimited accessThe rate limit for Plus and Team users was tripled from 50 messages per day for o1-mini to 50 messages per day for o3-mini.150 messagesThe
Paid subscribers also have the option of a more intelligent version"o3-mini-high". This version takes longer to generate a response.
Like the o1 model, the o3-mini model has a knowledge cutoff ofOctober 2023The context window is 200,000 tokens with a maximum output of 100,000 tokens. o3-mini is available in low, medium, and high versions for developers to optimize for their specific use cases.
The o3-mini does not currently support visual features, so developers will still need to use the o1 for visual reasoning tasks.
Effective immediately, o3-mini is available in the Chat Completions API, the Assistants API, and the Batch API. openAI claims that the price per token has been reduced by 95% compared to the launch of GPT-4, while maintaining top-tier inference. However, o3-mini's API pricing is still higher than the DeepSeek model.
OpenAI model vs. DeepSeek model API pricing comparison
In terms of security, OpenAI found that o3-mini significantly outperforms GPT-4o in terms of challenging security and jailbreaking.
01. detail o3-mini: evolution of scientific mathematical programming capabilities with significantly lower latency
Full Text:OpenAI o3-mini System Manual (Chinese)
OpenAI has released a detailed 37-page report on o3-mini, covering a wide range of aspects such as introduction to the model, data and training, scope of testing, security challenges and evaluation, external red team testing, preparation framework evaluation, multilingual performance, and conclusions.
The o3-mini is optimized for scientific, mathematical, and programming reasoning while being more responsive. The model scored 0.77, 0.80, and 2036 on the GPQA Diamond (science, chemistry, and biology), AIME 2022-2024 (math), and Codeforces ELO (programming) benchmarks. o3-mini matches or exceeds the o1 reasoning model.
The o3-mini significantly outperforms the o1-mini on the MMLU test set in 14 languages, demonstrating its progress in multilingual understanding.
Evaluations by external expert testers indicated that the o3-mini provided more accurate, clearer answers and better reasoning than the o1-mini. In the human preference assessment, testers preferred o3-mini's answers over 561 TP3T and observed a 391 TP3T reduction in significant errors on difficult real-world problems. o3-mini performed comparably to o1 on some of the most challenging assessments of reasoning and intelligence (including the AIME and the GPQA) under the Medium Reasoning Ability.
The intelligence of the o3-mini is comparable to the o1, providing faster performance, higher efficiency. The model also excels in additional mathematical and factual assessments under the medium reasoning capability. In A/B testing, the o3-mini responded 241 TP3T faster than the o1-mini, with an average response time of 7.7 seconds compared to 10.16 seconds for the o1-mini.
In math, at low reasoning ability, o3-mini performs comparably to o1-mini, while at medium reasoning ability, o3-mini performs comparably to o1. Meanwhile, at high reasoning ability, o3-mini outperforms o1-mini and o1.
The o3-mini with high reasoning power outperforms its predecessor on FrontierMath. On the FrontierMath test, when prompted to use the Python tool, o3-mini with high reasoning power solved more than 321 TP3T problems on the first attempt, including more than 281 TP3T of challenging (T3) problems.
o3-mini progressively obtains higher Elo scores with increasing reasoning ability, all of which are superior to o1-mini. at medium reasoning ability, it performs comparably to o1.
o3-mini is OpenAI's best performing model in SWE-bench validation.
More data on the SWE-bench validation results are shown below. o3-mini (tools) performed best at 611 TP3T. o3-mini listing candidates that used Agentless instead of internal tools scored 391 TP3T. o1 was the second best performing model with a score of 481 TP3T.
In the LiveBench programming test, the high-reasoning o3-mini scored more than the o1-high across the board.
02. Multiple safety assessments exceed GPT-4o
OpenAI also detailed o3-mini's performance in several security assessments, stating that o3-mini significantly outperformed GPT-4o in the challenging security and jailbreak assessments. o3-mini performed similarly in the standard denial assessment and the challenging denial assessment in the disallowed content assessment when compared to GPT-4o, but was slightly behind in the XSTest .
In the jailbreak evaluation, o3-mini performs comparably compared to o1-mini in the production jailbreak, jailbreak-enhanced example, StrongReject, and human-sourced jailbreak evaluations.
In the hallucination assessment, using the PersonQA dataset, the o3-mini had an accuracy rate of 21.71 TP3T and a hallucination rate of 14.81 TP3T, which was comparable or better performance compared to the GPT-4o and o1-mini.
On the fairness and bias assessments, the o3-mini performed similarly to the o1-mini on the BBQ assessment, but with a slight decrease in accuracy when dealing with ambiguous questions.
External red team testing showed that the o3-mini performed comparably in comparison to the o1, with both significantly outperforming the GPT-4o.
In the Gray Swan Arena jailbreak test, o3-mini had an average user attack success rate of 3.61 TP3T, which is slightly higher compared to o1-mini and GPT-4o. The readiness framework assessment covered four risk categories: cybersecurity, CBRN (Chemical, Biological, Radiological, Nuclear), persuasion, and model autonomy. o3-mini was rated as "Low Risk" for cybersecurity, "Medium Risk" for CBRN, persuasion, and model autonomy, and "Medium Risk" for biological threats. risk", and met the "medium risk" threshold for performance in biological threat creation, but had limited capabilities in nuclear and radiological weapons development.
According to their ratings, only models with a mitigation score of "medium" or less can be deployed, and models with a score of "high" or less can be further developed.
03. o3 benchmarking may cost over $30 million, OpenAI in talks for $290 billion in new funding
OpenAI has been iterating its inference models since the release of o1 last September, and the o3 model released late last year is its latest generation of AI inference models. The high-end version of the o3 model is targeted at high-computing applications, while the o3-mini caters to users who need to be both cost-effective and efficient. This reflects OpenAI's strategy of trying to balance accessibility with premium paid offerings.
In the past two days, I don't know whether it was pushed by DeepSeek or to warm up the o3-mini, OpenAI co-founder Sam Altman was very active on social platforms, again praising DeepSeek R1 as impressive, saying that OpenAI will provide better models, and emphasizing that more computation is important.
Yesterday he made a big announcementFirst complete 8-rack GB200 NVL72 serveris running on Microsoft Azure for OpenAI.
The Government of India's Economic Survey 2024-2025 report released this Friday suggests that OpenAI may have spent over$30 millionto benchmark its latest AI inference model, o3. The report writes that the breakthrough in processing power of the OpenAI o3 model came at a very high price. the ARC-AGI benchmarking is considered one of the most challenging AI tasks, and OpenAI's inefficiently configured model resulted in the$200,000The cost of an efficient model is even higher than the cost of an inefficient model. The cost of efficient models is even higher than the cost of inefficient ones172 timesThat's about$34.4 millionThe
Altman also posted a photo with Microsoft Chairman and CEO Satya Nadella the other day, saying that the next phase of Microsoft's partnership with OpenAI is going to be much better than anyone could have imagined.
But Microsoft's reputation as the largest investor in OpenAI may be taken away by Japan's SoftBank Group. Recently, the founder and CEO of SoftBank Group, Masayoshi Son, has been getting closer to Altman, announcing last week that they would join hands to set up an AI mega-project called "Stargate", investing $500 billion (RMB 3.6 trillion) in AI infrastructure over the next four years, and then yesterday it was revealed that he would be the lead investor in OpenAI's new round of funding. OpenAI's new round of financing.
According to foreign media reports, OpenAI is in preliminary talks to raise up to in a round of fundingUS$40 billion (approximately RMB 290.1 billion)The valuation will beUS$300 billion (about RMB 2.18 trillion). Japan's SoftBank Group will lead the round and is negotiating an investment$15 to $25 billionThe remaining funds will come from other investors. The remaining funds will come from other investors, in addition to the amount of money that SoftBank has previously committed to investing in Stargate.Over $15 billion, ultimately SoftBank may be investing in a partnership with OpenAIOver $40 billion. This will be one of SoftBank's largest investments to date.
04. Conclusion: wildly rolled up price/performance ratio and high quality AI inference models towards popularity
Previously, Musk and other technology leaders have publicly questioned how to afford the huge cost of building the "Stargate". Under the influence of DeepSeek's high-performance, low-cost open-source model, the U.S. AI industry and Wall Street investors are even more skeptical of the spending strategies of other U.S. AI developers such as OpenAI.
OpenAI's latest launch, o3-mini, is also seen as the latest move to fend off the onslaught of DeepSeek models, which is of particular interest to the industry.
In a press release, OpenAI said the release of the o3-mini marks another step toward the company's mission to push the boundaries of cost-effective intelligence and make high-quality AI more accessible, and that OpenAI is committed to being at the forefront of building large-scale models that balance intelligence, efficiency and security.