Google DeepMind on March 25, 2025 (last updated on March 26) released what it claims to be the smartest family of AI models - the Gemini 2.5
. The first debut of the Gemini 2.5 Pro Experimental
version excelled in several benchmarks, demonstrating significant improvements in reasoning and coding capabilities in particular, and in the LMArena
Topping the charts by a significant margin.
"Thinking models": not just prediction, but reasoning
Gemini 2.5
The series is defined by Google as "thinking models". At the core of this concept is the ability of models to perform internal "thinking" or reasoning processes before generating a response, with the aim of improving performance and accuracy in handling complex problems. In artificial intelligence, the ability to "reason" goes beyond simple classification and prediction and involves the ability to analyze information, make logical inferences, understand context and nuance, and make informed decisions accordingly.
Researchers have long been exploring ways to improve AI reasoning, such as reinforcement learning and chain-of-thought hints. Based on these explorations, Google previously launched the first "thinking model" Gemini 2.0 Flash Thinking
. This release Gemini 2.5
Google says it will build this "thinking" capability into all of its models in the future to tackle more complex problems and support more powerful, better context-aware agents. This ability to "think" first will be built into all models to address more complex problems and support more powerful, context-aware agents. This mechanism of "thinking" and then "answering" is believed to help reduce the common phenomenon of "illusion" in AI models and improve the reliability of answers.
Gemini 2.5 Pro Experimental
performance
According to information published by GoogleGemini 2.5 Pro Experimental
It is at the current top level in handling complex tasks. It is very good at measuring human preferences in LMArena
A significant lead in the charts usually means that the model is not only powerful, but the style, coherence and utility of its output is also preferred by users.
In terms of specific benchmarks:
- reasoning ability: The model is useful in math and science benchmark tests that require advanced reasoning skills (e.g., the
GPQA
cap (a poem)AIME 2025
) in leading performance and without the use of cost-increasing test-time tricks (e.g., majority voting). In a program designed to assess the frontiers of human knowledge and reasoning, by hundreds of domain experts, theHumanity’s Last Exam
on the dataset.Gemini 2.5 Pro
A score of 18.8% was achieved without the use of external tools, again leading the way.- Note: The figure above shows a comparison of the models in terms of reasoning, science, and math involving the OpenAI GPT-4.5 and the Claude 3.7 Model names and data such as Sonnet are taken from charts provided by Google.
- coding skills::
Gemini 2.5
In terms of coding compared toGemini 2.0
has taken a giant leap forward, specializing in the creation of visually appealing web applications, smart-body code applications, and code conversion and editing. The industry standard for measuring smart-body code capabilitiesSWE-Bench Verified
On, using a customized smart body setup ofGemini 2.5 Pro
achieved a score of 63.8%. google also showed an example of aGemini 2.5 Pro
A simple dinosaur video game was created using its reasoning ability to generate executable code based on a single line of prompts.
Inheritance and development: multimodality and long contexts
Gemini 2.5
inherited Gemini
series' core strengths: native multimodal capabilities and long context windows.Gemini 2.5 Pro
One million supported at launch token context window (and plans to expand to 2 million tokens soon) and outperforms the predecessor model in this long window. This means that it can understand and process massive datasets containing text, audio, images, video, and even entire code bases, dealing with complex problems from different information sources. This has significant application potential for scenarios that require deep understanding of long documents, analyzing complex code projects, or processing long video content.

The table was updated on March 26 to include the new MRCR (Multi Round Coreference Resolution) assessment.
Availability and Future Prospects
Currently.Gemini 2.5 Pro Experimental
The version has been released in Google AI Studio
The developer is provided with a trial version of theGemini Advanced
Users can also choose to use the model from a drop-down menu on desktop and mobile devices. The model is expected to land in the coming weeks on Vertex AI
platform. google also plans to announce pricing information in the coming weeks that will enable users to use it for scaled production environments at higher rate limits.
The "Experimental" label suggests that the current version may still be undergoing rapid iteration and that users may experience some instability while using it, while Google hopes to gather extensive feedback for continuous improvement.Gemini 2.5
The release of this new language model, especially its emphasis on "thinking" and its strong performance on benchmarks, is undoubtedly another step forward in the field of large-scale language modeling, and its subsequent development and practical application effects are worth watching closely.