AI Personal Learning
and practical guidance
讯飞绘镜

Google Releases Gemini 2.5: Big Improvements in "Thinking" Capabilities

Google DeepMind on March 25, 2025 (last updated on March 26) released what it claims to be the smartest family of AI models - the Gemini 2.5. The first debut of the Gemini 2.5 Pro Experimental version excelled in several benchmarks, demonstrating significant improvements in reasoning and coding capabilities in particular, and in the LMArena Topping the charts by a significant margin.

 


"Thinking models": not just prediction, but reasoning

Gemini 2.5 The series is defined by Google as "thinking models". At the core of this concept is the ability of models to perform internal "thinking" or reasoning processes before generating a response, with the aim of improving performance and accuracy in handling complex problems. In artificial intelligence, the ability to "reason" goes beyond simple classification and prediction and involves the ability to analyze information, make logical inferences, understand context and nuance, and make informed decisions accordingly.

Researchers have long been exploring ways to improve AI reasoning, such as reinforcement learning and chain-of-thought hints. Based on these explorations, Google previously launched the first "thinking model" Gemini 2.0 Flash Thinking. This release Gemini 2.5 Google says it will build this "thinking" capability into all of its models in the future to tackle more complex problems and support more powerful, better context-aware agents. This ability to "think" first will be built into all models to address more complex problems and support more powerful, context-aware agents. This mechanism of "thinking" and then "answering" is believed to help reduce the common phenomenon of "illusion" in AI models and improve the reliability of answers.

 

Gemini 2.5 Pro Experimental performance

According to information published by GoogleGemini 2.5 Pro Experimental It is at the current top level in handling complex tasks. It is very good at measuring human preferences in LMArena A significant lead in the charts usually means that the model is not only powerful, but the style, coherence and utility of its output is also preferred by users.

In terms of specific benchmarks:

  • reasoning ability: The model is useful in math and science benchmark tests that require advanced reasoning skills (e.g., the GPQA cap (a poem) AIME 2025) in leading performance and without the use of cost-increasing test-time tricks (e.g., majority voting). In a program designed to assess the frontiers of human knowledge and reasoning, by hundreds of domain experts, the Humanity’s Last Exam on the dataset.Gemini 2.5 Pro A score of 18.8% was achieved without the use of external tools, again leading the way.
    Google DeepMind 发布 Gemini 2.5:号称“会思考”的 AI 新里程碑?-1
    Note: The figure above shows a comparison of the models in terms of reasoning, science, and math involving the OpenAI GPT-4.5 and the Claude 3.7 Model names and data such as Sonnet are taken from charts provided by Google.
  • coding skills::Gemini 2.5 In terms of coding compared to Gemini 2.0 has taken a giant leap forward, specializing in the creation of visually appealing web applications, smart-body code applications, and code conversion and editing. The industry standard for measuring smart-body code capabilities SWE-Bench Verified On, using a customized smart body setup of Gemini 2.5 Pro achieved a score of 63.8%. google also showed an example of aGemini 2.5 Pro A simple dinosaur video game was created using its reasoning ability to generate executable code based on a single line of prompts.

Inheritance and development: multimodality and long contexts

Gemini 2.5 inherited Gemini series' core strengths: native multimodal capabilities and long context windows.Gemini 2.5 Pro One million supported at launch token context window (and plans to expand to 2 million tokens soon) and outperforms the predecessor model in this long window. This means that it can understand and process massive datasets containing text, audio, images, video, and even entire code bases, dealing with complex problems from different information sources. This has significant application potential for scenarios that require deep understanding of long documents, analyzing complex code projects, or processing long video content.

Google DeepMind 发布 Gemini 2.5:号称“会思考”的 AI 新里程碑?-2

The table was updated on March 26 to include the new MRCR (Multi Round Coreference Resolution) assessment.

 

Availability and Future Prospects

Currently.Gemini 2.5 Pro Experimental The version has been released in Google AI Studio The developer is provided with a trial version of theGemini Advanced Users can also choose to use the model from a drop-down menu on desktop and mobile devices. The model is expected to land in the coming weeks on Vertex AI platform. google also plans to announce pricing information in the coming weeks that will enable users to use it for scaled production environments at higher rate limits.

The "Experimental" label suggests that the current version may still be undergoing rapid iteration and that users may experience some instability while using it, while Google hopes to gather extensive feedback for continuous improvement.Gemini 2.5 The release of this new language model, especially its emphasis on "thinking" and its strong performance on benchmarks, is undoubtedly another step forward in the field of large-scale language modeling, and its subsequent development and practical application effects are worth watching closely.

May not be reproduced without permission:Chief AI Sharing Circle " Google Releases Gemini 2.5: Big Improvements in "Thinking" Capabilities
en_USEnglish