AI Personal Learning
and practical guidance
讯飞绘镜

Qwen2.5-VL Notebook Example Details: Getting Started with Multimodal Visual Models

Recently, the Qwen team proudly launched a series of Qwen2.5-VL Use Case Notebook ExampleThis is a comprehensive demonstration of the power of native models and APIs. This collection of carefully crafted Notebooks is designed to help developers and users gain a deeper understanding of the Qwen2.5-VL powerful visual understanding and inspire more innovative applications.

 

Notebook Example: Getting Started with the Qwen2.5-VL

With these detailed Notebook examples, developers are able to Get up to speed and see for yourself how the Qwen 2.5-VL model performs in every task!The Qwen2.5-VL is the perfect solution for the most complex documents. Whether it's dealing with complex document parsing, performing accurate OCR tasks, or performing in-depth video content comprehension, Qwen2.5-VL delivers efficient and accurate feedback that demonstrates its superior performance.


At the same time, the Qwen team is looking forward to the community's feedback and contributions to improve and expand the capabilities of Qwen 2.5-VL, and to work together to promote the development of multimodal technology.

🔗 RELATED:

  • GitHub repository. https://github.com/QwenLM/Qwen2.5-VL/tree/main/cookbooks
  • Online Experience. https://chat.qwenlm.ai (select Qwen2.5-VL-72B-Instruct model)
  • ModelScope model link: https://www.modelscope.cn/collections/Qwen25-VL-58fbb5d31f1d47
  • Parsons Brinckerhoff API Interface. https://help.aliyun.com/zh/model-studio/user-guide/vision/

Qwen2.5-VL Notebook 示例概览

 

Notebook Examples Explained

01 Computer Use

This Notebook example demonstrates how to utilize Qwen2.5-VL to perform tasks related to computer usage.

Users only need to take a screenshot of the computer desktop and make a query, Qwen2.5-VL model can analyze the content of the screenshot, understand the user's intention, and then generate precise operation commands such as clicking or typing to realize intelligent control of the computer.

👉 Notebook Links. https://github.com/QwenLM/Qwen2.5-VL/blob/main/cookbooks/computer_use.ipynb

Computer Use 示例

02 Spatial Understanding

This Notebook example highlights Qwen2.5-VL's advanced spatial localization capabilities, including accurate object detection and localization of specific targets in an image.

The examples provide insight into how Qwen2.5-VL effectively integrates visual and linguistic understanding to accurately interpret complex scenes and enable advanced spatial reasoning.

👉 Notebook Links. https://github.com/QwenLM/Qwen2.5-VL/blob/main/cookbooks/spatial_understanding.ipynb

Spatial Understanding 示例

03 Document Parsing

This Notebook example highlights the powerful document parsing capabilities of Qwen2.5-VL. It can process documents in a variety of image formats and output the parsed results in a variety of formats including HTML, JSON, MD and LaTeX.

Of particular interest is Qwen's innovative introduction of a unique QwenVL HTML format. This format contains information about the location of each component in the document, allowing for accurate reconstruction and flexible manipulation of the document.

👉 Notebook Links. https://github.com/QwenLM/Qwen2.5-VL/blob/main/cookbooks/document_parsing.ipynb

Document Parsing 示例

04 Mobile Agent (Mobile Device Agent)

This Notebook example demonstrates how to intelligently interact with a mobile device using Qwen2.5-VL's agent capabilities.

The example shows how the Qwen2.5-VL model generates and executes actions based on the user's query and the visual context of the mobile device, enabling convenient control of the mobile device.

👉 Notebook Links. https://github.com/QwenLM/Qwen2.5-VL/blob/main/cookbooks/mobile_agent.ipynb

Mobile Agent 示例

05 OCR (Optical Character Recognition)

This Notebook example focuses on demonstrating the OCR (Optical Character Recognition) capabilities of Qwen2.5-VL, including the accurate extraction and recognition of text information from images.

Through the examples, users can visualize how Qwen2.5-VL can accurately capture and interpret text content in complex scenarios, demonstrating its powerful text recognition capabilities.

👉 Notebook Links. https://github.com/QwenLM/Qwen2.5-VL/blob/main/cookbooks/ocr.ipynb

OCR 示例

06 Universal Recognition

This Notebook example demonstrates how to use Qwen2.5-VL for generic object recognition.

The user only needs to provide an image and a query, and the Qwen2.5-VL model can analyze the image, understand the user's query intent, and provide the corresponding recognition results to achieve a comprehensive understanding of the image content.

👉 Notebook Links. https://github.com/QwenLM/Qwen2.5-VL/blob/main/cookbooks/universal_recognition.ipynb

Universal Recognition 示例

07 Video Understanding

Qwen2.5-VL has powerful long video comprehension capabilities and can handle video content longer than 1 hour. This Notebook example provides an in-depth exploration of the capabilities of the Qwen2.5-VL model for video comprehension tasks.

Qwen2.5-VL is designed to demonstrate its potential in a wide range of video analytics scenarios, from basic OCR (Optical Character Recognition) to complex event detection and content summarization.

👉 Notebook Links. https://github.com/QwenLM/Qwen2.5-VL/blob/main/cookbooks/video_understanding.ipynb

Video Understanding 示例

 

Magic Hitch Best Practices: Free Arithmetic Play Cookbook Example

In the ModelScope Magic Hitch community, users can easily experience these Cookbook examples with free arithmetic.

First, download the Qwen2.5-VL code.

git clone https://github.com/QwenLM/Qwen2.5-VL.git

Use the Model API in Notebook: The MagicBuilder Platform API-Inference provides a free API for the Qwen2.5-VL family of models, which can be used directly by MagicBuilder users via API calls by replacing the base-URL in the Cookbook and filling out the MagicBuilder SDK. Token Ready to go.Detailed Documentation: https://www.modelscope.cn/docs/model-service/API-Inference/intro

from openai import OpenAI
client = OpenAI(
    api_key="<MODELSCOPE_SDK_TOKEN>", # ModelScope Token
    base_url="https://api-inference.modelscope.cn/v1"
)

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-VL-72B-Instruct", # ModelScope Model-Id
    messages = [
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {"url": "https://modelscope.oss-cn-beijing.aliyuncs.com/demo/images/bird-vl.jpg"}
                },
                {   "type": "text",
                    "text": "Count the number of birds in the figure, including those that are only showing their heads. To ensure accuracy, first detect their key points, then give the total number."
                },
            ],
        }
    ],
    stream=True
    )

Notebook uses a local model: Please select the GPU model.

alt text

 

Conclusion: Welcome to experience and create the future together

In the future, the Qwen team will continue to update and expand these Notebook examples to incorporate more useful features and application scenarios, in an effort to provide developers with more comprehensive solutions. Welcome to visit Qwen2.5-VL's GitHub repository or ModelScope to experience these Notebook examples and share your experience and innovative applications! The Qwen team is looking forward to exploring the possibilities of Qwen2.5-VL with you.

May not be reproduced without permission:Chief AI Sharing Circle " Qwen2.5-VL Notebook Example Details: Getting Started with Multimodal Visual Models
en_USEnglish