1.Model Introduction
In the five months since Qwen2-VL was released, numerous developers have built new models on top of the Qwen2-VL visual language model, providing valuable feedback to the Qwen team. During this time, the Qwen team has focused on building more useful visual language models. Today, the Qwen team is pleased to introduce the newest member of the Qwen family: Qwen2.5-VL.
Major enhancements:
- Understanding things visually: Qwen 2.5-VL is not only proficient in recognizing common objects such as flowers, birds, fish and insects, but also in analyzing text, charts, icons, graphs and layouts in images.
- Agenticity: Qwen2.5-VL plays the role of a visual agent directly, with the functionality of a reasoning and dynamic command tool that can be used on computers and cell phones.
- Understand long videos and capture events: Qwen 2.5-VL can understand videos longer than 1 hour, and this time it has the new ability to capture events by pinpointing relevant video clips.
- Capable of visual localization in different formats: Qwen2.5-VL can accurately locate objects in an image by generating bounding boxes or points, and can provide stable JSON output for coordinates and attributes.
- Generate Structured Output: For scanned data such as invoices, forms, tables, etc., Qwen 2.5-VL supports structured output of their contents, which is beneficial for financial and business purposes.
Model Architecture:
- Dynamic resolution and frame rate training for video understanding:
Extending the dynamic resolution to the temporal dimension by employing dynamic FPS sampling allows the model to understand video at various sample rates. Correspondingly, the Qwen team updated mRoPE with ID and absolute time alignment in the temporal dimension, enabling the model to learn temporal order and speed, ultimately gaining the ability to pinpoint specific moments.
- Streamlined and efficient visual coder
The Qwen team has improved training and inference speed by strategically introducing the windowed attention mechanism into ViT. The ViT architecture has been further optimized with SwiGLU and RMSNorm to align it with the structure of Qwen 2.5 LLM.
There are three models in this open source, with parameters of 3 billion, 7 billion and 72 billion. This repo contains the command-tuned 72B Qwen2.5-VL model.
Model Ensemble:
https://www.modelscope.cn/collections/Qwen25-VL-58fbb5d31f1d47
Modeling Experience:
https://chat.qwenlm.ai/
Tech Blog:
https://qwenlm.github.io/blog/qwen2.5-vl/
Code Address:
https://github.com/QwenLM/Qwen2.5-VL
2.model effect
Model Evaluation
Mr. José María González
3.model-based reasoning
Reasoning with transformers
The code for Qwen2.5-VL is in the latest transformers, and it is recommended to build from source using the command:
pip install git+https://github.com/huggingface/transformers
A toolkit is provided to help make it easier to work with various types of visual input, just as you would with an API. This includes base64, URLs, and interleaved images and videos. It can be installed using the following command:
pip install qwen-vl-utils[decord]==0.0.8
Reasoning about the code:
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor from qwen_vl_utils import process_vision_info from modelscope import snapshot_download # Download and load the model model_dir = snapshot_download("Qwen/Qwen2.5-VL-3B-Instruct") # Default: Load the model on the available device(s) model = Qwen2_5_VLForConditionalGeneration.from_pretrained( model_dir, torch_dtype="auto", device_map="auto" ) # Optional: Enable flash_attention_2 for better acceleration and memory saving # model = Qwen2_5_VLForConditionalGeneration.from_pretrained( # "Qwen/Qwen2.5-VL-3B-Instruct", # torch_dtype = "Qwen2_5_VLForConditionalGeneration.from_pretrained( # torch_dtype=torch.bfloat16, # attn_implementation="flash_attention_2", # device_map="auto", # ) # Load the default processor processor = AutoProcessor.from_pretrained(model_dir) # Optional: Set custom min and max pixels for visual token range # min_pixels = 256 * 28 * 28 # max_pixels = 1280 * 28 * 28 # processor = AutoProcessor.from_pretrained( # "Qwen/Qwen2.5-VL-3B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels # ) # Define input messages messages = [ { "role": "user", "content": [ "content": [ { "type": "image", "image": "", "content": [ { "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg", }, { "type": "text", "text": "Describe this image."}, } ], } } ] # Prepare inputs for inference text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) image_inputs, video_inputs = process_vision_info(messages) inputs = processor( image_inputs, video_inputs = process_vision_info(messages) images=image_inputs, videos=video_inputs images=image_inputs, videos=video_inputs, padding=True, return_tensors="pensors", "pens", "pens", "pens") return_tensors="pt", ) ) inputs = inputs.to("cuda") # Inference: Generate output generated_ids = model.generate(**inputs, max_new_tokens=128) generated_ids_trimmed = [ out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) ] output_text = processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False ) # Print the generated output print(output_text)
Called directly using the Magic Hitch API-Inference
The API-Inference of the Magic Match platform is also the first to provide support for the Qwen2.5-VL series of models. Users of Magic Match can use it directly through the API call. The specific way of using API-Inference can be found on the model page (e.g. https://www.modelscope.cn/models/Qwen/Qwen2.5-VL-72B-Instruct):
Or see the API-Inference documentation:
https://www.modelscope.cn/docs/model-service/API-Inference/intro
Here is an example of the following image, calling the API using the Qwen/Qwen2.5-VL-72B-Instruct model:
from openai import OpenAI # Initialize the OpenAI client client = OpenAI( api_key="", # ModelScope Token base_url="https://api-inference.modelscope.cn/v1" ) # Create a chat completion request response = client.chat.completions.create( model="Qwen/Qwen2.5-VL-72B-Instruct", # ModelScope Model-Id messages=[ { "role": "user", "content": [ { "type": "image_url", "image_url": { "url": "https://modelscope.oss-cn-beijing.aliyuncs.com/demo/images/bird-vl.jpg" } }, { "type": "text", { "text": ( "Count the number of birds in the figure, including those that " "are only showing their heads. "To ensure accuracy, first detect " "their key points, then give the total number." ) }, } ), " "then give the total number. } ), ], } stream=True ) # Stream the response for chunk in response. print(chunk.choices[0].delta.content, end='', flush=True)
4. Model fine-tuning
We introduce the use of ms-swift on Qwen/Qwen2.5-VL-7B-Instruct fine-tuning. ms-swift is the magic ride community officially provided by the large model and multimodal large model fine-tuning deployment framework. ms-swift open source address:
https://github.com/modelscope/ms-swift
Here, we'll show runnable fine-tuning demos and give the format of the self-defined dataset.
Before you start fine-tuning, make sure your environment is ready.
git clone https://github.com/modelscope/ms-swift.git cd ms-swift pip install -e .
The image OCR fine-tuning script is as follows:
MAX_PIXELS=1003520 \ CUDA_VISIBLE_DEVICES=0 \ swift sft \ ---model Qwen/Qwen2.5-VL-7B-Instruct \ --dataset AI-ModelScope/LaTeX_OCR:human_handwrite#20000 \ ---train_type lora \ ---torch_dtype bfloat16 \\ --num_train_epochs 1 \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --learning_rate 1e-4 \ --lora_rank 8 \ --lora_alpha 32 \ ---target_modules all-linear \ --freeze_vit true \ --gradient_accumulation_steps 16 \ --eval_steps 50 \ ---save_steps 50 \ ---save_total_limit 5 \\ --logging_steps 5 \ ---max_length 2048 \\ --output_dir output \ --warmup_ratio 0.05 \ --dataloader_num_workers 4
Training video memory resources:
The video fine-tuning script is below:
# VIDEO_MAX_PIXELS and other parameters meaning can be viewed: # https://swift.readthedocs.io/zh-cn/latest/Instruction/ .html#id18 nproc_per_node=2 CUDA_VISIBLE_DEVICES=0,1 \ NPROC_PER_NODE=$nproc_per_node \ VIDEO_MAX_PIXELS=100352 \ FPS_MAX_FRAMES=24 \ swift sft \ ---model Qwen/Qwen2.5-VL-7B-Instruct \ --dataset swift/VideoChatGPT:all \ ---train_type lora \ ---torch_dtype bfloat16 \ --num_train_epochs 1 \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --learning_rate 1e-4 \ --lora_rank 8 \ --lora_alpha 32 \ ---target_modules all-linear \ --freeze_vit true \ --gradient_accumulation_steps $(expr 16 / $nproc_per_node) \ --eval_steps 50 \ ---save_steps 50 \ ---save_total_limit 5 \\ --logging_steps 5 \ ---max_length 2048 \ --output_dir output \ --warmup_ratio 0.05 \ --dataloader_num_workers 4 \ --deepspeed zero2
Training video memory resources:
The custom dataset format is as follows (the system field is optional), just specify `--dataset `:
{"messages": [{"role": "user", "content": "Where is the capital of Zhejiang?"}]] , {"role": "assistant", "content": "The capital of Zhejiang is in Hangzhou."}]} {"messages": [{"role": "user", "content":"What's the difference between the two images"}, {"role": "assistant", "content": "The first one is a kitten, the second one is a puppy"}], "images": ["/xxx/x .jpg", "xxx/x.png"]} {"messages": [{"role": "system", "content": "You're a useful and harmless assistant"}, {"role": "user", "content":"
The grounding task fine-tuning script is as follows:
CUDA_VISIBLE_DEVICES=0 \ MAX_PIXELS=1003520 \ swift sft \ ---model Qwen/Qwen2.5-VL-7B-Instruct \ --dataset 'AI-ModelScope/coco#20000' \ ---train_type lora \ ---torch_dtype bfloat16 \ --num_train_epochs 1 \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --learning_rate 1e-4 \ --lora_rank 8 \ --lora_alpha 32 \ ---target_modules all-linear \ --freeze_vit true \ --gradient_accumulation_steps 16 \ --eval_steps 100 \ ---save_steps 100 \ ---save_total_limit 2 \\ --logging_steps 5 \ ---max_length 2048 \\ --output_dir output \ --warmup_ratio 0.05 \ --dataloader_num_workers 4 \ --dataset_num_proc 4 \fscy
Training video memory resources:
The grounding task customizes the data set format as follows:
{"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content":"Describes the image"}, {"role": "assistant", " content": " and are playing on the beach"}], "images": ["/xxx/x.jpg"], "objects": {"ref": ["a dog", "a woman"], "bbox": [[331.5, 761.4, 853.5, 1594.8], [676.5, 685.8, 1099.5, 1427.4]]}} {"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content":"Find the in the image"}, {"role". "assistant", "content": ""}], "images": ["/xxx/x.jpg"], "objects": {"ref": ["sheep"], "bbox": [[90.9, 160.8, 135, 212.8], [360.9, 480.8, 495, 532.8]]}}
After training is complete, inference is performed on the validation set from training using the following command.
Here `--adapters` needs to be replaced with the last checkpoint folder generated by the training. Since the adapters folder contains the parameter files for the training, there is no need to specify `--model` additionally:
CUDA_VISIBLE_DEVICES=0
swift infer
--adapters output/vx-xxx/checkpoint-xxx
--stream false
--max_batch_size 1
--load_data_args true
--max_new_tokens 2048
Push the model to ModelScope:
CUDA_VISIBLE_DEVICES=0
swift export
--adapters output/vx-xxx/checkpoint-xxx
--push_to_hub true
--hub_model_id ''
--hub_token ''