introductory
DashInfer-VLM is an inference architecture for visual multimodal large model VLMs, especially optimized for inference acceleration of Qwen VL models. The biggest difference between DashInfer-VLM and other inference acceleration frameworks for VLMs is that it separates the VIT part from the LLM part and the VIT and LLM runs in parallel without interfering with each other. and the VIT and LLM run in parallel without interfering with each other.
This is characterized by the fact that the image and video preprocessing in VLM, as well as the feature extraction part of VIT, will not interrupt the generation of LLM, and can also be a VIT/LLM separation of the architecture, which is the first VLM service framework in the open source community to use this architecture.
In a multi-card deployment, it has a ViT processing unit on each card, which gives a very significant performance advantage in video, multi-image scenarios.
In addition, for the ViT part, it supports memory caching so that there is no need to recalculate ViT repeatedly under multiple rounds of dialog.
Below is a diagram of its architecture, and its configuration according to the 4-card part 72B.
An architecture diagram describes the process and architecture:
- In the ViT part, many inference elicitation can be used for inference, such as TensorRT or onnxruntime (onnx model export will be performed on the ViT part of the model in the framework,) TensorRT is currently supported by default in the framework.
- In the LLM section, DashInfer is used for inference.
- Cache section, support ViT result Memory Cache, LLM section Prerfix Cache, LLM section Multimodal Prefix Cache (not enabled by default)
Code Address:
https://github.com/modelscope/dash-infer
Document Address:
https://dashinfer.readthedocs.io/en/latest/vlm/vlm_offline_inference_en.html
best practice
Experience DashInfer on the free GPU math of the Magic Hitch community:
First is the installation of dashinfer-vlm and TensorRT. # First install the required packages import os # Download and install dashinfer version 2.0.0rc2 # Download and extract the TensorRT packages using wget, if needed. # pip install dashinfer 2.0.0rc2 #!pip install https://github.com/modelscope/dash-infer/releases/download/v2.0.0-rc2/dashinfer-2.0.0rc2-cp310-cp310-manylinux_2_ 17_x86_64.manylinux2014_x86_64.whl #!wget https://modelscope.oss-cn-beijing.aliyuncs.com/releases/TensorRT-10.6.0.26.Linux.x86_64-gnu.cuda-12.6.tar.gz #!tar -xvzf TensorRT-10.6.0.26.Linux.x86_64-gnu.cuda-12.6.tar.gz # Download locally and replace with the URL corresponding to modelscope # installs dashinfer, we recommend installing it locally as the package is large. #!wget https://modelscope.oss-cn-beijing.aliyuncs.com/releases/dashinfer-2.0.0rc3-cp310-cp310-manylinux_2_17_x86_64. manylinux2014_x86_64.whl #!pip install . /dashinfer-2.0.0rc3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl # install dashinfer vlm #!pip install dashinfer-vlm # install OpenAI client #!pip install openai==1.56.2 # Install the Python package for TensorRT, open it from the download and install it. #!pip install TensorRT-10.6.0.26/python/tensorrt-10.6.0-cp310-none-linux_x86_64.whl
TensorRT requires environment variable configuration:
import os # Get the path to the TensorRT runtime library trt_runtime_path = os.getcwd() + "/TensorRT-10.6.0.26/lib/" # Get the current LD_LIBRARY_PATH environment variable value current_ld_library_path = os.environ.get('LD_LIBRARY_PATH', '') # Add the new path to the existing value if current_ld_library_path. # If LD
After the environment is installed, start dashinfer vlm to reason about the model and form an openai-compatible server, the model can be changed to 7B, 72B, etc.
All GPU memory in the environment is used by default.
!dashinfer_vlm_serve --model qwen/Qwen2-VL-2B-Instruct --port 8000 --host 127.0.0.1
This process initializes DashInfer, as well as the external engine used by ViT (TensorRT in this case), and starts an openai service.
Seeing these logs indicates that the TRT was initialized successfully:
Seeing these logs indicates that DashInfer was initialized successfully:
Seeing these logs indicates that the openai service was initialized successfully:
Here all the initialization is successful, you can open another notebook for client and benchmarking.
Notebook Address:https://modelscope.cn/notebook/share/ipynb/6ea987c5/vl-start-server.ipynb
Image Understanding Demo
Demonstrate a demo for image understanding with multiple images:
# Install the required OpenAI client version !pip install openai==1.56.2 # VL support requires a recent OpenAI client. from openai import OpenAI # Initialize the OpenAI client client = OpenAI( base_url="http://localhost:8000/v1", api_key="EMPTY", api_key="EMPTY base_url="", api_key="EMPTY" ) # Prepare the API call for a chat completion response = client.chat.completions.create( model="model", messages=[ { "role": "user", "content": [ { "content": [ { "type": "text", "text": "Are these images different?"}, { "role": "user", "content": [ "type": "text", "text": "Are these images different? { "type": "image_url", "image_url": { "type": "content", "text": "Are these images different? "image_url": { "url": "https://farm4.staticflickr.com/3075/3168662394_7d7103de7d_z_d.jpg", { "type": "image_url", "image_url": { } }, { "type": "image_url", { "image_url": { "url": "https://farm2.staticflickr.com/1533/26541536141_41abe98db3_z_d.jpg", } } ] } ], stream=True, max_completion_tokens=1024, }, ], stream=True, max_completion_tokens=1024, } temperature=0.1, ) ) # Process the streamed response full_response = "" for chunk in response. # Append the delta content to the full response full_response += chunk.choices[0].delta.content print("." , end="") # Print a dot for each chunk received # Print the full response print(f"\nImage: Full Response:\n{full_response}")
Video comprehension demo
Since openai does not define a standard video interface, this paper provides a video_url type, which automatically downloads, extracts frames and analyzes the video.
# video example !pip install openai==1.56.2 # Ensure the OpenAI client supports video link features. from openai import OpenAI # Initialize the OpenAI client client = OpenAI( base_url="http://localhost:8000/v1", api_key="EMPTY", api_key="EMPTY base_url="", api_key="EMPTY" ) # Create a chat completion request with a video URL response = client.chat.completions.create( model="model", messages=[ { "role": "user", "content": [ { "content": [ { "type": "text", "text": "Generate a compelling description that I can upload along with the video. "text": "Generate a compelling description that I can upload along with the video." }, " }, { "type": "video_url", "text": "Generate a compelling description that I can upload along with the video. "video_url": { "url": "https://cloud.video.taobao.com/vod/JCM2awgFE2C2vsACpDESXZ3h5_iQ5yCZCypmjtEs2Ck.mp4", "fps": 2 } } ] } ], max_completion_tokens=1024, max_completion_tokens=1024, top_p=0.5, temperature=0.1, frequency_penalty=1.05, stream=True, ) ) # Process the streaming response full_response = "" for chunk in response. # Append the delta content from the chunk to the full response full_response += chunk.choices[0].delta.content print("." , end="") # Indicate progress with dots # Print the complete response print(f"\nFull Response: \n{full_response}")
benchmark
Use the image above to understand the EXAMPLE and simply do a multi-concurrent test for throughput testing.
# benchmark!pip install openai==1.56.2 import time import concurrent.futures from openai import OpenAI # initializes the OpenAI client. client = OpenAI( base_url="http://localhost:8000/v1", api_key="EMPTY", api_key="EMPTY api_key="EMPTY" ) # request parameters model = "model" messages = [ { "role": "user", "content": [ "content": [ { "type": "text", "text": "Are these images different?"}, { "role": "user", "content": [ { "type": "image_url", "image_url": { "type": "content", "text": "Are these images different? "image_url": { "url": "https://farm4.staticflickr.com/3075/3168662394_7d7103de7d_z_d.jpg", { "type": "image_url", "image_url": { } }, { "type": "image_url", { "image_url": { "url": "https://farm2.staticflickr.com/1533/26541536141_41abe98db3_z_d.jpg", } } ] } ] # Concurrent request function def send_request(): start_time = time.time() response = client.chat.completions.create( model=model, messages=messages, stream=False, max_completion_tokens max_completion_tokens=1024, temperature=0.1, ) ) end_time = time.time() latency = end_time - start_time return latency # Benchmark function def benchmark(num_requests, num_workers):: latencies = [] start_time = time.time() with concurrent.futures.ThreadPoolExecutor(max_workers=num_workers) as executor: futures = [executor.submit(send_request) for _ in range(num_requests)] for future in concurrent.futures.as_completed(futures): [executor.submit(send_request) for _ in range(num_requests)] latencies.append(future.result()) end_time = time.time() total_time = end_time - start_time qps = num_requests / total_time average_latency = sum(latencies) / len(latencies) throughput = num_requests * 1024 / total_time # Assuming a response size of 1024 bytes per request print(f "Total Time: {total_time:.2f} seconds") print(f "QPS: {qps:.2f}") print(f "Average Latency: {average_latency:.2f} seconds") # Main Program Entry if __name__ == "__main__": num_requests = 100 num_requests = 100 # Total number of requests. num_workers = 10 # concurrent worker threads benchmark(num_requests, num_workers)
Test results:
Notebook Address:https://modelscope.cn/notebook/share/ipynb/5560603a/vl-test-and-benchmark.ipynb
Comprehensive and vLLM performance comparison:
In order to compare and contrast the performance of vLLM more comprehensively and accurately, we used OpenGVLab/InternVL-Chat-V1-2-SFT-Data to benchmark single-concurrent, multiple-concurrent, and multiple-round conversations on different sizes of models, and the detailed reproduction scripts are shown in the link, and the results are as follows:
It can be seen that DashInfer has some performance advantages in all cases, especially in multi-round conversations.