AI Personal Learning
and practical guidance

DashInfer-VLM, multimodal SOTA inference performance over vLLM!

introductory

DashInfer-VLM is an inference architecture for visual multimodal large model VLMs, especially optimized for inference acceleration of Qwen VL models. The biggest difference between DashInfer-VLM and other inference acceleration frameworks for VLMs is that it separates the VIT part from the LLM part and the VIT and LLM runs in parallel without interfering with each other. and the VIT and LLM run in parallel without interfering with each other.

This is characterized by the fact that the image and video preprocessing in VLM, as well as the feature extraction part of VIT, will not interrupt the generation of LLM, and can also be a VIT/LLM separation of the architecture, which is the first VLM service framework in the open source community to use this architecture.


In a multi-card deployment, it has a ViT processing unit on each card, which gives a very significant performance advantage in video, multi-image scenarios.

In addition, for the ViT part, it supports memory caching so that there is no need to recalculate ViT repeatedly under multiple rounds of dialog.

Below is a diagram of its architecture, and its configuration according to the 4-card part 72B.

DashInfer-VLM, multimodal SOTA inference performance, super vLLM!

 

An architecture diagram describes the process and architecture:

  • In the ViT part, many inference elicitation can be used for inference, such as TensorRT or onnxruntime (onnx model export will be performed on the ViT part of the model in the framework,) TensorRT is currently supported by default in the framework.
  • In the LLM section, DashInfer is used for inference.
  • Cache section, support ViT result Memory Cache, LLM section Prerfix Cache, LLM section Multimodal Prefix Cache (not enabled by default)

 

Code Address:

https://github.com/modelscope/dash-infer

Document Address: 

https://dashinfer.readthedocs.io/en/latest/vlm/vlm_offline_inference_en.html

 

best practice

Experience DashInfer on the free GPU math of the Magic Hitch community:

First is the installation of dashinfer-vlm and TensorRT.

# First install the required packages
import os

# Download and install dashinfer version 2.0.0rc2
# Download and extract the TensorRT packages using wget, if needed.
# pip install dashinfer 2.0.0rc2
#!pip install https://github.com/modelscope/dash-infer/releases/download/v2.0.0-rc2/dashinfer-2.0.0rc2-cp310-cp310-manylinux_2_ 17_x86_64.manylinux2014_x86_64.whl
#!wget https://modelscope.oss-cn-beijing.aliyuncs.com/releases/TensorRT-10.6.0.26.Linux.x86_64-gnu.cuda-12.6.tar.gz
#!tar -xvzf TensorRT-10.6.0.26.Linux.x86_64-gnu.cuda-12.6.tar.gz

# Download locally and replace with the URL corresponding to modelscope
# installs dashinfer, we recommend installing it locally as the package is large.
#!wget https://modelscope.oss-cn-beijing.aliyuncs.com/releases/dashinfer-2.0.0rc3-cp310-cp310-manylinux_2_17_x86_64. manylinux2014_x86_64.whl
#!pip install . /dashinfer-2.0.0rc3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl

# install dashinfer vlm
#!pip install dashinfer-vlm

# install OpenAI client
#!pip install openai==1.56.2

# Install the Python package for TensorRT, open it from the download and install it.
#!pip install TensorRT-10.6.0.26/python/tensorrt-10.6.0-cp310-none-linux_x86_64.whl

 

TensorRT requires environment variable configuration:

import os

# Get the path to the TensorRT runtime library
trt_runtime_path = os.getcwd() + "/TensorRT-10.6.0.26/lib/"

# Get the current LD_LIBRARY_PATH environment variable value
current_ld_library_path = os.environ.get('LD_LIBRARY_PATH', '')

# Add the new path to the existing value
if current_ld_library_path.
# If LD

After the environment is installed, start dashinfer vlm to reason about the model and form an openai-compatible server, the model can be changed to 7B, 72B, etc.

 

All GPU memory in the environment is used by default.

!dashinfer_vlm_serve --model qwen/Qwen2-VL-2B-Instruct --port 8000 --host 127.0.0.1

This process initializes DashInfer, as well as the external engine used by ViT (TensorRT in this case), and starts an openai service.

 

Seeing these logs indicates that the TRT was initialized successfully:

DashInfer-VLM, multimodal SOTA inference performance, super vLLM!

 

Seeing these logs indicates that DashInfer was initialized successfully:

DashInfer-VLM, multimodal SOTA inference performance, super vLLM!

 

Seeing these logs indicates that the openai service was initialized successfully:

DashInfer-VLM, multimodal SOTA inference performance, super vLLM!

 

Here all the initialization is successful, you can open another notebook for client and benchmarking.

Notebook Address:https://modelscope.cn/notebook/share/ipynb/6ea987c5/vl-start-server.ipynb

 

Image Understanding Demo

Demonstrate a demo for image understanding with multiple images:

# Install the required OpenAI client version
!pip install openai==1.56.2 # VL support requires a recent OpenAI client.

from openai import OpenAI

# Initialize the OpenAI client
client = OpenAI(
base_url="http://localhost:8000/v1", api_key="EMPTY", api_key="EMPTY
base_url="", api_key="EMPTY"
)

# Prepare the API call for a chat completion
response = client.chat.completions.create(
model="model",
messages=[
{
"role": "user", "content": [ {
"content": [
{ "type": "text", "text": "Are these images different?"}, { "role": "user", "content": [ "type": "text", "text": "Are these images different?
{
"type": "image_url", "image_url": { "type": "content", "text": "Are these images different?
"image_url": {
"url": "https://farm4.staticflickr.com/3075/3168662394_7d7103de7d_z_d.jpg", { "type": "image_url", "image_url": {
}
},
{
"type": "image_url", {
"image_url": {
"url": "https://farm2.staticflickr.com/1533/26541536141_41abe98db3_z_d.jpg",
}
}
]
}
],
stream=True,
max_completion_tokens=1024, }, ], stream=True, max_completion_tokens=1024, }
temperature=0.1, )
)

# Process the streamed response
full_response = ""
for chunk in response.
# Append the delta content to the full response
full_response += chunk.choices[0].delta.content
print("." , end="") # Print a dot for each chunk received

# Print the full response
print(f"\nImage: Full Response:\n{full_response}")

 

Video comprehension demo

Since openai does not define a standard video interface, this paper provides a video_url type, which automatically downloads, extracts frames and analyzes the video.

# video example
!pip install openai==1.56.2 # Ensure the OpenAI client supports video link features.

from openai import OpenAI

# Initialize the OpenAI client
client = OpenAI(
base_url="http://localhost:8000/v1", api_key="EMPTY", api_key="EMPTY
base_url="", api_key="EMPTY"
)

# Create a chat completion request with a video URL
response = client.chat.completions.create(
model="model",
messages=[
{
"role": "user", "content": [ {
"content": [
{
"type": "text", "text": "Generate a compelling description that I can upload along with the video.
"text": "Generate a compelling description that I can upload along with the video."
},
" }, {
"type": "video_url", "text": "Generate a compelling description that I can upload along with the video.
"video_url": {
"url": "https://cloud.video.taobao.com/vod/JCM2awgFE2C2vsACpDESXZ3h5_iQ5yCZCypmjtEs2Ck.mp4",
"fps": 2
}
}
]
}
],
max_completion_tokens=1024,
max_completion_tokens=1024, top_p=0.5,
temperature=0.1,
frequency_penalty=1.05,
stream=True, )
)

# Process the streaming response
full_response = ""
for chunk in response.
# Append the delta content from the chunk to the full response
full_response += chunk.choices[0].delta.content
print("." , end="") # Indicate progress with dots

# Print the complete response
print(f"\nFull Response: \n{full_response}")

 

benchmark

Use the image above to understand the EXAMPLE and simply do a multi-concurrent test for throughput testing.

# benchmark!pip install openai==1.56.2
import time
import concurrent.futures
from openai import OpenAI

# initializes the OpenAI client.
client = OpenAI(
base_url="http://localhost:8000/v1", api_key="EMPTY", api_key="EMPTY
api_key="EMPTY"
)

# request parameters
model = "model"
messages = [
{
"role": "user", "content": [
"content": [
{ "type": "text", "text": "Are these images different?"}, { "role": "user", "content": [
{
"type": "image_url", "image_url": { "type": "content", "text": "Are these images different?
"image_url": {
"url": "https://farm4.staticflickr.com/3075/3168662394_7d7103de7d_z_d.jpg", { "type": "image_url", "image_url": {
}
},
{
"type": "image_url", {
"image_url": {
"url": "https://farm2.staticflickr.com/1533/26541536141_41abe98db3_z_d.jpg",
}
}
]
}
]

# Concurrent request function
def send_request():
start_time = time.time()
response = client.chat.completions.create(
model=model,
messages=messages,
stream=False, max_completion_tokens
max_completion_tokens=1024,
temperature=0.1, )
)
end_time = time.time()
latency = end_time - start_time
return latency

# Benchmark function
def benchmark(num_requests, num_workers)::
latencies = []
start_time = time.time()

with concurrent.futures.ThreadPoolExecutor(max_workers=num_workers) as executor:
futures = [executor.submit(send_request) for _ in range(num_requests)]
for future in concurrent.futures.as_completed(futures): [executor.submit(send_request) for _ in range(num_requests)]
latencies.append(future.result())

end_time = time.time()
total_time = end_time - start_time
qps = num_requests / total_time
average_latency = sum(latencies) / len(latencies)
throughput = num_requests * 1024 / total_time # Assuming a response size of 1024 bytes per request

print(f "Total Time: {total_time:.2f} seconds")
print(f "QPS: {qps:.2f}")
print(f "Average Latency: {average_latency:.2f} seconds")

# Main Program Entry
if __name__ == "__main__": num_requests = 100
num_requests = 100 # Total number of requests.
num_workers = 10 # concurrent worker threads
benchmark(num_requests, num_workers)

 

Test results:

DashInfer-VLM, multimodal SOTA inference performance, super vLLM!

Notebook Address:https://modelscope.cn/notebook/share/ipynb/5560603a/vl-test-and-benchmark.ipynb

 

Comprehensive and vLLM performance comparison:

In order to compare and contrast the performance of vLLM more comprehensively and accurately, we used OpenGVLab/InternVL-Chat-V1-2-SFT-Data to benchmark single-concurrent, multiple-concurrent, and multiple-round conversations on different sizes of models, and the detailed reproduction scripts are shown in the link, and the results are as follows:

It can be seen that DashInfer has some performance advantages in all cases, especially in multi-round conversations.

DashInfer-VLM, multimodal SOTA inference performance, super vLLM!

DashInfer-VLM, multimodal SOTA inference performance, super vLLM!

May not be reproduced without permission:Chief AI Sharing Circle " DashInfer-VLM, multimodal SOTA inference performance over vLLM!

Chief AI Sharing Circle

Chief AI Sharing Circle specializes in AI learning, providing comprehensive AI learning content, AI tools and hands-on guidance. Our goal is to help users master AI technology and explore the unlimited potential of AI together through high-quality content and practical experience sharing. Whether you are an AI beginner or a senior expert, this is the ideal place for you to gain knowledge, improve your skills and realize innovation.

Contact Us
en_USEnglish