DashInfer-VLM, multimodal SOTA inference performance over vLLM!

AI News9mos agorelease AI Sharing Circle

20.8K 00

introductory

DashInfer-VLM is an inference architecture for visual multimodal large model VLMs, especially optimized for inference acceleration of Qwen VL models. The biggest difference between DashInfer-VLM and other inference acceleration frameworks for VLMs is that it separates the VIT part from the LLM part and the VIT and LLM runs in parallel without interfering with each other. and the VIT and LLM run in parallel without interfering with each other.

This is characterized by the fact that the image and video preprocessing in VLM, as well as the feature extraction part of VIT, will not interrupt the generation of LLM, and can also be a VIT/LLM separation of the architecture, which is the first VLM service framework in the open source community to use this architecture.

In a multi-card deployment, it has a ViT processing unit on each card, which gives a very significant performance advantage in video, multi-image scenarios.

In addition, for the ViT part, it supports memory caching so that there is no need to recalculate ViT repeatedly under multiple rounds of dialog.

Below is a diagram of its architecture, and its configuration according to the 4-card part 72B.

An architecture diagram describes the process and architecture:

In the ViT part, many inference elicitation can be used for inference, such as TensorRT or onnxruntime (onnx model export will be performed on the ViT part of the model in the framework,) TensorRT is currently supported by default in the framework.
In the LLM section, DashInfer is used for inference.
Cache section, support ViT result Memory Cache, LLM section Prerfix Cache, LLM section Multimodal Prefix Cache (not enabled by default)

Code Address:

https://github.com/modelscope/dash-infer

Document Address:

https://dashinfer.readthedocs.io/en/latest/vlm/vlm_offline_inference_en.html

best practice

Experience DashInfer on the free GPU math of the Magic Hitch community:

首先是dashinfer-vlm和TensorRT的安装。

# 首先安装所需的 package
import os

# 下载并安装 dashinfer 2.0.0rc2 版本
# 如果需要，可以使用 wget 下载并解压 TensorRT 包
# pip 安装 dashinfer 2.0.0rc2
#!pip install https://github.com/modelscope/dash-infer/releases/download/v2.0.0-rc2/dashinfer-2.0.0rc2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
#!wget https://modelscope.oss-cn-beijing.aliyuncs.com/releases/TensorRT-10.6.0.26.Linux.x86_64-gnu.cuda-12.6.tar.gz
#!tar -xvzf TensorRT-10.6.0.26.Linux.x86_64-gnu.cuda-12.6.tar.gz

# 下载到本地并替换为 modelscope 对应的 URL
# 安装 dashinfer，因 package 较大，推荐下载到本地后安装
#!wget https://modelscope.oss-cn-beijing.aliyuncs.com/releases/dashinfer-2.0.0rc3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
#!pip install ./dashinfer-2.0.0rc3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl

# 安装 dashinfer vlm
#!pip install dashinfer-vlm

# 安装 OpenAI 客户端
#!pip install openai==1.56.2

# 安装 TensorRT 的 Python 包，从下载的包中打开安装
#!pip install TensorRT-10.6.0.26/python/tensorrt-10.6.0-cp310-none-linux_x86_64.whl

TensorRT requires environment variable configuration:

import os

# 获取 TensorRT 运行时库的路径
trt_runtime_path = os.getcwd() + "/TensorRT-10.6.0.26/lib/"

# 获取当前的 LD_LIBRARY_PATH 环境变量值
current_ld_library_path = os.environ.get('LD_LIBRARY_PATH', '')

# 将新路径添加到现有值中
if current_ld_library_path:
# 如果 LD

After the environment is installed, start dashinfer vlm to reason about the model and form an openai-compatible server, the model can be changed to 7B, 72B, etc.

All GPU memory in the environment is used by default.

!dashinfer_vlm_serve --model qwen/Qwen2-VL-2B-Instruct --port 8000 --host 127.0.0.1

This process initializes DashInfer, as well as the external engine used by ViT (TensorRT in this case), and starts an openai service.

Seeing these logs indicates that the TRT was initialized successfully:

Seeing these logs indicates that DashInfer was initialized successfully:

Seeing these logs indicates that the openai service was initialized successfully:

Here all the initialization is successful, you can open another notebook for client and benchmarking.

Notebook Address:https://modelscope.cn/notebook/share/ipynb/6ea987c5/vl-start-server.ipynb

Image Understanding Demo

Demonstrate a demo for image understanding with multiple images:

# Install the required OpenAI client version
!pip install openai==1.56.2 # VL support requires a recent OpenAI client.

from openai import OpenAI

# Initialize the OpenAI client
client = OpenAI(
base_url="http://localhost:8000/v1", 
api_key="EMPTY"
)

# Prepare the API call for a chat completion
response = client.chat.completions.create(
model="model",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Are these images different?"},
{
"type": "image_url",
"image_url": {
"url": "https://farm4.staticflickr.com/3075/3168662394_7d7103de7d_z_d.jpg",
}
},
{
"type": "image_url",
"image_url": {
"url": "https://farm2.staticflickr.com/1533/26541536141_41abe98db3_z_d.jpg",
}
},
],
}
],
stream=True,
max_completion_tokens=1024,
temperature=0.1,
)

# Process the streamed response
full_response = ""
for chunk in response:
# Append the delta content to the full response
full_response += chunk.choices[0].delta.content
print(".", end="") # Print a dot for each chunk received

# Print the full response
print(f"\nImage: Full Response:\n{full_response}")

Video comprehension demo

Since openai does not define a standard video interface, this paper provides a video_url type, which automatically downloads, extracts frames and analyzes the video.

# video example
!pip install openai==1.56.2 # Ensure the OpenAI client supports video link features.

from openai import OpenAI

# Initialize the OpenAI client
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="EMPTY"
)

# Create a chat completion request with a video URL
response = client.chat.completions.create(
model="model",
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": "Generate a compelling description that I can upload along with the video."
},
{
"type": "video_url",
"video_url": {
"url": "https://cloud.video.taobao.com/vod/JCM2awgFE2C2vsACpDESXZ3h5_iQ5yCZCypmjtEs2Ck.mp4",
"fps": 2
}
}
]
}
],
max_completion_tokens=1024,
top_p=0.5,
temperature=0.1,
frequency_penalty=1.05,
stream=True,
)

# Process the streaming response
full_response = ""
for chunk in response:
# Append the delta content from the chunk to the full response
full_response += chunk.choices[0].delta.content
print(".", end="") # Indicate progress with dots

# Print the complete response
print(f"\nFull Response: \n{full_response}")

benchmark

Use the image above to understand the EXAMPLE and simply do a multi-concurrent test for throughput testing.

# benchmark!pip install openai==1.56.2
import time
import concurrent.futures
from openai import OpenAI

# 初始化 OpenAI 客户端
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="EMPTY"
)

# 请求参数
model = "model"
messages = [
{
"role": "user",
"content": [
{"type": "text", "text": "Are these images different?"},
{
"type": "image_url",
"image_url": {
"url": "https://farm4.staticflickr.com/3075/3168662394_7d7103de7d_z_d.jpg",
}
},
{
"type": "image_url",
"image_url": {
"url": "https://farm2.staticflickr.com/1533/26541536141_41abe98db3_z_d.jpg",
}
},
],
}
]

# 并发请求函数
def send_request():
start_time = time.time()
response = client.chat.completions.create(
model=model,
messages=messages,
stream=False,
max_completion_tokens=1024,
temperature=0.1,
)
end_time = time.time()
latency = end_time - start_time
return latency

# 基准测试函数
def benchmark(num_requests, num_workers):
latencies = []
start_time = time.time()

with concurrent.futures.ThreadPoolExecutor(max_workers=num_workers) as executor:
futures = [executor.submit(send_request) for _ in range(num_requests)]
for future in concurrent.futures.as_completed(futures):
latencies.append(future.result())

end_time = time.time()
total_time = end_time - start_time
qps = num_requests / total_time
average_latency = sum(latencies) / len(latencies)
throughput = num_requests * 1024 / total_time # 假设每个请求的响应大小为 1024 字节

print(f"Total Time: {total_time:.2f} seconds")
print(f"QPS: {qps:.2f}")
print(f"Average Latency: {average_latency:.2f} seconds")

# 主程序入口
if __name__ == "__main__":
num_requests = 100 # 总请求数
num_workers = 10 # 并发工作线程数
benchmark(num_requests, num_workers)

Test results:

Notebook Address:https://modelscope.cn/notebook/share/ipynb/5560603a/vl-test-and-benchmark.ipynb

Comprehensive and vLLM performance comparison:

In order to compare and contrast the performance of vLLM more comprehensively and accurately, we used OpenGVLab/InternVL-Chat-V1-2-SFT-Data to benchmark single-concurrent, multiple-concurrent, and multiple-round conversations on different sizes of models, and the detailed reproduction scripts are shown in the link, and the results are as follows:

It can be seen that DashInfer has some performance advantages in all cases, especially in multi-round conversations.