InternLM-XComposer：输出超长文本与图像视频理解的多模态大模型

1.2K 00

综合介绍

InternLM-XComposer 是由 InternLM 团队开发的一个开源图文多模态大模型项目，托管于 GitHub。它基于 InternLM 语言模型，能够处理文本、图像、视频等多模态数据，广泛应用于图文创作、图像理解和视频分析等领域。该项目以其支持高达 96K 长上下文、处理 4K 高分辨率图像以及细粒度视频理解能力而著称，仅使用 7B 参数即可媲美 GPT-4V 的性能。用户可以通过 GitHub 访问代码、模型权重和详细文档，适合研究人员、开发者或对多模态 AI 感兴趣的用户使用。截至 2025 年 2 月，该项目已发布多个版本，包括 InternLM-XComposer-2.5 和 OmniLive，持续优化多模态交互体验。

功能列表

支持超长上下文输出：处理长达 96K 的图文混合内容，适合复杂任务。
高分辨率图像理解：支持从 336 像素到 4K 的图像分析，细节清晰。
细粒度视频理解：将视频分解为多帧图像，捕捉动态细节。
图文创作：根据指令生成图文并茂的文章或网页内容。
多轮多图对话：支持多张图片输入，进行连续对话分析。
开源模型支持：提供多种模型权重和微调代码，方便二次开发。
多模态流媒体交互：OmniLive 版本支持长时间视频和音频处理。

使用帮助

InternLM-XComposer 是一个基于 GitHub 的开源项目，用户需要一定的编程基础来安装和使用。以下是详细的操作指南，帮助用户快速上手。

安装流程

1.环境准备

- 确保你的设备安装了 Python 3.9 或以上版本。
- 需要 NVIDIA GPU 和 CUDA 支持（推荐 CUDA 11.x 或 12.x）。
- 安装 Git 以克隆代码库。

2.克隆项目
在终端运行以下命令，将项目下载到本地：

git clone https://github.com/InternLM/InternLM-XComposer.git
cd InternLM-XComposer

3. 创建虚拟环境 使用 Conda 或虚拟环境工具隔离依赖：

conda create -n internlm python=3.9 -y
conda activate internlm

4. 安装依赖 根据官方文档安装必要库：

pip install torch==2.0.1+cu117 torchvision==0.15.2+cu117 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu117
pip install transformers==4.33.2 timm==0.4.12 sentencepiece==0.1.99 gradio==4.13.0 markdown2==4.4.10 xlsxwriter==3.1.2 einops

- 可选：安装 flash-attention2 以节省 GPU 内存：

pip install flash-attn --no-build-isolation

5. 下载模型权重 项目支持从 Hugging Face 下载预训练模型，例如：

model = AutoModel.from_pretrained('internlm/internlm-xcomposer2d5-7b', torch_dtype=torch.bfloat16, trust_remote_code=True).cuda().eval()

6. 验证安装 运行示例代码测试环境是否正常：

python -m torch.distributed.run --nproc_per_node=1 example_code/simple_chat.py

主要功能操作流程

1. 图文创作

功能简介：根据用户指令生成包含文本和图片的内容，例如文章或网页。
操作步骤：

准备输入：编写文本指令（如“写一篇关于旅行的文章，包含三张图片”）。
运行代码：

from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained('internlm/internlm-xcomposer2d5-7b', trust_remote_code=True).cuda().eval()
tokenizer = AutoTokenizer.from_pretrained('internlm/internlm-xcomposer2d5-7b', trust_remote_code=True)
query = "写一篇关于旅行的文章，包含三张图片"
response, _ = model.chat(tokenizer, query, do_sample=False, num_beams=3)
print(response)

输出结果：模型会生成图文混合内容，图片描述会自动嵌入文本中。

2. 高分辨率图像理解

功能简介：分析高分辨率图像并提供详细描述。
操作步骤：

准备图片：将图像文件放入本地目录（例如 examples/dubai.png）。
运行代码：

query = "详细分析这张图片"
image = ['examples/dubai.png']
with torch.autocast(device_type='cuda', dtype=torch.float16):
response, _ = model.chat(tokenizer, query, image, do_sample=False, num_beams=3)
print(response)

输出结果：模型返回对图像内容的细致描述，例如建筑、颜色等细节。

3. 视频分析

功能简介：分解视频帧并描述内容。
操作步骤：

准备视频：下载示例视频（如 liuxiang.mp4）。
使用 OmniLive 版本：

from lmdeploy import pipeline
pipe = pipeline('internlm/internlm-xcomposer2d5-ol-7b')
video = load_video('liuxiang.mp4')
query = "描述这段视频内容"
response = pipe((query, video))
print(response.text)

输出结果：返回视频帧的详细描述，例如动作或场景。

4. 多轮多图对话

功能简介：支持多张图片输入，进行连续对话。
操作步骤：

准备多张图片（如 cars1.jpg, cars2.jpg, cars3.jpg）。
运行代码：

query = "Image1 <ImageHere>; Image2 <ImageHere>; Image3 <ImageHere>; 分析这三辆车的优缺点"
images = ['examples/cars1.jpg', 'examples/cars2.jpg', 'examples/cars3.jpg']
response, _ = model.chat(tokenizer, query, images, do_sample=False, num_beams=3)
print(response)