AI Personal Learning
and practical guidance
Beanbag Marscode1

Long-VITA: A Visual Language Model Supporting Very Long Contextual Outputs

General Introduction

Long-VITA is an open source multimodal macromodel developed by the VITA-MLLM team, focusing on visual and linguistic tasks that deal with very long contexts. It is capable of analyzing images, videos, and text simultaneously and supports up to 1 million token inputs for scenarios such as video understanding, high-resolution image parsing, and multimodal intelligentsia reasoning. Compared to other models, Long-VITA performs well in short context tasks, while offering breakthrough advantages in long sequence processing. Developed in collaboration with Tencent's Youtu Lab, Nanjing University and Xiamen University, the project is trained entirely on open-source datasets, supports both NPU and GPU platforms, and aims to provide the open-source community with a powerful tool for long-context multimodal research. The model code, training methods and weights have been made public, making it suitable for researchers and developers to explore the cutting-edge applications of multimodal AI.

Long-VITA: A Visual Language Model Supporting Ultra-Long Contextual Output-1


 

Function List

  • ultra-long context processing: Supports image, video and text inputs of up to 1 million tokens or 4K frames for complex scene analysis.
  • multimodal understanding: Integrated image, video and text processing capabilities to analyze multiple data types simultaneously.
  • Efficient Distributed Reasoning:: Efficient reasoning for very long inputs through contextual parallelism.
  • Open source dataset training: Use 17 million public samples to ensure model reproducibility and transparency.
  • Cross-platform support: Compatible with Ascend NPU and Nvidia GPUs for flexible adaptation to different hardware environments.
  • Short context optimization:: Maintain leading performance in traditional multimodal tasks, balancing long and short sequence requirements.
  • Logits-Masked Language Modeling:: Innovative language model head design to enhance long sequence reasoning.

 

Using Help

Long-VITA is an open source project that allows users to obtain code and model weights through a GitHub repository and deploy them for use locally or on a server. Below is a detailed guide to help users get started and explore its powerful features.

Installation process

  1. clone warehouse
    Open a terminal and enter the following command to clone the Long-VITA repository:

    git clone https://github.com/VITA-MLLM/Long-VITA.git
    cd Long-VITA

This will download all the code and documentation for the project.

  1. Creating a Virtual Environment
    Use Conda to create a separate Python environment and ensure dependency isolation:

    conda create -n long-vita python=3.10 -y
    conda activate long-vita
    
  2. Installation of dependencies
    Install the Python packages needed for your project:

    pip install --upgrade pip
    pip install -r requirements.txt
    

    If accelerated reasoning is required, Flash Attention can be installed additionally:

    pip install flash-attn --no-build-isolation
    
  3. Download model weights
    Long-VITA is available in several versions (e.g., 16K, 128K, 1M tokens) that can be downloaded from Hugging Face:

  4. Configuring the hardware environment
    • Nvidia GPUs: Ensure that CUDA and cuDNN are installed and set environment variables:
      export CUDA_VISIBLE_DEVICES=0
      
    • Ascend NPU: Configure your MindSpeed or Megatron environment according to the official documentation.

Usage

Long-VITA supports two main modes of operation, inference and evaluation, and the following steps are described.

running inference

  1. Preparing to enter data
    • imagery: Place an image file (e.g. .jpg maybe .png(computing) put (sth) into (the) asset Folder.
    • video: Support for common video formats (e.g. .mp4), placed in the specified path.
    • copies: Write questions or instructions, save them as .txt file or enter it directly at the command line.
  2. Execute reasoning commands
    As an example of image comprehension, run the following command:

    CUDA_VISIBLE_DEVICES=0 python video_audio_demo.py \
    --model_path [模型权重路径] \
    --image_path asset/sample_image.jpg \
    --model_type qwen2p5_instruct \
    --conv_mode qwen2p5_instruct \
    --question "描述这张图片的内容。"
    

    For video input, add --video_path Parameters:

    --video_path asset/sample_video.mp4
    
  3. View Output
    The model will output results at the endpoint, such as image descriptions or video analysis content.

Evaluating Performance

  1. Preparation of assessment datasets
    Download the benchmark dataset (e.g. Video-MME) and organize the file structure as required.
  2. Run the evaluation script
    Evaluate using the provided script:

    bash script/evaluate.sh [模型路径] [数据集路径]
    

Featured Function Operation

ultra-long context processing

  • procedure:
    1. Select the Long-VITA-1M model to ensure that the total token count of the input data (e.g., long video or multiple HD images) does not exceed 1 million.
    2. utilization --max_seq_len 1048576 parameter sets the maximum sequence length.
    3. Run the inference and observe how the model handles long sequence tasks (e.g., video summary generation).
  • typical example: Input an hour-long video, ask the question "Summarize the main plot of the video", and the model will output a concise text summary.

multimodal understanding

  • procedure:
    1. Prepare multimodal inputs, such as images + text or videos + questions.
    2. Specify both on the command line --image_path cap (a poem) --question, e.g.:
      --image_path asset/sample_image.jpg --question "图片中的人物是谁?"
      
    3. The model will combine visual and textual information to generate answers.
  • typical example:: Input a photo of a celebrity and the question "What is he doing?" , the model will describe the action in the photo.

distributed inference

  • procedure:
    1. Configure a multi-GPU environment by modifying the CUDA_VISIBLE_DEVICES=0,1,2,3The
    2. Run with the contextual parallel option:
      python -m torch.distributed.launch --nproc_per_node=4 video_audio_demo.py [参数]
      
    3. The model will automatically assign tasks to multiple devices to boost processing speed.
  • typical example: Distributed reasoning can reduce the time spent from hours to minutes when dealing with very long videos.

caveat

  • Ensure that your hardware has enough memory, 1M token input may require more than 32GB of video memory.
  • Input data needs to be preprocessed (e.g., video frame extraction) to match model requirements.
  • Check your internet connection, you need a stable high speed internet connection to download weights.

With these steps, users can easily deploy Long-VITA and experience its ultra-long context and multimodal understanding capabilities for researching, developing, or testing multimodal AI applications.

May not be reproduced without permission:Chief AI Sharing Circle " Long-VITA: A Visual Language Model Supporting Very Long Contextual Outputs
en_USEnglish