AI Personal Learning
and practical guidance
Beanbag Marscode1

VideoMind: video by timestamp positioning content and Q&A open source project

General Introduction

VideoMind is an open source multimodal AI tool focused on inference, Q&A and summary generation for long videos. It was developed by Ye Liu of the Hong Kong Polytechnic University and a team from Show Lab at the National University of Singapore. The tool mimics the way humans understand videos by breaking down the task into steps such as planning, positioning, verifying and answering to solve the problem step by step. The core technology is Chain-of-LoRA, which switches between roles through a base model, which is both efficient and resource-efficient.VideoMind officially released the code, model, and dataset on March 21, 2024, and the technical report was updated on March 17, 2025.VideoMind is available on GitHub or via GitHub. Users can access the resources via GitHub or Hugging Face.

The project is not to extract audio to text to understand the content, but purely graphical and according to the timestamp accurate understanding of the video content, can be used for automatic video fine-tuning, teaching scenarios, very practical!

VideoMind: video by timestamp positioning content and Q&A open source project-1

Experience: https://huggingface.co/spaces/yeliudev/VideoMind-2B

 

Function List

  • Long video time localization: find the time period in the video that is relevant to the question.
  • Video Q&A function: extract accurate answers from videos based on user questions.
  • Video Summary Generation: Distill key video content to generate a concise overview.
  • Multimodal agent framework: includes planners, locators, validators and responders that collaborate on tasks.
  • Chain-of-LoRA Technology: Multi-functionality with a single model to reduce resource consumption.
  • Multiple hardware support: runs on NVIDIA GPUs or Ascend NPUs.
  • Efficient training support: Integration of DeepSpeed, LoRA, FlashAttention2 and other technologies.
  • Online Demo: Quickly experience the functionality through the Gradio interface.

 

Using Help

VideoMind is an open source project, users need to download the code and configure the environment to use it. The following is a detailed guide covering installation and feature usage.

Installation process

  1. Clone Code Repository
    Run the following command in the terminal to download the VideoMind source code:
git clone https://github.com/yeliudev/VideoMind.git

Go to the project catalog:

cd VideoMind
  1. Setting up the Python environment
    Create a virtual environment using conda (Python 3.12 recommended):
conda create -n videomind python=3.12 -y
conda activate videomind
  1. Installation of dependent libraries
    Execute the following command to install the required dependencies:
pip install -r requirements.txt

For efficient training, install DeepSpeed:

pip install deepspeed
  1. Download models and datasets
  • Visit https://huggingface.co/collections/yeliudev/videomind-67dd41f42c57f0e7433afb36 to download pre-trained models (e.g.VideoMind-2B), unzip it into themodelsFolder.
  • The dataset is available at https://huggingface.co/datasets/yeliudev/VideoMind-Dataset下载 and includes 27 video datasets and VideoMind-SFT (481K training data) placed in thedataCatalog.
  1. test installation
    Run the test script to confirm that the environment is correct:
python test.py

Main Functions

1. Long-form video time positioning

  • Preparation Video: Place the video file into thedata/videosCatalog, support MP4, AVI and other formats.
  • Run command::
python run_grounding.py --video_path data/videos/sample.mp4 --query "找到讨论AI的部分"
  • Result Output: Returns a time range, such as "00:05:12 - 00:07:34".

2. Video quiz

  • Input Issues: Newquery.txtWrite questions, such as "Who is the main character in the video?" .
  • execute a command::
python run_qa.py --video_path data/videos/sample.mp4 --query_file query.txt
  • Get Answers: The terminal displays the results, such as "The protagonist is Li Ming".

3. Video summary generation

  • Run command::
python run_summary.py --video_path data/videos/sample.mp4
  • View Summary: The results are saved assummary.txtFor example, "The video introduces a science and technology lecture".

Featured Function Operation

Chain-of-LoRA technology

  • VideoMind uses a single model to switch roles via LoRA adapters. The configuration file is available in theconfig/lora.yamlAdjustablerankParameter boosting effect:
lora:
rank: 16

Online Demo

  • There is no need for a local installation, so you can visit https://huggingface.co/spaces/yeliudev/VideoMind-2B, upload a video and enter your question to experience the functionality directly. For more information, seedocs/DEMO.mdThe

training model

  • Prepare data: UseVideoMind-SFTor 27 other datasets, placed in thedataCatalog.
  • priming training: Run:
python train.py --dataset data/VideoMind-SFT --model models/VideoMind-2B
  • Monitor training: Supports Tensorboard or Wandb to view progress, see configuration for detailsdocs/TRAIN.mdThe

caveat

  • When the video file is too large, it is recommended to process it in segments to avoid memory overflow.
  • Ascend NPU users need to install the Ascend driver and add parameters when running.--device ascendThe
  • For more information on commissioning seedocs/EVAL.md, including benchmarking guidelines.

With these steps, users can quickly install and use VideoMind to handle long video tasks.


 

application scenario

  1. Video Content Analysis
    Users can use VideoMind to analyze long YouTube videos, locate key segments or generate summaries to save viewing time.
  2. Education and Research
    Students or researchers processing lecture videos can ask "What algorithms were covered?" and get the answers, organizing notes more efficiently.
  3. Content Audit
    Platform administrators use it to verify that video content is compliant, such as checking for offending clips.

 

QA

  1. How long of a video can VideoMind handle?
    Depending on the hardware, 16GB of RAM can handle an hour of video, and 32GB supports longer content.
  2. What is the difference between an online demo and a local deployment?
    The online version requires no installation but has limited functionality; the local version supports customized training and large-scale processing.
  3. How are the datasets used?
    27 datasets were used for training and evaluation, and VideoMind-SFT was suitable for optimizing the models, as described at https://huggingface.co/datasets/yeliudev/VideoMind-Dataset.
May not be reproduced without permission:Chief AI Sharing Circle " VideoMind: video by timestamp positioning content and Q&A open source project
en_USEnglish