AI Personal Learning
and practical guidance
CyberKnife Drawing Mirror

VideoRAG: A RAG framework for understanding ultra-long videos with support for multimodal retrieval and knowledge graph construction

General Introduction

VideoRAG is a retrieval-enhanced generative framework designed for processing and understanding very long contextual videos. The tool combines a graph-driven textual knowledge base with hierarchical multimodal context encoding to efficiently process hundreds of hours of video content on a single NVIDIA RTX 3090 GPU. videoRAG maintains consistency across video semantics and optimizes retrieval efficiency by dynamically constructing a knowledge graph. Developed by the Department of Data Science at the University of Hong Kong, the project aims to provide users with a powerful tool to process complex video data.

VideoRAG: A RAG Framework for Understanding Ultra-Long Videos with Support for Multimodal Retrieval and Knowledge Graph Construction-1


 

Function List

  • Efficient handling of very long contextual videos: Process hundreds of hours of video content with a single NVIDIA RTX 3090 GPU.
  • Structured Video Knowledge Index: Distill hundreds of hours of video content into a concise knowledge graph.
  • multimodal search: Combine textual semantics and visual content to identify the most relevant videos to provide a comprehensive response.
  • Newly created LongerVideos benchmark: Contains over 160 videos totaling 134 hours of lectures, documentaries and entertainment.
  • dual-channel architecture: Combining a graph-driven textual knowledge base and hierarchical multimodal context encoding to maintain cross-video semantic consistency.

 

Using Help

Installation process

  1. Create and activate the conda environment:
   conda create --name videorag python=3.11
conda activate videorag
  1. Install the necessary Python packages:
   pip install numpy==1.26.4 torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2
pip install accelerate==0.30.1 bitsandbytes==0.43.1 moviepy==1.0.3
pip install git+https://github.com/facebookresearch/pytorchvideo.git@28fe037d212663c6a24f373b94cc5d478c8c1a1d
pip install timm==0.6.7 ftfy regex einops fvcore eva-decord==0.6.1 iopath matplotlib types-regex cartopy
pip install ctranslate2==4.4.0 faster_whisper neo4j hnswlib xxhash nano-vectordb
pip install transformers==4.37.1 tiktoken openai tenacity
  1. Install ImageBind:
   cd ImageBind
pip install .
  1. Download the necessary checkpoint files:
   git clone https://huggingface.co/openbmb/MiniCPM-V-2_6-int4
git clone https://huggingface.co/Systran/faster-distil-whisper-large-v3
mkdir .checkpoints
cd .checkpoints
wget https://dl.fbaipublicfiles.com/imagebind/imagebind_huge.pth
cd ..

Usage Process

  1. Video Knowledge Extraction: Multiple videos are fed into VideoRAG and the system automatically extracts and builds a knowledge graph.
  2. Query Response: Users can enter a query and VideoRAG will provide a comprehensive response based on the constructed knowledge graph and multimodal search mechanism.
  3. Multi-language support: Currently VideoRAG has only been tested in English environment, if you need to deal with multi-language video, it is recommended to modify the WhisperModel in asr.py.

Main Functions

  • Video Upload: Upload video files to the system, which will automatically process and extract knowledge.
  • Query Input: Enter a question in the query box and the system will provide a detailed answer based on the knowledge graph and multimodal search mechanism.
  • Results Showcase: The system displays relevant video clips and text responses that users can click on to view details.
May not be reproduced without permission:Chief AI Sharing Circle " VideoRAG: A RAG framework for understanding ultra-long videos with support for multimodal retrieval and knowledge graph construction
en_USEnglish