AI Personal Learning
and practical guidance
Resource Recommendation 1

VLM-R1: A Visual Language Model for Localizing Image Targets through Natural Language

General Introduction

VLM-R1 is an open source visual language modeling project developed by Om AI Lab and hosted on GitHub. The project is based on DeepSeek's R1 approach, combined with the Qwen2.5-VL model, and significantly improves the stability and generalization ability of the model in visual understanding tasks through reinforcement learning (R1) and supervised fine-tuning (SFT) techniques.VLM-R1 is particularly good at handling referent-expression-comprehension (REC), such as answering questions like "Where are the red cups in the picture? where is the red cup in the picture" and pinpointing targets in images. The project provides detailed installation scripts, dataset support, and training code for developers and researchers to explore and develop visual language tasks. As of February 2025, the project has garnered nearly 2,000 GitHub hashtags, demonstrating its widespread interest in multimodal AI.

VLM-R1: A Reinforcement Learning Visual Language Model that Excels at Visual Comprehension Tasks-1

Demo address: https://huggingface.co/spaces/omlab/VLM-R1-Referral-Expression


 

Function List

  • Refers to Representational Expression of Understanding (REC): The ability to parse natural language commands to locate specific targets in an image.
  • Joint image and text processing: Supports simultaneous image and text input to generate accurate analysis results.
  • Enhanced Learning Optimization: Enhancing model performance in complex visual tasks by training with the R1 method.
  • Open source training code: Full training scripts and configuration files are provided for easy customization of the model.
  • Dataset Support: Built-in COCO and RefCOCO dataset download and processing capabilities simplify the development process.
  • High-performance inference support: Compatible with Flash Attention and other technologies to improve computing efficiency.

 

Using Help

Installation process

VLM-R1 is a Python based project that requires certain environment configuration to run. Here are the detailed installation and usage steps to help users get started quickly.

1. Environmental preparation

  • Installing Anaconda: It is recommended that you use Anaconda to manage your Python environment to ensure system compatibility. Download: Anaconda official website. After the installation is complete, open the terminal.
  • Creating a Virtual Environment: Type the following command in the terminal to create a file named vlm-r1 Python 3.10 environment:
    conda create -n vlm-r1 python=3.10
  • activation environment: Activate the environment you just created:
    conda activate vlm-r1
    

2. Install project dependencies

  • cloning project: Download the code repository for VLM-R1 locally. Open a terminal and type:
    git clone https://github.com/om-ai-lab/VLM-R1.git
    cd VLM-R1
    
  • Run the installation script: The project provides a setup.sh Script for automated installation of dependencies. Run it in the terminal:
    bash setup.sh
    

    This script installs core libraries such as PyTorch, Transformers, etc. to make sure the environment is ready.

3. Data preparation

  • Download COCO Dataset: VLM-R1 is trained using the COCO Train2014 image dataset. Run the following command to download and unzip it:
    wget http://images.cocodataset.org/train2014/train2014.zip
    unzip train2014.zip -d
    

    Make a note of the unzip path , which will be needed in subsequent configurations.

  • Download the RefCOCO labeling file: The RefCOCO dataset is used to refer to representation tasks. The download link can be found in the project documentation, unzip it and place it in the appropriate directory.

4. Training models

  • Configuring Training Parameters: Enter src/open-r1-multimodal directory, edit the parameters in the training script. Example:
    cd src/open-r1-multimodal
    

    modifications grpo_rec.py or specify parameters when running the command. The following is an example command:

    torchrun --nproc_per_node=8 --nnodes=1 --node_rank=0 --master_addr="127.0.0.1" --master_port="12346" \
    src/open_r1/grpo_rec.py \
    --deepspeed local_scripts/zero3.json \
    --output_dir output/my_model \
    ---model_name_or_path Qwen/Qwen2.5-VL-3B-Instruct \
    --dataset_name data_config/rec.yaml \
    --image_root  \
    ---max_prompt_length 1024 \
    --num_generations 8 \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 2 \\
    --logging_steps 1 \
    ---bf16 \
    ---torch_dtype bfloat16 \\
    --num_train_epochs 2 \
    ---save_steps 100
    
    • Parameter Description:
      • --nproc_per_node: Number of GPUs, to be adjusted for your hardware.
      • --image_root: Replace with your COCO dataset path.
      • --output_dir: The model save path.

5. Functional operating procedures

Refers to Representational Expression of Understanding (REC)
  • Running test scripts: Once training is complete, use the provided test script to verify the model's effectiveness. Enter src/eval Catalog:
    cd src/eval
    python test_rec_r1.py --model_path  --image_root  --annotation_path
    
  • Input Example: Upload a picture and enter a question such as "Where is the blue car in the picture?". . The model will return the coordinates or description of the target location.
Image and Text Analysis
  • Prepare to enter: Place the image file and question text in the specified directory, or specify the path directly in the script.
  • running inference: Using the above test script, the model outputs an analysis of the image content, such as object category, position, etc.
Customized training
  • Modify the data set: If you want to use your own dataset, edit the data_config/rec.yaml, add the image path and label the file.
  • Adjustment of hyperparameters: Modified to meet the needs of the mandate grpo_rec.py Parameters such as learning rate, batch size, etc. in the

6. Cautions

  • hardware requirement: It is recommended to use a GPU with at least 8GB of RAM, or less if resources are limited. num_generations parameter to reduce the memory footprint.
  • debug mode: During training, you can set the export DEBUG_MODE="true", view the detailed log.
  • Community Support: If you run into problems, ask questions on the GitHub Issues page and the Om AI Lab team and community will help.

With the above steps, users can fully install and use the VLM-R1 to quickly get started and utilize its powerful functions, whether they are conducting research on vision tasks or developing real-world applications.

Content 2
May not be reproduced without permission:Chief AI Sharing Circle " VLM-R1: A Visual Language Model for Localizing Image Targets through Natural Language

Chief AI Sharing Circle

Chief AI Sharing Circle specializes in AI learning, providing comprehensive AI learning content, AI tools and hands-on guidance. Our goal is to help users master AI technology and explore the unlimited potential of AI together through high-quality content and practical experience sharing. Whether you are an AI beginner or a senior expert, this is the ideal place for you to gain knowledge, improve your skills and realize innovation.

Contact Us
en_USEnglish