AI Personal Learning
and practical guidance
CyberKnife Drawing Mirror

SegAnyMo: open source tool to automatically segment arbitrary moving objects from video

General Introduction

SegAnyMo is an open source project developed by a team of researchers at UC Berkeley and Peking University, including members such as Nan Huang. The tool focuses on video processing and can automatically recognize and segment arbitrary moving objects in a video, such as people, animals, or vehicles. It combines technologies such as TAPNet, DINOv2, and SAM2, and plans to present results at CVPR 2025. The project code is fully public and free for users to download, use or modify for developers, researchers and video processing enthusiasts.The goal of SegAnyMo is to simplify the analysis of moving video and provide efficient segmentation solutions.

SegAnyMo: an open source tool to automatically segment arbitrary moving objects from video-1


 

Function List

  • Automatically detects moving objects in the video and generates accurate segmentation masks.
  • Supports video formats (e.g. MP4, AVI) or image sequence input.
  • Provides pre-trained models to support rapid deployment and testing.
  • Integration with TAPNet generates 2D tracking traces to capture motion information.
  • Using DINOv2 to extract semantic features to improve segmentation accuracy.
  • Pixel-level segmentation using the SAM2 refinement mask.
  • Support customized dataset training to adapt to different scenarios.
  • Output visualized results for easy checking and adjustment.

 

Using Help

SegAnyMo requires a certain technical foundation and is mainly aimed at users with programming experience. Below is a detailed installation and usage guide.

Installation process

  1. Prepare hardware and software
    The project is developed on Ubuntu 22.04, and a NVIDIA RTX A6000 or similar CUDA-enabled graphics card is recommended. Requires Git and Anaconda pre-installed.

    • Clone the code repository:
      git clone --recurse-submodules https://github.com/nnanhuang/SegAnyMo
      
    • Go to the project catalog:
      cd SegAnyMo
      
  2. Creating a Virtual Environment
    Create a separate Python environment with Anaconda to avoid dependency conflicts.

    • Creating the environment:
      conda create -n seg python=3.12.4
      
    • Activate the environment:
      conda activate seg
      
  3. Installing core dependencies
    Install PyTorch and other necessary libraries.

    • Install PyTorch (supports CUDA 12.1):
      conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 pytorch-cuda=12.1 -c pytorch -c nvidia
      
    • Install other dependencies:
      pip install -r requirements.txt
      
    • Install xformers (accelerated reasoning):
      pip install -U xformers --index-url https://download.pytorch.org/whl/cu121
      
  4. Installation of DINOv2
    DINOv2 is used for feature extraction.

    • Go to the preprocessing catalog and clone:
      cd preproc && git clone https://github.com/facebookresearch/dinov2
      
  5. Installing SAM2
    SAM2 is the mask refinement core.

    • Enter the SAM2 catalog:
      cd sam2
      
    • Installation:
      pip install -e .
      
    • Download the pre-trained model:
      cd checkpoints && ./download_ckpts.sh && cd ../..
      
  6. Installation of TAPNet
    TAPNet is used to generate 2D tracking traces.

    • Go to the TAPNet catalog:
      cd preproc/tapnet
      
    • Installation:
      pip install .
      
    • Download the model:
      cd ../checkpoints && wget https://storage.googleapis.com/dm-tapnet/bootstap/bootstapir_checkpoint_v2.pt
      
  7. Verify Installation
    Check that the environment is normal:
python -c "import torch; print(torch.cuda.is_available())"

exports True Indicates success.

Usage

Data preparation

SegAnyMo supports video or image sequence input. The data needs to be organized in the following structure:

data
├── images
│   ├── scene_name
│   │   ├── image_name
│   │   ├── ...
├── bootstapir
│   ├── scene_name
│   │   ├── image_name
│   │   ├── ...
├── dinos
│   ├── scene_name
│   │   ├── image_name
│   │   ├── ...
├── depth_anything_v2
│   ├── scene_name
│   │   ├── image_name
│   │   ├── ...
  • If the input is video, use a tool (such as FFmpeg) to extract frames to the images Folder.
  • If it is an image sequence, put it directly into the corresponding directory.

Operational pre-processing

  1. Generate depth maps, features and trajectories
    Use the following commands to process the data (about 10 minutes, depending on the amount of data):

    • For image sequences:
      python core/utils/run_inference.py --data_dir $DATA_DIR --gpus 0 --depths --tracks --dinos --e
      
    • For the video:
      python core/utils/run_inference.py --video_path $VIDEO_PATH --gpus 0 --depths --tracks --dinos --e
      

    Parameter Description:

    • --e Enable Efficient Mode to reduce frame rate and resolution and speed up processing.
    • --step 10 Indicates that 1 frame out of every 10 frames is used as the query frame, which can be adjusted downward to improve the accuracy.

Predicting motion trajectories

  1. Download model weights
    surname Cong Hugging Face maybe Google Drive Download the pre-trained model. Write the path to the configs/example_train.yaml (used form a nominal expression) resume_path Fields.

    • Running trajectory prediction:
      python core/utils/run_inference.py --data_dir $DATA_DIR --motin_seg_dir $OUTPUT_DIR --config_file configs/example_train.yaml --gpus 0 --motion_seg_infer --e
      

    The output is saved in the $OUTPUT_DIRThe

Generate Segmentation Mask

  1. Using the SAM2 Refinement Mask
    • Run Mask Generation:
      python core/utils/run_inference.py --data_dir $DATA_DIR --sam2dir $RESULT_DIR --motin_seg_dir $OUTPUT_DIR --gpus 0 --sam2 --e
      

    Parameter Description:

    • $DATA_DIR is the original image path.
    • $RESULT_DIR is the mask save path.
    • $OUTPUT_DIR is the trajectory prediction result path.
      Note: SAM2 supports by default .jpg maybe .jpeg format, the file name needs to be a plain number. If it does not match, you can change the code or rename the file.

Assessment results

  1. Download pre-calculated results
    transferring entity Google Drive Download the official mask for comparison.

    • Evaluate the DAVIS dataset:
      CUDA_VISIBLE_DEVICES=0 python core/eval/eval_mask.py --res_dir $RES_DIR --eval_dir $GT_DIR --eval_seq_list core/utils/moving_val_sequences.txt
      
    • Refinement of the assessment:
      cd core/eval/davis2017-evaluation && CUDA_VISIBLE_DEVICES=0 python evaluation_method.py --task unsupervised --results_path $MASK_PATH
      

Customized training

  1. Data preprocessing
    The HOI4D dataset is used as an example:
python core/utils/process_HOI.py
python core/utils/run_inference.py --data_dir $DATA_DIR --gpus 0 --tracks --depths --dinos

RGB images and dynamic masks are required for custom datasets.

  • Check data integrity:
    python current-data-dir/dynamic_stereo/dynamic_replica_data/check_process.py
    
  • Cleaning up data saves space:
    python core/utils/run_inference.py --data_dir $DATA_DIR --gpus 0 --clean
    
  1. model training
    modifications configs/$CONFIG.yaml configuration, trained with datasets such as Kubric, HOI4D, etc:
CUDA_VISIBLE_DEVICES=0 python train_seq.py ./configs/$CONFIG.yaml

 

application scenario

  1. Video post-production
    Separate moving objects (e.g. running people) from the video, generate masks and use them for effects compositing.
  2. Behavior Analysis Research
    Track animal or human movement trajectories and analyze behavioral patterns.
  3. Autonomous Driving Development
    Segmentation of moving objects (e.g., vehicles, pedestrians) in driving videos to optimize the perception system.
  4. Monitoring system optimization
    Extract abnormal movements from surveillance videos to improve security efficiency.

 

QA

  1. Need a GPU?
    Yes, it is recommended that NVIDIA cards support CUDA, otherwise they run inefficiently.
  2. Does it support real-time processing?
    The current version is suitable for offline processing, real-time applications need to be optimized on their own.
  3. How much space do you need for training?
    Depending on the dataset, a small dataset is a few gigabytes and a large dataset may be hundreds of gigabytes.
  4. How to improve segmentation accuracy?
    diminish --step values, or train the model with more labeled data.
May not be reproduced without permission:Chief AI Sharing Circle " SegAnyMo: open source tool to automatically segment arbitrary moving objects from video
en_USEnglish