AI Personal Learning
and practical guidance

Gaze-LLE: A Target Prediction Tool for Character Gaze in Video

General Introduction

Gaze-LLE is a gaze target prediction tool based on a large-scale learning encoder. Developed by Fiona Ryan, Ajay Bati, Sangmin Lee, Daniel Bolya, Judy Hoffman, and James M. Rehg, the project aims to enable efficient gaze target prediction with pre-trained visual base models such as DINOv2.The architecture of Gaze-LLE is concise and only freezes the pre-trained visual coder to learn a lightweight gaze decoder, which reduces the amount of parameters by 1-2 orders of magnitude compared to previous work, and does not require additional input modalities such as depth and pose information.

Gaze-LLE: Target Prediction Tool for Character Gaze in Video-1


 

Function List

  • Focus on target forecasting: Efficient prediction of gaze targets based on pre-trained visual coders.
  • Multi-gaze prediction: Supports gaze prediction for multiple individuals in a single image.
  • Pre-trained models: Provides a variety of pre-trained models to support different backbone networks and training data.
  • Lightweight Architecture: Learning lightweight gaze decoders only on frozen pre-trained visual coders.
  • No additional input modes: No additional depth and attitude information inputs are required.

 

Using Help

Installation process

  1. Cloning Warehouse:
   git clone https://github.com/fkryan/gazelle.git
cd gazelle
  1. Create a virtual environment and install dependencies:
   conda env create -f environment.yml
conda activate gazelle
pip install -e .
  1. Optional: Install xformers to accelerate attention calculations (if supported by the system):
   pip3 install -U xformers --index-url https://download.pytorch.org/whl/cu118

Using pre-trained models

Gaze-LLE provides a variety of pre-trained models that users can download and use as needed:

  • gazelledinov2vitb14: Model based on DINOv2 ViT-B with training data from GazeFollow.
  • gazelledinov2vitl14: A model based on DINOv2 ViT-L with training data from GazeFollow.
  • gazelledinov2vitb14_inout: A model based on DINOv2 ViT-B with training data for GazeFollow and VideoAttentionTarget.
  • gazellelargevitl14_inout: A model based on DINOv2 ViT-L with training data for GazeFollow and VideoAttentionTarget.

usage example

  1. Load the model in PyTorch Hub:
   import torch
model, transform = torch.hub.load('fkryan/gazelle', 'gazelle_dinov2_vitb14')
  1. Check out the demo notebook in Google Colab to learn how to detect the target of everyone's gaze in an image.

watch for forecasts

Gaze-LLE supports gaze prediction for multiple people, i.e., a single image is encoded once and then features are used to predict the gaze targets of multiple people in the image. The model outputs a spatial heat map representing the probability of the location of the gaze target in the scene with values ranging from [0,1], where 1 represents the highest probability of the gaze target location.

May not be reproduced without permission:Chief AI Sharing Circle " Gaze-LLE: A Target Prediction Tool for Character Gaze in Video

Chief AI Sharing Circle

Chief AI Sharing Circle specializes in AI learning, providing comprehensive AI learning content, AI tools and hands-on guidance. Our goal is to help users master AI technology and explore the unlimited potential of AI together through high-quality content and practical experience sharing. Whether you are an AI beginner or a senior expert, this is the ideal place for you to gain knowledge, improve your skills and realize innovation.

Contact Us
en_USEnglish