Gaze-LLE: A Target Prediction Tool for Character Gaze in Video

Latest AI Resources4mos agoupdate AI Sharing Circle

2.2K 00

General Introduction

Gaze-LLE is a gaze target prediction tool based on a large-scale learning encoder. Developed by Fiona Ryan, Ajay Bati, Sangmin Lee, Daniel Bolya, Judy Hoffman, and James M. Rehg, the project aims to enable efficient gaze target prediction with pre-trained visual base models such as DINOv2.The architecture of Gaze-LLE is concise and only freezes the pre-trained visual coder to learn a lightweight gaze decoder, which reduces the amount of parameters by 1-2 orders of magnitude compared to previous work, and does not require additional input modalities such as depth and pose information.

Function List

Focus on target forecasting: Efficient prediction of gaze targets based on pre-trained visual coders.
Multi-gaze prediction: Supports gaze prediction for multiple individuals in a single image.
Pre-trained models: Provides a variety of pre-trained models to support different backbone networks and training data.
Lightweight Architecture: Learning lightweight gaze decoders only on frozen pre-trained visual coders.
No additional input modes: No additional depth and attitude information inputs are required.

Using Help

Installation process

Cloning Warehouse:

   git clone https://github.com/fkryan/gazelle.git
cd gazelle

Create a virtual environment and install dependencies:

   conda env create -f environment.yml
conda activate gazelle
pip install -e .

Optional: Install xformers to accelerate attention calculations (if supported by the system):

   pip3 install -U xformers --index-url https://download.pytorch.org/whl/cu118

Using pre-trained models

Gaze-LLE provides a variety of pre-trained models that users can download and use as needed:

gazelledinov2vitb14: Model based on DINOv2 ViT-B with training data from GazeFollow.
gazelledinov2vitl14: A model based on DINOv2 ViT-L with training data from GazeFollow.
gazelledinov2vitb14_inout: A model based on DINOv2 ViT-B with training data for GazeFollow and VideoAttentionTarget.
gazellelargevitl14_inout: A model based on DINOv2 ViT-L with training data for GazeFollow and VideoAttentionTarget.

usage example

Load the model in PyTorch Hub:

   import torch
model, transform = torch.hub.load('fkryan/gazelle', 'gazelle_dinov2_vitb14')

Check out the demo notebook in Google Colab to learn how to detect the target of everyone's gaze in an image.

watch for forecasts

Gaze-LLE supports gaze prediction for multiple people, i.e., a single image is encoded once and then features are used to predict the gaze targets of multiple people in the image. The model outputs a spatial heat map representing the probability of the location of the gaze target in the scene with values ranging from [0,1], where 1 represents the highest probability of the gaze target location.