AI Personal Learning
and practical guidance
CyberKnife Drawing Mirror

HumanOmni: a multimodal macromodel for analyzing human video emotions and actions

General Introduction

HumanOmni is an open source multimodal big model developed by the HumanMLLM team and hosted on GitHub. It focuses on analyzing human video and can process both picture and sound to help understand emotions, actions, and conversational content. The project used 2.4 million human-centered video clips and 14 million command data for pre-training, and 50,000 hand-labeled video clips (with more than 100,000 commands) for fine-tuning. humanOmni handles facial, bodily, and interactional scenarios in three branches, and dynamically adapts fusion methods based on inputs. It is the industry's first human-centered multimodal model and outperforms many similar models. The team has also launched R1-Omni based on it, which for the first time incorporates reinforcement learning to improve inference. The code and some of the datasets are open for easy access by researchers and developers.

HumanOmni: a multimodal macromodel for analyzing human video emotions and actions-1


 

Function List

  • emotion recognition: Analyze facial expressions and voice tones in videos to determine character emotions, such as happy, angry, or sad.
  • Description of facial expressions: Recognize and describe facial details of a person, such as a smile or a frown.
  • Action Understanding: Analyze the movements of people in a video and describe what they are doing, such as walking or waving.
  • speech processing: Extracts content from audio and supports speech recognition and intonation analysis.
  • multimodal fusion: Combine picture and sound to understand complex scenes and provide more accurate analysis.
  • Dynamic Branch Adjustment: Handle different scenes with three branches: face, body, and interaction, automatically adjusting weights.
  • Open Source Support:: Provide code, pre-trained models and partial datasets to support secondary development.

 

Using Help

HumanOmni is suitable for users with a technical base, such as developers or researchers. The following installation and usage steps are detailed enough to get started straight away.

Installation process

To run HumanOmni, you need to prepare your environment first. The following are the specific steps:

  1. Check hardware and software requirements
    • Operating System: Supports Linux, Windows or macOS.
    • Python: requires version 3.10 or higher.
    • CUDA: 12.1 or higher recommended (if using a GPU).
    • PyTorch: Requires version 2.2 or higher, with CUDA support.
    • Hardware: NVIDIA GPUs are recommended, CPUs work but are slow.
  2. Download Code
    Open a terminal and enter the command to download the project:
git clone https://github.com/HumanMLLM/HumanOmni.git
cd HumanOmni
  1. Creating a Virtual Environment
    Create separate environments with Conda to avoid conflicts:
conda create -n humanOmni python=3.10 -y
conda activate humanOmni
  1. Installation of dependencies
    The project has a requirements.txt file that lists the required libraries. Run the following command to install them:
pip install --upgrade pip
pip install --r requirements.txt
pip install flash-attn --no-build-isolation
  1. Download model weights
    HumanOmni has three models:
  • HumanOmni-Video: Processing Video, 7B Parameters.
  • HumanOmni-Audio: Processing Audio, 7B Parameters.
  • HumanOmni-Omni: fusion of video and audio, 7B parameters (referred to as HumanOmni).
    Download from Hugging Face or ModelScope, for example:
  • HumanOmni-7B
  • HumanOmni-7B-Video
    Download it and put it in the project folder.
  1. Verify Installation
    Check the environment with the test command:
python inference.py ---modal video ---model_path . /HumanOmni_7B --video_path test.mp4 --instruct "Describe this video."

If the video description is output, the installation is successful.

Functional operation flow

At the heart of HumanOmni is analyzing video and audio. Below is a detailed breakdown of how the main features work.

1. Emotion recognition

  • move
  • Prepare a video containing a character (e.g. sample.mp4).
  • Run command:
python inference.py --modal video_audio ---model_path . /HumanOmni_7B --video_path sample.mp4 --instruct "Which emotion is most obvious?"
  • The model outputs emotions such as "angry" or "happy".
  • take note of
  • The video should be clear and the characters' expressions and voices need to be recognizable.
  • Longer videos may require more computation time.

2. Description of facial expressions

  • move
  • Enter the video and run:
python inference.py ---modal video ---model_path . /HumanOmni_7B --video_path sample.mp4 --instruct "What's the major facial expression?"
  • The output may be "smile" or "frown" with a brief description.
  • suggestion
  • Testing with short 10-30 second videos works better.

3. Movement understanding

  • move
  • Enter the video and run:
python inference.py ---modal video ---model_path . /HumanOmni_7B --video_path sample.mp4 --instruct "Describe the major action in detail."
  • Outputs a description of the action, such as "a person is walking".
  • finesse
  • Make sure the action is obvious and avoid background clutter.

4. Speech processing

  • move
  • Input video with audio, run:
python inference.py ---modal audio ---model_path . /HumanOmni_7B --video_path sample.mp4 --instruct "What did the person say?"
  • Output voice content, such as "Dogs are sitting by the door".
  • take note of
  • Audio should be clear and work best without noise.

5. Multimodal fusion

  • move
  • Input video and audio, run:
python inference.py --modal video_audio ---model_path . /HumanOmni_7B --video_path sample.mp4 --instruct "Describe this video."
  • The model will give a full description in conjunction with the picture and sound.
  • dominance
  • Ability to capture the correlation between emotions and actions for a more comprehensive analysis.

6. Training on customized data sets

  • move
  • Prepare a data file in JSON format containing the video path and command dialog. For example:
[
{
"video": "path/to/video.mp4",
"conversations": [
{"from": "human", "value": "What's the emotion?"}, {"from": "gpt", "value": "gpt" }
{"from": "gpt", "value": "sad"}
]
}
]
  • downloading HumanOmni-7B-Video cap (a poem) HumanOmni-7B-Audio Weights.
  • Run the training script:
bash scripts/train/finetune_humanomni.sh
  • use
  • It is possible to optimize the model with your own video data.

Frequently Asked Questions

  • Runtime Error: Check that the Python and PyTorch versions match.
  • Model Load Failure: Confirm that the path is correct and that there is enough disk space (about 10GB for the model).
  • The results are not accurate.: Switch to clear video or adjust instruction presentation.

With these steps, users can easily install and use HumanOmni and experience its powerful features.

 

application scenario

  1. Educational research
    Analyze classroom videos to identify student mood and engagement and help teachers adjust their teaching style.
  2. medical assisting
    The patient's expression and tone of voice aids the physician in determining psychological states, such as anxiety or depression.
  3. film and television production
    Analyze character emotions and actions to generate subtitles or plot descriptions to enhance creative efficiency.
  4. social analytics
    Used for conference video to understand participants' emotions and behaviors and optimize communication.

 

QA

  1. What file formats are supported?
    Supports MP4 format, audio needs to be embedded in the video.
  2. Do I need to network?
    Not required. Download the code and model for offline use.
  3. How does the model perform?
    In terms of emotional comprehension, HumanOmni's DFEW data UAR amounted to 74.861 TP3T, far exceeding the 50.571 TP3T of GPT4-O. The average score for action comprehension was 72.6, which was higher than the 67.7 of Qwen2-VL-7B.
  4. Can ordinary people use it?
    Basic programming skills are required. If you don't know how to code, it is recommended to ask a technician for help.
May not be reproduced without permission:Chief AI Sharing Circle " HumanOmni: a multimodal macromodel for analyzing human video emotions and actions

Chief AI Sharing Circle

Chief AI Sharing Circle specializes in AI learning, providing comprehensive AI learning content, AI tools and hands-on guidance. Our goal is to help users master AI technology and explore the unlimited potential of AI together through high-quality content and practical experience sharing. Whether you are an AI beginner or a senior expert, this is the ideal place for you to gain knowledge, improve your skills and realize innovation.

Contact Us
en_USEnglish