Megrez-3B-Omni: an end-side multimodal understanding model supporting text, image, and audio multimodal understanding and analysis

Latest AI Resources7mos agoupdate AI Sharing Circle

1.7K 00

General Introduction

Infini-Megrez is an edge intelligence solution developed by Infinigence AI, aiming to achieve efficient multimodal understanding and analysis through hardware and software co-design. At the heart of the project is the Megrez-3B model, which supports integrated image, text and audio understanding with high accuracy and fast inference.The Megrez-3B model performs well in a number of mainstream benchmarks and is suitable for tasks such as scene understanding and optical character recognition (OCR). The project provides complete deployment code for developers to easily apply it on various platforms.

Megrez-3B-Omni：端侧多模态理解模型，支持文本、图像、音频多模态理解和分析

Function List

graphic understanding: Constructs image markers using SigLip-400M and performs well in benchmarks such as MME, MMVet, and OCRBench.
language understanding: Maintains excellent text comprehension and performs well in benchmark tests such as C-EVAL, MMLU, etc.
speech understanding: Supports Chinese and English voice input, multi-round dialog and voice command response.
fast inference: Achieve up to 300% inference speedup through hardware and software co-design.
easy-to-use: Adopts the classic LLaMA architecture, making it easy for developers to deploy on a variety of platforms.
Rich Applications: Provides a full-stack WebSearch solution that automatically determines the timing of search calls to provide better summary results.

Using Help

Installation process

clone warehouse: Clone the Infini-Megrez repository by running the following command in a terminal:

   git clone https://github.com/infinigence/Infini-Megrez.git

Installation of dependencies: Go to the project directory and install the required dependencies:

   cd Infini-Megrez
pip install -r requirements.txt

Download model: Download the required model files according to the guidelines in the README file and place them in the specified directory.

Guidelines for use

graphic understanding::
- Places the image file in the specified directory.
- Run the image understanding script:
```
 python image_understanding.py --input_dir ./images
```
- View the output with image markers and analysis results.
language understanding::
- Places the text file in the specified directory.
- Running language understanding scripts:
```
 python text_understanding.py --input_dir ./texts
```
- View the output, containing text analysis and comprehension results.
speech understanding::
- Places the audio file in the specified directory.
- Run the speech understanding script: bash python speech_understanding.py --input_dir ./audios
- View the output with speech-to-text and analysis results.

Featured Functions Operation Procedure

multimodal understanding::
- Place the image, text and audio files in the corresponding directories.
- Run the multimodal comprehension script:
```
 python multimodal_understanding.py --image_dir ./images --text_dir ./texts --audio_dir ./audios
```
- View the results of a comprehensive analysis that includes joint comprehension and analysis of images, text, and speech.
WebSearch Solutions::
- Configure the WebSearch module and make sure the network connection is working.
- Run the WebSearch script: bash python websearch.py --query "输入查询内容"
- View search results and summaries. The system automatically determines whether the search function needs to be invoked and provides optimized summary results.

Through the above steps, users can fully understand and use the functions of Infini-Megrez to realize efficient multimodal understanding and analysis.