Ming-lite-omni - ants Bering team open source unified multimodal big model

Latest AI Resources7mos agoupdate AI Sharing Circle

36.7K 00

What is Ming-lite-omni?

Ming-Lite-Omni is an open source unified multimodal big model from Ant Group's Bailing big model team, built on the highly efficient Mixture of Experts (MoE) architecture.Ming-Lite-Omni supports processing of multimodal data such as text, image, audio and video, and has powerful comprehension and generation capabilities. Ming-Lite-Omni is optimized for computational efficiency, supports large-scale data processing and real-time interaction, and is highly scalable. Ming-Lite-Omni is highly scalable and has a wide range of application scenarios, providing users with an integrated intelligent solution with a broad application prospect.

Main features of Ming-lite-omni

multimodal interaction: Supports multiple inputs and outputs such as text, image, audio, video, etc. to realize a natural and smooth interaction experience. Supports multiple rounds of dialog to provide coherent interaction.
Understanding and Generation: Powerful comprehension capabilities to accurately recognize and understand data in multiple modalities. Efficient generation capabilities, supporting the generation of high-quality text, image, audio and video content.
Efficient processing: Based on MoE architecture, it optimizes computational efficiency and supports large-scale data processing and real-time interaction.

Ming-lite-omni's official website address

HuggingFace Model Library::https://huggingface.co/inclusionAI/Ming-Lite-Omni

How to use Ming-lite-omni

environmental preparation::
- Installing Python: Python 3.8 or higher is recommended. Download and install it from the Python website.
- Installation of dependent libraries: Install the necessary dependency libraries by running the following commands in a terminal or on the command line.

pip install -r requirements.txt
pip install data/matcha_tts-0.0.5.1-cp38-cp38-linux_x86_64.whl
pip install diffusers==0.33.0
pip install nvidia-cublas-cu12==12.4.5.8  # 如果使用NVIDIA GPU

Download model: Download the Ming-Lite-Omni model from Hugging Face.

git clone https://huggingface.co/inclusionAI/Ming-Lite-Omni
cd Ming-Lite-Omni

Loading Models: Use the following code to load the model and processor:

import os
import torch
from transformers import AutoProcessor, GenerationConfig
from modeling_bailingmm import BailingMMNativeForConditionalGeneration

# 设置模型路径
model_path = "Ming-Lite-Omni-Preview"

# 加载模型
model = BailingMMNativeForConditionalGeneration.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True
).to("cuda")

# 加载处理器
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)

Preparing to enter data: Prepare the input data according to the requirements.Ming-Lite-Omni supports a variety of modal inputs, take text and image inputs as an example.
- text input::

messages = [
    {
        "role": "HUMAN",
        "content": [
            {"type": "text", "text": "请详细介绍鹦鹉的生活习性。"}
        ],
    },
]

- image input::

messages = [
    {
        "role": "HUMAN",
        "content": [
            {"type": "image", "image": os.path.join("assets", "flowers.jpg")},
            {"type": "text", "text": "What kind of flower is this?"}
        ],
    },
]

Data preprocessing: Pre-processing of input data using a processor:

text = processor.apply_chat_template(messages, add_generation_prompt=True)
image_inputs, video_inputs, audio_inputs = processor.process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    audios=audio_inputs,
    return_tensors="pt",
)
inputs = inputs.to(model.device)
for k in inputs.keys():
    if k == "pixel_values" or k == "pixel_values_videos" or k == "audio_feats":
        inputs[k] = inputs[k].to(dtype=torch.bfloat16)

model-based reasoning: Invoke the model to perform inference and generate the output:

generation_config = GenerationConfig.from_dict({'no_repeat_ngram_size': 10})
generated_ids = model.generate(
    **inputs,
    max_new_tokens=512,
    use_cache=True,
    eos_token_id=processor.gen_terminator,
    generation_config=generation_config,
)
generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(output_text)

output result: The model generates the appropriate outputs to further process or present the results as needed.

Core Advantages of Ming-Lite-Omni

multimodal fusion: Supports multimodal input and output of text, images, audio, and video for full multimodal interaction.
Efficient Architecture: Based on Mixture of Experts (MoE) architecture, dynamic routing optimizes computational efficiency and reduces resource waste.
Harmonization of understanding and generation: The encoder-decoder architecture supports integrated comprehension and generation, providing a coherent interactive experience.
Optimized reasoning: The hybrid linear attention mechanism reduces computational complexity, supports real-time interaction, and is suitable for rapid response scenarios.
widely used: Applicable to a variety of fields such as intelligent customer service, content creation, education, healthcare and smart office.
Open Source and Community Support: Open source model with rich resources from the community for developers to quickly get started and innovate.

People for whom Ming-Lite-Omni is suitable

business user: Technology companies and content creation businesses that need efficient multimodal solutions.
Educators and students: Teachers and students who want to use AI to assist their teaching and learning.
healthcare practitioner: Healthcare workers who need assistance with medical record analysis and medical image interpretation.
Smart Office Users: Business employees and management who need to work with documents and improve office efficiency.
average consumer: Individual users who use smart devices and need to generate creative content.