UniPixel - Pixel-level multimodal model open-sourced by Hong Kong Polytechnic, Tencent, Chinese Academy of Sciences and others

Latest AI Resources2mos agoupdate AI Sharing Circle

23.4K 00

What is UniPixel?

UniPixel is a novel multimodal model jointly proposed by Hong Kong Polytechnic University, Tencent, Chinese Academy of Sciences and Vivo to achieve pixel-level visual language understanding. By unifying object referencing and segmentation capabilities, it supports a variety of fine-grained tasks, such as image segmentation, video segmentation, region understanding, and PixelQA tasks.The core advantage of UniPixel is its powerful pixel-level reasoning capability, which generates accurate pixel-level masks based on linguistic descriptions, realizing the deep fusion of language and vision. UniPixel performs well in several benchmarks, for example, UniPixel-3B achieves a high score of 62.1 J&F on the ReVOS inference segmentation benchmark, which outperforms all existing models.UniPixel provides a rich set of model weights and datasets, and supports flexible hardware setups and efficient training techniques, which greatly facilitates research and applications. Prospects for a wide range of applications in the fields of intelligent surveillance, content creation, education, medical image analysis, and autonomous driving.

Features of UniPixel

Pixel-level visual language understanding: UniPixel enables pixel-level alignment between verbal descriptions and visual content, supporting a variety of fine-grained tasks such as image segmentation, video segmentation, and region understanding.
Unified object designation and segmentation: Seamlessly integrating object referencing and segmentation capabilities to generate pixel-level masks directly from linguistic descriptions, providing the basis for complex visual reasoning.
multitasking support: It performs well in several benchmarks, including ReVOS, MeViS, Ref-YouTube-VOS, etc., and also supports PixelQA tasks for joint object referencing, segmentation, and quizzing.
Flexible visual cue processing: It can flexibly process visual cue inputs, generate masks and perform inference, support single-frame and multi-frame video region understanding, and adapt to different scene requirements.
Powerful reasoning: The UniPixel-7B model performs well in complex visual inference tasks, such as the VideoRefer-Bench-Q Q&A task, where the UniPixel-7B model achieves an accuracy of 74.11 TP3T, outperforming several powerful benchmark models.
Model weights and dataset availability: Provides model weights for both UniPixel-3B and UniPixel-7B versions, as well as raw image/video and preprocessing annotations for 23 fingerprinting/segmentation/QA datasets, providing a rich resource for research and applications.
Training and evaluation support: The codebase supports training and evaluation on multiple datasets and benchmarks, flexible hardware settings, efficient training techniques, customized base LLMs and dialogue templates for ease of use and optimization.

Core Benefits of UniPixel

Pixel-level alignment capability: UniPixel's ability to achieve pixel-level alignment of linguistic descriptions with visual content is one of its core strengths, allowing it to excel in fine-grained visual language understanding tasks.
Integrated Framework Design: Seamlessly integrating object referencing and segmentation capabilities into a single model, this unified framework design not only improves efficiency, but also provides a powerful foundation for complex visual reasoning tasks.
Multitasking Adaptability: Supports a wide range of tasks, including image segmentation, video segmentation, region understanding, and PixelQA tasks, demonstrating its broad adaptability in different application scenarios.
Excellent performance: It has achieved excellent results in several benchmarks, such as the ReVOS inference segmentation benchmark, where UniPixel-3B achieved a high score of 62.1 J&F, outperforming all existing models.
Flexible visual cue processing: It can flexibly process visual cue inputs, generate masks and perform inference, support single-frame and multi-frame video region understanding, and adapt to different scene requirements.
Rich resource support: Provides model weights for both UniPixel-3B and UniPixel-7B versions, as well as raw image/video and preprocessing annotations for 23 fingerprinting/segmentation/QA datasets, providing a rich resource for research and applications.

What is UniPixel's official website

Project website:: https://polyu-chenlab.github.io/unipixel/
Github repository:: https://github.com/PolyU-ChenLab/UniPixel
HuggingFace data:: https://huggingface.co/datasets/PolyU-ChenLab/UniPixel-SFT-1M
arXiv Technical Paper:: https://arxiv.org/pdf/2509.18094
Online Experience Demo:: https://huggingface.co/spaces/PolyU-ChenLab/UniPixel

Who UniPixel is for

Artificial intelligence researchers: UniPixel provides researchers with powerful multimodal models that can be used to explore cutting-edge technologies in visual language understanding, image segmentation, video processing, and more.
Computer Vision Engineer: The model is suitable for engineers who need to implement image and video segmentation, target detection, and region understanding in real-world projects, which can improve development efficiency and application performance.
Machine Learning Developer: For developers working on multimodal applications, UniPixel provides a rich set of model weights and datasets to facilitate rapid model construction and optimization.
data scientist: UniPixel's multitasking support and powerful inference capabilities make it a powerful tool for data scientists when working with complex visual data.
educator: In education, UniPixel can be used to develop interactive teaching tools that help students better understand and analyze visual information to improve learning.
Medical Imaging Analyst: In medical image processing, UniPixel can accurately segment lesion areas to assist doctors in diagnosis and treatment planning, improving medical efficiency and accuracy.