Cross-Device End-Side Generative AI Multi-Modal Benchmarking with Nexa Compressed Inference

AI News7mos agorelease AI Sharing Circle
7.1K 00

Executive Summary

Nexa Native Inference Framework makes the deployment of generative AI models on the device side seamless and efficient. The technology supports a wide range of chipsets including AMD, Qualcomm, Intel, NVIDIA, and homegrown chips, and is compatible with all major operating systems. We provide benchmark data for generative AI models on a variety of common tasks, each tested at TOPS performance level on different types of devices.

Core strengths:

  1. multimodal capability - be in favor ofText, audio, video and visualGenerative AI-like tasks
  2. Wide range of hardware compatibility - Runs AI models on PCs, laptops, mobile devices, and embedded systems
  3. leading performance - With our edge inference framework, NexaQuant, models run 2.5x faster and storage and memory requirements are reduced by 4x, while maintaining high accuracy
跨设备端侧生成式 AI 多模态基准测试与 Nexa 压缩推理技术

Why end-side AI?

Deploying AI models directly on the device side has several advantages over relying on cloud APIs:

  • Privacy and Security - Data retention on the device side ensures confidentiality
  • reduce costs - No need to pay for expensive cloud-based reasoning
  • Speed and Response - Low-latency inference without relying on the network
  • offline capability - AI applications can still be used in low connectivity areas

With Nexa edge inference technology, developers can efficiently run generative AI models on a wide range of devices while minimizing resource consumption.

New Trends in Multimodal AI Applications

Nexa AI End-side deployment supportMultimodal AI, enabling applications to handle and integrate multiple data types:

  • Text AI - Chatbots, document summarization, programming assistants
  • Speech to Speech AI - Real-time voice translation, AI voice assistant
  • Visual AI - Target detection, image description, document OCR processing

This is accomplished through the use ofNexaQuantOur multimodal models achieve excellent compression and acceleration while maintaining top performance.

Cross-Device Generative AI Task Performance Benchmarks

We provide benchmarking data for generative AI models on a variety of common tasks, each tested at the TOPS performance level on different types of devices. If you have a specific device and target use case, you can refer to similarly performing devices to estimate processing power:

Generative AI tasks covered:

  • Voice to Voice
  • Text to Text
  • Visual to text

Covered device types:

  • Modern Notebook Chips - Optimized for desktop and laptop native AI processing
  • flagship mobile chip - AI models running on smartphones and tablets
  • embedded system (~4 TOPS) - Low Power Devices for Edge Computing Applications

Speech-to-speech benchmarking

Evaluating Real-Time Speech Interaction Capabilities with Language Models - ProcessingAudio input generates audio output

Equipment typeChips & DevicesDelay (TTFT)decoding speedAverage Peak Memory
Modern Notebook Chips (GPU)Apple M3 Pro GPU0.67 seconds20.46 tokens/second~990MB
Modern Notebook Chips (iGPU)AMD Ryzen AI 9 HX 370 iGPU (Radeon 890M)1.01 seconds19.28 tokens/second~990MB
Modern Notebook Chips (CPU)Intel Core Ultra 7 268V1.89 seconds11.88 tokens/second~990MB
Flagship Mobile Chip CPUQualcomm Snapdragon 8 Gen 3 (Samsung S24)1.45 seconds9.13 token/second~990MB
Embedded IoT System CPURaspberry Pi 4 Model B6.9 seconds4.5 token/second~990MB

Speech-to-Speech Benchmarking Using Moshi with NexaQuant

Text-to-text benchmarking

valuationGenerate text based on text inputAI model performance

Equipment typeChips & DevicesInitial Delay (TTFT)decoding speedAverage Peak Memory
Modern Notebook Chips (GPU)Apple M3 Pro GPU0.12 seconds49.01 token/second~2580MB
Modern Notebook Chips (iGPU)AMD Ryzen AI 9 HX 370 iGPU (Radeon 890M)0.19 seconds30.54 tokens/second~2580MB
Modern Notebook Chips (CPU)Intel Core Ultra 7 268V0.63 seconds14.35 tokens/second~2580MB
Flagship Mobile Chip CPUQualcomm Snapdragon 8 Gen 3 (Samsung S24)0.27 seconds10.89 tokens/second~2580MB
Embedded IoT System CPURaspberry Pi 4 Model B1.27 seconds5.31 token/second~2580MB

Text-to-text benchmarking using llama-3.2 with NexaQuant

Visual-to-text benchmarking

Evaluating AI Analyzing Visual InputsThe ability to generate responses, extract key visual information, and dynamic guidance tools -Visual Input, Text Output

Equipment typeChips & DevicesInitial Delay (TTFT)decoding speedAverage Peak Memory
Modern Notebook Chips (GPU)Apple M3 Pro GPU2.62 seconds86.77 tokens/second~1093MB
Modern Notebook Chips (iGPU)AMD Ryzen AI 9 HX 370 iGPU (Radeon 890M)2.14 seconds83.41 tokens/second~1093MB
Modern Notebook Chips (CPU)Intel Core Ultra 7 268V9.43 seconds45.65 tokens/second~1093MB
Flagship Mobile Chip CPUQualcomm Snapdragon 8 Gen 3 (Samsung S24)7.26 seconds.27.66 tokens/second~1093MB
Embedded IoT System CPURaspberry Pi 4 Model B22.32 seconds6.15 tokens/second~1093MB

Visual-to-Text Benchmarking Using OmniVLM with NexaQuant

© Copyright notes

Related posts

No comments

You must be logged in to leave a comment!
Login immediately
none
No comments...