AI Personal Learning
and practical guidance

Aphrodite Engine: an efficient LLM inference engine that supports multiple quantization formats and distributed inference.

General Introduction

Aphrodite Engine is the official backend engine for PygmalionAI, designed to provide an inference endpoint for PygmalionAI sites and support rapid deployment of Hugging Face compatible models. The engine leverages vLLM's Paged Attention technology to enable efficient K/V management and sequential batch processing, which significantly improves inference speed and memory utilization.Aphrodite Engine supports multiple quantization formats and distributed inference for a wide range of modern GPU and TPU devices.

 

Function List

  • Continuous Batch Processing: Efficiently handle multiple requests and improve inference speed.
  • Paged Attention: Optimize K/V management to improve memory utilization.
  • CUDA-optimized kernel: Improving inference performance.
  • Quantitative support: Supports multiple quantization formats such as AQLM, AWQ, Bitsandbytes, etc.
  • distributed inference: Supports 8-bit KV cache for high context length and high throughput requirements.
  • Multi-device support: Compatible with NVIDIA, AMD, Intel GPUs and Google TPUs.
  • Docker Deployment: Provide Docker images to simplify the deployment process.
  • API compatible: Supports OpenAI-compatible APIs for easy integration into existing systems.

 

Using Help

Installation process

  1. Installation of dependencies::
    • Make sure that Python versions 3.8 to 3.12 are installed on your system.
    • For Linux users, the following command is recommended to install the dependencies:
     sudo apt update && sudo apt install python3 python3-pip git wget curl bzip2 tar
    
    • For Windows users, a WSL2 installation is recommended:
     wsl --install
    sudo apt update && sudo apt install python3 python3-pip git wget curl bzip2 tar
    
  2. Installation of Aphrodite Engine::
    • Use pip to install:
     pip install -U aphrodite-engine
    
  3. priming model::
    • Run the following command to start the model: bash
      aphrodite run meta-llama/Meta-Llama-3.1-8B-Instruct
    • This will create an OpenAI-compatible API server with a default port of 2242.

Deploying with Docker

  1. Pulling a Docker image::
   docker pull alpindale/aphrodite-openai:latest
  1. Running Docker Containers::
   docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 2242:2242 \
--ipc=host \
alpindale/aphrodite-openai:latest \
--model NousResearch/Meta-Llama-3.1-8B-Instruct \
- --tensor-parallel-size 8 \
--api-keys "sk-empty"

Main function operation flow

  1. Continuous Batch Processing::
    • Aphrodite Engine significantly improves inference speed by allowing multiple requests to be processed at the same time through continuous batch processing technology. Users simply specify the batch processing parameters at startup.
  2. Paged Attention::
    • This technology optimizes K/V management and improves memory utilization. No additional configuration is required by the user and the system automatically applies the optimization.
  3. Quantitative support::
    • A variety of quantization formats are supported, such as AQLM, AWQ, Bitsandbytes, and so on. Users can specify the desired quantization format when starting the model:
     aphrodite run --quant-format AQLM meta-llama/Meta-Llama-3.1-8B-Instruct
    
  4. distributed inference::
    • Supports 8-bit KV cache for high context length and high throughput requirements. Users can start distributed reasoning with the following command:
     aphrodite run --tensor-parallel-size 8 meta-llama/Meta-Llama-3.1-8B-Instruct
    
  5. API integration::
    • Aphrodite Engine provides OpenAI compatible APIs for easy integration into existing systems. Users can start the API server with the following command: bash
      aphrodite run --api-keys "your-api-key" meta-llama/Meta-Llama-3.1-8B-Instruct

AI Easy Learning

The layman's guide to getting started with AI

Help you learn how to utilize AI tools at a low cost and from a zero base.AI, like office software, is an essential skill for everyone. Mastering AI will give you an edge in your job search and half the effort in your future work and studies.

View Details>
May not be reproduced without permission:Chief AI Sharing Circle " Aphrodite Engine: an efficient LLM inference engine that supports multiple quantization formats and distributed inference.

Chief AI Sharing Circle

Chief AI Sharing Circle specializes in AI learning, providing comprehensive AI learning content, AI tools and hands-on guidance. Our goal is to help users master AI technology and explore the unlimited potential of AI together through high-quality content and practical experience sharing. Whether you are an AI beginner or a senior expert, this is the ideal place for you to gain knowledge, improve your skills and realize innovation.

Contact Us
en_USEnglish