Awex - Ant Group open source high performance weight exchange framework

Latest AI Resources3mos agorelease AI Sharing Circle

65.5K 00

What's Awex?

Awex is a high-performance weight exchange framework open source by Ant Group, designed for large-scale parameter synchronization in reinforcement learning. It can complete terabyte-level parameter exchange in seconds, which significantly improves the efficiency of training and inference.Awex has a very fast synchronization performance, and on a thousand card cluster, trillion parameter models can be synchronized in 6 seconds. With a unified model adaptation layer, Awex can automatically handle the differences in Tensor format between different engines, and is compatible with a variety of model architectures. Awex supports zero-redundancy transmission and in-situ updating, and only transmits the necessary slices to reduce the copying overhead of video memory; it supports multiple transmission modes, such as NCCL, RDMA, and shared memory, to fully utilize the advantages of hardware bandwidth. Compatible with heterogeneous deployment, supports common card and split card modes, and adapts to a variety of training scenarios.

Features of Awex

Extreme synchronization performance: In a large-scale cluster environment, it can quickly complete the synchronization of terabytes of parameters, significantly improving the efficiency of reinforcement learning training and inference, for example, on a thousand-card cluster, a trillion-parameter model can be synchronized with the full volume in 6 seconds.
Universal Model Adaptation Layer (UMA): Automatically handle Tensor format and layout differences between different training and inference engines, support multiple model architectures, and reduce development and deployment complexity.
Zero-redundancy transmission with in-situ updates: Transmitting only the necessary parameter slices, the inference side updates the video memory in situ, avoiding the overhead of video memory reallocation and copying, and improving the efficiency of resource utilization.
Multi-mode transmission support: Compatible with various transmission modes such as NCCL, RDMA and shared memory, it fully utilizes the bandwidth advantages of different hardware, while reducing long-tail latency and improving overall transmission performance.
Heterogeneous Deployment CompatibilityIt supports common card and split card modes, adapts to the training scenarios of synchronous and asynchronous reinforcement learning algorithms, and meets diversified deployment needs.
Flexible Pluggable Architecture: Supports customized weight Sharing and Layout behaviors for different models, while allowing new training and inference engines to be accessed with good scalability and flexibility.

Awex's core strengths

High performance synchronization: Achieve second-by-second terabyte-level parameter synchronization in large-scale clusters to significantly improve the efficiency of reinforcement learning training and inference, e.g., on a thousand-card cluster, a trillion-parameter model can be synchronized with the full volume in 6 seconds.
high compatibility: Automatically adapts Tensor formats and layouts for different training and inference engines, supports multiple model architectures, and reduces the complexity of development and deployment.
Efficient transmission: Transmitting only the necessary parameters for slicing, the inference side updates the video memory in situ, avoiding video memory reallocation and copying overhead, and improving resource utilization efficiency.
Multi-mode transmission support: Compatible with multiple transfer modes such as NCCL, RDMA and shared memory to fully utilize the hardware bandwidth advantage while reducing long-tail latency.
Flexible Architecture: Supports customized weight Sharing and Layout behaviors, allowing new training and inference engine access with good scalability and flexibility.

What is the official website of Awex

Github repository:: https://github.com/inclusionAI/asystem-awex

People for whom Awex is intended

Deep and reinforcement learning researchers: Researchers who need to train and reason efficiently on large-scale clusters, especially those teams dealing with large-scale parametric models, Awex can significantly improve their productivity.
Artificial Intelligence Engineer: Engineers responsible for developing and deploying reinforcement learning systems in an enterprise or organization, Awex can help them quickly synchronize model training and inference to optimize system performance.
Cloud computing and data center operators: Teams managing large-scale computing resources, Awex's efficient parameter synchronization capabilities optimize resource utilization and improve the overall operational efficiency of the data center.
High Performance Computing (HPC) Developer: Professionals who need to work with large-scale data and complex computational tasks, Awex's multi-modal transport and flexible architecture meets their needs in high-performance computing environments.