DeepEP: An Open Source Tool to Optimize Communication Efficiency Specifically for MoE Models (DeepSeek Open Source Week Day 2)

Latest AI Resources6mos agorelease AI Sharing Circle

General Introduction

DeepEP is an open source communication library developed by deepseek-ai team, focusing on improving the training and reasoning efficiency of Mixture-of-Experts (MoE) models and Expert Parallelism (EP). It provides high-throughput and low-latency communication support by optimizing data exchange between GPUs for large-scale distributed systems.DeepEP supports NVLink and RDMA technologies, is compatible with low-precision operations such as FP8, and designs efficient kernels for training and inference scenarios respectively. The library has been battle-proven in DeepSeek team's production environment, especially for MoE models that require cross-node collaboration, which can significantly improve the overall performance of the system, and is a powerful assistant for AI researchers and developers in building efficient deep learning models. Currently, DeepEP is open-sourced on GitHub, and the community is welcome to participate in its improvement.

DeepEP：专为MoE模型优化通信效率的开源工具（DeepSeek 开源周第二天）

Function List

Efficient all-to-all communications: Optimizes all-to-all communication between GPUs and supports intra-node NVLink and inter-node RDMA to ensure fast and stable data exchange.
High throughput training support: Provides kernels designed for training and inference pre-population that can handle large-scale data transfers and improve model training efficiency.
Low-latency inference kernel: For the inference and decoding phase, pure RDMA technology is used to reduce latency, which is suitable for real-time application scenarios.
FP8 Low Precision Arithmetic: Native support for FP8 distribution reduces compute costs while maintaining performance for resource-sensitive environments.
Flexible resource control: Supports streaming multiprocessor (SM) number adjustment, so developers can optimize the configuration according to hardware conditions.
Overlap between communications and computing: Seamless communication and computation through hook mechanism to improve GPU utilization.
Cross-domain bandwidth optimization: Provides efficient data forwarding support from NVLink to RDMA for DeepSeek-V3's Packet Limit Domain algorithm.

Using Help

Installation process

DeepEP is a GitHub-based open source project that requires manual download and configuration of the environment to use. Here are the detailed installation steps:

1. Pre-conditions

operating system: Linux (e.g. Ubuntu 20.04 or later) is recommended for compatibility with GPU and network hardware.
hardware requirement: Equipped with an NVLink or RDMA-enabled GPU (e.g., NVIDIA H800) and connected to a high-speed network (e.g., InfiniBand 400Gb/s).
software dependency::
- CUDA Toolkit (recommended version compatible with hardware such as CUDA 11.x or 12.x).
- NCCL (NVIDIA Collective Communication Library).
- Modified version of NVSHMEM (DeepEP relies on its communication capabilities and needs to be installed separately).
- Python 3.8+ (for testing and script running).

2. Download DeepEP source code

Open a terminal and run the following command to clone the repository:

git clone https://github.com/deepseek-ai/DeepEP.git  
cd DeepEP

3. Installation of NVSHMEM

DeepEP relies on a modified version of NVSHMEM, please refer to the officially providedNVSHMEM Installation Guide. The brief steps are as follows:

Download the NVSHMEM source code and apply the patch provided by DeepEP (located in thethird-party/nvshmem.patch).

Compile and install:

cd nvshmem  
patch -p1 < ../third-party/nvshmem.patch  
make -j && sudo make install

4. Compiling DeepEP

Go to the DeepEP directory and compile the communication library:

make

After compilation, kernel files are generated that can be called in the project.

5. Configuring environment variables

To ensure that DeepEP operates correctly, NVSHMEM-related parameters, such as virtual channel assignments, need to be set:

export NVSHMEM_IB_SL=0  # 设置虚拟通道，避免流量冲突

Additional configuration is available if you need to enable adaptive routing (low latency kernels only):

export NVSHMEM_ENABLE_ADAPTIVE_ROUTING=1

6. Test installation

Run the provided test scripts to verify DeepEP functionality:

python tests/test_low_latency.py

If the output shows successful communication, the installation is complete.

Usage

DeepEP is primarily used through workflows integrated into the MoE model, and the following is a detailed how-to guide for the main features:

Function 1: Run high throughput training

DeepEP's high throughput kernel is suitable for distributed training of MoE models. Assuming you have a DeepSeek-V3 based model, you can follow the steps below:

Preparing models and data: Make sure your MoE model is configured with expert parallel logic and ready for the training dataset.

Calling the DeepEP kernel: Introduce DeepEP's all-to-all communication interface in the training script. Example:

#include "deep_ep.h"  
void moe_train(float* input, float* output, int size) {  
deep_ep_all_to_all(input, output, size, FP8);  
}

Configuration hardware: Specify the GPU device to be used, for example:
```
CUDA_VISIBLE_DEVICES=0,1,2,3 ./train_script
```
running training: After initiating training, DeepEP automatically optimizes the communication between NVLink and RDMA.

Feature 2: Low Latency Reasoning

Low-latency kernels are suitable for real-time reasoning tasks, such as online dialog systems:

Loading Models: Load the pre-trained MoE model into memory.

Calling the inference kernel: Use a pure RDMA communication interface. Example:

#include "deep_ep.h"  
void moe_infer(float* query, float* result, int batch_size) {  
deep_ep_low_latency_all_to_all(query, result, batch_size);  
}

Testing Reasoning Speed: Run the following command to measure latency:
```
python tests/test_inference.py --batch_size 128 --hidden_size 7168
```
The output will show the inference time for each batch, ensuring that real-time requirements are met.

Function 3: FP8 Computing Optimization

DeepEP supports FP8 distribution to reduce computational costs:

Enable FP8 mode: Specify the data type as FP8 when calling the communication interface.
Verification Accuracy: Run the test script to compare the performance and accuracy differences between FP8 and BF16:
```
python tests/test_fp8.py
```
Application to production: Integrate FP8 configurations into existing training or inference processes.

Function 4: Resource control and optimization

Adjust the number of SMs to fit the hardware:

View the number of hardware SMs: Usenvidia-smiCheck the number of stream processors in the GPU.

Setting SM limits: Specified in the script:

deep_ep_set_sm_limit(32);  // 限制为32个SM

Test Performance: Run benchmarks after adjusting the number of SMs to find the best configuration.

caveat

Network Configuration: DeepEP is optimally tested on InfiniBand networks, RoCE requires additional verification of compatibility.
Adaptive Routing: Only low-latency kernels support this feature; enabling it for regular kernels may result in deadlocks.
Cluster Tuning: It is recommended to run all test scripts (e.g.tests/directory) to automatically adjust the configuration to fit your cluster.

With these steps, you can quickly get started with DeepEP and take full advantage of its communication optimization capabilities in the MoE model.