General Introduction
DeepEP is an open source communication library developed by deepseek-ai team, focusing on improving the training and reasoning efficiency of Mixture-of-Experts (MoE) models and Expert Parallelism (EP). It provides high-throughput and low-latency communication support by optimizing data exchange between GPUs for large-scale distributed systems.DeepEP supports NVLink and RDMA technologies, is compatible with low-precision operations such as FP8, and designs efficient kernels for training and inference scenarios respectively. The library has been battle-proven in DeepSeek team's production environment, especially for MoE models that require cross-node collaboration, which can significantly improve the overall performance of the system, and is a powerful assistant for AI researchers and developers in building efficient deep learning models. Currently, DeepEP is open-sourced on GitHub, and the community is welcome to participate in its improvement.
Function List
- Efficient all-to-all communications: Optimizes all-to-all communication between GPUs and supports intra-node NVLink and inter-node RDMA to ensure fast and stable data exchange.
- High throughput training support: Provides kernels designed for training and inference pre-population that can handle large-scale data transfers and improve model training efficiency.
- Low-latency inference kernel: For the inference and decoding phase, pure RDMA technology is used to reduce latency, which is suitable for real-time application scenarios.
- FP8 Low Precision Arithmetic: Native support for FP8 distribution reduces compute costs while maintaining performance for resource-sensitive environments.
- Flexible resource control: Supports streaming multiprocessor (SM) number adjustment, so developers can optimize the configuration according to hardware conditions.
- Overlap between communications and computing: Seamless communication and computation through hook mechanism to improve GPU utilization.
- Cross-domain bandwidth optimization: Provides efficient data forwarding support from NVLink to RDMA for DeepSeek-V3's Packet Limit Domain algorithm.
Using Help
Installation process
DeepEP is a GitHub-based open source project that requires manual download and configuration of the environment to use. Here are the detailed installation steps:
1. Pre-conditions
- operating system: Linux (e.g. Ubuntu 20.04 or later) is recommended for compatibility with GPU and network hardware.
- hardware requirement: Equipped with an NVLink or RDMA-enabled GPU (e.g., NVIDIA H800) and connected to a high-speed network (e.g., InfiniBand 400Gb/s).
- software dependency::
- CUDA Toolkit (recommended version compatible with hardware such as CUDA 11.x or 12.x).
- NCCL (NVIDIA Collective Communication Library).
- Modified version of NVSHMEM (DeepEP relies on its communication capabilities and needs to be installed separately).
- Python 3.8+ (for testing and script running).
2. Download DeepEP source code
Open a terminal and run the following command to clone the repository:
git clone https://github.com/deepseek-ai/DeepEP.git
cd DeepEP
3. Installation of NVSHMEM
DeepEP relies on a modified version of NVSHMEM, please refer to the officially providedNVSHMEM Installation Guide. The brief steps are as follows:
- Download the NVSHMEM source code and apply the patch provided by DeepEP (located in the
third-party/nvshmem.patch
). - Compile and install:
cd nvshmem patch -p1 < ... /third-party/nvshmem.patch make -j && sudo make install
4. Compiling DeepEP
Go to the DeepEP directory and compile the communication library:
make
After compilation, kernel files are generated that can be called in the project.
5. Configuring environment variables
To ensure that DeepEP operates correctly, NVSHMEM-related parameters, such as virtual channel assignments, need to be set:
export NVSHMEM_IB_SL=0 # Setting up virtual channels to avoid traffic conflicts
Additional configuration is available if you need to enable adaptive routing (low latency kernels only):
export NVSHMEM_ENABLE_ADAPTIVE_ROUTING=1
6. Test installation
Run the provided test scripts to verify DeepEP functionality:
python tests/test_low_latency.py
If the output shows successful communication, the installation is complete.
Usage
DeepEP is primarily used through workflows integrated into the MoE model, and the following is a detailed how-to guide for the main features:
Function 1: Run high throughput training
DeepEP's high throughput kernel is suitable for distributed training of MoE models. Assuming you have a DeepSeek-V3 based model, you can follow the steps below:
- Preparing models and data: Make sure your MoE model is configured with expert parallel logic and ready for the training dataset.
- Calling the DeepEP kernel: Introduce DeepEP's all-to-all communication interface in the training script. Example:
#include "deep_ep.h" void moe_train(float* input, float* output, int size) { deep_ep_all_to_all(input, output, size, FP8); }
- Configuration hardware: Specify the GPU device to be used, for example:
CUDA_VISIBLE_DEVICES=0,1,2,3 . /train_script
- running training: After initiating training, DeepEP automatically optimizes the communication between NVLink and RDMA.
Feature 2: Low Latency Reasoning
Low-latency kernels are suitable for real-time reasoning tasks, such as online dialog systems:
- Loading Models: Load the pre-trained MoE model into memory.
- Calling the inference kernel: Use a pure RDMA communication interface. Example:
#include "deep_ep.h" void moe_infer(float* query, float* result, int batch_size) { deep_ep_low_latency_all_to_all(query, result, batch_size); }
- Testing Reasoning Speed: Run the following command to measure latency:
python tests/test_inference.py --batch_size 128 --hidden_size 7168
The output will show the inference time for each batch, ensuring that real-time requirements are met.
Function 3: FP8 Computing Optimization
DeepEP supports FP8 distribution to reduce computational costs:
- Enable FP8 mode: Specify the data type as FP8 when calling the communication interface.
- Verification Accuracy: Run the test script to compare the performance and accuracy differences between FP8 and BF16:
python tests/test_fp8.py
- Application to production: Integrate FP8 configurations into existing training or inference processes.
Function 4: Resource control and optimization
Adjust the number of SMs to fit the hardware:
- View the number of hardware SMs: Use
nvidia-smi
Check the number of stream processors in the GPU. - Setting SM limits: Specified in the script:
deep_ep_set_sm_limit(32); // limit to 32 SMs
- Test Performance: Run benchmarks after adjusting the number of SMs to find the best configuration.
caveat
- Network Configuration: DeepEP is optimally tested on InfiniBand networks, RoCE requires additional verification of compatibility.
- Adaptive Routing: Only low-latency kernels support this feature; enabling it for regular kernels may result in deadlocks.
- Cluster Tuning: It is recommended to run all test scripts (e.g.
tests/
directory) to automatically adjust the configuration to fit your cluster.
With these steps, you can quickly get started with DeepEP and take full advantage of its communication optimization capabilities in the MoE model.