Doubao-1.5-pro Released: A New Multimodal Base Model for Extreme Balance

AI News7mos agorelease AI Sharing Circle

3.3K 00

Doubao-1.5-pro

🌟 Model profile

Doubao-1.5-pro is a highly sparse MoE ArchitectureIn the four quadrants of Prefill/Decode and Attention/FFN, the computation and access characteristics are significantly different. For the four different quadrants, we adopt heterogeneous hardware combined with different low-precision optimization strategies to significantly increase throughput while ensuring low latency, and reduce the total cost while taking into account the optimization goals of TTFT and TPOT, achieving the ultimate balance between performance and inference efficiency.

minor activation parameter: exceeds the performance of the very large dense model.
Multi-scene adaptation: Outperforms on multiple review benchmarks.

📊 Performance Evaluation

Doubao-1.5-pro results on multiple benchmarks

豆包 Doubao-1.5-pro 发布：极致平衡的全新多模态基础模型
clarification::

The metrics for the rest of the models in the table are taken from the official results, and the unpublished parts are done by internal evaluation platforms.
GPT4o-0806 Excellent performance in public reviews of language models, see: simple-evals.

⚙️ Balancing Performance and Reasoning

Efficient MoE Architecture

utilization Sparse MoE Architecture Achieving dual optimization of training and reasoning efficiency.
Research highlights: Determine the optimal balance ratio between performance and efficiency by means of the sparsity Scaling Law.

Training Loss vs.

Model Performance Comparison

豆包 Doubao-1.5-pro 发布：极致平衡的全新多模态基础模型
clarification::

The Doubao-MoE model outperforms a dense model with 7 times the number of activated parameters (Doubao-Dense).
Doubao Dense model training is more efficient than Llama 3.1-405B, data quality and superparameter optimization are key.

🚀 High performance reasoning

Computational and access feature optimization

Doubao-1.5-pro performs well in four computational quadrants: Prefill, Decode, Attention, and FFN.
豆包 Doubao-1.5-pro 发布：极致平衡的全新多模态基础模型

In the Prefill phase, the communication and access bottleneck is not obvious, but the computation bottleneck is easily reached. Considering the characteristics of LLM one-way attention, we do Chunk-PP Prefill Serving on a variety of devices with high compute-visit-to-deposit ratios, so that the utilization rate of the Tensor Core of the online system is close to 60%.

Prefill Attention: Extends the open source FlashAttention 8-bit implementation with instructions such as MMA/WGMMA, combined with Per N tokens Per Sequence quantization strategy ensures that this phase can run losslessly on GPUs of different architectures. Meanwhile, by modeling the Attention consumption of slices of different lengths and combining with the dynamic cross-query batching strategy, it realizes inter-card balancing during Chunk-PP Serving, effectively eliminating the empty running caused by load imbalance;
Prefill FFN: Adopting W4A8 quantization effectively reduces the access overhead of sparse MoE experts, and gives more inputs to the FFN stage through the cross-query Batching strategy, which improves the MFU to 0.8.

In the Decode phase, the computational bottleneck is not obvious, but the communication and memory requirements are relatively high. We use Serving, a device with lower computation and access memory, to obtain higher ROI, and at the same time, we use very low-cost sampling and Speculative Decoding strategy to reduce the TPOT metrics.

Decode Attention: TP is deployed to optimize the common scenario of large differences in KV lengths among different Queries within a single batch by means of heuristic search and aggressive long sentence splitting strategy; in terms of accuracy, Per N tokens Per Sequence quantization is still adopted; in addition, the Attention computation during random sampling is optimized to ensure that the KV Cache is accessed only once. In addition, we optimize the Attention calculation during the random sampling process to ensure that the KV Cache is only accessed once.
Decode FFN: Keep W4A8 quantized and deployed using EP.

Overall, we have implemented the following optimizations on the PD separated Serving system:

Customized RPC Backend for Tensor transfer, and optimized Tensor transfer efficiency on TCP/RDMA network by means of zero-copy, multi-stream parallelism, etc., which in turn improves KV Cache transfer efficiency under PD separation.
It supports flexible allocation and dynamic expansion and contraction of Prefill and Decode clusters, and does HPA elastic expansion for each role independently, so as to ensure that both Prefill and Decode have no redundant arithmetic power, and the arithmetic power allocation of the two sides is in line with the actual online traffic pattern.
In the framework of the GPU computing and CPU pre- and post-processing asynchronization, so that the GPU reasoning the Nth step when the CPU launched the N + 1 step Kernel in advance, to keep the GPU is always full, the entire framework processing action of the GPU reasoning zero overhead. In addition, with our self-developed server cluster solution, we flexibly support low-cost chips, and the hardware cost is significantly lower than the industry solution. We have also significantly optimized the efficiency of packet communication through customized network cards and self-developed network protocols. At the arithmetic level, we realize the efficient overlap (Overlap) between computation and communication, thus ensuring the stability and efficiency of multi-computer distributed reasoning.

🎯 Data labeling: no shortcuts

Build an efficient data production system that combines Labeling Team cap (a poem) Model Self-Lifting Technology, significantly improving the quality of the data.