Requires only 14GB of RAM to run DeepSeek-Coder V3/R1 locally (Q4_K_M quantization)

AI News6mos agorelease AI Sharing Circle

1.4K 00

summaries

February 10, 2025: Support for DeepseekR1 and V3 on single GPU (24GB RAM) / multiple GPUs and 382GB RAM, with speedups of up to 3-28x.

Greetings, everyone, from the KTransformers team (formerly known as the CPU/GPU Hybrid Inference open source project team, known for DeepSeek-V2).

KTransformers The team has received requests for DeepSeek-R1/V3 support and is very excited to announce that it has finally been delivered!
Sorry for the wait, but the KTransformers team has been cooking up something truly amazing!

Today, the KTransformers team is proud to announce not only support for DeepSeek-R1/V3, as shown in the video below:

https://github.com/user-attachments/assets/ebd70bfa-b2c1-4abb-ae3b-296ed38aa285

[UPDATED!!!!] Local 671B DeepSeek-Coder-V3/R1. Runs its Q4_K_M version only with 14GB of video memory and 382GB of RAM.
- Prefill speed (tokens/s).
  - KTransfermor: 54.21 (32 cores) → 74.362 (dual-path, 2×32 cores) → 255.26 (optimized AMX-based MoE kernel, V0.3 only) → 286.55 (selective use of 6 experts, V0.3 only)
  - together with llama.cpp achieved up to 10.31 tokens/s at 2×32 cores compared to the 27.79 times accelerationThe
- Decode speed (tokens/s).
  - KTransfermor: 8.73 (32 cores) → 11.26 (dual, 2×32 cores) → 13.69 (selective use of 6 experts, V0.3 only)
  - Compared to llama.cpp's 4.51 tokens/s on 2×32 cores, it achieves up to 3.03x accelerationThe

The KTransformers team also gave a preview of upcoming optimizations, including an Intel AMX-accelerated kernel and selective expert activation methods that will significantly improve performance. With V0.3-preview, prefill is up to 286 tokens/s, faster than llama.cpp for native inference! 28 timesThe
The binary distribution is now available and the source code will be released as soon as possible!View wheel packages hereThe

Terms of preparation

The KTransformers team ran the best performance tests (V0.2) on the following configurations:

CPU: Intel (R) Xeon (R) Gold 6454S 1T RAM (2 NUMA nodes)

GPU: 4090D 24G Video Memory

Memory: Standard DDR5-4800 server memory (1 TB)

Benchmarking results

V0.2

set up

Model: DeepseekV3-q4km (int4)
CPU: cpu_model_name: Intel (R) Xeon (R) Gold 6454S, 32 cores per channel, 2 channels, 2 numa nodes
GPU: 4090D 24G Video Memory
KTransformers team testing after full warm-up

Memory consumption.

Single: 382G RAM, at least 14GB VRAM
Dual: 1T RAM, at least 14GB VRAM

Benchmarking results

The "6 experts" scenario is part of the V0.3 preview.

| Prompt

(500 tokens)	Dual Ktrans (6 experts)	Dual Ktrans (8 experts)	Single Ktrans (6 experts)	Single Ktrans (8 experts)	llama.cpp (8 experts)
Prefill token/s	97.32	82.94	65.14	54.21	10.31
Decode token/s	13.69	12.208	10.303	8.73	4.51

Increase decoding speed by up to 3.03 timesMaximum increase in pre-fill speed 9.44 timesThe It seems that KTransformers' decoding acceleration is not as obvious as pre-populated, and there is still a lot of room for decoding optimization.

V0.3-Preview

set up

Model: DeepseekV3-BF16 (online quantization as int8 for CPU, int4 for GPU)
CPU: cpu_model_name: Intel (R) Xeon (R) Gold 6454S, 32 cores per channel, 2 channels, 2 numa nodes
GPU: (1~4)x 4090D 24GVRAM (longer prompts require more video memory)

Memory consumption.

644GB RAM, at least 14GB video memory

Benchmarking results

Prompt length	1K	2K	4K	8K
KTrans (8 experts) Prefill token/s	185.96	255.26	252.58	195.62
KTrans (6 experts) Prefill token/s	203.70	286.55	271.08	207.20

KTrans V0.3 is faster than KTrans V0.2 in pre-filling. 3.45 timesIt's faster than llama.cpp. 27.79 timesThe This pre-fill speed increase is truly amazing and it looks like KTransformers has put a lot of effort into pre-fill optimization.
The decoding speed is the same as KTrans V0.2 (6 experts version), so it is omitted. It seems that the V0.3 release focuses mainly on improvements in pre-fill speed.

The main acceleration comes from

Intel AMX instruction set and cache-friendly memory layout specifically designed by the KTransformers team
Expert selection strategy for selecting fewer experts based on offline profile results from out-of-domain data

According to the KTransformers team for DeepSeekV2, DeepSeekV3 and DeepSeekR1.
When slightly reducing the number of experts activated in the inference, the
The output quality does not change. But the speed of decoding and pre-filling
will speed up, which is encouraging. So the KTransformers team's demo utilizes this finding It seems that the "expert selection strategy" is the key to speeding up, but how to ensure that the output quality does not deteriorate needs more testing and verification.

How it works

V0.2 Demo

Single-path version (32 cores)

The KTransformers team's local_chat The test command is:

git clone https://github.com/kvcache-ai/ktransformers.git
cd ktransformers
numactl -N 1 -m 1 python ./ktransformers/local_chat.py --model_path <your model path> --gguf_path <your gguf path>  --prompt_file <your prompt txt file>  --cpu_infer 33  --cache_lens 1536
<当看到 chat 时，按 Enter 键加载文本 prompt_file>

can be a local path or a path set from an online hugging face, e.g. deepseek-ai/DeepSeek-V3. If you encounter connection problems online, try using a mirror (hf-mirror.com).

can also be an online path, but since it is large, the KTransformers team recommends that you download it and quantize the model into the format you want!

command numactl -N 1 -m 1 Designed to avoid data transfers between NUMA nodes

Dual-Path version (64 cores)

Before installing (using install.sh or the make dev_install), through export USE_NUMA=1 Setting environment variables USE_NUMA=1 (If already installed, reinstall with this environment variable set)

The KTransformers team's local_chat The test command is:

git clone https://github.com/kvcache-ai/ktransformers.git
cd ktransformers
export USE_NUMA=1
make dev_install # or sh ./install.sh
python ./ktransformers/local_chat.py --model_path <your model path> --gguf_path <your gguf path>  --prompt_file <your prompt txt file>  --cpu_infer 65  --cache_lens 1536
<当看到 chat 时，按 Enter 键加载文本 prompt_file>

parameter has the same meaning. However, since the KTransformers team uses a two-way, the cpu_infer Set to 65

V0.3 Demo

Dual-Path version (64 cores)

The KTransformers team's local_chat The test command is:

wget https://github.com/kvcache-ai/ktransformers/releases/download/v0.1.4/ktransformers-0.3.0rc0+cu126torch26fancy-cp311-cp311-linux_x86_64.whl
pip install ./ktransformers-0.3.0rc0+cu126torch26fancy-cp311-cp311-linux_x86_64.whl
python -m ktransformers.local_chat --model_path <your model path> --gguf_path <your gguf path>  --prompt_file <your prompt txt file>  --cpu_infer 65  --cache_lens 1536
<当看到 chat 时，按 Enter 键加载文本 prompt_file>

The meaning of the parameters is the same as in V0.2. However, since the KTransformers team uses a dual path, the cpu_infer Set to 65

Some explanations

The KTransformers team also wanted to further utilize the two NUMA nodes on the Xeon Gold CPU.
In order to avoid the cost of data transfer between nodes, the KTransformers team has made the
The key matrix is "copied" on both nodes, which consumes more memory but speeds up the pre-population and decoding process.
However, this method takes up a lot of memory and is slow in loading weights, so please be patient while it loads!
The KTransformers team will optimize this huge memory overhead. Stay tuned~ This "copying" of the matrix may speed up the process, but the memory footprint is a real problem, so we're looking forward to seeing what the KTransformers team comes up with in the future.
command parameter --cpu_infer 65 Specify the number of cores to use (more than the number of physical cores is fine, the
But more is not better. Just adjust it to slightly less than the actual number of cores)
Why hybrid CPU/GPU reasoning?
DeepSeek The MLA algorithm is computationally intensive. While it is possible to run them entirely on the CPU, offloading the heavy computation to the GPU can dramatically improve performance. with the CPU handling the expert computation and the GPU handling the MLA/KVCache, this hybrid inference strategy seems smart, taking full advantage of both the CPU and the GPU.
Where does the speed boost come from?
- Expert Offload: Unlike traditional layer- or KVCache-based offloads (as seen in llama.cpp), the KTransformers team offloads expert computation to the CPU and MLA/KVCache to the GPU, which is a perfect fit with DeepSeek's architecture for optimal efficiency.
- Intel AMX Optimization - The KTransformers team's AMX-accelerated kernel has been carefully tuned to run several times faster than existing llama.cpp implementations.The KTransformers team plans to open-source this kernel after cleanup, and is considering contributing code to the upstream llama.cpp.The AMX The addition of the AMX instruction set seems to be one of the key factors in the speedup of KTransformers.
Why Intel CPUs?
Intel is currently the only CPU vendor to support something like the AMX instruction, which offers significantly better performance than the AVX-only alternative. It seems that Intel CPUs are the way to go when it comes to experiencing the best performance of KTransformers.