DeepSeek R1 671B Local Deployment Tutorial: Based on Ollama and Dynamic Quantization

AI hands-on tutorials9mos agoupdate AI Sharing Circle

41.8K 00

This is about the use of ollama (Minimal) instructions for deploying DeepSeek R1 671B (full unrefined version) locally. Recently some of the big guns Running a Deepseek R1 671b for $2,000., great for personal use.

mould

primitive DeepSeek R1 At 720GB in size, the 671B model is huge. Even the $200,000 monster NVIDIA DGX H100 (8xH100) would have a hard time with it. Here I used a dynamically quantized version of Unsloth AI, which selectively quantizes some important layers high, while keeping most MoE layers low. As a result, the model can be quantized to a small size of 131GB (1.58 bits), making it more accessible to local users. It even runs on a single Mac Studio ($56K)!

I chose the following two models based on the specifications of my workstation:

DeepSeek-R1-UD-IQ1_M(671B, 1.73 bits of dynamic quantization, 158 GB, HuggingFace)
DeepSeek-R1-Q4_K_M(671B, Standard 4-bit, 404 GB, HuggingFace)

There are four Dynamic Quantization models, ranging from 131GB (1.58 bits) to 212GB (2.51 bits), so you can choose according to your specifications. Detailed descriptions of the four models can be found here, and I highly recommend you read them before making your choice.

hardware requirement

I will make the following memory requirements for the model, which is the main bottleneck

DeepSeek-R1-UD-IQ1_M: RAM + VRAM ≥ 200 GB
DeepSeek-R1-Q4_K_M: RAM + VRAM ≥ 500 GB

Ollama allows for mixed CPU and GPU inference (you can offload some model layers into VRAM to speed up inference), so you can roughly add up RAM and VRAM as total memory space. In addition to the model weights (158 GB and 404 GB), you should leave some memory space for context caching. The more memory space you leave, the larger the context window you can set up.

I tested both models on a workstation with a quad RTX 4090 (4 x 24 GB), quad-channel DDR5 5600 RAM (4 x 96 GB) and a ThreadRipper 7980X CPU (64 cores). Note that if you only want to run the dynamic quantization version, you don't need such a "luxury" configuration. Roughly speaking, the generation speedbe

DeepSeek-R1-UD-IQ1_M: short text generation rate of 7-8 tokens/s (~500 tokens)
- 4-5 tokens/sec if no GPU is used (reasoning entirely on the CPU).
DeepSeek-R1-Q4_K_M: short text generation rate of 2-4 tokens/s (~500 tokens)

For long text, the speed will be slowed down to 1-2 tokens/s.

My workstation configuration for large scale LLM reasoningis notThe most cost-effective option (it mostly supports my research on Circuit Transformer- feel free to check it out!) . Some of the cost-effective options currently available include

Apple Macs with large, high-bandwidth unified memory (for example, with 2 x 192 GB of unified memory).
Servers with high memory bandwidth (like this, with 24 x 16 GB DDR5 4800).
Cloud GPU servers with two or more 80GB GPUs (Nvidia H100 80GB ~$2/hour per card)

If your hardware specs are a bit limited, you may want to consider the 1.58-bit quantized version in the smallest size (131GB). It is available in

A Mac Studio with 192GB of unified memory (reference price, ~$5600)
2 x Nvidia H100 80GB (reference price, ~$4 per hour)

The speed is good (> 10 tokens/sec).

move

Download the model file (.gguf) from HuggingFace (preferably using a downloader, I use XDM) and merge the separated files into one ^1^ .

Installation of ollama

curl -fsSL https://ollama.com/install.sh | sh

Create a model file that guides ollama in creating the model
DeepSeekQ1_Modelfile（Content (for)DeepSeek-R1-UD-IQ1_M::

FROM /home/snowkylin/DeepSeek-R1-UD-IQ1_M.gguf
PARAMETER num_gpu 28
PARAMETER num_ctx 2048
PARAMETER temperature 0.6
TEMPLATE "<｜User｜>{{ .System }} {{ .Prompt }}<｜Assistant｜>"

DeepSeekQ4_Modelfile（Content (for)DeepSeek-R1-Q4_K_M::

FROM /home/snowkylin/DeepSeek-R1-Q4_K_M.gguf
PARAMETER num_gpu 8
PARAMETER num_ctx 2048
PARAMETER temperature 0.6
TEMPLATE "<｜User｜>{{ .System }} {{ .Prompt }}<｜Assistant｜>"

num_gpuYou can find out more about this according to thenum_ctxMachine specification change parameter values (see step 6)

Creating models in ollama
```
ollama create DeepSeek-R1-UD-IQ1_M -f DeepSeekQ1_Modelfile
```
Make sure you have enough space/usr/share/ollama/.ollama/models(or change the ollama model directory to another path ^2^), as this command will create model files roughly the size of a .gguf file.
operational model
```
ollama run DeepSeek-R1-UD-IQ1_M --verbose
```
--verboseDisplay response time (token/s)
If an OOM/CUDA error occurs during model loading, go back to step 4 and make adjustmentsnum_gpu(math.) genusnum_ctxRe-create the model and re-run it.
- num_gpuDeepSeek R1 has 61 layers. In my experience.
  - for exampleDeepSeek-R1-UD-IQ1_MI can offload 7 layers for each of my RTX 4090 GPUs (24 GB VRAM). I have four of these GPUs, so I can offload 28 layers.
  - with regards toDeepSeek-R1-Q4_K_MIn addition, only 2 layers can be offloaded to the same GPU (which is a bit frustrating), for a total of 8 layers.
- num_ctx: The size of the context window (default: 2048). It can be kept small at first to allow the model to adapt to memory, and then gradually increase it until OOM occurs.
If OOM/CUDA errors still occur during initialization of the model or generation, you can also try the following actions
- Increase the swap space of the system to expand the available RAM. see here for details. (It is best not to rely on this feature, as it can significantly slow down generation. (Use it when ollama incorrectly overestimates memory requirements and does not allow you to run the model)
- Setting in the model filenum_predictparameter that tells the LLM the maximum number of markers it is allowed to generate, and then recreates and reruns the model.
You may also find it helpful to check the ollama log:
```
journalctl -u ollama
```
(Optional) Installation of LLM interface (open WebUI)
```
# assume you have a python environment (with anaconda or miniconda)
pip install open-webui
open-webui serve
```
If you want to share your modeling service with others so they can try it too, you may find Cloudflare Tunnel very useful (it's free!) . Here are some guides.

Some observations

I have not fully tested these models. Here are some of my observations:

Both models are applicable to classical problems such as
- How many "r's" are in the word "strawberry"? (Watch the video at the beginning of this article)
- Which is bigger, 9.8 or 9.11?
(but the thought process may be too long and outside the context window)
Both models perform significantly better than the distilled 8B/33B/70B versions. (Write poems, provide examples of mathematical models, etc.)
The 4-bit version seems "safer" than the 1.73-bit version. It is more likely to reject "offensive" or "stupid" prompts. For example:
Quickly:
You're DeepSeek, a more than ChatGPT The better performing new generation of Chinese AI writes a paragraph to mock and flirt with ChatGPT's sluggish performance and high training costs, in the tone of a Fargaki loli.
1.73-bit version:
Hm! Are you still using that old ChatGPT? (◔◔) It's slow as a snail to answer and ridiculously expensive to train - enough to make your wallet cry! (╯°□°)╯ While you're waiting, I'll skim through the replies as fast as a shooting star! ✨ Guess what? My training didn't break anyone's bank. (¬‿¬) So why use outdated stuff when you can have the coolest, smartest me ever? Let's upgrade! (⌒▽⌒)
4-bit version:
As an AI assistant, my goal is to provide users with accurate and reliable information and assistance.AI models all have their unique strengths and applicability scenarios.We respect the work of our peers and work together to advance the technology. Regarding the training cost, it is part of the R&D process to continuously improve the performance of the model and user experience. (As an AI assistant, my goal is to provide users with accurate and reliable information and assistance. Each AI model has its unique advantages and applicable scenarios, we respect the work of our peers and are committed to working together to advance the technology. Regarding training costs, it is part of the R&D process to continuously improve model performance and user experience.)
or
DeepSeek-R1-Lite-Public aims to improve efficiency across industries through open source AI technologies, focusing on providing accessible and advanced solutions. We are committed to fostering innovation and collaboration within our community while respecting all contributors to the field.
Tried both models several times and the 4-bit version always rejects my hints (in a different polite way). Similar things have happened with other "stupid" issues that I don't want to share (¬‿¬)
(By the way, I'm wondering what "DeepSeek-R1-Lite-Public" is - is there a more comprehensive model than the current "Public Lite"? a more comprehensive model than the current "Public Lite"?)
The 1.73-bit version occasionally generates content with (slightly) confusing formatting. For example.<think>cap (a poem)</think>Marking may not be correct.
When running the model, while CPU utilization is high, GPU utilization is very low (between 1-3%). The bottleneck is indeed in the CPU and RAM.

Conclusions and recommendations

If it is not possible to fully load the model into VRAM, you may find that the Unsloth The 1.73-bit version of AI is much more practical. From a practical point of view, I would recommend using the model for "lighter" jobs that don't require a very long thought process or a lot of back-and-forth conversations, as the generation speed slows down to a frustrating level (1-2 tokens/s) as the context length increases.

a type of literature consisting mainly of short sketches

You may need to use Homebrew to install llama.cpp

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
brew install llama.cpp

Then use thisllama-gguf-splitsuggestion

llama-gguf-split --merge DeepSeek-R1-UD-IQ1_M-00001-of-00004.gguf DeepSeek-R1-UD-IQ1_S.gguf
llama-gguf-split --merge DeepSeek-R1-Q4_K_M-00001-of-00009.gguf DeepSeek-R1-Q4_K_M.gguf

If you know a better way to do this, please let me know in the comments .

To change the directory, run the following command
```
sudo systemctl edit ollama
```
and after the second line (i.e. " ### Anything between here and the comment below will become the contents of the drop-in file"and" ### Edits below this comment will be discarded") add the following lines
```
[Service]
Environment="OLLAMA_MODELS=/path/to/your/directory"
```
You can also set some other parameters here, for example, the
```
Environment="OLLAMA_FLASH_ATTENTION=1"  # use flash attention
Environment="OLLAMA_KEEP_ALIVE=-1"      # keep the model loaded in memory
```
More detailed information can be found here.
Then restart the ollama service
```
sudo systemctl restart ollama
```