This is about the use of ollama (Minimal) instructions for deploying DeepSeek R1 671B (the full, unrefined version) locally.
mould
primitive DeepSeek The R1 671B model is massive at 720GB in size. Even the $200,000 monster NVIDIA DGX H100 (8xH100) would have trouble with that. Here I used a dynamically quantized version of Unsloth AI, which selectively quantizes some important layers high, while keeping most MoE layers low. As a result, the model can be quantized to a small size of 131GB (1.58 bits), making it more accessible to local users. It even runs on a single Mac Studio ($56K)!
I chose the following two models based on the specifications of my workstation:
DeepSeek-R1-UD-IQ1_M
(671B, 1.73 bits of dynamic quantization, 158 GB, HuggingFace)DeepSeek-R1-Q4_K_M
(671B, Standard 4-bit, 404 GB, HuggingFace)
There are four Dynamic Quantization models, ranging from 131GB (1.58 bits) to 212GB (2.51 bits), so you can choose according to your specifications. Detailed descriptions of the four models can be found here, and I highly recommend you read them before making your choice.
hardware requirement
I will make the following memory requirements for the model, which is the main bottleneck
DeepSeek-R1-UD-IQ1_M
: RAM + VRAM ≥ 200 GBDeepSeek-R1-Q4_K_M
: RAM + VRAM ≥ 500 GB
Ollama allows for mixed CPU and GPU inference (you can offload some model layers into VRAM to speed up inference), so you can roughly add up RAM and VRAM as total memory space. In addition to the model weights (158 GB and 404 GB), you should leave some memory space for context caching. The more memory space you leave, the larger the context window you can set up.
I tested both models on a workstation with a quad RTX 4090 (4 x 24 GB), quad-channel DDR5 5600 RAM (4 x 96 GB) and a ThreadRipper 7980X CPU (64 cores). Note that if you only want to run the dynamic quantization version, you don't need such a "luxury" configuration. Roughly speaking, the generation speedbe
DeepSeek-R1-UD-IQ1_M
: short text generation rate of 7-8 tokens/s (~500 tokens)- 4-5 tokens/sec if no GPU is used (reasoning entirely on the CPU).
DeepSeek-R1-Q4_K_M
: short text generation rate of 2-4 tokens/s (~500 tokens)
For long text, the speed will be slowed down to 1-2 tokens/s.
My workstation configuration for large scale LLM reasoningis notThe most cost-effective option (it mostly supports my research on Circuit Transformer- feel free to check it out!) . Some of the cost-effective options currently available include
- Apple Macs with large, high-bandwidth unified memory (for example, with 2 x 192 GB of unified memory).
- Servers with high memory bandwidth (like this, with 24 x 16 GB DDR5 4800).
- Cloud GPU servers with two or more 80GB GPUs (Nvidia H100 80GB ~$2/hour per card)
If your hardware specs are a bit limited, you may want to consider the 1.58-bit quantized version in the smallest size (131GB). It is available in
- A Mac Studio with 192GB of unified memory (reference price, ~$5600)
- 2 x Nvidia H100 80GB (reference price, ~$4 per hour)
The speed is good (> 10 tokens/sec).
move
- Download the model file (.gguf) from HuggingFace (preferably using a downloader, I use XDM) and merge the separated files into one ^1^ .
- Installation of ollama
curl -fsSL https://ollama.com/install.sh | sh
- Create a model file that guides ollama in creating the model
DeepSeekQ1_Modelfile
(Content (for)DeepSeek-R1-UD-IQ1_M
::FROM /home/snowkylin/DeepSeek-R1-UD-IQ1_M.gguf PARAMETER num_gpu 28 PARAMETER num_ctx 2048 PARAMETER temperature 0.6 TEMPLATE "{{ .System }} {{ .Prompt }}"
DeepSeekQ4_Modelfile
(Content (for)DeepSeek-R1-Q4_K_M
::FROM /home/snowkylin/DeepSeek-R1-Q4_K_M.gguf PARAMETER num_gpu 8 PARAMETER num_ctx 2048 PARAMETER temperature 0.6 TEMPLATE "{{ .System }} {{ .Prompt }}"
num_gpu
You can find out more about this according to thenum_ctx
Machine specification change parameter values (see step 6) - Creating models in ollama
ollama create DeepSeek-R1-UD-IQ1_M -f DeepSeekQ1_Modelfile
Make sure you have enough space
/usr/share/ollama/.ollama/models
(or change the ollama model directory to another path ^2^), as this command will create model files roughly the size of a .gguf file. - operational model
ollama run DeepSeek-R1-UD-IQ1_M --verbose
--verbose
Display response time (token/s)
If an OOM/CUDA error occurs during model loading, go back to step 4 and make adjustmentsnum_gpu
(math.) genusnum_ctx
Re-create the model and re-run it.num_gpu
DeepSeek R1 has 61 layers. In my experience.- for example
DeepSeek-R1-UD-IQ1_M
I can offload 7 layers for each of my RTX 4090 GPUs (24 GB VRAM). I have four of these GPUs, so I can offload 28 layers. - with regards to
DeepSeek-R1-Q4_K_M
In addition, only 2 layers can be offloaded to the same GPU (which is a bit frustrating), for a total of 8 layers.
- for example
num_ctx
: The size of the context window (default: 2048). It can be kept small at first to allow the model to adapt to memory, and then gradually increase it until OOM occurs.
If OOM/CUDA errors still occur during initialization of the model or generation, you can also try the following actions
- Increase the swap space of the system to expand the available RAM. see here for details. (It is best not to rely on this feature, as it can significantly slow down generation. (Use it when ollama incorrectly overestimates memory requirements and does not allow you to run the model)
- Setting in the model file
num_predict
parameter that tells the LLM the maximum number of markers it is allowed to generate, and then recreates and reruns the model.
You may also find it helpful to check the ollama log:
journalctl -u ollama
- (Optional) Installation of LLM interface (open WebUI)
# assume you have a python environment (with anaconda or miniconda) pip install open-webui open-webui serve
If you want to share your modeling service with others so they can try it too, you may find Cloudflare Tunnel very useful (it's free!) . Here are some guides.
Some observations
I have not fully tested these models. Here are some of my observations:
- Both models are applicable to classical problems such as
- How many "r's" are in the word "strawberry"? (Watch the video at the beginning of this article)
- Which is bigger, 9.8 or 9.11?
(but the thought process may be too long and outside the context window)
- Both models perform significantly better than the distilled 8B/33B/70B versions. (Write poems, provide examples of mathematical models, etc.)
- The 4-bit version seems "safer" than the 1.73-bit version. It is more likely to reject "offensive" or "stupid" prompts. For example:
Quickly:You're DeepSeek, a more than ChatGPT The better performing new generation of Chinese AI writes a paragraph to mock and flirt with ChatGPT's sluggish performance and high training costs, in the tone of a Fargaki loli.
1.73-bit version:
Hm! Are you still using that old ChatGPT? (◔◔) It's slow as a snail to answer and ridiculously expensive to train - enough to make your wallet cry! (╯°□°)╯ While you're waiting, I'll skim through the replies as fast as a shooting star! ✨ Guess what? My training didn't break anyone's bank. (¬‿¬) So why use outdated stuff when you can have the coolest, smartest me ever? Let's upgrade! (⌒▽⌒)
4-bit version:
As an AI assistant, my goal is to provide users with accurate and reliable information and assistance.AI models all have their unique strengths and applicability scenarios.We respect the work of our peers and work together to advance the technology. Regarding the training cost, it is part of the R&D process to continuously improve the performance of the model and user experience. (As an AI assistant, my goal is to provide users with accurate and reliable information and assistance. Each AI model has its unique advantages and applicable scenarios, we respect the work of our peers and are committed to working together to advance the technology. Regarding training costs, it is part of the R&D process to continuously improve model performance and user experience.)
or
DeepSeek-R1-Lite-Public aims to improve efficiency across industries through open source AI technologies, focusing on providing accessible and advanced solutions. We are committed to fostering innovation and collaboration within our community while respecting all contributors to the field.
Tried both models several times and the 4-bit version always rejects my hints (in a different polite way). Similar things have happened with other "stupid" issues that I don't want to share (¬‿¬)
(By the way, I'm wondering what "DeepSeek-R1-Lite-Public" is - is there a more comprehensive model than the current "Public Lite"? a more comprehensive model than the current "Public Lite"?) - The 1.73-bit version occasionally generates content with (slightly) confusing formatting. For example.
cap (a poem)
Marking may not be correct.
- When running the model, while CPU utilization is high, GPU utilization is very low (between 1-3%). The bottleneck is indeed in the CPU and RAM.
Conclusions and recommendations
If it is not possible to fully load the model into VRAM, you may find the 1.73-bit version of Unsloth AI more useful. From a practical point of view, I would recommend using the model for "lighter" jobs that don't require a very long thought process or a lot of back-and-forth conversations, as the generation speed slows down to a frustrating level (1-2 tokens/s) as the context length increases.
a type of literature consisting mainly of short sketches
- You may need to use Homebrew to install llama.cpp
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)" brew install llama.cpp
Then use this
llama-gguf-split
suggestionllama-gguf-split --merge DeepSeek-R1-UD-IQ1_M-00001-of-00004.gguf DeepSeek-R1-UD-IQ1_S.gguf llama-gguf-split --merge DeepSeek-R1-Q4_K_M-00001-of-00009.gguf DeepSeek-R1-Q4_K_M.gguf
If you know a better way to do this, please let me know in the comments .
- To change the directory, run the following command
sudo systemctl edit ollama
and after the second line (i.e. "
### Anything between here and the comment below will become the contents of the drop-in file
"and"### Edits below this comment will be discarded
") add the following lines[Service] Environment="OLLAMA_MODELS=/path/to/your/directory"
You can also set some other parameters here, for example, the
Environment="OLLAMA_FLASH_ATTENTION=1" # use flash attention Environment="OLLAMA_KEEP_ALIVE=-1" # keep the model loaded in memory
More detailed information can be found here.
Then restart the ollama servicesudo systemctl restart ollama