Running DeepSeek-R1 with an RTX 4090 graphics card.Recommended preference for the full-blooded version of 671B quantized by Q4_K_M, followed by a quantized version of 14B or 32B, provided it relies on KTransformers, and if it's a pain in the ass to learn the Unsloth A quantized version was introduced, and here's another Ollama Installation Tutorial DeepSeek R1 671B Local Deployment Tutorial: Based on Ollama and Dynamic QuantizationIt depends on whether your needs are for "extreme power" or "more speed". It depends on whether your needs are for "ultimate power" or "more speed".
1️⃣ RTX 4090 The full-blooded version (671B) also runs?
Yes! The Tsinghua team's KTransformers Let a single 4090 video card run the full-blooded version.
- VGA memory requirements: The Q4_K_M Quantized Edition only requires 14GB of video memory, and the 4090's 24GB is perfectly adequate.
- pace: The preprocessing speed is up to 286 words/sec and the generation speed is about 14 words/sec, which is already too much for the average person to see.
- Scenario: Tasks that require complex reasoning, such as writing code, multi-round conversations.
2️⃣ If it's too slow? Try a smaller version
If you find 14 words/second too slow, you can choose a smaller model:
- 14B quantized version: The graphics memory requirement is about 6.5GB, generating faster speeds for daily writing and translation.
- 32B quantized version: Requires 14.9 GB of video memory and supports long text processing (e.g., analyzing entire papers).
3️⃣ Why does the full-blooded version run instead?
Here's a technical trick:Quantitative + computational offloadingThe
- quantize: "Compressing" the model to a smaller size, e.g. 4-bit quantization (Q4) reduces the memory footprint by 701 TP3T.
- Calculate unloading: Drop unimportant computational tasks to be handled by the CPU and just let the GPU do what it does best.
4️⃣ Compare to other graphics solutions
Again, run the full-blooded version:
- H100 Graphics Cluster: It costs hundreds of thousands of dollars and is faster but not affordable for the average person.
- home-grown graphics card: Compatibility isn't enough and it's easy to step in potholes.
reach a verdict: The 4090 is by far the most cost-effective option.
5️⃣ Deployment Tips
- expense or outlay
KTransformers
The framework can be deployed with one click and comes with the same interface as ChatGPT. - If you are running low on video memory, you can try the "Activate only 6 experts" mode, which is a bit faster.
For the smartest AI pick the 671B Quantized, for smooth conversations pick the 14B/32B, the 4090 holds it all!