General Introduction
Chitu is an open source reasoning framework developed by the PACMAN team at Tsinghua University, called "Red Rabbit", which is specifically designed to run large language models. Chitu supports a variety of hardware, such as NVIDIA's GPUs and domestic chips, and can be used from stand-alone to large-scale clusters.Its highlight is inference with FP8 models, which can dramatically reduce costs, such as running DeepSeek-671B on an A800, which uses half the GPUs and is more than 3x faster than vLLM.The code is publicly available on GitHub and can be downloaded and used for free by businesses or individuals. This is an out-of-the-box tool for production environments and is suitable for teams that want to save money but also want performance.
Function List
- It supports FP8 and BF16 model inference, and can run on old GPUs and domestic chips with low cost and high performance.
- Adapts to a wide range of hardware, from pure CPUs to GPUs such as NVIDIA A800, H20, etc., to large-scale clusters.
- Optimize inference speed with CUDA Graph for faster output in a single request.
- Provides a service interface that allows models to be invoked directly via HTTP requests.
- Supports multi-node distributed inference, suitable for high-volume task processing.
- Open source code that companies can change or optimize as needed.
Using Help
Installation process
The installation of Chitu is not too complicated, but it requires some preparation. Below are the detailed steps:
- Preparing the environment
- System: Ubuntu 22.04 recommended, machine with NVIDIA GPU (e.g. A800 or H20).
- Software: Install Git, Python 3.10, CUDA 12.1 (adjusted for your GPU version), PyTorch 2.1.
- Command Example:
sudo apt update && sudo apt install -y git python3.10 python3-pip pip install -U torch==2.1 --index-url https://download.pytorch.org/whl/cu121
- Download Code
- Clone Chitu's code locally using Git:
git clone --recursive https://github.com/thu-pacman/chitu cd chitu
- Clone Chitu's code locally using Git:
- Installation of dependencies
- Install the required Python packages and compilation environment:
pip install -r requirements-build.txt pip install flash-attn
- Install the required Python packages and compilation environment:
- compile and install
- Set the compilation parameters (adjust TORCH_CUDA_ARCH_LIST according to your GPU, e.g. 8.0 for A800) and compile:
TORCH_CUDA_ARCH_LIST=8.0 CHITU_SETUP_JOBS=4 MAX_JOBS=4 pip install --no-build-isolation .
- Set the compilation parameters (adjust TORCH_CUDA_ARCH_LIST according to your GPU, e.g. 8.0 for A800) and compile:
- Checking the installation
- Run a test to make sure it's okay:
torchrun --nproc_per_node 1 test/single_req_test.py
- Run a test to make sure it's okay:
How to use
Once Chitu is loaded, you can start the service from the command line or run a single test. Here's how to do it:
Initiate reasoning service
- Configuring Model Paths
- Prepare the model file, e.g. DeepSeek-R1, in a local directory (e.g.
/data/DeepSeek-R1
). - In the edit command, the
models.ckpt_dir
The parameter points to the model path.
- Prepare the model file, e.g. DeepSeek-R1, in a local directory (e.g.
- Starting services
- Start the standalone service with torchrun, listening on port 21002:
export WORLD_SIZE=1 torchrun --nproc_per_node 1 chitu/serve.py \ serve.port=21002 \ models=DeepSeek-R1 \\ models.ckpt_dir=/data/DeepSeek-R1 \ infer.use_cuda_graph=True \ request.max_new_tokens=100
- Start the standalone service with torchrun, listening on port 21002:
- Testing Services
- Send a request with curl and see if the model answers properly:
curl localhost:21002/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"messages": [{"role": "user", "content": "Hello, what is Chitu?"}]}'
- The return result will be in JSON format, containing the model's answer.
- Send a request with curl and see if the model answers properly:
Running a Single Reasoning Test
- If you don't want to start the service, you can just test the model output:
torchrun --nproc_per_node 1 test/single_req_test.py \ models=DeepSeek-R1 \\ models.ckpt_dir=/data/DeepSeek-R1 \ request.max_new_tokens=64
- The output is displayed in the terminal, telling you what the model has generated.
Multi-node distributed reasoning
- Prepare multiple machines
- Make sure Chitu and dependencies are loaded on each machine and model files are on shared storage.
- Starting Distributed Services
- Run it on 2 machines with 8 GPUs each:
torchrun --nnodes 2 --nproc_per_node 8 test/single_req_test.py \ request.max_new_tokens=64 \ infer.pp_size=2 \ infer.tp_size=8 \\ models=DeepSeek-R1 \ \ models.ckpt_dir=/data/DeepSeek-R1
- Run it on 2 machines with 8 GPUs each:
- Check the effect
- After a multi-node run, the output will be faster than a single machine, which is suitable for handling large volume requests.
Featured Function Operation
Save Money and Speed with FP8 Models
- Chitu supports FP8 format models, which use fewer GPUs and are faster than BF16.
- Operation: Add at startup
infer.soft_fp8=True
If the model is in FP8 format, make sure that the model is in FP8 format. For example:torchrun --nproc_per_node 1 chitu/serve.py \ serve.port=21002 \ models=DeepSeek-R1 \\ models.ckpt_dir=/data/DeepSeek-R1 \ infer.soft_fp8=True
Accelerated with CUDA Graph
- Single requests can be accelerated with CUDA Graph, with the addition of the parameter
infer.use_cuda_graph=True
The - Test effect: run a single inference and compare the difference in speed with and without the addition.
performance testing
- Chitu comes with benchmarking tools to measure throughput and latency:
python benchmarks/benchmark_serving.py \ --model "deepseek-r1" \bordeaux --iterations 10 \ --seq-len 10 \ ---base-url http://localhost:21002
- The result will show the number of processed per second token numbers to help you optimize your allocation.
caveat
- If you use multiple nodes, the network has to be stable or it will drop.
- GPU may report OOM error if not enough memory, lower the value.
infer.max_seq_len
or with fewer nodes. - Domestic chip support is still in the process of optimization, may need to change the code to adapt.
Chitu is not difficult to use, just follow the steps and you'll be up and running. Its documentation and community are also on GitHub, so you can raise an issue if you have questions.