Chitu (Red Rabbit): A High-Performance Large Language Modeling Reasoning Framework Launched by Tsinghua Team

Latest AI Resources5mos agorelease AI Sharing Circle

1.1K 00

General Introduction

Chitu is an open source reasoning framework developed by the PACMAN team at Tsinghua University, called "Red Rabbit", which is specifically designed to run large language models. Chitu supports a variety of hardware, such as NVIDIA's GPUs and domestic chips, and can be used from stand-alone to large-scale clusters.Its highlight is inference with FP8 models, which can dramatically reduce costs, such as running DeepSeek-671B on an A800, which uses half the GPUs and is more than 3x faster than vLLM.The code is publicly available on GitHub and can be downloaded and used for free by businesses or individuals. This is an out-of-the-box tool for production environments and is suitable for teams that want to save money but also want performance.

Function List

It supports FP8 and BF16 model inference, and can run on old GPUs and domestic chips with low cost and high performance.
Adapts to a wide range of hardware, from pure CPUs to GPUs such as NVIDIA A800, H20, etc., to large-scale clusters.
Optimize inference speed with CUDA Graph for faster output in a single request.
Provides a service interface that allows models to be invoked directly via HTTP requests.
Supports multi-node distributed inference, suitable for high-volume task processing.
Open source code that companies can change or optimize as needed.

Using Help

Installation process

The installation of Chitu is not too complicated, but it requires some preparation. Below are the detailed steps:

Preparing the environment
- System: Ubuntu 22.04 recommended, machine with NVIDIA GPU (e.g. A800 or H20).
- Software: Install Git, Python 3.10, CUDA 12.1 (adjusted for your GPU version), PyTorch 2.1.
- Command Example:
```
sudo apt update && sudo apt install -y git python3.10 python3-pip
pip install -U torch==2.1 --index-url https://download.pytorch.org/whl/cu121
```

Download Code

Clone Chitu's code locally using Git:

git clone --recursive https://github.com/thu-pacman/chitu
cd chitu

Installation of dependencies
- Install the required Python packages and compilation environment:
```
pip install -r requirements-build.txt
pip install flash-attn
```
compile and install
- Set the compilation parameters (adjust TORCH_CUDA_ARCH_LIST according to your GPU, e.g. 8.0 for A800) and compile:
```
TORCH_CUDA_ARCH_LIST=8.0 CHITU_SETUP_JOBS=4 MAX_JOBS=4 pip install --no-build-isolation .
```
Checking the installation
- Run a test to make sure it's okay:
```
torchrun --nproc_per_node 1 test/single_req_test.py
```

How to use

Once Chitu is loaded, you can start the service from the command line or run a single test. Here's how to do it:

Initiate reasoning service

Configuring Model Paths
- Prepare the model file, e.g. DeepSeek-R1, in a local directory (e.g. /data/DeepSeek-R1).
- In the edit command, the models.ckpt_dir The parameter points to the model path.

Starting services

Start the standalone service with torchrun, listening on port 21002:

export WORLD_SIZE=1
torchrun --nproc_per_node 1 chitu/serve.py \
serve.port=21002 \
models=DeepSeek-R1 \
models.ckpt_dir=/data/DeepSeek-R1 \
infer.use_cuda_graph=True \
request.max_new_tokens=100

Testing Services

Send a request with curl and see if the model answers properly:

curl localhost:21002/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"messages": [{"role": "user", "content": "你好，Chitu 是什么？"}]}'

The return result will be in JSON format, containing the model's answer.

Running a Single Reasoning Test

If you don't want to start the service, you can just test the model output:

torchrun --nproc_per_node 1 test/single_req_test.py \
models=DeepSeek-R1 \
models.ckpt_dir=/data/DeepSeek-R1 \
request.max_new_tokens=64

The output is displayed in the terminal, telling you what the model has generated.

Multi-node distributed reasoning

Prepare multiple machines
- Make sure Chitu and dependencies are loaded on each machine and model files are on shared storage.

Starting Distributed Services

Run it on 2 machines with 8 GPUs each:

torchrun --nnodes 2 --nproc_per_node 8 test/single_req_test.py \
request.max_new_tokens=64 \
infer.pp_size=2 \
infer.tp_size=8 \
models=DeepSeek-R1 \
models.ckpt_dir=/data/DeepSeek-R1

Check the effect
- After a multi-node run, the output will be faster than a single machine, which is suitable for handling large volume requests.

Featured Function Operation

Save Money and Speed with FP8 Models

Chitu supports FP8 format models, which use fewer GPUs and are faster than BF16.

Operation: Add at startup infer.soft_fp8=TrueIf the model is in FP8 format, make sure that the model is in FP8 format. For example:

torchrun --nproc_per_node 1 chitu/serve.py \
serve.port=21002 \
models=DeepSeek-R1 \
models.ckpt_dir=/data/DeepSeek-R1 \
infer.soft_fp8=True

Accelerated with CUDA Graph

Single requests can be accelerated with CUDA Graph, with the addition of the parameter infer.use_cuda_graph=TrueThe
Test effect: run a single inference and compare the difference in speed with and without the addition.

performance testing

Chitu comes with benchmarking tools to measure throughput and latency:

python benchmarks/benchmark_serving.py \
--model "deepseek-r1" \
--iterations 10 \
--seq-len 10 \
--base-url http://localhost:21002

The result will show the number of processed per second token numbers to help you optimize your allocation.

caveat

If you use multiple nodes, the network has to be stable or it will drop.
GPU may report OOM error if not enough memory, lower the value. infer.max_seq_len or with fewer nodes.
Domestic chip support is still in the process of optimization, may need to change the code to adapt.

Chitu is not difficult to use, just follow the steps and you'll be up and running. Its documentation and community are also on GitHub, so you can raise an issue if you have questions.