AI Personal Learning
and practical guidance
CyberKnife Drawing Mirror

LitServe: Rapidly Deploying Enterprise-Grade General AI Model Reasoning Services

General Introduction

LitServe is Lightning AI launched an open source AI model service engine, built on FastAPI, focusing on rapid deployment of inference services for general-purpose AI models. It supports a wide range of scenarios from large language models (LLMs), visual models, audio models, to classical machine learning models, and provides batch processing, streaming, and GPU auto-scaling, with at least a 2x performance boost over FastAPI. LitServe is easy to use and highly flexible, and can be self-hosted or fully hosted through Lightning Studios. LitServe is easy to use and highly flexible, and can be self-hosted or fully hosted through Lightning Studios, making it ideal for researchers, developers, and enterprises to quickly build efficient model inference APIs. officials emphasize its enterprise-class features, such as security, scalability, and high-availability, to ensure that production environments are ready to go out of the box.

LitServe: Rapidly Deploying an Enterprise-Grade Generalized AI Model Reasoning Service-1


 

Function List

  • Rapid deployment of inference services: Support for fast conversion of models from frameworks like PyTorch, JAX, TensorFlow, etc. to APIs.
  • batch file: Merge multiple inference requests into a batch to improve throughput.
  • streaming: Support real-time inference result stream output, suitable for continuous response scenarios.
  • GPU Auto Scaling: Optimizes performance by dynamically adjusting GPU resources based on inference load.
  • Composite AI System: Allows multiple models to reason collaboratively to build complex services.
  • Self-hosted vs. cloud hosting: Supports local deployment or management through the Lightning Studios cloud.
  • Integration with vLLM: Optimizing inference performance for large language models.
  • OpenAPI Compatible: Automatically generates standard API documentation for easy testing and integration.
  • Full Model Support: Covering the inference needs of various models such as LLM, vision, audio, embedding, etc.
  • Server Optimization: Provides multi-process processing and inference more than 2x faster than FastAPI.

 

Using Help

Installation process

LitServe is easy to install with Python's pip The tool will do the job. Below are the detailed steps:

1. Preparing the environment

Ensure that Python 3.8 or later is installed on your system; a virtual environment is recommended:

python -m venv venv
source venv/bin/activate # Linux/Mac
venv\Scripts\activate # Windows

2. Installation of LitServe

Run the following command to install the stable version:

pip install litserve

If you need the latest features, you can install the development version:

pip install git+https://github.com/Lightning-AI/litserve.git@main

3. Inspection of installations

Verify that it was successful:

python -c "import litserve; print(litserve.__version__)"

Successful output of the version number completes the installation.

4. Optional dependencies

If you need GPU support, install the GPU version of the corresponding framework, for example:

pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121

How to use LitServe

LitServe turns AI models into inference services through clean code. Here's how it works in detail:

1. Creation of a simple reasoning service

The following is an example of a composite reasoning service with two models:

import litserve as ls
class SimpleLitAPI(ls.LitAPI): def setup(self, device).
def setup(self, device).
# initialize, load model or data
self.model1 = lambda x: x ** 2 # square model
self.model2 = lambda x: x ** 3 # cubic model
def decode_request(self, request).
# Parsing request data
return request["input"]
def predict(self, x).
# Compound inference
squared = self.model1(x)
cubed = self.model2(x)
return squared + cubed
def encode_response(self, output).
# Format the inference result
return {"output": output}
if __name__ == "__main__": server = ls.
server = ls.LitServer(SimpleLitAPI(), accelerator="auto")
server.run(port=8000)
  • (of a computer) run: Save as server.pyImplementation python server.pyThe
  • test (machinery etc): Use of curl Sends a reasoning request:
    curl -X POST "http://127.0.0.1:8000/predict" -H "Content-Type: application/json" -d '{"input": 4.0}'
    

    Output:{"output": 80.0}(16 + 64).

2. Enabling bulk reasoning

Modify the code to support batch processing:

server = ls.LitServer(SimpleLitAPI(), max_batch_size=4, accelerator="auto")
  • Operating Instructions::max_batch_size=4 Indicates that up to 4 inference requests are processed at the same time and automatically merged to improve efficiency.
  • Test Methods: Send the request multiple times and observe the throughput improvement:
    curl -X POST "http://127.0.0.1:8000/predict" -H "Content-Type: application/json" -d '{"input": 5.0}'
    

3. Configuring streaming reasoning

For real-time reasoning scenarios:

class StreamLitAPI(ls.LitAPI).
def setup(self, device): self.model = lambda x: [x * i for i in range(5)]]
self.model = lambda x: [x * i for i in range(5)]
def decode_request(self, request): return request["input"].
return request["input"]
def predict(self, x): for result in self.model(x)
for result in self.model(x): yield result
yield result
def encode_response(self, output): return {"output": {"output": {"input"].
return {"output": output}
server = ls.LitServer(StreamLitAPI(), stream=True, accelerator="auto")
server.run(port=8000)
  • Operating Instructions::stream=True Enabling streaming reasoning.predict utilization yield Returns results one by one.
  • Test Methods: Use a client that supports streaming responses:
    curl --no-buffer -X POST "http://127.0.0.1:8000/predict" -H "Content-Type: application/json" -d '{"input": 2}'
    

4. GPU auto-scaling

If a GPU is available, LitServe automatically optimizes inference:

  • Operating Instructions::accelerator="auto" Detect and prioritize GPUs.
  • validate (a theory): Check the logs after running to confirm GPU usage.
  • Environmental requirements: Ensure that the GPU version of the framework (e.g. PyTorch) is installed.

5. Deployment of complex model reasoning (using BERT as an example)

Deploy Hugging Face's BERT model inference service:

from transformers import BertTokenizer, BertModel
import litserve as ls
class BertLitAPI(ls.)
def setup(self, device): self.tokenizer = BertTokenizer.
self.tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
self.model = BertModel.from_pretrained("bert-base-uncased").to(device)
def decode_request(self, request):
return request["text"]
def predict(self, text).
inputs = self.tokenizer(text, return_tensors="pt").to(self.model.device)
outputs = self.model(**inputs)
return outputs.last_hidden_state.mean(dim=1).tolist()
def encode_response(self, outputs): {"embedding": outputs.last_hidden_state.mean(dim=1.tolist)
return {"embedding": output}
server = ls.LitServer(BertLitAPI(), accelerator="auto")
server.run(port=8000)
  • (of a computer) run: After executing the script, access the http://127.0.0.1:8000/predictThe
  • test (machinery etc)::
    curl -X POST "http://127.0.0.1:8000/predict" -H "Content-Type: application/json" -d '{"text": "Hello, world!"}'
    

6. Integrate vLLM to deploy LLM reasoning

Efficient reasoning for large language models:

import litserve as ls
from vllm import LLM
class LLMLitAPI(ls.LitAPI): def setup(self, device).
def setup(self, device): self.model = LLM(model="meta-llama/Llama-3.2-1B")
self.model = LLM(model="meta-llama/Llama-3.2-1B", dtype="float16")
def decode_request(self, request):
return request["prompt"]
def predict(self, prompt).
outputs = self.model.generate(prompt, max_tokens=50)
return outputs[0].outputs[0].text
def encode_response(self, output): return {"response": outputs[0].
return {"response": output}
server = ls.LitServer(LLMLitAPI(), accelerator="auto")
server.run(port=8000)
  • Installing vLLM::pip install vllmThe
  • test (machinery etc)::
    curl -X POST "http://127.0.0.1:8000/predict" -H "Content-Type: application/json" -d '{"prompt": "What is AI?"}'
    

7. View the API documentation

  • Operating Instructions: Access http://127.0.0.1:8000/docs, Interactive Test Reasoning Service.
  • Function Tips: Based on the OpenAPI standard and contains all endpoint details.

8. Hosting options

  • self-hosted: Run the code locally or on the server.
  • cloud hosting: Deployed via Lightning Studios, requires account registration, offers load balancing, auto-scaling, and more.

Operating Tips

  • adjust components during testing: Settings timeout=60 Avoid reasoning timeouts.
  • log (computing): Check the terminal logs at startup to troubleshoot the problem.
  • make superior: Refer to the official documentation to enable advanced features such as authentication and Docker deployment.

LitServe supports the full range of process requirements from prototyping to enterprise-class applications through rapid deployment and optimization of inference services.

CDN1
May not be reproduced without permission:Chief AI Sharing Circle " LitServe: Rapidly Deploying Enterprise-Grade General AI Model Reasoning Services

Chief AI Sharing Circle

Chief AI Sharing Circle specializes in AI learning, providing comprehensive AI learning content, AI tools and hands-on guidance. Our goal is to help users master AI technology and explore the unlimited potential of AI together through high-quality content and practical experience sharing. Whether you are an AI beginner or a senior expert, this is the ideal place for you to gain knowledge, improve your skills and realize innovation.

Contact Us
en_USEnglish