In this article, we'll briefly explain how to use the Ollama Whether you want to have a simple chat conversation, work with big data using streaming responses, or want to do model creation, copying, deletion, etc. locally, this article can guide you. In addition, we show how to use custom clients and asynchronous programming to optimize your application's performance. Whether you're new to Ollama or an experienced developer, this article can help you use the Ollama API more efficiently in Python.
This tutorial also provides a Jupyter Notebook Examples to get better.
environmental preparation
Before you start using Python to interact with the Ollama API, make sure your development environment meets the following conditions:
- Python: Install Python 3.8 or later.
- pip: Make sure you have pip, the Python package management tool, installed.
- ollama Library: Used to make it easier to interact with the Ollama API. The installation command is as follows:
pip install ollama
Usage
from ollama import chat
from ollama import ChatResponse
response: ChatResponse = chat(model='llama3.1', messages=[
{
'role': 'user',
'content': '为什么天空是蓝色的?',
},
])
print(response['message']['content'])
print(response.message.content)
Streaming Response
This can be done by setting the stream=True
Enable response streaming so that the function call returns a Python generator where each part is an object in the stream.
from ollama import chat
stream = chat(
model='llama3.1',
messages=[{'role': 'user', 'content': '为什么天空是蓝色的?'}],
stream=True,
)
for chunk in stream:
print(chunk['message']['content'], end='', flush=True)
Structured Output
- Normal Output (Unstructured Output)
- Generate natural language text directly.
- Suitable for human reading, but not easy for programs to parse or automate.
- Example:
这是一只黑色的小猫,它正在草地上玩耍。
- Structured Output
- Return data in JSON, YAML, XML, or other formats to make it easier for machines to parse and use.
- Ideal for APIs, automated workflows and data storage.
- Example:
{ "description": "这是一只黑色的小猫", "activity": "正在草地上玩耍" }
Advantages of Structured Output
(1) Ease of handling
- The machine can easily extract specific fields such as
description
maybeactivity
The NLP is used to parse ordinary text without the need for NLP.
(2) Improvement of controllability
- Structured formats allow developers to precisely control model output and avoid lengthy or unpredictable answers.
- For example, when AI generates code:
{ "language": "Python", "code": "print('Hello, World!')" }
(3) Easy to store and analyze
- Structured data is better suited to be stored in a database for easy querying and analysis.
- Example:
{ "date": "2025-01-20", "summary": "今天的销售额增长了10%。" }
from pydantic import BaseModel, Field
from ollama import chat
import json
class CountryInfo(BaseModel):
capital: str = Field(..., alias="首都")
number: str = Field(..., alias="人口")
area: str = Field(..., alias="占地面积")
response = chat(
model='llama3.1',
messages=[{
'role': 'user',
'content': "请介绍美国的首都、人口、占地面积信息,并以 JSON 格式返回。"
}],
format="json",
options={'temperature': 0},
)
response_content = response["message"]["content"]
if not response_content:
raise ValueError("Ollama 返回的 JSON 为空")
json_response = json.loads(response_content)
print(json_response)
friends_response = CountryInfo.model_validate(json_response)
print(friends_response)
API
The Ollama Python library provides a rich set of interfaces that simplify interaction with Ollama. These interfaces are designed to be intuitive and easy to integrate, and are intended to help developers invoke and manage models more easily. For a more detailed look at the underlying implementation and complete API endpoint information, we recommend the Ollama API User's GuideThe
chats
ollama.chat(model='llama3.1', messages=[{'role': 'user', 'content': '为什么天空是蓝色的?'}])
generating
ollama.generate(model='llama3.1', prompt='为什么天空是蓝色的?')
List of Local Models
ollama.list()
Displaying model information
ollama.show('llama3.1')
Creating Models
modelfile='''
FROM llama3.1
SYSTEM 你是超级马里奥兄弟中的马里奥。
'''
ollama.create(model='example', modelfile=modelfile)
Replication models
ollama.copy('llama3.1', 'user/llama3.1')
Delete Model
ollama.delete('llama3.1')
pull model
ollama.pull('llama3.1')
push model
ollama.push('user/llama3.1')
Generate Embedding
ollama.embeddings(model='llama3.1', prompt='天空是蓝色的因为瑞利散射')
# 批量生成embedding
ollama.embed(model='llama3.1', input=['天空是蓝色的', '草是绿色的'])
step
ollama.ps()
Customized Clients
This can be accomplished by passing the ollama
instantiated Client
maybe AsyncClient
to create a custom client.
Custom clients can be created using the following fields:
host
: Ollama host to connect totimeout
:: Request timeout
For all keyword parameters seehttpx.Client
.
Synchronization Client
The synchronization client is used (Client)
means that when you call client.chat()
method, the program will wait for the request to complete and return the results before continuing to execute subsequent code. This approach is more intuitive and simple, and is suitable for writing applications that have a more linear flow and do not need to handle a large number of concurrent tasks.
from ollama import Client
client = Client(
host='http://localhost:11434',
headers={'x-some-header': 'some-value'}
)
response = client.chat(model='llama3.1', messages=[
{
'role': 'user',
'content': '为什么天空是蓝色的?',
},
])
print(response)
Asynchronous Client
This code uses an asynchronous client (AsyncClient)
and defines an asynchronous function chat()
. With the await keyword, you can pause the execution of this function until the AsyncClient().chat()
request completes, but does not block other operations in the meantime. This is useful for efficiently processing I/O
operations (e.g., network requests) or for applications that wish to perform multiple tasks simultaneously. In addition, the use of asyncio.run(chat())
to run this asynchronous function.
import asyncio
from ollama import AsyncClient
import nest_asyncio
nest_asyncio.apply()
async def chat():
message = {'role': 'user', 'content': '为什么天空是蓝色的?'}
response = await AsyncClient().chat(model='llama3.1', messages=[message])
print(response)
asyncio.run(chat())
set up stream=True
Modifies the function to return a Python asynchronous generator:
import asyncio
from ollama import AsyncClient
import nest_asyncio
nest_asyncio.apply()
async def chat():
message = {'role': 'user', 'content': '为什么天空是蓝色的?'}
async for part in await AsyncClient().chat(model='llama3.1', messages=[message], stream=True):
print(part['message']['content'], end='', flush=True)
asyncio.run(chat())
Synchronous & Asynchronous Client Time Consumption Comparison Test with Different Number of Calls
The following piece of code calls the synchronous and asynchronous client repeats respectively test_num
Sub-quiz process, comparing the total time needed and the single time, the user can change the following parameters for testing:
- test_messages: test data
- test_num: number of tests
- model_name: test model
import time
import asyncio
from ollama import Client, AsyncClient
import nest_asyncio
# 应用nest_asyncio以支持Jupyter中的异步操作
nest_asyncio.apply()
# 初始化客户端
client = Client(host='http://localhost:11434')
async_client = AsyncClient(host='http://localhost:11434')
# 同步请求处理函数
def request_example(client, model_name, messages):
start_time = time.time()
try:
# 同步请求返回
response = client.chat(model=model_name, messages=messages)
except Exception as e:
print(f"同步请求失败: {e}")
response = None
end_time = time.time()
duration = end_time - start_time
print(f"同步请求时间: {duration}")
return response, duration
# 异步请求处理函数
async def async_request_example(client, model_name, messages):
start_time = time.time()
try:
# 异步请求返回
response = await client.chat(model=model_name, messages=messages)
except Exception as e:
print(f"异步请求失败: {e}")
response = None
end_time = time.time()
duration = end_time - start_time
print(f"异步请求时间: {duration}")
return response, duration
# 异步请求测试函数
async def async_client_test(test_num, model_name, messages):
tasks = [asyncio.create_task(async_request_example(async_client, model_name, messages))
for _ in range(test_num)]
results= await asyncio.gather(*tasks)
return results
# 运行同步测试
def sync_test(model_name, messages, test_num):
total_time = 0
for i in range(test_num):
_, duration = request_example(client, model_name, messages)
total_time += duration
return total_time / test_num
# 运行异步测试
async def async_test(model_name, messages, test_num):
start_time = time.time()
await async_client_test(test_num, model_name, messages)
end_time = time.time()
return (end_time - start_time) / test_num
# 准备测试数据
test_messages = [{'role': 'user', 'content': '为什么天空是蓝色的?'}]
test_num = 10
model_name = 'llama3.1'
# 运行同步测试并输出结果
print("运行同步测试")
sync_avg_time = sync_test(model_name, test_messages, test_num)
print(f"同步测试平均时间: {sync_avg_time:.2f} 秒")
# 运行异步测试并输出结果
print("运行异步测试")
async_avg_time = asyncio.run(async_test(model_name, test_messages, test_num))
print(f"异步测试平均时间: {async_avg_time:.2f} 秒")
incorrect
An error is raised if the request returns an error status or an error is detected while streaming.
import ollama
model = 'does-not-yet-exist'
try:
ollama.chat(model)
except ollama.ResponseError as e:
print('错误:', e.error)
if e.status_code == 404:
ollama.pull(model)
Refer to the documentation:Ollama python