In this article, we'll briefly explain how to use the Ollama Whether you want to have a simple chat conversation, work with big data using streaming responses, or want to do model creation, copying, deletion, etc. locally, this article can guide you. In addition, we show how to use custom clients and asynchronous programming to optimize your application's performance. Whether you're new to Ollama or an experienced developer, this article can help you use the Ollama API more efficiently in Python.
This tutorial also provides a Jupyter Notebook Examples to get better.
environmental preparation
Before you start using Python to interact with the Ollama API, make sure your development environment meets the following conditions:
- Python: Install Python 3.8 or later.
- pip: Make sure you have pip, the Python package management tool, installed.
- ollama Library: Used to make it easier to interact with the Ollama API. The installation command is as follows:
pip install ollama
Usage
from ollama import chat
from ollama import ChatResponse
response: ChatResponse = chat(model='ollama3.1', messages=[
{
'content': 'Why is the sky blue?' ,
},
])
print(response['message']['content'])
print(response.message.content)
Streaming Response
This can be done by setting the stream=True
Enable response streaming so that the function call returns a Python generator where each part is an object in the stream.
from ollama import chat
stream = chat(
model='ollama3.1', [{'role': 'user', 'content': 'Why is the sky blue?
messages=[{'role': 'user', 'content': 'Why is the sky blue?'}] ,
stream=True, )
)
for chunk in stream.
print(chunk['message']['content'], end='', flush=True)
Structured Output
- Normal Output (Unstructured Output)
- Generate natural language text directly.
- Suitable for human reading, but not easy for programs to parse or automate.
- Example:
It was a black kitten who was playing in the grass.
- Structured Output
- Return data in JSON, YAML, XML, or other formats to make it easier for machines to parse and use.
- Ideal for APIs, automated workflows and data storage.
- Example:
{ "description": "This is a black kitten", "activity": "Playing in the grass." }
Advantages of Structured Output
(1) Ease of handling
- The machine can easily extract specific fields such as
description
maybeactivity
The NLP is used to parse ordinary text without the need for NLP.
(2) Improvement of controllability
- Structured formats allow developers to precisely control model output and avoid lengthy or unpredictable answers.
- For example, when AI generates code:
{ "language": "Python", "code": "print('Hello, World!') "code": "print('Hello, World!')" }
(3) Easy to store and analyze
- Structured data is better suited to be stored in a database for easy querying and analysis.
- Example:
{ "date": "2025-01-20",. "summary": "Today's sales are up 10%." }
from pydantic import BaseModel, Field
from ollama import chat
import json
class CountryInfo(BaseModel).
capital: str = Field(... , alias="Capital")
number: str = Field(... , alias="Population")
area: str = Field(... , alias="Area")
response = chat(
model='llama3.1',
messages=[{
'role': 'user', 'content': 'introduce', 'content'.
'content': "Please present information about the capital, population, and footprint of the United States and return it in JSON format."
}]
format="json", options={'temperature': 0.5", 'content'.
options={'temperature': 0}, )
)
response_content = response["message"]["content"]
if not response_content: raise ValueError("Ollama Return JSR")
raise ValueError("The JSON returned by Ollama is empty")
json_response = json.loads(response_content)
print(json_response)
friends_response = CountryInfo.model_validate(json_response)
print(friends_response)
API
The Ollama Python library provides a rich set of interfaces that simplify interaction with Ollama. These interfaces are designed to be intuitive and easy to integrate, and are intended to help developers invoke and manage models more easily. For a more detailed look at the underlying implementation and complete API endpoint information, we recommend the Ollama API User's GuideThe
chats
ollama.chat(model='ollama3.1', messages=[{'role': 'user', 'content': 'Why is the sky blue?'}])
generating
ollama.generate(model='ollama3.1', prompt='Why is the sky blue?')
List of Local Models
ollama.list()
Displaying model information
ollama.show('llama3.1')
Creating Models
modelfile='''
FROM llama3.1
SYSTEM You are Mario from Super Mario Bros.
'''
ollama.create(model='example', modelfile=modelfile)
Replication models
ollama.copy('llama3.1', 'user/llama3.1')
Delete Model
ollama.delete('llama3.1')
pull model
ollama.pull('llama3.1')
push model
ollama.push('user/llama3.1')
Generate Embedding
ollama.embeddings(model='llama3.1', prompt='The sky is blue because of Rayleigh scattering')
# Batch generation of embedding
ollama.embed(model='llama3.1', input=['sky is blue', 'grass is green'])
step
ollama.ps()
Customized Clients
This can be accomplished by passing the ollama
instantiated Client
maybe AsyncClient
to create a custom client.
Custom clients can be created using the following fields:
host
: Ollama host to connect totimeout
:: Request timeout
For all keyword parameters seehttpx.Client
.
Synchronization Client
The synchronization client is used (Client)
means that when you call client.chat()
method, the program will wait for the request to complete and return the results before continuing to execute subsequent code. This approach is more intuitive and simple, and is suitable for writing applications that have a more linear flow and do not need to handle a large number of concurrent tasks.
from ollama import Client
client = Client(
host='http://localhost:11434',
headers={'x-some-header': 'some-value'}
)
response = client.chat(model='ollama3.1', messages=[
{
'role': 'user', 'content': 'why-sky'.
'content': 'Why is the sky blue?' ,
},
])
print(response)
Asynchronous Client
This code uses an asynchronous client (AsyncClient)
and defines an asynchronous function chat()
. With the await keyword, you can pause the execution of this function until the AsyncClient().chat()
request completes, but does not block other operations in the meantime. This is useful for efficiently processing I/O
operations (e.g., network requests) or for applications that wish to perform multiple tasks simultaneously. In addition, the use of asyncio.run(chat())
to run this asynchronous function.
import asyncio
from ollama import AsyncClient
import nest_asyncio
nest_asyncio.apply()
async def chat().
message = {'role': 'user', 'content': 'Why is the sky blue?'}
response = await AsyncClient().chat(model='llama3.1', messages=[message])
print(response)
asyncio.run(chat())
set up stream=True
Modifies the function to return a Python asynchronous generator:
import asyncio
from ollama import AsyncClient
import nest_asyncio
nest_asyncio.apply()
async def chat().
message = {'role': 'user', 'content': 'Why is the sky blue?'}
async for part in await AsyncClient().chat(model='llama3.1', messages=[message], stream=True):
print(part['message']['content'], end='', flush=True)
asyncio.run(chat())
Synchronous & Asynchronous Client Time Consumption Comparison Test with Different Number of Calls
The following piece of code calls the synchronous and asynchronous client repeats respectively test_num
Sub-quiz process, comparing the total time needed and the single time, the user can change the following parameters for testing:
- test_messages: test data
- test_num: number of tests
- model_name: test model
import time
import asyncio
from ollama import Client, AsyncClient
import nest_asyncio
# Applying nest_asyncio to support asynchronous operations in Jupyter
nest_asyncio.apply()
# Initialize the client
client = Client(host='http://localhost:11434')
async_client = AsyncClient(host='http://localhost:11434')
# Synchronization Request Handling Function
def request_example(client, model_name, messages):
start_time = time.time()
start_time = time.time()
# Synchronization request return
response = client.chat(model=model_name, messages=messages)
except Exception as e.
print(f "Synchronization request failed: {e}")
response = None
end_time = time.time()
duration = end_time - start_time
print(f "Synchronization request time: {duration}")
return response, duration
# Asynchronous request handler
async def async_request_example(client, model_name, messages):: start_time = time.
start_time = time.time()
start_time = time.time(): start_time = time.time()
# asynchronous request return
response = await client.chat(model=model_name, messages=messages)
except Exception as e.
print(f "Asynchronous request failed: {e}")
response = None
end_time = time.time()
duration = end_time - start_time
print(f "Asynchronous request duration: {duration}")
return response, duration
# Asynchronous Request Test Function
async def async_client_test(test_num, model_name, messages): tasks = [asyncio.
tasks = [asyncio.create_task(async_request_example(async_client, model_name, messages))
for _ in range(test_num)]
results= await asyncio.gather(*tasks)
return results
# Run the synchronization test
def sync_test(model_name, messages, test_num):
total_time = 0
for i in range(test_num): _, duration = request_example = 0
_, duration = request_example(client, model_name, messages)
total_time += duration
return total_time / test_num
# Run an asynchronous test
async def async_test(model_name, messages, test_num):: start_time = time.
start_time = time.time()
await async_client_test(test_num, model_name, messages)
end_time = time.time()
return (end_time - start_time) / test_num
# Prepare the test data
test_messages = [{'role': 'user', 'content': 'Why is the sky blue?'}]
test_num = 10
model_name = 'llama3.1'
# Run the synchronization test and output the results
print("Running the synchronization test")
sync_avg_time = sync_test(model_name, test_messages, test_num)
print(f "Synchronized test average time: {sync_avg_time:.2f} seconds")
# Run the asynchronous test and output the results
print("Running asynchronous test")
async_avg_time = asyncio.run(async_test(model_name, test_messages, test_num))
print(f "Asynchronous test average time: {async_avg_time:.2f} seconds")
incorrect
An error is raised if the request returns an error status or an error is detected while streaming.
import ollama
model = 'does-not-yet-exist'
try.
ollama.chat(model)
except ollama.ResponseError as e:: print('Error:', e.error)
print('Error:', e.error)
print('Error:', e.error) if e.status_code == 404: ollama.pull(model): ollama.
ollama.pull(model)
Refer to the documentation:Ollama python