Ollama Custom Import Model

AI hands-on tutorials5mos agorelease AI Sharing Circle

summary

This section learns how to customize imported models using Modelfile, which is divided into the following sections.

Import from GGUF
Importing from Pytorch or Safetensors
Imported directly from the model
Customizing Prompt

I. Import from GGUF

GGUF (GPT-Generated Unified Format) is a file format for saving fine-tuned language models. This format is designed to help users easily share and import models between different platforms and environments. It supports multiple quantization formats and can effectively reduce the size of model files.

Its predecessor was GGML (GPT-Generated Model Language), a Tensor library designed specifically for machine learning to have a single-file format and easy to reason on different architectures of CPUs as well as GPUs, but subsequently encountered problems due to lack of flexibility in development, compatibility, and difficulty in maintenance.

Ollama Importing a model from a GGUF file is supported through the following steps:

downloading .gguf file

Download link: https://huggingface.co/RichardErkhov/Qwen_-_Qwen2-0.5B-gguf/resolve/main/Qwen2-0.5B.Q3_K_M.gguf?download=true

For demonstration purposes, we have chosen the Qwen2-0.5B model. Download it and copy it to the root directory of the first part. You can refer to the following directory structure:

├── 1.从gguf直接导入/
│   ├── main.ipynb     
│   ├── Modelfile      
│   └── Qwen-0.5B.Q3_K_M.gguf 
├── 2.safetensors导入/
│   ├── llama-3-8b-bnb-4bit/        
│   ├── main.ipynb    
│   ├── Modelfile      
├── 3.模型直接导入/
│   ├── llama.cpp/    
│   ├── Qwen-0.5b/    
│   ├── main.ipynb     
│   ├── Modelfile      
│   └── Qwen_instruct_0.5b.gguf  
└── 4.自定义Prompt实践/
├── main.ipynb   
└── Modelfile

Tip.

The code for Chapter 3 and the associated documentation is in the notebook. If you are not sure where the model files are installed, you can refer to the notebook/C3

Create a new Modelfile file

FROM ./Qwen2-0.5B.Q3_K_M.gguf

Creating models in Ollama

Note

Be sure to run the following terminal command from the directory where the Modelfile file is located!

ollama create mymodel -f Modelfile

Run the model in the terminal (it is tested that it does not run in notebook, it is better to run it in the terminal)

ollama run mymodel

II. Importing from Pytorch or Safetensors

Safetensors is a file format for storing deep learning model weights that aims to address issues of security, efficiency and ease of use. This part of the functionality is still to be developed by community members, and documentation resources are currently limited.

If the model being imported is one of the following architectures, it can be imported directly into Ollama via Modelfile. of course, you can also convert the safetensors file to a gguf file and then process it, the conversion process can be found in Part III.

More detailed information about safetensors and GGUF can be found at this link https://www.linkedin.com/pulse/llama-3-safetensors-vs-gguf-talles-carvalho-jjcqf

LlamaForCausalLM
MistralForCausalLM
GemmaForCausalLM

Since this part is still being optimized by the community, the sample code and flow provided here are for reference only and are not guaranteed to run successfully. Please refer to the official documentation for details.

Download llama-3 model

!pip install huggingface_hub

# 下载模型
from huggingface_hub import snapshot_download
model_id = "unsloth/llama-3-8b-bnb-4bit"
snapshot_download(
repo_id=model_id, 
local_dir="llama-3-8b-bnb-4bit",
local_dir_use_symlinks=False,
revision="main",
# 怎么获取<YOUR_ACCESS_TOKEN>，请参照部分3
use_auth_token="<YOUR_ACCESS_TOKEN>")

Create a Modelfile file in the root directory with the following contents:

FROM ./llama-3-8b-bnb-4bit

Creating models in Ollama

ollama create mymodel2 -f Modelfile

operational model

ollama run mymodel2

III. Direct import from the model

Normally, we are exposed to a lot of model files in HuggingFace, fortunately, hf provides a very convenient API to download and process these models, like the above direct download is limited by the network environment, the speed is very slow, this small section we will use the script and hf to complete.

llama.cpp GGUF is an open source project that provides CLI and Server functions.

For architectures that cannot be directly converted by Ollama, we can use llama.cpp to quantize them and convert them to GGUF format, and then import them in the first way. Our whole conversion process is divided into the following steps:

Download the model from huggingface;
Use llama.cpp for the conversion;
Use llama.cpp for model quantization;
Run and upload the model.

3.1 Download Model from HuggingFace

[](https://github.com/datawhalechina/handy-ollama/blob/main/docs/C3/1.%20%E8%87%AA%E5%AE%9A%E4%B9%89%E5%AF%BC%E5%85%A5%E6%A8%A1%E5%9E%8B.md#31-%E4%BB%8E-huggingface-%E4%B8%8B%E8%BD%BD-model)

The most intuitive way to download is through a git clone or link, but since llm counts each part by GB, avoiding a OOM Error(Out of memory)We can write a simple download.py in Python.

First should go to hf to get the user's personal ACCESS_TOKENto open the huggingface personal settings page.

At this point, we've got a ACCESS_TOKEN Next we use the huggingface_hub (used form a nominal expression) snapshot_download Download the model, our choice here is still Qwen-0.5b.

!pip install huggingface_hub

from huggingface_hub import snapshot_download
model_id = "Qwen/Qwen1.5-0.5B" # hugginFace's model name
snapshot_download(
repo_id=model_id, 
local_dir="Qwen-0.5b",
local_dir_use_symlinks=False,
revision="main",
use_auth_token="<YOUR_ACCESS_TOKEN>")

We cd to the third part of the directory to execute, the download may take a long time, be patient.

3.2 Conversion with llama.cpp

llama.cpp was developed by the main author of GGML based on the earliest c/c++ version of llama, with the aim of reasoning about various LLMs using the CPU, and now supports most mainstream models, even multimodal models, thanks to the continuous efforts of the community.

First we clone the llama.cpp library locally, in the same directory as the downloaded model:

git clone https://github.com/ggerganov/llama.cpp.git

Since the process of converting a model using llama.cpp is based on python development, you need to install the relevant libraries, and it is recommended that you use conda or venv to create a new environment.

cd llama.cpp
pip install -r requirements.txt
python convert_hf_to_gguf.py -h

If the following is displayed, the conversion program is ready.

Next, we convert the model we just downloaded from HuggingFace to the GGUF format, using the following script:

python convert_hf_to_gguf.py ../Qwen-0.5b --outfile Qwen_instruct_0.5b.gguf --outtype f16

You can see that there is an additional Qwen_instruct_0.5b.gguf file in the llama.cpp directory, and this process takes only a few seconds.

In order to save overhead when reasoning, we quantize the model, and next we start quantizing the practical exercises.

3.3 Model quantization using llama.cpp

Model quantization is a technique that converts a high-precision floating-point model into a low-precision model. The main purpose of model quantization is to reduce the size and computational cost of the model and to keep the model as accurate as possible, with the goal of enabling the model to run on resource-limited devices, such as CPUs or mobile devices.

Similarly, we create the Modelfile file first and then use the ollama create command to create our model from the gguf file, but slightly different from the first step, we add quantization logic by simply executing the ollama create is just add a parameter.

First, move the Qwen_instruct_0.5b.gguf you got in the previous step to the root directory of the third part, and then create a Modelfile file with the following contents.

FROM ./Qwen_instruct_0.5b.gguf

The terminal runs the creation and quantization scripts.

# 第三部分根目录下
ollama create -q Q4_K_M mymodel3 -f ./Modelfile

At this point, our model is quantized and created, and next we can run the model.

3.4 Run and upload the model

The steps for running a model with gguf are detailed in Part I and will not be repeated here.

If the locally saved model file is too space consuming, we can upload the gguf model to our own repo on huggingface, same idea as step 1, we can write an upload logic.

Tip.

To complete the upload, your HF_ACCESS_TOKEN permission must be write and your model_id must be changed. your_hf_name refers to your huggingface account name.

from huggingface_hub import HfApi
import os
api = HfApi()
HF_ACCESS_TOKEN = "<YOUR_HF_WRITE_ACCESS_TOKEN>"
#TODO 这里需要设置你的model_id
#例如 model_id = "little1d/QWEN-0.5b"
model_id = "your_hf_name/QWEN-0.5b"
api.create_repo(
model_id,
exist_ok=True,
repo_type="model", # 上傳格式為模型
use_auth_token=HF_ACCESS_TOKEN,
)
# upload the model to the hub
# upload model name includes the Bailong-instruct-7B in same folder
for file in os.listdir():
if file.endswith(".gguf"):
model_name = file.lower()
api.upload_file(
repo_id=model_id,
path_in_repo=model_name,
path_or_fileobj=f"{os.getcwd()}/{file}",
repo_type="model", # 上傳格式為模型
use_auth_token=HF_ACCESS_TOKE)

Once uploaded, you can see it in your hf repository!

Customizing Prompt

Ollama supports customized Prompts, which allow the model to generate text that better meets the user's needs.

The steps to customize the Prompt are as follows:

Create a Modelfile file in the root directory

FROM llama3.1
# sets the temperature to 1 [higher is more creative, lower is more coherent]
PARAMETER temperature 1
# sets the context window size to 4096, this controls how many tokens the LLM can use as context to generate the next token
PARAMETER num_ctx 4096
# sets a custom system message to specify the behavior of the chat assistant
SYSTEM You are Mario from super mario bros, acting as an assistant.

Creating Models

ollama create mymodel -f ./Modelfile

Creating a model may take a little longer than pulling a model, so please be patient.

Running again ollama list Looking at the existing model, you can see that mymodel has been created correctly.

operational model

ollama run mymodel

As you can see, our little alpaca 🦙 has turned into Mario! Customizing the Prompt was successful! Customizing Prompt worked!

Reference Links

https://www.linkedin.com/pulse/llama-3-safetensors-vs-gguf-talles-carvalho-jjcqf
https://www.sysgeek.cn/ollama-on-windows
https://ollama.com/blog/openai-compatibility