AI Personal Learning
and practical guidance
CyberKnife Drawing Mirror

Ollama Custom Import Model

summary

This section learns how to customize imported models using Modelfile, which is divided into the following sections.

  1. Import from GGUF
  2. Importing from Pytorch or Safetensors
  3. Imported directly from the model
  4. Customizing Prompt

 

I. Import from GGUF

GGUF (GPT-Generated Unified Format) is a file format for saving fine-tuned language models. This format is designed to help users easily share and import models between different platforms and environments. It supports multiple quantization formats and can effectively reduce the size of model files.

Its predecessor was GGML (GPT-Generated Model Language), a Tensor library designed specifically for machine learning to have a single-file format and easy to reason on different architectures of CPUs as well as GPUs, but subsequently encountered problems due to lack of flexibility in development, compatibility, and difficulty in maintenance.


Ollama Importing a model from a GGUF file is supported through the following steps:

  1. downloading .gguf file

Download link: https://huggingface.co/RichardErkhov/Qwen_-_Qwen2-0.5B-gguf/resolve/main/Qwen2-0.5B.Q3_K_M.gguf?download=true

For demonstration purposes, we have chosen the Qwen2-0.5B model. Download it and copy it to the root directory of the first part. You can refer to the following directory structure:

├── 1. Import directly from gguf/
│ ├── main.ipynb
│ ├── Modelfile
│ └── Qwen-0.5B.Q3_K_M.gguf
├── 2.safetensors import /
│ ├── llama-3-8b-bnb-4bit/
│ ├── main.ipynb
│ ├── Modelfile
Model direct import/ │ ├── 3.
│ ├── llama.cpp/ │ ├── main.ipynb │ ├── 3.
│ ├── Qwen-0.5b/
│ ├── main.ipynb
│ ├── Modelfile
│ └── Qwen_instruct_0.5b.gguf
└── 4. Customized Prompt Practices/ │ ├── main.ipynb │ ├── main.ipynb
├── main.ipynb │ └── Modelfile │ └── 4.
└── Modelfile

Tip.

The code for Chapter 3 and the associated documentation is in the notebook. If you are not sure where the model files are installed, you can refer to the notebook/C3

  1. Create a new Modelfile file
FROM . /Qwen2-0.5B.Q3_K_M.gguf
  1. Creating models in Ollama

Note

Be sure to run the following terminal command from the directory where the Modelfile file is located!

ollama create mymodel -f Modelfile
  1. Run the model in the terminal (it is tested that it does not run in notebook, it is better to run it in the terminal)
ollama run mymodel

Ollama Custom Import Model-1

 

II. Importing from Pytorch or Safetensors

Safetensors is a file format for storing deep learning model weights that aims to address issues of security, efficiency and ease of use. This part of the functionality is still to be developed by community members, and documentation resources are currently limited.

If the model being imported is one of the following architectures, it can be imported directly into Ollama via Modelfile. of course, you can also convert the safetensors file to a gguf file and then process it, the conversion process can be found in Part III.

More detailed information about safetensors and GGUF can be found at this link https://www.linkedin.com/pulse/llama-3-safetensors-vs-gguf-talles-carvalho-jjcqf

  • LlamaForCausalLM
  • MistralForCausalLM
  • GemmaForCausalLM

Since this part is still being optimized by the community, the sample code and flow provided here are for reference only and are not guaranteed to run successfully. Please refer to the official documentation for details.

  1. Download llama-3 model
!pip install huggingface_hub
# download model
from huggingface_hub import snapshot_download
model_id = "unsloth/llama-3-8b-bnb-4bit"
snapshot_download(
repo_id=model_id,
local_dir="llama-3-8b-bnb-4bit",
local_dir_use_symlinks=False,
revision="main",
# How to get , see section 3
use_auth_token="")
  1. Create a Modelfile file in the root directory with the following contents:
FROM . /llama-3-8b-bnb-4bit
  1. Creating models in Ollama
ollama create mymodel2 -f Modelfile
  1. operational model
ollama run mymodel2

 

III. Direct import from the model

Normally, we are exposed to a lot of model files in HuggingFace, fortunately, hf provides a very convenient API to download and process these models, like the above direct download is limited by the network environment, the speed is very slow, this small section we will use the script and hf to complete.

llama.cpp GGUF is an open source project that provides CLI and Server functions.

For architectures that cannot be directly converted by Ollama, we can use llama.cpp to quantize them and convert them to GGUF format, and then import them in the first way. Our whole conversion process is divided into the following steps:

  1. Download the model from huggingface;
  2. Use llama.cpp for the conversion;
  3. Use llama.cpp for model quantization;
  4. Run and upload the model.

3.1 Download Model from HuggingFace

[](https://github.com/datawhalechina/handy-ollama/blob/main/docs/C3/1.1 TP3T20 .md#31--huggingface--model)

The most intuitive way to download is through a git clone or link, but since llm counts each part by GB, avoiding a OOM Error (Out of memory)We can write a simple download.py in Python.

First should go to hf to get the user's personal ACCESS_TOKENto open the huggingface personal settings page.

Ollama Customized Import Model-2

Ollama Customized Import Model-3

Ollama Customized Import Model-4

At this point, we've got a ACCESS_TOKEN Next we use the huggingface_hub (used form a nominal expression) snapshot_download Download the model, our choice here is still Qwen-0.5b.

!pip install huggingface_hub
from huggingface_hub import snapshot_download
model_id = "Qwen/Qwen1.5-0.5B" # hugginFace's model name
snapshot_download(
repo_id=model_id, local_dir="Qwen/Qwen1.5-0.5B")
local_dir="Qwen-0.5b",
local_dir_use_symlinks=False,
revision="main",
use_auth_token="")

We cd to the third part of the directory to execute, the download may take a long time, be patient.

3.2 Conversion with llama.cpp

llama.cpp was developed by the main author of GGML based on the earliest c/c++ version of llama, with the aim of reasoning about various LLMs using the CPU, and now supports most mainstream models, even multimodal models, thanks to the continuous efforts of the community.

First we clone the llama.cpp library locally, in the same directory as the downloaded model:

git clone https://github.com/ggerganov/llama.cpp.git

Since the process of converting a model using llama.cpp is based on python development, you need to install the relevant libraries, and it is recommended that you use conda or venv to create a new environment.

cd llama.cpp
pip install -r requirements.txt
python convert_hf_to_gguf.py -h

If the following is displayed, the conversion program is ready.

Ollama Customized Import Model-5

Next, we convert the model we just downloaded from HuggingFace to the GGUF format, using the following script:

python convert_hf_to_gguf.py ... /Qwen-0.5b --outfile Qwen_instruct_0.5b.gguf --outtype f16

Ollama Customized Import Model-6

You can see that there is an additional Qwen_instruct_0.5b.gguf file in the llama.cpp directory, and this process takes only a few seconds.

In order to save overhead when reasoning, we quantize the model, and next we start quantizing the practical exercises.

3.3 Model quantization using llama.cpp

Model quantization is a technique that converts a high-precision floating-point model into a low-precision model. The main purpose of model quantization is to reduce the size and computational cost of the model and to keep the model as accurate as possible, with the goal of enabling the model to run on resource-limited devices, such as CPUs or mobile devices.

Similarly, we create the Modelfile file first and then use the ollama create command to create our model from the gguf file, but slightly different from the first step, we add quantization logic by simply executing the ollama create is just add a parameter.

First, move the Qwen_instruct_0.5b.gguf you got in the previous step to the root directory of the third part, and then create a Modelfile file with the following contents.

FROM . /Qwen_instruct_0.5b.gguf

The terminal runs the creation and quantization scripts.

# Part 3 in the root directory
ollama create -q Q4_K_M mymodel3 -f . /Modelfile

Ollama Customized Import Model-7

At this point, our model is quantized and created, and next we can run the model.

3.4 Run and upload the model

The steps for running a model with gguf are detailed in Part I and will not be repeated here.

If the locally saved model file is too space consuming, we can upload the gguf model to our own repo on huggingface, same idea as step 1, we can write an upload logic.

Tip.

To complete the upload, your HF_ACCESS_TOKEN permission must be write and your model_id must be changed. your_hf_name refers to your huggingface account name.

from huggingface_hub import HfApi
import os
api = HfApi()
HF_ACCESS_TOKEN = ""
#TODO Here you need to set your model_id
# for example model_id = "little1d/QWEN-0.5b"
model_id = "your_hf_name/QWEN-0.5b"
api.create_repo(
model_id,
model_id, exist_ok=True,
repo_type="model", # upload format as model
use_auth_token=HF_ACCESS_TOKEN,
)
# upload the model to the hub
# upload model name includes the Bailong-instruct-7B in same folder
for file in os.listdir(): if file.endswith()
if file.endswith(".gguf"): model_name = file.lower(".gguf").
model_name = file.lower()
api.upload_file(
repo_id=model_id, path_in_repo=model_id, path_in_repo=model_id
path_in_repo=model_name,
path_or_fileobj=f"{os.getcwd()}/{file}",
repo_type="model", # upload formatted as model
use_auth_token=HF_ACCESS_TOKE)

Ollama Customized Import Model-8

Once uploaded, you can see it in your hf repository!

 

Customizing Prompt

Ollama supports customized Prompts, which allow the model to generate text that better meets the user's needs.

The steps to customize the Prompt are as follows:

  1. Create a Modelfile file in the root directory
FROM llama3.1
# sets the temperature to 1 [higher is more creative, lower is more coherent].
PARAMETER temperature 1
# sets the context window size to 4096, this controls how many tokens the LLM can use as context to generate the next token
PARAMETER num_ctx 4096
# sets a custom system message to specify the behavior of the chat assistant
SYSTEM You are Mario from super mario bros, acting as an assistant.
  1. Creating Models
ollama create mymodel -f . /Modelfile

Ollama Customized Import Model-9

Creating a model may take a little longer than pulling a model, so please be patient.

Ollama Custom Import Model-10

Running again ollama list Looking at the existing model, you can see that mymodel has been created correctly.

Ollama Custom Import Model-11

  1. operational model
ollama run mymodel

Ollama Custom Import Model-12

As you can see, our little alpaca 🦙 has turned into Mario! Customizing the Prompt was successful! Customizing Prompt worked!

Reference Links

  • https://www.linkedin.com/pulse/llama-3-safetensors-vs-gguf-talles-carvalho-jjcqf
  • https://www.sysgeek.cn/ollama-on-windows
  • https://ollama.com/blog/openai-compatibility
CDN1
May not be reproduced without permission:Chief AI Sharing Circle " Ollama Custom Import Model

Chief AI Sharing Circle

Chief AI Sharing Circle specializes in AI learning, providing comprehensive AI learning content, AI tools and hands-on guidance. Our goal is to help users master AI technology and explore the unlimited potential of AI together through high-quality content and practical experience sharing. Whether you are an AI beginner or a senior expert, this is the ideal place for you to gain knowledge, improve your skills and realize innovation.

Contact Us
en_USEnglish