AI Personal Learning
and practical guidance

LlamaEdge: the quickest way to run and fine-tune LLM locally

General Introduction

LlamaEdge is an open source project designed to simplify the process of running and fine-tuning large language models (LLMs) on local or edge devices. The project supports the Llama2 family of models and provides OpenAI-compatible API services that enable users to easily create and run LLM reasoning applications.LlamaEdge leverages the Rust and Wasm technology stacks to provide powerful alternatives for AI reasoning. Users can quickly get models up and running with simple command line operations and can fine-tune and extend them as needed.

 

Function List

  • Running LLM locally: Supports running the Llama2 family of models on local or edge devices.
  • OpenAI Compatible API Services: Provides OpenAI API-compatible service endpoints that support chat, speech-to-text, text-to-speech, image generation, and more.
  • Cross-platform support: Supports a wide range of CPU and GPU devices and provides cross-platform Wasm applications.
  • quick start: Models can be quickly downloaded and run with simple command line operations.
  • Fine-tuning and expansion: Users can modify and extend the source code as needed to meet specific needs.
  • Documentation and Tutorials: Detailed official documentation and tutorials are provided to help users get started quickly.

 

Using Help

Installation process

  1. Installation of WasmEdge: First you need to install WasmEdge, which can be done with the following command line:
   curl -sSf https://raw.githubusercontent.com/WasmEdge/WasmEdge/master/utils/install_v2.sh | bash
  1. Download LLM model file: Take the Meta Llama 3.2 1B model as an example and download it using the following command:
   curl -LO https://huggingface.co/second-state/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q5_K_M.gguf
  1. Download LlamaEdge CLI Chat Application: Use the following command to download the cross-platform Wasm application:
   curl -LO https://github.com/second-state/LlamaEdge/releases/latest/download/llama-chat.wasm
  1. Running the chat application: Use the following command to chat with LLM:
   wasmedge --dir . :. --nn-preload default:GGML:AUTO:Llama-3.2-1B-Instruct-Q5_K_M.gguf llama-chat.wasm -p llama-3-chat

Functional operation flow

  1. Starting the API service: The API service can be started with the following command:
   wasmedge --dir . :. --env API_KEY=your_api_key llama-api-server.wasmedge ---model-name llama-3.2-1B --prompt-template llama-chat --reverse-prompt "[INST]" --ctx- size 32000
  1. Interacting with LLM using the web interface: After starting the API service, you can interact with the local LLM through the web interface.
  2. Creating custom API services: Custom API service endpoints can be created as needed, such as speech-to-text, text-to-speech, image generation, and so on.
  3. Fine-tuning and expansion: Users can modify configuration files and parameters in the source code to fulfill specific functional requirements.

 

LlamaEdge Rapidly Run Distillation on Laptop DeepSeek-R1

DeepSeek-R1 is a powerful and versatile AI model that challenges established players such as OpenAI with its advanced inference capabilities, cost-effectiveness, and open source availability. While it has some limitations, its innovative approach and robust performance make it an invaluable tool for developers, researchers, and organizations. For those interested in exploring its capabilities, the model and its lite version are available on platforms such as Hugging Face and GitHub.


Trained by a GPU-constrained Chinese team, it excels at math, coding, and even some fairly complex reasoning. Most interestingly, it's a "lite" model, meaning it's smaller and more efficient than the giant model it's based on. This is important because it makes it more practical for people to actually use and build it.

LlamaEdge: the quickest way to run and fine-tune LLM locally-1

In this article we will introduce

  • How to run open source on your own device DeepSeek mould
  • How to create OpenAI-compatible API services using the latest DeepSeek models

We will use LlamaEdge (Rust + Wasm tech stack) to develop and deploy applications for this model. No need to install complex Python packages or C++ toolchains! Learn why we chose this technology.

 

Run the DeepSeek-R1-Distill-Llama-8B model on your own equipment!

Step 1: Install WasmEge via the following command line.

curl -sSf https://raw.githubusercontent.com/WasmEdge/WasmEdge/master/utils/install_v2.sh | bash -s -- -v 0.14.1

Step 2: Download the quantized DeepSeek-R1-Distill-Llama-8B-GGUF model file.

This may take some time as the size of the model is 5.73 GB.

curl -LO https://huggingface.co/second-state/DeepSeek-R1-Distill-Llama-8B-GGUF/resolve/main/DeepSeek-R1-Distill-Llama-8B-Q5_K_M.gguf `

Step 3: Download the LlamaEdge API server application.

It is also a cross-platform, portable Wasm application that runs on many CPU and GPU devices.

curl -LO https://github.com/LlamaEdge/LlamaEdge/releases/latest/download/llama-api-server.wasm

Step 4: Download chatbot UI

to interact with the DeepSeek-R1-Distill-Llama-8B model in a browser.

curl -LO https://github.com/LlamaEdge/chatbot-ui/releases/latest/download/chatbot-ui.tar.gz
tar xzf chatbot-ui.tar.gz
rm chatbot-ui.tar.gz

Next, start the LlamaEdge API server using the following command behavior model.

wasmedge --dir . :. --nn-preload default:GGML:AUTO:DeepSeek-R1-Distill-Llama-8B-Q5_K_M.gguf \
  llama-api-server.wasm \
  --prompt-template llama-3-chat \
  --ctx-size 8096

Then, open your browser and visit http://localhost:8080 to start chatting! Or you can send an API request to the model.

curl -X POST http://localhost:8080/v1/chat/completions \
  -H 'accept:application/json' \\
  -H 'Content-Type: application/json' \\
  -d '{"messages":[{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is the capital of France?"}], " model": "DeepSeek-R1-Distill-Llama-8B"}'

{"id": "chatcmpl-68158f69-8577-4da2-a24b-ae8614f88fea", "object": "chat.completion", "created":1737533170, "model": "default", "choices". [{"index":0, "message":{"content": "The capital of France is Paris.\n\n\nThe capital of France is Paris.\n\nThe capital of France is Paris.\n\n, "role":" assistant"}, "finish_reason": "stop", "logprobs":null}], "usage":{"prompt_tokens":34, "completion_tokens":18, "total_tokens":52}}

 

Creating OpenAI-compatible API services for DeepSeek-R1-Distill-Llama-8B

LlamaEdge is lightweight and requires no daemon or sudo process to run. It can be easily embedded into your own applications! With support for chat and embedding models, LlamaEdge can be an alternative to the OpenAI API inside applications on your local computer!

Next, we'll show how to add a new function to the DeepSeek-R1 model and the embedding model to start the full API server. the API server will have the chat/completions cap (a poem) embeddings Endpoints. In addition to the steps in the previous section, we need to:

Step 5: Download the embedding model.

curl -LO https://huggingface.co/second-state/Nomic-embed-text-v1.5-Embedding-GGUF/resolve/main/nomic-embed-text-v1.5.f16.gguf

We can then start the LlamaEdge API server with chat and embedding models using the following command line. For more detailed instructions, see the documentation - Starting the LlamaEdge API Service.

wasmedge --dir . :. \
--nn-preload default:GGML:AUTO:DeepSeek-R1-Distill-Llama-8B-Q5_K_M.gguf \
--nn-preload embedding:GGML:AUTO:nomic-embed-text-v1.5.f16.gguf \
llama-api-server.wasm -p llama-3-chat,embedding \
--model-name DeepSeek-R1-Distill-Llama-8B,nomic-embed-text-v1.5.f16 \
--ctx-size 8192,8192 \
---batch-size 128,8192 \
--log-prompts --log-stat

 

Finally, you can follow these tutorials to integrate the LlamaEdge API server with other Agent frameworks as a replacement for OpenAI. Specifically, replace the OpenAI API with the following values in your application or Agent configuration.

Config option (be) worth
Base API URL http://localhost:8080/v1
Model name (large model) DeepSeek-R1-Distill-Llama-8B
Model name (text embedding) nomic-embed

That's it! Visit the LlamaEdge repository now and build your first AI Agent! If you find it interesting, please star our repo here. If you have any questions about running this model, please also head over to the repo to ask questions or schedule a demo with us to run your own LLM across devices!

May not be reproduced without permission:Chief AI Sharing Circle " LlamaEdge: the quickest way to run and fine-tune LLM locally

Chief AI Sharing Circle

Chief AI Sharing Circle specializes in AI learning, providing comprehensive AI learning content, AI tools and hands-on guidance. Our goal is to help users master AI technology and explore the unlimited potential of AI together through high-quality content and practical experience sharing. Whether you are an AI beginner or a senior expert, this is the ideal place for you to gain knowledge, improve your skills and realize innovation.

Contact Us
en_USEnglish