General Introduction
LlamaEdge is an open source project designed to simplify the process of running and fine-tuning large language models (LLMs) on local or edge devices. The project supports the Llama2 family of models and provides OpenAI-compatible API services that enable users to easily create and run LLM reasoning applications.LlamaEdge leverages the Rust and Wasm technology stacks to provide powerful alternatives for AI reasoning. Users can quickly get models up and running with simple command line operations and can fine-tune and extend them as needed.
Function List
- Running LLM locally: Supports running the Llama2 family of models on local or edge devices.
- OpenAI Compatible API Services: Provides OpenAI API-compatible service endpoints that support chat, speech-to-text, text-to-speech, image generation, and more.
- Cross-platform support: Supports a wide range of CPU and GPU devices and provides cross-platform Wasm applications.
- quick start: Models can be quickly downloaded and run with simple command line operations.
- Fine-tuning and expansion: Users can modify and extend the source code as needed to meet specific needs.
- Documentation and Tutorials: Detailed official documentation and tutorials are provided to help users get started quickly.
Using Help
Installation process
- Installation of WasmEdge: First you need to install WasmEdge, which can be done with the following command line:
curl -sSf https://raw.githubusercontent.com/WasmEdge/WasmEdge/master/utils/install_v2.sh | bash
- Download LLM model file: Take the Meta Llama 3.2 1B model as an example and download it using the following command:
curl -LO https://huggingface.co/second-state/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q5_K_M.gguf
- Download LlamaEdge CLI Chat Application: Use the following command to download the cross-platform Wasm application:
curl -LO https://github.com/second-state/LlamaEdge/releases/latest/download/llama-chat.wasm
- Running the chat application: Use the following command to chat with LLM:
wasmedge --dir . :. --nn-preload default:GGML:AUTO:Llama-3.2-1B-Instruct-Q5_K_M.gguf llama-chat.wasm -p llama-3-chat
Functional operation flow
- Starting the API service: The API service can be started with the following command:
wasmedge --dir . :. --env API_KEY=your_api_key llama-api-server.wasmedge ---model-name llama-3.2-1B --prompt-template llama-chat --reverse-prompt "[INST]" --ctx- size 32000
- Interacting with LLM using the web interface: After starting the API service, you can interact with the local LLM through the web interface.
- Creating custom API services: Custom API service endpoints can be created as needed, such as speech-to-text, text-to-speech, image generation, and so on.
- Fine-tuning and expansion: Users can modify configuration files and parameters in the source code to fulfill specific functional requirements.
LlamaEdge Rapidly Run Distillation on Laptop DeepSeek-R1
DeepSeek-R1 is a powerful and versatile AI model that challenges established players such as OpenAI with its advanced inference capabilities, cost-effectiveness, and open source availability. While it has some limitations, its innovative approach and robust performance make it an invaluable tool for developers, researchers, and organizations. For those interested in exploring its capabilities, the model and its lite version are available on platforms such as Hugging Face and GitHub.
Trained by a GPU-constrained Chinese team, it excels at math, coding, and even some fairly complex reasoning. Most interestingly, it's a "lite" model, meaning it's smaller and more efficient than the giant model it's based on. This is important because it makes it more practical for people to actually use and build it.
In this article we will introduce
- How to run open source on your own device DeepSeek mould
- How to create OpenAI-compatible API services using the latest DeepSeek models
We will use LlamaEdge (Rust + Wasm tech stack) to develop and deploy applications for this model. No need to install complex Python packages or C++ toolchains! Learn why we chose this technology.
Run the DeepSeek-R1-Distill-Llama-8B model on your own equipment!
Step 1: Install WasmEge via the following command line.
curl -sSf https://raw.githubusercontent.com/WasmEdge/WasmEdge/master/utils/install_v2.sh | bash -s -- -v 0.14.1
Step 2: Download the quantized DeepSeek-R1-Distill-Llama-8B-GGUF model file.
This may take some time as the size of the model is 5.73 GB.
curl -LO https://huggingface.co/second-state/DeepSeek-R1-Distill-Llama-8B-GGUF/resolve/main/DeepSeek-R1-Distill-Llama-8B-Q5_K_M.gguf `
Step 3: Download the LlamaEdge API server application.
It is also a cross-platform, portable Wasm application that runs on many CPU and GPU devices.
curl -LO https://github.com/LlamaEdge/LlamaEdge/releases/latest/download/llama-api-server.wasm
Step 4: Download chatbot UI
to interact with the DeepSeek-R1-Distill-Llama-8B model in a browser.
curl -LO https://github.com/LlamaEdge/chatbot-ui/releases/latest/download/chatbot-ui.tar.gz tar xzf chatbot-ui.tar.gz rm chatbot-ui.tar.gz
Next, start the LlamaEdge API server using the following command behavior model.
wasmedge --dir . :. --nn-preload default:GGML:AUTO:DeepSeek-R1-Distill-Llama-8B-Q5_K_M.gguf \ llama-api-server.wasm \ --prompt-template llama-3-chat \ --ctx-size 8096
Then, open your browser and visit http://localhost:8080 to start chatting! Or you can send an API request to the model.
curl -X POST http://localhost:8080/v1/chat/completions \ -H 'accept:application/json' \\ -H 'Content-Type: application/json' \\ -d '{"messages":[{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is the capital of France?"}], " model": "DeepSeek-R1-Distill-Llama-8B"}' {"id": "chatcmpl-68158f69-8577-4da2-a24b-ae8614f88fea", "object": "chat.completion", "created":1737533170, "model": "default", "choices". [{"index":0, "message":{"content": "The capital of France is Paris.\n\n\nThe capital of France is Paris.\n\nThe capital of France is Paris.\n\n, "role":" assistant"}, "finish_reason": "stop", "logprobs":null}], "usage":{"prompt_tokens":34, "completion_tokens":18, "total_tokens":52}}
Creating OpenAI-compatible API services for DeepSeek-R1-Distill-Llama-8B
LlamaEdge is lightweight and requires no daemon or sudo process to run. It can be easily embedded into your own applications! With support for chat and embedding models, LlamaEdge can be an alternative to the OpenAI API inside applications on your local computer!
Next, we'll show how to add a new function to the DeepSeek-R1 model and the embedding model to start the full API server. the API server will have the chat/completions
cap (a poem) embeddings
Endpoints. In addition to the steps in the previous section, we need to:
Step 5: Download the embedding model.
curl -LO https://huggingface.co/second-state/Nomic-embed-text-v1.5-Embedding-GGUF/resolve/main/nomic-embed-text-v1.5.f16.gguf
We can then start the LlamaEdge API server with chat and embedding models using the following command line. For more detailed instructions, see the documentation - Starting the LlamaEdge API Service.
wasmedge --dir . :. \ --nn-preload default:GGML:AUTO:DeepSeek-R1-Distill-Llama-8B-Q5_K_M.gguf \ --nn-preload embedding:GGML:AUTO:nomic-embed-text-v1.5.f16.gguf \ llama-api-server.wasm -p llama-3-chat,embedding \ --model-name DeepSeek-R1-Distill-Llama-8B,nomic-embed-text-v1.5.f16 \ --ctx-size 8192,8192 \ ---batch-size 128,8192 \ --log-prompts --log-stat
Finally, you can follow these tutorials to integrate the LlamaEdge API server with other Agent frameworks as a replacement for OpenAI. Specifically, replace the OpenAI API with the following values in your application or Agent configuration.
Config option | (be) worth |
---|---|
Base API URL | http://localhost:8080/v1 |
Model name (large model) | DeepSeek-R1-Distill-Llama-8B |
Model name (text embedding) | nomic-embed |
That's it! Visit the LlamaEdge repository now and build your first AI Agent! If you find it interesting, please star our repo here. If you have any questions about running this model, please also head over to the repo to ask questions or schedule a demo with us to run your own LLM across devices!