Building a Local RAG Application with Ollama+LlamaIndex

AI hands-on tutorials8mos agorelease AI Sharing Circle

24.5K 00

summary

This document will detail how to use the LlamaIndex framework to build a local RAG (Retrieval-Augmented Generation) application. By integrating with LlamaIndex, it is possible to build a RAG system that combines retrieval and generation capabilities to improve the efficiency of information retrieval and the relevance of generated content. Local knowledge base paths can be customized, indexed by LlamaIndex, and then utilized for contextual conversations.

Note: This document contains core code snippets and detailed explanations. The full code can be found at notebook The

1. Model downloads

This example uses the llama3.1 model, you can use the appropriate model according to your own computer configuration.

ollama pull llama3.1
ollama pull nomic-embed-text

2. Installation of dependencies

pip install llama-index-llms-ollama
pip install llama-index-embeddings-ollama
pip install -U llama-index-readers-file

3. Loading data

Loads all documents in the data folder of the current directory into memory.

documents = SimpleDirectoryReader("data").load_data()

4. Construction of indexes

Settings.embed_model = OllamaEmbedding(model_name="nomic-embed-text")
Settings.llm = Ollama(model="llama3.1", request_timeout=360.0)
index = VectorStoreIndex.from_documents(
documents,
)

Settings.embed_model : Global embed_model attribute. The sample code assigns the created embedded model to the global embed_model Properties;
Settings.llm : Global llm attribute. The example code assigns the created language model to the global llm Properties;
VectorStoreIndex.from_documents: Builds an index using previously loaded documents and converts them to vectors for quick retrieval.

pass (a bill or inspection etc) Settings The global attributes are set so that the corresponding model is used by default in the later index building and querying process.

5. Query data

query_engine = index.as_query_engine()
response = query_engine.query("Datawhale是什么?")
print(response)

index.as_query_engine(): Create a query engine based on the previously constructed index. This query engine can receive queries and return retrieved responses.

6. Retrieving context for dialogues

Since the retrieved contexts may take up a large number of available LLM contexts, it is necessary to configure the chat history with a small token Limitations!

# 检索上下文进行对话
from llama_index.core.memory import ChatMemoryBuffer
memory = ChatMemoryBuffer.from_defaults(token_limit=1500)
chat_engine = index.as_chat_engine(
chat_mode="context",
memory=memory,
system_prompt=(
"You are a chatbot, able to have normal interactions."
),
)
response = chat_engine.chat("Datawhale是什么？")
print(response)

chat_mode You can select the appropriate mode according to the usage scenario, and the supported modes are as follows:

best(default): Use a proxy (react or openai) with query engine tools;
context: Use the Retriever to get the context;
condense_question: Condense the question;
condense_plus_context: Condense the question and use a retriever to get the context;
simple: Simple chat engine that uses LLM directly;
react: Use the react agent with the query engine tool;
openai: Use the openai agent with the query engine tool.

7. Storing and loading vector indexes

storage_context.persist Store vector indexes.
load_index_from_storage Load vector index.

# 存储向量索引
persist_dir = 'data/'
index.storage_context.persist(persist_dir=persist_dir)
# 加载向量索引
from llama_index.core import StorageContext, load_index_from_storage
storage_context = StorageContext.from_defaults(persist_dir=persist_dir)
index= load_index_from_storage(storage_context)

8. streamlit applications

This example also implements a streamlit application, which can be viewed in the app.py

The required dependencies are listed below:

llama_index==0.10.62
streamlit==1.36.0

Note

exist app.py In order to not reload the model in a continuous dialog, you can configure the environment variable OLLAMA_NUM_PARALLEL cap (a poem) OLLAMA_MAX_LOADED_MODELSThe model is loaded into the server, and because it supports multiple model loads, it requires at least an additional 8G of RAM.

OLLAMA_NUM_PARALLEL: Handle multiple requests for a single model at the same time.

OLLAMA_MAX_LOADED_MODELS: Load multiple models at the same time.

Sample Display