AI Personal Learning
and practical guidance
CyberKnife Drawing Mirror

Building a Local RAG Application with Ollama+LlamaIndex

summary

This document will detail how to use the LlamaIndex framework to build a local RAG (Retrieval-Augmented Generation) application. By integrating with LlamaIndex, it is possible to build a RAG system that combines retrieval and generation capabilities to improve the efficiency of information retrieval and the relevance of generated content. Local knowledge base paths can be customized, indexed by LlamaIndex, and then utilized for contextual conversations.

Note: This document contains core code snippets and detailed explanations. The full code can be found at notebook The

 

1. Model downloads

This example uses the llama3.1 model, you can use the appropriate model according to your own computer configuration.

ollama pull llama3.1
ollama pull nomic-embed-text

 

2. Installation of dependencies

pip install llama-index-llms-ollama
pip install llama-index-embeddings-ollama
pip install -U llama-index-readers-file

 

3. Loading data

Loads all documents in the data folder of the current directory into memory.

documents = SimpleDirectoryReader("data").load_data()

 

4. Construction of indexes

Settings.embed_model = OllamaEmbedding(model_name="nomic-embed-text")
Settings.llm = Ollama(model="llama3.1", request_timeout=360.0)
index = VectorStoreIndex.from_documents(
documents, )
)
  • Settings.embed_model : Global embed_model attribute. The sample code assigns the created embedded model to the global embed_model Properties;
  • Settings.llm : Global llm attribute. The example code assigns the created language model to the global llm Properties;
  • VectorStoreIndex.from_documents: Builds an index using previously loaded documents and converts them to vectors for quick retrieval.

pass (a bill or inspection etc) Settings The global attributes are set so that the corresponding model is used by default in the later index building and querying process.

 

5. Query data

query_engine = index.as_query_engine()
response = query_engine.query("What is Datawhale?")
print(response)
  • index.as_query_engine(): Create a query engine based on the previously constructed index. This query engine can receive queries and return retrieved responses.

 

6. Retrieving context for dialogues

Since the retrieved contexts may take up a large number of available LLM contexts, it is necessary to configure the chat history with a small token Limitations!

# Retrieve Context for Conversation
from llama_index.core.memory import ChatMemoryBuffer
memory = ChatMemoryBuffer.from_defaults(token_limit=1500)
chat_engine = index.as_chat_engine(
chat_mode="context",
memory=memory,
system_prompt=(
"You are a chatbot, able to have normal interactions."
),
)
response = chat_engine.chat("What is Datawhale?")
print(response)

chat_mode You can select the appropriate mode according to the usage scenario, and the supported modes are as follows:

  • best(default): Use a proxy (react or openai) with query engine tools;
  • context: Use the Retriever to get the context;
  • condense_question: Condense the question;
  • condense_plus_context: Condense the question and use a retriever to get the context;
  • simple: Simple chat engine that uses LLM directly;
  • react: Use the react agent with the query engine tool;
  • openai: Use the openai agent with the query engine tool.

 

7. Storing and loading vector indexes

  • storage_context.persist Store vector indexes.
  • load_index_from_storage Load vector index.
# Storage vector index
persist_dir = 'data/'
index.storage_context.persist(persist_dir=persist_dir)
# Load vector index
from llama_index.core import StorageContext, load_index_from_storage
storage_context = StorageContext.from_defaults(persist_dir=persist_dir)
index= load_index_from_storage(storage_context)

 

8. streamlit applications

This example also implements a streamlit application, which can be viewed in the app.py


The required dependencies are listed below:

llama_index==0.10.62
streamlit==1.36.0

Note

exist app.py In order to not reload the model in a continuous dialog, you can configure the environment variable OLLAMA_NUM_PARALLEL cap (a poem) OLLAMA_MAX_LOADED_MODELSThe model is loaded into the server, and because it supports multiple model loads, it requires at least an additional 8G of RAM.

OLLAMA_NUM_PARALLEL: Handle multiple requests for a single model at the same time.

OLLAMA_MAX_LOADED_MODELS: Load multiple models at the same time.

Sample Display

  1. single-text question and answer (Q&A)

Building a Local RAG Application with Ollama+LlamaIndex-1

  1. multiple-text question and answer (Q&A)

Building a Local RAG Application with Ollama+LlamaIndex-2

 

References:LlamaIndex Documentation

CDN1
May not be reproduced without permission:Chief AI Sharing Circle " Building a Local RAG Application with Ollama+LlamaIndex

Chief AI Sharing Circle

Chief AI Sharing Circle specializes in AI learning, providing comprehensive AI learning content, AI tools and hands-on guidance. Our goal is to help users master AI technology and explore the unlimited potential of AI together through high-quality content and practical experience sharing. Whether you are an AI beginner or a senior expert, this is the ideal place for you to gain knowledge, improve your skills and realize innovation.

Contact Us
en_USEnglish