summary
This document will detail how to use the LlamaIndex framework to build a local RAG (Retrieval-Augmented Generation) application. By integrating with LlamaIndex, it is possible to build a RAG system that combines retrieval and generation capabilities to improve the efficiency of information retrieval and the relevance of generated content. Local knowledge base paths can be customized, indexed by LlamaIndex, and then utilized for contextual conversations.
Note: This document contains core code snippets and detailed explanations. The full code can be found at notebook The
1. Model downloads
This example uses the llama3.1
model, you can use the appropriate model according to your own computer configuration.
ollama pull llama3.1
ollama pull nomic-embed-text
2. Installation of dependencies
pip install llama-index-llms-ollama
pip install llama-index-embeddings-ollama
pip install -U llama-index-readers-file
3. Loading data
Loads all documents in the data folder of the current directory into memory.
documents = SimpleDirectoryReader("data").load_data()
4. Construction of indexes
Settings.embed_model = OllamaEmbedding(model_name="nomic-embed-text")
Settings.llm = Ollama(model="llama3.1", request_timeout=360.0)
index = VectorStoreIndex.from_documents(
documents, )
)
Settings.embed_model
: Globalembed_model
attribute. The sample code assigns the created embedded model to the globalembed_model
Properties;Settings.llm
: Globalllm
attribute. The example code assigns the created language model to the globalllm
Properties;VectorStoreIndex.from_documents
: Builds an index using previously loaded documents and converts them to vectors for quick retrieval.
pass (a bill or inspection etc) Settings
The global attributes are set so that the corresponding model is used by default in the later index building and querying process.
5. Query data
query_engine = index.as_query_engine()
response = query_engine.query("What is Datawhale?")
print(response)
index.as_query_engine()
: Create a query engine based on the previously constructed index. This query engine can receive queries and return retrieved responses.
6. Retrieving context for dialogues
Since the retrieved contexts may take up a large number of available LLM contexts, it is necessary to configure the chat history with a small token Limitations!
# Retrieve Context for Conversation
from llama_index.core.memory import ChatMemoryBuffer
memory = ChatMemoryBuffer.from_defaults(token_limit=1500)
chat_engine = index.as_chat_engine(
chat_mode="context",
memory=memory,
system_prompt=(
"You are a chatbot, able to have normal interactions."
),
)
response = chat_engine.chat("What is Datawhale?")
print(response)
chat_mode
You can select the appropriate mode according to the usage scenario, and the supported modes are as follows:
best
(default): Use a proxy (react or openai) with query engine tools;context
: Use the Retriever to get the context;condense_question
: Condense the question;condense_plus_context
: Condense the question and use a retriever to get the context;simple
: Simple chat engine that uses LLM directly;react
: Use the react agent with the query engine tool;openai
: Use the openai agent with the query engine tool.
7. Storing and loading vector indexes
storage_context.persist
Store vector indexes.load_index_from_storage
Load vector index.
# Storage vector index
persist_dir = 'data/'
index.storage_context.persist(persist_dir=persist_dir)
# Load vector index
from llama_index.core import StorageContext, load_index_from_storage
storage_context = StorageContext.from_defaults(persist_dir=persist_dir)
index= load_index_from_storage(storage_context)
8. streamlit applications
This example also implements a streamlit application, which can be viewed in the app.py
The required dependencies are listed below:
llama_index==0.10.62
streamlit==1.36.0
Note
exist app.py
In order to not reload the model in a continuous dialog, you can configure the environment variable OLLAMA_NUM_PARALLEL
cap (a poem) OLLAMA_MAX_LOADED_MODELS
The model is loaded into the server, and because it supports multiple model loads, it requires at least an additional 8G of RAM.
OLLAMA_NUM_PARALLEL
: Handle multiple requests for a single model at the same time.
OLLAMA_MAX_LOADED_MODELS
: Load multiple models at the same time.
Sample Display
- single-text question and answer (Q&A)
- multiple-text question and answer (Q&A)
References:LlamaIndex Documentation