ZEP-Graphiti: a temporal knowledge graph architecture for intelligent body memory

AI Knowledge Base6mos agorelease AI Sharing Circle

1.2K 00

stenographic reading

The Challenge of Intelligent Body Memory and Zep's Innovation

AI Agents face memory bottlenecks in complex tasks. Traditional Large Language Model (LLM)-based AI Agents are constrained by contextual windows, which make it difficult to effectively integrate long-term conversation history and dynamic data, limiting performance and making them prone to hallucinations.Zep is an innovative Knowledge Graph Architecture, which empowers AI Agents to cope with dynamic information environments with its core component, Graphiti, a powerful memory capability. It solves the challenges of existing RAG approaches that struggle to handle temporal information and cross-session reasoning, providing more reliable memory support for enterprise-class applications.

Implementation: Graphiti - a time-aware knowledge graph engine

Graphiti Intelligent body memory is realized through multi-level knowledge graph construction, dynamic management of temporal information and efficient retrieval mechanism.

Multi-level knowledge graph construction:
- Plot Subplot: The raw data (dialog messages) are used as plot nodes and contain text and timestamps.
- Semantic entity subgraphs: Entities (names of people, places, products, etc.) are extracted from the plot nodes as semantic entity nodes.
- Community Submap: Aggregate entity nodes on the same topic to form community nodes.
- This hierarchical structure enhances scalability and allows intelligences to understand information more efficiently.
Dynamic management of time-series information:
- Bi-Temporal Modeling: Records event timings (Timeline T) and data ingestion timings (Timeline T') and processes absolute/relative time information.
- Dynamic Updates: Real-time updating of knowledge graphs to handle time conflicts.
- Graphiti's time-series-aware capabilities ensure that information is accurate and up-to-date.
Efficient retrieval mechanisms:
- Multi-method integration: Combine cosine similarity search, full-text search and breadth-first search to find relevant contexts quickly and accurately.

Enterprise Applications: Complete Sample Process

take: Enterprise Customer Support Intelligent Body

user input::
The user sends a message through the chat interface, "Hi, I'm having a problem with the HP LaserJet Pro MFP M28w printer I purchased last month, the printouts have ghosting. I've contacted tech support before and they told me that restarting the printer would fix the problem, but I've tried that and it still doesn't work."

1. Episode node intake::
Graphiti ingests the user's latest news into the knowledge graph as a new episode node with a current timestamp (e.g., February 20, 2024, 10:30 AM). At the same time, the system retrieves historical dialog episodes related to the "printer problem".

2. Entity and relationship extraction::

Physical extraction::
- Printer model: HP LaserJet Pro MFP M28w
- Purchase date: January 10, 2024
- Problem description: Printing documents with ghosting
- Previously contacted on January 15, 2024
- Previous solution: Reboot the printer
relationship extraction::
- [用户] 购买 [打印机]
- [打印机] 存在 [打印文件有重影] 问题
- [重启打印机] 是 [尝试过的解决方案]

3. Community testing::
The system aggregates all entities and relationships related to the printer "HP LaserJet Pro MFP M28w" into one community for more efficient retrieval.

4. Dynamic updating of information::
Graphiti checks for information that needs to be updated. As users have reported that restarting the printer has not resolved the issue, the system will update the [重启打印机] 的 [解决方案状态] because of [未解决]The

5. Search for relevant context::
The system uses multiple search methods to find:

Full-text search: Retrieve historical conversations containing keywords such as "printer", "reimaging", "HP LaserJet Pro MFP M28w".
Cosine Similarity Search: Find dialogs describing similar printing problems, such as "Printouts are blurry", "Print colors are distorted", etc.
breadth-first search: from [HP LaserJet Pro MFP M28w] The community set out to find all the common problems and recommended solutions related to this printer model.

6. Generating responses::
The smart body generates the following response based on the retrieved contextual information:

"Hello! Thank you for your feedback. We understand that the HP LaserJet Pro MFP M28w printer you purchased is printing documents with heavy shadows and you have tried restarting the printer before, but the problem still persists. In order to better resolve your issue, we suggest you try the following steps:

Check the cartridges: Ensure that the cartridges are installed correctly and that the ink level is sufficient.
Clean the print head: A clogged print head may cause ghosting, try cleaning the print head (refer to the printer's manual for the procedure).
Update the driver: Please go to the HP website to download and install the latest printer driver.

If the problem persists after you try the above steps, please provide the following information so we can troubleshoot further:

What version of operating system are you using?
Does the ghosting problem occur in all documents or in specific documents?

Thank you for your cooperation!"

7. Knowledge updating::
If the user solves the problem by replacing the cartridge, the intelligent body will record the [更换墨盒] settle [打印文件有重影] 问题 effective solutions and to update the timestamps of relevant entities and relationships for subsequent reference.

Experimental results and future prospects

Zep In the LongMemEval evaluation, compared with baseline, the accuracy is improved by 18.5% and the response latency is reduced by 90% when using the gpt-4o model. The experimental results prove that Zep has significant advantages in terms of memory accuracy, dynamic updating capability, efficient retrieval, and scalability.

Future research directions include:

Model fine-tuning: optimizing the accuracy and efficiency of knowledge extraction.
Domain Ontology Integration: Enhanced understanding and reasoning about specific domains.
New benchmarking development: advancing memory systems.

summaries

We introduce Zep, a novel memory layer service for intelligentsia that outperforms the current state-of-the-art system MemGPT in Deep Memory Retrieval (DMR) benchmarks.In addition, Zep performs well in more comprehensive and challenging evaluations than DMR, which better reflect real-world enterprise use cases. While existing Large Language Model (LLM)-based Retrieval Augmented Generation (RAG) frameworks are limited to static document retrieval, enterprise applications need to dynamically integrate knowledge from a variety of sources including ongoing conversations and business data.Zep addresses this fundamental limitation with its core component, Graphiti, a time-series-aware knowledge graph engine that dynamically synthesizes unstructured conversation data and structured business data while maintaining historical relationships. In the DMR benchmarks established by the MemGPT team, Zep demonstrated its superior performance (94.81 TP3T vs. 93.41 TP3T). In addition to DMR, Zep's capabilities were further validated in the more complex LongMemEval benchmark, which better reflects enterprise use cases through complex temporal reasoning tasks. In this evaluation, Zep improved in accuracy by up to 18.51 TP3T while reducing response latency by 901 TP3T compared to the baseline implementation.These results are particularly significant in enterprise-critical tasks such as cross-session information synthesis and long-term context maintenance, demonstrating Zep's effectiveness in real-world applications.

1. Introduction

In recent years, the impact of Transformer-based Large Language Models (LLMs) on industry and the research community has attracted a lot of attention [1].One of the main applications of LLMs is the development of chat-based intelligences. However, the capabilities of these intelligences are limited by the LLM context window, effective context utilization, and the knowledge gained during pre-training. Therefore, additional contexts are needed to provide out-of-domain (OOD) knowledge and reduce illusions.

Retrieval Augmented Generation (RAG) has become a key area of interest in LLM applications.RAG utilizes Information Retrieval (IR) techniques pioneered over the last fifty years [2] to provide the necessary domain knowledge for LLM.

Current approaches to using RAG focus on extensive domain knowledge and relatively static corpora, i.e., the content of documents added to the corpus rarely changes. In order for intelligences to become pervasive in our daily lives, autonomously solving problems ranging from the trivial to the highly complex, they will need access to an ever-evolving, large corpus of data derived from user interactions with intelligences, as well as relevant business and world data. We believe that giving intelligences this kind of extensive and dynamic "memory" is a key component of realizing this vision, and we do not believe that current RAG approaches are appropriate for this future. Since entire dialog histories, business datasets, and other domain-specific content cannot be effectively adapted to the contextual window of LLM, new approaches to intelligentsia memory need to be developed. Adding memory to LLM-driven intelligences is not a new idea - the concept has been explored previously in MemGPT [3].

Recently, knowledge graphs (KGs) have been used to augment RAG architectures to address many of the shortcomings of traditional IR techniques [4]. In this paper, we introduce Zep [5], a memory layer service powered by Graphiti [6], a dynamic, time-series-aware knowledge graph engine.Zep ingests and synthesizes unstructured message data and structured business data.The Graphiti KG engine dynamically updates the knowledge graph in a non-lossy manner, preserving facts and relationships on a timeline, including their expiration dates. This approach enables the knowledge graph to represent a complex, evolving world.

Since Zep is a production system, we place great importance on the accuracy, latency, and scalability of its memory retrieval mechanisms. We evaluate the effectiveness of these mechanisms using two existing benchmarks: the Deep Memory Retrieval task (DMR) of MemGPT [3] and the LongMemEval benchmark [7].

2. Knowledge mapping

In Zep, memory consists of a temporal-aware dynamic knowledge graph G = (N,E,ϕ) Provide support where N denotes the node, E denotes the edge, ϕ:E → N × N represents a formalized association function. This graph contains three hierarchical subgraphs: the plot subgraph, the semantic entity subgraph, and the community subgraph.

- Episode subplot Ge :

Plot node (plot), ni ∈ Nethat contain raw input data in the form of messages, text, or JSON. plots are stored as a non-lossy data store from which semantic entities and relationships are extracted. The plot edge, ei ∈ Ee ⊆ ϕ∗ (Ne × Ns)that connects episodes to the semantic entities they reference.

- Semantic entity subgraphs Gs :

The semantic entity subgraph is built on top of the plot subgraph. Entity nodes (entities), ni ∈ Ns, representing the entities extracted from the plot and resolved with the existing graph entities. The entity edges (semantic edges), ei ∈ Es ⊆ ϕ∗(Ns × Ns), representing the relationships between entities extracted from the plot.

- Community Submap Gc :

The community subgraph forms the top level of the Zep knowledge graph. Community nodes (communities), ni ∈ Nc, representing clusters of strongly connected entities. The community contains a high-level summary of these clusters and represents the Gs A more comprehensive, interconnected view of the structure of the The community edge, ei∈Ec⊆ϕ∗(Nc×Nˉs)that connects the community with its physical members.

The dual storage of raw episode data and derived semantic entity information reflects mental models of human memory. These models distinguish between episode memory, which represents different events, and semantic memory, which captures associations between concepts and their meanings [8]. This approach allows LLM intelligences using Zep to develop more complex and subtle memory structures that are better aligned with our understanding of the human memory system. Knowledge graphs provide an efficient medium for representing these memory structures, and the different episodic and semantic subgraphs we implemented draw on a similar approach in AriGraph [9].

Our use of community nodes to represent high-level structures and domain concepts builds on the work of GraphRAG [4], enabling a more comprehensive global understanding of the domain. The resulting hierarchical organization - from plots to facts to entities to communities - extends existing hierarchical RAG strategies [10][11].

2.1 Episodes

Zep's graph construction starts by ingesting raw data units called episodes. Episodes can be one of three core types: messages, text, or JSON.While each type requires specific processing during the construction of the graph, this paper focuses on the message type because our experiments concentrate on conversational memory. In our context, a message consists of relatively short text (several messages can fit into the context window of the LLM) as well as the associated participants that produced the discourse.

Each message includes a reference timestamp trefThe time of the message is indicated by the time the message was sent. This temporal information allows Zep to accurately recognize and extract relative or partial dates mentioned in the message content (e.g., "next Thursday", "in two weeks", or "last summer"). Zep implements a diachronic model where the timeline T represents the temporal ordering of events, the timeline T′ represents the timing transaction for Zep data ingestion. While T′ Timelines serve the traditional purpose of database auditing, but T The timeline provides an additional dimension for modeling the dynamics of conversation data and memory. This dual-time approach represents a novel advance in LLM knowledge graph construction and underlies Zep's unique capabilities compared to previous graph-based RAG proposals.

Plot Side Ee Connect plots to their extracted entity nodes. Plots and their derived semantic edges maintain bi-directional indexes that track the relationship between an edge and its source plot. This design enhances the nondestructive nature of Graphiti's plot subgraphs by enabling both forward and backward traversal: semantic artifacts can be traced back to their sources for references or citations, and plots can be quickly retrieved for their associated entities and facts. While these connections were not directly examined in the experiments of this thesis, they will be explored in future work.

2.2 Semantic entities and facts

2.2.1 Entities

Entity extraction represents the initial phase of episode processing. During the ingestion process, the system processes the current message content and the last n messages to provide context for named entity recognition. For this paper and the general implementation of Zep, n = 4, providing two complete dialog rounds for contextual evaluation. Given our focus on message processing, the speaker is automatically extracted as an entity. After the initial entity extraction, we employ a reflection technique inspired by reflection [12] to minimize illusions and enhance extraction coverage. The system also extracts entity summaries from episodes to facilitate subsequent entity resolution and retrieval operations.

After extraction, the system embeds each entity name into a 1024-dimensional vector space. This embedding makes it possible to retrieve similar nodes by searching existing graph entity nodes by cosine similarity. The system also performs separate full-text searches on existing entity names and abstracts to identify additional candidate nodes. These candidate nodes, along with the plot context, are then processed through LLM using our entity resolution hints. When the system recognizes duplicate entities, it generates an updated name and abstract.

After entity extraction and parsing, the system integrates the data into the knowledge graph using predefined Cypher queries. We chose this approach over LLM-generated database queries to ensure a consistent architectural format and reduce the possibility of hallucinations.

Selected tips for use in graph construction are provided in the Appendix.

2.2.2 Facts

for each fact containing its key predicate. Similarly, the same fact can be extracted multiple times across different entities, enabling Graphiti to model complex multi-entity facts by implementing hyperedges.

After extraction, the system generates embeddings for facts in preparation for graph integration. The system performs edge de-duplication through a process similar to entity resolution. Hybrid search-related edges are restricted to edges that exist between pairs of entities that are identical to the new edge. This restriction not only prevents incorrect combinations of similar edges between different entities, but also significantly reduces the computational complexity of de-duplication by restricting the search space to a subset of edges associated with a particular entity pair.

2.2.3 Time Extraction and Edge Failure

A key differentiating feature of Graphiti compared to other knowledge graph engines is its ability to manage dynamic information updates through temporal extraction and edge failure processes.

The system uses tref Extracting temporal information about facts from the plot context. This makes it possible to accurately extract and date time representations of absolute timestamps (e.g., "Alan Turing was born on June 23, 1912") and relative timestamps (e.g., "I started my new job two weeks ago"). Consistent with our consistent modeling approach, the system tracks four timestamps: t′ Create and t′ expiration ∈ T′ monitors when facts are created or invalidated in the system, while tvalid and tinvalid ∈ T Tracks the timeframe in which the fact was established. These time data points are stored on the side along with other factual information.

The introduction of new edges can invalidate existing edges in the database. The system uses LLM to compare new edges with semantically related existing edges to identify potential contradictions. When the system recognizes a temporal conflict contradiction, it does so by comparing the tinvalid Set to tvalid on the failing side to invalidate the affected edges. According to the transaction timeline T′, Graphiti always prioritizes new information when determining edge failure.

This comprehensive approach allows data to be dynamically added to Graphiti as conversations evolve, while maintaining the current state of the relationship and a history of relationships that have evolved over time.

2.3 Community

After building the plot and semantic subgraphs, the system constructs community subgraphs through community detection. While our community detection approach builds on the techniques described in GraphRAG [4], we employ the label propagation algorithm [13] rather than the Leiden algorithm [14]. This choice is influenced by the simple dynamic scaling of label propagation, which allows the system to maintain accurate community representations longer as new data enters the graph, postponing the need for a full community refresh.

Dynamic expansion implements the logic of a single recursive step in label propagation. When the system adds a new entity node ni ∈ Ns to the graph When it does, it surveys the communities of neighboring nodes. The system assigns the new node to the community held by the majority of its neighbors and then updates the community summary and graph accordingly. While this dynamic update makes community expansion efficient as data flows into the system, the resulting communities gradually deviate from those generated by the full label propagation run. Therefore, periodic community refreshes remain necessary. However, this dynamic updating strategy provides a practical heuristic that significantly reduces latency and LLM inference costs.

Following [4], our community nodes contain summaries of member nodes via iterative map-reduce style. However, our retrieval approach is quite different from GraphRAG's map-reduce approach [4]. To support our retrieval approach, we generated community names containing key terms and related topics from the community summaries. These names are embedded and stored to enable cosine similarity search.

3. Memory retrieval

The memory retrieval system in Zep provides powerful, sophisticated and highly configurable functionality. Overall, the Zep graph search API implements a function f:S→Swhich accepts a text string query α ∈ S as input and returns a text string context β ∈ S as the output. The output β contains formatted data from nodes and edges that are generated by LLM intelligences on queries α required for the exact response of the The process f(α) → β It consists of three different steps:

- Search (φ): The process first identifies candidate nodes and edges that may contain relevant information. Although Zep employs several different search methods, the overall search function can be expressed as φ:S→Esn-×Nsnˉ×.Ncn. Therefore, φ Convert the query into a 3-tuple containing a list of semantic edges, entity nodes, and community nodes - these three graph types contain relevant textual information.

- Reorderer (ρ): The second step reorders the search results. The reorderer function or model takes a list of search results and generates a reordered version of these results: ρ:φ(α),...→ Esn×Nsn×NcnThe

- Constructor (χ): In the last step, the builder converts the relevant nodes and edges into a textual context: χ:Esn×Nsn×Ncn→S. For each ei ∈ Esχ Return facts and tvalidTinvalid. field; for each ni ∈ Nsthat returns the name and summary fields; and for each ni ∈ Nc, returns the summary field.

With these definitions in place, we can set f is expressed as a combination of these three components: f(α) = χ(ρ(φ(α))) = βThe

Example context string template:

事实和实体 代表与当前对话相关的上下文。
这些是最相关的事实及其有效日期范围。 如果该事实是关于一个事件，那么该事件发生在此期间。
格式：事实（日期范围：从 - 到）
<事实>
{事实}
</事实>
这些是最相关的实体
实体名称：实体摘要
<实体>
{实体}
</实体>

3.1 Search

Zep implements three search functions: cosine semantic similarity search (φcos)Full-text search of Okapi BM25 (φbm25) and breadth-first search (φbfs). The first two functions utilize Neo4j's implementation of Lucene [15][16]. Each search function provides different capabilities in terms of identifying relevant documents, and together they provide comprehensive coverage of candidate results before reordering. The search fields differ between the three object types: for Es, we search for fact fields; for Ns, search for entity names; for Nc, searches for community names that include relevant keywords and phrases covered in the community. Although developed independently, our community search approach parallels the advanced key search approach of LightRAG [17]. Combining LightRAG's approach with graph-based systems such as Graphiti provides a promising direction for future research.

While cosine similarity and full-text search methods are well established in RAG [18], breadth-first search on knowledge graphs has received limited attention in the RAG domain, with notable exceptions in graph-based RAG systems such as AriGraph [9] and Distill-SynthKG [19]. In Graphiti, breadth-first search is performed by identifying n additional nodes and edges within a hop to enhance the initial search results. In addition, φbfs Nodes can be accepted as search parameters, thus providing greater control over the search function. This feature proves particularly valuable when using recent episodes as the seed for a breadth-first search, allowing the system to incorporate recently mentioned entities and relationships into the retrieved context.

Each of these three search methods targets a different aspect of similarity: full-text search identifies word similarity, cosine similarity captures semantic similarity, and breadth-first search reveals contextual similarity - closer nodes and edges in the graph appear in more similar conversation contexts. This multifaceted approach to candidate outcome identification maximizes the likelihood of discovering the best context.

3.2 Reorderer

While the initial search approach aims for high recall, the reorderer improves precision by prioritizing the most relevant results.Zep supports existing reordering methods such as Reciprocal Ranking Fusion (RRF) [20] and Maximum Marginal Relevance (MMR) [21]. In addition, Zep implements a graph-based episode mention reorderer that prioritizes results based on how often an entity or fact is mentioned in a conversation, making frequently cited information more accessible. The system also includes a node distance reorderer that reorders results based on their distance from a specified central node, providing context for locating to specific regions of the knowledge graph. The system's most sophisticated reordering capability employs a cross-encoder - generating relevance scores by using cross-attention to evaluate nodes and edges against the query for the LLM, although this approach incurs the highest computational cost.

4. Experiments

This section analyzes two experiments conducted using a benchmark test based on LLM memory. The first evaluation utilizes the Deep Memory Retrieval (DMR) task developed in [3], which uses a 500-conversation subset of the multi-session chat dataset introduced in "Beyond Goldfish Memory: long-term open-domain conversations" [22]. The second evaluation utilizes the LongMemEval benchmark from "LongMemEval: Benchmarking Chat Assistants on Long-Term Interaction Memory" [7]. Specifically, we use LongMemEval ⋅ dataset, which provides a wide range of conversation contexts with an average length of 115,000 tokens.

For both experiments, we integrated the dialog history into the Zep knowledge graph via Zep's API. We then retrieve the 20 most relevant edges (facts) and entity nodes (entity summaries) using the techniques described in Section 3. The system reformats this data into context strings that match the functionality provided by Zep's Memory API.

While these experiments demonstrate Graphiti's key search capabilities, they represent a subset of the system's complete search functionality. This focused scope allows for clear comparisons with existing benchmark tests, while preserving the additional ability to explore the knowledge graph for future work.

4.1 Model selection

Our experimental implementation uses BAAI's BGE-m3 model for the reordering and embedding tasks [23][24]. For graph construction and response generation, we use gpt-4o-mini-2024-07-18 for graph construction and gpt-4o-mini-2024-07-18 and gpt-4o-2024-11-20 for chat intelligences to generate responses to the provided context.

To ensure direct comparability with MemGPT's DMR results, we also performed a DMR evaluation using gpt-4-turbo-2024-04-09.

The experimental notebooks will be publicly available via our GitHub repository and include relevant experimental tips in the appendix.

Table 1: Deep memory retrieval

memorization	mould	mark
recursive summary	Summary of dialogues	MemGPT†
Zep	Summary of dialogues	gpt-4-turbo
full dialog	Zep	gpt-4o-mini

† Results are reported in [3].

4.2 Deep Memory Retrieval (DMR)

Deep memory retrieval evaluation is introduced by [3] and consists of 500 multi-session dialogs, each containing 5 chat sessions with up to 12 messages per session. Each conversation consists of a Q&A pair for memory evaluation.The MemGPT framework [3] currently leads the performance metrics with an accuracy of 93.41 TP3T using gpt-4-turbo, which is a significant improvement over the baseline of 35.31 TP3T for recursive summarization.

To establish a comparison baseline, we implemented two common LLM memory methods: full dialog context and session summary. Using gpt-4-turbo, the full dialog baseline achieves an accuracy of 94.41 TP3T, which is slightly higher than the results reported by MemGPT, while the session summary baseline achieves 78.61 TP3T. both methods show improved performance when using gpt-4o-mini: 98.01 TP3T for the full dialog, 88.01 TP3T for the session summary . we were unable to reproduce the results of MemGPT using gpt-4o-mini due to the lack of sufficient methodological details in its published work.

We then evaluate Zep's performance by ingesting the dialog and using its search function to retrieve the top 10 most relevant nodes and edges.The LLM judge compares the intelligentsia's response to the correct answer provided.Zep achieves an accuracy of 94.81 TP3T with gpt-4-turbo, and 98.21 TP3T with gpt-4o-mini , showing marginal improvements to MemGPT and the corresponding full dialog baseline. However, these results have to be placed in context: each dialog contains only 60 messages and is easily adapted to the current LLM context window.

The limitations of the DMR assessment go beyond its small size. Our analysis reveals significant weaknesses in the design of this benchmark test. The assessment relies exclusively on single-round, fact-retrieval questions and is unable to assess complex memory comprehension. Many of the questions contained vague wording, invoking concepts like "favorite relaxation drink" or "weird hobby" that were not explicitly described as such in the conversation. Crucially, the dataset does not perform well for real-world enterprise use cases of LLM intelligences. The high performance achieved using the simple full-context approach of modern LLM further highlights the inadequacy of this benchmark in evaluating memory systems.

This shortcoming is further emphasized by the findings in [7], which show that the performance of LLM on the LongMemEval benchmarks degrades rapidly as the length of conversations increases.The LongMemEval dataset [7] addresses these shortcomings by providing longer, more coherent conversations that better reflect enterprise scenarios, as well as a more diverse set of evaluation questions.

4.3 LongMemEval (LME)

We evaluated Zep using the LongMemEvals dataset, which provides conversations and questions representative of LLM intelligences in real-world enterprise applications.The LongMemEvals dataset poses a significant challenge to existing LLM and Business Memory solutions [7], with conversations averaging about 115,000 tokens in length.This length, while quite large, but is still within the context window of current frontier models, allowing us to establish a meaningful baseline to evaluate Zep's performance.

The dataset contains six different problem types: single-session users, single-session assistants, single-session preferences, multi-session, knowledge updating, and temporal reasoning. These categories are not uniformly distributed in the dataset; for more information, we refer the reader to [7].

We conducted all experiments between December 2024 and January 2025. We used consumer laptops at our residential location in Boston, MA to connect to Zep's service, which is hosted in AWS us-west-2. This distributed architecture introduces additional network latency in evaluating Zep's performance, although this latency is not present in our baseline evaluation.

For answer assessment, we used the GPT-4o and provided question-specific cues provided in [7] that have been shown to be highly relevant to human assessors.

4.3.1 LongMemEval and MemGPT

In order to establish a comparative benchmark between Zep and the current state-of-the-art MemGPT system [3], we attempted to evaluate MemGPT using the LongMemEval dataset.Given that the current MemGPT framework does not support the direct ingestion of existing message histories, we implemented a solution by adding conversational messages to the archive history. However, we were unable to achieve successful Q&A using this approach. We look forward to seeing this benchmark test evaluated by other research teams, as comparing performance data would be beneficial to the broader development of LLM memory systems.

4.3.2 LongMemEval results

Zep demonstrated significant improvements in both accuracy and latency compared to baseline. Using gpt-4o-mini, Zep improves accuracy by 15.21 TP3T over the baseline, while gpt-4o improves by 18.51 TP3T. the reduced cue size also leads to a significant reduction in latency cost compared to the baseline implementation.

Table 2: LongMemEvals

memorization	mould	mark	procrastinate	Delay IQR	Average context Tokens
full context	gpt-4o-mini	55.4%	31.3 s	8.76 s	115k
Zep	gpt-4o-mini	63.8%	3.20 s	1.31 s	1.6k
full context	gpt-4o	60.2%	28.9 s	6.01 s	115k
Zep	gpt-4o	71.2%	2.58 s	0.684 s	1.6k

Analysis by question type shows that gpt-4o-mini using Zep demonstrates improvement in four of the six categories, with the greatest improvement in the complex question types: single-session preference, multi-session, and temporal reasoning. When using gpt-4o, Zep further demonstrates improved performance in the knowledge update category, highlighting that it is more effective when used with more powerful models. However, additional development may be required to improve the understanding of Zep's temporal data by less powerful models.

Table 3: Decomposition of LongMemEvals problem types

Type of problem	mould	full context	Zep	incremental
Single-session preference	gpt-4o-mini	30.0%	53.3%	77.7%↑
Single Session Assistant	gpt-4o-mini	81.8%	75.0%	90'6%↑
chronological inference	gpt-4o-mini	36.5%	54.1%	48.2%↑
multisession	gpt-4o-mini	40.6%	47.4%	16.7%↑
Knowledge update	gpt-4o-mini	76.9%	74.4%	3.36%↓
single-session user	gpt-4o-mini	81.4%	92.9%	14.1%↑
Single-session preference	gpt-4o	20.0%	56.7%	184%↑
Single Session Assistant	gpt-4o	94.6%	80.4%	17.7%↓
chronological inference	gpt-4o	45.1%	62.4%	38.4%↑
multisession	gpt-4o	44.3%	57.9%	30.7%↑
Knowledge update	gpt-4o	78.2%	83.3%	6.52%↑
single-session user	gpt-4o	81.4%	92.9%	14.1%↑

These results demonstrate Zep's ability to improve performance at model scale, with the most significant improvements observed on complex and delicate problem types when used with more powerful models. Latency improvements are particularly significant, with Zep reducing response times by approximately 901 TP3T while maintaining higher accuracy.

The performance degradation for the single-session helper problem - 17.71 TP3T for gpt-4o and 9.061 TP3T for gpt-4o-mini - represents a notable exception to Zep's otherwise consistent improvement, suggesting that that further research and engineering work is needed.

5. Conclusion

We have presented Zep, a graph-based approach to LLM memory that combines semantic and episodic memory with entity and community summarization. Our evaluation shows that Zep achieves state-of-the-art performance in existing memory benchmarks, while reducing token cost and operating at significantly lower latency.

While the results achieved by Graphiti and Zep are impressive, they may only be preliminary advances in graph-based memory systems. Multiple avenues of research could be built upon these two frameworks, including the integration of other GraphRAG methods into the Zep paradigm, as well as novel extensions of our work.

Research has demonstrated the value of fine-tuning models for LLM entity and edge extraction in the GraphRAG paradigm, improving accuracy while reducing cost and latency [19][25]. Similarly, fine-tuning the model for Graphiti cues may enhance knowledge extraction, especially for complex dialogs. Furthermore, while current research on LLM-generated knowledge graphs has mainly operated in the absence of formal ontologies [9][4][17][19][26], domain-specific ontologies have significant potential. Graph ontologies, which are fundamental in pre-LLM knowledge mapping work, deserve further exploration in the Graphiti framework.

Our search for suitable memory benchmark tests reveals a limited selection, with existing benchmark tests typically lacking robustness and sophistication, often defaulting to simple pin-seeking fact retrieval problems [3]. The field needs additional memory benchmark tests, especially those reflecting business applications such as customer experience tasks, to effectively evaluate and differentiate memory approaches. Notably, existing benchmark tests are insufficient to assess Zep's ability to process and synthesize conversation history with structured business data. While Zep focuses on LLM memory, its traditional RAG capabilities should be evaluated against established benchmark tests in [17], [27] and [28].

The scalability of production systems, including cost and latency, has not been adequately addressed in the current literature on LLM memories and RAG systems. We include latency benchmarking of retrieval mechanisms to begin to address this gap, following the example of the authors of LightRAG in prioritizing these metrics.

6. Appendix

6.1 Tips for Graph Construction

6.1.1 Entity extraction

{当前_消息}
</当前消息>
根据以上对话，从“当前消息”中提取明确或隐含提及的实体节点：
准则：
1.始终提取说话者/演员作为第一个节点。说话者是每行对话中冒号前的部分。
2.提取“当前消息”中提及的其他重要实体、概念或演员。
3.不要为关系或动作创建节点。
4.不要为时间信息（如日期、时间或年份）创建节点（这些将在稍后添加到边缘）。
5.在节点命名时尽可能明确，使用全名。
6.不要仅提取提及的实体

6.1.2 Entity resolution

<之前的消息>
{之前的消息}
</之前的消息>
<当前消息>
{当前消息}
</当前消息>
<现有节点>
{现有节点}
</现有节点>
给定上述的现有节点、消息和之前的消息。确定从对话中提取的新节点是否是现有节点中的重复实体。
<新节点>
{新节点}
</新节点>
任务：
1. 如果新节点与现有节点中的任何节点表示相同的实体，则在响应中返回“is_duplicate: true”。否则，返回“is_duplicate: false”。
2. 如果 is_duplicate 为 true，则在响应中同时返回现有节点的 uuid。
3. 如果 is_duplicate 为 true，则返回节点的最完整的全名。
指南：
1. 使用节点的名称和摘要来确定实体是否重复，重复节点可能具有不同的名称。

6.1.3 Fact extraction

<之前的消息>
{之前的消息}
</之前的消息>
<当前消息>
{当前消息}
</当前消息>
<实体>
{实体}
</实体>
给定上述消息和实体，从当前消息中提取所有与列出的实体相关的要素。
指南：
1. 仅提取所提供的实体之间的事实。
2. 每个事实应表示两个不同节点之间的清晰关系。
3. relation_type 应该是对事实的简洁、全大写描述（例如，LOVES、IS_FRIENDS_WITH、
WORKS_FOR）。
4. 提供包含所有相关信息的更详细的事实。
5. 考虑关系的时间方面（如果相关）。

6.1.4 Factual analysis

给定以下上下文，判断 "新边" 是否表示 "现有边" 列表中任何一条边所表达的相同信息。
<现有边>
{existing_edges}
</现有边>
<新边>
{new_edge}
</新边>
任务：
如果 "新边" 表示与 "现有边" 中任何一条边相同的事实信息，则在响应中返回 'is_duplicate: true'。 否则，返回 'is_duplicate: false'。
如果 is_duplicate 为 true，则还在响应中返回现有边的 uuid。
指南：
事实信息不需要完全相同才能判定为重复，它们只需要表达相同的信息即可。

6.1.5 Time extraction

<先前消息>
{先前的消息}
</先前消息>
<当前消息>
{当前消息}
</当前消息>
<参考时间戳>
{参考时间戳}
</参考时间戳>
<事实>
{事实}
</事实>
重要提示：仅在提供的事实中包含时间信息时才提取时间信息，否则忽略提及的时间。  
如果仅提到了相对时间（例如“10年前”、“2分钟前”），请尽力基于提供的参考时间戳确定具体日期。  
如果关系不是跨越性的，但仍然可以确定日期，则仅设置 `valid_at`。  
定义：
- valid_at：该事实描述的关系变为真实或被建立的日期和时间。  
- invalid_at：该事实描述的关系不再真实或结束的日期和时间。  
任务：
分析对话并确定事实中是否包含日期信息。仅在日期明确与关系的建立或变更相关时才进行设置。  
指南： 
1. 使用 ISO 8601 格式（YYYY-MM-DDTHH:MM:SS.SSSSSSZ）表示日期时间。  
2. 计算 `valid_at` 和 `invalid_at` 时，以参考时间戳作为当前时间。  
3. 如果事实使用现在时态，则使用参考时间戳作为 `valid_at` 日期。  
4. 如果没有找到能确立或更改关系的时间信息，则将字段设为 null。  
5. 不得 从相关事件推断日期，仅使用直接用于确立或更改关系的日期。  
6. 如果提及的相对时间直接与关系相关，则基于参考时间戳计算实际日期时间。  
7. 如果只提到了日期但未指定时间，则默认使用 00:00:00（午夜）。  
8. 如果仅提到了年份，则使用该年 1月1日 00:00:00。  
9. 始终包括时区偏移量（如果未指定具体时区，则使用 Z 表示 UTC）。