Question: Knowledge graphs are important, DeepSeek language model is hot, can it be used to build knowledge graphs quickly? I want to try it. DeepSeek of the real thing to see how well it does at extracting information, integrating knowledge, and creating graphs out of thin air.
Methods: I did three experiments to test DeepSeek's knowledge graph building capabilities:
- Single article build mapping: Give DeepSeek an article and see if it can accurately grab the information and build a graph.
- Multiple article fusion mapping: Give DeepSeek multiple articles to see if it can add new knowledge to an existing graph.
- Theme Generation Mapping: Just tell DeepSeek the topic of the graph, without giving the article, and see if it can come up with a graph on its own.
Results: Experiments prove that DeepSeek builds and fuses knowledge graphs well, but there is still room for improvement.
Here's the exact procedure and results of the experiment, which I'll make clear in simpler terms.
I. Extracting Knowledge from a Single Article to Build a Graph
Purpose: Test DeepSeek's ability to extract knowledge from an article and build a knowledge graph.
Process: I found an article on NebulaGraph graph databases and had DeepSeek read the article, extract the key information, and then generate NebulaGraph database code (nGQL) to create a knowledge graph.
Instructions: "Can you help me generate a knowledge graph? I'll give you the article, you extract the key information and build the knowledge graph with nGQL code."
Operation: Copy and paste the article content directly to DeepSeek.
DeepSeek generated nGQL code:
# Schema Definition (TAG Type)
CREATE TAG IF NOT EXISTS product (
name STRING,
description STRING
); CREATE TAG IF NOT EXISTS product
CREATE TAG IF NOT EXISTS company (
name STRING
); CREATE TAG IF NOT EXISTS company
CREATE TAG IF NOT EXISTS feature (
name STRING
); CREATE TAG IF NOT EXISTS company ( name STRING )
CREATE TAG IF NOT EXISTS tool (
name STRING
); CREATE TAG IF NOT EXISTS
CREATE TAG IF NOT EXISTS scenario (
name STRING
); CREATE TAG IF NOT EXISTS
CREATE TAG IF NOT EXISTS language (
name STRING
); CREATE TAG IF NOT EXISTS
CREATE TAG IF NOT EXISTS auth_service (
name STRING
); CREATE TAG IF NOT EXISTS auth_service ( name STRING )
CREATE TAG IF NOT EXISTS ecosystem (
name STRING
); CREATE TAG IF NOT EXISTS ecosystem ( name STRING ).
# Schema Definition (EDGE type)
CREATE EDGE IF NOT EXISTS has_feature (); CREATE EDGE IF NOT EXISTS used ( name STRING ); # Schema Definitions (EDGE type)
CREATE EDGE IF NOT EXISTS has_feature (); CREATE EDGE IF NOT EXISTS used_by ().
CREATE EDGE IF NOT EXISTS provides_tool ().
CREATE EDGE IF NOT EXISTS supports_language (); CREATE EDGE IF NOT EXISTS supports_language ().
CREATE EDGE IF NOT EXISTS applicable_scenario (); CREATE EDGE IF NOT EXISTS supports_language ()
CREATE EDGE IF NOT EXISTS compatible_with ();; CREATE EDGE IF NOT EXISTS compatible_with ()
CREATE EDGE IF NOT EXISTS supports_auth (); CREATE EDGE IF NOT EXISTS supports_auth ()
CREATE EDGE IF NOT EXISTS integrates_with ().
# Data Insertion - Core Product
INSERT VERTEX product(name, description) VALUES
"NebulaGraph": ("NebulaGraph", "Open source distributed graph database supporting hundreds of billions of points and trillions of edges datasets");; # Data Insertion - Core Product INSERT VERTEX product(name, description) VALUES
# Data Insertion - Product Features
INSERT VERTEX feature(name) VALUES
"high_performance": ("high performance"), "open_source": ("open_source")
"flexible_modeling": ("flexible data modeling"), "read_write_balance": ("read_write_balance")
"read_write_balance": ("Read/Write Balance"), "compatibility": ("Compatibility")
"compatibility": ("Query language compatibility");; /* ...
/* ... Omit n more data here ... */
Results: The DeepSeek code is fine, the syntax is correct, and it synthesizes multiple pieces of data into a single statement, which is efficient. The code is put into NebulaGraph to run, the plot looks like this:
The graph is centered on the NebulaGraph, which displays relevant information.
II. Integration of knowledge across multiple articles
Purpose: Testing DeepSeek's ability to fuse knowledge from multiple articles to extend an existing graph.
Process: First build the graph with one article, then give DeepSeek more articles about NebulaGraph (from Baidu's encyclopedia) to merge the new knowledge into the existing graph.
Instructions: "Next, another article for you. Try to blend the mapping with the current table structure. To change the table structure use the ALTER command."
Operation: The summaries of NebulaGraph and 360 Encyclopedia entries were given to DeepSeek, respectively.
The nGQL code generated by DeepSeek (NebulaGraph lexicon):
# Structure Change 1: Add version and time attributes to product tag
ALTER TAG product ADD (
license STRING,
license STRING, current_version STRING,
license STRING, current_version STRING, release_date STRING, last_updated STRING
last_updated STRING
STRING, last_updated STRING).
# Structure Change 2: Add new developed relationship type
CREATE EDGE IF NOT EXISTS developed_by();
# Data Update - Add Product Attributes
INSERT VERTEX product(name, description, license, current_version, release_date, last_updated)
VALUES
"NebulaGraph": ("NebulaGraph", "Distributed Graph Database", "Apache 2.0", "v3.3.0", "2019-05-15", "2022-11-14");
# Add development company (reuse company tag)
INSERT VERTEX company(name)
VALUES
"Yueshu": ("Hangzhou Yueshu Technology Co.)
# New feature (reuse feature tag)
INSERT VERTEX feature(name)
VALUES
"acl_mechanism": ("ACL Mechanism"), "linear_scaling": ("ACL_mechanism"), "linear_scaling": ("ACL_mechanism")
"linear_scaling": ("linear scaling"), "data_recovery": ("data_recovery"), "data_recovery": ("data_recovery")
"data_recovery": ("data recovery"), "htap": ("htap"), "htap": ("htap"), "htap": ("htap")
"nGQL_language": ("nGQL Query Language").
# Relationship Establishment (reuse of existing relationship types)
INSERT EDGE has_feature()
VALUES
"NebulaGraph"->"acl_mechanism":(),
"NebulaGraph"->"linear_scaling":(),
"NebulaGraph"->"data_recovery":(),
"NebulaGraph"->"htap":(),
"NebulaGraph"->"nGQL_language":();
# Adding a new developed relationship
INSERT EDGE developed_by()
VALUES
"NebulaGraph"->"Yueshu":();
DeepSeek generated nGQL code (360 words):
# Structure Change 1: Add founder, founded_date and revenue attributes to the company tag
ALTER TAG company ADD (founder STRING, founded_date STRING, revenue DOUBLE);
# Structure Change 2: Add event type TAG
CREATE TAG IF NOT EXISTS event (
type STRING, date STRING, event (
date STRING,
amount STRING, date STRING, amount DOUBLE, details STRING
details STRING
STRING, date STRING, amount DOUBLE, details STRING).
# Data Update: Add 360 Company Information
INSERT VERTEX company(name, founder, founded_date, revenue)
VALUES
"360": ("Qihoo360", "Zhou Hongyi", "2005-09", 116.15);
# Data insertion: product ecology (reuse product tag)
INSERT VERTEX product(name)
VALUES
"360_Security_Guard": ("360 Security Guard"), "360_Browser".
"360_Cloud": ("360 Security Cloud").
# Data insertion: significant events (using the new event tag)
INSERT VERTEX event(type, date, amount, details)
VALUES
"acquisition_360.com": ("Domain Acquisition", "2015-02-04", 1700.0, "110 Million RMB Acquisition of 360.com"),
"privatization": ("privatization", "2015-12", 9300.0, "$9.3 Billion Privatization Deal"), ("私有化", "2015-12", 9300.0, "93亿美元私有化交易").
# Relationship building: reuse developed_by side connection products
Results: DeepSeek can modify the table structure based on the new article (e.g. by giving the product cap (a poem) company table plus fields) and also added a new relationship type. It does this as required with the ALTER command to change the table structure. The small problem is that the comment uses the --nGQL does not recognize it, change it manually. # On the line.
The code is put into the database for execution, and the fused mapping works:
Summary: DeepSeek is good for building graphs and fusing knowledge. However, the server is sometimes slow, and NebulaGraph changes the schema asynchronously and takes a while to take effect.
III. Theme generation: mapping agriculture from nothing
Purpose: Test DeepSeek's ability to generate knowledge graphs based on topics without articles.
Process: Create a new dialog and let DeepSeek generate the mapping code directly for the topic "Agriculture". No article is given, it's a "free-for-all".
Instructions: "I want to build an agricultural knowledge graph. Use NebulaGraph database and generate nGQL script. Table names are big hump, attribute names are small hump. schema names are wrapped with `` to avoid conflicts. Generate at least 3000 nodes."
DeepSeek generated nGQL code (partial):
# ======================================= #
# Creating a diagram space
# ======================================= #
CREATE SPACE `Agriculture` (vid_type = FIXED_STRING(32));
USE `Agriculture`.
# ======================================= #
# Defining Labels Schema (big hump naming)
# ======================================= #
CREATE TAG IF NOT EXISTS `Crop` (
`cropId` int, `cropName` string
`cropName` string, `growthPeriod
`growthPeriod` int, `suitableTemp` string
`suitableTemp` string, `yieldPerMu
`yieldPerMu` float
).
# ======================================= #
# A lot of scripts are omitted here
# ======================================= #
Problems and improvements:
- Node count limit: Started with 3000 nodes, DeepSeek refused and gave Python to import the CSV code. I didn't want to use Python, so I reduced the number of nodes.
- The annotation problem comes back: The code comments are again --I would like to point out the problem again.
Improvement Instructions: "Use # for comments, no Python code, 3000 nodes is too many. Just give me the ngql script for 50 nodes."
Follow-up dialog and instructions: To refine the atlas, I continued to talk to DeepSeek, asking it to supplement the data, strengthen the associations, organize the atlas by taxa (phylum, order, family, genus, and species), and also ask it to generate crop rotation data.
For example, my instructions:
- "Supplemental data for stronger data correlation."
- "Make an atlas of these classifications [of phylums, orders, families, genera and species]."
- "Identify contraindications and gain crops in the rotation of existing crops."
- "Combining mapped crop tissue data to give nGQL scripts in the same format as before"
Experimental Interlude: DeepSeek, once. INSERT statement uses Cypher syntax, which is not supported by nGQL, and when pointed out it was changed.
Instructions: "This insert statement is not nGQL syntax. Change it so that DDL comes first and DML comes second."
Final data volume: After a few rounds of dialog, the amount of data is shown:
Mapping effects: Expand a few random nodes and take a look:
Examples of yield-enhancing combinations of rotational species: Yield-enhancing combinatorial effects of adventitious planting:
IV. Summary
Conclusion: DeepSeek excels at knowledge graph construction and fusion, and experiments prove its capabilities:
- Extracting information is fast and accurate: DeepSeek quickly extracts key information from text, generates compliant nGQL scripts, and has strong language comprehension to recognize entities, relationships, and events.
- Strong ability to integrate knowledge: DeepSeek fuses knowledge from multiple articles well, and can expand and update the graph based on new articles to ensure graph completeness and accuracy.
- You can build a map from nothing: No articles can generate charts by topic. There are some syntax hiccups in the generation process, but adjustments produce passable scripts.
- Details need to be optimized: Scripts generated by DeepSeek occasionally have syntax issues, such as incorrect comments. When generating a large number of nodes, the server may be slow to respond. You need to pay attention to these problems when you actually use it.