Experiment: Converting WordPress site-wide content into a "knowledge base" for an AI assistant.

AI hands-on tutorials6mos agoupdate AI Sharing Circle

1.6K 00

a factor (leading an effect)

The Chief AI Sharing Circle has compiled a large number of "practical commands" and various "AI tools", which can be found on the website by enteringbywordMatching for searching could not find the exact resource needed. The website is full of excellentGenerate Video ToolsThe inability to be found is intolerable.

Lacking the ability to develop a website, we can rely on external functions for searching:

Relying on the search engine to use the "site search" method to solve the problem seems to be a bit cumbersome, and the content is not included in the full:

或者直接输入：site:aisharenet.com SEO

Of course, I don't have the ability to directly convert website content to semantic search and provide a good interface to use it, so the problem centers around:

How to convert website content into an easily retrievable knowledge base.

content analysis

AI tools and instructions for their use, in the header area, essentially describes clearly the characteristics of their content, while the content area, although presented in more detail, can appear to bedisruptionstext, which affects the quality of retrieval. Also, there are images in the content that I would like to try to provide readers with a preview of.

Examples of AI tool content

Example of using command contents

Thinking about search strategies

1. Title and content are mixed as a whole paragraph to participate in semantic retrieval

Pros: complete content

Cons: Too much content leads to imprecise searches

2. Retrieve only the title and then cite the content knowledge according to the title

Advantage: accurate search

Cons: Reduced effective search scope

3. Input the title and content to the big model to split into QA pairs

Benefits: Greatly improves effective search coverage

Disadvantages: Higher processing costs and time costs; important content and structure of the original text will be lost

PS: do not need any development experience, you can deploy the DIFY project to batch generate QA pairs, not demonstrated here.

4. Knowledge mapping

Content not suitable, ignore.

I'm going to rely on free and open platforms for editing intelligences, which also don't support knowledge graphs.

Selective retrieval2 is simple and efficient. Although the effective retrieval range is reduced, it can be incrementally optimized through continuous iteration.
The content subject doesn't really need to be involved in the retrieval either, as long as it follows the semantics to retrieve thecaptionThis reduces the number of exceptions generated by the large model when dealing with long contexts, and returning the URL allows for more complete reading.

Search Tool Carrier

Which three-way platform is used to implement semantic search?

There are many free platforms on the market that support knowledge bases, such as Metabase, Smart Spectrum, Buckle, and Wenshin. Here I'm going to choose the platform that supports importing QA pairs for retrieval.

Retrieve QA pairs: return the answer B corresponding to question A by retrieving question A back to the big model, and use B as reference content to answer the user's question.

Which platform is better, which semantic understanding is better, is not considered here, their basic performance is basically considered up to standard.

Where do users use it?

The main push is public, so it allows users to search in public.

Smart Spectrum is good, but I choose Wenxin Intelligent Body, which has clearer operational instructions when dealing with QA rules. At the same time, Wenxin Intelligent Body can be published to Baidu for customer acquisition. Recommended Reading:Killer traffic portal: using AI intelligent body to get external traffic for websites and public numbers in the long run

Operation Tutorial

1. Export XML files from WordPress

2.XML converted to MD format

2.1 Click here to downloadblog2md project(math.) genusUnzip to directory D:\222\blog2md

2.2 Click the right mouse button at the beginning of the blog2md directory to open the SHELL terminal.

2.3 Most likely you need to install the dependency, enter the following command

安装命令：
npm install xml2js
验证命令：
npm list xml2js

2.4 Name the exported XML file 111.xml, place it in the D:\222\blog2md directory, and execute the following command

node index.js w 111.xml out

2.5 At this point, the directory D:\222\blog2md\out is generated, and you can verify whether the generated content is correct after entering it.

3.MD Convert EXCEL Format

The md content grid is structured so it's good to extract, here I write a regular in chatgpt and execute it in python.

我要提取：文件名（文件名就是URL，例如：https://aisharenet.com/anse/）、title、内容区域(---下方的内容)

3.1 After executing the python script, the output.xlsx file is generated in the current directory.

Script content:

Save the script file with a random name: 111.py and put the script in any directory, here I put it in D:\222\blog2md.

Execute from the command line (the default command line cannot execute 111.pt directly, you must add the . \ prefix)

.\111.py

The script file code is as follows, please save it as 111.py (generated by CHATGPT)

Directory to read md files: folder_path = "D:\\222\\blog2md\\out"
Generate EXCEL in current directory: output_file = "output.xlsx"

import os
import pandas as pd
import re
import html
def clean_content(content):
# 移除 Markdown 语法中的特殊字符
content = re.sub(r'[*_~`#]', '', content)
# 将 HTML 实体转换为对应字符
content = html.unescape(content)
# 移除可能导致 Excel 问题的字符
content = ''.join(char for char in content if ord(char) < 65536)
# 替换换行符为空格
content = content.replace('\n', ' ').replace('\r', '')
# 移除多余的空格
content = re.sub(r'\s+', ' ', content).strip()
return content
def md_to_excel(folder_path, output_file):
data = []
for filename in os.listdir(folder_path):
if filename.endswith(".md"):
file_path = os.path.join(folder_path, filename)
try:
with open(file_path, "r", encoding="utf-8") as file:
content = file.read()
# 使用正则表达式提取元数据
metadata_pattern = r"---\n(.*?)\n---"
metadata_match = re.search(metadata_pattern, content, re.DOTALL)
if metadata_match:
metadata = metadata_match.group(1)
title = re.search(r"title:\s*['\"](.*?)['\"]", metadata)
title = title.group(1) if title else ""
# 提取内容部分
content_pattern = r"---\n.*?\n---\n\n(.*)"
content_match = re.search(content_pattern, content, re.DOTALL)
text_content = content_match.group(1) if content_match else ""
# 清理内容
cleaned_content = clean_content(text_content)
# 获取文件名（不包含.md后缀）作为URL
url = os.path.splitext(filename)[0]
data.append({
"title": title,
"content": cleaned_content,
"url": url
})
except Exception as e:
print(f"Error processing file {filename}: {str(e)}")
# 将数据转换为DataFrame并保存为Excel文件
df = pd.DataFrame(data)
try:
df.to_excel(output_file, index=False, engine='openpyxl')
print(f"Successfully saved to {output_file}")
except Exception as e:
print(f"Error saving Excel file: {str(e)}")
# 尝试使用 CSV 格式保存
csv_file = output_file.rsplit('.', 1)[0] + '.csv'
df.to_csv(csv_file, index=False, encoding='utf-8-sig')
print(f"Saved as CSV instead: {csv_file}")
# 指定Markdown文件所在的目录和输出的Excel文件路径
folder_path = "D:\\222\\blog2md\\out"
output_file = "output.xlsx"
# 调用函数进行转换
md_to_excel(folder_path, output_file)

3.2 Organize output.xlsx as a knowledge base to be uploaded

Here only the title is kept and the full URL is spliced out.

4. Knowledge base uploaded by the Manxim smart body

4.1 Accessing the Wenshin Intelligence Body and uploading the knowledge base

4.2 Uploading EXCEL files

4.3 Customized search columns (this is the reason for using Wenshin Intelligence, other tools lack this interface)

For more tips on organizing your knowledge base read on:Wenxin Intelligent Body Tutorial: (4) Processing Documents and Synchronizing to the Knowledge Base

5. Create intelligences and publish them for use

5.1 Creating Intelligentsia

Let's simply configure it here without getting bogged down in the specifics. Start creating the smart body...

You can try to use low-code mode to create intelligent bodies, add multiple knowledge base judgment logic, after all, the site has many channels well, here I will not demonstrate, interested in low-code friends can read:Wenxin Intelligent Body Tutorial: (V) Organize Intelligent Body Workflow

5.2 Configuring Intelligentsia

Turn off non-Knowledge Base functions to avoid anomalies, and I'll leave the other settings at default without fine-tuning.

The hit rate of the recalled knowledge base should be tested briefly, otherwise it is easy to match irrelevant content.

5.3 Debugging and Previewing the Output

5.4 Publishing Intelligentsia

ultimate

In the end, you get a smart body that can quickly look up AI tools in the public, all for free! Also, based on the Wenxin Smartbody distribution channel ( Wenxin Intelligent Body Platform: Intelligent Body Applications Built on Complete Distribution Channels and Commercial Closures ), this tool will be released to the Baidu home page to provide users with access.