Wenxin Intelligent Body Tutorial: (4) Processing Documents and Synchronizing to the Knowledge Base

AI hands-on tutorials12mos agoupdate AI Sharing Circle

Introduction to the Knowledge Base

The knowledge base is the data basis for the intelligent body's output answers, and is suitable for developers with professional data accumulation, as well as developers with accuracy and professionalism requirements for the output results.
In the knowledge base module to upload their own data, the big model and the user interaction process, according to the knowledge base retrieved similar content, the big model touched up to generate the results, you can effectively limit the scope of model generation.
Wenxin Intelligent Body Platform fully respects and safeguards the security of your proprietary data, and will not use the submitted data to train or improve generalized large models, and has not opened the exclusive model training capability for the time being.

1. Usage Scenario

Zero-code development of intelligences with references to knowledge bases and limited retrieval;
Citing the knowledge base when developing intelligences in low code;
Cite the knowledge base and develop data plug-ins quickly.

2. Knowledge base portal

Entry 1: After logging into the platform, click on the left navigation to access the Knowledge Base module.

Entry 2: To develop zero-code intelligences, on the Create Intelligence page, click "New Knowledge Base" to add data;

Entry 3: Low-code development intelligence, in the visualization page, drag and drop the knowledge base kit, click "New Knowledge Base" to enter the knowledge base module;

Entry 4: To develop a data plugin, on the Edit Plugin page, click "New Knowledge Base" to access the Knowledge Base module.

3. Knowledge base creation

Step 1: Upload the data.

There are 3 ways to upload knowledge base data, ①upload local files, ②submit web address, ③Baidu.com.hk import. 1 account can create 100 knowledge bases, and the total capacity of all knowledge bases can not be more than 1G, 1 knowledge base can be added 100 files or URLs, and the total capacity can not be more than 200M.

①Local files

Currently only supports text and image types of files, including txt, md, docx, pdf, xlsx, csv, png, jpg, jpeg, m4a, mp3, mp4, mov, mpeg formats.Only supports video upload, does not support video content recognition for the time being.

data type	nickname	Upload Instructions
copies	text	File size not exceeding 50M
	md	File size not exceeding 50M
	docx	File size not exceeding 50M
		Graphics are not supported for the time being, images in the file will be filtered and only the text will be retained
	pdf	File size not exceeding 50M
		Graphics are not supported for the time being, images in the file will be filtered and only the text will be retained
		Scanning of documents up to 50 pages can be supported.
	xlsx	File size not exceeding 50M
		It is recommended to upload data files in xlsx format. Note that in order to ensure that the model can understand the meaning of the data after the xlsx format file is split, and to conduct more accurate data query and statistics, the uploaded xlsx should contain table headers.
	csv	File size not exceeding 50M
photograph	png	30px ≤ side length ≤ 4096px, within 3:1 ratio, size cannot exceed 20M.
		Up to 500 images can be uploaded for 1 knowledge base
		More accurate recognition results when physical items are included in the image
	jpg	30px ≤ side length ≤ 4096px, within 3:1 ratio, size cannot exceed 20M.
		Up to 500 images can be uploaded for 1 knowledge base set
		More accurate recognition results when physical items are included in the image
	jpeg	30px ≤ side length ≤ 4096px, within 3:1 ratio, size cannot exceed 20M.
		Up to 500 images can be uploaded for 1 knowledge base
		More accurate recognition results when physical items are included in the image
sound frequency	m4a	File size not exceeding 50M
		Converts audio to text through intelligent recognition
	mp3	File size not exceeding 50M
		Converts audio to text through intelligent recognition
video	mp4	File size not exceeding 200M
		Converts video to text through intelligent recognition
	mov (computing)	File size not exceeding 200M
		Converts video to text through intelligent recognition
	mpeg	File size not exceeding 200M
		Converts video to text through intelligent recognition

②Website Submission

After inputting the web page address, click the "Identify" button to identify the text data in the web page; only support the identification of publicly accessible and Baidu has been included in the web page address, such as the need to log in to access, or unauthorized Baidu included in the web site will fail to identify.
You can set the frequency of automatically recognizing the updated knowledge base according to the frequency of web page updates.

③Baidu.com Import

The first time you use it, you need to authorize the Baidu.com account data, and you can select the files in the netbook after successful authorization.
The time limit of net disk import is limited by the download speed of net disk files, such as longer time can choose the background processing.

Step 2: Data processing.

Since the big model has strict limitations on input and output characters at this stage, and the knowledge base is also a kind of input content, which also needs to follow the limitations on the number of input characters of the big model, the purpose of the text segmentation is to cut the long text into short paragraphs, eliminate irrelevant information, and input the most relevant content under the premise of ensuring that the input characters don't exceed the limit. In order to let the big model understand the picture content more accurately, it will call the model to label the picture content intelligently first. Currently, 2~3 knowledge base paragraphs can be input to the big model, and the relevant content should be divided into 3 paragraphs or less as far as possible.

Text Segmentation: The platform provides "default segmentation" and "customized segmentation", which supports developers to cut long text into multiple segments by means of text, punctuation, space, carriage return, etc., so that the model can more accurately understand the text content. When segmentation is processed, the maximum segmentation characters are guaranteed to be cut in accordance with the set segmentation method.

Novels, customer service and other scenarios of Q&A content, data and other content, how to set up segments see how to set up file segments (with examples)

Form Setup: The table header of the form file will be used as the key information for the big model to understand the content of the table. By default, the 1st row of the table will be set as the table header, and customized marking of the table header according to the actual table structure can be supported.

Multimedia settings: The default call to the big model of the picture, audio content for intelligent recognition, and generate text annotation, assisting the retrieval of the link to the picture, audio understanding as well as more accurate retrieval of the recall. If the generated annotation information is wrong, you can manually modify the wrong content.Stay tuned for video recognition capabilities coming soon!

4. Knowledge base utilization

Way 1: Zero-code development of intelligences, in the Create Intelligence Body page, select Knowledge Base. You can observe the knowledge base invocation and optimize the knowledge base retrieval recall effect by debugging the retrieval parameters. For details, see: Common QA of Knowledge Base Calling

Way 2: Low-code development of intelligences, on the visual orchestration page, drag and drop the knowledge base suite to select the knowledge base that has been created.

Way 3: Develop a data plugin and select the knowledge base that has been created.

How to set up document segmentation (with examples)

1. When do I need to change a document segment?

Structured data
Smartbody or plugin output results in successful hits to the knowledge base, but contains too much irrelevant information

2. How to set up file segmentation

The purpose of data segmentation processing is to cut long text into short paragraphs and eliminate as much irrelevant information from the retrieved content as possible so that the model can process and understand it more effectively.

Wenxin Intelligent Body Platform provides default segmentation and customized segmentation. For different types of documents, different segmentation configurations need to be switched.

Maximum Segment Characters: the maximum number of characters in a paragraph after cutting a long text, instead of the number of characters in each paragraph, you can fill in any number from 50 to 512;

Paragraph overlapping characters: the maximum number of repeatable characters at the beginning of each segment and at the end of the previous segment, you can enter any number from 0 to 500, note that the number of overlapping characters needs to be less than the maximum number of paragraph characters, to retain the original semantics of the cut segments as much as possible, to avoid incomplete expression due to segmentation, and to help the model to understand more accurately and completely;

Segmentation mode: the segmentation symbols for long text cutting, you can choose the commonly used segmentation symbols, or you can input any symbols, when cutting the text, the cutting position will be selected in accordance with the sorting of the segmentation symbols.

Note: The number of segments in a single knowledge base cannot exceed 700w, please set the segments reasonably.

3. Segmented cases

Case 1: Long Text Content Segmentation Case

Scope of application: the cases are applicable to novels, e-books, texts, company introductions, theses, patent documents, etc., which require the model to understand the semantics in context with the long text content.

Example file:The Man in the Suit.docx

Segmented thoughts:

Recommended default segmentationThe specific segmentation results can be viewed by downloading the example file and creating a knowledge base.
- Maximum paragraph characters: long text content paragraphs are generally longer, paragraphs and paragraphs between some of the relationship between the beginning and end, so the maximum paragraph characters can be set a little larger, try to ensure that the paragraph contains a complete semantics, the model to better understand the accuracy.
- Paragraph Overlap Characters: When paragraphs need to be understood in context, the paragraph overlap characters can be filled in as needed to try to have relevant content between contexts displayed in a single paragraph.
- Segmentation: the default segmentation of the segmentation symbols basically contains most of the text segmentation, such as segmentation results are not appropriate, you can view the document suitable for cutting the location of the symbols, select or type to add the segmentation symbols, will be in accordance with the order of the segmentation symbols to choose to cut.

Follow-up optimization ideas: try to ensure that the same semantic text cut in a paragraphIf a paragraph cannot be divided into one paragraph because of the limitation of the number of characters in the paragraph, the correlation between paragraphs can be carried out through the overlapping characters of the paragraph, so that the model can increase the probability of being retrieved at the same time when retrieving and synthesize the understanding of the output results.

Model retrieval results:

文心智能体教程：（四）加工文档并同步到知识库 Model retrieval output:

Case 2: The case of structural content segmentation

Scope of application: the case applies to customer service chat records, sales talk and other scenarios of a question and answer, text forms, etc. there is aThe content of the distinctive structural featuresNeed for modelingUnderstand the semantics of content within a structure.

Example file:Wenxin Intelligent Body Platform FAQ.docx

Segmented thoughts:

It is recommended to use custom segmentation, to try to ensure that the same structure within the text cut in a paragraphThe specific segmentation results can be viewed by downloading the example file and creating a knowledge base.
- Maximum paragraph characters: First look at the structure of the original text, the average number of characters in each structure is how much, the maximum number of characters will be set to how many paragraphs, probably select a few representative paragraphs to calculate the average number of characters. For example, the sample document is a question-and-answer structure, there are 2 paragraphs, the average number of characters is 340 characters, the maximum number of paragraph characters set to 340 characters.
- paragraph overlay characterThe overlapping character is set to 0. If the paragraph cannot be divided into one paragraph due to the limitation of the number of characters, the overlapping character can be used to correlate the paragraphs, so that the model can increase the probability of being retrieved at the same time when retrieving, and synthesize and understand the output results.
- segmentation: document more distinctive structure, each group of questions and answers are marked "question", "answer", and we hope to be in accordance with the structure of a question and answer for the segmentation, you can "question" as a segmentation symbol, and in the "question" symbol before the segmentation, you can get a question and answer structure of the segmentation results.

Model retrieval results:

文心智能体教程：（四）加工文档并同步到知识库 Model retrieval output:

文心智能体教程：（四）加工文档并同步到知识库

Case 3: Excel Data Type Content Segmentation Case

Scope of application: cases apply to specificData query, data statistics categoryof the Excel table data class, row to row, with no correlation other than statistics.

Example file:2023 Movie Box Office Data.xlsx

Segmented thoughts:

If statistical analysis is required, the data to be calculated together should be divided into 1~3 segments as much as possible (the current model limits the knowledge base to a maximum of 2000 characters), and try to ensure the integrity of the original data input to the model, so that the final statistical results will be correct;

It is recommended to use custom segments to try to ensure the integrity of the raw data input to the model, so that the final statistics will be correctSpecific segmentation results can be viewed after downloading the example file to create a knowledge base
- Maximum Paragraph Characters: To ensure the integrity of the retrieved paragraphs, it is necessary to set the maximum number of paragraph characters to the maximum limit of 512 characters.
- Paragraph Overlap Characters: In order to minimize the number of characters in a paragraph taken up by overlapping characters, the Paragraph Overlap Character needs to be set to zero.
- Segmentation: Table type data can be cut directly by line, select "Line Feed" for segmentation.

Subsequent optimization of segmentation ideas: If the model limits the knowledge base to a maximum of 2,000 characters, then the data to be calculated should be divided into 1 to 3 paragraphs as much as possible. For larger scale statistics, it is recommended to upload an Excel sheet with no more than 2 columns in order to ensure that all the data required for the statistics are included in the 3 paragraphs entered into the model.

Model retrieval results:

文心智能体教程：（四）加工文档并同步到知识库 Touch up the output results:

文心智能体教程：（四）加工文档并同步到知识库

Attention:

Table headers are important for retrieval of segmented results, is the key information for the model to understand the data, so the data table header needs to have clear semantics and try not to use out-of-the-way words that the model cannot understand.
For plug-ins or intelligences that have a need for statistical analysis, you need to add the plug-in's or intelligence'sCommand prompts stating detailed calculation steps can improve the accuracy of the model's statistical results.

Knowledge base calls common QA

Q1: When previewing the effect of the knowledge base call, it prompts "System Exception" and "Service Exception", how should I handle it?
A: Sorry to affect your experience, "system exception", "service exception" is only occasional, you can try to refresh after the prompt, exit the current page to re-visit, clear the cache and other ways to retry, you can resume use.

Q2: What if my knowledge base is not recalled?
A: It may be that there is nothing in the knowledge base that is relevant to the question, go ahead andKnowledge Base Management PageCheck if there is any relevant content. If there is no relevant content, the knowledge base can be enriched according to the question; if there is relevant content but it is not recalled, it can be transferred to Q3.

Q3: I have relevant content in my knowledge base, but I keep getting a message saying "No relevant knowledge base recalled", how can I recall my knowledge base?
A: This can be resolved by.
The first thing you can do is go toKnowledge Base Management PageIf there are semantic problems, the content can be edited first to optimize the semantic problems;

Secondly, the recall effect can be debugged by lowering the [retrieval relevance threshold] through the recall configuration function of the knowledge base.Note: [Retrieve Relevance Threshold] will take effect globally for the current smart body, need to integrate most of the demand scenarios to do the configuration, such as only need to optimize the individual case of the problem, you can submit the ideal answer through the [Feedback] to amend the model to generate the answer.

Q4: The knowledge base recall results are not relevant to the user's question, but the relevance value given by the system is quite high, how to solve it?
A: There are 3 ways to try to solve this problem:
1. Modify the content of the recall paragraph, delete the relevant description, and then re-preview whether it is recalled;
2. Debugging knowledge base recall configuration, when the recall of irrelevant results in the recall results, ranked in the latter few, you can try to improve the [retrieval relevance threshold], reduce the [maximum number of recalled paragraphs], [maximum number of paragraph characters];
3. If you only need to optimize the individual cases of the problem, you can submit the ideal answer through the [Feedback] to amend the model to generate the answer.

Q5: What to do when only a portion of the relevant results in the knowledge base have been recalled, and there are others that would like to be recalled as well?
A: There are 2 ways to try to solve this problem:
1. Debugging the recall configuration of the knowledge base, you can try to reduce the [retrieval relevance threshold], increase the [maximum number of recalled paragraphs], [maximum number of paragraph characters];
2. If you only need to optimize a single example of the problem, you can submit the ideal answer through the [Feedback] to amend the model to generate the answer.

Q6: The recall results are all fine, but the final output has nothing to do with my knowledge base ah?
A: This problem occurs because the model filters out the results of the knowledge base recall when embellishing the answers, to solve this problem, try to supplement the character settings of the intelligences with the requirements for the application of the knowledge base. Example:
- Template 1: When the user asks a question, the knowledge base must be retrieved, and when no result is retrieved, output "I'm sorry, I don't know much about this issue, let's talk about something else~".
- Template 2: When a user asks a question, the answer is generated by prioritizing the results recalled from the retrieved knowledge base.