Solve LLM long context JSON format output text exception/break problem

AI hands-on tutorials8mos agorelease AI Sharing Circle

2.3K 00

I. Preface

Although most of the LLMs can now support the output of tens of thousands of Token. However, when we are actually landing in the business, we still often encounter the situation that LLM does not output according to the format we specified.

In particular, we need the output to be structured so that we can continue with downstream tasks.

Since the json format is inherently pro-plain text, when you want LLM to structure the output, it is usually output as json, and then the json is extracted and transformed into a dictionary to continue with the downstream task.

But there are many reasons that can cause us to not get the full json output.

Second, the output half of Json in several cases

2.1 Output exceeds LLM support for Token on line

The input context is too long, resulting in too few positions left for the output for LLM to give a complete answer, and thus the json ends after only half of the output.
The task was too complex to use the CoT technique, leaving the LLM output with too many intermediate thought processes, resulting in the LLM not having much room to output the final result. Either you don't get the final result, or the result stops only halfway through the output

2.2 Sensitive Word Interception LLM Outputs

In order to prevent LLM from being abused and outputting inappropriate content. Different regions have different sensitive content, so when outputting results externally, they are usually filtered or blocked for sensitive words depending on the location. Just as a few years ago, Meta came up with a criminally inclined AI, which is not allowed in the world.

So this is also what happens when your llm is truncated when it outputs sensitive content in an uncontrolled manner and outputs a half-assed result.

2.3 Failure to comply strictly with instructions

This is much more common. Even the best LLMs have snooze times, and there's no guarantee that the output will be in the format you want.

We often encounter, in the output json, often encounter a situation:

Missing critical punctuation; for example:{"question":你好吗！} It is impossible to parse this output completely without double quotes.
Punctuation that does not appear in the correct place; for example:{"question":"你好吗！" ，"answer":"我很好"} The punctuation in the middle is full-width, not half-width, which also causes the json output to fail.
There are many other scenarios, so I won't list them all, but here's how to solve these types of problems.

III. An open source solution

There are usually three ways to solve the three problems mentioned above:

Completion of json: mainly for the json answer output half of the case, that is, half of the json not all into a complete json format, so that it has the conditions of json parsing.
Fix json: mainly for json answers with missing punctuation or incorrect punctuation
Discard part of the json content: mainly for the json in some can not take into account the special circumstances, retain the normal content, discard the content of the irregularities.

3.1 streaming-json-py solution

This is an open source solution and provides

streaming-json-go`
streaming-json-py
streaming-json-js

Three programming languages are available to meet the needs of different environments.

Here's a look at the results in action

pip install streamingjson

lexer = streamingjson.Lexer()

# append your JSON segment
lexer.append_string('{"a":')

# complete the JSON
print(lexer.complete_json()) # will print `{"a":null}`

# append more JSON segment
lexer.append_string('[tr')

# complete the JSON again
print(lexer.complete_json()) # will print `{"a":[true]}`

This open source solution focuses on complementing the json

Alternatives:Fixes invalid JSON strings and resolves possible formatting errors in JSON data generated by LLMs.

IV. A program for improvement

Our own improved scheme, mainly by detecting the lack of punctuation in the json and the wrongness of the json, to correct the json format, and then, to assist in completing the json to achieve, and finally through the extraction of the json part of the content, to get the structured results

Two examples.

output = '''The answer you need is ```jsonn{"question": "Where will the 2024 Olympics be held?" n, n "answer": "Paris n''''

There are three errors here:

，It's Chinese, can't parse it correctly
巴黎 There's a misquote. There's a missing quote.
Missing triple quotes and}

Eventually the above error was perfectly corrected and the correct output

{'question': '2024年奥运会在哪里举办？', 'answer': '巴黎'}

AI hands-on tutorials

Article copyright AI Sharing Circle All, please do not reproduce without permission.

Make your own Life Log with Tana

AI hands-on tutorials

9mos ago

02.3K

Preventing Plagiarism in AI: Prevention Strategies and Practices from Instructional Design onwards

AI hands-on tutorials

4mos ago

01.9K

PixVerse教程：Discord中使用 /Meme_face 用脸部图片创建视频

PixVerse Tutorial: Creating Videos with Face Images using /Meme_face in Discord

AI hands-on tutorials # PixVerse

1yrs ago

03.2K

AI Efficiency Note Taking Tool: NoteGen Helps You Capture Your Inspiration and Unleash Your Creative Potential

AI hands-on tutorials

6mos ago

01.9K

No comments

You must be logged in to leave a comment!

No comments...

Solve LLM long context JSON format output text exception/break problem

I. Preface

Second, the output half of Json in several cases

2.1 Output exceeds LLM support for Token on line

2.2 Sensitive Word Interception LLM Outputs

2.3 Failure to comply strictly with instructions

III. An open source solution

3.1 streaming-json-py solution

IV. A program for improvement

Rapid Deployment of the RAG 3-Pack for Dify with GPUStack

Who's better for long text editing with ChatGPT, Claude?Canvas vs Artifacts

Related posts

Make your own Life Log with Tana

Preventing Plagiarism in AI: Prevention Strategies and Practices from Instructional Design onwards

PixVerse Tutorial: Creating Videos with Face Images using /Meme_face in Discord

AI Efficiency Note Taking Tool: NoteGen Helps You Capture Your Inspiration and Unleash Your Creative Potential

No comments

Latest Collections

Latest Articles

Solve LLM long context JSON format output text exception/break problem

I. Preface

Second, the output half of Json in several cases

2.1 Output exceeds LLM support for Token on line

2.2 Sensitive Word Interception LLM Outputs

2.3 Failure to comply strictly with instructions

III. An open source solution

3.1 streaming-json-py solution

IV. A program for improvement

Rapid Deployment of the RAG 3-Pack for Dify with GPUStack

Who's better for long text editing with ChatGPT, Claude?Canvas vs Artifacts

Related posts

Make your own Life Log with Tana

Preventing Plagiarism in AI: Prevention Strategies and Practices from Instructional Design onwards

PixVerse Tutorial: Creating Videos with Face Images using /Meme_face in Discord

AI Efficiency Note Taking Tool: NoteGen Helps You Capture Your Inspiration and Unleash Your Creative Potential

No comments

Selected AI Tools

Latest Collections

Latest Articles