AI Personal Learning
and practical guidance

Solve LLM long context JSON format output text exception/break problem

I. Preface

Although most of the LLMs can now support the output of tens of thousands of Token. However, when we are actually landing in the business, we still often encounter the situation that LLM does not output according to the format we specified.

In particular, we need the output to be structured so that we can continue with downstream tasks.


Since the json format is inherently pro-plain text, when you want LLM to structure the output, it is usually output as json, and then the json is extracted and transformed into a dictionary to continue with the downstream task.

But there are many reasons that can cause us to not get the full json output.

 

Second, the output half of Json in several cases

2.1 Output exceeds LLM support for Token on line

  1. The input context is too long, resulting in too few positions left for the output for LLM to give a complete answer, and thus the json ends after only half of the output.
  2. The task was too complex to use the CoT technique, leaving the LLM output with too many intermediate thought processes, resulting in the LLM not having much room to output the final result. Either you don't get the final result, or the result stops only halfway through the output

2.2 Sensitive Word Interception LLM Outputs

In order to prevent LLM from being abused and outputting inappropriate content. Different regions have different sensitive content, so when outputting results externally, they are usually filtered or blocked for sensitive words depending on the location. Just as a few years ago, Meta came up with a criminally inclined AI, which is not allowed in the world.

So this is also what happens when your llm is truncated when it outputs sensitive content in an uncontrolled manner and outputs a half-assed result.

2.3 Failure to comply strictly with instructions

This is much more common. Even the best LLMs have snooze times, and there's no guarantee that the output will be in the format you want.

We often encounter, in the output json, often encounter a situation:

  1. Missing critical punctuation; for example:{"question": How are you!} It is impossible to parse this output completely without double quotes.
  2. Punctuation that does not appear in the correct place; for example:{"question": "How are you!" {"answer": "I'm fine"} The punctuation in the middle is full-width, not half-width, which also causes the json output to fail.
  3. There are many other scenarios, so I won't list them all, but here's how to solve these types of problems.

 

III. An open source solution

There are usually three ways to solve the three problems mentioned above:

  1. Completion of json: mainly for the json answer output half of the case, that is, half of the json not all into a complete json format, so that it has the conditions of json parsing.
  2. Fix json: mainly for json answers with missing punctuation or incorrect punctuation
  3. Discard part of the json content: mainly for the json in some can not take into account the special circumstances, retain the normal content, discard the content of the irregularities.

3.1 streaming-json-py solution

This is an open source solution and provides

  • streaming-json-go`
  • streaming-json-py
  • streaming-json-js

Three programming languages are available to meet the needs of different environments.

Here's a look at the results in action

pip install streamingjson

lexer = streamingjson.Lexer()

# append your JSON segment
lexer.append_string('{"a":')

# complete the JSON
print(lexer.complete_json()) # will print `{"a":null}`

# append more JSON segment
lexer.append_string('[tr')

# complete the JSON again
print(lexer.complete_json()) # will print `{"a":[true]}`

This open source solution focuses on complementing the json

Alternatives:Fixes invalid JSON strings and resolves possible formatting errors in JSON data generated by LLMs.

 

IV. A program for improvement

Our own improved scheme, mainly by detecting the lack of punctuation in the json and the wrongness of the json, to correct the json format, and then, to assist in completing the json to achieve, and finally through the extraction of the json part of the content, to get the structured results

Two examples.

 

output = '''The answer you need is ```jsonn{"question": "Where will the 2024 Olympics be held?" n, n "answer": "Paris n''''

There are three errors here:

  • (math.) genusIt's Chinese, can't parse it correctly
  • Paris, capital of France There's a misquote. There's a missing quote.
  • Missing triple quotes and}

Eventually the above error was perfectly corrected and the correct output

{'question': 'Where will the 2024 Olympics be held?' , 'answer': 'Paris'}
May not be reproduced without permission:Chief AI Sharing Circle " Solve LLM long context JSON format output text exception/break problem

Chief AI Sharing Circle

Chief AI Sharing Circle specializes in AI learning, providing comprehensive AI learning content, AI tools and hands-on guidance. Our goal is to help users master AI technology and explore the unlimited potential of AI together through high-quality content and practical experience sharing. Whether you are an AI beginner or a senior expert, this is the ideal place for you to gain knowledge, improve your skills and realize innovation.

Contact Us
en_USEnglish