I. Preface
Although most of the LLMs can now support the output of tens of thousands of Token. However, when we are actually landing in the business, we still often encounter the situation that LLM does not output according to the format we specified.
In particular, we need the output to be structured so that we can continue with downstream tasks.
Since the json format is inherently pro-plain text, when you want LLM to structure the output, it is usually output as json, and then the json is extracted and transformed into a dictionary to continue with the downstream task.
But there are many reasons that can cause us to not get the full json output.
Second, the output half of Json in several cases
2.1 Output exceeds LLM support for Token on line
- The input context is too long, resulting in too few positions left for the output for LLM to give a complete answer, and thus the json ends after only half of the output.
- The task was too complex to use the CoT technique, leaving the LLM output with too many intermediate thought processes, resulting in the LLM not having much room to output the final result. Either you don't get the final result, or the result stops only halfway through the output
2.2 Sensitive Word Interception LLM Outputs
In order to prevent LLM from being abused and outputting inappropriate content. Different regions have different sensitive content, so when outputting results externally, they are usually filtered or blocked for sensitive words depending on the location. Just as a few years ago, Meta came up with a criminally inclined AI, which is not allowed in the world.
So this is also what happens when your llm is truncated when it outputs sensitive content in an uncontrolled manner and outputs a half-assed result.
2.3 Failure to comply strictly with instructions
This is much more common. Even the best LLMs have snooze times, and there's no guarantee that the output will be in the format you want.
We often encounter, in the output json, often encounter a situation:
- Missing critical punctuation; for example:
{"question": How are you!}
It is impossible to parse this output completely without double quotes. - Punctuation that does not appear in the correct place; for example:
{"question": "How are you!" {"answer": "I'm fine"}
The punctuation in the middle is full-width, not half-width, which also causes the json output to fail. - There are many other scenarios, so I won't list them all, but here's how to solve these types of problems.
III. An open source solution
There are usually three ways to solve the three problems mentioned above:
- Completion of json: mainly for the json answer output half of the case, that is, half of the json not all into a complete json format, so that it has the conditions of json parsing.
- Fix json: mainly for json answers with missing punctuation or incorrect punctuation
- Discard part of the json content: mainly for the json in some can not take into account the special circumstances, retain the normal content, discard the content of the irregularities.
3.1 streaming-json-py solution
This is an open source solution and provides
- streaming-json-go`
- streaming-json-py
- streaming-json-js
Three programming languages are available to meet the needs of different environments.
Here's a look at the results in action
pip install streamingjson
lexer = streamingjson.Lexer() # append your JSON segment lexer.append_string('{"a":') # complete the JSON print(lexer.complete_json()) # will print `{"a":null}` # append more JSON segment lexer.append_string('[tr') # complete the JSON again print(lexer.complete_json()) # will print `{"a":[true]}`
This open source solution focuses on complementing the json
Alternatives:Fixes invalid JSON strings and resolves possible formatting errors in JSON data generated by LLMs.
IV. A program for improvement
Our own improved scheme, mainly by detecting the lack of punctuation in the json and the wrongness of the json, to correct the json format, and then, to assist in completing the json to achieve, and finally through the extraction of the json part of the content, to get the structured results
Two examples.
output = '''The answer you need is ```jsonn{"question": "Where will the 2024 Olympics be held?" n, n "answer": "Paris n''''
There are three errors here:
(math.) genus
It's Chinese, can't parse it correctlyParis, capital of France
There's a misquote. There's a missing quote.- Missing triple quotes and
}
Eventually the above error was perfectly corrected and the correct output
{'question': 'Where will the 2024 Olympics be held?' , 'answer': 'Paris'}