Flying Paddle PP-TableMagic: Structured Information Extraction for Complex Tables

Latest AI Resources5mos agorelease AI Sharing Circle

2.3K 00

The goal of table recognition is to parse tables in images, accurately identify table structures and cell locations, and reduce them to structured table formats (e.g. HTML). In today's information age, a large amount of important table data still exists in an unstructured state (e.g., pictures of information statistics tables in scanned documents, data statistics tables in PDF financial statements, etc.), which cannot be directly processed automatically. Therefore, form recognition has become a key technology in the application scenarios of intelligent document understanding and automatic data analysis. High-performance form recognition solutions have important application value in the fields of financial statement processing, scientific research data analysis, insurance claims accounting, etc., which can significantly improve work efficiency and reduce human errors. However, in the face of complex form formats in different application scenarios, traditional general-purpose form recognition models are often difficult to adapt. For this reason, Flying Paddle has launched a new table recognition solution, PP-TableMagic.

PP-TableMagic effect

PP-TableMagic Technical Analysis

Deficiencies in the current technology

Currently, common table recognition solutions usually adopt the following framework: the user inputs a table image, and the model predicts both the HTML structure and cell positions of the table in the image, and then reduces it to a complete HTML table. This solution achieves good prediction performance in common, simple table scenarios, but there are two problems:

The number of table recognition model parameters is usually small, and the two tasks, table structure prediction and cell position prediction, have large differences in goals and dependent feature semantic hierarchies, and there is an upper performance limit for joint optimization.
When users fine-tune the model in a specific scenario, the fine-tuning of certain types of tabular data may lead to "double-dip" model performance, i.e., the performance of the fine-tuned tabular categories increases, but the performance of other categories decreases, and the overall performance may decrease instead of increasing.

PP-TableMagic Technology Solutions and Principles

In order to fully utilize the performance of the lightweight table recognition model and to support the user's targeted fine-tuning of any type of table data, PP-TableMagic adopts the structure shown in the figure below:

PP-TableMagic adopts a dual-stream architecture to categorize tables into wired and wireless tables. Then, the end-to-end table recognition task is split into two sub-tasks: cell detection and table structure recognition, and finally the complete HTML table prediction result is obtained by the self-optimization result fusion algorithm. Specifically:

Flying Paddle team researches its own lightweight table classification model PP-LCNet_x1_0_table_cls to realize high-precision classification of wired and wireless tables.
The R&D team launched the industry's first open source table cell detection model RT-DETR-L_table_cell_det, including wired table cell detection pre-training weights RT-DETR-L_wired_table_cell_det and wireless table cell detection pre-training weights RT-DETR-L_wireless_table _cell_det, to realize the precise positioning of various types of table cells.
Flying Paddle introduces a new table structure recognition model, SLANeXt, which provides better parsing of table structures than SLANet and SLANet_plus, resulting in more accurate table HTML structures.

In PP-TableMagic framework, SLANeXt, a new table structure recognition model developed by FeiPaddle, is particularly important. Table structure recognition is the most critical aspect of table recognition, and the prediction from table images to HTML expressions relies on high-level features in the images. Therefore, SLANeXt uses Vary-ViT-B, which is more capable of feature characterization, as a visual coder and feeds the extracted features into SLAHead to achieve more accurate structure recognition. In addition to the model structure improvement, the training strategy is also improved. Based on Flying Paddle's self-constructed full-volume dataset + high-quality fine-tuned dataset, the structural recognition weights of wired table and wireless table are obtained respectively by a new three-stage pre-training strategy.

In order to evaluate the form recognition capability of SLANeXt, the R&D team conducted a number of tests based on various types of datasets. The results of the experiments are as follows:

Based on an internal high level form recognition review set:

Based on partners' real business data:

The experimental results show that SLANeXt has a significant performance improvement over SLANet_plus.

Algorithmic Applications

When using PP-TableMagic, not only can you take advantage of its excellent HTML table prediction capabilities to process tables directly, but you can also take full advantage of its structure to enable customized model fine-tuning.

When fine-tuning other end-to-end form recognition models for bad cases, it is often difficult to build large training sets when only this type of data can be collected, leading to a "one against the other" phenomenon where model performance decreases rather than increases.

In addition, fine-tuning the end-to-end table recognition model requires simultaneous labeling of the table structure and cell positions of the training data, which is very time-consuming and laborious in most application scenarios.

With PP-TableMagic's multi-model networking architecture, when there is a need to improve the processing performance of a certain type of table, only the most critical model or models need to be fine-tuned, thus minimizing the impact on the recognition performance of other types of tables.

Therefore, when PP-TableMagic is fine-tuned in real scenarios, not only the recognition performance of each type of table has little influence on each other, but also the data labeling only needs to be labeled with the corresponding category, which saves a lot of manpower.

For developers with strong coding skills, PP-TableMagic's architecture can be adjusted directly at the branch level. As shown in the figure below, when a certain type of table data is found to be very important, a separate branch can be set up for processing, which can greatly improve the overall table recognition capability.

PP-TableMagic has excellent performance and supports highly customizable, high degree of freedom of targeted model fine-tuning, in a variety of application scenarios can achieve the best table recognition performance, is the first open-source solution to achieve highly customizable table recognition in all scenarios.

Getting Started

mounting

Install PaddlePaddle:

# CPU 版本
python -m pip install paddlepaddle==3.0.0rc0 -i https://www.paddlepaddle.org.cn/packages/stable/cpu/
# GPU 版本，需显卡驱动程序版本 ≥450.80.02（Linux）或 ≥452.39（Windows）
python -m pip install paddlepaddle-gpu==3.0.0rc0 -i https://www.paddlepaddle.org.cn/packages/stable/cu118/
# GPU 版本，需显卡驱动程序版本 ≥545.23.06（Linux）或 ≥545.84（Windows）
python -m pip install paddlepaddle-gpu==3.0.0rc0 -i https://www.paddlepaddle.org.cn/packages/stable/cu123/

Install the PaddleX Wheel package:

pip install https://paddle-model-ecology.bj.bcebos.com/paddlex/whl/paddlex-3.0.0rc0-py3-none-any.whl

Quick Experience

PP-TableMagic can be called directly.

PaddleX provides an easy-to-use Python API to experience model predictions with just a few lines of code.
Download test images:

https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/table_recognition_v2.jpg

PaddleX supports calling PP-TableMagic from the command line or from Python scripts (represented in PaddleX by the table_recognition_v2 product line).

Command line method:

paddlex --pipeline table_recognition_v2 
--use_doc_orientation_classify=False 
--use_doc_unwarping=False 
--input table_recognition.jpg 
--save_path ./output 
--device gpu:0

Python scripting method:

from paddlex import create_pipeline
pipeline = create_pipeline(pipeline="table_recognition_v2")
output = pipeline.predict(
input="table_recognition.jpg",
use_doc_orientation_classify=False,
use_doc_unwarping=False,
)
for res in output:
res.print()
res.save_to_img("./output/")
res.save_to_xlsx("./output/")
res.save_to_html("./output/")
res.save_to_json("./output/")

After use, the recognition results will be saved under the specified path.

secondary development

If you are satisfied with the effect of PP-TableMagic, you can directly carry out high-performance reasoning, service deployment or end-side deployment on the production line. If the table scenario is particularly vertical and there is still room for optimization, you can also use PaddleX to carry out targeted secondary development of one or several models in PP-TableMagic based on the data of your own scenario, giving full play to the advantages of PP-TableMagic's customized fine-tuning. Based on the convenient secondary development capability of PaddleX, data validation, model training and evaluation inference can be completed by using unified commands, without the need to understand the underlying principles of deep learning, prepare the scene data according to the requirements, and simply run the commands to complete the model iteration. Here we show the secondary development process of the wireless table cell detection model RT-DETR-L_wireless_table_cell_det:

python main.py -c paddlex/configs/modules/table_cells_detection/RT-DETR-L_wireless_table_cell_det.yaml 
-o Global.mode=train 
-o Global.dataset_dir=./path_to_your_datasets

All other models support secondary development, please refer to the details:

https://github.com/PaddlePaddle/PaddleX/blob/release/3.0-rc/docs/pipeline_usage/tutorials/ocr_pipelines/table_recognition_v2.md#4-%E4%BA%8C%E6%AC%A1%E5%BC%80%E5%8F%91

Service-oriented deployment

PaddleX likewise provides a serviced deployment capability for PP-TableMagic, by encapsulating the reasoning capabilities of table recognition as services and allowing clients to access these services via web requests for table reasoning results.

PaddleX provides two ways of servitization deployment: basic servitization deployment and high stability servitization deployment. Basic Serviced Deployment is a simple and easy-to-use serviced deployment solution with low development costs that allows users to quickly deploy and debug results. High Stability Serviced Deployment is based on NVIDIA Triton Inference Server, which provides higher stability and allows for higher performance.

For additional information on PP-TableMagic, see the official PaddleX production line documentation:

https://github.com/PaddlePaddle/PaddleX/blob/release/3.0-rc/docs/pipeline_usage/tutorials/ocr_pipelines/table_recognition_v2.md