Automatically parse PDF content and extract text and tables of open source services

Latest AI Resources4mos agorelease AI Sharing Circle

General Introduction

It automatically analyzes the layout of PDF documents, identifies text, titles, images, tables, formulas and other elements in the page, and determines their correct order. The tool supports OCR functionality , you can convert scanned PDF to searchable text. It runs on Docker and provides two models: visual model (Vision Grid Transformer, or VGT) and LightGBM model. The former is highly accurate but resource-consuming, the latter is fast and resource-saving. The current version is v0.0.21, free and open on GitHub, suitable for researchers, archivists, etc. who need to deal with PDF.

Function List

Automatically recognize text, titles, images, tables, formulas and other elements in PDF pages.
Support OCR function to convert scanned PDF to searchable text.
Determine the correct reading order of page elements.
Two analysis modes are provided, visual model (VGT) and LightGBM model.
Extract tables and support multiple formats for output, such as Markdown, LaTeX, HTML.
Extracts formulas and outputs LaTeX format by default.
Supports multi-language OCR, such as English, Korean, etc.
Provides API interface for integration into other projects.
Supports visual output to generate PDF with annotations.

Using Help

Installation process

This tool runs with Docker and the installation steps are as follows:

Preparing the environment
Install Docker first. go to the Docker website to download and install it. After installation, type in the terminal:

docker --version

If the version number is displayed, it is successful. If using a GPU, you also need to install the NVIDIA Container Toolkit, refer to theInstallation GuideThe

Pulling Mirrors
Enter the command in the terminal to pull the tool image:

There's the GPU:

docker pull huridocs/pdf-document-layout-analysis:v0.0.21

No GPU:

docker pull huridocs/pdf-document-layout-analysis:v0.0.21

Operational services
Start the service in two ways:

There's the GPU:

docker run --rm --name pdf-analysis --gpus '"device=0"' -p 5060:5060 huridocs/pdf-document-layout-analysis:v0.0.21

No GPU:

docker run --rm --name pdf-analysis -p 5060:5060 huridocs/pdf-document-layout-analysis:v0.0.21

When the service starts, it listens on port 5060 by default. If the port is occupied, it can be changed to another port, such as 5061.

validation service
Open your browser and visithttp://localhost:5060/infoIf the version information is returned, the operation is normal.

How to use the main features

The tool operates through an API with the following common functions:

1. OCR function

To convert scanned PDF to searchable text, you can use OCR.

procedure::
Prepare a PDF such astest.pdf, run in the terminal:

curl -X POST -F 'language=en' -F 'file=@/path/to/test.pdf' localhost:5060/ocr --output result.pdf

language=enis English and can be replaced withkor(Korean), etc. Supported languages are available through thecurl localhost:5060/infoView.
/path/to/test.pdfis the file path, e.g./home/user/test.pdfThe
output fileresult.pdfwill be saved in the current directory.
in the end::
Get a searchable PDF with text that can be copied.

2. Layout analysis

To extract the elements in the PDF and analyze the layout:

procedure::
Running:

curl -X POST -F 'file=@/path/to/test.pdf' localhost:5060 --output analysis.json

output fileanalysis.jsonContains element information such as location, type (text, table, etc.).
in the end::
The JSON file lists the details of each element.

3. Rapid mode

Want faster processing, use LightGBM model, add parametersfast=true::

curl -X POST -F 'file=@/path/to/test.pdf' -F 'fast=true' localhost:5060 --output fast_analysis.json

take note of: Fast, but slightly less accurate.

4. Table and formula extraction

Withdrawal form::
Specify the format (e.g. Markdown):

curl -X POST -F 'file=@/path/to/test.pdf' -F 'extraction_format=markdown' localhost:5060 --output table.json

be in favor ofmarkdown,latex,htmlFormat.

Extraction formula::
The default output is LaTeX format, which can be analyzed directly with the Layout Analysis command.

5. Visualization output

Would like to see the labeled PDF:

curl -X POST -F 'file=@/path/to/test.pdf' localhost:5060/visualize --output visualized.pdf

in the end::
The output PDF will be labeled with the location and type of each element.

6. Adding language support

A few languages are supported by default, would like to add more languages (e.g. Chinese):

Enter the container:

docker exec -it --user root pdf-analysis /bin/bash

Install language packs, e.g. Chinese:

apt-get install tesseract-ocr-chi-sim

Check:

curl localhost:5060/info

see thatchi_simIndicates success.

7. Discontinuation of services

Discontinuation of services:

docker stop pdf-analysis

Output element order

The results of the analysis are organized in a specific order. The tool uses Poppler to determine the initial reading order, which is then adjusted according to the element type:

The header is at the top of the page, sorted in internal order.
Common elements (text, tables, etc.) are arranged in average reading order.
The footer and footnote are placed last.
Elements without text (e.g., images) are ordered according to the order of the nearest element with text.

caveat

hardware requirement: Visual model requires GPU and 5GB of video memory, without GPU it will be slow with CPU. lightGBM is CPU only and requires 2GB of RAM.
tempo: 15 pages of academic papers, 0.42 sec/page in fast mode, 1.75 sec/page in VGT (GPU), 13.5 sec/page in VGT (CPU).
adjust components during testing: View the log when something goes wrong:

docker logs pdf-analysis

These features and steps will help you get started quickly and handle a variety of PDF needs.

application scenario

academic research
Researchers use it to extract tables and formulas from papers and organize data more efficiently.
file management
Archivists convert scans of old documents into searchable PDFs that are easy to find.
Legal work
Attorneys analyze contract PDFs to quickly locate clauses and forms.

QA

Is there a fee?
No charge. This is open source tool, free to download and use on GitHub.
Do I need to network?
Internet connection is required to download the image, after which it can be run offline.
Does it support Chinese?
Support. Chinese packages need to be installed manually (e.g.tesseract-ocr-chi-sim), slightly less effective than English but usable.