General Introduction
It automatically analyzes the layout of PDF documents, identifies text, titles, images, tables, formulas and other elements in the page, and determines their correct order. The tool supports OCR functionality , you can convert scanned PDF to searchable text. It runs on Docker and provides two models: visual model (Vision Grid Transformer, or VGT) and LightGBM model. The former is highly accurate but resource-consuming, the latter is fast and resource-saving. The current version is v0.0.21, free and open on GitHub, suitable for researchers, archivists, etc. who need to deal with PDF.
Function List
- Automatically recognize text, titles, images, tables, formulas and other elements in PDF pages.
- Support OCR function to convert scanned PDF to searchable text.
- Determine the correct reading order of page elements.
- Two analysis modes are provided, visual model (VGT) and LightGBM model.
- Extract tables and support multiple formats for output, such as Markdown, LaTeX, HTML.
- Extracts formulas and outputs LaTeX format by default.
- Supports multi-language OCR, such as English, Korean, etc.
- Provides API interface for integration into other projects.
- Supports visual output to generate PDF with annotations.
Using Help
Installation process
This tool runs with Docker and the installation steps are as follows:
- Preparing the environment
Install Docker first. go to the Docker website to download and install it. After installation, type in the terminal:
docker --version
If the version number is displayed, it is successful. If using a GPU, you also need to install the NVIDIA Container Toolkit, refer to theInstallation GuideThe
- Pulling Mirrors
Enter the command in the terminal to pull the tool image:
- There's the GPU:
docker pull huridocs/pdf-document-layout-analysis:v0.0.21
- No GPU:
docker pull huridocs/pdf-document-layout-analysis:v0.0.21
- Operational services
Start the service in two ways:
- There's the GPU:
docker run --rm --name pdf-analysis --gpus '"device=0"' -p 5060:5060 huridocs/pdf-document-layout-analysis:v0.0.21
- No GPU:
docker run --rm --name pdf-analysis -p 5060:5060 huridocs/pdf-document-layout-analysis:v0.0.21
When the service starts, it listens on port 5060 by default. If the port is occupied, it can be changed to another port, such as 5061.
- validation service
Open your browser and visithttp://localhost:5060/info
If the version information is returned, the operation is normal.
How to use the main features
The tool operates through an API with the following common functions:
1. OCR function
To convert scanned PDF to searchable text, you can use OCR.
- procedure::
Prepare a PDF such astest.pdf
, run in the terminal:
curl -X POST -F 'language=en' -F 'file=@/path/to/test.pdf' localhost:5060/ocr --output result.pdf
language=en
is English and can be replaced withkor
(Korean), etc. Supported languages are available through thecurl localhost:5060/info
View./path/to/test.pdf
is the file path, e.g./home/user/test.pdf
The- output file
result.pdf
will be saved in the current directory. - in the end::
Get a searchable PDF with text that can be copied.
2. Layout analysis
To extract the elements in the PDF and analyze the layout:
- procedure::
Running:
curl -X POST -F 'file=@/path/to/test.pdf' localhost:5060 --output analysis.json
- output file
analysis.json
Contains element information such as location, type (text, table, etc.). - in the end::
The JSON file lists the details of each element.
3. Rapid mode
Want faster processing, use LightGBM model, add parametersfast=true
::
curl -X POST -F 'file=@/path/to/test.pdf' -F 'fast=true' localhost:5060 --output fast_analysis.json
- take note of: Fast, but slightly less accurate.
4. Table and formula extraction
- Withdrawal form::
Specify the format (e.g. Markdown):
curl -X POST -F 'file=@/path/to/test.pdf' -F 'extraction_format=markdown' localhost:5060 --output table.json
be in favor ofmarkdown
,latex
,html
Format.
- Extraction formula::
The default output is LaTeX format, which can be analyzed directly with the Layout Analysis command.
5. Visualization output
Would like to see the labeled PDF:
curl -X POST -F 'file=@/path/to/test.pdf' localhost:5060/visualize --output visualized.pdf
- in the end::
The output PDF will be labeled with the location and type of each element.
6. Adding language support
A few languages are supported by default, would like to add more languages (e.g. Chinese):
- Enter the container:
docker exec -it --user root pdf-analysis /bin/bash
- Install language packs, e.g. Chinese:
apt-get install tesseract-ocr-chi-sim
- Check:
curl localhost:5060/info
see thatchi_sim
Indicates success.
7. Discontinuation of services
Discontinuation of services:
docker stop pdf-analysis
Output element order
The results of the analysis are organized in a specific order. The tool uses Poppler to determine the initial reading order, which is then adjusted according to the element type:
- The header is at the top of the page, sorted in internal order.
- Common elements (text, tables, etc.) are arranged in average reading order.
- The footer and footnote are placed last.
- Elements without text (e.g., images) are ordered according to the order of the nearest element with text.
caveat
- hardware requirement: Visual model requires GPU and 5GB of video memory, without GPU it will be slow with CPU. lightGBM is CPU only and requires 2GB of RAM.
- tempo: 15 pages of academic papers, 0.42 sec/page in fast mode, 1.75 sec/page in VGT (GPU), 13.5 sec/page in VGT (CPU).
- adjust components during testing: View the log when something goes wrong:
docker logs pdf-analysis
These features and steps will help you get started quickly and handle a variety of PDF needs.
application scenario
- academic research
Researchers use it to extract tables and formulas from papers and organize data more efficiently. - file management
Archivists convert scans of old documents into searchable PDFs that are easy to find. - Legal work
Attorneys analyze contract PDFs to quickly locate clauses and forms.
QA
- Is there a fee?
No charge. This is open source tool, free to download and use on GitHub. - Do I need to network?
Internet connection is required to download the image, after which it can be run offline. - Does it support Chinese?
Support. Chinese packages need to be installed manually (e.g.tesseract-ocr-chi-sim
), slightly less effective than English but usable.