AI Personal Learning
and practical guidance

ViTLP: Extracting Structured Data from Typographically Complex PDF Documents and Visually Guided Generation of Text Layout Pre-training Models

General Introduction

ViTLP (Visually Guided Generative Text-Layout Pre-training for Document Intelligence) is an open-source project that aims to enhance document intelligence processing through visually guided generative text layout pre-training models. The project was developed by the Veason-silverbullet team and presented at NAACL 2024.The ViTLP model is capable of localizing and recognizing OCR text, and provides pre-trained ViTLP-medium (380M) checkpoints, which can be accessed by users on Huggingface. The code and model weights for the project are available on GitHub and support OCR processing of document images and text layout generation.

ViTLP: OCR Recognition of PDF Documents to Extract Structured Data, ViTLP is an open source visually guided pre-training model for generating text layouts-1


 

Function List

  • OCR text localization and recognition: The ViTLP model enables efficient OCR text localization and recognition.
  • Pre-trained models: ViTLP-medium (380M) pre-trained checkpoints are provided, which can be used directly or fine-tuned by the user.
  • Document Image Processing: Support for uploading document images and OCR processing.
  • Model fine-tuning: Provide fine-tuning tools to support subsequent training on OCR datasets and VQA datasets.
  • Document Composition Tools: Provides document synthesis tools with positioning box metadata.

 

Using Help

Installation process

  1. Clone the ViTLP project code:
   git clone https://github.com/Veason-silverbullet/ViTLP
cd ViTLP
  1. Install the dependencies:
   pip install -r requirements.txt
  1. Download Pre-training Checkpoints:
   mkdir -p ckpts/ViTLP-medium
git clone https://huggingface.co/veason/ViTLP-medium ckpts/ViTLP-medium

Usage Process

  1. OCR text recognition::
    • Run the OCR script:
     python ocr.py
    
    • Upload a document image and the model will automatically perform OCR processing and output the results.
  2. Model fine-tuning::
    • consultation. /finetuninginstruction file in the directory for subsequent training on the OCR dataset and the VQA dataset.
    • Use the document synthesis tool to generate synthetic documents with positioning box metadata to enhance model training.
  3. Batch Decode::
    • Use batch decoding scripts: bash
      bash decode.sh
    • The script will batch process document images and output OCR results.

Detailed Function Operation

  • OCR text localization and recognition: After uploading the document image, the model will automatically detect and recognize the text area, and output the text content and location information.
  • Model fine-tuning: Users can use the provided fine-tuning tools to further train the model according to their dataset needs and improve the recognition effect in specific scenes.
  • Document Composition Tools: Generate documents with positioning box metadata via a synthesis tool to help the model better understand text layout and structure during training.
May not be reproduced without permission:Chief AI Sharing Circle " ViTLP: Extracting Structured Data from Typographically Complex PDF Documents and Visually Guided Generation of Text Layout Pre-training Models

Chief AI Sharing Circle

Chief AI Sharing Circle specializes in AI learning, providing comprehensive AI learning content, AI tools and hands-on guidance. Our goal is to help users master AI technology and explore the unlimited potential of AI together through high-quality content and practical experience sharing. Whether you are an AI beginner or a senior expert, this is the ideal place for you to gain knowledge, improve your skills and realize innovation.

Contact Us
en_USEnglish