General Introduction
TF-ID (Table/Figure IDentifier) is a family of object detection models specialized for extracting tables and images from academic papers. The project was created by Yifei Hu and open-sourced on GitHub.TF-ID models are fine-tuned to recognize and extract tables and images from academic papers, supporting extraction with or without caption text. The project provides complete training code, model weights and manually labeled datasets, all open-sourced under the MIT license.
Function List
- Extract tables and images from academic papers
- Supports extraction with or without header text
- Provide complete training code and model weights
- Support extracting tables and images from PDF files
- Multiple model versions available to suit different needs
Using Help
Installation process
- Cloning Warehouse:
git clone https://github.com/ai8hyf/TF-ID cd TF-ID
- Download the dataset: Download the dataset from Hugging Face and extract it to the appropriate directory.
wget https://huggingface.co/datasets/yifeihu/TF-ID-arxiv-papers/resolve/main/arxiv_paper_images.zip unzip arxiv_paper_images.zip -d . /images
- Convert the dataset format:
python coco_to_florence.py
- Training models:
accelerate launch train.py
Usage Process
- Extracts tables and images from a single image:
python inference.py --image_path path/to/image.png
- Extract all tables and images from PDF files:
python pdf_to_table_figures.py --pdf_path path/to/paper.pdf --output_dir . /sample_output
Detailed Operation Procedure
- Extract tables and images from a single image::
- Passes the image path to the
inference.py
script, which will use the default TF-ID-large model to extract the tables and images in the image. - The extraction results will be returned as a bounding box identifying the table and image position in the image.
- Passes the image path to the
- Extract all tables and images from PDF files::
- Pass the PDF file path to the
pdf_to_table_figures.py
script, which will extract all tables and images from the PDF file and save the cropped images to the specified output directory. - By default, the TF-ID-large model is used for extraction, which can be changed by modifying the script's
model_id
parameter to switch to another model version.
- Pass the PDF file path to the
- training model::
- After cloning the repository and downloading the dataset, use the
coco_to_florence.py
The script converts the dataset to Florence 2 format. - utilization
accelerate launch train.py
command initiates model training, and the checkpoint file is saved during training.
- After cloning the repository and downloading the dataset, use the