PDF Craft: PDF scanned documents to Markdown open source tools

Latest AI Resources5mos agorelease AI Sharing Circle

1.4K 00

General Introduction

PDF Craft is an open source tool designed for scanning PDFs of books and converting them to Markdown format. It is developed by oomol-lab and hosted on GitHub for users who like to organize their eBooks. The tool runs through a local AI model without the need for an Internet connection, which protects privacy and facilitates operation. It extracts the body text from scanned documents, removes miscellaneous items such as headers and footers, and produces clean Markdown files, which are especially good for organizing old books or research materials.

Function List

Convert scanned book PDF to Markdown format with native processing support.
Extract body content and automatically filter headers, footers and page numbers.
Handle text across pages and keep sentences coherent.
Supports illustrations and table screenshots, embedded in Markdown files.
Use AI to analyze page layout and organize text in reading order.
Expandable to EPUB format to generate eBook files.

Using Help

PDF Craft specializes in scanning books PDF to Markdown.Here are the detailed installation and usage steps to help you get started quickly.

Installation process

Preparing the environment
You will need a computer with Python 3.8 or above installed. Make sure the hard disk has enough space to store the AI model.
Download Code
Open a terminal and enter the command Clone Project:

git clone https://github.com/oomol-lab/pdf-craft.git

Then go to the catalog:

cd pdf-craft

Installation of dependencies
Enter the following command to install the required libraries:

pip install -r requirements.txt

If you have a GPU, you can add CUDA support:

pip install torch --extra-index-url https://download.pytorch.org/whl/cu117

Getting the model
The first time you run it, the tool will automatically download the AI model (e.g. DocLayout-YOLO). Keeping the network open, the model will be saved to <model_dir_path>(can be set in the code).

workflow

Convert to Markdown

Prepare PDF
Put the scanned book PDFs in a folder, such as /path/to/pdf/book.pdfThe
runtime conversion
Enter the following code in the terminal:

from pdf_craft import PDFPageExtractor, MarkDownWriter
extractor = PDFPageExtractor(device="cpu", model_dir_path="/path/to/model/dir/path")
with MarkDownWriter(markdown_path="/path/to/output.md", image_dir="images", encoding="utf-8") as md:
for block in extractor.extract(pdf="/path/to/pdf/book.pdf"):
md.write(block)

device="cpu": Runs on CPU. GPU support reads device="cuda:0"The
markdown_path: Output Markdown file path.
image_dir: Illustration save directory.

View Results
When you're done, open the /path/to/output.md Check the content. Illustrations are automatically saved to the images Folder.

Featured Function Operation

text extraction
The tool recognizes scanned pages, eliminates headers and footers, and keeps only the body text. You don't need to clean up the clutter manually.
cross-page processing
If a sentence is truncated by a page break, PDF Craft automatically connects it to ensure that the text flows smoothly.
Illustration Embedding
Images or tables in scanned books will be screenshotted and embedded in Markdown. you can find them in the images folder to find them.

tip

PDF scanning quality should be clear, otherwise the recognition may be wrong.
The first run will download the model, after that it will be available offline.
If it's slow, try GPU acceleration or reducing the number of pages.

application scenario

Organize old books
Do you have scanned PDFs of old books that you want to convert to Markdown for editing.PDF Craft can remove the clutter and produce clean files.
Research data conversion
Scholars need to convert scanned papers into Markdown for note taking. The tool preserves the text and illustrations for easy citation.
E-book production
You want to turn scanned PDFs into editable Markdown documents.PDF Craft offers simple solutions.

QA

Does it only support scanning PDFs?
Mainly optimized for scanned book PDFs. normal text PDFs will work, but the results may not be as good as scanned documents.
What do I do with the images after conversion?
The image is saved as a screenshot to a specified folder, and the link is automatically embedded in Markdown.
Why is the first run slow?
Because you have to download the AI model. It gets faster after that.