General Introduction
PDF Craft is an open source tool designed for scanning PDFs of books and converting them to Markdown format. It is developed by oomol-lab and hosted on GitHub for users who like to organize their eBooks. The tool runs through a local AI model without the need for an Internet connection, which protects privacy and facilitates operation. It extracts the body text from scanned documents, removes miscellaneous items such as headers and footers, and produces clean Markdown files, which are especially good for organizing old books or research materials.
Function List
- Convert scanned book PDF to Markdown format with native processing support.
- Extract body content and automatically filter headers, footers and page numbers.
- Handle text across pages and keep sentences coherent.
- Supports illustrations and table screenshots, embedded in Markdown files.
- Use AI to analyze page layout and organize text in reading order.
- Expandable to EPUB format to generate eBook files.
Using Help
PDF Craft specializes in scanning books PDF to Markdown.Here are the detailed installation and usage steps to help you get started quickly.
Installation process
- Preparing the environment
You will need a computer with Python 3.8 or above installed. Make sure the hard disk has enough space to store the AI model. - Download Code
Open a terminal and enter the command Clone Project:
git clone https://github.com/oomol-lab/pdf-craft.git
Then go to the catalog:
cd pdf-craft
- Installation of dependencies
Enter the following command to install the required libraries:
pip install -r requirements.txt
If you have a GPU, you can add CUDA support:
pip install torch --extra-index-url https://download.pytorch.org/whl/cu117
- Getting the model
The first time you run it, the tool will automatically download the AI model (e.g. DocLayout-YOLO). Keeping the network open, the model will be saved to<model_dir_path>
(can be set in the code).
workflow
Convert to Markdown
- Prepare PDF
Put the scanned book PDFs in a folder, such as/path/to/pdf/book.pdf
The - runtime conversion
Enter the following code in the terminal:
from pdf_craft import PDFPageExtractor, MarkDownWriter
extractor = PDFPageExtractor(device="cpu", model_dir_path="/path/to/model/dir/path")
with MarkDownWriter(markdown_path="/path/to/output.md", image_dir="images", encoding="utf-8") as md:
for block in extractor.extract(pdf="/path/to/pdf/book.pdf"):
md.write(block)
device="cpu"
: Runs on CPU. GPU support readsdevice="cuda:0"
Themarkdown_path
: Output Markdown file path.image_dir
: Illustration save directory.
- View Results
When you're done, open the/path/to/output.md
Check the content. Illustrations are automatically saved to theimages
Folder.
Featured Function Operation
- text extraction
The tool recognizes scanned pages, eliminates headers and footers, and keeps only the body text. You don't need to clean up the clutter manually. - cross-page processing
If a sentence is truncated by a page break, PDF Craft automatically connects it to ensure that the text flows smoothly. - Illustration Embedding
Images or tables in scanned books will be screenshotted and embedded in Markdown. you can find them in theimages
folder to find them.
tip
- PDF scanning quality should be clear, otherwise the recognition may be wrong.
- The first run will download the model, after that it will be available offline.
- If it's slow, try GPU acceleration or reducing the number of pages.
application scenario
- Organize old books
Do you have scanned PDFs of old books that you want to convert to Markdown for editing.PDF Craft can remove the clutter and produce clean files. - Research data conversion
Scholars need to convert scanned papers into Markdown for note taking. The tool preserves the text and illustrations for easy citation. - E-book production
You want to turn scanned PDFs into editable Markdown documents.PDF Craft offers simple solutions.
QA
- Does it only support scanning PDFs?
Mainly optimized for scanned book PDFs. normal text PDFs will work, but the results may not be as good as scanned documents. - What do I do with the images after conversion?
The image is saved as a screenshot to a specified folder, and the link is automatically embedded in Markdown. - Why is the first run slow?
Because you have to download the AI model. It gets faster after that.