AI Personal Learning
and practical guidance
讯飞绘镜

PDF Craft: PDF scanned documents to Markdown open source tools

General Introduction

PDF Craft is an open source tool designed for scanning PDFs of books and converting them to Markdown format. It is developed by oomol-lab and hosted on GitHub for users who like to organize their eBooks. The tool runs through a local AI model without the need for an Internet connection, which protects privacy and facilitates operation. It extracts the body text from scanned documents, removes miscellaneous items such as headers and footers, and produces clean Markdown files, which are especially good for organizing old books or research materials.

PDF Craft:PDF扫描文件转Markdown的开源工具-1


 

Function List

  • Convert scanned book PDF to Markdown format with native processing support.
  • Extract body content and automatically filter headers, footers and page numbers.
  • Handle text across pages and keep sentences coherent.
  • Supports illustrations and table screenshots, embedded in Markdown files.
  • Use AI to analyze page layout and organize text in reading order.
  • Expandable to EPUB format to generate eBook files.

 

Using Help

PDF Craft specializes in scanning books PDF to Markdown.Here are the detailed installation and usage steps to help you get started quickly.

Installation process

  1. Preparing the environment
    You will need a computer with Python 3.8 or above installed. Make sure the hard disk has enough space to store the AI model.
  2. Download Code
    Open a terminal and enter the command Clone Project:
git clone https://github.com/oomol-lab/pdf-craft.git

Then go to the catalog:

cd pdf-craft
  1. Installation of dependencies
    Enter the following command to install the required libraries:
pip install -r requirements.txt

If you have a GPU, you can add CUDA support:

pip install torch --extra-index-url https://download.pytorch.org/whl/cu117
  1. Getting the model
    The first time you run it, the tool will automatically download the AI model (e.g. DocLayout-YOLO). Keeping the network open, the model will be saved to <model_dir_path>(can be set in the code).

workflow

Convert to Markdown

  1. Prepare PDF
    Put the scanned book PDFs in a folder, such as /path/to/pdf/book.pdfThe
  2. runtime conversion
    Enter the following code in the terminal:
from pdf_craft import PDFPageExtractor, MarkDownWriter
extractor = PDFPageExtractor(device="cpu", model_dir_path="/path/to/model/dir/path")
with MarkDownWriter(markdown_path="/path/to/output.md", image_dir="images", encoding="utf-8") as md:
for block in extractor.extract(pdf="/path/to/pdf/book.pdf"):
md.write(block)
  • device="cpu": Runs on CPU. GPU support reads device="cuda:0"The
  • markdown_path: Output Markdown file path.
  • image_dir: Illustration save directory.
  1. View Results
    When you're done, open the /path/to/output.md Check the content. Illustrations are automatically saved to the images Folder.

Featured Function Operation

  • text extraction
    The tool recognizes scanned pages, eliminates headers and footers, and keeps only the body text. You don't need to clean up the clutter manually.
  • cross-page processing
    If a sentence is truncated by a page break, PDF Craft automatically connects it to ensure that the text flows smoothly.
  • Illustration Embedding
    Images or tables in scanned books will be screenshotted and embedded in Markdown. you can find them in the images folder to find them.

tip

  • PDF scanning quality should be clear, otherwise the recognition may be wrong.
  • The first run will download the model, after that it will be available offline.
  • If it's slow, try GPU acceleration or reducing the number of pages.

 

application scenario

  1. Organize old books
    Do you have scanned PDFs of old books that you want to convert to Markdown for editing.PDF Craft can remove the clutter and produce clean files.
  2. Research data conversion
    Scholars need to convert scanned papers into Markdown for note taking. The tool preserves the text and illustrations for easy citation.
  3. E-book production
    You want to turn scanned PDFs into editable Markdown documents.PDF Craft offers simple solutions.

 

QA

  1. Does it only support scanning PDFs?
    Mainly optimized for scanned book PDFs. normal text PDFs will work, but the results may not be as good as scanned documents.
  2. What do I do with the images after conversion?
    The image is saved as a screenshot to a specified folder, and the link is automatically embedded in Markdown.
  3. Why is the first run slow?
    Because you have to download the AI model. It gets faster after that.
May not be reproduced without permission:Chief AI Sharing Circle " PDF Craft: PDF scanned documents to Markdown open source tools
en_USEnglish