Zerox: PDF, DOCX, image conversion to Markdown, visual modeling high-precision OCR

Latest AI Resources7mos agoupdate AI Sharing Circle

1.9K 00

General Introduction

Zerox is an open source project designed to convert PDF, DOCX, images and other documents to Markdown format through visual modeling. The project is developed by getomni-ai team , provides a simple and efficient OCR (Optical Character Recognition) solution. zerox supports Node and Python two programming languages , the use of graphicsmagick and ghostscript for PDF to image processing . Users can quickly convert documents to Markdown format by providing the file path and OpenAI API key for a variety of documents with complex layouts, such as tables and charts.

Function List

Support PDF, DOCX, images and other file formats conversion
Provides support for both Node and Python programming languages
Efficient OCR Processing Using Visual Models
Automatically installs graphicsmagick and ghostscript for PDF-to-image processing.
Supports both file path and URL input
Provide a variety of optional parameters, such as concurrency processing, page orientation correction, error handling mode, etc.
Support for pre-processing and post-processing callback functions
Option to save conversion results to a specified directory

Using Help

Installation process

Node version

Installing Node.js and npm
Run command npm install zerox
Make sure that graphicsmagick and ghostscript are installed on your system, if not, run the following command:

   sudo apt-get update
sudo apt-get install -y graphicsmagick ghostscript

Python version

Install Python and pip
Run command pip install zerox
Make sure that graphicsmagick and ghostscript are installed on your system, if not, run the following command:

   sudo apt-get update
sudo apt-get install -y graphicsmagick ghostscript

Usage

Node version

Import the zerox module:

   import { zerox } from "zerox";

Use the file path for conversion:

   const result = await zerox({
filePath: "path/to/file.pdf",
openaiAPIKey: process.env.OPENAI_API_KEY,
});

Use the URL for conversion:

   const result = await zerox({
filePath: "https://example.com/file.pdf",
openaiAPIKey: process.env.OPENAI_API_KEY,
});

Python version

Import the zerox module:

   from zerox import zerox

Use the file path for conversion:

   result = zerox(
file_path="path/to/file.pdf",
openai_api_key="your_openai_api_key"
)

Use the URL for conversion:

   result = zerox(
file_path="https://example.com/file.pdf",
openai_api_key="your_openai_api_key"
)

Main function operation flow

file conversion: Provide the path or URL of the file, call the zerox function to convert it and return the text in Markdown format.
concurrent processing: By setting theconcurrencyparameter to control the number of pages processed at the same time to improve processing efficiency.
Page orientation correction: The page orientation correction feature is enabled by default to ensure that the converted text is oriented correctly.
error handling mode: Optionally, errors can be ignored or thrown, by setting theerrorModeparameters are configured.
Pre- and post-processing callbacks: Provides callback functions to perform custom actions before and after each page is processed.
Save results: By setting theoutputDirparameter to save the conversion result to the specified directory.

sample code (computing)

Node version

import { zerox } from "zerox";
const result = await zerox({
filePath: "path/to/file.pdf",
openaiAPIKey: process.env.OPENAI_API_KEY,
cleanup: true,
concurrency: 10,
correctOrientation: true,
errorMode: "IGNORE",
maintainFormat: false,
maxRetries: 1,
maxTesseractWorkers: -1,
model: "gpt-4o-mini",
onPostProcess: async ({ page, progressSummary }) => Promise<void>,
onPreProcess: async ({ imagePath, pageNumber }) => Promise<void>,
outputDir: "output",
pagesToConvertAsImages: -1,
});

Python version

from zerox import zerox
result = zerox(
file_path="path/to/file.pdf",
openai_api_key="your_openai_api_key",
cleanup=True,
concurrency=10,
correct_orientation=True,
error_mode="IGNORE",
maintain_format=False,
max_retries=1,
max_tesseract_workers=-1,
model="gpt-4o-mini",
on_post_process=lambda page, progress_summary: None,
on_pre_process=lambda image_path, page_number: None,
output_dir="output",
pages_to_convert_as_images=-1,
)

We use libreoffice cap (a poem) graphicsmagick The document to image conversion is done using a combination of the following. For non-image/non-PDF files, we use libreoffice to convert the file to PDF and then to image.

[
"pdf", // Portable Document Format
"doc", // Microsoft Word 97-2003
"docx", // Microsoft Word 2007-2019
"odt", // OpenDocument Text
"ott", // OpenDocument Text Template
"rtf", // Rich Text Format
"txt", // Plain Text
"html", // HTML Document
"htm", // HTML Document (alternative extension)
"xml", // XML Document
"wps", // Microsoft Works Word Processor
"wpd", // WordPerfect Document
"xls", // Microsoft Excel 97-2003
"xlsx", // Microsoft Excel 2007-2019
"ods", // OpenDocument Spreadsheet
"ots", // OpenDocument Spreadsheet Template
"csv", // Comma-Separated Values
"tsv", // Tab-Separated Values
"ppt", // Microsoft PowerPoint 97-2003
"pptx", // Microsoft PowerPoint 2007-2019
"odp", // OpenDocument Presentation
"otp", // OpenDocument Presentation Template
];