General Introduction
Zerox is an open source project designed to convert PDF, DOCX, images and other documents to Markdown format through visual modeling. The project is developed by getomni-ai team , provides a simple and efficient OCR (Optical Character Recognition) solution. zerox supports Node and Python two programming languages , the use of graphicsmagick and ghostscript for PDF to image processing . Users can quickly convert documents to Markdown format by providing the file path and OpenAI API key for a variety of documents with complex layouts, such as tables and charts.
Function List
- Support PDF, DOCX, images and other file formats conversion
- Provides support for both Node and Python programming languages
- Efficient OCR Processing Using Visual Models
- Automatically installs graphicsmagick and ghostscript for PDF-to-image processing.
- Supports both file path and URL input
- Provide a variety of optional parameters, such as concurrency processing, page orientation correction, error handling mode, etc.
- Support for pre-processing and post-processing callback functions
- Option to save conversion results to a specified directory
Using Help
Installation process
Node version
- Installing Node.js and npm
- Run command
npm install zerox
- Make sure that graphicsmagick and ghostscript are installed on your system, if not, run the following command:
sudo apt-get update
sudo apt-get install -y graphicsmagick ghostscript
Python version
- Install Python and pip
- Run command
pip install zerox
- Make sure that graphicsmagick and ghostscript are installed on your system, if not, run the following command:
sudo apt-get update
sudo apt-get install -y graphicsmagick ghostscript
Usage
Node version
- Import the zerox module:
import { zerox } from "zerox".
- Use the file path for conversion:
const result = await zerox({
filePath: "path/to/file.pdf",
openaiAPIKey: process.env.OPENAI_API_KEY, }); }
}).
- Use the URL for conversion:
const result = await zerox({
filePath: "https://example.com/file.pdf",
openaiAPIKey: process.env.OPENAI_API_KEY, }); await zerox({ filePath: "", openaiAPIKey: process.env.
});
Python version
- Import the zerox module:
from zerox import zerox
- Use the file path for conversion:
result = zerox(
file_path="path/to/file.pdf",
openai_api_key="your_openai_api_key"
)
- Use the URL for conversion:
result = zerox(
file_path="https://example.com/file.pdf",
openai_api_key="your_openai_api_key"
)
Main function operation flow
- file conversion: Provide the path or URL of the file, call the zerox function to convert it and return the text in Markdown format.
- concurrent processing: By setting the
concurrency
parameter to control the number of pages processed at the same time to improve processing efficiency. - Page orientation correction: The page orientation correction feature is enabled by default to ensure that the converted text is oriented correctly.
- error handling mode: Optionally, errors can be ignored or thrown, by setting the
errorMode
parameters are configured. - Pre- and post-processing callbacks: Provides callback functions to perform custom actions before and after each page is processed.
- Save results: By setting the
outputDir
parameter to save the conversion result to the specified directory.
sample code (computing)
Node version
import { zerox } from "zerox" ;
const result = await zerox({
filePath: "path/to/file.pdf", openaiAPIKey: process.env.OPENAI_API_KEY, process.
openaiAPIKey: process.env.OPENAI_API_KEY, cleanup: true, true, result = await zerox({ filePath: "path/to/file.pdf", openaiAPIKey: process.env.
openaiAPIKey: process.env.OPENAI_API_KEY, cleanup: true,
cleanup: true, concurrency: 10,
cleanup: true, concurrency: 10, correctOrientation: true,
errorMode: "IGNORE",
maintainFormat: false,
maxRetries: 1, maxTesseractWorkers: -1,
maxTesseractWorkers: -1, model: "gpt-4-o-mini",
model: "gpt-4o-mini", onPostProcess: asynchronized
onPostProcess: async ({ page, progressSummary }) => Promise,
onPreProcess: async ({ imagePath, pageNumber }) => Promise,
outputDir: "output",
pagesToConvertAsImages: -1,
}).
Python version
from zerox import zerox
result = zerox(
file_path="path/to/file.pdf",
openai_api_key="your_openai_api_key",
cleanup=True,
openai_api_key="your_openai_api_key", cleanup=True, concurrency=10,
correct_orientation=True, error_mode="IGNORE", maintain_format=False
maintain_format=False, max_retries=1,
max_retries=1,
max_tesseract_workers=-1, model="gpt-4-o-min
model="gpt-4o-mini",
on_post_process=lambda page, progress_summary: None,
output_dir="output",
pages_to_convert_as_images=-1,
)
We use libreoffice
cap (a poem) graphicsmagick
The document to image conversion is done using a combination of the following. For non-image/non-PDF files, we use libreoffice to convert the file to PDF and then to image.
[ "pdf", // Portable Document Format "doc", // Microsoft Word 97-2003 "docx", // Microsoft Word 2007-2019 "odt", // OpenDocument Text "ott", // OpenDocument Text Template "rtf", // Rich Text Format "txt", // Plain Text "html", // HTML Document "htm", // HTML Document (alternative extension) "xml", // XML Document "wps", // Microsoft Works Word Processor "wpd", // WordPerfect Document "xls", // Microsoft Excel 97-2003 "xlsx", // Microsoft Excel 2007-2019 "ods", // OpenDocument Spreadsheet "ots", // OpenDocument Spreadsheet Template "csv", // Comma-Separated Values "tsv", // Tab-Separated Values "ppt", // Microsoft PowerPoint 97-2003 "pptx", // Microsoft PowerPoint 2007-2019 "odp", // OpenDocument Presentation "otp", // OpenDocument Presentation Template ];.